TweetFollow Us on Twitter

68040 BlockMove
Volume Number:9
Issue Number:5
Column Tag:Coding Efficiently

Related Info: Memory Manager

An Efficient 68040 BlockMove

Moving data at light speed and only when you need to

By Mike Scanlin, MacTech Magazine Regular Contributing Author

Note: Source code files accompanying article are located on MacTech CD-ROM or source code disks.

From an application writer’s point of view, Apple’s 68040 version of BlockMove is slower than it needs to be for three reasons: (1) trap dispatch overhead, (2) clearing the 040 data cache after it’s finished moving the bytes and (3) excessive calculations when determining the optimal set of Move instructions to use. This article presents an alternative version of BlockMove that doesn’t have these performance problems and is up to 3x faster than Apple’s as a result but, comes with the restriction that you can’t move code with it. Also in this article are faster versions of the system’s NewPtrClear and NewHandleClear (which have been optimized for the 68040, too, and are 1.5x faster than Apple’s versions). You can link these functions with and call them directly from your programs for improved performance.

PROFILING BLOCKMOVE

The first thing I did when I sat down to write a fast BlockMove was to profile the parameters that the existing BlockMove was receiving. I wrote an INIT that patched BlockMove and gathered statistics on the number of bytes being moved each time and the relative alignment between the source and destination pointers. Once my INIT was installed I ran a variety of applications and performed a variety of typical tasks. I did some spreadsheet recalcs, scrolled around and printed from within several word processors, played with some painting and drawing programs, connected to AppleLink, browsed a few HyperCard stacks, compiled several programs, did a Finder copy (spent most of my time waiting for this to finish ;-) ), ran ResEdit and other utilities, etc. After 15 minutes of non-stop user activity, I dumped my statistics.

The executive summary is this: for these types of operations, (1) BlockMove is called about 500 times per second, (2) most BlockMoves are 64 bytes or less and (3) the source and destination addresses are usually divisible by 16 or equal to <an address divisible by 16> plus four. One amazing fact I learned was that 3.6% of the time BlockMove is called to move zero bytes (!).

Here is a summary of the value of D0 on entry to BlockMove (the number of bytes to be moved). The missing ranges occurred less than 1% of the time:

Bytes to move % of calls to BlockMove

0 3.6

1-31 46.0

32-63 21.4

64-95 7.9

96-127 2.7

128-159 2.0

160-191 1.2

192-223 1.7

224-255 1.2

256-287 1.1

512-543 3.2

8192 or more 1.5

So it’s pretty clear that any replacement BlockMove needs to check for and deal with small moves efficiently before it goes off and processes large moves. The routine MyBlockMoveData given here does this.

A 68040 optimized BlockMove is certainly going to have to take advantage of the special Move16 instruction if it wants to do well for large moves. Unfortunately, the Move16 instruction cannot be used for every large move. It depends on the direction you’re moving the data (Move16 only supports post-increment, not pre-decrement, addressing mode) and whether or not you have 16-byte aligned source and destination addresses. So, in order to get a feel for how often I’d be able to use the Move16 instruction I made an array of 16x16 elements where the indexes into the array were the low four bits of the source address and destination address. My INIT was set up so that every call to BlockMove would add 1 to the appropriate element in my array.

The results of this alignment test show that only two cases come up regularly: (1) the source and destination addresses are divisible by 16, and (2) the source and destination addresses are equal to <an address divisible by 16> plus 4. The first case can probably be explained by the fact that Apple’s 040 memory manager only allocates blocks on 16-byte boundaries but I’m not sure what’s causing the second case. Perhaps it’s something the memory manager does when it shuffles blocks around in the heap and needs to take the block header with them. See Figure 1 for the complete alignment statistics matrix.

ALIGNMENT

The nice thing about these results is that in most cases we’ll be able to use the Move16 instruction. Even in the case where the <source and destination addresses> mod 16 == 4 we can use Move16, after we’ve first moved 12 bytes to sync up on a 16-byte boundary. We can use Move16 for any cases that lie on the diagonal of Figure 1, for a total of 53.1% of the time.

Similar to how I’m concerned with 16-byte boundaries in the 68040 version of MyBlockMoveData, pure 68000 versions of BlockMove spend a lot of time worrying about odd addresses. This is because they want to use Move.W or Move.L to copy the data wherever possible (for efficiency). But because the 68000 won’t allow word or long word accesses to odd addresses (you’ll get an address error if you try it) they can’t always do that. Consider the case where the source and destination addresses differ by 1 (maybe you’re sliding a buffer up or down by one byte): the 68000 version of BlockMove is forced to use a Move.B loop because one of the source or destination pointers will be odd at every point during the loop. However, if the source and destination pointers are both odd to begin with, BlockMove can move one byte, increment the pointers and then fall into its Move.W or Move.L loop because it has word-aligned pointers at that point. Since it’s optimal to reference longs on addresses divisible by four, BlockMove could check for long-aligned pointers at this point and if they aren’t then move one word so that they are long-word aligned for subsequent Move.L’s. Even on machines like the 68020 and 68030 you want to pay attention to non-aligned memory reads and writes for maximum performance.

On the 68040, however, we can forget about all of this. The reason is because for cachable reads and writes there is no performance penalty for reading or writing a long to an odd address. (Actually, that’s a slight lie; but the extra time caused by two cache misses instead of one (when the 4 bytes you’re accessing overlap two cache lines) is barely noticeable if you time several million such unaligned accesses -- and it is certainly less than the time to check for and correct for this case with byte and word moves.) This greatly simplifies the preflighting code of MyBlockMoveData routine because all it has to check for is the Move16 case (where the low 4 bits of the source and destination addresses are equal) and then, if we’re not in that case, drop into a Move.L loop. It does have to worry about the few bytes at the beginning or end that won’t be handled by the Move16 or Move.L loop but other than that it’s pretty simple (and fast).

Note that I said “cachable” reads and writes in the paragraph above. If someone is running with the 040 data cache turned off (via the Cache Switch control panel, for instance) then MyBlockMoveData is only about 1.5x faster than Apple’s for moves of 128 bytes or less (with arbitrary alignment) because I don’t worry about the non-cachable case and Apple does. For large unaligned moves my routine is slightly slower than Apple’s BlockMove. But I think it’s true that most people run with the caches on most of the time so I’m not bothered by this limitation of my code. The performance gains in the cachable case make this non-cachable degradation worth it. Also, it makes the code quite a bit smaller by not trying to arrange for word and long word alignment (but code size is not as important as speed in a routine as core as BlockMove): my routine is 226 bytes while Apple’s is over 900 (not including the data cache flushing code they call).

If I were Apple, I would put at least four versions of BlockMove in the ROM, with two external entry points: one for moving code or things that might be code (like BlockMove as we know it today) and one for moving data that is definitely not code (like MyBlockMoveData given here). Each of those two versions, on the 68040, would have two internal variants: one for the data-cache-is-on case that doesn’t worry about reading and writing words and longs at odd addresses (like MyBlockMoveData given here) and one for the data-cache-is-off case that does worry about reading and writing at odd addresses (which would be very similar to the ideal 680x0 version where ‘x’ is less than ‘4’). Programmers would then have to decide whether to call BlockMoveCode or BlockMoveData on a per-instance basis as they are coding. The calls themselves would make a run-time decision (based on the data cache on/off) about which of their two internal versions they should use. And only the BlockMoveCode cache-is-on variant would ever flush the data cache.

MOVE16

There are two nuances involving the Move16 instruction that you should know about. The first one is not in any book that I’ve seen and is something I only learned of recently from DTS (thanks Dave Radcliffe): there’s a bug in some early versions of the 68040 chip (including some shipped Quadras) that requires you to use a Nop instruction before any set of Move16 instructions. The problem is that if you have a pending write to an address subsequently referenced by a Move16 instruction that executes before the pending write completes you’ll get bogus data. The Nop instruction flushes the instruction pipeline (including the pending write) and eliminates the possibility that the bug will show up. Strictly speaking, I don’t think I need the Nop given the instructions that execute in my code before the first Move16 but I left it in there for instructional purposes and because it might be needed for some other rare set of circumstances on certain batches of 040’s.

The other thing to know about Move16 is that because it doesn’t affect the condition codes you can interleave it with an instruction that does affect the condition codes (but doesn’t reference memory) for better performance. For instance, this is one obvious way to write a 64 byte transfer loop with Move16:

;1

 @1Move16 (A0)+,(A1)+
 Move16 (A0)+,(A1)+
 Move16 (A0)+,(A1)+
 Move16 (A0)+,(A1)+
 Sub.L  #64,D1
 Bne.S  @1

But that loop can be improved like this:

;2

 @1Move16 (A0)+,(A1)+
 Move16 (A0)+,(A1)+
 Move16 (A0)+,(A1)+
 Sub.L  #64,D1
 Move16 (A0)+,(A1)+
 Bne.S  @1

Now the Sub.L instruction executes in parallel with the third of the four Move16 instructions and the Bne.S instruction executes in parallel with the fourth Move16 instruction. This kind of interleaving is common in optimized 68040 code but does tend to make the code somewhat harder to read.

24-bit vs. 32-bit MODE

Normally when dealing with pointers, which MyBlockMoveData does, you can forget about 24-bit vs. 32-bit memory mode. However, if you’re going to use pointers in greater-than or less-than comparisons (as opposed to simple equality and non-equality comparisons) then you have to strip them before you compare them. Stripping is the process of clearing the upper 8-bits of a 32-bit pointer when you’re in 24-bit memory mode. The best way to do this is to call StripAddress (or ‘and’ the pointer with the cached result of StripAddress(-1) for better performance). But executing StripAddress within MyBlockMoveData is too slow, and we need to do it twice: once for the source pointer and once for the destination pointer.

To solve this problem I resorted to using a global Boolean, gIn24BitMode, that you set once during program initialization and then again each time you change the memory mode with SwapMMUMode:

gIn24BitMode = GetMMUMode() == 0;

The other choice was to reference the low memory global MMU32bit (0x0CB2) directly from within MyBlockMoveData but that would certainly bring the code police down upon my head. You could, of course, set gIn24BitMode to FALSE during program initialization and then just make sure you pass stripped pointers to MyBlockMoveData if you don’t want to worry about gIn24BitMode maintenance.

In any case, if gIn24BitMode is non-zero then MyBlockMoveData will ‘and’ the two pointers with 0x00FFFFFF to make them 32-bit clean for subsequent comparison (code police please note the lack of any reference to Lo3Bytes; how am I doing? Can I go out and play now?).

WHY COMPARE POINTERS?

What’s all this fuss with comparing pointers? Why not just move the bytes and forget about it? The reason is that you can’t always move the bytes in order from front to back. Specifically, if the source and destination areas overlap and srcPtr < dstPtr then you have to move the bytes back to front to avoid overwriting bytes you haven’t yet moved.

Suppose srcPtr = 1000, dstPtr = 1001 and byteCount = 2. You could move the bytes like this:

dstPtr[0] = srcPtr[0];
dstPtr[1] = srcPtr[1];

but if you did that you’d get wrong results because dstPtr[0] maps to the same byte location as srcPtr[1]. The first instruction sets dstPtr[0] okay but in the process overwrites srcPtr[1] (because they’re the same thing). Instead, you need to move the bytes back-to-front like this:

dstPtr[1] = srcPtr[1];
dstPtr[0] = srcPtr[0];

This is back-to-front move is only necessary if (1) srcPtr < dstPtr and (2) (dstPtr - srcPtr) < byteCount. If either of these conditions are not true then you can use the front-to-back move. You want to use front-to-back moves when you can because that’s the only case when you have a chance at using the Move16 instruction (since it doesn’t have a pre-decrement mode).

BLOCKMOVE RESULTS

Comparing MyBlockMoveData to the system’s BlockMove is not fair in raw terms because BlockMove has the trap overhead and always clears the data cache. Still, those are two reasons for MyBlockMoveData to exist. From an application point of view, it doesn’t really matter who is moving the bytes as long as they get moved in a quick and orderly fashion. So, given a couple of functional differences that won’t affect application non-code moves, here’s how they compare on a Quadra 700:

Improvement over BM

Small moves test 3.2x

Weighted average test 1.8x

Big moves test 1.1x

The small moves test involved moving 0 to 128 bytes with every possible combination of alignment for srcPtr and dstPtr (there are 256 combinations of alignment and 129 possible values for byteCount). This represents, among other small moves, your typical string manipulation calls to BlockMove. Some would argue that you shouldn’t be using BlockMove for such small moves but, like the 68040 designers themselves, we should optimize for the installed base of existing code (it’s not a perfect world).

The weighted average test used the results of my INIT statistics gathering to make the set of calls to both BlockMove and MyBlockMoveData that represent typical usage by a variety of apps.

The big moves test represents shuffling large blocks (8K-64K) that begin on 4-byte addresses. This is where the smallest improvement is noticed because the preflighting code time is insignificant when compared to the actual Move.L or Move16 data transfer loop time.

DBRA OPTIMIZATION

There are a couple of non-obvious optimizations in MyBlockMoveData. One of them has to do with this sequence of instructions:

;3

 Andi   #LongBytesPerLoop-1,D2
 Beq.S  @2
 Subq   #1,D2
@1 Move.B (A0)+,(A1)+
 Dbra   D2,@1
@2

Some might think that this would be better:

;4

 Andi   #LongBytesPerLoop-1,D2
 Bra.S  @2
@1 Move.B (A0)+,(A1)+
@2 Dbra D2,@1

While it’s true that this second version is two bytes smaller and perhaps uses the Dbra instruction more like it was intended (there was a letter to MacTech in the August 1992 issue along these lines complaining that I had used the first construct where I should have used the second), it is also true that this second version is slower because of two branches taken in the common case where D2 is not zero. Assume D2 is not zero and compare the first case: Andi, Beq (not taken), Subq, Move.B loop, with the second case: Andi, Bra (taken), Dbra (branch taken), Move.B loop. Because of the 68040’s pipelining the first case cruises right along uninterrupted because no branches are taken before entering the Move.B loop. However, in the second case the instruction pipeline is disrupted twice with two taken branches before any bytes are moved.

EOR OPTIMIZATION

Just before the Move16 loop, after I’ve determined that the low four bits of srcPtr and dstPtr are the same, I need to set D0 equal to (16 - D0). If you ever need to evaluate an expression like x = p - x where p is a compile-time constant power of 2 and x < p, you can use exclusive or to do it faster and in fewer instructions: x ^= p - 1. So, instead of this:

 Neg    D0
 Add    #16,D0

I use this:

 Eori   #0x000F,D0

FORWARD vs. BACKWARD OPTIMIZATION

When you look at the forward Move.L loop and compare it to the backwards Move.L loop you’ll notice that I use four Move.L instructions in the forward case and only two in the backwards case. As much as I’d like to, I can’t explain this. It’s not that I don’t want to tell you, it’s just that I don’t know the answer. I timed it both ways (as well as several other ways) and these two came up as the optimal number of Move.L instructions for their given cases. From what little timing information is given in the MC68040 Designer’s Handbook it looks like Move.L (A0)+,(A1)+ should be exactly equal in time to Move.L -(A0),-(A1). Maybe the observed difference has to do with sunspots or something...

CLEARING MEMORY

Apple’s NewPtrClear and NewHandleClear routines (as well as NewPtrSysClear and NewHandleSysClear) use a Clr.B instruction at their core to clear the memory they’ve just allocated. This is inefficient for two reasons (1) the Clr instruction is slow (you should Move a register whose value is zero) and (2) they’re dealing with bytes when they could be using longs.

The MyNewPtrClear and MyNewHandleClear routines solve both of these problems. They take advantage of the fact that the memory manager always allocates blocks that begin on 4-byte boundaries (in all Macs) by not checking for odd bytes at the beginning of the area to be cleared -- they just start off clearing longs. They clear 64 bytes at a time initially, then 8 bytes at a time and lastly 1 byte at a time (in each case they do it by using Move to move a register whose value is zero to memory). Because of more efficient clearing, MyNewPtrClear and MyNewHandleClear are 1.5x faster than NewPtrClear and NewHandleClear.

The four clearing functions I give here take advantage of Think C’s multiple entry points feature. This is an efficient way of sharing code because it saves the overhead of calling a separate ClearMemory function from within each of those four functions (saves pushing/poping parameters as well as the Bsr/Rts pair). Efficient as it may be, it can lead to bizarre crashes if you modify those four routines without understanding how Think allocates/deallocates parameters, registers and stack variables. If you want to change them but don’t understand precisely how Think works then you should factor out the clearing code to a ClearMemory function that takes a pointer and a byteCount and then make all four functions call that ClearMemory function. Once you’ve done that (and removed the multiple entry points from everything), you can change them as much as you like without worrying about what Think is doing behind the scenes.

/* 5 */

/*****************************************************
 * MyMemMgr.h
 ****************************************************/

void    MyBlockMoveData (const void *srcPtr,
 void *destPtr, Size byteCount);
Handle  MyNewHandleClear (Size theSize);
Handle  MyNewHandleSysClear (Size theSize);
PtrMyNewPtrClear (Size theSize);
PtrMyNewPtrSysClear (Size theSize);

extern Boolean   gIn24BitMode;

/*****************************************************
 * MyMemMgr.c
 *
 * Optimized 68040 versions of BlockMove, NewPtrClear
 * and NewHandleClear.
 *
 * Mike Scanlin  8 Mar 1992
 ****************************************************/

#include “MyMemMgr.h”

/* Think C doesn’t know this instruction:
 * Move16 (A0)+,(A1)+
 */
#define Move16A0ToA1 DC.W 0xF620, 0x9000

/* Move16BytesPerLoop must be a power of 2 and
 * agree with # of Move16 instructions per loop
 * in MyBlockMoveData
 */
#define Move16BytesPerLoop64

/* LongBytesPerLoop must be a power of 2 and 
 * agree with # of Move.L instructions per loop 
 * in MyBlockMoveData
 */
#define LongBytesPerLoop  16

/* SmallNumBytes can’t be less than 16 */
#define SmallNumBytes16

/* ClearLongBytesPerLoop must be a power of 2 and
 * agree with # of Move.L instructions per loop
 * in MyNewPtrClear
 */
#define ClearLongBytesPerLoop 64

/* ClearBytesPerLoop must be a power of 2 and
 * agree with # of Move.L instructions per loop
 * in MyNewPtrClear
 */
#define ClearBytesPerLoop 8

/* gIn24BitMode needs to be set to TRUE if the
 * system is in 24-bit mode when you call
 * MyBlockMoveData; FALSE if in 32-bit mode
 */
Boolean gIn24BitMode;

/*****************************************************
 * MyBlockMoveData
 *
 * Equivalent to the system’s BlockMove except that
 * it doesn’t flush the 68040 data cache on exit so
 * it shouldn’t be used to BlockMove code.
 ****************************************************/
void MyBlockMoveData(const void *srcPtr, 
 void *destPtr, Size byteCount)
{
 asm {

 ;get parameters into registers

 Move.L srcPtr,D0
 Move.L destPtr,D1
 Move.L byteCount,D2

 ;if we’re in 24-bit mode we need to StripAddress
 ;the addresses before comparison
 
 Tst.B  gIn24BitMode
 Beq.S  @1
 Andi.L #0x00FFFFFF,D0
 Andi.L #0x00FFFFFF,D1
 
 @1Move.L D0,A0
 Move.L D1,A1
 
 ;if dst > src then we might need to move bytes
 ;last to first
 
 Cmp.L  A0,A1
 Bhi.S  @CheckForNonOverlapping
 
 ;if dst == src then we’re done
 
 Beq.S  @8

 MoveFrontToBack:
 
 ;if we’re only doing a few bytes, skip the
 ;tricky stuff
 
 Cmp.L  #SmallNumBytes,D2
 Bls.S  @10

 ;if the low 4 bits of src and dst are equal
 ;then we can use Move16

 Andi   #0x000F,D0
 Andi.L #0x0000000F,D1
 Cmp    D0,D1
 Bne.S  @ForwardLongLoop
 Tst    D0
 Beq.S  @4

 ;do the first few bytes at the beginning,
 ;until we’re 16-byte aligned

 Eori   #0x000F,D0 ;D0 = 16 - D0
 @2Move.B (A0)+,(A1)+
 @3Dbra D0,@2
 Sub.L  D1,D2

 ;move groups of Move16BytesPerLoop bytes
 
 @4Move.L D2,D1
 Andi.L #Move16BytesPerLoop-1,D2
 Sub.L  D2,D1
 Beq.S  @ForwardLongLoop
 
 Nop    ;compensate for chip bug on some 040s
 
 @5Move16A0ToA1
 Move16A0ToA1
 Move16A0ToA1
 Sub.L  #Move16BytesPerLoop,D1
 Move16A0ToA1
 Bne.S  @5

 ForwardLongLoop:

 ;move groups of LongBytesPerLoop bytes

 Move.L D2,D1
 Bra.S  @7

 @6Move.L (A0)+,(A1)+
 Move.L (A0)+,(A1)+
 Move.L (A0)+,(A1)+
 Move.L (A0)+,(A1)+

 @7Sub.L#LongBytesPerLoop,D1
 Bpl.S  @6
 
 ;move the last few remaining bytes
 
 Andi   #LongBytesPerLoop-1,D2
 @8Beq.S@Exit
 Subq   #1,D2
 @9Move.B (A0)+,(A1)+
 @10  DbraD2,@9

 Bra.S  @Exit

 CheckForNonOverlapping:
 
 ;if we’re not overlapping, use the front-to-back
 ;loops so Move16 has a chance
 
 Sub.L  A0,D1
 Cmp.L  D1,D2
 Bhi.S  @MoveBackToFront
 Move   A1,D1
 Bra.S  @MoveFrontToBack

 MoveBackToFront:

 ;set the pointers to one past the last byte
 
 Add.L  D2,A0
 Add.L  D2,A1

 ;if we’re only doing a few bytes, skip the
 ;tricky stuff

 Cmp.L  #SmallNumBytes,D2
 Bls.S  @14

 ;move groups of LongBytesPerLoop/2 bytes

 Move.L D2,D1
 Bra.S  @12
 
 @11  Move.L-(A0),-(A1)
 Move.L -(A0),-(A1)
 
 @12  Sub.L #LongBytesPerLoop/2,D1
 Bpl.S  @11
 
 ;move the last few remaining bytes

 Andi   #(LongBytesPerLoop/2)-1,D2
 Beq.S  @Exit
 Subq   #1,D2
 @13  Move.B-(A0),-(A1)
 @14  DbraD2,@13
 
 Exit:
 
 }
}

/* Declare extra entry points so that functions
 * can share the memory clearing code.
 */
void ClearHandleBytes(void);
void ClearPtrBytes(void);
void ClearBytes(void);

/*****************************************************
 * MyNewHandleClear
 *
 * Faster version of the system’s NewHandleClear.
 ****************************************************/
Handle MyNewHandleClear(Size theSize)
{
 register Handle h;
 
 if (h = NewHandle(theSize)) {
 
 asm {
 Bra    ClearHandleBytes;
 }
 }
 
 return (h);
}

/*****************************************************
 * MyNewHandleSysClear
 *
 * Faster version of the system’s NewHandleSysClear.
 ****************************************************/
Handle MyNewHandleSysClear(Size theSize)
{
 register Handle h;
 
 if (h = NewHandleSys(theSize)) {

 asm {
 
 extern ClearHandleBytes:
 
 Move.L h,A0
 Move.L (A0),A0
 Bra    ClearBytes
 }
 }
 
 return (h);
}

/*****************************************************
 * MyNewPtrSysClear
 *
 * Faster version of the system’s NewPtrSysClear.
 ****************************************************/
Ptr MyNewPtrSysClear(Size theSize)
{
 register Ptr    p;
 
 if (p = NewPtrSys(theSize)) {
 
 asm {
 Bra    ClearPtrBytes
 }
 }
 
 return (p);
}

/*****************************************************
 * MyNewPtrClear
 *
 * Faster version of the system’s NewPtrClear.
 ****************************************************/
Ptr MyNewPtrClear(Size theSize)
{
 register Ptr    p;
 
 if (p = NewPtr(theSize)) {
 
 asm {

 extern ClearPtrBytes:
 
 Move.L p,A0
 
 extern ClearBytes:

 ;init bytesUntilDone
 
 Move.L theSize,D2
 
 ;init the seed value
 
 Moveq  #0,D0
 
 ;clear groups of ClearLongBytesPerLoop bytes
 
 Move.L D2,D1
 Bra.S  @2
 
 @1Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 Move.L D0,(A0)+
 
 @2Sub.L#ClearLongBytesPerLoop,D1
 Bpl.S  @1
 
 ;clear groups of ClearBytesPerLoop bytes

 Andi   #ClearLongBytesPerLoop-1,D2
 Move   D2,D1
 Bra.S  @4

 @3Move.L D0,(A0)+
 Move.L D0,(A0)+
 
 @4Subq #ClearBytesPerLoop,D1
 Bpl.S  @3
 
 ;clear the last few remaining bytes
 
 Andi   #ClearBytesPerLoop-1,D2
 Beq.S  @Exit
 Subq   #1,D2
 @5Move.B D0,(A0)+
 Dbra   D2,@5
 
 Exit:
 
 }
 }
 
 return (p);
}

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

FotoMagico 5.6.12 - Powerful slideshow c...
FotoMagico lets you create professional slideshows from your photos and music with just a few, simple mouse clicks. It sports a very clean and intuitive yet powerful user interface. High image... Read more
OmniGraffle Pro 7.12.1 - Create diagrams...
OmniGraffle Pro helps you draw beautiful diagrams, family trees, flow charts, org charts, layouts, and (mathematically speaking) any other directed or non-directed graphs. We've had people use... Read more
beaTunes 5.2.1 - Organize your music col...
beaTunes is a full-featured music player and organizational tool for music collections. How well organized is your music library? Are your artists always spelled the same way? Any R.E.M. vs REM?... Read more
HandBrake 1.3.0 - Versatile video encode...
HandBrake is a tool for converting video from nearly any format to a selection of modern, widely supported codecs. Features Supported Sources VIDEO_TS folder, DVD image or real DVD (unencrypted... Read more
Macs Fan Control 1.5.1.6 - Monitor and c...
Macs Fan Control allows you to monitor and control almost any aspect of your computer's fans, with support for controlling fan speed, temperature sensors pane, menu-bar icon, and autostart with... Read more
TunnelBear 3.9.3 - Subscription-based pr...
TunnelBear is a subscription-based virtual private network (VPN) service and companion app, enabling you to browse the internet privately and securely. Features Browse privately - Secure your data... Read more
calibre 4.3.0 - Complete e-book library...
Calibre is a complete e-book library manager. Organize your collection, convert your books to multiple formats, and sync with all of your devices. Let Calibre be your multi-tasking digital librarian... Read more
Lyn 1.13 - Lightweight image browser and...
Lyn is a fast, lightweight image browser and viewer designed for photographers, graphic artists, and Web designers. Featuring an extremely versatile and aesthetically pleasing interface, it delivers... Read more
Visual Studio Code 1.40.0 - Cross-platfo...
Visual Studio Code provides developers with a new choice of developer tool that combines the simplicity and streamlined experience of a code editor with the best of what developers need for their... Read more
OmniGraffle 7.12.1 - Create diagrams, fl...
OmniGraffle helps you draw beautiful diagrams, family trees, flow charts, org charts, layouts, and (mathematically speaking) any other directed or non-directed graphs. We've had people use Graffle to... Read more

Latest Forum Discussions

See All

The House of Da Vinci 2 gets a new gamep...
The House of Da Vinci launched all the way back in 2017. Now, developer Blue Brain Games is gearing up to deliver a second dose of The Room-inspired puzzling. Some fresh details have now emerged, alongside the game's first official trailer. [Read... | Read more »
Shoot 'em up action awaits in Battl...
BattleBrew Productions has just introduced another entry into its award winning, barrelpunk inspired, BattleSky Brigade series. Whilst its previous title BattleSky Brigade TapTap provided fans with idle town building gameplay, this time the... | Read more »
Arcade classic R-Type Dimensions EX blas...
If you're a long time fan of shmups and have been looking for something to play lately, Tozai Games may have just released an ideal game for you on iOS. R-Type Dimensions EX brings the first R-Type and its sequel to iOS devices. [Read more] | Read more »
Intense VR first-person shooter Colonicl...
Our latest VR obsession is Colonicle, an intense VR FPS, recently released on Oculus and Google Play, courtesy of From Fake Eyes and Goboogie Games. It's a pulse-pounding multiplayer shooter which should appeal to genre fanatics and newcomers alike... | Read more »
PUBG Mobile's incoming update bring...
PUGB Mobile's newest Royale Pass season they're calling Fury of the Wasteland arrives tomorrow and with it comes a fair chunk of new content to the game. We'll be seeing a new map, weapon and even a companion system. [Read more] | Read more »
PSA: Download Bastion for free, but wait...
There hasn’t been much news from Supergiant Games on mobile lately regarding new games, but there’s something going on with their first game. Bastion released on the App Store in 2012, and back then it was published by Warner Bros. This Warner... | Read more »
Apple Arcade: Ranked - 51+ [Updated 11.5...
This is Part 2 of our Apple Arcade Ranking list. To see part 1, go here. 51. Patterned [Read more] | Read more »
NABOKI is a blissful puzzler from acclai...
Acclaimed developer Rainbow Train's latest game, NABOKI, is set to launch for iOS, Android, and Steam on November 13th. It's a blissful puzzler all about taking levels apart in interesting, inventive ways. [Read more] | Read more »
A Case of Distrust is a narrative-driven...
A Case of Distrust a narrative-focused mystery game that's set in the roaring 20s. In it, you play as a detective with one of the most private eye sounding names ever – Phyllis Cadence Malone. You'll follow her journey in San Francisco as she... | Read more »
Brown Dust’s October update offers playe...
October is turning out to be a productive month for the Neowiz team, and a fantastic month to be a Brown Dust player. First, there was a crossover event with the popular manga That Time I Got Reincarnated as a Slime. Then, there was the addition of... | Read more »

Price Scanner via MacPrices.net

Score a 37% discount on Apple Smart Keyboards...
Amazon has Apple Smart Keyboards for current-generation 10″ iPad Airs and previous-generation 10″ iPad Pros on sale today for $99.99 shipped. That’s a 37% discount over Apple’s regular MSRP of $159... Read more
Apple has refurbished 2019 13″ 1.4GHz MacBook...
Apple has a full line of Certified Refurbished 2019 13″ 1.4GHz 4-Core Touch Bar MacBook Pros available starting at $1099 and up to $230 off MSRP. Apple’s one-year warranty is included, shipping is... Read more
2019 13″ 1.4GHz 4-Core MacBook Pros on sale f...
Amazon has new 2019 13″ 1.4GHz 4-Core Touch Bar MacBook Pros on sale for $150-$200 off Apple’s MSRP. These are the same MacBook Pros sold by Apple in its retail and online stores: – 2019 13″ 1.4GHz/... Read more
11″ 64GB Gray WiFi iPad Pro on sale for $674,...
Amazon has the 11″ 64GB Gray WiFi iPad Pro on sale today for $674 shipped. Their price is $125 off MSRP for this iPad, and it’s the lowest price available for the 64GB model from any Apple reseller. Read more
2019 15″ MacBook Pros available for up to $42...
Apple has a full line of 2019 15″ 6-Core and 8-Core Touch Bar MacBook Pros, Certified Refurbished, available for up to $420 off the cost of new models. Each model features a new outer case, shipping... Read more
2019 15″ MacBook Pros on sale this week for $...
Apple resellers B&H Photo and Amazon are offering the new 2019 15″ MacBook Pros for up to $300 off Apple’s MSRP including free shipping. These are the same MacBook Pros sold by Apple in its... Read more
Sunday Sale: AirPods with Wireless Charging C...
B&H Photo has Apple AirPods with Wireless Charging Case on sale for $159.99 through 11:59pm ET on November 11th. Their price is $40 off Apple’s MSRP, and it’s the lowest price available for these... Read more
Details of Sams Club November 9th one day App...
Through midnight Saturday night (November 9th), Sams Club online has several Apple products on sale as part of their One Day sales event. Choose free shipping or free local store pickup (if available... Read more
Sprint is offering the 64GB Apple iPhone 11 f...
Sprint has the new 64GB iPhone 11 available for $15 per month for new lines. That’s about 50% off their standard monthly lease of $29.17. Over is valid until November 24, 2019. The fine print: “Lease... Read more
New Sprint November iPhone deal: Lease one iP...
Switch to Sprint and purchase an Apple iPhone 11, 11 Pro, or 11 Pro Max, and get a second 64GB iPhone 11 for free. Requires 2 new lines or 1 upgrade-eligible line and 1 new line. Offer is valid from... Read more

Jobs Board

*Apple* Mobility Pro - Best Buy (United Stat...
**746087BR** **Job Title:** Apple Mobility Pro **Job Category:** Store Associates **Store NUmber or Department:** 000319-Harlem & Irving-Store **Job Description:** Read more
Best Buy *Apple* Computing Master - Best Bu...
**743392BR** **Job Title:** Best Buy Apple Computing Master **Job Category:** Store Associates **Store NUmber or Department:** 001171-Southglenn-Store **Job Read more
Best Buy *Apple* Computing Master - Best Bu...
**746015BR** **Job Title:** Best Buy Apple Computing Master **Job Category:** Sales **Store NUmber or Department:** 000372-Federal Way-Store **Job Description:** Read more
*Apple* Mobility Pro - Best Buy (United Stat...
**744658BR** **Job Title:** Apple Mobility Pro **Job Category:** Store Associates **Store NUmber or Department:** 000586-South Hills-Store **Job Description:** At Read more
Best Buy *Apple* Computing Master - Best Bu...
**741552BR** **Job Title:** Best Buy Apple Computing Master **Job Category:** Sales **Store NUmber or Department:** 000277-Metcalf-Store **Job Description:** **What Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.