TweetFollow Us on Twitter

Efficient 68000
Volume Number:8
Issue Number:2
Column Tag:Assembly workshop

Efficient 68000 Programming

If a new CPU speeds up inefficient code, what do you think it will do to efficient code?

By Mike Scanlin, MacTutor Regular Contributing Author

The dew is cold. It is quiet. I hear nothing except for crackling sounds coming from the little fire burning two inches to the left of my keyboard. It wasn’t there a minute ago. Seems that Doo-Dah, the god of efficient programming, is upset with me for typing “Adda.W #10,A0” and just sent me a warning in the form of a lightning bolt. I hate it when he does that. You’d think that after three years in his service, researching which 68000 assembly language instructions are the most efficient ones for any given job, that he would lighten up a little. I guess that’s what makes him a god and me a mere mortal striving for enlightenment through the use of optimal instructions. As I extinguish the fire with a little Mountain Dew, I reflect upon the last three years.

My first lesson in the service of Doo-Dah was that proficiency in assembly language is a desirable skill in programmers so long as performance is a desirable attribute of software. The nay-sayers who depend upon faster and faster CPUs to make their sluggish software run at acceptable speeds don’t realize the underlying relativeness of the universe. If a new CPU will speed up a set of non-optimal instructions by 10%, then it will also speed up a set of optimal instructions by 10%. One should strive to be right on the edge of absolute maximum performance all the time. Users may not notice the difference in a 2K document but when they start working with 20MB documents they will soon be able to separate the optimal software from the non-optimal.

In the months following that lesson, I was given the task of compiling a list of instructions that should only very rarely appear in any program executing on a 68000 (and only then because you’re dealing with either self-modifying code or special hardware that depends on certain types of reads and writes from the processor). They are:

Don't Use Use Save

Move.B #0,Dx Clr.B Dx 8 cycles, 2 bytes

Move.W #0,Dx Clr.W Dx 8 cycles, 2 bytes

Clr.L Dx Moveq #0,Dx 2 cycles

Move.L #0,Dx Moveq #0,Dx 8 cycles, 4 bytes

Move.L #0,Ax Suba.L Ax,Ax 4 cycles, 4 bytes

Move.L #[-128..127],Dx Moveq #[-128..127],Dx 8 cycles, 4 bytes

Move.L #[-128..127],ea Moveq #[-128..127],Dx 4 cycles, 2 bytes

Move.L Dx,ea

Move.L #[128..254],Dx Moveq #[64..127],Dx 4 cycles, 2 bytes

Add Dx,Dx

Move.L #[-256..-130],Dx Moveq #[-128..-65],Dx 0 cycles, 2 bytes

Add.L Dx,Dx

Lea [1..8](Ax),Ax Addq #[1..8],Ax 0 cycles, 2 bytes

Add.W #[9..32767],Ax Lea [9..32767](Ax),Ax 4 cycles

Lea [-8..-1](Ax),Ax Subq #[1..8],Ax 0 cycles, 2 bytes

Sub.W #[9..32767],Ax Lea [-32767..-9](Ax),Ax 4 cycles

Asl.W #1,Dx Add.W Dx,Dx 4 cycles

Asl.L #1,Dx Add.L Dx,Dx 2 cycles

Cmp.x #0,ea Tst.x ea 4-10 cycles, 2 bytes

And.L #$0000FFFF,Dx Swap Dx 4 cycles

Clr.W Dx

Swap Dx

In addition, if you don’t care about the values of the condition codes then the following may be optimized:

Don't Use Use Save

Move.W #nnnn,-(SP) Move.L #ppppnnnn,-(SP) 4 cycles, 2 bytes

Move.W #pppp,-(SP)

Move.L #$0000nnnn,-(SP) Pea $nnnn 4 cycles, 2 bytes

Move.B #255,Dx St Dx 2 cycles, 2 bytes

Move.L #$00nn0000,Dx Moveq #[0..127],Dx 4 cycles, 2 bytes

Swap Dx

Movem (SP)+,Dx Move (SP)+,Dx 4 cycles

Ext.L Dx

Movem.L Dx,-(SP) Move.L Dx,-(SP) 4 cycles, 2 bytes

Movem.L (SP)+,Dx Move.L (SP)+,Dx 8 cycles, 2 bytes

Movem.L (SP)+,<2 regs> Move.L (SP)+,<reg 1> 4 cycles

Move.L (SP)+,<reg 2>

Note that pushing 2 regs or popping 3 with Movem.L is equivalent in cycles to doing it with multiple Move.L’s, but popping 3 regs with Move.L’s costs you two extra bytes. An easy rule to remember is to always use Movem.L whenever you’re dealing with 3 or more registers.

There are other optimizations you can make with minimal assumptions. For instance, if you are making room for a function result then don’t use Clr:

Don't UseUseSave
Clr.W -(SP)Subq #2,SP6 cycles
_Random _Random
Clr.L -(SP)Subq #4,SP14 cycles
_FrontWindow _FrontWindow

If you’re trying to set, clear, or change one of the low 16 bits of a data register and you don’t need to test it first, then don’t use these:

Don't UseUseSave
Bset #n,DxOr.W #mask,Dx4 cycles
Bclr #n,DxAnd.W #mask,Dx4 cycles
Bchg #n,DxEor.W #mask,Dx4 cycles

You should use registers wherever possible, not memory (because memory is much slower to access). If you need to test for a NIL handle or pointer, for instance, do this:

Don't UseUseSave
Move.L A0,-(SP)Move.L A0,D016 cycles, 2 bytes
Addq #4,SPBeq.S ItsNil
Beq.S ItsNil

Use the “quick” operations wherever you can. Many times you can reverse the order of two instructions to use a Moveq (since Moveq handles bigger numbers than Addq/Subq):

Don't UseUseSave
Move.L D0,D1Moveq #10,D16 cycles, 4 bytes
Add.L #10,D1Add.L D0,D1

Also, use two Addq’s or Subq’s when dealing with longs in the range of 9..16:

Don't UseUseSave
Addi.L #10,D0Addq.L #2,D04 cycles, 2 bytes
Addq.L #8,D0

The following three optimizations will reduce the size of your program but at the expense of a few cycles. This is good for user interface code, but you probably don’t want to use these optimizations in tight loops where speed is important:

Don't UseUseSave
Move.B #0,-(SP)Clr.B -(SP)-2 cycles, 2 bytes
Move.W #0,-(SP)Clr.W -(SP)-2 cycles, 2 bytes
Move.L #0,-(SP)Clr.L -(SP)-2 cycles, 4 bytes

Most of the optimizations from here onward are only applicable in some cases. Many times you can use a slightly different version of the exact code given here to get an optimization that works well for your particular set of circumstances. These optimizations don’t always have the same set of side effects or overflow/underflow conditions that the original code has, so use them with caution.

Shifting left by 2 bits (to multiply by 4) should be avoided if you’re coding for speed:

Don't UseUseSave
Asl.W #2,DxAdd.W Dx,Dx2 cycles, -2 bytes
Add.W Dx,Dx

Use bytes for booleans instead of bits. They’re faster to access (and less code in some cases). If you have many booleans, though, bits may be the way to go because of reduced memory requirements (of the data, that is, not the code).

Don't UseUseSave
Btst #1,myBools(A6)Tst.B aBool(A6)4 cycles, 2 bytes
Btst #1,D0Tst.B D06 cycles, 2 bytes

Avoid the use of multiply and divide instructions like the plague. Use shifts and adds for immediate operands or loops of adds and subtracts for variable operands. For instance, to multiply by 14 you could do this:

Don't UseUseSave
Mulu #14,D0Add D0,D0many cycles, -4 bytes
Move D0,D1
Lsl #3,D0
Sub D1,D0

If you have a variable source operand, but you know that it is typically small (and positive, for this example), then use a loop instead of a multiply instruction. This works really well in the case of a call to FixMul if you know one of the operands is a small integer -- you can avoid the trap overhead and the routine itself by using a loop similar to this one (in fact, the FixMul routine itself checks if either parameter is 1.0 before doing any real work):

Don't UseUseSave
Mulu D1,D0Move D0,D2many cycles, -8 bytes
Neg D2
@1 Add D0,D2
Subq #1,D1
Bne.S @1

Likewise, for division, use a subtract loop if you know that the quotient isn’t going to be huge (and if the destination fits in 16 bits):

Don't UseUseSave
Divu D1,D0Moveq #0,D2many cycles, -10 bytes
Cmp D1,D0
Bra.S @2
@1 Addq #1,D2
Sub D1,D0
@2 Bhi.S @1

Don’t use Bsr/Rts in tight loops where speed is important. Put the return address in an unused address register instead.

Don't UseUseSave
Bsr MyProcLea @1,A08 cycles, -4 bytes
;<blah>Bra MyProc
@1 ;<blah>
MyProc:MyProc:
;<blah blah>;<blah blah>
RtsJmp (A0)

You can eliminate a complete Bsr/Rts pair (or equivalent above) if the Bsr is the last instruction before an Rts by changing the Bsr to a Bra:

Don't UseUseSave
Bsr MyProcBra MyProc24 cycles, 2 bytes
Rts

Don’t use BlockMove for moves of 80 bytes or less where you know the source and destination don’t overlap. The trap overhead and preflighting that BlockMove does make it inefficient for such small moves. Use this loop instead (assuming Dx > 0 on entry):

Don't UseUseSave
_BlockMoveSubq #1,Dxmany cycles, -6 bytes
@1 Move.B (A0)+,(A1)+
Dbra Dx,@1

I base this conclusion on time trials done on a Mac IIci with a cache card. The actual results were (for several thousand iterations):

Figure 1: How fast do blocks move?

I did the same tests on a Mac SE and found that it was only beneficial to call BlockMove on that machine for moves of 130 bytes or more. However, since you should optimize for the lowest common denominator across all machines, you should only use the Dbra loop for non-overlapping moves of 80 bytes or less.

Be warned, though: on the Quadras, BlockMove has been modified to flush the 040 caches because of the possibility that you (or the memory manager) are BlockMoving executable code. So don’t use the above loop for moving small amounts of code (like you might do in some INIT installation code). Apple did this for compatibility reasons with existing non-040 aware applications running in 040 copy-back mode (high performance mode). However, because of this, your non-code BlockMoves are unnecessarily clearing the caches, too. I don’t know if it’s worth it to write a dedicated BlockMove for non-code moves, but it seems like it’s worth doing and then timing to see if there’s a difference.

Unroll loops. At the expense of a few extra bytes you can make any tight loop run faster. This is because short branch instructions that are not taken are faster than those that are taken. Here’s an even faster version of the above loop:

;1

 Subq #1,Dx
 @1 Move.B (A0)+,(A1)+
 Subq #1,Dx
 Bcs.S @2
 Move.B (A0)+,(A1)+
 Subq #1,Dx
 Bcs.S @2
 Move.B (A0)+,(A1)+
 Dbra Dx,@1
 @2

Beware when using the above trick, though, because it doesn’t work for long branches. In that case, a taken branch is faster than a branch not taken.

Preserving pointers into relocatable blocks across code that moves memory: If you need to lock a handle because you’re going to call a routine that moves memory but the handle (and the dereferenced handle) isn’t a parameter to that routine, then you can usually avoid locking the handle with a trick (which has the desirable side effect of reducing memory fragmentation). Assume the handle is in A3 and the pointer into the middle of the block is in A2. All you really have to do is save/restore the offset into the block; you don’t care if the block moves or not:

Don't UseUseSave
Move.L A3,A0Sub.L (A3),A2many cycles, 4 bytes
_HLock
;<move memory> ;<move memory>
Move.L A3,A0Add.L (A3),A2
_HUnlock

If the end of a routine is executing the same set of instructions two or more times, then you may be able to use this trick to save some bytes (at the expense of a few cycles). If the end of the routine looks like a subroutine, then have it Bsr to itself, like this (this example is drawing a BCD byte in D3):

Don't UseUseSave
Ror #4,D3Ror #4,D3many bytes
Move.B D3,D0Bsr @1
And #$000F,D0Rol #4,D3
Add #'0',D0
Move D0,-(SP)
_DrawChar
Rol #4,D3
Move.B D3,D0@1 Move D3,D0
And #$000F,D0And #$000F,D0
Add #'0',D0Add #'0',D0
Move D0,-(SP)Move D0,-(SP)
_DrawChar _DrawChar
Rts Rts

Use multiple entry points to set common parameters. Suppose you have a routine that takes a boolean value in D0 as an input and suppose you call this routine 20 times with the value of True and 30 times with the value of False. It would save code if you made two entry points that each set D0, and then branched to common code. For instance:

Don't UseUseSave
St D0Bsr MyProcTruemany bytes
Bsr MyProc
Sf D0Bsr MyProcFalse
Bsr MyProc
MyProcTrue:
St D0
Bra.S MyProc
MyProcFalse:
Sf D0
MyProc:MyProc:
;<blah>;<blah>
RtsRts

Clean up the stack with Unlk. If your routine already has a stack frame and you create some temporary data on the stack (in addition to the stack frame) then you don’t always need to remove it when you’re done with it -- the Unlk will clean it up for you. For instance, suppose you make a temporary Rect on the stack. You would normally remove it with Addq #8,SP but if it’s near the end of a function that does an Unlk, then leave the Rect there; it’ll be gone when the Unlk executes.

Well, hopefully Doo-Dah has many more learned disciples now. Don’t forget to sacrifice a copy of FullWrite in his honor at least once a year. That makes him happy.

P.S. If you want even more 68000 optimizations there is an excellent article by Mike Morton in the September, 1986, issue of Byte magazine called “68000 Tricks and Traps” (pgs. 163-172). There are more than half a dozen or so tricks in that article not covered in this article (sorry for not listing them here but I didn’t want to get sued for plagiarism).

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Chromium 119.0.6044.0 - Fast and stable...
Chromium is an open-source browser project that aims to build a safer, faster, and more stable way for all Internet users to experience the web. List of changes available here. Version for Apple... Read more
Spotify 1.2.21.1104 - Stream music, crea...
Spotify is a streaming music service that gives you on-demand access to millions of songs. Whether you like driving rock, silky R&B, or grandiose classical music, Spotify's massive catalogue puts... Read more
Tor Browser 12.5.5 - Anonymize Web brows...
Using Tor Browser you can protect yourself against tracking, surveillance, and censorship. Tor was originally designed, implemented, and deployed as a third-generation onion-routing project of the U.... Read more
Malwarebytes 4.21.9.5141 - Adware remova...
Malwarebytes (was AdwareMedic) helps you get your Mac experience back. Malwarebytes scans for and removes code that degrades system performance or attacks your system. Making your Mac once again your... Read more
TinkerTool 9.5 - Expanded preference set...
TinkerTool is an application that gives you access to additional preference settings Apple has built into Mac OS X. This allows to activate hidden features in the operating system and in some of the... Read more
Paragon NTFS 15.11.839 - Provides full r...
Paragon NTFS breaks down the barriers between Windows and macOS. Paragon NTFS effectively solves the communication problems between the Mac system and NTFS. Write, edit, copy, move, delete files on... Read more
Apple Safari 17 - Apple's Web brows...
Apple Safari is Apple's web browser that comes bundled with the most recent macOS. Safari is faster and more energy efficient than other browsers, so sites are more responsive and your notebook... Read more
Firefox 118.0 - Fast, safe Web browser.
Firefox offers a fast, safe Web browsing experience. Browse quickly, securely, and effortlessly. With its industry-leading features, Firefox is the choice of Web development professionals and casual... Read more
ClamXAV 3.6.1 - Virus checker based on C...
ClamXAV is a popular virus checker for OS X. Time to take control ClamXAV keeps threats at bay and puts you firmly in charge of your Mac’s security. Scan a specific file or your entire hard drive.... Read more
SuperDuper! 3.8 - Advanced disk cloning/...
SuperDuper! is an advanced, yet easy to use disk copying program. It can, of course, make a straight copy, or "clone" - useful when you want to move all your data from one machine to another, or do a... Read more

Latest Forum Discussions

See All

‘Monster Hunter Now’ October Events Incl...
Niantic and Capcom have just announced this month’s plans for the real world hunting action RPG Monster Hunter Now (Free) for iOS and Android. If you’ve not played it yet, read my launch week review of it here. | Read more »
Listener Emails and the iPhone 15! – The...
In this week’s episode of The TouchArcade Show we finally get to a backlog of emails that have been hanging out in our inbox for, oh, about a month or so. We love getting emails as they always lead to interesting discussion about a variety of topics... | Read more »
TouchArcade Game of the Week: ‘Cypher 00...
This doesn’t happen too often, but occasionally there will be an Apple Arcade game that I adore so much I just have to pick it as the Game of the Week. Well, here we are, and Cypher 007 is one of those games. The big key point here is that Cypher... | Read more »
SwitchArcade Round-Up: ‘EA Sports FC 24’...
Hello gentle readers, and welcome to the SwitchArcade Round-Up for September 29th, 2023. In today’s article, we’ve got a ton of news to go over. Just a lot going on today, I suppose. After that, there are quite a few new releases to look at... | Read more »
‘Storyteller’ Mobile Review – Perfect fo...
I first played Daniel Benmergui’s Storyteller (Free) through its Nintendo Switch and Steam releases. Read my original review of it here. Since then, a lot of friends who played the game enjoyed it, but thought it was overpriced given the short... | Read more »
An Interview with the Legendary Yu Suzuk...
One of the cool things about my job is that every once in a while, I get to talk to the people behind the games. It’s always a pleasure. Well, today we have a really special one for you, dear friends. Mr. Yu Suzuki of Ys Net, the force behind such... | Read more »
New ‘Marvel Snap’ Update Has Balance Adj...
As we wait for the information on the new season to drop, we shall have to content ourselves with looking at the latest update to Marvel Snap (Free). It’s just a balance update, but it makes some very big changes that combined with the arrival of... | Read more »
‘Honkai Star Rail’ Version 1.4 Update Re...
At Sony’s recently-aired presentation, HoYoverse announced the Honkai Star Rail (Free) PS5 release date. Most people speculated that the next major update would arrive alongside the PS5 release. | Read more »
‘Omniheroes’ Major Update “Tide’s Cadenc...
What secrets do the depths of the sea hold? Omniheroes is revealing the mysteries of the deep with its latest “Tide’s Cadence" update, where you can look forward to scoring a free Valkyrie and limited skin among other login rewards like the 2nd... | Read more »
Recruit yourself some run-and-gun royalt...
It is always nice to see the return of a series that has lost a bit of its global staying power, and thanks to Lilith Games' latest collaboration, Warpath will be playing host the the run-and-gun legend that is Metal Slug 3. [Read more] | Read more »

Price Scanner via MacPrices.net

Clearance M1 Max Mac Studio available today a...
Apple has clearance M1 Max Mac Studios available in their Certified Refurbished store for $270 off original MSRP. Each Mac Studio comes with Apple’s one-year warranty, and shipping is free: – Mac... Read more
Apple continues to offer 24-inch iMacs for up...
Apple has a full range of 24-inch M1 iMacs available today in their Certified Refurbished store. Models are available starting at only $1099 and range up to $260 off original MSRP. Each iMac is in... Read more
Final weekend for Apple’s 2023 Back to School...
This is the final weekend for Apple’s Back to School Promotion 2023. It remains active until Monday, October 2nd. Education customers receive a free $150 Apple Gift Card with the purchase of a new... Read more
Apple drops prices on refurbished 13-inch M2...
Apple has dropped prices on standard-configuration 13″ M2 MacBook Pros, Certified Refurbished, to as low as $1099 and ranging up to $230 off MSRP. These are the cheapest 13″ M2 MacBook Pros for sale... Read more
14-inch M2 Max MacBook Pro on sale for $300 o...
B&H Photo has the Space Gray 14″ 30-Core GPU M2 Max MacBook Pro in stock and on sale today for $2799 including free 1-2 day shipping. Their price is $300 off Apple’s MSRP, and it’s the lowest... Read more
Apple is now selling Certified Refurbished M2...
Apple has added a full line of standard-configuration M2 Max and M2 Ultra Mac Studios available in their Certified Refurbished section starting at only $1699 and ranging up to $600 off MSRP. Each Mac... Read more
New sale: 13-inch M2 MacBook Airs starting at...
B&H Photo has 13″ MacBook Airs with M2 CPUs in stock today and on sale for $200 off Apple’s MSRP with prices available starting at only $899. Free 1-2 day delivery is available to most US... Read more
Apple has all 15-inch M2 MacBook Airs in stoc...
Apple has Certified Refurbished 15″ M2 MacBook Airs in stock today starting at only $1099 and ranging up to $230 off MSRP. These are the cheapest M2-powered 15″ MacBook Airs for sale today at Apple.... Read more
In stock: Clearance M1 Ultra Mac Studios for...
Apple has clearance M1 Ultra Mac Studios available in their Certified Refurbished store for $540 off original MSRP. Each Mac Studio comes with Apple’s one-year warranty, and shipping is free: – Mac... Read more
Back on sale: Apple’s M2 Mac minis for $100 o...
B&H Photo has Apple’s M2-powered Mac minis back in stock and on sale today for $100 off MSRP. Free 1-2 day shipping is available for most US addresses: – Mac mini M2/256GB SSD: $499, save $100 –... Read more

Jobs Board

Licensed Dental Hygienist - *Apple* River -...
Park Dental Apple River in Somerset, WI is seeking a compassionate, professional Dental Hygienist to join our team-oriented practice. COMPETITIVE PAY AND SIGN-ON Read more
Sublease Associate Optometrist- *Apple* Val...
Sublease Associate Optometrist- Apple Valley, CA- Target Optical Date: Sep 30, 2023 Brand: Target Optical Location: Apple Valley, CA, US, 92307 **Requisition Read more
*Apple* / Mac Administrator - JAMF - Amentum...
Amentum is seeking an ** Apple / Mac Administrator - JAMF** to provide support with the Apple Ecosystem to include hardware and software to join our team and Read more
Child Care Teacher - Glenda Drive/ *Apple* V...
Child Care Teacher - Glenda Drive/ Apple ValleyTeacher Share by Email Share on LinkedIn Share on Twitter Read more
Cashier - *Apple* Blossom Mall - JCPenney (...
Cashier - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Apple Blossom Mall Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.