Tips and Tidbits

Blistering Blitting

The June issue of MacTech highlighted some ways to move pixels around but there was no mention of a little-known PPC ASM instruction that can be used to speed pixel blitting (and all memory moving functions for that matter) called lmw/stmw.

When the data is aligned, it takes 4 cycles to execute a normal move memory. These two instructions are very powerful because they take 3 + n cycles to move n words. Hence each additional move only takes 1 cycle. This instruction works differently on different PPCs. The 601 treats these instruction like multiple lwa while the 603e, 604 and, G3 treat the instruction like a multimove.

There are a few restrictions on the following code:

It works fastest when the baseAddrs are aligned.
The code is for 8-bit pixels (tip: change r23 to 12 for 16 bit pixels and to 6 for 32 bit pixels).
The width must be a multiple of 24 for 8 bit pixels (12 for 16 bit pixels, and 6 for 32 bit pixels). If it isn't there will be pixel overwriting. Most of the overwrite will be overwritten with the next scan line. You have been warned! (Any destination world should have a few more words in its baseAddr.)
This assumes the same palette for the source and destination worlds (in 8-bit mode).
This assumes the same depth for the source and destination worlds.

The following is an example of how to move pixels three times faster than the fastest method presented in Fast Blit Strategies.

export SpeedCopy[DS]
export .SpeedCopy[PR]

toc
	tc SpeedCopy[TC],	SpeedCopy[DS]	;TOC entry "SpeedCopy" for
																	;transition vector "SpeedCopy"

		csect	SpeedCopy[DS]			 		;Define transition vector "SpeedCopy"
		dc.l		.SpeedCopy[PR]	 				;Pointer to code
		dc.l		TOC[tc0]								;Pointer to TOC
		dc.l		0

# Prolog: SpeedCopy
;void SpeedCopy(long height, long width, long srcRowbytes,
;		unsigned long *dest, long destRowbytes,	unsigned long* src);

		csect	.SpeedCopy[PR]			;Prolog begins here

		;r3	= dest.height
		;r4	= dest.width
		;r5	= dest.rowbytes
		;r6	= dest.baseAddr
		;r7	= src.rowbytes
		;r8	= src.baseAddr

		stmw		r22,-36(SP)					;store temp register space
		li			r22, 1
		li			r23, 24

@lineLoop
		mr			r12,r4							;x = dest.width
		mr			r10,r6							;tmpdest = dest
		mr			r11,r8							;tmpsrc = src

@pixelLoop:
		lmw		r25,0(r11)					;Move 4 + 4 + 4 + 4 + 4 + 4 from 
														;tempSource to r25 thru r31

		subf.	r12,r23,r12					;Subtract num Pixels from total, test against 0
		addi		r11,r11,24					;Add pixel width to tempSource even for	
														;different size pixels

		stmw		r25,0(r10)					;Move pixels from r25 thru r31 to screen
		addi		r10,r10,24					;Add pixel width to dest even for different
														;size pixels

		bgt		@pixelLoop					;Loop if the subtraction is greater than 0

		subf.	r3, r22, r3				;Subtract one line from height, test against 0
		add		r8,r8,r7						;Add src.rowbytes to src.baseaddr
		add		r6,r6,r5						;Add dest.rowbytes to dest.baseaddr

		bne		@lineLoop					;Loop if height not equal to 0

		lmw		r22,-36(SP)					;Restore register space
		blr

end

With a few changes this code can pixel double, copy every other line, or both.

The best way to make a machine go faster is to make it do less. This is one of only a handful of cases where you can do more with less.

Brad Anderson anderson@rpmmusic.com

Software Updates via MacUpdate

Latest Forum Discussions

Price Scanner via MacPrices.net

Jobs Board

SPREAD THE WORD:
Slashdot
Digg
Del.icio.us
Reddit
Newsvine