Source Modding: Shifting Sprites as they are Drawn.

Request patches for Keens 4-6.
Post Reply
lemm
Posts: 554
Joined: Sun Jul 05, 2009 12:32 pm

Source Modding: Shifting Sprites as they are Drawn.

Post by lemm » Mon Aug 24, 2015 9:20 pm

A while ago, I modified NY00123's Keen Dreams tech demo by including some of the code from Omnispeak in order to produce a code base that replicated Keen:Galaxy that compiles and runs in a 16-bit DOS environment.

Recently, I fiddled with the drawing code so that sprites could be shifted as they are drawn. The result is that 100's of KB of memory are saved at the expense of the speed of drawing sprites to the screen.

Image


Here is the assembly for the new and the old loops. Sorry, the indentation is all messed up.

Code: Select all

New inner loop

@@byteloop:
	mov	ax,[es:di]				; 5 	
	rol ax, cl					; 8 (5 + n; avg case n = 3)
	and	al,[si]				; 7 
	or	al,[bx+si]				; 7
	ror ax, cl					; 8 (5 + n; avg case n = 3)	
	inc	si					; 2
	mov [es:di], ax				; 3
	inc di						; 2
	dec bp					; 2
	jnz @@byteloop				; 10 (7 + m; m = 3) or 3 if no jump

// TOTAL = 52 (45 if last byte on line, but bp needs to be loaded again)


Old Inner loop

@@byteloop:
	mov	al,[es:di]				; 5
	and	al,[si]				; 7
	or	al,[bx+si]				; 7
	inc	si					; 2
	stosb						; 3
	loop	@@byteloop			; 11 (8 + m; m = 3) or 4 if no jump

// TOTAL = 35 (28 if last byte on line, but cx needs to be loaded again)
// UNROLLED TOTAL = 24 (Unrolled loops have no loop instruction at the end:)


OPTIMIZED New Inner Loop

@@byteloop:
	mov	ax,[es:di]				; 5 	
REPEAT shifts 
	rol ax, 1					; 6 (rol reg, 1 = 2; avg case shifts = 3)
ENDM

	and	al,[si]				; 7 
	or	al,[bx+si]				; 7

REPEAT shifts
	ror ax, 1 					; 6 (rol reg, 1 = 2; avg case shifts = 3)
ENDM
	inc	si					; 2
	mov [es:di], ax				; 3
	inc di						; 2
	loop	@@byteloop			; 11 (8 + m; m = 3) or 4 if no jump

// TOTAL = 49 (42 if last byte on line, but cx needs to be loaded again)
// UNROLLED Total = 38 (No loop instruction at end of byte)
Drawing a 16x16 block:
32 bytes x 4 planes = 128 inner loops.

The OLD loop is optimized to draw WORDS. In best case, a 16x16 block takes 64 inner loops
But on average case (word mis-alignment; 8x8 sprites, etc) probably takes about 96 loops.

Secondly, the OLD algorithm has unrolled loops which eliminate the conditional jump after drawing a
byte or word to the screen. This could be added to the new algorithm, though.

Finally, there are instructions outside of the inner loop, but the time that they take to execute
is negligible compared to the inner loop.

How many 16x16 blocks of sprites would need to be drawn at once?
A "crowded" scene in keen might involve keen + 4 actors + a scorebox. Let's say that this takes
40 16x16 blocks. We're updating at 35 FPS, ideally. This table shows the number of clock cycles per
second that each algorithm would consume.

Code: Select all

 Loop                                       Per Block                  Crowded Screen @ 35 FPS (Block * 40 * 35)
 OLD (and unrolled)                         24 * 96 =  2304 		 3 225 600
 NEW                                        52 * 128 = 6656              9 318 400
 NEW OPTIMIZED (and unrolled)               38 * 128 = 4864	         6 809 600

By these calculations, the old code uses over 3 million cycles to draw a few sprites. But, this assumes
that every sprite is being updated 35 times per second, which is never the case. Some sprites are drawn smoothly,
(i.e., to "slide"), but others are drawn to "step," so they are only updated when they move. The number of
cycles that is used is much lower than what is listed in the table, but it is probably still a sizable portion
of the cycle budget. Nevertheless, these calculations show that the new inner loop, which shifts sprites as
they are drawn, would take around twice the number of cycles as the existing inner loop. Would it run on a 286? Maybe, though with some possible slowdown.

Post Reply