Recently, I fiddled with the drawing code so that sprites could be shifted as they are drawn. The result is that 100's of KB of memory are saved at the expense of the speed of drawing sprites to the screen.
Here is the assembly for the new and the old loops. Sorry, the indentation is all messed up.
Code: Select all
New inner loop @@byteloop: mov ax,[es:di] ; 5 rol ax, cl ; 8 (5 + n; avg case n = 3) and al,[si] ; 7 or al,[bx+si] ; 7 ror ax, cl ; 8 (5 + n; avg case n = 3) inc si ; 2 mov [es:di], ax ; 3 inc di ; 2 dec bp ; 2 jnz @@byteloop ; 10 (7 + m; m = 3) or 3 if no jump // TOTAL = 52 (45 if last byte on line, but bp needs to be loaded again) Old Inner loop @@byteloop: mov al,[es:di] ; 5 and al,[si] ; 7 or al,[bx+si] ; 7 inc si ; 2 stosb ; 3 loop @@byteloop ; 11 (8 + m; m = 3) or 4 if no jump // TOTAL = 35 (28 if last byte on line, but cx needs to be loaded again) // UNROLLED TOTAL = 24 (Unrolled loops have no loop instruction at the end:) OPTIMIZED New Inner Loop @@byteloop: mov ax,[es:di] ; 5 REPEAT shifts rol ax, 1 ; 6 (rol reg, 1 = 2; avg case shifts = 3) ENDM and al,[si] ; 7 or al,[bx+si] ; 7 REPEAT shifts ror ax, 1 ; 6 (rol reg, 1 = 2; avg case shifts = 3) ENDM inc si ; 2 mov [es:di], ax ; 3 inc di ; 2 loop @@byteloop ; 11 (8 + m; m = 3) or 4 if no jump // TOTAL = 49 (42 if last byte on line, but cx needs to be loaded again) // UNROLLED Total = 38 (No loop instruction at end of byte)
32 bytes x 4 planes = 128 inner loops.
The OLD loop is optimized to draw WORDS. In best case, a 16x16 block takes 64 inner loops
But on average case (word mis-alignment; 8x8 sprites, etc) probably takes about 96 loops.
Secondly, the OLD algorithm has unrolled loops which eliminate the conditional jump after drawing a
byte or word to the screen. This could be added to the new algorithm, though.
Finally, there are instructions outside of the inner loop, but the time that they take to execute
is negligible compared to the inner loop.
How many 16x16 blocks of sprites would need to be drawn at once?
A "crowded" scene in keen might involve keen + 4 actors + a scorebox. Let's say that this takes
40 16x16 blocks. We're updating at 35 FPS, ideally. This table shows the number of clock cycles per
second that each algorithm would consume.
Code: Select all
Loop Per Block Crowded Screen @ 35 FPS (Block * 40 * 35) OLD (and unrolled) 24 * 96 = 2304 3 225 600 NEW 52 * 128 = 6656 9 318 400 NEW OPTIMIZED (and unrolled) 38 * 128 = 4864 6 809 600
By these calculations, the old code uses over 3 million cycles to draw a few sprites. But, this assumes
that every sprite is being updated 35 times per second, which is never the case. Some sprites are drawn smoothly,
(i.e., to "slide"), but others are drawn to "step," so they are only updated when they move. The number of
cycles that is used is much lower than what is listed in the table, but it is probably still a sizable portion
of the cycle budget. Nevertheless, these calculations show that the new inner loop, which shifts sprites as
they are drawn, would take around twice the number of cycles as the existing inner loop. Would it run on a 286? Maybe, though with some possible slowdown.