Finding HEDs and DCTs

Tools, assembly, and file formats.
Post Reply
levellass
Posts: 3001
Joined: Wed Oct 11, 2006 12:03 pm
Location: Ngaruawahia New Zealand

Finding HEDs and DCTs

Post by levellass »

Andy Andy Andy.

I have said his name, will he come?


I have been informed by someone that there are easier ways of getting things such as the locations of the AUDIODCT or EGAHED files in Keen executables not to mention the GFXINFOE files and number of sounds, songs, etc.

I therefore ask any and all with knowledge of how thee things are found to tell me please. In simple language. Think BASIC, and a moron.

DO NOT give me source code, I never get it right, and barely understand it. Napalm will tell you how much pounding it took to get the simple notions of RLE into my head. This is not a smart girl you are dealing with. Baby steps.


So, will anyone help me in my hour of need?
User avatar
adurdin
Site Founder
Posts: 549
Joined: Fri Aug 29, 2003 11:27 pm
Location: Edinburgh, Scotland
Contact:

Post by adurdin »

I have been informed by someone that there are easier ways of getting things such as the locations of the AUDIODCT or EGAHED files in Keen executables not to mention the GFXINFOE files and number of sounds, songs, etc.
A DOS executable is often made up of a number of parts ("segments"). Some of these will be code, and some data. Some information will be found in the code segment, for example, the number of songs might have been compiled into a CMP AX, <number of songs> instruction. Finding these without disassembling the code and carefully tracing through it can be very difficult, so usually something like the number of songs is easier to determine by examining the song file format.

Larger bits of data, like the huffman compression dictionaries, end-of-game ASCII screens will have been compiled into the exe as tables of data. If they're very big, they will have a segment all of their own (actually, that is usually because they were in a source code file all of their own when the exe was compiled), or they'll be in the main data segment for the program (AZ in the Keen 4 disassembly, for example). The locations of segments can be found by reading the exe headers, but usually I ran the exe through a disassembler, which identified the segments for me, and gave me the offsets of each segment.

Finding or identifying the data requires that you know what you're looking for, or at least something about what it will looks like.

Knowing the huffman compression algorithm, I knew that the data (the pairs of 'left node' and 'right node' indices) would look like a bunch of short ints, starting with smaller numbers x00nn, and gradually bringing in higher numbers, x01nn. The huffman table would have a total of 255 entries (draw a row of 256 nodes for the 256 possible byte values; build a tree, by drawing new nodes and joining up pairs of nodes to its left and right. Keep doing this until there's only one node at the top of the tree. You'll have drawn 255 new nodes), making 511 total indices. Since the nodes are numbered from 0, the index of the top node would be 510, x01FE. But since it's the top node, its index won't be in the table, only those of its left and right node indices, one of which will have been the second-last new node, index 509, 0x1FD. And this will be the last or second-last entry in the table. The whole table would be 1020 bytes long (255 new nodes, times 2 short ints--left and right--per node, times 2 bytes per short int), probably padded with nulls to 1024 bytes (since padding data to multiples of 16 is commonly done to make data access quicker).

So having worked all that out, just started browsing through the exe. My hex editor fortunately showed the graphical symbols for x01, x02, ... (see http://en.wikipedia.org/wiki/Code_page_437), so I'd be able to identify the short ints in the character view in the hex editor by seeing every second column either being blanks (x00) or black faces (x01). So I could search for xFD x01, and as soon as I saw the pattern in every second column, I could be pretty sure I'd found it. A quick check that the data structure looked like it was 1020 bytes long (it was), and I was ready to try it.

I'd already written my huffman decompression algorithm, so I loaded the data table in and started trying to decompress the graphics. It failed. I was confused for a bit, then looked back at the hex editor and noticed that there were two huffman dictionaries there, one right after the other. So I tried the second one to decompress the graphics, and it worked (I still had issues with the graphics format, but I could see enough to make out parts of tiles and sprites). I guessed that the first of the dictionaries was for compressing the sound, which since turned out to be the case, but didn't pursue that then.

Finding EGAHEAD was easier. I knew that it would be a list of offsets into the EGAGRAPH file. I'd already managed to decompress all the chunks from EGAGRAPH, just by working on the assumption that they'd all be in the file stored one after the other. Each chunk began with a long integer, its decompressed size, so most of the chunks, being small, had a sequence ?? ?? 00 00, which is easy to identify, as the 00 00 was going to be very rare in the huffman-compressed data. Just looking at the file in the hex editor, I could see that the offsets of the first few chunks were: x00000000, x00000115, x00000121, x00000E9F. I used my hex editor to search through the exe for 00 00 00 00 15 01 00 00 21 01 00 00 9F 0E 00 00, but found nothing. I knew the offset had to be there--the egagraph was too large for the exe to load it all into memory, so the exe had to have an index of the chunks so it could load just the ones it needed--so I picked the offset which had the rarest-looking number, 00000E9F, and just searched for the last bit, 9F 0E. There was one occurrence of this in the exe, in the series of bytes 00 00 00 15 01 00 21 01 00 9F 0E 00 6F 13 00 01 17 00. I could immediately see the familiar 21 01 00 and 15 01 00 in here, so it was now obvious that the EGAHEAD was only storing three bytes of the long ints, to save space, since the egagraph was not large enough, so the topmost byte would always have been 00 anyway. I quickly checked the next offsets, x00136F and x001701, which were indeed where the next chunks started, so I knew I had found where EGAHEAD started. But I didn't know where it ended.

I scrolled down, looking through the hex dump, and noticed among the offsets quite a lot of FF FF FF entries, which I concluded were for chunk indices which were not actually in the file. (I still have no idea why they're there, why the chunks were not indexed from 0 to number-of-chunks - 1!). Anyway, the size of EGAGRAPH was 0x7F185, so the highest offsets would be like ?? ?? 07, so I scrolled down till I found offsets like this. They went as far as 85 F1 07 -- which would be the first byte past the end of the EGAGRAPH file, so that had to be the end of the EGAGRAPH.

Now interestingly, I could see the next two bytes in the exe were CD AB, followed by what looked like a bunch of long int offsets. I'd been experimenting with decompressing GAMEMAPS already, finding the start of the plane data by looking for the "!ID!", and extracting the plane offsets and sizes by reading bits of the header before that. I didn't understand the Carmack compression yet, but had been trying to decompress the levels with just the RLEW algorithm which had given me short bits of gibberish mixed in with actual rows of tiles from level. I knew that 0xABCD was the magic number indicating the start of an RLEW run, so seeing it in the exe did not feel like a coincidence. And it was not: I quickly looked at a few of the longs I was guessing were offsets, and found they were indeed the offsets of the map headers in GAMEMAPS. So now I'd found MAPHEAD in the exe, more by coincidence than searching.

I managed to figure out bits of the decarmackization (the near references), but didn't get it working fully until I'd looked at the Wolfenstein 3D source code, which shared that (along with a surprising amount of other code) with Keen 4. It wasn't until much later that I came across the TED5 source code.

The format of GFXINFOE was found in TED5.H (someone else told me about it, but I forget who) from the TED5 source code (it used to be on 3drealms' FTP site, but can still be found on the internet, such as here: http://cd.textfiles.com/cream/cream11/c ... s/ted5.zip). It was a bunch of ints named "num8", "num8m", "num16", "num16m", "off8", "off8m" etc., pretty obviously meaning "number of 8x8 tiles", "number of 8x8 masked tiles", "number of 16x16 tiles", "number of 16x16 masked tiles", "offset of 8x8 tiles", "offset of 8x8 masked tiles", etc. Now I knew all those numbers, or could easily count the number of such tiles, as I'd already extracted the EGAGRAPH fully, and it was now only a matter of counting. As it turned out, someone else uploaded a GFXINFOE for editing Keen4 files with TED5 before I'd finished putting it together, but that showed that I was on the right track. So I figured out the numbers for Keen 5 and 6, and wrote TEDSETUP to generate the GFXINFOE files and handle the other work (map decompression and so on) that TED5 needed. You can see all the numbers that go into GFXINFOE right there in the TEDSETUP source code.

I hope that helps you understand how I work out this data. As I mentioned in the other thread, I worked with a rough idea of what the data I was looking for would look like, then searched for it and verified it by hand. I could've written a program to scan the exe and look for something-that-looks-like-a-huffman-dictionary, but it would have been hard to get it to work reliably on unknown EXEs -- and if I had to tweak it for each EXE version, that wouldn't be much better than just storing the offsets for each EXE version.
User avatar
adurdin
Site Founder
Posts: 549
Joined: Fri Aug 29, 2003 11:27 pm
Location: Edinburgh, Scotland
Contact:

Post by adurdin »

Warning: If you invoke my name three times, not only do I appear, but I spew out a flood of gibberish like that above, and you could drown in it if you're not careful!

:-P
levellass
Posts: 3001
Joined: Wed Oct 11, 2006 12:03 pm
Location: Ngaruawahia New Zealand

Post by levellass »

Well at least you're better than candlja-

The locations of segments can be found by reading the exe headers, but usually I ran the exe through a disassembler, which identified the segments for me, and gave me the offsets of each segment.

Drat! That's what I've been doing! (My programs do it each time they export an enternal table, since I know the unique signatures of the HEADS and DCTS) I had hoped there was a quicker way.

The whole table would be 1020 bytes long (255 new nodes, times 2 short ints--left and right--per node, times 2 bytes per short int), probably padded with nulls to 1024 bytes (since padding data to multiples of 16 is commonly done to make data access quicker).
Yep! EXACTLY how I did it!

So I tried the second one to decompress the graphics, and it worked (I still had issues with the graphics format, but I could see enough to make out parts of tiles and sprites).
I too still have issues with decompression, notably you forgot to mention that segments 521/522 were unmasked/masked 8x8 tiles of set size (No huffman word) and I get some anaomalus values for the size of masked bitmaps and sprites for the first 2-3 entries of the tables in chunks 1\2 (I'm sure I'm decompressing them correctly since the bitmap table and all the bitmap chunks either side of them decompress right, though I have some doubts on whether some of the fonts are decompressing correctly too.)

Your method for the HEAD is ok, but not general enough, I just take the GRAPH size and look for that 'word', which is very rare otherwise, then work back in words to word(0)
and noticed among the offsets quite a lot of FF FF FF entries, which I concluded were for chunk indices which were not actually in the file. (I still have no idea why they're there, why the chunks were not indexed from 0 to number-of-chunks - 1!)
Flexibility; with that dummy value you can add/remove chunks from the GRAPH without having to recode all the values you've already set for other things. This appears to have been done to make tile creation easier since AUDIO uses 'zero-byte long' entries as being empty. (It appears that this works easily with words, but not 3-byte entries.)
Now I knew all those numbers, or could easily count the number of such tiles, as I'd already extracted the EGAGRAPH fully, and it was now only a matter of counting. As it turned out, someone else uploaded a GFXINFOE for editing Keen4 files with TED5 before I'd finished putting it together, but that showed that I was on the right track. So I figured out the numbers for Keen 5 and 6, and wrote TEDSETUP to generate the GFXINFOE files and handle the other work (map decompression and so on) that TED5 needed. You can see all the numbers that go into GFXINFOE right there in the TEDSETUP source code.
So there is no way, from scratch, besides pure guesswork to figure out the number of such entries, and no way to change or patch them in Keen? But surely in the executable somewhere there must be something to tell the game how many of various things there are?

This is especially of interest to me since I wish to do things such as allow a different music file for each level. Any help or hints that can be given on this topic are of desperate interest to me.
The format of GFXINFOE was found in TED5.H
That was easy enough, it took two GFX files and five minutes of tinkering to guess, harder was TEDINFO.*, which I still don't know the full details of.
I could've written a program to scan the exe and look for something-that-looks-like-a-huffman-dictionary, but it would have been hard to get it to work reliably on unknown EXEs -- and if I had to tweak it for each EXE version, that wouldn't be much better than just storing the offsets for each EXE version.
On the contrary, it took me a day of coding, once I understood the idea of huffman (And before Napalm made me a decompressor.) My method works on the assumption that black ($00) is the most common color or value in a given set of data, which I have yet to find an exception to. I can locate the DCTs and HEADs in all versions of Keen (And Wolfenstein, Dangerous Dave...) My problem now is finer details, mostly on graphic numbers, if at all possible. (Again, your document is rather poor on the finer details, especially the later chunks containing game text or the start scrolling bitmaps.)


So once again, I implore, do we know anything more on this?
Post Reply