Graphic, Sound & Coding Tips/Tricks/FX/Etc. Tools for Development

TurboXray · 12/03/2014, 01:39 PM

Can we get a sticky for graphic, sound, and coding tips/tricks/effects/etc?

Since the TED info isn't stickied, here are the relative links for it:
pcengine-fx.com/forums/index.php?topic=21121.0
pcengine-fx.com/forums/index.php?topic=20120.msg436168#msg436168

TOOLS

https://tasvideos.org/Bizhawk.html Bizhawk PCE emulator with LUA script support. Perfect for debugging your games, or hacking exiting games. Check out the sample LUA script for Neutopia. It shows collision boxes and HP points for them.

NecroPhile · 12/03/2014, 03:22 PM

Done.

TurboXray · 12/06/2014, 12:11 PM

Quote from: guestDone.

Thanks

Saw this article on Contra for NES: http://tomorrowcorporation.com/posts/retro-game-internals-contra-collision-detection

7 part in depth look at Contra and the engine is runs on. One interesting note, is how collision detection is used for object to object. The main character is a single point and all other objects are boxes. The bullets are considered 'spawned' enemies. The collision box for the enemies change depending on the mode the character is in (laying down, standing/running, and jumping). I don't think it saves a whole lot on cycles, but interesting approach none the less.

touko · 12/09/2014, 05:32 AM

Very interesting, and useful .
I think maybe a box for player caracter and an unique point for enemies bullets can save some cycles .

I have a very generic box collision detection routine (it works for all type of games), it takes 192 cycles if collision between 2 sprites occur,this is not the fastest routine ever but not slow .
I think in a real use it takes between 5000 and 10000 cycles for a fairly good amount of check like needed in a shmup,in my case 10000 cycles is 50 effective collisions at same time in a frame .

TurboXray · 12/09/2014, 09:46 PM

10k cycles isn't bad at all. That's only 8.3% cpu resource for the frame. The PCE could easily handle double that if the VDC could handle it without flickering. A shame really.

Touko, do you have a game engine that handles object to map collision... for slopes? Maybe something along the lines of Sonic or Mario?

touko · 12/10/2014, 03:18 AM

QuoteTouko, do you have a game engine that handles object to map collision... for slopes? Maybe something along the lines of Sonic or Mario?

Yes i have, it's (for now) 200 cycles for testing a 32x16 sprites with bg tiles, i tested this routine for flappy bird SGX ..
The slow part is to translate the player's coordinates X/Y in tiles coordinates,and after i test directly in VRAM with auto incrementation, it's fast and easy i think,and better if you have a dynamic TILEMAP .

But it's easy to do more faster, test only some points,for exemple no need to test an entire 32 pixels wide sprite if player go forward, or backward ,same for each direction .
For testing, with flappy i test the entire player's sprite .

QuoteThe PCE could easily handle double that if the VDC could handle it without flickering. A shame really.

Yes of course, even with my routine you can test more collisions without impacting to much the cpu .
You can easily improve performance with caculating each boxes in a RAM array when you move each sprites, this is not the case actualy with my routine .

And like you, i have some SGX games in mind

TurboXray · 12/10/2014, 02:41 PM

I had something I wanted to show off on the PCE, but I don't have a slope collision engine (only block tiles/physics and collision).

touko · 12/10/2014, 03:16 PM

Quotebut I don't have a slope collision engine

same here,my routine is only for shmups, not for plateformers .

TurboXray · 12/10/2014, 07:12 PM

Maybe someone from the NESdev scene has something I can borrow/use.

touko · 12/11/2014, 03:36 AM

For now, no need for me to have a complex objetc/map collision routine, my next project will be a shmup and perhaps a browler . ;-)

TurboXray · 12/12/2014, 05:56 PM

Expanded color palette mode on the PCE/SGX

This specific effect only works on composite (or RF) of the original consoles: no RGB or s-video mod.
This is based on the same effect as CGA composite artifacts, and is specific to NTSC signal only.

Since the mid res mode (7.159mhz) is double the color burst frequency, you get direct artifacting between two signals ( Y and C). The trick here, is to turn off the PCE's XOR color burst alternating bit (this bit removes the artifact for still/non moving screens).

Two things to consider:
1) The color burst is still XOR going down the screen, but just that it doesn't swap on the next frame. Your dithering will have to compensate for this (pretty sure vertical line dithering doesn't work for this, so XOR checker board dithering is needed).
2) I haven't found a way to directly control which 'phase' you are in, on start up (two phases: blue slightly cool tinted or slightly warm tinted). IIRC, you can switch/change the phase by enabling and then disabling the XOR pattern bit on the VCE for a single frame. So this requires user input from a visual aid, so the program will know (exactly like the Coco 1/2 red/blue mode, except you don't need to reset the console).

Unlike Genesis dithering (in either res mode), this appears thoroughly solid.

So you get more colors per tile, for averaging dithered 16 colors, and you get colors outside the default 512 master palette. Limited, but a cool effect. Could be great for doing a raycast style engine (since the screen doesn't scroll), or some other demo effect.

touko · 01/05/2015, 02:41 PM

I have a question in mind, it's feasible to do self modifying code on AC ???
Or simply execute some code on it ???

I ask this because i want use some compiled sprites for my browler.

TurboXray · 01/05/2015, 04:26 PM

You mean like ST1/ST2 opcodes executing sequentially from AC ports (the 8k $40-43 banked ports)?

In theory it should work, but in practice it'll mostly likely lock up. The arcade card has no way of stalling the cpu (with /RDY), and AFAIK - it buffers (reads ahead) from the port for the next value after the first it read (or written). At minimum, that's like 6 cycles per byte on (Txx), which I think is fast enough for any type of sequential or random access of the memory through the ports (it's fast DRAM, but it's still DRAM with a refresh cycle hidden in there), but when you execute sequential code from the 8k banked ports - it hits that port on every cycle of the instruction (1cycle per byte), instead of 6 or 10 or whatever cycles per byte. I almost positive that's too fast (120ns for byte access on the PCE side. I think the dram was 70ns, but couple that with a refresh hit and I'm pretty sure you'll exceed 120ns window).

You'd have to copy the embedded opcode/sprite data to local ram first, then execute it. Unless you mean something else..

touko · 01/06/2015, 03:28 AM

QuoteYou mean like ST1/ST2 opcodes executing sequentially from AC ports (the 8k $40-43 banked ports)?

Yes this is that i mean .

Thanks for explanation,the fastest way is to use txx for transfering datas,and for a SGX browler it will be harder than I thought .
i think that i'll need a double sprite buffer in vram, vblank is not enough to update sprites with txx .

QuoteYou'd have to copy the embedded opcode/sprite data to local ram first, then execute it. Unless you mean something else..

I think is better to copy directly in vram, doing the copy twice is not the best way(but no need a double buffer in that case)

TurboXray · 01/06/2015, 02:59 PM

If it's a brawler, you could probably get away with interleaving updates on specific frame intervals. I mean, it's pretty rare that frame updates need to be 60hz for a character. Someone else had mentioned this (something like 4 frame slots, so frame/pixel animation was limited to 15fps).

Though if it's SGX, and you're using both sprite planes from both VDCs as a single virtual/pseudo SATB (cause you need to process sprite priority layering) - then I can see where you would need to update between both VDCs at a faster rate. I.e. it's fairly dynamic as to which VDC would receive the frame update (or even need a redundant update; the character moved from VDC2 SATB to VDC1 SATB and VDC1 needs the frame that VDC2 already had in memory).

But being a brawler, surely you could fit most enemy frames (as ST1/ST2) in local memory to keep those as fast updates. Since most of the game data is going to sit in AC ram anyway, you have lots of room for stuff in local ram. 32k of ram should be plenty for a brawler engine, and 16k of work ram (8k of that being original sys ram). That leaves you with 216k. And most brawlers break up a stage into subsections, which you could take advantage of and replace/update enemy opcode sprites in local ram (from AC ram). Same for bosses. You can use mini 'transition' scenes to hide this 'loading'. Or do it as a background process in an area that's lite on enemies. Etc.

If you can't tell, I've thought a lot about this before (SGX+AC+brawler)

touko · 01/07/2015, 03:31 AM

QuoteIf it's a brawler, you could probably get away with interleaving updates on specific frame intervals. I mean, it's pretty rare that frame updates need to be 60hz for a character. Someone else had mentioned this (something like 4 frame slots, so frame/pixel animation was limited to 15fps).

Of course you're right ..

QuoteThough if it's SGX, and you're using both sprite planes from both VDCs as a single virtual/pseudo SATB (cause you need to process sprite priority layering) - then I can see where you would need to update between both VDCs at a faster rate. I.e. it's fairly dynamic as to which VDC would receive the frame update (or even need a redundant update; the character moved from VDC2 SATB to VDC1 SATB and VDC1 needs the frame that VDC2 already had in memory).

Ehehe, yes it's my problem, i want to maximise sprites on screen .It's not difficult for any type of games but for brawler with Y ordering it's slightly difficult to manage this on two separate layers .

Y ordering will be made with a dynamic sprites list, like a chained list, for now this is the best i found

QuoteBut being a brawler, surely you could fit most enemy frames (as ST1/ST2) in local memory to keep those as fast updates. Since most of the game data is going to sit in AC ram anyway, you have lots of room for stuff in local ram. 32k of ram should be plenty for a brawler engine, and 16k of work ram (8k of that being original sys ram). That leaves you with 216k. And most brawlers break up a stage into subsections, which you could take advantage of and replace/update enemy opcode sprites in local ram (from AC ram). Same for bosses. You can use mini 'transition' scenes to hide this 'loading'. Or do it as a background process in an area that's lite on enemies. Etc.

Now i think i'll do something like that, for now i do not have any gfx, and i don't think that sprites size will be close to FF but like double dragon with mush more different moves for players and like 10 enemies on screen (and 2 player co-op of course) .
I calculated a 50% of max cpu use for a 8,5 ko/frame transfert, this should be enought,and let me ~60000 cycles for game logic .The drawback is the need to double buffering sprites datas in vram, not really a problem with SGX, but can be tedious with Y ordering and the 2 sprites layers .
I can probably use scdram for players compiled sprites and AC for the rest,with a good sprites datas organisation to avoid transfering empty tiles ..
Sprites buffer can be cleared with fast VDC DMA and DMA list driven by interrupt (i already have) .

QuoteIf you can't tell, I've thought a lot about this before (SGX+AC+brawler)

I see lol, you show exactly the same problematics i have

This is in my mind since a while, no code or something for now, only theoretical and some basis, and i dream that this combo could allow ..

touko · 02/06/2015, 12:14 PM

hi,i converted lz4 decompressor to 6280 .
It's a real time decompressor and has a good ratio compression/speed,and can compress all kind of files .
For exemple i compressed the first level of my game chuck no rise, original file is 23ko, and down to 13 KO compressed .
A sprite pattern (not very optimised) of about 7 ko, is 1,79 ko compressed .
My routine is functional, and works very fine,but the bank rollover is not yet finished,and it lacks for now the part to decompress directly in vram, and a use of block transfert instruction rather than a simple copy (lda=>sta) .
Yes you can summarise this decompression to a single bytes copy,this is why is so fast .

Someone has already tested this in a game ??

Exemple,code,algo, and benchs for apple 2 gs (65816) :
http://www.brutaldeluxe.fr/products/crossdevtools/lz4/

lz4 creator's blog:
http://fastcompression.blogspot.fr/2011/05/lz4-explained.html

elmer · 02/06/2015, 02:02 PM

Quote from: touko on 02/06/2015, 12:14 PMSomeone has already tested this in a game ??

Yes, it's very suitable for games. We were using various LZ77-variants (such as LZ4) all through the 1980's-1990's for compressing game data.

I've been meaning to look up LZ4 for a while now to see what the fuss is about, thanks for the links to the explanation.

The trick with all the LZ77-variants is what scheme you use to store the literal/match lengths and offsets ... LZ4 seems to offer a good balance of compression and performance.

I can't see that you'd want to decompress directly to VRAM when you need to keep the sliding window of previous data around for copying the "match" bytes ... but you can certainly play with the code to achieve that effect if you want to.

If you're short of decompression space, what we usually did was to just split larger data into separately-compressed blocks of a fixed size (say 8KB), and then decompress a block at a time. That's what I did to make a cacheable-filesystem-on-a-cartridge for a few N64 games.

touko · 02/06/2015, 02:38 PM

Thank's for feedback ;-)

QuoteI can't see that you'd want to decompress directly to VRAM when you need to keep the sliding window of previous data around for copying the "match" bytes ... but you can certainly play with the code to achieve that effect if you want to.

It's easy (i think) because the match and letteral stay in source file in ram not in vram, only destination will be ..

QuoteIf you're short of decompression space, what we usually did was to just split larger data into separately-compressed blocks of a fixed size (say 8KB), and then decompress a block at a time. That's what I did to make a filesystem-on-a-cartridge for a few N64 games.

I want to avoid copying datas in a buffer first, and later in vram for datas need to be like sprites pattern, and directly doing it in vram .
As you can access vram any time,why do not do it ?? ;-)

elmer · 02/06/2015, 03:25 PM

Quote from: touko on 02/06/2015, 02:38 PMI want to avoid copying datas in a buffer first, and later in vram for datas need to be like sprites pattern, and directly doing it in vram .
As you can access vram any time,why do not do it ?? ;-)

Why not?? ... because I'm afraid that you've misunderstood the LZ4 algorithm. Here's a quote from one of the pages that you linked ...

QuoteWith the offset and the match length, the decoder can now proceed to copy the repetitive data from the already decoded buffer.

That's how all LZ77-variants work ... by exploiting the repetitive nature of the decompressed data.

You must keep a sliding window buffer of the decompressed data available to copy from.

The match offset and match length in the compressed data refers to offsets and lengths in the decompressed data.

Now ... you certainly can modify the algorithm to use a sliding window in the compressed data instead of the decompressed data and thus enable decompression directly into VRAM ... but your compression-ratio will almost-certainly suffer very, very badly (I tried this back in the 1980's).

You are welcome to try this yourself ... your data may be different enough that it will work ... but don't be surprised if it doesn't!

elmer · 02/06/2015, 05:19 PM

Quote from: elmer on 02/06/2015, 02:02 PMYes, it's very suitable for games. We were using various LZ77-variants (such as LZ4) all through the 1980's-1990's for compressing game data.

I've been meaning to look up LZ4 for a while now to see what the fuss is about, thanks for the links to the explanation.

Since one of touko's links actually gave a testsuite, I thought that I'd run my old SWD compressor on it. Just like LZ4, my compressor's LZ77-style-encoding is designed for fast game-time decompression.

I'm insufferably happy to present the following results (compressed size in bytes) ...

Test File LZ4 SWD
---------------------------
ANGELFISH 6,505 5,799
ASTRONUT 23,517 21,426
BEHEMOTH 14,799 14,068
BIG 2,800 2,571
BUTTERFLY 8,862 8,137
CD 6,651 6,164
CLOWN 18,873 16,934
COGITO 7,659 9,666
COTTAGE 15,297 13,628
FIGHTERS 13,099 12,182
FLOWER 13,217 12,338
JAZZ 9,970 9,074
KNIFE 14,807 13,707
LORI 20,258 18,610
MAX 8,640 8,171
OWL 18,471 15,347
RED.DRAGON 20,592 18,903
TAJ 16,303 13,953
TUT 12,548 11,476

touko · 02/07/2015, 04:07 AM

QuoteYou must keep a sliding window buffer of the decompressed data available to copy from.

The match offset and match length in the compressed data refers to offsets and lengths in the decompressed data.

Ah ok,i see

, it's not a problem because you have 2 independant VRAM pointer, 1 for read and 1 for write .
You can point easily on destination in VRAM (with the read pointer) and copying your match byte in A reg, and write it repeatedly in VRAM (with the write one),i treat vram like a buffer,don't forget we have an unlimited access to vram and not only in vblank .
And as the write pointer is auto incremented, you do not have to set it each time or to words inc destination as we do in ram..

Of course you canot use transfert block instructions in this case.

QuoteNow ... you certainly can modify the algorithm to use a sliding window in the compressed data instead of the decompressed data and thus enable decompression directly into VRAM ... but your compression-ratio will almost-certainly suffer very, very badly (I tried this back in the 1980's).

LZ4 algorithm is very easy, and i have already a faster version than the 65C02 one (for ram decompression only).
My version is based on this one :
http://pferrie.host22.com/misc/appleii.htm
It will inevitably increase the code size (it's already the case),but it should be faster than copying datas twice i think..

QuoteYou are welcome to try this yourself ... your data may be different enough that it will work ... but don't be surprised if it doesn't!

I'll try

,but for now the difficulty is how to manage the 2 options, RAM and VRAM efficiently without convoluted code,and not if write directly in vram is possible (because it is) ..

QuoteI'm insufferably happy to present the following results (compressed size in bytes) ...

Test File LZ4 SWD
---------------------------
ANGELFISH 6,505 5,799
ASTRONUT 23,517 21,426
BEHEMOTH 14,799 14,068
BIG 2,800 2,571
BUTTERFLY 8,862 8,137
CD 6,651 6,164
CLOWN 18,873 16,934
COGITO 7,659 9,666
COTTAGE 15,297 13,628
FIGHTERS 13,099 12,182
FLOWER 13,217 12,338
JAZZ 9,970 9,074
KNIFE 14,807 13,707
LORI 20,258 18,610
MAX 8,640 8,171
OWL 18,471 15,347
RED.DRAGON 20,592 18,903
TAJ 16,303 13,953
TUT 12,548 11,476

Wahou, your compressor is better in any case than lz4,and what about the speed ?? ..

elmer · 02/07/2015, 12:45 PM

Quote from: touko on 02/07/2015, 04:07 AMAh ok,i see , it's not a problem because you have 2 independant VRAM pointer, 1 for read and 1 for write .

Excellent! Yes, it'll work directly to/from VRAM. ... I'm still not used to the intricacies of the PCE's VDC and was thinking about other (much more limited) machines.

Just remember that you are copying a string of bytes from the previous data, so it's a sequence of read/write pairs and not just read-once, write-many.

That's going to get ugly very quickly with even/odd byte boundaries ... so what I'd suggest is to hack up a customized version of LZ4 that processes 16-bit words instead of 8-bit bytes, it'll be a much better match for the VRAM data that way and avoid lots of ugly code.

I seem to remember losing a few % of compression when I tried that on the Gameboy, but it'll make your life much easier ... I think that it's a good trade off for your usage.

QuoteMy version is based on this one :
http://pferrie.host22.com/misc/appleii.htm
It will inevitably increase the code size (it's already the case),but it should be faster than copying datas twice i think..

His code is written for clarity and not speed, so you can definitely do better.

QuoteWahou, your compressor is better in any case than lz4,and what about the speed ?? ..

It is almost-certainly a bit slower, because I bit-pack the offset/length encodings, but in my experience most of the compressed data is single-byte literals which should be just as fast (or faster) than LZ4.

I'll have to clean up the code a bit and release it on github, and then you can run some tests!

Remember ... there is always a tradeoff between compression and speed, that's why LZ4 is so fast ... it uses a very simple encoding for the runs/offsets/lengths.

My encoding is a bit more complex, and usually get's an extra few % of compression, but not always ... you can see that SWD is actually considerably larger than LZ4 in one of the tests.

It all depends upon the data, and LZ4 is more resilient to different data sets than my encoding, which was originally hand-tuned for the character/map/sprite data in one specific game.

The test suite that the AppleII guys used is, IMHO, not a very good representation of the character/map/sprite data used on the PCE/Genesis/SNES/Gameboy ... it contains way too many runs of single-color or simple-pattern pixels.

touko · 02/07/2015, 02:08 PM

QuoteJust remember that you are copying a string of bytes from the previous data, so it's a sequence of read/write pairs and not just read-once, write-many.

Yes it's the case for litteral not for match, no ??

QuoteThat's going to get ugly very quickly with even/odd byte boundaries ... so what I'd suggest is to hack up a customized version of LZ4 that processes 16-bit words instead of 8-bit bytes, it'll be a much better match for the VRAM data that way and avoid lots of ugly code.

Of course, i'am not sure that copying directly in VRAM will be pratical, and like i said the 2 case (RAM/VRAM) are not easy to do together and implies (maybe) dirty code and an increase in decompressor code size ..

The buffer in ram is the most simplest solution, by far, but not optimal in term of speed .

QuoteI seem to remember losing a few % of compression when I tried that on the Gameboy, but it'll make your life much easier ... I think that it's a good trade off for your usage.

I do not exclude any solution

QuoteHis code is written for clarity and not speed, so you can definitely do better.

Exact, and size too, but definitely not for speed .

QuoteI'll have to clean up the code a bit and release it on github, and then you can run some tests!

Thanks so much

QuoteRemember ... there is always a tradeoff between compression and speed, that's why LZ4 is so fast ... it uses a very simple encoding for the runs/offsets/lengths.

You're right, i'am not a fan in general of compression, and i search a good compromise between size and speed,i don't like to spend too mush cycles for decompressing datas.

QuoteIt all depends upon the data, and LZ4 is more resilient to different data sets than my encoding, which was originally hand-tuned for the character/map/sprite data in one specific game.

This is why i 'am gone with LZ4,not to bad for all kind, and easy to implement .
But yours is very good too ..

QuoteThe test suite that the AppleII guys used is, IMHO, not a very good representation of the character/map/sprite data used on the PCE/Genesis/SNES/Gameboy ... it contains way too many runs of single-color or simple-pattern pixels.

Of course, i made some tests on my pce graphics datas, mainly tileset and sprites, and it was very good for my use, not the best of course but with a factor of 2/2,5 in most case .

TurboXray · 02/07/2015, 02:32 PM

Wow, that's a really simple compression algorithm (LZ4). I love looking and taking apart different compression schemes (they all have their own advantage).

Planar graphics never compress that well, compared packed pixel. I wonder how it does with 4bit packed pixel nibbles. Gate of Thunder uses LZSS and has the sprite (and IIRC, tiles) all in pack pixel format. The compression algorithm knows ahead of time whether the graphic data is a 16x16 native sprite cell or a 8x8 tile cell, and has an internal counter that when expired - converts decompressed graphics data back into PCE format and writes it to vram. On top of that, it does this in real time as the game engine is playing along. I didn't fully investigate how the game engine does this, but making a time sliced background 'process' isn't too difficult. Definitely something you can do if the game is in such a fashion that you have 'lead time' before the graphics are due - thus decompress them in the background process over quite a few frames as the normal game logic is running.

I've used LZss, pucrunch, and packfire for PCE. All with circular buffer to decode directly to vram. With Pucrunch, I was able to get really good results with 512k and 1024k window sizes. But man.. it's slow. Especially with the packed pixel to planar counter/conversion implemented ;>_>

Some other later gen PCE CD games that use LZss, prime the half or more the 'window' with a special set of values every time, before the decompression process starts. The compression algorithm knows this ahead of time and can reference this (usually great for tilemap data and such).

Thinking about all of this in the context of CD ram, reserving a larger decompression buffer can negate better compression savings (because you're taking away 'storage' ram for 'work' ram to create a local decompression area). I think in this context, decompressing directly to vram can save more overal CDRAM space, even with a slightly worse compression ratio/scheme. Of course, it's all really relative to what you need for your project.

elmer: I'm looking forward to your 'SWD' compression tools when you release them.

elmer · 02/07/2015, 03:08 PM

Quote from: touko on 02/07/2015, 02:08 PMYes it's the case for litteral not for match, no ??

No, I'm afraid that you often copy multiple bytes from the match position, that's how it get's it's good compression.

QuoteOf course, i'am not sure that copying directly in VRAM will be pratical

Because you have both read and write pointers to VRAM that actually auto-increment ... it will be blindingly fast compared to the regular-RAM version. But doing the compression with 16-bit words instead of 8-bit bytes is likely to hurt the compression quite a bit.

You can still do it with bytes, but you'll probably end up with 4 different routines to cope with the various combinations of even/odd source/destination.

elmer · 02/07/2015, 03:53 PM

Quote from: TurboXray on 02/07/2015, 02:32 PMWow, that's a really simple compression algorithm (LZ4). I love looking and taking apart different compression schemes (they all have their own advantage).

They're fun aren't they! I like it that LZ4 actually implements a run-length for the literal data, I'd always meant to try that, but never got around to it.

QuotePlanar graphics never compress that well, compared packed pixel. I wonder how it does with 4bit packed pixel nibbles.

Very, very true. I expect that it'll do extremely well with packed data ... but OMG, the terrible overhead!!!!

QuoteGate of Thunder uses LZSS and has the sprite (and IIRC, tiles) all in pack pixel format. ...

That's cool, I certainly didn't know that. The background process is a very nice solution to the problem of unpacking the pixels if you don't need an as-fast-as-possible data-rate.

QuoteSome other later gen PCE CD games that use LZss, prime the half or more the 'window' with a special set of values every time, before the decompression process starts.

That's a cool trick ... especially if you have the preload data in ROM or VRAM somewhere. It's always fun to hear what ideas people came up with to wring the best performance out of a machine.

Quoteelmer: I'm looking forward to your 'SWD' compression tools when you release them.

It's really just another LZ77/LZSS variant. I always mix up LZ77 and LZSS since they're basically the same thing in my mind ... LZSS is such a trivial (but useful) improvement to the LZ77 concept.

As the wikipedia page on LZSS says ...

QuoteMany popular archivers like PKZip, ARJ, RAR, ZOO, LHarc use LZSS rather than LZ77 as the primary compression algorithm; the encoding of literal characters and of length-distance pairs varies, with the most common option being Huffman coding.

My first Amiga games used Huffman-encoded LZSS as the article suggests, but it was a bit slow and also a pain because you had to include the Huffman table along with the compressed data.

When I had to do a Gameboy game, I ran all the data through the LZSS/Huffman encoding and took a look at the bit-lengths of each length/offset encoding used. After a bit of eyeballing and tweaking I came up with a static encoding of the lengths/offsets that gave approx 80% as good results, but was trivial to decode in Z80/6502 assembler.

SWD data is encoded as LZSS length/offset pairs.

Lengths are encoded ...

1 : 0 dddddddd
2 : 10
3-5 : 11 xx
6-20 : 11 00 xxxx
21-275 : 11 00 0000 xxxxxxxx

Offsets are encoded ...

$0001-$0020 : 00 x xxxx
$0021-$00A0 : 01 xxx xxxx
$00A1-$02A0 : 10 x xxxx xxxx
$02A1-$06A0 : 11 xx xxxx xxxx

In order to avoid too many bit-shifts, bytes of the encoded bit-stream are interleaved with data bytes, so that literal values and the low 8-bits of long lengths and offsets can be read directly from the compressed stream without any shifting.

With this encoding, any 2-byte or longer match is a win ... whereas with LZ4 the minimum match is 4-bytes.

TurboXray · 02/07/2015, 11:10 PM

I think you could keep the compression scheme byte based for LZ4. Yeah, you need work out some case logic, but something like this could handle the bulk of it:

QuoteMaybe for something like setting up the 'read' address, you could shift out the byte offset into a word base offset, and then take the 'carry' and shift it into the index register. This would automatically setup your even/off offset for the read pointer.

lsr .sm0+1
ror .sm1+2
cla
rol a
tax

st0 #$01
.sm0
st1 #$00
.sm1
st2 #$00
lda #$02
sta <vdc_reg
st0 #$02

.loop
lda $0002,x
sta $0002,y

txa
eor #$01
tax

tya
eor #$01
tay

dec <counter
bne .loop

I didn't show setting up Y, but it should be continuous - since you're writing forward with this compression scheme. Same for the VRAM write pointer. That's set at the start of the block of data to decompress. It's the read pointer that needs to be modified, hence the above code.

Starting off with reading from $0002 or $0003 handles the even/odd byte offset reading issue (by indexing on the base read address $0002). I mean, you're never skipping bytes - just starting with an even or odd byte offset.

Though a jump table with multiple renditions of the same code, but handling/priming the starting offset read/write, would definitely be faster (you wouldn't have to deal with indexing, and modifying those index regs) - at the expense of some code space.

elmer · 02/08/2015, 12:13 AM

Quote from: TurboXray on 02/07/2015, 11:10 PMI think you could keep the compression scheme byte based for LZ4. Yeah, you need work out some case logic, but something like this could handle the bulk of it:

NICE!!!! Those auto-incrementing VDC registers really take a lot of the niggling-cr*p out of the inner loop. The PCE is such a beautifully designed piece of hardware.

But, really ... you know that you can get those eor's out of the inner loop if you really want to!

TurboXray · 02/08/2015, 03:30 PM

Yeah, you can optimize out those eor's. Just to show something as an approach.

Yeah, the PCE architecture is pretty simple and clean. Being able to read and write to vram during active display is pretty nice IMO. It might not have a fast local to vram DMA like the SNES and Genesis, but in a good amount of cases open vram access can balance that out (games like Sapphire with large area animation updates show this off).

On a related note.. (source code layout optimization?)
I used to think the lack of a bigger linear PC address range (local to the cpu) was a design hindrance, but then I realized that all my optimizations were local anyway, and macros for 'far jsr' makes the code structure help lend itself to a more linear like layout (kinda. In the source it looks that way). I typically have a layout of 8k I/O, 16k of ram, 16k of code, 16k of data, and 8k of fixed library.

I have multiple vector banks, with the top 4k with repeated code/data, and the lower 4k with different stuffs - with the lower 4k being usually tables for speeding up code/etc - relative to the subroutine called. The upper 4k always has the code (along with the macro) to do the far calls and far returns, while always having the fixed lib funcs and video/timer interrupt routines, etc. So you get a 16k code+4k fast table mapping, and still have 16k for other 'data'. Or call an 8k code+24k data, etc. Or 8k code + another 8k code, etc. It works out pretty well. I'm usually not concerned with wasting a little bit of fat on code, since code generally takes up a small percentage compared to data.

Do you guys ever map anything in the typical I/O bank area? After working on nes2pce stuff, I've found myself mapping other banks to this area (MPR0). Interrupt routines that need access to the I/O bank can mapped it bank in for that interval. I mean, if a specific subroutine isn't writing/read vram or writing to the sound hardware, why not map something else there? Matter of fact, having done nes2pce stuff - I don't find it odd to map the I/O bank to something like the 4000-5fff or 6000-7fff range either ($6002,$6003, $7403, $6404, etc). It gives you another 8k of address range to work it otherwise (ram, data, code, etc).

touko · 02/09/2015, 05:03 AM

QuoteDo you guys ever map anything in the typical I/O bank area?

No, because i use a custom version of Huc and i stay as close as possible to his scheme, i use some custom mapping when necessary,but not for I/O.

For compression your experiences are great, i'am a big noob for now and my first step was with LZ4 ..

I have already experienced an easy scheme for PCM samples,the packed pixel, this is not the best compression ever but very well suited for 5 bit pcm, it allow you to encode your samples in 2 bytes rather of three .
My PCM routine is 30/50 cycles / PCM with compression and mapping .It's a simple PCM playback with volume setting.

elmer · 02/09/2015, 11:18 AM

Quote from: elmer on 02/07/2015, 03:53 PMMy first Amiga games used Huffman-encoded LZSS as the article suggests, but it was a bit slow and also a pain because you had to include the Huffman table along with the compressed data.

Whoops ... I checked the source code and apparently I was having a bit of a "Brian Williams" moment with my memory!

One Amiga game was pure Huffman compressed, then LZW for a Gameboy game, then SWD for later games.

Anyway, the SWD source should be on github today and I'll send you both links.

Please forgive it's crusty old style, lack of documentation, and general limits ... remember that it was written for internal use and not for public use.

I'd be really interested to hear how it does on *your* specific data in comparison to LZ4.

touko · 02/09/2015, 12:05 PM

QuotePlease forgive it's crusty old style, lack of documentation, and general limits ... remember that it was written for internal use and not for public use.

Thanks a lot ..

QuoteI'd be really interested to hear how it does on *your* specific data in comparison to LZ4.

No problem

elmer · 02/09/2015, 02:25 PM

Quote from: TurboXray on 02/08/2015, 03:30 PMBeing able to read and write to vram during active display is pretty nice IMO. It might not have a fast local to vram DMA like the SNES and Genesis, but in a good amount of cases open vram access can balance that out.

IMHO it's a HUGE win for the PCE! From my memory, the limits on the SNES and Genesis DMA were really annoying.

Yes, they can transfer a lot more in the vblank period than the PCE can ... but the vblank period is short, and the PCE can catch up and far outstrip them during the frame itself.

Doing any more complex scatter-gather copying to VRAM should be a huge win on the PCE compared to the SNES/Genesis.

QuoteOn a related note.. (source code layout optimization?)

Your experience on this platform is so much greater than mine, so you're the expert.

Your thinking makes a lot of sense ... especially for anything written in assembler.

In my very personal opinion, assembler is the only sensible language for this platform, but then, everyone is entitled to their own opinion, so YMMV.

I'll only be able to really say anything useful after I've gotten some more experience.

Arkhan Asylum · 02/09/2015, 06:06 PM

Using C for PC Engine is acceptable, given the right game.

For speed you need assembly. Atlantean only optimized what was needed to gain speed. Some functions are still 100% C.

No point causing brain damage where it isn't needed.

elmer · 02/09/2015, 07:19 PM

Quote from: guest on 02/09/2015, 06:06 PMFor speed you need assembly. Atlantean only optimized what was needed to gain speed.

Wise words ... basically don't over-optimize, and don't fret over what isn't a blockage.

Quote from: guest on 02/09/2015, 06:06 PMNo point causing brain damage where it isn't needed.

Some of us don't define assembly language (particularly with a good macro assembler), as causing brain damage. As I said before ... everyone has their own (entirely valid) opinion of what is in their comfort-zone.

For your entertainment, here are 2 interesting posts by Mick West (one of the founders of Neversoft) about game programming in 1991 and 1995 ...

My coding practices in 1991:
http://cowboyprogramming.com/2008/11/15/my-coding-practices-in-1991/

1995 Programming on the Sega Saturn:
http://cowboyprogramming.com/2010/06/03/1995-programming-on-the-sega-saturn/

TurboXray · 02/09/2015, 07:34 PM

QuoteIn my very personal opinion, assembler is the only sensible language for this platform, but then, everyone is entitled to their own opinion, so YMMV.

Considering there's practically zero advantage of using C vs ASM for 65x stuffs, I just stick with ASM. When I need to write fast working code (prototyping), I just use an advance set of macros that simulate a more advance processor - like the 68k (makes the source code very compact, with much easier readability). And re-write stuff as needed for speed, etc.

Still, I think it would be kind of cool to have a C directive for an assembler. Of course, one could just write an external preprocessor app to parse the source code and hand that off to something like CC65, then put the 'assembly' result back into the .S file, and assemble. Thus, C prototyping support for assembly.

elmer · 02/09/2015, 07:48 PM

Quote from: TurboXray on 02/09/2015, 07:34 PMOf course, one could just write an external preprocessor app to parse the source code and hand that off to something like CC65, then put the 'assembly' result back into the .S file, and assemble. Thus, C prototyping support for assembly.

I'm looking forward to setting up a full PCE CC65 build environment (mainly for the assembler), as soon as I finish messing around trying to get GCC for the PC-FX's V810 a lot more up-to-date than it currently is.

touko · 02/10/2015, 04:00 AM

@elmer:

QuoteI wrote this in 1991, when I was writing Amiga and Atari ST games for Ocean Software in Manchester, UK. I think at the time I was working on Parasol Stars. It's an interesting look at a simpler time in games programming.

Respect ...

And you're right, the match bytes part is bytes and not byte

QuoteConsidering there's practically zero advantage of using C vs ASM for 65x stuffs, I just stick with ASM.

Same here, i don't use C at all, now i 'am faster with ASM than C and my code is directly optimised ..

TurboXray · 03/11/2015, 05:21 PM

Phase lock/syncing the cpu with the VDC. I have some mid scanline effects that would lend themselves better if the cpu was in sync with the VDC (it's not, because of the instruction its execution during the VDC interrupt call).

I remember something about a part in the hsync area, where the VDC is busy fetching all sprite pixels for the current scanline to draw. This is a very short period, but if write or read vram during this phase - that the cpu will be stalled. VDC regs don't count; it has to be vram. Now, I also remember hearing that this is also variable, because the sprite pixels per scanline can also be variable.

I have an idea that might work. I do know that the VDC doesn't care if the sprite is 'on screen' or offscreen when parsing for the horizontal line. In other words, off screen sprites will also have their pixel data fetched until the 64word pixel buffer is full (which is why you should hide/clip sprites with the Y reg and not the X reg). I thinking that you could put a bunch of extra sprites offscreen, that come after your normal displayed sprites. In other words, force-ably fill that sprite pixel buffer every scanline. Then, during the interrupt call - setup to start reading vram for a period of time. To put the cpu in that window of time where it would be stalled. When it comes out of the stall, it will be in sync with the VDC and at the same spot every scanline. The timing would be tight, but it might be doable. Most long executing instructions have a 7 cycle on the max side (besides a few). You'd have to calculate a +1 to +7 cycle index into this window.

Basically, the idea is to get rid of jitter for mid scanline effects. Not that there is a lot you can do mid scanline, but there a handful of things ;)

TurboXray · 09/16/2015, 12:47 PM

Clipping for sprites that are larger in width of 16 pixels.

There are only two widths available on the PCE; 16 and 32 pixel wide. Anything larger is a meta-sprite.

On the PCE (and on nes, sms, genesis, snes), clipping is important because sprites that appear to the left or right of the screen, but are off screen, still count as the sprite overflow total.

Think of the sprite overflow as one large 256 pixel buffer. This includes transparent pixels as well (consoles really didn't have the luxury of only including opaque pixels). So, PCE will process all 64 sprite entries every scanline, but it's the buffer that prevents all sprites to be shown on any given line.

So, clipping is pretty easy in general. When is sprite has fully left the screen, be it right or left, you drop it from the entry list (usually set the Y coord to something outside the range, or zero out the whole entry). So the sprite (X,Y) coordinates are taken from the top-left side. If a sprite is X_coord+sprite_width <= left_border, then clip it - etc.

The obvious reason for clipping is to reduce sprite drop, or flicker if you implement it, or something along those lines. If you look at this chart here:

You'll notice that all 32 wide sprites can be easily divided into columns of 16 pixels wide.

Take a 32x64 sprite, for example, and notice the difference of how the 16x64 sprites can be defined. Normally, for such a large sprite, vram alignment is every 0x200words. If you use a "cell offset" that falls in the middle, those lower bits of the offset are clipped to force align to a 0x200word alignment in vram.

If you look at the 16x64 column specifications there, you'll see that can chose which column to address. It basically halves the 32x64 sprite into two columns. This makes clipping not only easy, but you don't need to have two copies of the same sprite... or always use 16x64 wide configuration.

So for left side clipping, you'd check (X_coord+$10) <= Left_limit. If so true, then clear X_width bit in the sprite attribute entry (bit #7, which sets the width), add $10 to X_coord, and set bit #1 of the Cell_Offset/sprite_num entry (not bit #0). Right side clipping is even easier; check (X_coord+$10) >= Right_limit, then clear X_width bit. For right side clipping, you don't need to reposition the X value to compensate, nor do you need to point to the next column.

You can't clip any tighter than that. Something to keep in mind, though, is that the 256 pixel buffer is fixed in size. If you make the screen size smaller - that buffer size stays the same. So a 256 pixel wide screen has a 1:1 ratio to the buffer. You couldn't make a huge sprite layer/image scroll from left to right, because at certain points it would need 17 sprite cells (16pixels wide) to fill the edges. 17x16 = 272 which is greater than 256. But, PCE is extremely flexible on how you define the visible display area of the video frame. If you set the display to 240 pixels wide, that's 15 sprite cells wide. The widest scroll point of cells would be 16, not 17, and thus you could do a seamless scroll of a huge sprite (given the tight clipping I showed above). 240 pixels really isn't that noticeable from 256; some SNES games ran with this size clipped window. Now, imagine a "tate" mode vertical shmup. You could set the width to something 192 pixels wide, and fake some decent looking BG layer effects. Imagine this idea taken to the SGX? Imagine how many individual moving layers you could fake.

Or take another approach. You could make a game like Sonic, here the character is 32 pixels wide, and have the screen res clipped to something like 224 wide and clipped the vertical width with a larger display box (to make the screen area appear wider and less square). Besides a couple of key issues, you could basically create a whole BG layer of sprites.

elmer · 09/25/2015, 12:34 AM

Quote from: TurboXray on 09/16/2015, 12:47 PMOr take another approach. You could make a game like Sonic, here the character is 32 pixels wide, and have the screen res clipped to something like 224 wide and clipped the vertical width with a larger display box (to make the screen area appear wider and less square). Besides a couple of key issues, you could basically create a whole BG layer of sprites.

I don't know why I missed this when you posted it a week ago, but nice explanation and what a great idea!

TurboXray · 09/25/2015, 08:23 AM

I had worked out a system with a sprite layer/map that used 32x64 entries. Of course, these were meta-tile entries for a look-up table into segments of sprites, etc. But optimization was such that you could use the 32x64 sprite sizes for larger areas. In other words, cut down on the SATB usage. Have a 32x64 metatile setup would also mean less work for the cpu compared to 32x32. The cpu overhead is going to be greater than that of a tilemap, but it's still doable.

Take a 224x192 screen (playable area, status bar can fill the rest if needed). In terms of 32x32 sprites, that's 7x6 area. So that's 42 sprite entries. That's also assuming a full solid screen of sprites, which isn't really what I had in mind. So it would actually be lower than that. And of course, might not want to use 32x32 segments, but maybe 16x16/32x16/16x32 here and there. So then the number does go back up. So I figured even if you get close to mid 50's, that's plenty enough left over for a platformer given the large sizes PCE can through out there. If done right, it could come off looking pretty good.

Like I said, there would definitely have to be some limitations as to what the character can move over/walk through. For instance, Chemical Plant (level 2 in Sonic 2) is easy to do with this setup. But Emerald Hill (first level) presents some problems as the main character is able to walk through/over the foreground area in some parts (busting the sprite limit). It would need a further clipped screen of 216 wide to handle that, or add gaps in the map area to alleviate sprite congestion for those scanlines where the main character would be. So it has some design limitations. Heh, you could get crazy and make such areas lower color count sprites and manually composite the main character each frame against that area. Might be doable at 60fps. But definitely at 30fps.

TailChao · 09/25/2015, 02:09 PM

Quote from: TurboXray on 09/16/2015, 12:47 PMThink of the sprite overflow as one large 256 pixel buffer. This includes transparent pixels as well (consoles really didn't have the luxury of only including opaque pixels). So, PCE will process all 64 sprite entries every scanline, but it's the buffer that prevents all sprites to be shown on any given line.

I've wondered for quite some time if Hudson designed the sprite system on the PCE closer to the NES (which uses eight shift registers for each sprite on a line) rather than the SNES or Genesis which have true linebuffers. That would make more sense given the multiple resolutions.

Quote from: TurboXray on 09/16/2015, 12:47 PMYou'll notice that all 32 wide sprites can be easily divided into columns of 16 pixels wide.

This is such good design, and makes it so easy to split 32px wide sprites into 16px columns when they near the screen edges.

touko · 10/06/2015, 05:44 AM

i have an idea for reodering sprites in a browler for exemple .
You put a copy in VRAM of your actual SATB with DMA (VRAM->VRAM) each frame (you have 2 SATB in VRAM,the real one, and a copy) .
You sort your sprites and make a DMA list of eaches sprites to copy from your false SATB to your real one .
It's free and really fast,even more if you put the VDC in 10.74 mhz mode.
If the 1 frame delay is a problem, you can change the DMA mode (VRAM SATB -> VDC SATB) from auto to manualy, and doing it when all VRAM DMA are complete .
I already have a DMA list driven by interrupts, i'll just have to do the sorting routine .

TurboXray · 10/13/2015, 11:55 AM

Touko, can you explain a little more?

touko · 10/13/2015, 12:15 PM

Quote from: TurboXray on 10/13/2015, 11:55 AMTouko, can you explain a little more?

I'm going to try .

The idea is to have a copy in VRAM(call it SAT2) of your main SAT (call it SAT1 also in VRAM) .
First you copy your SAT1 in your SAT2 each frame with DMA (VRAM -> VRAM) .
Next in your engine, you sort all your sprites, and for exemple if your sprite1 come in front of sprite2, you make a DMA of your 4 words of sprite1(in SAT2) earlier in SAT1 than sprite2(it must be copied too) .
When all your sprites were sorted(and of course your DMA list is complete), you do a manual DMA VRAM->SATB when all your transferts in your DMA list are done .
Of course all DMA must be in a DMA list(except the VRAM to SATB one) with the SAT1 to SAT2 copy in first,and SATB auto DMA must be off .

The goal is to use DMA for copying sprites's attributes and not the CPU .
The CPU is used for sorting sprites and make the DMA list in RAM .

Quotebefore sorting,first DMA transfert in your list
SAT1 SAT2
spr3 spr3
spr2 spr2
spr1 spr1

follows after sorting, each sprite transfert in your list(here we want spr1 in front of spr2)
SAT1 SAT2
spr3 spr3
spr1 spr2
spr2 spr1

Of course if you are using meta-sprites(with aligned sprites) this will work better .
i don't know if i'am clear

(but for me i'am

)

TurboXray · 10/14/2015, 02:22 PM

What's the savings in cpu cycles? Something slow like (unoptimized; straight code)... LDA [vector],y -> STA port -> INY would be 15 cpu cycles per byte or 120 cpu cycles per SAT entry (8bytes).

I had a link list system and SAT in local ram with embedded opcodes (ST1/ST2). It was 5 cycles a byte + one JMP ~return~. The overhead was is one JMP [table,x]. The table was a list of JMP $address to jump to the start of the ST1/ST2 list. That's 44 cycles for a single SAT copy into vram, without the overhead of calling/sorting. Of course, the down side is a bloated SAT array in local ram. And accessing the SAT was a little bit more complex (but still doable and optimizable). Let's see, JMP[Addr,x] is 7 cycles, JMP is 4 cycles, so that bumps it up to 44+7+4 = 55cycles per SAT entry. 455/55= 8 SAT entries per line.

So 8 SAT entries per scanline. We know V-V DMA is 336bytes per scanline in 10mhz mode (going by the other thread), so the max theoretical SAT updates per scanline via DMA is 42. It's going to be lower, but even if it was something like 30 realistically.. that's way faster than 8.

I think you've got a winner here Touko! So keep a link list in local ram with a reference to a DMA table. 8 bytes takes 16 cpu vdc cycles. At 10mhz, you wouldn't even need to poll the status flag. Nice.

The nice thing too, is that this DMA approach lends itself nicely to meta-sprite objects too. A single DMA call could handle all meta-cell entries for that object.

touko · 10/14/2015, 02:40 PM

QuoteI had a link list system and SAT in local ram with embedded opcodes (ST1/ST2).

For me it's in my DMA list

it's this embeded list who gave me this idea .

QuoteThe nice thing too, is that this DMA approach lends itself nicely to meta-sprite objects too. A single DMA call could handle all meta-cell entries for that object.

yes, it's the main idea, this is why i took browler as exemple .

QuoteAt 10mhz, you wouldn't even need to poll the status flag.

i use interrupts, no need of status flag .

for now my satb DMA interrupt starts my DMA list, who continu and finish by himself .

elmer · 10/14/2015, 03:40 PM

I can totally understand that getting that your SAT sorted for display is quick with this system ... but haven't you just dramatically increased the complexity of the code that actually updates the actual sprite positions (and palette if you're going to flash it)?

Are you expecting position updates to be written directly to VRAM, or are you still expecting to update a RAM-based SAT and then copy that to VRAM each frame (before doing the sorting)?