Faster fade out code?

DildoKKKobold · 09/20/2015, 11:30 AM

Here is my fade out code:


fade_out()
{
   int i, clr;
   char j;
   for (j = 0; j <8; j++)
   {
      for (i=0; i < 64; i++)
      {
         clr = get_color(i);
         if ((clr&7) > 0) clr = clr - 1;
         if ((clr&56) > 0) clr = clr - 8;
         if ((clr&448) > 0) clr = clr - 64;
         set_color(i, clr);
       }
       vsync();
       for (i=64; i < 128; i++)
      {
         clr = get_color(i);
         if ((clr&7) > 0) clr = clr - 1;
         if ((clr&56) > 0) clr = clr - 8;
         if ((clr&448) > 0) clr = clr - 64;
         set_color(i, clr);
       }
       vsync();
       for (i=128; i < 192; i++)
      {
         clr = get_color(i);
         if ((clr&7) > 0) clr = clr - 1;
         if ((clr&56) > 0) clr = clr - 8;
         if ((clr&448) > 0) clr = clr - 64;
         set_color(i, clr);
       }
       vsync();
       for (i=192; i < 256; i++)
      {
         clr = get_color(i);
         if ((clr&7) > 0) clr = clr - 1;
         if ((clr&56) > 0) clr = clr - 8;
         if ((clr&448) > 0) clr = clr - 64;
         set_color(i, clr);
       }
       vsync();
       for (i=256; i < 320; i++)
      {
         clr = get_color(i);
         if ((clr&7) > 0) clr = clr - 1;
         if ((clr&56) > 0) clr = clr - 8;
         if ((clr&448) > 0) clr = clr - 64;
         set_color(i, clr);
       }
       vsync();
       for (i=320; i < 384; i++)
      {
         clr = get_color(i);
         if ((clr&7) > 0) clr = clr - 1;
         if ((clr&56) > 0) clr = clr - 8;
         if ((clr&448) > 0) clr = clr - 64;
         set_color(i, clr);
       }
       vsync();
       for (i=384; i < 448; i++)
      {
         clr = get_color(i);
         if ((clr&7) > 0) clr = clr - 1;
         if ((clr&56) > 0) clr = clr - 8;
         if ((clr&448) > 0) clr = clr - 64;
         set_color(i, clr);
       }
       vsync();
       for (i=448; i < 512; i++)
      {
         clr = get_color(i);
         if ((clr&7) > 0) clr = clr - 1;
         if ((clr&56) > 0) clr = clr - 8;
         if ((clr&448) > 0) clr = clr - 64;
         set_color(i, clr);
       }
       vsync();
   }
   cls();
   reset_satb();
   satb_update();
   vsync();
}

Unfortunately, even when dividing it up into 64 color blocks, it still causes flicker in real hardware. I'm guessing its just too slow to write in C, but I'm not good enough with assembly. Any help would be appreciated!

touko · 09/20/2015, 01:05 PM

I think the best way for optimising your routine is doing the fade in a buffer for all palettes and transfer all the buffer after a vsync in asm with a tia bloc transfer .

like that:

/* A 256 bytes buffer is enough for 8 palettes */
int my_buffer[1024];

#asm

stz $402
stz $403

tia _my_buffer , $404 , 1023

#endasm

My fade routine is close to yours, and is very fast,but in ASM .

OldMan · 09/20/2015, 02:28 PM

int dv;
.
.
.
clr = get_color(i);
dv = 0;

/* low color bits : 0000 0111 */
if (clr & 0x0007 ) dv ++;

/* mid color bits : 0011 1000 */
if (clr & 0x0028 ) dv = dv+ 8; /* iirc, this is faster than += in huc. try both */

/* hi color bits : 1 1100 0000 */
if (clr & 0x01c0 ) dv = dv + 64;

clr = clr - dv;
set_color(i, clr);
.
.
.
...............................................................................
that's off the top of my head, so double check the hex values

And everything else....

Assuming it works, the next step is to move that to asm; iirc, an int parameter will come in in the A and X registers, so color won't have to be loaded; the rest is a pretty straightfoward conversion.
( look up the asm code in the listing to see how it's done. That's how I learned 650x asm

For the low sets of bits, you can just check the low byte; for the high set, shift the color right (?) for the check (then you can and with 0xe0 ) But that's a bit of asm optimization you may not need (or want to do).
.................................................................................
Just out of curiosity, why are you ckecking for >0 anyway? clr can't be negative, an & won't change that. And != 0 is true in C, so you don't have to compare the result to anything....

TurboXray · 09/20/2015, 04:23 PM

What touko said. Wait for vblank, read all colors (sprite and BG, if desired) into a buffer in ram. Do your alterations to those values in ram, then wait for vync and upload the changes during vblank. Rinse repeat. HuC is still going to be slow, but if you do the operations during normal/whole frames but only update the changes during vblank - it should do the job.

Fading out is easy, fading in is a bit more complex.

On a side note; I had an idea for a RGB to YUV conversion table, for special fading type of effect. YUV is nicer to work with IMO and gives you a wider range of features. Of course, going from YUV to RGB will need a different set of tables and take a bit longer.

spenoza · 09/20/2015, 06:05 PM

Quote from: TurboXray on 09/20/2015, 04:23 PMFading out is easy, fading in is a bit more complex.

When you fade in you know what your colors are going to be, so couldn't you pre-calculate/prepare the fade-in and then just cycle through the known palettes?

TurboXray · 09/20/2015, 06:37 PM

Yeah, you need to know the values to reach rather than just checking for overflow or floor (fixed value). It requires a little more logic for testing for overflow for the addition process. There are quite a few ways to do fade in approach, but they are all more complex with more operations than a simple fade out.

Fixed point deltas are one way. That requires a very large buffer (entries * RGB * delta or 512*3*2) in ram and setup the initial distance calculation for each color (which can take some time). It's wasteful on ram, but every R/G/B is faded in equally.

Rate of change delta is another way. It requires a smaller size buffer, but doesn't fade in equally. You basically take the R/G/B color of the destination and copy these as the deltas to subtract from the destination palette block (your first initial setup). On every call, you subtract 1 from each delta, then subtract from the RGB block and write to the buffer to be uploaded. The R/G/B destination values are never altered, just used as a value to subtract from. When the delta reaches 0, because on each call you subtract by one, it means that particular R/G/B is at full value. As you can see, lower values will reach their max value faster than large values. It looks decent, and it's fast.

touko · 09/21/2015, 08:02 AM

@DarkKobold:And may be you don't need to fade all palettes .

Gredler · 09/21/2015, 10:08 AM

Quote from: touko on 09/21/2015, 08:02 AM@DarkKobold:And may be you don't need to fade all palettes .

This is what I was thinking, only apply the fade to specific sprites that cant be animated onto the screen (background and ui elements).

touko · 09/21/2015, 10:22 AM

Yes is useless to do it on the 32 palettes 99% of the time .
In practice it's 4->6 palettes max .

DildoKKKobold · 09/21/2015, 10:43 AM

Quote from: TurboXray on 09/20/2015, 06:37 PMYeah, you need to know the values to reach rather than just checking for overflow or floor (fixed value). It requires a little more logic for testing for overflow for the addition process. There are quite a few ways to do fade in approach, but they are all more complex with more operations than a simple fade out.

Fixed point deltas are one way. That requires a very large buffer (entries * RGB * delta or 512*3*2) in ram and setup the initial distance calculation for each color (which can take some time). It's wasteful on ram, but every R/G/B is faded in equally.

Rate of change delta is another way. It requires a smaller size buffer, but doesn't fade in equally. You basically take the R/G/B color of the destination and copy these as the deltas to subtract from the destination palette block (your first initial setup). On every call, you subtract 1 from each delta, then subtract from the RGB block and write to the buffer to be uploaded. The R/G/B destination values are never altered, just used as a value to subtract from. When the delta reaches 0, because on each call you subtract by one, it means that particular R/G/B is at full value. As you can see, lower values will reach their max value faster than large values. It looks decent, and it's fast.

This is way too complex, you just need to subtract each color from 511, and do it backwards. Excuse the psuedo code.

int palette_holder[512];

for c = 0 to 511
palette_holder[c] = 511 - get_color(i);

for i = 1 to 7
for j = 0 to 511
clr = palette_holder[j];
dv = 0;

/* low color bits : 0000 0111 */
if (clr & 0x0007 ) dv ++;

/* mid color bits : 0011 1000 */
if (clr & 0x0028 ) dv = dv+ 8; /* iirc, this is faster than += in huc. try both */

/* hi color bits : 1 1100 0000 */
if (clr & 0x01c0 ) dv = dv + 64;
clr2 = get_color(j);
clr = clr - (dv XOR 511);
clr2 = clr2+ dv;
set_color(i, clr2);
palette_holder[j] = clr;

Not sure if HuC has XOR though.

touko · 09/21/2015, 02:55 PM

If i remember correctly XOR is A^B ..

DildoKKKobold · 10/01/2016, 01:11 AM

Quote from: touko on 09/20/2015, 01:05 PMI think the best way for optimising your routine is doing the fade in a buffer for all palettes and transfer all the buffer after a vsync in asm with a tia bloc transfer .

like that:

/* A 256 bytes buffer is enough for 8 palettes */
int my_buffer[1024];

#asm

stz $402
stz $403

tia _my_buffer , $404 , 1023

#endasm

My fade routine is close to yours, and is very fast,but in ASM .

So, rather than create a buffer for all the palettes at once, I'd like to split it up into chunks for 64 "colors" at a time. The problem is, I have no idea what the stz $402, stz $403 lines do. Set addresses to zero... but why?

cabbage · 10/01/2016, 04:00 AM

Code Select

#include "huc.h"
int my_buffer[1024];
main(){
#asm
   stz $402
   stz $403
   tia _my_buffer, $404, 511
#endasm
}

compiles successfully...
a variable declared in c, e.g. my_buffer, is accessed with a leading underscore in asm, as in _my_buffer

touko · 10/01/2016, 08:05 AM

QuoteThis fails to compile:

19000 2D:B710 tia _my_buffer, $404, 511
Undefined symbol in operand field!

int my_buffer[1024], must be a global variable (before the main procedure) .
Tested and works fine for me, but beware the #asm and #endasm must'n be to the left hedge,you must add 1 or 2 space caracters before .

EDIT:cabbage has answered .

@DarkKobold:You can also copy the content of your VCE palettes in the buffer the same way

tai $404 , _my_buffer , 511

More faster than in C .

DildoKKKobold · 10/01/2016, 12:25 PM

Quote from: touko on 10/01/2016, 08:05 AM
QuoteThis fails to compile:

19000 2D:B710 tia _my_buffer, $404, 511
Undefined symbol in operand field!
int my_buffer[1024], must be a global variable (before the main procedure) .
Tested and works fine for me, but beware the #asm and #endasm must'n be to the left hedge,you must add 1 or 2 space caracters before .

EDIT:cabbage has answered .

@DarkKobold:You can also copy the content of your VCE palettes in the buffer the same way

tai $404 , _my_buffer , 511

More faster than in C .

Oh, I didn't realize it needed to be a global. That is 2k of ram dedicated just to fadeout then...

touko · 10/01/2016, 12:39 PM

QuoteOh, I didn't realize it needed to be a global. That is 2k of ram dedicated just to fadeout then...

Why 2k ??
You really need 1k,but in fact less, because all the palettes are rarelly/never used at the same time .
i think a 6 palettes buffer is enough, it's 192 bytes,and this should speed up a lot your fade routine(because no need to copy/fade unused palettes) .

TurboXray · 10/03/2016, 11:19 AM

Some PCE games just have pre-compiled fades for palettes. Sure, it takes up rom (not a whole lot though) but it doesn't need any ram.

Arkhan Asylum · 10/04/2016, 01:34 AM

The trick is to write one fade routine. It's not a fade in or a fade out. It's a "fade to this palette" routine.

It will work for fading to black, or for fading to a color. Just make sure you pay attention to the way PCE lays out a palette, and inc/dec accordingly. It's a little easier if you work with octal number formatting. It seems more intuitive that way.

I do this and use a work palette (the current screen palette), and a target palette, and simply fade to it, and copy when ready.

touko · 10/04/2016, 06:01 AM

QuoteThe trick is to write one fade routine. It's not a fade in or a fade out. It's a "fade to this palette" routine.

Yes i did this too .
You can fade from/to a palette,and for exemple a fade out is concidered as fade to a black palette.

TurboXray · 10/04/2016, 02:29 PM

Yeah, a fade "to". Not that anyone probably cares, but there are weighted and unweighted fade methods (as in R/G/B elements as they reach a target value). Obviously weighted takes more time, but looks better for fades to black or from black IMO.

And if you're ambitious enough, you can even do real time RGB->YUV, make changes, and then back again for additional effects. Though that's probably more fancy than practical - "look what I can do" type effect.

touko · 10/04/2016, 03:04 PM

QuoteObviously weighted takes more time, but looks better for fades to black or from black IMO.

I fade each RGB values,and not a single inc/dec for the whole color .

TurboXray · 10/04/2016, 03:09 PM

Quote from: touko on 10/04/2016, 03:04 PM
QuoteObviously weighted takes more time, but looks better for fades to black or from black IMO.
I fade each RGB values,and not a single inc/dec for the whole color .

Yeah, but do you use a fixed point value (8bit:8bit) for each R/G/B element? That's what I mean by weighted. Otherwise, you'll get to the target value of each R/G/B faster on one element vs the others. I.e. it's not uniform across all elements within a color slot's R/G/B system.

touko · 10/04/2016, 03:23 PM

QuoteYeah, but do you use a fixed point value (8bit:8bit) for each R/G/B element? That's what I mean by weighted.

No,no mine it's more simple, better than a simple inc/dec,but definitely not as advanced as that you call weighted .

Quoteit's not uniform across all elements within a color slot's R/G/B system.

Yes , but very acceptable .
For best result with a RGB fade, you must normalise all the values, before fading .

Arkhan Asylum · 10/04/2016, 08:21 PM

Weighted "looks" better, but in practice is never noticeable to anyone playing a game.

I put in a weighted one into Inferno and ended up not using it.

elmer · 10/04/2016, 08:37 PM

Quote from: Psycho Arkhan on 10/04/2016, 08:21 PMWeighted "looks" better, but in practice is never noticeable to anyone playing a game.

That's been my experience, too.

If the fade is running fast-enough that people aren't given the time to stare at and analyze each individual step, then you can actually get away with a really-simple algorithm and find that it still looks good on the screen (as in, I've never, ever, heard anyone complain about it).

Arkhan Asylum · 10/05/2016, 02:41 AM

Atlantean's fades aren't weighted IIRC. They just do your basic "fade each color til it gets where it needs to be".

People claim it "makes things go grey", but, whatever.

elmer · 10/05/2016, 11:39 AM

Quote from: Psycho Arkhan on 10/05/2016, 02:41 AMPeople claim it "makes things go grey", but, whatever.

Yep, that's the simple way of doing it that I got away with for many years (until hardware fades took over).

If I ever wanted to be "clever" and do it properly, then I'd probably use the classic integer version of the Bresenham line-draw algorithm to decide when to change each RGB component.

TurboXray · 10/06/2016, 03:54 PM

That's basically what I was referring to with the fixed point values (a tiny LUT for a bresenham line algo).

Sometimes you just write stuff.. just because you can. I think most fade to routines don't need to be done in realtime gameplay (i.e. end of stage, or transition into another area, etc - where the action is paused). At least from what I've needed it for, but I can see some situations were it might be needed in game during normal gameplay. I always just used recalculated subpalettes for that, but I guess if you setup a background process system or "thread", you could call such a routine a head of time and have it do the fade steps as needed (kind of like a background timed-sliced decompression process).

Arkhan, did you use the fade routines while the gameplay was in action, or did you pause the action and use the fade as a transition? Just curious.

Arkhan Asylum · 10/06/2016, 04:07 PM

Atlantean fades while SOME stuff is happening on screen. I am pretty sure the fades will work mid-game, though.

Most of the fading was for transitioning, though. No point in fading mid-game in a shootything

Reflectron had a bunch of palette cycling during the entire game.

DildoKKKobold · 10/06/2016, 04:46 PM

Quote from: touko on 09/20/2015, 01:05 PMI think the best way for optimising your routine is doing the fade in a buffer for all palettes and transfer all the buffer after a vsync in asm with a tia bloc transfer .

like that:

/* A 256 bytes buffer is enough for 8 palettes */
int my_buffer[1024];

#asm

stz $402
stz $403

tia _my_buffer , $404 , 1023

#endasm

My fade routine is close to yours, and is very fast,but in ASM .

So, I'm really confused.

First, why the store zero in $402 and $403?

Second, if I want to split this into chunks of 64, how do I tell it $404 plus 64 etc?

This is why I hate assembly. I know, I'm definitely the idiot of the thread.

Gredler · 10/06/2016, 04:56 PM

I feel the need to post something to make DK feel like less of an idiot.

Can't you just store all the palettes to an array, then create a loop that lerps from black to the array color or the array to black?

#Python
#C#
#Scripting

OldMan · 10/06/2016, 05:33 PM

QuoteFirst, why the store zero in $402 and $403?

Select palette slot 0

QuoteSecond, if I want to split this into chunks of 64, how do I tell it $404 plus 64 etc

You don't use $404+anything. $404/$405 is the color.
You set $402/$403 to slot number you want. (0, 64, 128. etc)

In general you set the starting slot # in $402/$403, and send the colors to $404/$405.
The slot increments after every write to high byte (I think), so you can just loop through the
palette and pump them out. If you want to break it into 64 color chunks, write the first 64 colors,
then the next 64 colors, etc.

You can set the slot number to start an any slot, evn in the middle of a palette, afaik.

QuoteCan't you just store all the palettes to an array, then create a loop that lerps from black to the array color or the array to black?

Yes. But that won't update the palettes until you send it to the vce.
If you really want to, you can read the color from the vce, fade it, and write it back, without any intermediate array. It's just quicker to use an array. (Less overhead)

TurboXray · 10/06/2016, 11:23 PM

Quote from: DildoKKKobold on 10/06/2016, 04:46 PM
Quote from: touko on 09/20/2015, 01:05 PMI think the best way for optimising your routine is doing the fade in a buffer for all palettes and transfer all the buffer after a vsync in asm with a tia bloc transfer .

like that:

/* A 256 bytes buffer is enough for 8 palettes */
int my_buffer[1024];

#asm

stz $402
stz $403

tia _my_buffer , $404 , 1023

#endasm

My fade routine is close to yours, and is very fast,but in ASM .
So, I'm really confused.

First, why the store zero in $402 and $403?

Second, if I want to split this into chunks of 64, how do I tell it $404 plus 64 etc?

This is why I hate assembly. I know, I'm definitely the idiot of the thread.

It's not assembly, but direct hardware interfacing. You could access the ports directly in C.

$402/$403 make up a single port to the VCE color # or slot. Because there are 512 color slots in the VCE, it's larger than a 8bit value for a single port to handle. So the 16bit port is spread across two 8bit ports; 0x402 is the LSByte and 0x403 is the MSByte.

VCE and VDC ports tend to have what is known as "latch" system. This means when you write the upper address of a 16bit port, it triggers the transferring the contents of the two ports to the internal place it needs to go (be it VCE or VDC).

In the case of the VCE, 0x402 is the LSB, and 0x403 is the MSB and latch. Once 0x403 is written to, the contents are transferred to whatever reg internal to the VCE. But not until that latch port is accessed - so the order of port pair access if very important.

On the VCE, here are some ports:
0x402/0x403 = is the color slot you want to update.
0x404/0x405 = the color value to update on the corresponding color slot.

One other thing to note: While you can constantly tell the VCE what specific color you want to update, it does have an "auto increment" internal mechanism that automatically advances to the next color slot after a successful write/update (i.e. latch port). Same with reading color data from the VCE.

DildoKKKobold · 11/02/2016, 12:55 AM

int my_buffer[64];

fade_out()
{

   int i, clr;
   char j,k;
   for (j = 0; j <8; j++)
   {

      #asm

          stz $402
          stz $403
       #endasm
      for (i=0;i<512;i+=64)
      {
         for (k=0; k<64; k++)
         {
            clr = get_color(i+k);
            if (clr&7) clr = clr - 1;
            if (clr&56) clr = clr - 8;
            if (clr&448) clr = clr - 64;
            my_buffer[k] = clr;
         }
          vsync();
       #asm
          tia _my_buffer , $404 , 64
       #endasm
       }
   }
   cls();
   reset_satb();
   satb_update();
   vsync();
}

So, here's my attempt to split it into 64 color chunks. It... fails. Miserably. I'd assume its latching fine, and should store the current state of the latches, through each vsync.

OldMan · 11/02/2016, 01:17 AM

Quoteint my_buffer[64];
.
.
.
tia _my_buffer , $404 , 64

Ints are 2 bytes. You're only transferring 32 'colors'. Try using 128 as length

Also, be careful mixing ints and chars. HuC doesn't promote chars to ints.

touko · 11/02/2016, 05:11 AM

And be careful, you have only one pointer in the VCE's hardware color table .
you must select color entry for writing AND for reading(except if you want to read a pallet and write the next one) like that :

#asm
; // May be not needed here because get_color() select the pallet entry every time.
stz $402
stz $403
#endasm
for (i=0;i<512;i+=64)
{
for (k=0; k<64; k++)
{
clr = get_color(i+k);
if (clr&7) clr = clr - 1;
if (clr&56) clr = clr - 8;
if (clr&448) clr = clr - 64;
my_buffer[k] = clr;
}
vsync();
#asm
stz $402
stz $403
tia _my_buffer , $404 , 64
#endasm

Else you read in pallet 0, and write in pallet 2,as you did .

DildoKKKobold · 11/02/2016, 08:20 PM

for posterity, here is the final function. I will be testing it on hardware shortly. Whoever sees this in the near or far future, feel free to use it.

fade_out()
{

   int i, clr;
   char j,k;
   for (j = 0; j <8; j++)
   {
      for (i=0;i<512;i+=64)
      {
         for (k=0; k<64; k++)
         {
            clr = get_color(i+k);
            if (clr&7) clr = clr - 1;
            if (clr&56) clr = clr - 8;
            if (clr&448) clr = clr - 64;
            my_buffer[k] = clr;
         }
          vsync();
          clr = get_color(i-1);
       #asm
          tia _my_buffer , $404 , 128
       #endasm
       }
   }
   cls();
   reset_satb();
   satb_update();
   vsync();
}

EDIT: Oh, and a huge thanks to everyone for working with me. I'm glad that a simple fade didn't kill catastrophy.

Gredler · 11/02/2016, 08:34 PM

Quote from: DildoKKKobold on 11/02/2016, 08:20 PMfor posterity, here is the final function. I will be testing it on hardware shortly. Whoever sees this in the near or far future, feel free to use it.

fade_out()
{

   int i, clr;
   char j,k;
   for (j = 0; j <8; j++)
   {
      for (i=0;i<512;i+=64)
      {
         for (k=0; k<64; k++)
         {
            clr = get_color(i+k);
            if (clr&7) clr = clr - 1;
            if (clr&56) clr = clr - 8;
            if (clr&448) clr = clr - 64;
            my_buffer[k] = clr;
         }
          vsync();
          clr = get_color(i-1);
       #asm
          tia _my_buffer , $404 , 128
       #endasm
       }
   }
   cls();
   reset_satb();
   satb_update();
   vsync();
}

EDIT: Oh, and a huge thanks to everyone for working with me. I'm glad that a simple fade didn't kill catastrophy.

That would have been catastrophic

touko · 11/03/2016, 04:18 AM

if you want your routine more faster you can translate this in assembly

Quotefor (j = 0; j <8; j++)
{
for (i=0;i<512;i+=64)
{
for (k=0; k<64; k++)
{
clr = get_color(i+k);
if (clr&7) clr = clr - 1;
if (clr&56) clr = clr - 8;
if (clr&448) clr = clr - 64;
my_buffer[k] = clr;
}

This loop is slow as hell .

TurboXray · 11/03/2016, 02:07 PM

Quote from: DildoKKKobold on 11/02/2016, 08:20 PMfor posterity, here is the final function. I will be testing it on hardware shortly. Whoever sees this in the near or far future, feel free to use it.

fade_out()
{

   int i, clr;
   char j,k;
   for (j = 0; j <8; j++)
   {
      for (i=0;i<512;i+=64)
      {
         for (k=0; k<64; k++)
         {
            clr = get_color(i+k);
            if (clr&7) clr = clr - 1;
            if (clr&56) clr = clr - 8;
            if (clr&448) clr = clr - 64;
            my_buffer[k] = clr;
         }
          vsync();
          clr = get_color(i-1);
       #asm
          tia _my_buffer , $404 , 128
       #endasm
       }
   }
   cls();
   reset_satb();
   satb_update();
   vsync();
}

EDIT: Oh, and a huge thanks to everyone for working with me. I'm glad that a simple fade didn't kill catastrophy.

It's going to cause snow on the real system if this takes too long (goes into active display), which I think it will.

Use the Txx instruction and read the color data directly into your my_buffer during vblank. Then do your modifications on the array - when each iteration is has completed (one iteration of j), then wait for vsync and Txx the buffer back to color ram port. That should prevent any snow on screen.

If you slightly modify your code like this:

Quotefade_out()
{

int i, clr;
char j,k;

vsync();

for (j = 0; j <8; j++)
{
for (i=0;i<512;i+=64)
{

temp = i
#asm
lda temp
sta $402
lda temp+1
sta $403
tai $404, _my_buffer, 128
#endasm

for (k=0; k<64; k++)
{
clr = my_buffer[k];
if (clr&7) clr = clr - 1;
if (clr&56) clr = clr - 8;
if (clr&448) clr = clr - 64;
my_buffer[k] = clr;
}
vsync();
#asm
lda temp
sta $402
lda temp+1
sta $403
tia _my_buffer , $404 , 128
#endasm
}
}
cls();
reset_satb();
satb_update();
vsync();
}

It should get everything done during vblank and avoid snow on screen on the real system. That also includes reading in the next 64 colors as well (both transfers together only take 1.5k cpy cycles). Note: You'll need a global variable "temp" or some such name, so that you can access the function's instance variable in asm. Also, I think the read port is $404 and not $406. If not, then change it to $406.

DildoKKKobold · 11/03/2016, 03:19 PM

Quote from: TurboXray on 11/03/2016, 02:07 PMIt's going to cause snow on the real system if this takes too long (goes into active display), which I think it will.

Use the Txx instruction and read the color data directly into your my_buffer during vblank. Then do your modifications on the array - when each iteration is has completed (one iteration of j), then wait for vsync and Txx the buffer back to color ram port. That should prevent any snow on screen.

So, that is why I put the vsync right before the transfer - even if the code takes two frames, the transfer will still only occur right after a vblank. I'm actually concerned that if I make the code faster, i'll have to put in delays, as the fade is already pretty fast now. Its a minimum of 64 frames already.

Granted, I need to try my new code in hardware. I'll also try yours.

As a question - I keep needing to add globals to do ASM. Could a future version of HuC do locals?

OldMan · 11/03/2016, 03:44 PM

QuoteI'm actually concerned that if I make the code faster, i'll have to put in delays, as the fade is already pretty fast now.

Suck up the need to add a delay. Use a down-counter so its tuneable. It's not too bad if you use a dedicated fade routine. You can use the wait time to do other things, like loading new gfx....
Just my opinion.

QuoteAs a question - I keep needing to add globals to do ASM. Could a future version of HuC do locals?

You can do that already. It's just a pain, as they have to be accessed via the Huc Stack pointer, which is slow...

TurboXray · 11/03/2016, 04:17 PM

Quote from: DildoKKKobold on 11/03/2016, 03:19 PM
Quote from: TurboXray on 11/03/2016, 02:07 PMIt's going to cause snow on the real system if this takes too long (goes into active display), which I think it will.

Use the Txx instruction and read the color data directly into your my_buffer during vblank. Then do your modifications on the array - when each iteration is has completed (one iteration of j), then wait for vsync and Txx the buffer back to color ram port. That should prevent any snow on screen.
So, that is why I put the vsync right before the transfer - even if the code takes two frames, the transfer will still only occur right after a vblank. I'm actually concerned that if I make the code faster, i'll have to put in delays, as the fade is already pretty fast now. Its a minimum of 64 frames already.

Hmm.. there's an issue you might not be aware of; every time you read or write to any VCE regs (that includes $400 and $401), and you'll cause the VCE not to be able to read from pixel bus that the VDC is constantly outputting to. What happens, is that since it can't read from the pixel bus - it will output the last color (pixel) that it read from the pixel bus. You get horizontal 'stretches' of colors across the screen - i.e. snow. Not just a the borders, but anywhere on a scanline. This actually happens when you turn the display "off", but since its all one color for the screen - you can't see the pixel "stretching". This is different from other color update interfere of other systems, where if update a color while display is active - you see that color update as corruption on screen. The VCE doesn't do this, but reading and writing from any VCE port gives the same stretching behavior regardless (read or write, color update regs or other VCE regs).

Here's an example video where I purposely do it:

So any access to the VCE does this, not just reading or writing. If your routine does manage to read in and modify all 64 colors withing vblank, and you update on the following frame - then you'll be fine. And if that's the case, then don't worry about the code changes I made (unless you want more resource during vblank to do something else, but it doesn't look like it. You'd have to make a completely different system/function for that).

QuoteGranted, I need to try my new code in hardware. I'll also try yours.

Test yours first, and if it's good then don't worry about mine. Just keep my code in mind. I.e. the approach I took, as you might want to do a more flexible fade routines in the future.

QuoteAs a question - I keep needing to add globals to do ASM. Could a future version of HuC do locals?

What TheOldMan said. It's a pain, because you have to generate the .s file, look at what index represents that variable, then go back and write an indirect-index load from it. It's not just that the instance variable inside the function is a stack object, but there's no clean way to access it in asm without knowing what the index is on that stack for that specific instance variable. Indeed, it would be nice for HuC to pass this on to asm block. If the assembler had a way to make scope equates, then HuC could generate a function scope equate list for each function (the index into the stack). Globals are just easier to transfer stuff to.

touko · 11/04/2016, 05:24 AM

QuoteAs a question - I keep needing to add globals to do ASM. Could a future version of HuC do locals?

Actually as declared:
int i, clr;
char j,k;

Are treated as local variables,if you want local in asm, you must use the stack ($100 -> $106 must be safe enough),or you can use the classic push/pop .
But in fact you have a bunch of temporary global variables already reserved(like __temp,<_al,<_bl,etc..) .

Arkhan Asylum · 11/14/2016, 09:49 PM

Quote from: TheOldMan on 11/03/2016, 03:44 PM
QuoteI'm actually concerned that if I make the code faster, i'll have to put in delays, as the fade is already pretty fast now.
Suck up the need to add a delay. Use a down-counter so its tuneable. It's not too bad if you use a dedicated fade routine. You can use the wait time to do other things, like loading new gfx....
Just my opinion.

QuoteAs a question - I keep needing to add globals to do ASM. Could a future version of HuC do locals?
You can do that already. It's just a pain, as they have to be accessed via the Huc Stack pointer, which is slow...

The entirety of Atlantean is written with global variables.

Just saying.

lol

ASM doesn't have a concept of local, really. you push/pop things to make them "local", simply by saving the state of all of the registers so you can fuck around with them again before popping the stack back to reset everything, but yeah

global = <3

you'll get faster code.

elmer · 12/01/2016, 08:58 PM

I'm thinking that this sort of stuff is so commonly needed, that it really should be built into the HuC library.

Fading down is easy ... but the fun comes when you're fading back in.

There you need to know what the desired palette is, and have that stored in memory somewhere.

And you need the buffer where you're going to calculate what the next set of colors is that you're going to send to the VDC.

Has anyone avoid needing *both* of these "target" and "current" buffers needing to be in memory at once?

The classic "cheap" fade-down looks OK, but the same thing run in reverse (increment color component if not at target) has the effect of greying everything out, and then having the stronger colors appear later on.

It's not horrible, and I've done that before, and I believe that Arkhan does it, too.

A more sophisticated fade is nearly 3 times slower ... but that's still fast enough to calculate all 512 colors in only 2/3 of a 60Hz frame, and nobody runs a fade at 60-steps-per-second.

Does anyone have any opinion about including a decent fade in HuC?

TurboXray · 12/01/2016, 09:42 PM

So.. what are the dynamics at play in this design? Speed and memory size? Is the routine going to be an automatic thing? As in, it only takes a rate argument for fade in/out? If so, does it have complete control of the main code until it's finished? Or is it a lighter process, that only does up to 64 sets of colors per call and is divided into prep and update functions, allowing game code to run at the same time (or at least same frame)? How much memory are you going to require (important for hucard projects)? Is the work buffer user defined and passed along as a pointer (so if can be reused for something else in the project)? Or is it an internal static defined size, that takes away from ram regardless?

I never really liked this trying to make one size fits all thing when designing libs/stuff for HuC. It'd be nice if it was something they directly included into the main source file (different small libs), than trying to attach it exiting library (in startup). Though I think doing that would require restructuring the main lib bank, and having support for bank directive directly in HuC.

DildoKKKobold · 12/01/2016, 10:14 PM

As an update, of course my code didn't work (for the reasons Bonknuts illustrated). His did. No surprise there.

I don't think this needs to be in the core of HuC. It would serve better as example code, which someone could work to their needs.

elmer · 12/01/2016, 10:34 PM

Quote from: TurboXray on 12/01/2016, 09:42 PMSo.. what are the dynamics at play in this design? Speed and memory size? Is the routine going to be an automatic thing? As in, it only takes a rate argument for fade in/out? If so, does it have complete control of the main code until it's finished? Or is it a lighter process, that only does up to 64 sets of colors per call and is divided into prep and update functions, allowing game code to run at the same time (or at least same frame)? How much memory are you going to require (important for hucard projects)? Is the work buffer user defined and passed along as a pointer (so if can be reused for something else in the project)? Or is it an internal static defined size, that takes away from ram regardless?

Good questions!

Taking control of the system would be impolite.

The goal would be to provide function calls that the HuC user can use to provide fast alternatives to writing their own code.

For example ...

void __fastcall get_colors( int *pbuffer<__td> );
void __fastcall get_colors( int index<color_reg>, int *pbuffer<__td>, unsigned char count<__tl> );

void __fastcall set_colors( int *pbuffer<__ts> );
void __fastcall set_colors( int index<color_reg>, int *pbuffer<__ts>, unsigned char count<__tl> );

void __fastcall fade_colors( int *psource<__si>, int *pdestination<__di>, unsigned char count<__al>, unsigned char fade<acc> );

Those make up a simple set of functions that do everything that DK wanted in Catastrophy, and ended up writing in either slow C code, or fast inline-assembly.

They do it fast, and they keep things flexible enough that you can use as-much or as-little resources as you need.

The "get" and "set" functions use TAI & TIA instructions for fast processing.

"count" is limited to a maximum of 128 for fast indexing

"fade" is a value 0-7.

I *think* that's enough basic functionality for the end-user to build pretty much whatever they want.

Can you think of a better *practical* design?

QuoteI never really liked this trying to make one size fits all thing when designing libs/stuff for HuC. It'd be nice if it was something they directly included into the main source file (different small libs), than trying to attach it exiting library (in startup). Though I think doing that would require restructuring the main lib bank, and having support for bank directive directly in HuC.

Making the libraries modular would be great ... but it's going to take a significant time-investment from whoever wants to do it.

Since there's no linker phase and dead-code elimination, so from what I'm seeing, HuC is pretty-much a behemoth right now.

But ... there is some argument for providing common functionality within the library itself, especially since the code that HuC generates to do the same stuff if you do it in C (like DK did for catastrophy) is going to be much larger and slower than the same code hand-written in assembly.

Quote from: DildoKKKobold on 12/01/2016, 10:14 PMI don't think this needs to be in the core of HuC. It would serve better as example code, which someone could work to their needs.

It's not like a "fade" routine is an uncommon requirement.

1) Your C code is big and slow, and generates a lot of Hu6280 code that a hand-written assembly function doesn't. That's not you ... that's just HuC.

2) Have you got a fade-up working, yet?

ccovell · 12/02/2016, 01:14 AM

Quote from: elmer on 12/01/2016, 10:34 PM"fade" is a value 0-7.

I have no stake in this, but thinking down the road it might be better to add more granularity (0..15 or more) right now. For example, many Sega games have more levels of fading by fading out Red & Green at different speeds before finally doing Blue... and it looks fantastic and far smoother than 8 steps as on the PCE.

I did something similar (using lookup tables) for my HuZero game.