Converting a floating point texture to a rgba texture so it's ready to be flipped to the screen ?! ;)

  • Follow


Hello,

One thing is annoying me a little bit with my current Delphi 
program/example/opengl acceleration experiment ;)

I cannot enjoy the fast speed of opengl because for now I am using 
Tcanvas.Pixels[x,y] to draw the texture map to the screen. And since the 
texture map is in range 0.0 to 1.0 for the color components these first need 
to be converted to RGB's which means many multiplications and rounds.

But the biggest problem is the slowness of Tcanvas.Pixels.

Anyway I probably already have a "CopyMemoryToBitmap" routine somewhere 
which would help with flipping the memory into bitmap format ;)

So the remaing problem is:

Converting floating point textures to rgba textures so they can be flipped 
to screen.

I guess I could use an additional render to texture target... in rgba 
mode... and use an extra shader... just for recalculating the floating 
points to rgba's...

However doing this seems a bit weird... but it would probably be possible as 
follows:

1. Draw a quad with 4 verteces which would activate all pixel shaders.
2. Shade the pixels and output them to the texture... preferably y-flipped 
if necessary.
3. Read texture to cpu/system memory.
4. Flip memory to Tbitmap/canvas etc.

However I wonder if OpenGL has a better method of converting a floating 
point texture/framebuffer into a bitmap ?!

So that it doesn't need to go through the vertex and pixel shaders ?!?

Hmmmm...

Maybe there is even a faster way ?

Maybe be re-enabling the "default framebuffer ?" But it would be empty... 
hmm...

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 5:59:20 AM

Ok,

For now I am gonna get rid of the Tcanvas.Pixels... and simply use an extra 
memory buffer to convert the floating point texture 3x16 bits or 3x32 bits 
floating point texture to rgba 4x8 bytes in cpu.

That way cpu can do something too... hopefully cpu not gonna be too slow at 
it ! LOL.

Would be funny if the cpu is still fricking slow even for something like...

I have bad feeling about that ! ;) :)

But gonna try anyway... have to do this anyway ! ;)

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 6:03:17 AM


Hmmm something fishy going on here...

The color component order of the Delphi form seems to be:

R,G,B,A

The color component order of the Delphi bitmap seems to be:

B,G,R,A

?!?!?

What the fuck ?!?

Weirdness !

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 6:38:25 AM

Or maybe I just flipped the texture color components...

Me confused...

Should it be record:
    r,g,b,a : float
end;

or

should it be record
    b,g,r,a : float
end;

for floating point texture maps for GL_RGBA ?!? (and/or GL_RGB)

Hmm...

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 6:40:59 AM

Well I think I got the floating point format correct if I recall 
correctly...

Since the OpenGL window seemed to draw ok...

So record for floating point texture format is probably:

r,g,b,a : float;

Then why does Delphi needs it other way around ?!

Weird...

Especial form vs bitmap... double weird ?!

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 6:43:17 AM

Anyway I am using Tbitmap.Scanline for fast access...

According to other postings it indeed seems to be reversed: B,G,R,A...

The reason for this I don't understand...

For now I will have to use a special type for it...

TbgraByte = record b,g,r,a etc; // considered a bitmap rgba ;)

And use the one which is appriorate ;)

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 6:46:37 AM

Ok this solves problem nicely ! ;)

Delphi even helped me prevent a stupid error thanks to strong type checking 
like so:

Faulty:

var
    vBitmapColor : TbgraByte;

begin

TrgbaByte( scanline pointer etc ) := vBitmapColor; // compiler type error ;) 
:)

Good:

TbgraByte( scanline pointer etc ) := vBitmapColor;

end;

:)

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 6:50:20 AM

"Skybuck Flying" <BloodyShame@hotmail.com> wrote in message 
news:d6518$4ac59fd4$d53372a9$20198@cache6.tilbu1.nb.home.nl...
> Hmmm something fishy going on here...
>
> The color component order of the Delphi form seems to be:
>
> R,G,B,A

^ Not sure about that...

Form does not seem to have a scanline property...

Maybe it's internal format is also b,g,r,a...

Canvas.Pixels might be doing a conversion as well...

Tcolor is in rgba mode at least... so it might be doing conversions like so:

RGBA to BGRA.

^ This might be another reason why .Pixels[x,y] is slow...

>
> The color component order of the Delphi bitmap seems to be:
>
> B,G,R,A

When accessing the scanline pointer at least...

Maybe .Pixels[ ] does a conversion ;) <- probably.

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 6:54:12 AM

Anyway using the Tbitmap.Scanline property + rounding seems to be fast 
enough for now... for 500x400 pixels ;)

Canvas.Draw( 0, 0, mScreenBuffer ); // mScreenBuffer : TBitmap;

Draws it real fast... like under one second at least ! ;)

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 6:56:01 AM

Time for a little performance testing ;)

With delphi conversion code included it's:

0.0254 seconds for one frame of 500x400 with 32 bit colors, and 3x32 bit 
floating point texture format r,g,b.

That's 25 milliseconds...

Even using this shitty conversion code... this could mean:

1000 / 25.4 = 39 frames per second haha ! :)

Actually the first time it seems to require 50 milliseconds not sure why... 
maybe cpu cache getting filled or maybe it's disk activity from delphi ide 
or so... could be...

Now let's see how fast draw is without this shitty conversion code and 
canvas draw code.

It's about 0.006 seconds for opengl draw + read from texture.

That's about 6 milliseconds (500x400 vertex points as well ! ;))

Now let's leave texture reading out of it... to see how fast it goes then ! 
;)

It's 0.00032 seconds.

That's 0.32 milliseconds !

Holyshit batman... that's already pretty fast ! ;) :) And this includes 
500x400 verteces ! HAHA.

Let's see what frame rate would be for this:

1000 / 0.32 = 3125

Not bad.

I did notice the occasional hick up.. this could be the first time because 
of loading the texture... and/or disk activity... I think it's disk activity 
mostly.

Ok... now I reduce vertex points to something more realistic...

According to my last calculations posted in another sub thread... the number 
of simulators would be:

1198.

There are two fields... each one can probably do a number of cell updates... 
one pointer need for itself...
one pointer for somewhere else... and again a pointer for somewhere else...

So for itself at least 1, then maybe 2 then maybe another 2... so I think at 
most 5 pointers needed.

So 1195 * 5 = 5975 verteces needed... Now I go test it's speed:

Time is: 0.0003126 seconds.

So it remains at 0.32 milliseconds per frame. Hmmm...

This is not so good.

This means:

3125 frames per second * 1198 simulators = 3.743.750 cycles per second.

Which need to divide by 2 probably which gives 1.871.875 cycles per second.

Dual core CPU was probably something like 80.000 * 200 = 16.000.000 cycles 
per second.

I must know for sure so I am going to start it to make sure.

Yup confirmed...

Dual core CPU can do: 10 battles of 1 v 1 warriors with 100 rounds with 
80.000 cycles in 5 seconds.

This means the dual core is executing:

10 * 100 * 80.000 cycles in 5 seconds = 80.000.000 (no early kills those 
were disabled)

Which means it's executing: 80.000.000 / 5 = 16.000.000 cycles for dual 
core.

Which means roughly 8.000.000 cycles per core.

So far the cpu seems 4.2 times faster...

However I don't have a decent gpu implementation yet...

However seeing these numbers for this simple test raises big doubts if gpu 
version is gonna be any faster...

Maybe the clearing of the framebuffer has something to do with it... gonna 
disable it and retest...

depth test was disable... clearing was disable...

Time is now:

0.000237 seconds.

0.237 milliseconds

1000 / 0.237 = 4219

about 33.3% more performance.. still not enough me thinks.

Ok reloading the identity matrix does not seem necessary...

setting the cg world view thingy only needs to be done once it seems...

Performance is now
0.00019 seconds. (fluctuating a bit... maybe at full speed it would be lower 
not sure)

which is about 0.19 milliseconds

1000 / 0.19 = 5263 frame per second.

Let's see what this would give:

5263 * 1198 simulators = 6.305.074 cycles
divide by 2 : 3.152.537

Still very poor.

Hmmm I see I have three frame buffer textures active... I only need 2...

Gonna disable one to see if that helps.

Yes that helped a bit...
0.000156 seconds.

Ok I don't need 32 bit floating points...

Only 16 bit floating points... gonna change textures to 16 bit...
This should give good improvement.

Hmmm nope still 0.00015 seconds...

I am starting to wonder if the time I am measuring is actually the api 
calling
cpu overhead... hmm...

Hmm could be... if that's the case... then optimizing the number of api 
calls
could give more speed... I wonder if enable profilings all the time is 
necessary
maybe not... binding programs is that necessary ? I don't know...

For now the speed would be:
1000 / 0.156 = 6410 assuming the bigger texture is no problem.

Final speed would be:
(1198 * 6410) / 2 = 3.839.743 something like that.

Almost half of what a single cpu core would achieve.

This is assuming the vertex/pixel shaders and texturing lookups don't add 
any significant
delays or overheads... for the largest possibility.

I go do a little large test to see what happens ;)
4096x4096 gives about same speed...

For now I am worried this is not a good situation.

But there might be a solution... instead of stuffing the entire cores into 
the frame buffer... the opposite
could happen...

only instructions are stuffed into the frame buffer and verteces and such 
for processing... and the cores themselfes are stuffed into textures which 
will not be rendered to a texture...

Instead something else will update those textures... this could be done by 
cpu...

I was thinking about doing a single 4096x4096 texture map executor with 1198 
simulators... but seeing these presumably api call overheads or round trip 
times to gpu makes it doubtfull that it would be any faster... it probably 
would not... therefore the strategy has to be rethought and changed.

For now the number of texture inputs seem to be limited to 6 to 8 
TEXCOORDS... these textcoords are necessary
to supply the necessary information to tex2D or rect2D or something like 
that...

But not really... each vertex could simply ignore those texcoords 
semantics... and simply use the first one...
The textcoordintes themselfes could use an additional coordinate for example 
the Z coordinate or an addtional TEXCOORD2 or maybe NORMAL coordinate to 
indicate which texture to use.

This way the number of texture maps in the gpu could be endless... however 
the memory is limited.

Now let's try to fit as much simulators/cores/warriors as possible into the 
core... new calculation becomes:

512 MB / (14000 * 6 bytes) =

536870912 / 84000 = 6391 simulators in gpu. the necessary instruction 
pointers plus additional fields per simulator should fit easy in one 
framebuffer so I am not worried about that.

Using this new figure the number of cycles would be:
6391 * 6410 = 40.968.363 / 2 = 20.484.181

Compared to the dual core cpu which has 16.000.000 cycles this is still very 
weak ?!

However the gpu code I used to test is not running at full speed.. so who 
knows what will happen... but for now I base it on what I see...

At best with a little bit of luck... two extra threads could be added which 
feed the necessary data to the gpu... so that the gpu can do some processing 
as well... this way the final speed would be doubled.

However this does require feeding the gpu with 6391 battles ?! which is 
quite a lot..

This would be a battlefield size of almost 80x80  then for the cpu a little 
bit of extra fields are necessary...
2x2 or so...

Hmm... this makes it difficult to distribute the battles across gpu or 
cpu...

Gpu needs a lot of battles to be efficient... when gpu is done it would have 
to wait for the cpu to finish the remaining battles...

Then the next round of battles could occur..

Figuring out the sweet spot for gpu and cpu is what would be necessary...

Also the results for gpu could vary... it might finish sooner because of 
favorite battles... then the gpu has nothing to do anymore... not enough 
extra battles... unless gpu is made flexible... but that would be dangerous 
because then it could take long...

For now it does seem like gpu might add some performance benefits... but the 
cpu would also be tasked with doing lot's of api calls which might eat into 
the performance of the cpu simulators themselfes which would not be a good 
thing...

I have also started wondering if the cpu code/simulator code can be changed 
to represent a more gpu like approach... maybe the cpu can act like a gpu as 
well... more like in a streaming fashion...

Two approaches are possible:

1. Assume this theory and change/reimplement the simulator on cpu to see if 
cpu can do stream processing.

or

2. Convert delphi simulator code to c/c++ to analyze it in visual studio to 
see what the actual bottleneck is on the cpu... is it bandwidth ? is it cpu 
execution ? is it stalls because of reads and writes ?

If it's the last then maybe the cpu code can be altered to run even faster.

I am very curious about that... seeing these poor results for gpu has made 
me doubt if I should continue with this corewars/gpu project ;)

GPU might be nice for video codec though :)

But I am even more curious about getting more simulator speed... so I have 
different direction to try... I am having heavy doubts about which path to 
choose...

I think it's best if I try to do some cpu benchmarking test to try out the 
"cpu streaming" theory ! ;)

Streaming vs non-streaming cpu test ! ;)

That's what I should do first ! ;)

And if no difference is found... maybe an analysis in c/c++ to see what 
actual bottleneck is ?! ;)

However there is another possibility...

What if the gpu code could run entirely inside a shader with multiple passes 
?!

Then all these opengl api calls would not be necessary anymore...

However I am not sure if such an algorithm is possible... it might be 
though... if the output from the pixel shaders can be redirected into the 
vertex shaders/textures for the next pass...

I don't know if that's possible so that's also another "research/benchmark" 
direction to try out.

So two early benchmarks to try out...:

1. Running code entirely inside gpu shader with multiple passes and frame 
buffer/texture feedback ?!
^ Somehow only a few texels on the textures need to be updated for good 
speed though ?!? If not possible use full frame shaders maybe that not so 
bad for performance ? But I doubt it ! ;) :)

2. Running streaming like vs non-streaming like code on the cpu to see if 
there is a difference.

^ Final conclusion: more research needed into possibilities ! ;)

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 8:05:18 AM

Hmm,

Maybe I am in luck... I already have pmars source code in C... didn't write 
it myself... but it's samiliar to mine ;)

If I can get the source code compiling in visual studio I might be able to 
do some profiling with amd analyzer/profiler to hopefully see what the 
bottleneck might be ;)

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 9:04:17 AM

I think I already tried that in the past... so screw pmars code...  if there 
was an amd analyzer for c++ builder then it would have been usefull.

Converting my code to c/c++ shouldn't be too hard except maybe for the 
threading code... but that could be interesting as well if that fails I can 
always switch to non-multi threaded code...

For now I am going to do a conversion to C/C++ because it's highly needed I 
have no idea what the real bottleneck is... and that makes it hard to come 
up with a good bottleneck-fighting-strategy ;) :)

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 9:23:12 AM

Having to use visual studio and converting my code to c/c++ is depressing.

However I could use other editors like c++ builder to easy the pain 
somewhat...

Then finally I would have to use visual studio.

To get over this depression I am now going to play some "CoH ToS" ;) :)

For inspiration and happyness :) LOL.

^ downtime coming ;) :) ^ =D

Bye,
  Skybuck ;) 


0
Reply Skybuck 10/2/2009 9:39:16 AM

Hmm ok

1. played a bit of CoH some time ago...
2. Tried a Delphi to C++ conversion tool 1.5 trial it said... But it was no 
good.
3. And investiged the possibility of writing my own Delphi to C/C++ 
convertor...
It might be possible but it would require a whole lot of time and a whole 
lot of testing.
Mostly to figure out how the used parser/lexer works.

It would require to much time for now me thinks... but could be interesting 
project for the future...

As far as I know there is no decent Delphi to C++ tool ? Thus such a good 
tool could be popular ?!? And sell well ? Any potential buyers out there ? 
;) :)

Anyway the tool could be usefull for myself as well... to quickly convert 
Delphi code to C/C++ code... to then benchmark/profile it with AMD Code 
Analyst and any other future tools.

However maybe it's possible to make Delphi interface with AMD Code 
Analyst... haven't exploded that... but that would probably be even more 
difficult if not impossible ?! ;)

Bye,
  Skybuck. 


0
Reply Skybuck 10/2/2009 1:24:17 PM

Some other idea's to consider:

1. Speculative execution of all core cells, would probably lead to many 
conflicts, however output to different cells could be stored seperatedly per 
input/output core so at least all results would be ok. <- many unnecessary 
executions at first and maybe later too

2. Speculative execution of all processes in the list <- different way of 
parallelism, could produce more usefull executions but still very limited

These two idea's above are more "fun" idea's they are not very serious... 
but could be easy to implement.

Time for a totally different idea:

3. CPU does preprocessing of all-to-be-executed instructions per 
core/simulator.

CPU could have access to 2 GB of ram (virtual memory limit) 4 GB of ram 
would need to be enabled for kernel memory.

Total ammount of simulators for 1 v 1 warrior fights would be:

2 GB / 84.000 bytes = 2147483648 / 84000 = 25.565 simulators.

Possibilities for memory locations per instruction are rougly:
1. A=A+B,
2. A=A+1,
3. A=A-1,
4. B=B+1,
5. B=B-1,
6. A=A/B,
7. A=B/A,
8. B=B/A,
9. B=A/B,
10. A=A*B,
11. B=B*A,
12. A=A mod A
13. A=B mod B
14. A=A mod B
15. A=B mod A
16. B=A mod A
17. B=B mod B
18. B=A mod B
19. B=B mod A

Maybe even all of these+1...

I am not sure how many possibilities there are...

Maybe 100 ? Maybe more ?

For now let's assume 100 or so.

This could mean 100 memory locations have to be read to be sure that all 
locations are present for complete instruction execution and memory input 
data and memory output data...

Actually the possibilities aren't that great... the pre-processor should be 
able to know exactly which instruction type will be executed so the number 
of possibilities will be very small... and can be pre-computed. However this 
would almost be the same as actually executing it...

So another idea could be to do the pre-processor on the gpu as well... so I 
guess this comes down to simply:

1. Processing the instructions on the gpu for as far as possible
2. Falling back to cpu to get any necessary code or locations and supplieing 
them again to the gpu... or maybe another gpu pass can actually do all that.
3. Go back to gpu and execute the remaining part of the instructions.

(Had this idea while letting this post "idle" for a while on my pc LOL :))

Yeah so to keep this story short:

1. Process instructions on the gpu for as far as possible, then try to do 
anything else is secondary/tertiary passes/multiple passes and so forth.

Yeah this is pretty much how I designed the original core gpu algorithm... 
which also included loading/using multiple textures in the gpu up to 512 MB 
! ;)

So I was hoping to do just one texture map or so... but now it turns out 
that would not give enough performance.

So to make long story short: I must go back to the original core gpu design 
and implement it massively ! ;) :)

However easier said then done... because more passes probably means more api 
delay... and then the target might not be reached as well.

Target ofcourse being insane speed ! ;)

Let's do some calculations...

Number of steps estimated for core gpu executor design: 21

21 passes * 0.152 milliseconds = 3.192,00 milliseconds required for all 
steps...

1000 / 3.192 = 313 cycles per second... let's divide this by 2 just in 
case... 155 cycles per second.


25565 simulators * 155 cycles = 3.962.575

Again 4 million cycles ?!?!? wtf ?!

Kinda funny how I keep hitting this 4 million limit ! ;)

Bye,
  Skybuck ;) =D 


0
Reply Skybuck 10/2/2009 2:25:40 PM

Ok,

I just did some testing of the draw routine...

The speed in a tight loop without any data changes is about 20.000 frames 
per second...

I am not sure if OpenGL actually renders each one or that it detect that 
nothing changed...

For now I will assume it renders each frame.

This means the actual speed in the scenerio described could be 3 times 
higher...

About 12.000.000 cycles per second.

However the scenerio described is probably totally unrealistic since the cpu 
could never supply 2 gb per frame...

That would be like 40.000 tb per second haha ! ;)

However I have some new idea's which might work by feedback to gpu.

But I am getting a bit tired of all these different models/scenerio's...

Maybe I describe one later on or maybe not and keep it secret :)

Bye,
  Skybuck. 


0
Reply Skybuck 10/3/2009 8:48:30 AM

Euhm actually not 4 tb... because cpu could upload only those this which 
would be necessary and that's definetly not everything... only small 
portion...

So many different ways of implementing it... makes me dizzy and nervous ! :)

Bye,
  Skybuck. 


0
Reply Skybuck 10/3/2009 8:53:08 AM

So more interesting techniques to investigate:

0. Feedback buffers (already mentioned in previous post ;)) (only for 
texture0?)
1. Pixel rectangles
2. Bitmaps
3. Stencil buffer can be used to exclude certain fragments
(if multiple cores in the texture...then stencil buffer would need at least 
a few bits to indicate which core instruction pointer is to be enabled and 
which instruction pointer/location so another 14 to 16 bits so many bits 
needed for stencil buffer... not sure what the maximum is...)

4. The z buffer also has some bits... 24 bits... not sure if that could 
somehow be used.

5. Logical operations... could be used to quickly replace certain values in 
the framebuffer.
(Only for integers ???) Could be used to do copy ?

6. Buffer updates color masks... <- could be interesting to split planes or 
to get a certain bitplane.

For other projects ;)
7. Bits can be written into the stencil by using mask... not sure if it 
means color bits, depth bits or any...
8. Accumalation buffer, can for example add up bits it seems... could be 
handy for counting bits in parallel.
9. Pixel store parameters could be used to swap bytes, or switch bit order.

10. CopyPixels could be used to copy from read framebuffer to drawbuffer ? 
To then display it ? I so could be used as an easy scroller or so.

11. BlitFrameBuffer can be used to visualize the special buffers like 
stencils and depth buffers and so.
(Supports stretching) (If buffers specified is the same then overlapping in 
same buffer results are undefined).
Pixel formats for both buffers must be the same.

12. Not sure but: render buffers might be able to have 16 stencil bits ?! 
(Probably still to little for what I might use them for...) Maybe stencil 
and depth can be combined to form one large special buffer. dIt mentions 
depth_stencil or so.

13. Texture objects can be bound to frame buffer via FrameBufferTexture1D 
(probably equivalent of _EXT version or so ;))

14. Attaching the texture buffer to framebuffer and using that same texture 
buffer as a texture access could produce undefined results... it could lead 
to a feedback loop which in itself is kinda interesting... could be used to 
try and to sequantial execution... or it could be used for random noise 
generation (?).

15. Generally interesting: polynomials... used to generate verteces and 
such... I assume across the polynomial/curve ? ;) :) (See map command)

16. Specifieing hint: POINT SMOOTH HINT could be enough and might provide 
some performance benefits.

17. I am not sure what a histogram is... but might be interesting for 
"belongs to group" visualizations.
Page 420 has a word about histogram... apperently it's counting the 
occurence of certain color values ;)
(min and max pixel values can also be determined)

18. State tables could be interesting to learn what "state" the opengl is in 
?! ;) when in doubt I presume ;) :)
(tremendous ammount of state/information can be examined.)

19. Multi texture could be interesting even highly interesting ?!? It 
mentions the possibility of "pipelining" and using the output of one texture 
as input for the next texture ?!? It mentions this is controlled through 
texture environments ?!? not sure what that is all about... ;)

20. This is explained further: Texture Combine Environment Mode <- Could be 
highly interesting !!! Page 427 says it's possible to arrange these textures 
in all kind of ways ! Very very interesting ! ;) :)
I hope that each pixel shader gets to act on them ?!?!? Or maybe it's just 
an opengl api thingy ? Not sure...

21. Funny thingy: "point parameter" controls characteritics of points (?!?) 
;) :)

22. Pixel buffer object might give more performance for pixel drawing and 
reading... not sure how usefull this would be...

And that ladies and gentlemen concludes my "analyzation" of the current 
opengl 3.0 spec...

Most interesting concept/feature I came across is the concept of "pipelining 
textures".

I am not sure if it's possible but that would be a very interesting concept:

texture->shader->texture->shader->texture->shader->texture->shader->texture->shader

Only problem would be shaders can't write to certain locations... but that 
can be solved by using an "output address" then the next shader can use that 
and simply "read" and pretend that it came from itself or so...
But then again... it doesn't know here to read so this wouldn't work haha !
It would only work for vertex shaders which can displace themselfes so then 
the pipeline would look:

texture->vertex shader->pixel shader->texture->vertex shader->pixel 
shader->texture.

I think multi-textures are probably limited to pixel shaders only ? Or maybe 
not even that... so don't know about all this.

Attention: Framebuffers have no accumalation buffer... so much for that ! ;)
Though for other project the default frame could probably be used if this 
was necessary ?!? ;) (Maybe a hidden frame or so ;))


Here is an idea to use the depth buffer:

Different cores could be at different depths... maybe by setting some depth 
value a specify core could be selected... this could be used to reset cores 
or to update them with new battles this way the gpu could run multiple 
cores/simulators asynchronously... and it doesn't have to wait until all or 
done... could be pretty and pretty damn handy ! ;) depth and stencil values 
can be combined could also be used as a place to store more information. 
However a simply copy to certain location of framebuffer could work just as 
well so maybe this unnecessary complexity or so ;) depends on what is faster 
I guess ;)

^ These are all opengl api calls...

It's like "saying" to the cpu do a := a xor b; except now it's told to the 
gpu and the gpu does a := a xor b...

But a in this case is not just a field... it could be a whole buffer... like 
one million pixels ! ;) :)

Bye,
  Skybuck. 


0
Reply Skybuck 10/3/2009 12:03:09 PM

Ok,

I just tested the "streaming" idea for the cpu.

Streaming idea: "do many reads, do many writes, repeat".

Non streaming idea: "do single read, do single write repeat".

The non streaming idea works faster.

(Streaming idea requires multiplications and some extra looping, not sure if 
that slows it down... most likely reason is that streaming idea requires 
extra memory to hold the reads... cannot directly read into cache ?!?)
Possible solution: try doing fetches instead <- nice idea.

Going to try version 0.02 with fetches only ;) :) and then some normal read 
write cycles or so

Bye,
  Skybuck. 


0
Reply Skybuck 10/3/2009 6:58:54 PM

Ok tried it...

The prefetching "streaming" version is also slower than the non streaming 
version...

Maybe the pattern of writing/reading wasn't identical for both versions... 
but it's the best I could do for now...

So for now I give up on this idea ! ;)

Bye,
  Skybuck. 


0
Reply Skybuck 10/3/2009 7:10:15 PM

I thiiiiiiiink I am going to attempt a Delphi to C/C++ converter tool.

The idea of having such a tool which would work very well seems very 
attractive to me ! ;) :)

Bye,
  Skybuck =D 


0
Reply Skybuck 10/4/2009 9:48:40 AM

With the parser I have it's gonna be a peace of cake and then I am gonna be 
filthy rich ! LOL :)

Bye,
  Skybuck ;) =D 


0
Reply Skybuck 10/4/2009 10:48:26 AM

I just gave this pipeline simulation a test... without actually using any 
simulator code yet...

And it seems very limited... only 100.000 instructions can be recorded or 
so... maybe a 1.000.000 but that's very little... just an initializing loop 
takes like 8000 * 10 instructions = 80.000 instructions or so...

So this pipeline simulation is not worth much... though maybe it could give 
some insight into some cycles or so...

All in all probably not worth investigating any further since it's pretty 
clear that memory lookups slow it down... and other tests already show the 
cpu can't do anything else while it's waiting for memory or so ?!?

At least it seemed like that for me... I could wrong though ;)

Bye,
  Skybuck. 


0
Reply Skybuck 10/10/2009 3:25:07 PM

My latest insights into the possibility of executing corewars on a gpu have 
made me doubt if the performance is going to be any good... it's probably 
not going to be any faster than a cpu... maybe even significantly slower 
depending on the number of passes that are needed.

Calculations also assume that all executors would actually run in parallel 
at full speed which is also probably a flawed assumption... this could mean 
that ultimate performance could even be far worse for gpu.

Conclusions for parallel processors:

1. Huge memory requirements just to be able to store stuff and also cache 
stuff.

This is mostly where my current graphics card is kinda lacking... only 512 
MB... that's not really that much for parallel stuff... where for each 
parallel stuff only a little bit of work would be done ;)

I could continue trying to develop something... but I now have serious 
doubts that it would achieve any good speed... at least with the current 
design... which is probably a very good design... maybe the best one... only 
the other idea might give some performance benefit the speculative execution 
one... but I doubt that will be any good for sequantial warriors... unless 
something more complex is done with loop iteration predict per processing 
element or so... that's a bit too advanced for my taste...

I think it's time to start spending my time on other projects...

Maybe in the future when programming has become more easy... and when more 
resources are available I might give it another try... but using opengl/cg 
shaders probably has too much programming overhead and especially to little 
resources available... hardware wise as well.. too little memory.

It's kinda a bummer...

I shall do one last calculation which would be an optimistic calculation 
just to see if something can be done:

4 input textures + 4 output textures * 4 elements per texture * 3 bytes = 96 
bytes.

512 MB / 96 = 5.33333333 mega elements per texture.

5.3333333 mega sqrt = 2364x2364 texture size or so.

core size = 8000 + warriors 2 * (8000 processes + 500 pspace) = 8.000 + 
17.000 = 25.000 elements + 10 for little overhead or so...

Means 2364*2364 / 25010 = 223 simulators in gpu at best.

cycles per simulator could be anywhere from 1000 to 100.000 cycles per 
second.

Worst case scenerio: 223 * 1000 = 223000 cycles per second... could even be 
worse if not fully executed in parallel... but gpu does have many cores... 
like 200 so might actually execute in parallel.

Best case scenerio: 223 * 100.000 = 22.352.161 cycles for entire gpu.

This is pretty optimisic... probably a bit too optimistic... probably more 
passes required... or maybe not...

but let's say 22 million cycles per second for gpu.

Cpu achieves 16 million for dual core... so gpu is not really spectacular... 
and I need something spectacular...

The 100.000 above is assuming that opengl doesn't need to bind the cg 
program all the time...

It probably would need to re bind... so that would make it 10x times slower 
or so... so gpu might actually achieve only 2 million cycles per second 
which would be bad.

So conclusion in other short words:

It's like having a cpu which can do 223 cycles in parallel... but it can 
only do it 10.000 per second or so... so finally speed would be: 2.230.000 
cycles per second... which is just miserable.

So that's my latest guess at what the performance would be... miserable ! ;)

Bye,
  Skybuck. 


0
Reply Skybuck 10/22/2009 11:48:00 PM

However I just had a radically new idea...

What if the shader itself uses 50.000 local integers or so...

Then the shader could use all those local integers as if it was local 
memory... and simply execute everything in one pass... this would/should 
greatly increase the execution speed.

The question is now how much local memory/integers/variables can a shader 
have ?!

A simple test with an array of ints could shed some light on this for 
example:

void myshader()
{
    int myvar[50000];

}

^ if something like that compiles than that could be very interesting ! ;)

Bye,
  Skybuck. 


0
Reply Skybuck 10/23/2009 12:06:56 AM

Ok,

I tested this theory (from last posting) and it seems to compile with some 
slight modifications.

It seems for loops are limited to 4096 ? Not sure what that is...

What if it was a while loop ?

Maybe ints limited to range 4096 ? I am not sure...

For now the core could be split into a lower and upper half and then this 
code works:

Now idea yet of what performance would be... also no idea how many of these 
could run in parallel without blowing things up ?! ;)

Time will tell... now time for some performance indication testing with fx 
composer 2.5.

Fingers crossed, code example:

/*

% Description of my shader.
% Second line of description for my shader.

keywords: material classic

date: YYMMDD

*/

struct Tinstruction
{
 short mWord1;
 short mWord2;
 short mWord3;
};

typedef short Tprocess;

float4x4 WorldViewProj : WorldViewProjection;

float4 mainVS(float3 pos : POSITION) : POSITION{
 return mul(WorldViewProj, float4(pos.xyz, 1.0));
}

float4 mainPS() : COLOR
{

 int vIndex;
/*
 // works:
 int vLowerCore[4000];
 int vHigherCore[4000];

 for (vIndex=0; vIndex < 4000; vIndex++)
 {
  vLowerCore[vIndex] = vLowerCore[vIndex] + 1;
 }

 for (vIndex=0; vIndex < 4000; vIndex++)
 {
  vHigherCore[vIndex] = vHigherCore[vIndex] + 1;
 }
*/

 // works as well... highly interesting !
 Tinstruction vLowerCore[4000];
 Tinstruction vHigherCore[4000];

 for (vIndex=0; vIndex < 4000; vIndex++)
 {
  vLowerCore[vIndex].mWord1 = vLowerCore[vIndex].mWord1 + 1;
 }

 for (vIndex=0; vIndex < 4000; vIndex++)
 {
  vHigherCore[vIndex].mWord1 = vHigherCore[vIndex].mWord1 + 1;
 }

 Tprocess vLowerProcess[4000];
 Tprocess vHigherProcess[4000];

 for (vIndex=0; vIndex < 4000; vIndex++)
 {
  vLowerProcess[vIndex] = vLowerProcess[vIndex] + 1;
 }

 for (vIndex=0; vIndex < 4000; vIndex++)
 {
  vHigherProcess[vIndex] = vHigherProcess[vIndex] + 1;
 }


 return float4(1.0, 1.0, 1.0, 1.0);
}

technique technique0 {
 pass p0 {
  CullFaceEnable = false;
  VertexProgram = compile vp40 mainVS();
  FragmentProgram = compile fp40 mainPS();
 }
}

Bye,
  Skybuck. 


0
Reply Skybuck 10/23/2009 12:34:55 AM

I just tried to do some performance testing with fx composer 2.5...

It gives some error "GPuPerformanceUnsupported" ?!?

It did give some indication 10 Gpixels / sec ?!?

Probably flawed indication...

I think I could use this technique to try and implement a parallel corewar 
simulator...

The data would be loaded from a texture map just once at the start of the 
shader...

Then the shader runs a full simulator battle, maybe even multiple in one 
go/pass.

And then it simply returns the battle results in a little output texture...

Could be nice if it works ! ;)

Example for two warriors in core:

This way the constraints would be:

First constraint:

Maximum ammount of simulators in gpu memory possible:

512 MB / ( 8000*6 bytes + 2 * (8000 + 500+4) * 2  ) =
512 MB / 48000 + 34016 =
512 MB / 82016 =
536870912 / 82016 = 6545 simulators in core !

Now the pixel shaders would simply run each simulator side by side for as 
far as possible...

I have no idea what the performance for the pixel shader would be...

But for now I will take a guess...

6545 simulators * 80.000 cycles * 2 warriors * 100 battles =

104.720.000.000 instructions to execute at least.

Each instruction is about 6 bytes...

So that's a bandwidth requirement of:

628.320.000.000 bytes

The true bandwidth is something like:

50 GB/sec which is: 5.368.709.1200 bytes

So clearly the bandwidth is a limiter/constraint...

So estimated time for shader to complete based on bandwidth constraint would 
be:

628.320.000.000 bytes / 5.368.709.1200 bytes / sec =

628320000000 bytes / 53687091200 bytes / sec = 11.7 seconds.

So instructions per second exected would be:

104.720.000.000 / 11.7 = 8.950.427.350 instructions per second.

For two warriors that would mean 4.475.213.675 cycles per second.

Let's see.. a dual core cpu achieves 16.000.000 cycles per second.

The gpu performance would be staggering/very good.. however I have a feeling 
there must be another bottleneck/constraint somewhere....

There could also be an execution constraint for the gpu.

Stats/specs say something like: Fill rate: 15.7 billion pixels/sec.

I think that's about:
15.7 * 1000 * 1000 * 1000 = 15.700.000.000

So far this seems within range of the number of above.

Conclusion: performance could be staggering/super speed !

Speed up over cpu would be:

4.475.213.675 / 16.000.000 =
4475213675 / 16000000 = 279.7

The gpu would be about 280 times faster than a cpu !

That's the kind of performance gain I am looking for ! ;)

Me very happy about that number ! =D

As long as the code will compiled this should definetly be achieveable !

However there is still a little catch... these numbers do not include the 
initialization... this would
need to be done for each battle... but that's probably pretty quickly done 
as well...

Even a 200 speed up would be real nice ! ;)

So these numbers are very encouraging and I will definetly continue my 
development efforts to get a parallel gpu corewars executor going ! ;) 
:):):):):)

Bye,
  Skybuck =D 


0
Reply Skybuck 10/23/2009 12:55:18 AM

"Skybuck Flying" <BloodyShame@hotmail.com> wrote in message 
news:8d20b$4ae0fef5$d53372a9$11893@cache5.tilbu1.nb.home.nl...
>I just tried to do some performance testing with fx composer 2.5...
>
> It gives some error "GPuPerformanceUnsupported" ?!?
>
> It did give some indication 10 Gpixels / sec ?!?
>
> Probably flawed indication...
>
> I think I could use this technique to try and implement a parallel corewar 
> simulator...
>
> The data would be loaded from a texture map just once at the start of the 
> shader...
>
> Then the shader runs a full simulator battle, maybe even multiple in one 
> go/pass.
>
> And then it simply returns the battle results in a little output 
> texture...
>
> Could be nice if it works ! ;)
>
> Example for two warriors in core:
>
> This way the constraints would be:
>
> First constraint:
>
> Maximum ammount of simulators in gpu memory possible:
>
> 512 MB / ( 8000*6 bytes + 2 * (8000 + 500+4) * 2  ) =
> 512 MB / 48000 + 34016 =
> 512 MB / 82016 =
> 536870912 / 82016 = 6545 simulators in core !
>
> Now the pixel shaders would simply run each simulator side by side for as 
> far as possible...
>
> I have no idea what the performance for the pixel shader would be...
>
> But for now I will take a guess...
>
> 6545 simulators * 80.000 cycles * 2 warriors * 100 battles =
>
> 104.720.000.000 instructions to execute at least.
>
> Each instruction is about 6 bytes...
>
> So that's a bandwidth requirement of:
>
> 628.320.000.000 bytes
>
> The true bandwidth is something like:
>
> 50 GB/sec which is: 5.368.709.1200 bytes

I made a little typo there in the dots:

Correct dotted value is:

53.687.091.200

However the calculations were still done properly... because I removed
the dots later on ! ;)

So calculations are correct ! ;)

Bye,
  Skybuck ! ;) :)

>
> So clearly the bandwidth is a limiter/constraint...
>
> So estimated time for shader to complete based on bandwidth constraint 
> would be:
>
> 628.320.000.000 bytes / 5.368.709.1200 bytes / sec =
>
> 628320000000 bytes / 53687091200 bytes / sec = 11.7 seconds.
>
> So instructions per second exected would be:
>
> 104.720.000.000 / 11.7 = 8.950.427.350 instructions per second.
>
> For two warriors that would mean 4.475.213.675 cycles per second.
>
> Let's see.. a dual core cpu achieves 16.000.000 cycles per second.
>
> The gpu performance would be staggering/very good.. however I have a 
> feeling there must be another bottleneck/constraint somewhere....
>
> There could also be an execution constraint for the gpu.
>
> Stats/specs say something like: Fill rate: 15.7 billion pixels/sec.
>
> I think that's about:
> 15.7 * 1000 * 1000 * 1000 = 15.700.000.000
>
> So far this seems within range of the number of above.
>
> Conclusion: performance could be staggering/super speed !
>
> Speed up over cpu would be:
>
> 4.475.213.675 / 16.000.000 =
> 4475213675 / 16000000 = 279.7
>
> The gpu would be about 280 times faster than a cpu !
>
> That's the kind of performance gain I am looking for ! ;)
>
> Me very happy about that number ! =D
>
> As long as the code will compiled this should definetly be achieveable !
>
> However there is still a little catch... these numbers do not include the 
> initialization... this would
> need to be done for each battle... but that's probably pretty quickly done 
> as well...
>
> Even a 200 speed up would be real nice ! ;)
>
> So these numbers are very encouraging and I will definetly continue my 
> development efforts to get a parallel gpu corewars executor going ! ;) 
> :):):):):)
>
> Bye,
>  Skybuck =D
> 


0
Reply Skybuck 10/23/2009 1:00:18 AM

The error was probably related to gtx 7900 which doesn't support certain 
performance benchmarks... the gtx 8800 does...

Anyway back to the story...:

Even more interesting could be to completely leave the core, processes and 
pspace out of the texture maps...

Since those "entities" can be done/initialized in the shader itself.

What remains is the warrior's code... that could be supplied into the 
texture map... parameters maybe not possible... I would be worried that it 
would be pre-compiled/computed which is unwanted.

To keep it simple each warrior could be stuffed into 100 cells... even if 
they not all used... plus a size indicating how large it really is...

This means the number of simulators could be:

512 MB / (100 * 6 bytes + 2) =
536870912 / 602 = 891812 simulators ! LOL.

This could allow a "battlefield" of 944 x 944 ;)

Hmm seems a bit overkill for now... my battlefield would be 60x60 or so... 
but maybe later I try 944x944 or so...

For now I shall not do any calculations how long this would take... just 
want to "document" the idea a little bit ;)

Bye,
  Skybuck.






0
Reply Skybuck 10/24/2009 11:10:14 PM

"Skybuck Flying" <BloodyShame@hotmail.com> wrote in message 
news:67037$4ae38952$d53372a9$1360@cache4.tilbu1.nb.home.nl...
> The error was probably related to gtx 7900 which doesn't support certain 
> performance benchmarks... the gtx 8800 does...
>
> Anyway back to the story...:
>
> Even more interesting could be to completely leave the core, processes and 
> pspace out of the texture maps...
>
> Since those "entities" can be done/initialized in the shader itself.
>
> What remains is the warrior's code... that could be supplied into the 
> texture map... parameters maybe not possible... I would be worried that it 
> would be pre-compiled/computed which is unwanted.
>
> To keep it simple each warrior could be stuffed into 100 cells... even if 
> they not all used... plus a size indicating how large it really is...
>
> This means the number of simulators could be:

Hmm program start needed as well

So this becomes:

512 MB / (100 * 6 bytes + 4) =

536870912 / 604 = 888859 simulators

Max battlefield 942 x 942

Bye,
  Skybuck. 


0
Reply Skybuck 10/24/2009 11:42:00 PM

I was losing confidence if it's gonna work because I don't know what will 
happen if a shader uses many variables...

So I decided to do a little test... a little input texture... and some local 
variables like 8000*4*32 bits.

And some code to try and force the gpu/cg compiler to actually use all of 
them and not illiminate them...

Surprisingly it did seem to work... only problem is that FX Composer takes 
multiple seconds to render something... it also allocates gigabytes of 
memory... and then the whole application freezes.

I tried to make the shader only work for a few pixels... but alas.. it still 
uses gigabytes.

It does seem to render some white now and then which was probably the result 
of the shader which summed everything up more or less.

Maybe I need to develop my own cg editor minimalistic development 
environment which is more aimed at large scale or so...

Hmmm..

Bye,
  Skybuck. 


0
Reply Skybuck 10/27/2009 9:28:52 PM

30 Replies
139 Views

(page loaded in 0.253 seconds)


Reply: