Hello,
One thing is annoying me a little bit with my current Delphi
program/example/opengl acceleration experiment ;)
I cannot enjoy the fast speed of opengl because for now I am using
Tcanvas.Pixels[x,y] to draw the texture map to the screen. And since the
texture map is in range 0.0 to 1.0 for the color components these first need
to be converted to RGB's which means many multiplications and rounds.
But the biggest problem is the slowness of Tcanvas.Pixels.
Anyway I probably already have a "CopyMemoryToBitmap" routine somewhere
which would help with flipping the memory into bitmap format ;)
So the remaing problem is:
Converting floating point textures to rgba textures so they can be flipped
to screen.
I guess I could use an additional render to texture target... in rgba
mode... and use an extra shader... just for recalculating the floating
points to rgba's...
However doing this seems a bit weird... but it would probably be possible as
follows:
1. Draw a quad with 4 verteces which would activate all pixel shaders.
2. Shade the pixels and output them to the texture... preferably y-flipped
if necessary.
3. Read texture to cpu/system memory.
4. Flip memory to Tbitmap/canvas etc.
However I wonder if OpenGL has a better method of converting a floating
point texture/framebuffer into a bitmap ?!
So that it doesn't need to go through the vertex and pixel shaders ?!?
Hmmmm...
Maybe there is even a faster way ?
Maybe be re-enabling the "default framebuffer ?" But it would be empty...
hmm...
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 5:59:20 AM |
|
Ok,
For now I am gonna get rid of the Tcanvas.Pixels... and simply use an extra
memory buffer to convert the floating point texture 3x16 bits or 3x32 bits
floating point texture to rgba 4x8 bytes in cpu.
That way cpu can do something too... hopefully cpu not gonna be too slow at
it ! LOL.
Would be funny if the cpu is still fricking slow even for something like...
I have bad feeling about that ! ;) :)
But gonna try anyway... have to do this anyway ! ;)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 6:03:17 AM
|
|
Hmmm something fishy going on here...
The color component order of the Delphi form seems to be:
R,G,B,A
The color component order of the Delphi bitmap seems to be:
B,G,R,A
?!?!?
What the fuck ?!?
Weirdness !
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 6:38:25 AM
|
|
Or maybe I just flipped the texture color components...
Me confused...
Should it be record:
r,g,b,a : float
end;
or
should it be record
b,g,r,a : float
end;
for floating point texture maps for GL_RGBA ?!? (and/or GL_RGB)
Hmm...
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 6:40:59 AM
|
|
Well I think I got the floating point format correct if I recall
correctly...
Since the OpenGL window seemed to draw ok...
So record for floating point texture format is probably:
r,g,b,a : float;
Then why does Delphi needs it other way around ?!
Weird...
Especial form vs bitmap... double weird ?!
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 6:43:17 AM
|
|
Anyway I am using Tbitmap.Scanline for fast access...
According to other postings it indeed seems to be reversed: B,G,R,A...
The reason for this I don't understand...
For now I will have to use a special type for it...
TbgraByte = record b,g,r,a etc; // considered a bitmap rgba ;)
And use the one which is appriorate ;)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 6:46:37 AM
|
|
Ok this solves problem nicely ! ;)
Delphi even helped me prevent a stupid error thanks to strong type checking
like so:
Faulty:
var
vBitmapColor : TbgraByte;
begin
TrgbaByte( scanline pointer etc ) := vBitmapColor; // compiler type error ;)
:)
Good:
TbgraByte( scanline pointer etc ) := vBitmapColor;
end;
:)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 6:50:20 AM
|
|
"Skybuck Flying" <BloodyShame@hotmail.com> wrote in message
news:d6518$4ac59fd4$d53372a9$20198@cache6.tilbu1.nb.home.nl...
> Hmmm something fishy going on here...
>
> The color component order of the Delphi form seems to be:
>
> R,G,B,A
^ Not sure about that...
Form does not seem to have a scanline property...
Maybe it's internal format is also b,g,r,a...
Canvas.Pixels might be doing a conversion as well...
Tcolor is in rgba mode at least... so it might be doing conversions like so:
RGBA to BGRA.
^ This might be another reason why .Pixels[x,y] is slow...
>
> The color component order of the Delphi bitmap seems to be:
>
> B,G,R,A
When accessing the scanline pointer at least...
Maybe .Pixels[ ] does a conversion ;) <- probably.
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 6:54:12 AM
|
|
Anyway using the Tbitmap.Scanline property + rounding seems to be fast
enough for now... for 500x400 pixels ;)
Canvas.Draw( 0, 0, mScreenBuffer ); // mScreenBuffer : TBitmap;
Draws it real fast... like under one second at least ! ;)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 6:56:01 AM
|
|
Time for a little performance testing ;)
With delphi conversion code included it's:
0.0254 seconds for one frame of 500x400 with 32 bit colors, and 3x32 bit
floating point texture format r,g,b.
That's 25 milliseconds...
Even using this shitty conversion code... this could mean:
1000 / 25.4 = 39 frames per second haha ! :)
Actually the first time it seems to require 50 milliseconds not sure why...
maybe cpu cache getting filled or maybe it's disk activity from delphi ide
or so... could be...
Now let's see how fast draw is without this shitty conversion code and
canvas draw code.
It's about 0.006 seconds for opengl draw + read from texture.
That's about 6 milliseconds (500x400 vertex points as well ! ;))
Now let's leave texture reading out of it... to see how fast it goes then !
;)
It's 0.00032 seconds.
That's 0.32 milliseconds !
Holyshit batman... that's already pretty fast ! ;) :) And this includes
500x400 verteces ! HAHA.
Let's see what frame rate would be for this:
1000 / 0.32 = 3125
Not bad.
I did notice the occasional hick up.. this could be the first time because
of loading the texture... and/or disk activity... I think it's disk activity
mostly.
Ok... now I reduce vertex points to something more realistic...
According to my last calculations posted in another sub thread... the number
of simulators would be:
1198.
There are two fields... each one can probably do a number of cell updates...
one pointer need for itself...
one pointer for somewhere else... and again a pointer for somewhere else...
So for itself at least 1, then maybe 2 then maybe another 2... so I think at
most 5 pointers needed.
So 1195 * 5 = 5975 verteces needed... Now I go test it's speed:
Time is: 0.0003126 seconds.
So it remains at 0.32 milliseconds per frame. Hmmm...
This is not so good.
This means:
3125 frames per second * 1198 simulators = 3.743.750 cycles per second.
Which need to divide by 2 probably which gives 1.871.875 cycles per second.
Dual core CPU was probably something like 80.000 * 200 = 16.000.000 cycles
per second.
I must know for sure so I am going to start it to make sure.
Yup confirmed...
Dual core CPU can do: 10 battles of 1 v 1 warriors with 100 rounds with
80.000 cycles in 5 seconds.
This means the dual core is executing:
10 * 100 * 80.000 cycles in 5 seconds = 80.000.000 (no early kills those
were disabled)
Which means it's executing: 80.000.000 / 5 = 16.000.000 cycles for dual
core.
Which means roughly 8.000.000 cycles per core.
So far the cpu seems 4.2 times faster...
However I don't have a decent gpu implementation yet...
However seeing these numbers for this simple test raises big doubts if gpu
version is gonna be any faster...
Maybe the clearing of the framebuffer has something to do with it... gonna
disable it and retest...
depth test was disable... clearing was disable...
Time is now:
0.000237 seconds.
0.237 milliseconds
1000 / 0.237 = 4219
about 33.3% more performance.. still not enough me thinks.
Ok reloading the identity matrix does not seem necessary...
setting the cg world view thingy only needs to be done once it seems...
Performance is now
0.00019 seconds. (fluctuating a bit... maybe at full speed it would be lower
not sure)
which is about 0.19 milliseconds
1000 / 0.19 = 5263 frame per second.
Let's see what this would give:
5263 * 1198 simulators = 6.305.074 cycles
divide by 2 : 3.152.537
Still very poor.
Hmmm I see I have three frame buffer textures active... I only need 2...
Gonna disable one to see if that helps.
Yes that helped a bit...
0.000156 seconds.
Ok I don't need 32 bit floating points...
Only 16 bit floating points... gonna change textures to 16 bit...
This should give good improvement.
Hmmm nope still 0.00015 seconds...
I am starting to wonder if the time I am measuring is actually the api
calling
cpu overhead... hmm...
Hmm could be... if that's the case... then optimizing the number of api
calls
could give more speed... I wonder if enable profilings all the time is
necessary
maybe not... binding programs is that necessary ? I don't know...
For now the speed would be:
1000 / 0.156 = 6410 assuming the bigger texture is no problem.
Final speed would be:
(1198 * 6410) / 2 = 3.839.743 something like that.
Almost half of what a single cpu core would achieve.
This is assuming the vertex/pixel shaders and texturing lookups don't add
any significant
delays or overheads... for the largest possibility.
I go do a little large test to see what happens ;)
4096x4096 gives about same speed...
For now I am worried this is not a good situation.
But there might be a solution... instead of stuffing the entire cores into
the frame buffer... the opposite
could happen...
only instructions are stuffed into the frame buffer and verteces and such
for processing... and the cores themselfes are stuffed into textures which
will not be rendered to a texture...
Instead something else will update those textures... this could be done by
cpu...
I was thinking about doing a single 4096x4096 texture map executor with 1198
simulators... but seeing these presumably api call overheads or round trip
times to gpu makes it doubtfull that it would be any faster... it probably
would not... therefore the strategy has to be rethought and changed.
For now the number of texture inputs seem to be limited to 6 to 8
TEXCOORDS... these textcoords are necessary
to supply the necessary information to tex2D or rect2D or something like
that...
But not really... each vertex could simply ignore those texcoords
semantics... and simply use the first one...
The textcoordintes themselfes could use an additional coordinate for example
the Z coordinate or an addtional TEXCOORD2 or maybe NORMAL coordinate to
indicate which texture to use.
This way the number of texture maps in the gpu could be endless... however
the memory is limited.
Now let's try to fit as much simulators/cores/warriors as possible into the
core... new calculation becomes:
512 MB / (14000 * 6 bytes) =
536870912 / 84000 = 6391 simulators in gpu. the necessary instruction
pointers plus additional fields per simulator should fit easy in one
framebuffer so I am not worried about that.
Using this new figure the number of cycles would be:
6391 * 6410 = 40.968.363 / 2 = 20.484.181
Compared to the dual core cpu which has 16.000.000 cycles this is still very
weak ?!
However the gpu code I used to test is not running at full speed.. so who
knows what will happen... but for now I base it on what I see...
At best with a little bit of luck... two extra threads could be added which
feed the necessary data to the gpu... so that the gpu can do some processing
as well... this way the final speed would be doubled.
However this does require feeding the gpu with 6391 battles ?! which is
quite a lot..
This would be a battlefield size of almost 80x80 then for the cpu a little
bit of extra fields are necessary...
2x2 or so...
Hmm... this makes it difficult to distribute the battles across gpu or
cpu...
Gpu needs a lot of battles to be efficient... when gpu is done it would have
to wait for the cpu to finish the remaining battles...
Then the next round of battles could occur..
Figuring out the sweet spot for gpu and cpu is what would be necessary...
Also the results for gpu could vary... it might finish sooner because of
favorite battles... then the gpu has nothing to do anymore... not enough
extra battles... unless gpu is made flexible... but that would be dangerous
because then it could take long...
For now it does seem like gpu might add some performance benefits... but the
cpu would also be tasked with doing lot's of api calls which might eat into
the performance of the cpu simulators themselfes which would not be a good
thing...
I have also started wondering if the cpu code/simulator code can be changed
to represent a more gpu like approach... maybe the cpu can act like a gpu as
well... more like in a streaming fashion...
Two approaches are possible:
1. Assume this theory and change/reimplement the simulator on cpu to see if
cpu can do stream processing.
or
2. Convert delphi simulator code to c/c++ to analyze it in visual studio to
see what the actual bottleneck is on the cpu... is it bandwidth ? is it cpu
execution ? is it stalls because of reads and writes ?
If it's the last then maybe the cpu code can be altered to run even faster.
I am very curious about that... seeing these poor results for gpu has made
me doubt if I should continue with this corewars/gpu project ;)
GPU might be nice for video codec though :)
But I am even more curious about getting more simulator speed... so I have
different direction to try... I am having heavy doubts about which path to
choose...
I think it's best if I try to do some cpu benchmarking test to try out the
"cpu streaming" theory ! ;)
Streaming vs non-streaming cpu test ! ;)
That's what I should do first ! ;)
And if no difference is found... maybe an analysis in c/c++ to see what
actual bottleneck is ?! ;)
However there is another possibility...
What if the gpu code could run entirely inside a shader with multiple passes
?!
Then all these opengl api calls would not be necessary anymore...
However I am not sure if such an algorithm is possible... it might be
though... if the output from the pixel shaders can be redirected into the
vertex shaders/textures for the next pass...
I don't know if that's possible so that's also another "research/benchmark"
direction to try out.
So two early benchmarks to try out...:
1. Running code entirely inside gpu shader with multiple passes and frame
buffer/texture feedback ?!
^ Somehow only a few texels on the textures need to be updated for good
speed though ?!? If not possible use full frame shaders maybe that not so
bad for performance ? But I doubt it ! ;) :)
2. Running streaming like vs non-streaming like code on the cpu to see if
there is a difference.
^ Final conclusion: more research needed into possibilities ! ;)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 8:05:18 AM
|
|
Hmm,
Maybe I am in luck... I already have pmars source code in C... didn't write
it myself... but it's samiliar to mine ;)
If I can get the source code compiling in visual studio I might be able to
do some profiling with amd analyzer/profiler to hopefully see what the
bottleneck might be ;)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 9:04:17 AM
|
|
I think I already tried that in the past... so screw pmars code... if there
was an amd analyzer for c++ builder then it would have been usefull.
Converting my code to c/c++ shouldn't be too hard except maybe for the
threading code... but that could be interesting as well if that fails I can
always switch to non-multi threaded code...
For now I am going to do a conversion to C/C++ because it's highly needed I
have no idea what the real bottleneck is... and that makes it hard to come
up with a good bottleneck-fighting-strategy ;) :)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 9:23:12 AM
|
|
Having to use visual studio and converting my code to c/c++ is depressing.
However I could use other editors like c++ builder to easy the pain
somewhat...
Then finally I would have to use visual studio.
To get over this depression I am now going to play some "CoH ToS" ;) :)
For inspiration and happyness :) LOL.
^ downtime coming ;) :) ^ =D
Bye,
Skybuck ;)
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 9:39:16 AM
|
|
Hmm ok
1. played a bit of CoH some time ago...
2. Tried a Delphi to C++ conversion tool 1.5 trial it said... But it was no
good.
3. And investiged the possibility of writing my own Delphi to C/C++
convertor...
It might be possible but it would require a whole lot of time and a whole
lot of testing.
Mostly to figure out how the used parser/lexer works.
It would require to much time for now me thinks... but could be interesting
project for the future...
As far as I know there is no decent Delphi to C++ tool ? Thus such a good
tool could be popular ?!? And sell well ? Any potential buyers out there ?
;) :)
Anyway the tool could be usefull for myself as well... to quickly convert
Delphi code to C/C++ code... to then benchmark/profile it with AMD Code
Analyst and any other future tools.
However maybe it's possible to make Delphi interface with AMD Code
Analyst... haven't exploded that... but that would probably be even more
difficult if not impossible ?! ;)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 1:24:17 PM
|
|
Some other idea's to consider:
1. Speculative execution of all core cells, would probably lead to many
conflicts, however output to different cells could be stored seperatedly per
input/output core so at least all results would be ok. <- many unnecessary
executions at first and maybe later too
2. Speculative execution of all processes in the list <- different way of
parallelism, could produce more usefull executions but still very limited
These two idea's above are more "fun" idea's they are not very serious...
but could be easy to implement.
Time for a totally different idea:
3. CPU does preprocessing of all-to-be-executed instructions per
core/simulator.
CPU could have access to 2 GB of ram (virtual memory limit) 4 GB of ram
would need to be enabled for kernel memory.
Total ammount of simulators for 1 v 1 warrior fights would be:
2 GB / 84.000 bytes = 2147483648 / 84000 = 25.565 simulators.
Possibilities for memory locations per instruction are rougly:
1. A=A+B,
2. A=A+1,
3. A=A-1,
4. B=B+1,
5. B=B-1,
6. A=A/B,
7. A=B/A,
8. B=B/A,
9. B=A/B,
10. A=A*B,
11. B=B*A,
12. A=A mod A
13. A=B mod B
14. A=A mod B
15. A=B mod A
16. B=A mod A
17. B=B mod B
18. B=A mod B
19. B=B mod A
Maybe even all of these+1...
I am not sure how many possibilities there are...
Maybe 100 ? Maybe more ?
For now let's assume 100 or so.
This could mean 100 memory locations have to be read to be sure that all
locations are present for complete instruction execution and memory input
data and memory output data...
Actually the possibilities aren't that great... the pre-processor should be
able to know exactly which instruction type will be executed so the number
of possibilities will be very small... and can be pre-computed. However this
would almost be the same as actually executing it...
So another idea could be to do the pre-processor on the gpu as well... so I
guess this comes down to simply:
1. Processing the instructions on the gpu for as far as possible
2. Falling back to cpu to get any necessary code or locations and supplieing
them again to the gpu... or maybe another gpu pass can actually do all that.
3. Go back to gpu and execute the remaining part of the instructions.
(Had this idea while letting this post "idle" for a while on my pc LOL :))
Yeah so to keep this story short:
1. Process instructions on the gpu for as far as possible, then try to do
anything else is secondary/tertiary passes/multiple passes and so forth.
Yeah this is pretty much how I designed the original core gpu algorithm...
which also included loading/using multiple textures in the gpu up to 512 MB
! ;)
So I was hoping to do just one texture map or so... but now it turns out
that would not give enough performance.
So to make long story short: I must go back to the original core gpu design
and implement it massively ! ;) :)
However easier said then done... because more passes probably means more api
delay... and then the target might not be reached as well.
Target ofcourse being insane speed ! ;)
Let's do some calculations...
Number of steps estimated for core gpu executor design: 21
21 passes * 0.152 milliseconds = 3.192,00 milliseconds required for all
steps...
1000 / 3.192 = 313 cycles per second... let's divide this by 2 just in
case... 155 cycles per second.
25565 simulators * 155 cycles = 3.962.575
Again 4 million cycles ?!?!? wtf ?!
Kinda funny how I keep hitting this 4 million limit ! ;)
Bye,
Skybuck ;) =D
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/2/2009 2:25:40 PM
|
|
Ok,
I just did some testing of the draw routine...
The speed in a tight loop without any data changes is about 20.000 frames
per second...
I am not sure if OpenGL actually renders each one or that it detect that
nothing changed...
For now I will assume it renders each frame.
This means the actual speed in the scenerio described could be 3 times
higher...
About 12.000.000 cycles per second.
However the scenerio described is probably totally unrealistic since the cpu
could never supply 2 gb per frame...
That would be like 40.000 tb per second haha ! ;)
However I have some new idea's which might work by feedback to gpu.
But I am getting a bit tired of all these different models/scenerio's...
Maybe I describe one later on or maybe not and keep it secret :)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/3/2009 8:48:30 AM
|
|
Euhm actually not 4 tb... because cpu could upload only those this which
would be necessary and that's definetly not everything... only small
portion...
So many different ways of implementing it... makes me dizzy and nervous ! :)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/3/2009 8:53:08 AM
|
|
So more interesting techniques to investigate:
0. Feedback buffers (already mentioned in previous post ;)) (only for
texture0?)
1. Pixel rectangles
2. Bitmaps
3. Stencil buffer can be used to exclude certain fragments
(if multiple cores in the texture...then stencil buffer would need at least
a few bits to indicate which core instruction pointer is to be enabled and
which instruction pointer/location so another 14 to 16 bits so many bits
needed for stencil buffer... not sure what the maximum is...)
4. The z buffer also has some bits... 24 bits... not sure if that could
somehow be used.
5. Logical operations... could be used to quickly replace certain values in
the framebuffer.
(Only for integers ???) Could be used to do copy ?
6. Buffer updates color masks... <- could be interesting to split planes or
to get a certain bitplane.
For other projects ;)
7. Bits can be written into the stencil by using mask... not sure if it
means color bits, depth bits or any...
8. Accumalation buffer, can for example add up bits it seems... could be
handy for counting bits in parallel.
9. Pixel store parameters could be used to swap bytes, or switch bit order.
10. CopyPixels could be used to copy from read framebuffer to drawbuffer ?
To then display it ? I so could be used as an easy scroller or so.
11. BlitFrameBuffer can be used to visualize the special buffers like
stencils and depth buffers and so.
(Supports stretching) (If buffers specified is the same then overlapping in
same buffer results are undefined).
Pixel formats for both buffers must be the same.
12. Not sure but: render buffers might be able to have 16 stencil bits ?!
(Probably still to little for what I might use them for...) Maybe stencil
and depth can be combined to form one large special buffer. dIt mentions
depth_stencil or so.
13. Texture objects can be bound to frame buffer via FrameBufferTexture1D
(probably equivalent of _EXT version or so ;))
14. Attaching the texture buffer to framebuffer and using that same texture
buffer as a texture access could produce undefined results... it could lead
to a feedback loop which in itself is kinda interesting... could be used to
try and to sequantial execution... or it could be used for random noise
generation (?).
15. Generally interesting: polynomials... used to generate verteces and
such... I assume across the polynomial/curve ? ;) :) (See map command)
16. Specifieing hint: POINT SMOOTH HINT could be enough and might provide
some performance benefits.
17. I am not sure what a histogram is... but might be interesting for
"belongs to group" visualizations.
Page 420 has a word about histogram... apperently it's counting the
occurence of certain color values ;)
(min and max pixel values can also be determined)
18. State tables could be interesting to learn what "state" the opengl is in
?! ;) when in doubt I presume ;) :)
(tremendous ammount of state/information can be examined.)
19. Multi texture could be interesting even highly interesting ?!? It
mentions the possibility of "pipelining" and using the output of one texture
as input for the next texture ?!? It mentions this is controlled through
texture environments ?!? not sure what that is all about... ;)
20. This is explained further: Texture Combine Environment Mode <- Could be
highly interesting !!! Page 427 says it's possible to arrange these textures
in all kind of ways ! Very very interesting ! ;) :)
I hope that each pixel shader gets to act on them ?!?!? Or maybe it's just
an opengl api thingy ? Not sure...
21. Funny thingy: "point parameter" controls characteritics of points (?!?)
;) :)
22. Pixel buffer object might give more performance for pixel drawing and
reading... not sure how usefull this would be...
And that ladies and gentlemen concludes my "analyzation" of the current
opengl 3.0 spec...
Most interesting concept/feature I came across is the concept of "pipelining
textures".
I am not sure if it's possible but that would be a very interesting concept:
texture->shader->texture->shader->texture->shader->texture->shader->texture->shader
Only problem would be shaders can't write to certain locations... but that
can be solved by using an "output address" then the next shader can use that
and simply "read" and pretend that it came from itself or so...
But then again... it doesn't know here to read so this wouldn't work haha !
It would only work for vertex shaders which can displace themselfes so then
the pipeline would look:
texture->vertex shader->pixel shader->texture->vertex shader->pixel
shader->texture.
I think multi-textures are probably limited to pixel shaders only ? Or maybe
not even that... so don't know about all this.
Attention: Framebuffers have no accumalation buffer... so much for that ! ;)
Though for other project the default frame could probably be used if this
was necessary ?!? ;) (Maybe a hidden frame or so ;))
Here is an idea to use the depth buffer:
Different cores could be at different depths... maybe by setting some depth
value a specify core could be selected... this could be used to reset cores
or to update them with new battles this way the gpu could run multiple
cores/simulators asynchronously... and it doesn't have to wait until all or
done... could be pretty and pretty damn handy ! ;) depth and stencil values
can be combined could also be used as a place to store more information.
However a simply copy to certain location of framebuffer could work just as
well so maybe this unnecessary complexity or so ;) depends on what is faster
I guess ;)
^ These are all opengl api calls...
It's like "saying" to the cpu do a := a xor b; except now it's told to the
gpu and the gpu does a := a xor b...
But a in this case is not just a field... it could be a whole buffer... like
one million pixels ! ;) :)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/3/2009 12:03:09 PM
|
|
Ok,
I just tested the "streaming" idea for the cpu.
Streaming idea: "do many reads, do many writes, repeat".
Non streaming idea: "do single read, do single write repeat".
The non streaming idea works faster.
(Streaming idea requires multiplications and some extra looping, not sure if
that slows it down... most likely reason is that streaming idea requires
extra memory to hold the reads... cannot directly read into cache ?!?)
Possible solution: try doing fetches instead <- nice idea.
Going to try version 0.02 with fetches only ;) :) and then some normal read
write cycles or so
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/3/2009 6:58:54 PM
|
|
Ok tried it...
The prefetching "streaming" version is also slower than the non streaming
version...
Maybe the pattern of writing/reading wasn't identical for both versions...
but it's the best I could do for now...
So for now I give up on this idea ! ;)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/3/2009 7:10:15 PM
|
|
I thiiiiiiiink I am going to attempt a Delphi to C/C++ converter tool.
The idea of having such a tool which would work very well seems very
attractive to me ! ;) :)
Bye,
Skybuck =D
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/4/2009 9:48:40 AM
|
|
With the parser I have it's gonna be a peace of cake and then I am gonna be
filthy rich ! LOL :)
Bye,
Skybuck ;) =D
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/4/2009 10:48:26 AM
|
|
I just gave this pipeline simulation a test... without actually using any
simulator code yet...
And it seems very limited... only 100.000 instructions can be recorded or
so... maybe a 1.000.000 but that's very little... just an initializing loop
takes like 8000 * 10 instructions = 80.000 instructions or so...
So this pipeline simulation is not worth much... though maybe it could give
some insight into some cycles or so...
All in all probably not worth investigating any further since it's pretty
clear that memory lookups slow it down... and other tests already show the
cpu can't do anything else while it's waiting for memory or so ?!?
At least it seemed like that for me... I could wrong though ;)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/10/2009 3:25:07 PM
|
|
My latest insights into the possibility of executing corewars on a gpu have
made me doubt if the performance is going to be any good... it's probably
not going to be any faster than a cpu... maybe even significantly slower
depending on the number of passes that are needed.
Calculations also assume that all executors would actually run in parallel
at full speed which is also probably a flawed assumption... this could mean
that ultimate performance could even be far worse for gpu.
Conclusions for parallel processors:
1. Huge memory requirements just to be able to store stuff and also cache
stuff.
This is mostly where my current graphics card is kinda lacking... only 512
MB... that's not really that much for parallel stuff... where for each
parallel stuff only a little bit of work would be done ;)
I could continue trying to develop something... but I now have serious
doubts that it would achieve any good speed... at least with the current
design... which is probably a very good design... maybe the best one... only
the other idea might give some performance benefit the speculative execution
one... but I doubt that will be any good for sequantial warriors... unless
something more complex is done with loop iteration predict per processing
element or so... that's a bit too advanced for my taste...
I think it's time to start spending my time on other projects...
Maybe in the future when programming has become more easy... and when more
resources are available I might give it another try... but using opengl/cg
shaders probably has too much programming overhead and especially to little
resources available... hardware wise as well.. too little memory.
It's kinda a bummer...
I shall do one last calculation which would be an optimistic calculation
just to see if something can be done:
4 input textures + 4 output textures * 4 elements per texture * 3 bytes = 96
bytes.
512 MB / 96 = 5.33333333 mega elements per texture.
5.3333333 mega sqrt = 2364x2364 texture size or so.
core size = 8000 + warriors 2 * (8000 processes + 500 pspace) = 8.000 +
17.000 = 25.000 elements + 10 for little overhead or so...
Means 2364*2364 / 25010 = 223 simulators in gpu at best.
cycles per simulator could be anywhere from 1000 to 100.000 cycles per
second.
Worst case scenerio: 223 * 1000 = 223000 cycles per second... could even be
worse if not fully executed in parallel... but gpu does have many cores...
like 200 so might actually execute in parallel.
Best case scenerio: 223 * 100.000 = 22.352.161 cycles for entire gpu.
This is pretty optimisic... probably a bit too optimistic... probably more
passes required... or maybe not...
but let's say 22 million cycles per second for gpu.
Cpu achieves 16 million for dual core... so gpu is not really spectacular...
and I need something spectacular...
The 100.000 above is assuming that opengl doesn't need to bind the cg
program all the time...
It probably would need to re bind... so that would make it 10x times slower
or so... so gpu might actually achieve only 2 million cycles per second
which would be bad.
So conclusion in other short words:
It's like having a cpu which can do 223 cycles in parallel... but it can
only do it 10.000 per second or so... so finally speed would be: 2.230.000
cycles per second... which is just miserable.
So that's my latest guess at what the performance would be... miserable ! ;)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/22/2009 11:48:00 PM
|
|
However I just had a radically new idea...
What if the shader itself uses 50.000 local integers or so...
Then the shader could use all those local integers as if it was local
memory... and simply execute everything in one pass... this would/should
greatly increase the execution speed.
The question is now how much local memory/integers/variables can a shader
have ?!
A simple test with an array of ints could shed some light on this for
example:
void myshader()
{
int myvar[50000];
}
^ if something like that compiles than that could be very interesting ! ;)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/23/2009 12:06:56 AM
|
|
Ok,
I tested this theory (from last posting) and it seems to compile with some
slight modifications.
It seems for loops are limited to 4096 ? Not sure what that is...
What if it was a while loop ?
Maybe ints limited to range 4096 ? I am not sure...
For now the core could be split into a lower and upper half and then this
code works:
Now idea yet of what performance would be... also no idea how many of these
could run in parallel without blowing things up ?! ;)
Time will tell... now time for some performance indication testing with fx
composer 2.5.
Fingers crossed, code example:
/*
% Description of my shader.
% Second line of description for my shader.
keywords: material classic
date: YYMMDD
*/
struct Tinstruction
{
short mWord1;
short mWord2;
short mWord3;
};
typedef short Tprocess;
float4x4 WorldViewProj : WorldViewProjection;
float4 mainVS(float3 pos : POSITION) : POSITION{
return mul(WorldViewProj, float4(pos.xyz, 1.0));
}
float4 mainPS() : COLOR
{
int vIndex;
/*
// works:
int vLowerCore[4000];
int vHigherCore[4000];
for (vIndex=0; vIndex < 4000; vIndex++)
{
vLowerCore[vIndex] = vLowerCore[vIndex] + 1;
}
for (vIndex=0; vIndex < 4000; vIndex++)
{
vHigherCore[vIndex] = vHigherCore[vIndex] + 1;
}
*/
// works as well... highly interesting !
Tinstruction vLowerCore[4000];
Tinstruction vHigherCore[4000];
for (vIndex=0; vIndex < 4000; vIndex++)
{
vLowerCore[vIndex].mWord1 = vLowerCore[vIndex].mWord1 + 1;
}
for (vIndex=0; vIndex < 4000; vIndex++)
{
vHigherCore[vIndex].mWord1 = vHigherCore[vIndex].mWord1 + 1;
}
Tprocess vLowerProcess[4000];
Tprocess vHigherProcess[4000];
for (vIndex=0; vIndex < 4000; vIndex++)
{
vLowerProcess[vIndex] = vLowerProcess[vIndex] + 1;
}
for (vIndex=0; vIndex < 4000; vIndex++)
{
vHigherProcess[vIndex] = vHigherProcess[vIndex] + 1;
}
return float4(1.0, 1.0, 1.0, 1.0);
}
technique technique0 {
pass p0 {
CullFaceEnable = false;
VertexProgram = compile vp40 mainVS();
FragmentProgram = compile fp40 mainPS();
}
}
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/23/2009 12:34:55 AM
|
|
I just tried to do some performance testing with fx composer 2.5...
It gives some error "GPuPerformanceUnsupported" ?!?
It did give some indication 10 Gpixels / sec ?!?
Probably flawed indication...
I think I could use this technique to try and implement a parallel corewar
simulator...
The data would be loaded from a texture map just once at the start of the
shader...
Then the shader runs a full simulator battle, maybe even multiple in one
go/pass.
And then it simply returns the battle results in a little output texture...
Could be nice if it works ! ;)
Example for two warriors in core:
This way the constraints would be:
First constraint:
Maximum ammount of simulators in gpu memory possible:
512 MB / ( 8000*6 bytes + 2 * (8000 + 500+4) * 2 ) =
512 MB / 48000 + 34016 =
512 MB / 82016 =
536870912 / 82016 = 6545 simulators in core !
Now the pixel shaders would simply run each simulator side by side for as
far as possible...
I have no idea what the performance for the pixel shader would be...
But for now I will take a guess...
6545 simulators * 80.000 cycles * 2 warriors * 100 battles =
104.720.000.000 instructions to execute at least.
Each instruction is about 6 bytes...
So that's a bandwidth requirement of:
628.320.000.000 bytes
The true bandwidth is something like:
50 GB/sec which is: 5.368.709.1200 bytes
So clearly the bandwidth is a limiter/constraint...
So estimated time for shader to complete based on bandwidth constraint would
be:
628.320.000.000 bytes / 5.368.709.1200 bytes / sec =
628320000000 bytes / 53687091200 bytes / sec = 11.7 seconds.
So instructions per second exected would be:
104.720.000.000 / 11.7 = 8.950.427.350 instructions per second.
For two warriors that would mean 4.475.213.675 cycles per second.
Let's see.. a dual core cpu achieves 16.000.000 cycles per second.
The gpu performance would be staggering/very good.. however I have a feeling
there must be another bottleneck/constraint somewhere....
There could also be an execution constraint for the gpu.
Stats/specs say something like: Fill rate: 15.7 billion pixels/sec.
I think that's about:
15.7 * 1000 * 1000 * 1000 = 15.700.000.000
So far this seems within range of the number of above.
Conclusion: performance could be staggering/super speed !
Speed up over cpu would be:
4.475.213.675 / 16.000.000 =
4475213675 / 16000000 = 279.7
The gpu would be about 280 times faster than a cpu !
That's the kind of performance gain I am looking for ! ;)
Me very happy about that number ! =D
As long as the code will compiled this should definetly be achieveable !
However there is still a little catch... these numbers do not include the
initialization... this would
need to be done for each battle... but that's probably pretty quickly done
as well...
Even a 200 speed up would be real nice ! ;)
So these numbers are very encouraging and I will definetly continue my
development efforts to get a parallel gpu corewars executor going ! ;)
:):):):):)
Bye,
Skybuck =D
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/23/2009 12:55:18 AM
|
|
"Skybuck Flying" <BloodyShame@hotmail.com> wrote in message
news:8d20b$4ae0fef5$d53372a9$11893@cache5.tilbu1.nb.home.nl...
>I just tried to do some performance testing with fx composer 2.5...
>
> It gives some error "GPuPerformanceUnsupported" ?!?
>
> It did give some indication 10 Gpixels / sec ?!?
>
> Probably flawed indication...
>
> I think I could use this technique to try and implement a parallel corewar
> simulator...
>
> The data would be loaded from a texture map just once at the start of the
> shader...
>
> Then the shader runs a full simulator battle, maybe even multiple in one
> go/pass.
>
> And then it simply returns the battle results in a little output
> texture...
>
> Could be nice if it works ! ;)
>
> Example for two warriors in core:
>
> This way the constraints would be:
>
> First constraint:
>
> Maximum ammount of simulators in gpu memory possible:
>
> 512 MB / ( 8000*6 bytes + 2 * (8000 + 500+4) * 2 ) =
> 512 MB / 48000 + 34016 =
> 512 MB / 82016 =
> 536870912 / 82016 = 6545 simulators in core !
>
> Now the pixel shaders would simply run each simulator side by side for as
> far as possible...
>
> I have no idea what the performance for the pixel shader would be...
>
> But for now I will take a guess...
>
> 6545 simulators * 80.000 cycles * 2 warriors * 100 battles =
>
> 104.720.000.000 instructions to execute at least.
>
> Each instruction is about 6 bytes...
>
> So that's a bandwidth requirement of:
>
> 628.320.000.000 bytes
>
> The true bandwidth is something like:
>
> 50 GB/sec which is: 5.368.709.1200 bytes
I made a little typo there in the dots:
Correct dotted value is:
53.687.091.200
However the calculations were still done properly... because I removed
the dots later on ! ;)
So calculations are correct ! ;)
Bye,
Skybuck ! ;) :)
>
> So clearly the bandwidth is a limiter/constraint...
>
> So estimated time for shader to complete based on bandwidth constraint
> would be:
>
> 628.320.000.000 bytes / 5.368.709.1200 bytes / sec =
>
> 628320000000 bytes / 53687091200 bytes / sec = 11.7 seconds.
>
> So instructions per second exected would be:
>
> 104.720.000.000 / 11.7 = 8.950.427.350 instructions per second.
>
> For two warriors that would mean 4.475.213.675 cycles per second.
>
> Let's see.. a dual core cpu achieves 16.000.000 cycles per second.
>
> The gpu performance would be staggering/very good.. however I have a
> feeling there must be another bottleneck/constraint somewhere....
>
> There could also be an execution constraint for the gpu.
>
> Stats/specs say something like: Fill rate: 15.7 billion pixels/sec.
>
> I think that's about:
> 15.7 * 1000 * 1000 * 1000 = 15.700.000.000
>
> So far this seems within range of the number of above.
>
> Conclusion: performance could be staggering/super speed !
>
> Speed up over cpu would be:
>
> 4.475.213.675 / 16.000.000 =
> 4475213675 / 16000000 = 279.7
>
> The gpu would be about 280 times faster than a cpu !
>
> That's the kind of performance gain I am looking for ! ;)
>
> Me very happy about that number ! =D
>
> As long as the code will compiled this should definetly be achieveable !
>
> However there is still a little catch... these numbers do not include the
> initialization... this would
> need to be done for each battle... but that's probably pretty quickly done
> as well...
>
> Even a 200 speed up would be real nice ! ;)
>
> So these numbers are very encouraging and I will definetly continue my
> development efforts to get a parallel gpu corewars executor going ! ;)
> :):):):):)
>
> Bye,
> Skybuck =D
>
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/23/2009 1:00:18 AM
|
|
The error was probably related to gtx 7900 which doesn't support certain
performance benchmarks... the gtx 8800 does...
Anyway back to the story...:
Even more interesting could be to completely leave the core, processes and
pspace out of the texture maps...
Since those "entities" can be done/initialized in the shader itself.
What remains is the warrior's code... that could be supplied into the
texture map... parameters maybe not possible... I would be worried that it
would be pre-compiled/computed which is unwanted.
To keep it simple each warrior could be stuffed into 100 cells... even if
they not all used... plus a size indicating how large it really is...
This means the number of simulators could be:
512 MB / (100 * 6 bytes + 2) =
536870912 / 602 = 891812 simulators ! LOL.
This could allow a "battlefield" of 944 x 944 ;)
Hmm seems a bit overkill for now... my battlefield would be 60x60 or so...
but maybe later I try 944x944 or so...
For now I shall not do any calculations how long this would take... just
want to "document" the idea a little bit ;)
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/24/2009 11:10:14 PM
|
|
"Skybuck Flying" <BloodyShame@hotmail.com> wrote in message
news:67037$4ae38952$d53372a9$1360@cache4.tilbu1.nb.home.nl...
> The error was probably related to gtx 7900 which doesn't support certain
> performance benchmarks... the gtx 8800 does...
>
> Anyway back to the story...:
>
> Even more interesting could be to completely leave the core, processes and
> pspace out of the texture maps...
>
> Since those "entities" can be done/initialized in the shader itself.
>
> What remains is the warrior's code... that could be supplied into the
> texture map... parameters maybe not possible... I would be worried that it
> would be pre-compiled/computed which is unwanted.
>
> To keep it simple each warrior could be stuffed into 100 cells... even if
> they not all used... plus a size indicating how large it really is...
>
> This means the number of simulators could be:
Hmm program start needed as well
So this becomes:
512 MB / (100 * 6 bytes + 4) =
536870912 / 604 = 888859 simulators
Max battlefield 942 x 942
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/24/2009 11:42:00 PM
|
|
I was losing confidence if it's gonna work because I don't know what will
happen if a shader uses many variables...
So I decided to do a little test... a little input texture... and some local
variables like 8000*4*32 bits.
And some code to try and force the gpu/cg compiler to actually use all of
them and not illiminate them...
Surprisingly it did seem to work... only problem is that FX Composer takes
multiple seconds to render something... it also allocates gigabytes of
memory... and then the whole application freezes.
I tried to make the shader only work for a few pixels... but alas.. it still
uses gigabytes.
It does seem to render some white now and then which was probably the result
of the shader which summed everything up more or less.
Maybe I need to develop my own cg editor minimalistic development
environment which is more aimed at large scale or so...
Hmmm..
Bye,
Skybuck.
|
|
0
|
|
|
|
Reply
|
Skybuck
|
10/27/2009 9:28:52 PM
|
|
|
30 Replies
139 Views
(page loaded in 0.253 seconds)
|