Fastest way to draw lots of triangles -- redux (FYI)

  • Follow


Folks,
   After our flurry of Display List/VBO comparisons, I went and put VBO's
into a large commercial app I
work on. I took em back out. Here's why; while indexed VBO's are much -- 
sometime MUCH faster than Display
Lists for small synthetic tests, when the polygon count gets high, (and
adding normals, lighting, & textures)
evidently the VBO swapping is less efficient than Display Lists -- by about
a factor of 2 in overall
frame rate in my case.(NVIDIA QUADROs on linux) VBO's in my case have a real
hard "knee point" where I go from 35 M lit colored textured tris/sec
to around 5M with just a few more vertices. I'd love to have more memory but
we have to allocate so much to
big dual screens, quadbuffers & etc that there's scant little left to jam in
10 million+ triangles.
Display Lists start out initially slower, but have a much smoother
tris/second curve as I load more & more
millions into the scene, and (usually) significantly outperform VBO's when I
get up to operational loads.
   Similarly, while VBO strips initially are less performant than Indexed
(by just a bit), when you have way too
much VBO memory to fit on the card, the more efficient strip organization
holds up better for the same # of
triangles.
   I haven't tried this comparison yet on hot PCI-E interfaces; I would
expect the data paging difference to be less
with that architecture -- but maybe not -- depends on how smart the Display
List writers were.
   In one article I read, there was talk about keeping VBO's under the T&L
cache size for best performance.
I was too lazy to programmatically break up randomly-sized VBO's to some
indeterminate T&L cache size --
I can't spend that much time fiddling with the code and benchmarking over
dozens of commercial configs and
typed of cards.

   Which all goes to show that YMMV, and synthetic benchmarks don't
necessarily mean anything to how your
app will behave.

  Onwards...

JB West



0
Reply JB 7/9/2005 2:38:09 PM

JB West wrote:
> 
>    In one article I read, there was talk about keeping VBO's under the T&L
> cache size for best performance.

And how big is that, exactly...? Only the driver knows.

If there's a "problem" it has to do with
memory organisation. I'm betting you could
do a lot better by fiddling a bit.

> I was too lazy to programmatically break up randomly-sized VBO's to some
> indeterminate T&L cache size --
> I can't spend that much time fiddling with the code and benchmarking over
> dozens of commercial configs and
> typed of cards.
> 

You should really have it as a program option
and let customers fiddle with it by themselves,
maybe even on a dataset-by-dataset basis. Some
customers might have cards with more memory than
yours and VBOs will be much faster for them.

>    Which all goes to show that YMMV, and synthetic benchmarks don't
> necessarily mean anything to how your
> app will behave.
> 

This is always true for *any* program and falls
under "premature optimisation is the root of
all evil".

Unless you're working with real-world datasets then
any kind of cleverness or "optimization" usually
turns out to be useless as soon as the program
goes out the door. Much better to write a flexible
program that can work in several different ways.


-- 
<\___/>
/ O O \
\_____/  FTB.    For email, remove my socks.

In science it often happens that scientists say, 'You know
that's a really good argument; my position is mistaken,'
and then they actually change their minds and you never
hear that old view from them again.  They really do it.
It doesn't happen as often as it should, because scientists
are human and change is sometimes painful.  But it happens
every day.  I cannot recall the last time something like
that happened in politics or religion.

- Carl Sagan, 1987 CSICOP keynote address

0
Reply fungus 7/9/2005 5:32:09 PM


"fungus" <umailMY@SOCKSartlum.com> wrote in message
news:tSTze.55549$US.37215@news.ono.com...
> JB West wrote:
> >
> >    In one article I read, there was talk about keeping VBO's under the
T&L
> > cache size for best performance.
>
> And how big is that, exactly...? Only the driver knows.
>
> If there's a "problem" it has to do with
> memory organisation. I'm betting you could
> do a lot better by fiddling a bit.
>
> > I was too lazy to programmatically break up randomly-sized VBO's to some
> > indeterminate T&L cache size --
> > I can't spend that much time fiddling with the code and benchmarking
over
> > dozens of commercial configs and
> > typed of cards.
> >
>
> You should really have it as a program option
> and let customers fiddle with it by themselves,
> maybe even on a dataset-by-dataset basis. Some
> customers might have cards with more memory than
> yours and VBOs will be much faster for them.
>
>
[ trimmed]

Fungus,
  Nice in theory, pretty useless in practice. Customers load indeterminate
amounts of dozens of datasets at once
& manipulate it, so they'd have to constantly fiddle. And fiddle some
bizarre constant that
has nothing to do with their workflow.

  VBO's also suffer, greatly, if you have one wonderfully tuned superfast
big object -- and then
plop in another. Poof, all gone. Worse than loading up DL's.

  Since DL's don't have such a nasty behavior, I'm hopeful that further VBO
optimizations *at the driver level*
can overcome these problems.

-jbw


0
Reply JB 7/10/2005 4:48:37 AM

JB West wrote:
> "fungus" <umailMY@SOCKSartlum.com> wrote in message
> 
>>You should really have it as a program option
>>and let customers fiddle with it by themselves,
>>
>   Nice in theory, pretty useless in practice.

Why? It's just a setting in program preferences.

>   VBO's also suffer, greatly, if you have one wonderfully tuned superfast
> big object -- and then
> plop in another. Poof, all gone. Worse than loading up DL's.
> 

Ah, so you're making huge VBOs. Of course they're
going to swap really badly. The problem is with
your program is this is the case.

>   Since DL's don't have such a nasty behavior, I'm hopeful that further VBO
> optimizations *at the driver level*
> can overcome these problems.
> 

Not reall possible. If you bind a huge VBO then
OpenGL has no option but to page the whole
thing at once, maybe paging out some other stuff
to make room for it.


-- 
<\___/>
/ O O \
\_____/  FTB.    For email, remove my socks.

In science it often happens that scientists say, 'You know
that's a really good argument; my position is mistaken,'
and then they actually change their minds and you never
hear that old view from them again.  They really do it.
It doesn't happen as often as it should, because scientists
are human and change is sometimes painful.  But it happens
every day.  I cannot recall the last time something like
that happened in politics or religion.

- Carl Sagan, 1987 CSICOP keynote address

0
Reply fungus 7/10/2005 1:53:52 PM

"fungus" <umailMY@SOCKSartlum.com> wrote in message
news:RL9Ae.55605$US.38592@news.ono.com...
> JB West wrote:
> > "fungus" <umailMY@SOCKSartlum.com> wrote in message
> >
> >>You should really have it as a program option
> >>and let customers fiddle with it by themselves,
> >>
> >   Nice in theory, pretty useless in practice.
>
> Why? It's just a setting in program preferences.
>
> >   VBO's also suffer, greatly, if you have one wonderfully tuned
superfast
> > big object -- and then
> > plop in another. Poof, all gone. Worse than loading up DL's.
> >
>
> Ah, so you're making huge VBOs. Of course they're
> going to swap really badly. The problem is with
> your program is this is the case.
>
> >   Since DL's don't have such a nasty behavior, I'm hopeful that further
VBO
> > optimizations *at the driver level*
> > can overcome these problems.
> >
>
> Not reall possible. If you bind a huge VBO then
> OpenGL has no option but to page the whole
> thing at once, maybe paging out some other stuff
> to make room for it.
>
>
> -- 
> <\___/>
> / O O \
> \_____/  FTB.    For email, remove my socks.
>
> In science it often happens that scientists say, 'You know
> that's a really good argument; my position is mistaken,'
> and then they actually change their minds and you never
> hear that old view from them again.  They really do it.
> It doesn't happen as often as it should, because scientists
> are human and change is sometimes painful.  But it happens
> every day.  I cannot recall the last time something like
> that happened in politics or religion.
>
> - Carl Sagan, 1987 CSICOP keynote address
>

I don't want to belabor this issue too much just between us, but there's
something
of value here. Yes, the app can make huge VBO's. "Breaking up is hard to
do"" --
I don't necessarily have the scratch memory available to try to sort &
reorder the
data from a 100 million cell model, *on the fly*. And I can't tell the
database to
go reorder itself for my sake -- it's ordered its own way for numerical
processing reasons.
This isn't static read-only stuff that can be stripified & optimized saved
away -- it's dynamic
scientific data that changes while the user runs the workflow.

Pretend you are running Quake III and dynamically adding to & editing the
scene, maybe
altering what you see by an order of magnitude in polygon count from frame
to frame.

It can also make large numbers of "reasonable" VBO's. Either way, if the sum
total of the scene is > available memory,
--->VBO's seem to suffer much more from paging slowdown than Display
Lists.<---
That's my early conclusion, and that "shouldn't be". It's also a warning to
folks who might
assume that indexed, unstripified VBO are *always* the fastest way to go.

Indeed, in my world, indexed VBO's are the worst way to go in some cases.
Bummer!

The original post was about "large" or "huge" (now I forget) numbers of
triangles.
The VBO benchmark did not have "large" (in my universe) numbers, hence my
further stirring
of the pot.

Indexed VBO's, like display lists, don't necessarily have to be paged in all
at once to
start to feed the pipeline. Both have the semantics that would allow a
driver writer to
re-order/parcel the data for better paging based on (only a driver knows)
cache
characteristics. He'd do what I'd do, if I knew what the cache size was.
Take the
first magic-number N indexes & collect all the normals, colors, texture
indices for
them, and make that one chunk. Repeat. It appears to me that display lists
do just
that -- they bring in chunks at a time.

jbw



0
Reply JB 7/10/2005 4:08:57 PM

JB West wrote:
> 
> This isn't static read-only stuff that can be stripified & optimized saved
> away -- it's dynamic
> scientific data that changes while the user runs the workflow.
> 

Ok, now I'm confused. How do you do this with
display lists...?



-- 
<\___/>
/ O O \
\_____/  FTB.    For email, remove my socks.

In science it often happens that scientists say, 'You know
that's a really good argument; my position is mistaken,'
and then they actually change their minds and you never
hear that old view from them again.  They really do it.
It doesn't happen as often as it should, because scientists
are human and change is sometimes painful.  But it happens
every day.  I cannot recall the last time something like
that happened in politics or religion.

- Carl Sagan, 1987 CSICOP keynote address

0
Reply fungus 7/10/2005 5:02:02 PM

fungus wrote:
> Ok, now I'm confused. How do you do this with
> display lists...?

There are lots of ways to use display lists for dynamic geometry. Skinning
is an obvious one (and nVidia's Linux SDK uses display lists for their
skinning demo). You can probably also use vertex programs, although I've
never tried.

But the most obvious way (and the way I suspect JB is using) is to use lots
of display lists. This is particularly relevant for tree-based rendering
(much more common in sophisticated scientific visualisation but
comparatively rare in games, although there is some overlap) where you can
have huge data sets that cannot just be thrown at the GPU and often can't
even be stored in RAM.

In this case, you use tree-based approaches (e.g. multiresolution
representations) and "chop off" the bottom few levels of nodes in the tree,
replacing them with small batches which are most easily (and very
efficiently) stored in display lists. The bigger the batches, the less
efficient the tree becomes but the more efficient the CPU->GPU transfer
becomes and the more work is shifted from the CPU onto the GPU.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
0
Reply Jon 7/10/2005 8:11:50 PM

Jon Harrop wrote:
> fungus wrote:
> 
>>Ok, now I'm confused. How do you do this with
>>display lists...?
>
> But the most obvious way (and the way I suspect JB is using) is to use lots
> of display lists.

He said:

"Yes, the app can make huge VBO's. "Breaking up is hard to
do"" --

So...how can he can make small display lists but
not small VBOs?



-- 
<\___/>
/ O O \
\_____/  FTB.    For email, remove my socks.

In science it often happens that scientists say, 'You know
that's a really good argument; my position is mistaken,'
and then they actually change their minds and you never
hear that old view from them again.  They really do it.
It doesn't happen as often as it should, because scientists
are human and change is sometimes painful.  But it happens
every day.  I cannot recall the last time something like
that happened in politics or religion.

- Carl Sagan, 1987 CSICOP keynote address

0
Reply fungus 7/10/2005 10:07:45 PM

"fungus" <umailMY@SOCKSartlum.com> wrote in message
news:T_gAe.55647$US.19040@news.ono.com...
> Jon Harrop wrote:
> > fungus wrote:
> >
> >>Ok, now I'm confused. How do you do this with
> >>display lists...?
> >
> > But the most obvious way (and the way I suspect JB is using) is to use
lots
> > of display lists.
>
> He said:
>
> "Yes, the app can make huge VBO's. "Breaking up is hard to
> do"" --
>
> So...how can he can make small display lists but
> not small VBOs?
>
>
>
> -- 
> <\___/>
> / O O \
> \_____/  FTB.    For email, remove my socks.
>
> In science it often happens that scientists say, 'You know
> that's a really good argument; my position is mistaken,'
> and then they actually change their minds and you never
> hear that old view from them again.  They really do it.
> It doesn't happen as often as it should, because scientists
> are human and change is sometimes painful.  But it happens
> every day.  I cannot recall the last time something like
> that happened in politics or religion.
>
> - Carl Sagan, 1987 CSICOP keynote address
>

<begin background>
There are lots of display lists (thousands, tens of thousands sometimes) and
they vary from, oh,
10 million triangle honkers to 10,000 polygon bits & pieces. This isn't
real-time, folks,
it can take minutes for algorithms to run & new datasets to be generated.
But then, one might
want to spin it around & analyze the results for tens of minutes, make some
edits & repeat.
Adding more tens of minutes to tri-strip a big dataset (even if there's
memory to do so -- there isn't)
or other "fancy" or memory-hungry optimizations isn't in the cards. It might
make the graphics display faster,
but it would make the workflow slower.
 I'd love graphics to be faster, wouldn't we all, but frankly that would
just mean the users would use
bigger models until the same threshold of pain was reached. So any
optimization for size N
that is much worse for size 2*N doesn't work for me. Hence my disappointment
so far with the VBO
thread. Swell for some N; much worse than DL's for 10*N. Ditto for other
algorithms that aren't
linear-time (or less). <end background>

Relevant to patient readers:

   I have found that in the case of tri-stripped Display Lists, having one
giant tri-strip is a bad idea.
Arbitrarily breaking the strip & restarting it gives significantly better
performance. (Duh -- you'd have thought
the driver writers would have figured that one out). Could well be the case
with breaking up VBO's, even
with the added bookkeeping.

   When I have more time I'll try to anaylize the effect of taking some Big
Daddy that might be easy to bust up,
& see if it makes any difference if it's parcelled into smaller VBO bits. I
do know from one run that 2
medium-sized VBOs (each of which stand-alone runs swell) is just as slow as
the same amount of data
in one giant VBO -- not a good sign, but that's just one datapoint.

  Unfortunately, some of the worst data is indeed pretty worst-case-- flat
shaded, per-face normal & color, unstripped
cellular models with "holes" and irregular n-gons thrown in just for fun.
Quite a bit of data growth to build
indexed VBO's in the 1st place. Sadly, Display lists exhibit pathological
growth also in this case -- looks
like driver writers expand out the colors & normals to be per-vertex also,
from the increase in memory (?) (arghh).

  If anyone has any other ideas, I'm game to try em out. Chasing performance
(on a *really* low budget) is a tough one.

jbw


0
Reply JB 7/11/2005 2:27:19 AM

JB West wrote:
> 
>   If anyone has any other ideas, I'm game to try em out. Chasing performance
> (on a *really* low budget) is a tough one.
> 

1) Make a big list of all triangles.

2) If list is smaller than some size eg. 10,000
triangles goto 5.

3) Split the list in two in biggest axis.

4) goto 2

5) Make a display lists/VBOs.


This "split in two until it's small" approach
means the algorithms in step 5 (stripping, indexing)
will run quite fast. I'd make it less than 64k
vertices in a chunk because you can use 16 bit
indices and this saves you some memory.

Be careful when you split the list that you
don't allocate more memory other wise you'll
end up using huge amounts of RAM. Better to
store the triangles as a linked list and
rebuild the list into two new ones.


-- 
<\___/>
/ O O \
\_____/  FTB.    For email, remove my socks.

In science it often happens that scientists say, 'You know
that's a really good argument; my position is mistaken,'
and then they actually change their minds and you never
hear that old view from them again.  They really do it.
It doesn't happen as often as it should, because scientists
are human and change is sometimes painful.  But it happens
every day.  I cannot recall the last time something like
that happened in politics or religion.

- Carl Sagan, 1987 CSICOP keynote address

0
Reply fungus 7/11/2005 2:53:04 AM

Hi JB,

I am also working on something similar, but counts of  quads is 2-3
million.

I use one big DL , which is slow.. was planning to use VBO's - but now
thinking :)

Other alternate which I am going to try is move light and
transformation, coloring into shader. i am hoping to get some benefit
from there.

Will update if get good results
bye
ketan

0
Reply Ketan 7/11/2005 8:18:05 AM

"fungus" <umailMY@SOCKSartlum.com> wrote in message
news:lalAe.55664$US.38586@news.ono.com...
> JB West wrote:
> >
> >   If anyone has any other ideas, I'm game to try em out. Chasing
performance
> > (on a *really* low budget) is a tough one.
> >
>
> 1) Make a big list of all triangles.
>
> 2) If list is smaller than some size eg. 10,000
> triangles goto 5.
>
> 3) Split the list in two in biggest axis.
>
> 4) goto 2
>
> 5) Make a display lists/VBOs.
>
>
> This "split in two until it's small" approach
> means the algorithms in step 5 (stripping, indexing)
> will run quite fast. I'd make it less than 64k
> vertices in a chunk because you can use 16 bit
> indices and this saves you some memory.
>
> Be careful when you split the list that you
> don't allocate more memory other wise you'll
> end up using huge amounts of RAM. Better to
> store the triangles as a linked list and
> rebuild the list into two new ones.
>
>
> -- 
> <\___/>
> / O O \
> \_____/  FTB.    For email, remove my socks.
>
> In science it often happens that scientists say, 'You know
> that's a really good argument; my position is mistaken,'
> and then they actually change their minds and you never
> hear that old view from them again.  They really do it.
> It doesn't happen as often as it should, because scientists
> are human and change is sometimes painful.  But it happens
> every day.  I cannot recall the last time something like
> that happened in politics or religion.
>
> - Carl Sagan, 1987 CSICOP keynote address
>


Well, I split the VBO to be about 14K indices & used a short for array
indices. It was slower by 25%
than one big VBO w/int indices (or just a drawarray).
I also found out the ATI FIREGL's go into *software* rendering
mode when I use byte normals in a VBO; (14.13 driver, both win32 &
Linux-64).
nVIDIA drivers run byte normals at ~the same speed as float normals.

I realize that my case is pathological (flat-shaded single-colored faces,
expanded out to independent
quads/tris w/4 normals & colors to meet drawArray/drawElement limitations),
but VBO's are all a bust.

Also one caveat; somewhere between 1 million and 2 million triangles in a
single VBO, ATI's driver
quietly just doesn't render anything. It goes through all the motions, and
takes as much time as it
is expected to, to draw, but nothing gets to the framebuffer. Grr. (Same
story with Display Lists).

So it goes.

jbw


0
Reply JB 7/14/2005 1:16:15 AM

JB West wrote:
> Well, I split the VBO to be about 14K indices & used a short for array
> indices. It was slower by 25%
> than one big VBO w/int indices (or just a drawarray).

You mean it's slower than a drawArray from
main RAM?

If so then your problem must be that the
VBOs don't fit and paging them to/from the
card is killing you.

I don't see why display lists would be any
better though. They can't possibly use less
memory.

Now you've got array splitting code and
have the program organised you can try
some other stuff....

a) Try sending the indices from main RAM (not
from a VBO). 16 bit indices are twice as fast
as 32 bit ones for this.

b) Is your array data interleaved or not? If
not, try it with interleaved vertex/normal/color.

> I realize that my case is pathological (flat-shaded single-colored faces,
> expanded out to independent
> quads/tris w/4 normals & colors to meet drawArray/drawElement limitations),
> but VBO's are all a bust.
> 

It doesn't make any sense that display lists
are faster though. They can't possibly be
using less overall memory to store the
vertex data.

Nor you've got array-splitting code and
assuming that memory paging really is the
problem, it might be worth making just two
write-only VBOs which are big enough to hold
your biggest chunk of vertex data. To draw
a chunk you use glBufferSubDataARB() to copy
vertex data into one of the VBOs and draw
it from there. Alternate between the two
VBOs in a "double buffered" sort of way.
(triple buffering might be even better).

This is the same amount of AGP bus traffic
as letting the driver page in all the VBOs
every frame, but it _might_ give the driver
enough of a clue to be enable the graphics
card and CPU to work in parallel (ie. graphics
card can be busy rendering while the CPU is
uploading the next chunk of data).


-- 
<\___/>
/ O O \
\_____/  FTB.    For email, remove my socks.

In science it often happens that scientists say, 'You know
that's a really good argument; my position is mistaken,'
and then they actually change their minds and you never
hear that old view from them again.  They really do it.
It doesn't happen as often as it should, because scientists
are human and change is sometimes painful.  But it happens
every day.  I cannot recall the last time something like
that happened in politics or religion.

- Carl Sagan, 1987 CSICOP keynote address

0
Reply fungus 7/14/2005 4:08:14 AM

fungus wrote:
> It doesn't make any sense that display lists
> are faster though. They can't possibly be
> using less overall memory to store the
> vertex data.

How do you know the display lists aren't compiled into a more efficient
form, e.g. reduced precision?

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
0
Reply Jon 7/14/2005 10:20:32 PM

13 Replies
221 Views

(page loaded in 0.226 seconds)

Similiar Articles:
















7/24/2012 3:37:38 PM


Reply: