memory copying onto video memory

  • Follow


Hello all,

I'm writing a program that render video using OpenGL.
I started making some tests using Intel Vtune and I noticed that my 
bottleneck is the memcpy.
The worse impact was 64K aliasing which I already fixed, now I have a very 
big problem with Second-Level Cache Load misses,
it accounts for 89% of the work load, followed by First-Level Cache Load 
Misses (4,89%) and then Blocked Store Forwards 0.27 sec processor time, 
0.39% workload.
If I compile with Visual Studio 2002, I get up to 79.39% of clockticks with 
Store Buffer Stalls.

My application receives samples from a video stream, that are copied to 
video memory in the folowing way:

Here I get the pointer for my video memory:
void * pboMemory = glMapBuffer(GL_PIXEL_UNPACK_BUFFER_EXT, GL_WRITE_ONLY);

Here I test for 64K Aliasing:

  if ((((unsigned int)pboMemory) & 0xFFF8) == (((unsigned int)m_pImage) & 
0xFFF8)) {
   pboMemory = (void*)(((unsigned int)pboMemory) + 4096);
   bufferOffset = m_PBufferOffset;
}

And then I copy the image (RGB32) 1024x384 :

memcpy(pboMemory, m_pImage, m_textureSize); // textureSize = 1024x384x4

I'm using Visual C++ 2003 compiler with the following /O2 /Og /Ot /Ob2 /GT 
/G7 /GA
The video should be rendered at 50 fps but I'm achieving 40 - 45.

My question here is if non temporal writes are beeing used here or not. If 
not, how can I force it!?
My application will be delivered only for (P4 or AMD Athlon XP) and above.

Does anybody knows how I can get rid of these Problems, Is there a memcopy 
that goes around the cache !?

wpr.


0
Reply whisper 6/21/2004 7:57:41 PM

whisper <mlopes_filho@br.com.yahoo> wrote:
> memcpy(pboMemory, m_pImage, m_textureSize); // textureSize = 1024x384x4
> 
> I'm using Visual C++ 2003 compiler 

I have no earthly clue about Visual C++.  Isn't that a post
increment operator, so it's `C` when used and only later improved :)
I would suggest you instruct the compiler to dump asm or disassemble
the source to see what instructions it is generating.

> The video should be rendered at 50 fps but I'm achieving 40 - 45.

Why do you believe the hardware is capable of 50 fps?  Vidram has
horrible demands upon it.  Achieving 40 - 45 doesn't sound that bad.

> Is there a memcopy that goes around the cache !?

I don't know.  The preferred way of transferring framebuffers
is to get the Graphics Processor to do the DMA.  It can then
do the necessary sync for display refresh.  I believe there
are OpenGL and DirectX calls to do this.

Vidram normally should not be flagged as cacheable

-- Robert

0
Reply Robert 6/21/2004 8:55:33 PM


"whisper" <mlopes_filho@br.com.yahoo> wrote in message
news:cb7e7s$9tl$1@online.de...
> Hello all,
>
> I'm writing a program that render video using OpenGL.
> I started making some tests using Intel Vtune and I noticed that my
> bottleneck is the memcpy.
> The worse impact was 64K aliasing which I already fixed, now I have a very
> big problem with Second-Level Cache Load misses,
> it accounts for 89% of the work load, followed by First-Level Cache Load
> Misses (4,89%) and then Blocked Store Forwards 0.27 sec processor time,
> 0.39% workload.
> If I compile with Visual Studio 2002, I get up to 79.39% of clockticks
with
> Store Buffer Stalls.
....
> My question here is if non temporal writes are beeing used here or not. If
> not, how can I force it!?
> My application will be delivered only for (P4 or AMD Athlon XP) and above.
>
> Does anybody knows how I can get rid of these Problems, Is there a memcopy
> that goes around the cache !?

Unfortunately, there is no standard memcpy() that does prefetching or
non-temporal writes, so you'll need to roll your own replacement (probably
with inline asm).

The instructions you probably want are prefetchnta, movdqa/movdqu (for
loads), and movntdq (for stores).  Prefetches combined with loop unrolling
will help reduce read stalls.

S

-- 
Stephen Sprunk      "Those people who think they know everything
CCIE #3723         are a great annoyance to those of us who do."
K5SSS                                             --Isaac Asimov

0
Reply stephen (1128) 6/21/2004 9:11:58 PM

Hi again,

I also think that it shouldn't be flagged as cacheable. Here is an excerpt
from nvidia:

Some implementations may choose, for example,
        that they would rather stream vertex data from write-combined
        system memory, element (or index) data from video memory, and
        pixel data from video memory.

        In fact, one can imagine arbitrarily complicated heuristics for
        selecting the memory space, based on factors such as the target,
        the "usage" argument, and the application's observed behavior.

In this case my usage is GL_WRITE_ONLY what should give a hint that I want
memory to which I can copy very fast....

I just found the function memcpy_amd at amd.com and looks good enough.
It was originally written in 2001, I don't know if it's up to date, but I'm 
sure that there can be more improvements,
for example, the fact that the image is always RGB32 may help, right !?

I'm beginning with asm, I 'll try to find a way to get the maximum perf. out 
of it.

wpr. 


0
Reply whisper 6/22/2004 2:36:24 AM

> My application receives samples from a video stream, that are copied to
> video memory in the folowing way:
I think you should try to avoid the memcpy. While receiving the samples, you
should draw directly into the buffer, or an off-screen buffer in video
memory,
and then BLT the image.

H


0
Reply Herman 6/22/2004 8:08:09 AM

Hi,

This is what I think I'm doing.
OpenGL PBO is the fastes way to copy to the texture as far as I know.

Thanks anyway.

"Herman Dullink" <hd5@hetnet.nl> wrote in message 
news:cb8ovv$ign$1@reader10.wxs.nl...
>> My application receives samples from a video stream, that are copied to
>> video memory in the folowing way:
> I think you should try to avoid the memcpy. While receiving the samples, 
> you
> should draw directly into the buffer, or an off-screen buffer in video
> memory,
> and then BLT the image.
>
> H
>
> 


0
Reply whisper 6/22/2004 6:07:47 PM

"whisper" <mlopes_filho@br.com.yahoo> wrote in message news:<cb7e7s$9tl$1@online.de>...
> Hello all,
> 
> I'm writing a program that render video using OpenGL.
> I started making some tests using Intel Vtune and I noticed that my 
> bottleneck is the memcpy.
> The worse impact was 64K aliasing which I already fixed, now I have a very 
> big problem with Second-Level Cache Load misses,
> it accounts for 89% of the work load, followed by First-Level Cache Load 
> Misses (4,89%) and then Blocked Store Forwards 0.27 sec processor time, 
> 0.39% workload.
> If I compile with Visual Studio 2002, I get up to 79.39% of clockticks with 
> Store Buffer Stalls.
> 
> My application receives samples from a video stream, that are copied to 
> video memory in the folowing way:
> 
> Here I get the pointer for my video memory:
> void * pboMemory = glMapBuffer(GL_PIXEL_UNPACK_BUFFER_EXT, GL_WRITE_ONLY);
> 
> Here I test for 64K Aliasing:
> 
>   if ((((unsigned int)pboMemory) & 0xFFF8) == (((unsigned int)m_pImage) & 
> 0xFFF8)) {
>    pboMemory = (void*)(((unsigned int)pboMemory) + 4096);
>    bufferOffset = m_PBufferOffset;
> }

To avoid 64k aliasing, you do not need to introduce an offset of 4096.
 A smaller number like 256 will probably suffice (it depends on the
processor you use).

> And then I copy the image (RGB32) 1024x384 :
> 
> memcpy(pboMemory, m_pImage, m_textureSize); // textureSize = 1024x384x4
> 
> I'm using Visual C++ 2003 compiler with the following /O2 /Og /Ot /Ob2 /GT 
> /G7 /GA
> The video should be rendered at 50 fps but I'm achieving 40 - 45.
> 
> My question here is if non temporal writes are beeing used here or not. If 
> not, how can I force it!?
> My application will be delivered only for (P4 or AMD Athlon XP) and above.

Unlikely.  If I am right, Visual C++ 2003 memcpy() on x86 is
implemented as a string operation on MOVSB.  It is not that the
compiler developers are lazy or stupid.  The memcpy() they implement
has to be generic and applicable for all cases -- they don't know if
the source and destination addresses are aligned or unaligned,
cacheable or not, and they don't know if the number of bytes to be
copied is larger than 64, a multiple of 8, and so on.

If you need non-temporal writes, you have to do your own x86 assembly
programming like MOVNTQ and MOVNTDQ.  Since you work with P4, you know
SSE1 (Katmai New Instruction) is supported and you can use
non-temporal writes.

> Does anybody knows how I can get rid of these Problems, Is there a memcopy 
> that goes around the cache !?

By using non-temporal writes, you already bypass the cache.  But that
doesn't matter much because most video memory is not cacheable anyway.

Keep in mind that non-temporal writes on P4 still suffer from 64k
aliasing, though it suffers less compared to temporal writes.

0
Reply alaskanmalamute 6/23/2004 6:20:27 PM

6 Replies
175 Views

(page loaded in 0.095 seconds)

Similiar Articles:













7/22/2012 8:02:42 PM


Reply: