jpeg decoding speeds: x86, ARM, dsp, fpga, GPU

  • Follow


We're doing some image processing on standard PCs (dual core, 3+ Ghz
x86, etc), and we're looking for more performance.  We're looking at
various jpeg libraries, and considering hotter boxes (quad core), but
we also need to investigate more specialized hardware, such as a GPU
or an fpga.  Anyone know of any benchmarks that compare the speed of
jpeg decompression on an x86 cpu, an ARM cpu, a DSP chip,  an fpga, an
GPU, etc?  I know each generation of each type of chip gets more
powerful, and other system factors can make or break performance, but
I'd like to get an idea of what's possible for each chip type and
determine is specialized hardware wil be enough of a win  to justify
the investment.

Thanks,

Aj

0
Reply amygdala.jones (1) 8/22/2007 4:31:22 PM

amygdala.jones@gmail.com schrieb:

> We're doing some image processing on standard PCs (dual core, 3+ Ghz
> x86, etc), and we're looking for more performance.  

Image processing in general or JPEG compression/decompression?

> We're looking at
> various jpeg libraries, and considering hotter boxes (quad core), but
> we also need to investigate more specialized hardware, such as a GPU
> or an fpga.  

I'd say it really depends on what you want to invest. You can buy very
fast software-only solutions (e.g. www.jpg.com has AFAIK the fastest
JPEG on the market), latest NVIDIA hardware supporting CUDA would be
able to run the DCT on the chip, and if you design your own FPGA, that's
of course fastest. Concerning GPUs I'm not so clear how much that will
buy you. While I haven't done any DCT on a GPU, I did test with DWT and
that wasn't worth it (my CPU was faster). The bottleneck is that you
need to load your data into the GPU first before performing the
operation, and that requires a larger overhead than to just do the
transform in the CPU right away. That is, unless you have specific
conditions where one or the other overhead can be avoided (i.e. the data
is on the GPU anyhow, or has to go there anyhow). This would also be my
next question.

> Anyone know of any benchmarks that compare the speed of
> jpeg decompression on an x86 cpu, an ARM cpu, a DSP chip,  an fpga, an
> GPU, etc?  I know each generation of each type of chip gets more
> powerful, and other system factors can make or break performance, but
> I'd like to get an idea of what's possible for each chip type and
> determine is specialized hardware wil be enough of a win  to justify
> the investment.

I think it would be worth discussing your constraints as well. My
thinking is that unless very specific conditions hold, your best bet is
likely a software solution with a specialized codec, but maybe you do
not want to run it on PC hardware?

So long,
	Thomas
0
Reply Thomas 8/23/2007 6:07:24 AM


> Image processing in general or JPEG compression/decompression?

Image processing in general, but one area of noticable delay is
getting the images loaded.  Much of that is media read time (disk or
network time),  but the decoding time is substantial.

> I'd say it really depends on what you want to invest. You can buy very
> fast software-only solutions (e.g.www.jpg.comhas AFAIK the fastest
> JPEG on the market), latest NVIDIA hardware supporting CUDA would be
> able to run the DCT on the chip, and if you design your own FPGA, that's
> of course fastest. Concerning GPUs I'm not so clear how much that will
> buy you. While I haven't done any DCT on a GPU, I did test with DWT and
> that wasn't worth it (my CPU was faster). The bottleneck is that you
> need to load your data into the GPU first before performing the
> operation, and that requires a larger overhead than to just do the
> transform in the CPU right away. That is, unless you have specific
> conditions where one or the other overhead can be avoided (i.e. the data
> is on the GPU anyhow, or has to go there anyhow). This would also be my
> next question.

www.jpg.com fastest? Can you point to any data?

FPGA beats CUDA? how do you know? Even a low-end FPGA? What class of
FPGA?

The EEMBC's Consumerbench gives some data for embedded procs:
  http://www.eembc.org/benchmark/consumer_sl.asp

The Moscow State site (www.compression.ru) looks promising, but so far
hasn't yielded much:
   http://www.compression-links.info/BenchmarksTests

> I think it would be worth discussing your constraints as well. My
> thinking is that unless very specific conditions hold, your best bet is
> likely a software solution with a specialized codec, but maybe you do
> not want to run it on PC hardware?

I do want to run it on PC hardware, so an FPGA, or other would be on
PCIx card, or conceivably as a front-end to a PC, but the images are
manipulated via a GUI on the PC.

Let's assume a specialized codec is used, how much faster will it be
than the stock Microsoft gdi+ classes?  Yes, this will vary by task,
but any data's better than speculation.

Would something like a cell processor or Tilera's new 64 core board
win?

To give a concrete example, say I had 1000 5mpix images, what would be
the fastest way to thumbnail them all to 300x300pixels?  For a given
approach, there are different constraints: For a GPU, there're copy
and copy-back times, for an fpga on a PCI card, there're similar data
communucation costs.


Aj

0
Reply amygdala 8/25/2007 1:08:37 AM

amygdala.jones@gmail.com schrieb:
>> Image processing in general or JPEG compression/decompression?
> 
> Image processing in general, but one area of noticable delay is
> getting the images loaded.  Much of that is media read time (disk or
> network time),  but the decoding time is substantial.
> 
>> I'd say it really depends on what you want to invest. You can buy very
>> fast software-only solutions (e.g.www.jpg.comhas AFAIK the fastest
>> JPEG on the market), latest NVIDIA hardware supporting CUDA would be
>> able to run the DCT on the chip, and if you design your own FPGA, that's
>> of course fastest. Concerning GPUs I'm not so clear how much that will
>> buy you. While I haven't done any DCT on a GPU, I did test with DWT and
>> that wasn't worth it (my CPU was faster). The bottleneck is that you
>> need to load your data into the GPU first before performing the
>> operation, and that requires a larger overhead than to just do the
>> transform in the CPU right away. That is, unless you have specific
>> conditions where one or the other overhead can be avoided (i.e. the data
>> is on the GPU anyhow, or has to go there anyhow). This would also be my
>> next question.
> 
> www.jpg.com fastest? Can you point to any data?

I'll try to provide you some, please stand by.

> FPGA beats CUDA? how do you know? Even a low-end FPGA? What class of
> FPGA?

I personally haven't done JPEG, but I'm doing JPEG2000, and there I
tried to speed up the transformation on GPUs. Back then CUDA wasn't
available, but I did this manually on OpenGL. The implementation did run
the wavelet on the GPU, but wasn't faster than my CPU-only
implementation. It was even a bit slower. With a higher-spec'd gfx card,
you could improve the speed a bit, but the bottleneck is really the
memory bandwidth between CPU and GPU, whereas the CPU can access its
cache in full speed. If you can utilize the CPU cache, that is. This
means that having a very fast implementation on the GPU is not
necessarely helpful.

The GPU in this test was an NVidia 6600GT, the CPU an AMD64x2 4800EE.

On the other hand, colleagues of mine from the Fraunhofer IIS institute
did IIRC an FPGA version of JPEG2000 "on chip" that compresses (it's a
compressor only) 4K or 8K images at at least 24fps, if not higher. Thus,
yes, if you can put it on a chip completely, and do not need to exchange
data with the PC, you're better off with an FPGA. Of course, it depends
on the algorithms you need to run, but a DCT should be fairly simple.

> The EEMBC's Consumerbench gives some data for embedded procs:
>   http://www.eembc.org/benchmark/consumer_sl.asp

That's again a very different part of the story. Vendors of embedded
processors often supply hand-optimized versions of the DCT just for this
purpose - I know definitely that the TI chip libraries include a DCT.
But that's really a different beast compared to the general-purpose
processors in your PC.

> The Moscow State site (www.compression.ru) looks promising, but so far
> hasn't yielded much:
>    http://www.compression-links.info/BenchmarksTests

It's currently off-line, too. I mainly checked that side for JPEG2000,
not JPEG.

>> I think it would be worth discussing your constraints as well. My
>> thinking is that unless very specific conditions hold, your best bet is
>> likely a software solution with a specialized codec, but maybe you do
>> not want to run it on PC hardware?
> 
> I do want to run it on PC hardware, so an FPGA, or other would be on
> PCIx card, or conceivably as a front-end to a PC, but the images are
> manipulated via a GUI on the PC.

Uh, then again you have the bottleneck of the PCI bus. Thus, how do
images arrive at the PC? Are they on the harddisk, or do you digitize
them? In the first case, a software-only version might be faster than
just moving data around over the bus, in the second case, a chip on the
digitizer might be the better solution.

Furthermore, sorry if I ask this again, but how large are your images
on average? If we're talking about 4K or above, you need to consider
bandwith issues and it's best to compress images as early in your
toolchain as possible. Otherwise, IO time might be dominating
compression time, especially for JPEG-1.


> Let's assume a specialized codec is used, how much faster will it be
> than the stock Microsoft gdi+ classes?  Yes, this will vary by task,
> but any data's better than speculation.

I'll try to deliver some figures.

> Would something like a cell processor or Tilera's new 64 core board
> win?

Depends on the code. With a plain-simple off-the-shelve code, no. But
if you can parallelize the code to the several CPUs of the cell, it
might be very well pretty fast. I would assume that JPEG-1 would profit
from this very much as the blocks are more or less encoded individually.
But again, note that the data has to go into the CPU somehow, and this
might be more the problem than really processing it. I do not know
enough about the memory hub of the cell to judge and their
cache-architecture.

> To give a concrete example, say I had 1000 5mpix images, what would be
> the fastest way to thumbnail them all to 300x300pixels?  For a given
> approach, there are different constraints: For a GPU, there're copy
> and copy-back times, for an fpga on a PCI card, there're similar data
> communucation costs.

Yes, exactly, and that's the major bottleneck. If the images are on a
HD, you also need to load them from there. If you want to thumbnail
them, AFAIK the fastest way is really to decode them only partially,
extract only the DC part from each DCT block and combine those into a
new image. This gives you a downscaling of 1:8, which might be exactly
what you need. If that isn't sufficient, you can still apply a cubic
spline filter or something similar on the output. Compressing the
thumbnails should be less the problem as they are small enough. However,
I would still expect that IO time is here the limiting factor.

So long,
	Thomas
0
Reply Thomas 8/27/2007 8:53:30 AM

<amygdala.jones@gmail.com> wrote in message 
news:1187800282.365363.188700@z24g2000prh.googlegroups.com...
> We're doing some image processing on standard PCs (dual core, 3+ Ghz
> x86, etc), and we're looking for more performance.  We're looking at
> various jpeg libraries, and considering hotter boxes (quad core), but
> we also need to investigate more specialized hardware, such as a GPU
> or an fpga.  Anyone know of any benchmarks that compare the speed of
> jpeg decompression on an x86 cpu, an ARM cpu, a DSP chip,  an fpga, an
> GPU, etc?  I know each generation of each type of chip gets more
> powerful, and other system factors can make or break performance, but
> I'd like to get an idea of what's possible for each chip type and
> determine is specialized hardware wil be enough of a win  to justify
> the investment.
>

simple may be to get a fairly new processor, and then attempt to write a 
specialized jpeg decoder using SSE (SSE3 or SSSE3, which may offer a few 
instructions which could be helpful). may be worthwhile, in particular, to 
look into packed integers.

(or, limniting to SSE2, you can have it run on most processors released 
within the past few years).

never done this myself (I get plenty fast decode speeds using traditional 
methods).
actually, for the typical task of decoding images and cramming them into 
OpenGL, I suspect loading the images into GL is probably the main slow point 
(I before experimented with this kind of thing, disabling everything but the 
glTexImage call, which seemed to be where all the time was going).

(the experiment at the time was trying to decode video and cram it into a 
texturemap, but I found this hurt the framerate too bad to be of that much 
real usefulness...).

so, little need to worry about it in my case...


but, what SSE allows is this:
you can do an IDCT as a number of vector operations, which may be faster 
than more traditional methods.

just an idle thought really...


> Thanks,
>
> Aj
> 


0
Reply cr88192 8/28/2007 6:26:11 AM

On Aug 22, 12:31 pm, amygdala.jo...@gmail.com wrote:
> We're doing some image processing on standard PCs (dual core, 3+ Ghz
> x86, etc), and we're looking for more performance.  We're looking at
> various jpeg libraries, and considering hotter boxes (quad core), but
> we also need to investigate more specialized hardware, such as a GPU
> or an fpga.  Anyone know of any benchmarks that compare the speed of
> jpeg decompression on an x86 cpu, an ARM cpu, a DSP chip,  an fpga, an
> GPU, etc?  I know each generation of each type of chip gets more
> powerful, and other system factors can make or break performance, but
> I'd like to get an idea of what's possible for each chip type and
> determine is specialized hardware wil be enough of a win  to justify
> the investment.
>
> Thanks,
>
> Aj

I've been studying software-based imaging performance on various CPUs
for the last few years and have recently rewritten my JPEG decoder in
C and ARM asm for more speed.  Please have a look at my articles here:
http://bitbank.wordpress.com for some info about the bottlenecks and
optimizations needed for faster image handling.
Larry B.

0
Reply bitbank 9/6/2007 2:57:23 PM

5 Replies
1121 Views

(page loaded in 0.092 seconds)

Similiar Articles:













7/20/2012 9:18:25 PM


Reply: