We're doing some image processing on standard PCs (dual core, 3+ Ghz
x86, etc), and we're looking for more performance. We're looking at
various jpeg libraries, and considering hotter boxes (quad core), but
we also need to investigate more specialized hardware, such as a GPU
or an fpga. Anyone know of any benchmarks that compare the speed of
jpeg decompression on an x86 cpu, an ARM cpu, a DSP chip, an fpga, an
GPU, etc? I know each generation of each type of chip gets more
powerful, and other system factors can make or break performance, but
I'd like to get an idea of what's possible for each chip type and
determine is specialized hardware wil be enough of a win to justify
the investment.
Thanks,
Aj
|
|
0
|
|
|
|
Reply
|
amygdala.jones (1)
|
8/22/2007 4:31:22 PM |
|
amygdala.jones@gmail.com schrieb:
> We're doing some image processing on standard PCs (dual core, 3+ Ghz
> x86, etc), and we're looking for more performance.
Image processing in general or JPEG compression/decompression?
> We're looking at
> various jpeg libraries, and considering hotter boxes (quad core), but
> we also need to investigate more specialized hardware, such as a GPU
> or an fpga.
I'd say it really depends on what you want to invest. You can buy very
fast software-only solutions (e.g. www.jpg.com has AFAIK the fastest
JPEG on the market), latest NVIDIA hardware supporting CUDA would be
able to run the DCT on the chip, and if you design your own FPGA, that's
of course fastest. Concerning GPUs I'm not so clear how much that will
buy you. While I haven't done any DCT on a GPU, I did test with DWT and
that wasn't worth it (my CPU was faster). The bottleneck is that you
need to load your data into the GPU first before performing the
operation, and that requires a larger overhead than to just do the
transform in the CPU right away. That is, unless you have specific
conditions where one or the other overhead can be avoided (i.e. the data
is on the GPU anyhow, or has to go there anyhow). This would also be my
next question.
> Anyone know of any benchmarks that compare the speed of
> jpeg decompression on an x86 cpu, an ARM cpu, a DSP chip, an fpga, an
> GPU, etc? I know each generation of each type of chip gets more
> powerful, and other system factors can make or break performance, but
> I'd like to get an idea of what's possible for each chip type and
> determine is specialized hardware wil be enough of a win to justify
> the investment.
I think it would be worth discussing your constraints as well. My
thinking is that unless very specific conditions hold, your best bet is
likely a software solution with a specialized codec, but maybe you do
not want to run it on PC hardware?
So long,
Thomas
|
|
0
|
|
|
|
Reply
|
Thomas
|
8/23/2007 6:07:24 AM
|
|
> Image processing in general or JPEG compression/decompression?
Image processing in general, but one area of noticable delay is
getting the images loaded. Much of that is media read time (disk or
network time), but the decoding time is substantial.
> I'd say it really depends on what you want to invest. You can buy very
> fast software-only solutions (e.g.www.jpg.comhas AFAIK the fastest
> JPEG on the market), latest NVIDIA hardware supporting CUDA would be
> able to run the DCT on the chip, and if you design your own FPGA, that's
> of course fastest. Concerning GPUs I'm not so clear how much that will
> buy you. While I haven't done any DCT on a GPU, I did test with DWT and
> that wasn't worth it (my CPU was faster). The bottleneck is that you
> need to load your data into the GPU first before performing the
> operation, and that requires a larger overhead than to just do the
> transform in the CPU right away. That is, unless you have specific
> conditions where one or the other overhead can be avoided (i.e. the data
> is on the GPU anyhow, or has to go there anyhow). This would also be my
> next question.
www.jpg.com fastest? Can you point to any data?
FPGA beats CUDA? how do you know? Even a low-end FPGA? What class of
FPGA?
The EEMBC's Consumerbench gives some data for embedded procs:
http://www.eembc.org/benchmark/consumer_sl.asp
The Moscow State site (www.compression.ru) looks promising, but so far
hasn't yielded much:
http://www.compression-links.info/BenchmarksTests
> I think it would be worth discussing your constraints as well. My
> thinking is that unless very specific conditions hold, your best bet is
> likely a software solution with a specialized codec, but maybe you do
> not want to run it on PC hardware?
I do want to run it on PC hardware, so an FPGA, or other would be on
PCIx card, or conceivably as a front-end to a PC, but the images are
manipulated via a GUI on the PC.
Let's assume a specialized codec is used, how much faster will it be
than the stock Microsoft gdi+ classes? Yes, this will vary by task,
but any data's better than speculation.
Would something like a cell processor or Tilera's new 64 core board
win?
To give a concrete example, say I had 1000 5mpix images, what would be
the fastest way to thumbnail them all to 300x300pixels? For a given
approach, there are different constraints: For a GPU, there're copy
and copy-back times, for an fpga on a PCI card, there're similar data
communucation costs.
Aj
|
|
0
|
|
|
|
Reply
|
amygdala
|
8/25/2007 1:08:37 AM
|
|
amygdala.jones@gmail.com schrieb:
>> Image processing in general or JPEG compression/decompression?
>
> Image processing in general, but one area of noticable delay is
> getting the images loaded. Much of that is media read time (disk or
> network time), but the decoding time is substantial.
>
>> I'd say it really depends on what you want to invest. You can buy very
>> fast software-only solutions (e.g.www.jpg.comhas AFAIK the fastest
>> JPEG on the market), latest NVIDIA hardware supporting CUDA would be
>> able to run the DCT on the chip, and if you design your own FPGA, that's
>> of course fastest. Concerning GPUs I'm not so clear how much that will
>> buy you. While I haven't done any DCT on a GPU, I did test with DWT and
>> that wasn't worth it (my CPU was faster). The bottleneck is that you
>> need to load your data into the GPU first before performing the
>> operation, and that requires a larger overhead than to just do the
>> transform in the CPU right away. That is, unless you have specific
>> conditions where one or the other overhead can be avoided (i.e. the data
>> is on the GPU anyhow, or has to go there anyhow). This would also be my
>> next question.
>
> www.jpg.com fastest? Can you point to any data?
I'll try to provide you some, please stand by.
> FPGA beats CUDA? how do you know? Even a low-end FPGA? What class of
> FPGA?
I personally haven't done JPEG, but I'm doing JPEG2000, and there I
tried to speed up the transformation on GPUs. Back then CUDA wasn't
available, but I did this manually on OpenGL. The implementation did run
the wavelet on the GPU, but wasn't faster than my CPU-only
implementation. It was even a bit slower. With a higher-spec'd gfx card,
you could improve the speed a bit, but the bottleneck is really the
memory bandwidth between CPU and GPU, whereas the CPU can access its
cache in full speed. If you can utilize the CPU cache, that is. This
means that having a very fast implementation on the GPU is not
necessarely helpful.
The GPU in this test was an NVidia 6600GT, the CPU an AMD64x2 4800EE.
On the other hand, colleagues of mine from the Fraunhofer IIS institute
did IIRC an FPGA version of JPEG2000 "on chip" that compresses (it's a
compressor only) 4K or 8K images at at least 24fps, if not higher. Thus,
yes, if you can put it on a chip completely, and do not need to exchange
data with the PC, you're better off with an FPGA. Of course, it depends
on the algorithms you need to run, but a DCT should be fairly simple.
> The EEMBC's Consumerbench gives some data for embedded procs:
> http://www.eembc.org/benchmark/consumer_sl.asp
That's again a very different part of the story. Vendors of embedded
processors often supply hand-optimized versions of the DCT just for this
purpose - I know definitely that the TI chip libraries include a DCT.
But that's really a different beast compared to the general-purpose
processors in your PC.
> The Moscow State site (www.compression.ru) looks promising, but so far
> hasn't yielded much:
> http://www.compression-links.info/BenchmarksTests
It's currently off-line, too. I mainly checked that side for JPEG2000,
not JPEG.
>> I think it would be worth discussing your constraints as well. My
>> thinking is that unless very specific conditions hold, your best bet is
>> likely a software solution with a specialized codec, but maybe you do
>> not want to run it on PC hardware?
>
> I do want to run it on PC hardware, so an FPGA, or other would be on
> PCIx card, or conceivably as a front-end to a PC, but the images are
> manipulated via a GUI on the PC.
Uh, then again you have the bottleneck of the PCI bus. Thus, how do
images arrive at the PC? Are they on the harddisk, or do you digitize
them? In the first case, a software-only version might be faster than
just moving data around over the bus, in the second case, a chip on the
digitizer might be the better solution.
Furthermore, sorry if I ask this again, but how large are your images
on average? If we're talking about 4K or above, you need to consider
bandwith issues and it's best to compress images as early in your
toolchain as possible. Otherwise, IO time might be dominating
compression time, especially for JPEG-1.
> Let's assume a specialized codec is used, how much faster will it be
> than the stock Microsoft gdi+ classes? Yes, this will vary by task,
> but any data's better than speculation.
I'll try to deliver some figures.
> Would something like a cell processor or Tilera's new 64 core board
> win?
Depends on the code. With a plain-simple off-the-shelve code, no. But
if you can parallelize the code to the several CPUs of the cell, it
might be very well pretty fast. I would assume that JPEG-1 would profit
from this very much as the blocks are more or less encoded individually.
But again, note that the data has to go into the CPU somehow, and this
might be more the problem than really processing it. I do not know
enough about the memory hub of the cell to judge and their
cache-architecture.
> To give a concrete example, say I had 1000 5mpix images, what would be
> the fastest way to thumbnail them all to 300x300pixels? For a given
> approach, there are different constraints: For a GPU, there're copy
> and copy-back times, for an fpga on a PCI card, there're similar data
> communucation costs.
Yes, exactly, and that's the major bottleneck. If the images are on a
HD, you also need to load them from there. If you want to thumbnail
them, AFAIK the fastest way is really to decode them only partially,
extract only the DC part from each DCT block and combine those into a
new image. This gives you a downscaling of 1:8, which might be exactly
what you need. If that isn't sufficient, you can still apply a cubic
spline filter or something similar on the output. Compressing the
thumbnails should be less the problem as they are small enough. However,
I would still expect that IO time is here the limiting factor.
So long,
Thomas
|
|
0
|
|
|
|
Reply
|
Thomas
|
8/27/2007 8:53:30 AM
|
|
<amygdala.jones@gmail.com> wrote in message
news:1187800282.365363.188700@z24g2000prh.googlegroups.com...
> We're doing some image processing on standard PCs (dual core, 3+ Ghz
> x86, etc), and we're looking for more performance. We're looking at
> various jpeg libraries, and considering hotter boxes (quad core), but
> we also need to investigate more specialized hardware, such as a GPU
> or an fpga. Anyone know of any benchmarks that compare the speed of
> jpeg decompression on an x86 cpu, an ARM cpu, a DSP chip, an fpga, an
> GPU, etc? I know each generation of each type of chip gets more
> powerful, and other system factors can make or break performance, but
> I'd like to get an idea of what's possible for each chip type and
> determine is specialized hardware wil be enough of a win to justify
> the investment.
>
simple may be to get a fairly new processor, and then attempt to write a
specialized jpeg decoder using SSE (SSE3 or SSSE3, which may offer a few
instructions which could be helpful). may be worthwhile, in particular, to
look into packed integers.
(or, limniting to SSE2, you can have it run on most processors released
within the past few years).
never done this myself (I get plenty fast decode speeds using traditional
methods).
actually, for the typical task of decoding images and cramming them into
OpenGL, I suspect loading the images into GL is probably the main slow point
(I before experimented with this kind of thing, disabling everything but the
glTexImage call, which seemed to be where all the time was going).
(the experiment at the time was trying to decode video and cram it into a
texturemap, but I found this hurt the framerate too bad to be of that much
real usefulness...).
so, little need to worry about it in my case...
but, what SSE allows is this:
you can do an IDCT as a number of vector operations, which may be faster
than more traditional methods.
just an idle thought really...
> Thanks,
>
> Aj
>
|
|
0
|
|
|
|
Reply
|
cr88192
|
8/28/2007 6:26:11 AM
|
|
On Aug 22, 12:31 pm, amygdala.jo...@gmail.com wrote:
> We're doing some image processing on standard PCs (dual core, 3+ Ghz
> x86, etc), and we're looking for more performance. We're looking at
> various jpeg libraries, and considering hotter boxes (quad core), but
> we also need to investigate more specialized hardware, such as a GPU
> or an fpga. Anyone know of any benchmarks that compare the speed of
> jpeg decompression on an x86 cpu, an ARM cpu, a DSP chip, an fpga, an
> GPU, etc? I know each generation of each type of chip gets more
> powerful, and other system factors can make or break performance, but
> I'd like to get an idea of what's possible for each chip type and
> determine is specialized hardware wil be enough of a win to justify
> the investment.
>
> Thanks,
>
> Aj
I've been studying software-based imaging performance on various CPUs
for the last few years and have recently rewritten my JPEG decoder in
C and ARM asm for more speed. Please have a look at my articles here:
http://bitbank.wordpress.com for some info about the bottlenecks and
optimizations needed for faster image handling.
Larry B.
|
|
0
|
|
|
|
Reply
|
bitbank
|
9/6/2007 2:57:23 PM
|
|
|
5 Replies
1121 Views
(page loaded in 0.092 seconds)
Similiar Articles: jpeg decoding speeds: x86, ARM, dsp, fpga, GPU - comp.compression ...We're doing some image processing on standard PCs (dual core, 3+ Ghz x86, etc), and we're looking for more performance. We're looking at various jpeg... Fastest way to save data? - comp.soft-sys.matlabJPEG decode: how to use it? - comp.lang.labview Fastest way to save data? - comp.soft-sys.matlab jpeg decoding speeds: x86, ARM, dsp, fpga, GPU - comp.compression ... www ... DSP Libraries using C - comp.dspDSP Libraries using C - comp.dsp jpeg decoding speeds: x86, ARM, dsp, fpga, GPU - comp.compression ... We're looking at various jpeg libraries, and considering hotter ... code for jpeg image compression - comp.soft-sys.matlab... code - comp.soft-sys.matlab... x86, ARM, dsp, fpga, GPU - comp.compression ... video compression using wavelet - comp.soft-sys.matlab jpeg decoding speeds: x86, ARM ... viterbi decoding - comp.soft-sys.matlabjpeg decoding speeds: x86, ARM, dsp, fpga, GPU - comp.compression ... Block Viterbi decoding - comp.dsp | Computer Group jpeg decoding speeds: x86, ARM, dsp, fpga ... JPEG2000 standard test images - comp.compressionjpeg decoding speeds: x86, ARM, dsp, fpga, GPU - comp.compression ... a question about JasPer compression (JPEG2000) - comp.compression ... Can you point to any data? speed up software emulated OpenGL - comp.graphics.api.opengl ...jpeg decoding speeds: x86, ARM, dsp, fpga, GPU - comp.compression ... You can buy very fast software-only ... and there I tried to speed up the transformation on GPUs. Major differences between MIPS and ARM - comp.archjpeg decoding speeds: x86, ARM, dsp, fpga, GPU - comp.compression ... Yes, exactly, and that's the major bottleneck. If the images are on a ... Major differences between ... CUDA: Customize number of used processors - comp.soft-sys.matlab ...jpeg decoding speeds: x86, ARM, dsp, fpga, GPU - comp.compression ... CUDA: Customize number of used processors - comp.soft-sys.matlab ... jpeg decoding speeds: x86, ARM ... FPGA board for video processing - comp.arch.fpgajpeg decoding speeds: x86, ARM, dsp, fpga, GPU - comp.compression ... We're doing some image processing on standard PCs ... cell processor or Tilera's new 64 core board ... jpeg decoding speeds: x86, ARM, dsp, fpga, GPU - comp.compression ...We're doing some image processing on standard PCs (dual core, 3+ Ghz x86, etc), and we're looking for more performance. We're looking at various jpeg... Re: jpeg decoding speeds: x86, ARM, dsp, fpga, GPUamygdala.jones@xxxxxxxxx schrieb: We're doing some image processing on standard PCs (dual core, 3+ Ghz x86, etc), and we're looking for more performance. 7/20/2012 9:18:25 PM
|