Fastest Implementation of Integer Wavelets in VC++

  • Follow


Though Integer Wavelet Transform using Lifting Schemes offer less
computation hurdle by use of dyadic co-efficients(normally Bit shifts),
what other consideration may be take in account in fastest
implementation  in a software module in VC++.

George Wilson

0
Reply george_wilson_mit (24) 3/18/2005 5:40:06 AM

Hi George,

> Though Integer Wavelet Transform using Lifting Schemes offer less
> computation hurdle by use of dyadic co-efficients(normally Bit shifts),
> what other consideration may be take in account in fastest
> implementation  in a software module in VC++.

If your target platform is x86, then this is my consideration: 
"Don't do it". The x86 FPU is as fast as the CPU, and can compute the
wavelet lifting faster than a fixpoint implementation could. For
integer, you need a multiply, add, shift, add instruction flow, for
float it is just multiply,add. Besides that, on the P4 the integer
multiplication is carried out in the FPU anyhow, so even from that
point of view it doesn't make sense.

Other considerations: Carefully compute the output range of
the wavelet so you know how many fractional bits your fixpoint
representation can have to allow the full range. Carefully bias
the number of fixpoint bits of the lifting factors and the
wavelet coefficients.

So long,
	Thomas
0
Reply Thomas 3/21/2005 9:00:59 AM


Thomas Richter wrote:
> If your target platform is x86, then this is my consideration: 
> "Don't do it". The x86 FPU is as fast as the CPU, and can compute the
> wavelet lifting faster than a fixpoint implementation could. 

If computing the wavelet transform of an integer signal (eg. an image), 
then using a floating point transform is unnecessary.

The biggest factor in speed of the 2D wavelet transform is memory 
access. In a naieve implementation the image is processed several times 
(horizontal, vertical scale etc.). This causes cache issues with medium 
to large images, which significantly slows down the code.

The fastest way to implement the lifted wavelet transform is to use a 
ring buffer form. This form uses less memory and requires only a single 
pass over an image, which improves speed significantly (especially when 
the image is too large to fit in the cache).
There are several papers available on restricted memory implementations 
of the wavelet transform...

> For
> integer, you need a multiply, add, shift, add instruction flow, for
> float it is just multiply,add. Besides that, on the P4 the integer
> multiplication is carried out in the FPU anyhow, so even from that
> point of view it doesn't make sense.

It may seem that a FP transform is faster than it's integer equivalent, 
however in practice (at least for some integer transforms such as the 
D(2,2) and D(4,4) ) this is not so. One potential deciding factor is 
that integer transforms can be done on 16bit data (for 8bit images at 
least), thereby halving the memory requirements.

> Other considerations: Carefully compute the output range of
> the wavelet so you know how many fractional bits your fixpoint
> representation can have to allow the full range. Carefully bias
> the number of fixpoint bits of the lifting factors and the
> wavelet coefficients.

I have found no need for fractional bits or fixed point representation. 
To achieve the best results, 2-3 extra bits can be used all the way 
through the transform (ie. the 8bit values are promoted to 10-11bit 
values before the transform), however the loss by rounding after each 
level is not terribly big.

Malcolm

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
0
Reply Malcolm 3/22/2005 7:47:36 AM

Hi,

> If computing the wavelet transform of an integer signal (eg. an image), 
> then using a floating point transform is unnecessary.

Depends on the wavelet. If the wavelet allows integer lifting, then
yes. Otherwise no.

> The biggest factor in speed of the 2D wavelet transform is memory 
> access. In a naieve implementation the image is processed several times 
> (horizontal, vertical scale etc.). This causes cache issues with medium 
> to large images, which significantly slows down the code.

Absolutely correct.

> The fastest way to implement the lifted wavelet transform is to use a 
> ring buffer form. This form uses less memory and requires only a single 
> pass over an image, which improves speed significantly (especially when 
> the image is too large to fit in the cache).

A line based approach, to be precise.

> There are several papers available on restricted memory implementations 
> of the wavelet transform...

Sure.

> > For
> > integer, you need a multiply, add, shift, add instruction flow, for
> > float it is just multiply,add. Besides that, on the P4 the integer
> > multiplication is carried out in the FPU anyhow, so even from that
> > point of view it doesn't make sense.

> It may seem that a FP transform is faster than it's integer equivalent, 
> however in practice (at least for some integer transforms such as the 
> D(2,2) and D(4,4) ) this is not so. One potential deciding factor is 
> that integer transforms can be done on 16bit data (for 8bit images at 
> least), thereby halving the memory requirements.

Sorry, I don't see where you draw this conclusion from - it disagrees
with mine. This is not a matter "what it seems like", but rather a
matter of measurements. I've here a JPEG2000 encoder/decoder that
provides both a 16bit fixpoint and a single-precision floating point
version. Of the two, the FPU version is faster on x86, but slower on
other architectures (e.g. on PPC). Not significantly, but it is. The
data is of course organized in a way such that as much of it remains
in cache. According to my cachegrind measurements, the cache-miss
ratio is below 1% for the 1st level cache, so the width of the access
doesn't enter as an important factor at all. If the transform is
implemented in a stupid way (e.g. the "jasper" way), then this does
matter a lot, of course.

> I have found no need for fractional bits or fixed point representation. 

For the 9-7 wavelet, you need to assign fractional bits because the
wavelet lifting coefficients are not rationals with denominators
which are powers of two. Without fractional bits, you'd still have
a reversible wavelet decomposition simply because the lifting scheme
is always reversible, but it won't approximate the original wavelet
very well. In other words, if you need to implement the 9/7 wavelet
since the standard dictates this, you need to add fractional bits.

> To achieve the best results, 2-3 extra bits can be used all the way 
> through the transform (ie. the 8bit values are promoted to 10-11bit 
> values before the transform), however the loss by rounding after each 
> level is not terribly big.

Big enough in the 9/7 case so PSNR differences between float and
fixpoint rule this alternative out. Been there, done that.

So long,
	Thomas
0
Reply Thomas 3/22/2005 9:16:13 AM

Thomas Richter wrote:
>>It may seem that a FP transform is faster than it's integer equivalent, 
>>however in practice (at least for some integer transforms such as the 
>>D(2,2) and D(4,4) ) this is not so. One potential deciding factor is 
>>that integer transforms can be done on 16bit data (for 8bit images at 
>>least), thereby halving the memory requirements.
> 
> Sorry, I don't see where you draw this conclusion from - it disagrees
> with mine. This is not a matter "what it seems like", but rather a
> matter of measurements. 

I too have done those measurements, and found the opposite conclusion. 
The difference is that the basic wavelet I use is the D(4,4) wavelet 
which is much easier to implement than the 9/7 and has exact rational 
coefficients (it can be seen to be based on interpolation if analysed in 
one manner). The D(4,4) is not that different from the 9/7 in PSNR 
results and visual quality (9/7 has a slight edge in psnr results, 
however the D(4,4) has less ringing on edges).
I also make use of MMX for the vertical filtering.

> in cache. According to my cachegrind measurements, the cache-miss
> ratio is below 1% for the 1st level cache, so the width of the access
> doesn't enter as an important factor at all. If the transform is
> implemented in a stupid way (e.g. the "jasper" way), then this does
> matter a lot, of course.

I'd be interested to know exactly how you have organised the data 
(unless it is a trade secret :). What speed can you run the reverse 
transform including the coefficient decoding? Megapixels a second for a 
given processor and colour depth would be an interesting measurement. My 
own decodes a color image at around 5.5MPs on a 2GHz Athlon XP.

> 
>>I have found no need for fractional bits or fixed point representation. 
> 
> In other words, if you need to implement the 9/7 wavelet
> since the standard dictates this, you need to add fractional bits.

I suppose this is where the difference lies. I am not implementing 
JPEG2000, and so can have numerical innaccuracies that are compensated 
on decode (reversible lifting). However if you are trying to decode 
someone elses 9/7 wavelet then you would need the accuracy.

Hmm... I might just try a floating point 9/7 myself and see what speed I 
can achieve.

Malcolm

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
0
Reply Malcolm 3/22/2005 10:10:40 PM

Hi Malcolm,

> > Sorry, I don't see where you draw this conclusion from - it disagrees
> > with mine. This is not a matter "what it seems like", but rather a
> > matter of measurements. 

> I too have done those measurements, and found the opposite conclusion. 
> The difference is that the basic wavelet I use is the D(4,4) wavelet 
> which is much easier to implement than the 9/7 and has exact rational 
> coefficients (it can be seen to be based on interpolation if analysed in 
> one manner). The D(4,4) is not that different from the 9/7 in PSNR 
> results and visual quality (9/7 has a slight edge in psnr results, 
> however the D(4,4) has less ringing on edges).
> I also make use of MMX for the vertical filtering.

What I'm currently comparing is the C++ version of the integer and
the fixpoint lifting. There is an assembly (SSE2) version of the float
lifting, but that doesn't count here because it is naturally faster
than the C based solution, thus no fair comparison here. The compilers
I tried were VS 6.0 (where fixpoint is slightly faster) VS 7.1
(where float is slightly faster) and GCC 3.3 (at that time) with
float being faster, and Intel icc 7.1, also having a faster float.

VS6.0 uses the "classical" stack based FPU of the i386, so no miracle
its float version leaves something to desire. VS7.1 doesn't use
vector operations, but the "register based" FPU extensions of the
x86 and is able to outperform the 6.0 release. The assembly of GNU
looks pretty nice as well, depending on processor switches the
fixpoint multiplication ends up either as a series of additions
(slower for the more advanced processors) or as an integer multiply
(faster for my processors).

Concerning the 9-7 vs. the 4-4 wavelet is that I haven't tried to make
comparisons PSNR wise; I'm mainly interested in the standard; if you
could sent me the lifting steps for it I would be curious to try it. I
would believe that the 4-4 requires the more complicated HSS boundary
extension (even-sized support region) which might be one of the
reasons why the simpler WSS extension of the 9-7 filter was prefered.
I'd personally prefered the 13-7 since it is applicable for integer
lifting, early predicessors (the Ricoh Crew coder) used even
something wierder like the 2-10 wavelet.

> > in cache. According to my cachegrind measurements, the cache-miss
> > ratio is below 1% for the 1st level cache, so the width of the access
> > doesn't enter as an important factor at all. If the transform is
> > implemented in a stupid way (e.g. the "jasper" way), then this does
> > matter a lot, of course.

> I'd be interested to know exactly how you have organised the data 
> (unless it is a trade secret :). What speed can you run the reverse 
> transform including the coefficient decoding? Megapixels a second for a 
> given processor and colour depth would be an interesting measurement. My 
> own decodes a color image at around 5.5MPs on a 2GHz Athlon XP.

Concering the implementation I afraid I won't be allowed to say much
about it. It is a line based approach that tries to keep data as local
as possible by various tricks. Interestingly, the main problem in
JPEG2000 is not even the wavelet filter (unless it is really a very
simple-minded implementation), the major problem is the entropy coding
model (aka "EBCOT"). Concering the processor, I seem to remember I had
different results for AMD vs. Intel, and IIRC, the Pentium 4 float
version was faster since the AMD FPU is not quite as advanced, but
that's now a while ago so I might be wrong. 

> > In other words, if you need to implement the 9/7 wavelet
> > since the standard dictates this, you need to add fractional bits.

> I suppose this is where the difference lies. I am not implementing 
> JPEG2000, and so can have numerical innaccuracies that are compensated 
> on decode (reversible lifting). However if you are trying to decode 
> someone elses 9/7 wavelet then you would need the accuracy.

Yes, that's also where I agree to agree. (-; As long as you're
dealing with a proprietry implementation, you're free to pick the
lifting as you like, so there are less problems keeping the overhead
of the fixpoint version low. It's different for the standard since
you *need* to be compatible; that is, PSRN comparisons with
mixed float/fix encode/decode are imperative.

> Hmm... I might just try a floating point 9/7 myself and see what speed I 
> can achieve.

I'd personally like to try the 4-4 wavelet, just see how it performs. (-;
I could easely stick this into the JPEG internal vm software which allows
arbitrary wavelets. I'm mainly interested in the magnitude of the Gibbs
phenomena (aka "ringing") here.

So long,
	Thomas
0
Reply Thomas 3/23/2005 9:10:29 AM

Hi again,

Thomas Richter wrote:
> Hi Malcolm,
> Concerning the 9-7 vs. the 4-4 wavelet is that I haven't tried to make
> comparisons PSNR wise; I'm mainly interested in the standard; if you
> could sent me the lifting steps for it I would be curious to try it. I
> would believe that the 4-4 requires the more complicated HSS boundary
> extension (even-sized support region) which might be one of the
> reasons why the simpler WSS extension of the 9-7 filter was prefered.
> I'd personally prefered the 13-7 since it is applicable for integer
> lifting, early predicessors (the Ricoh Crew coder) used even
> something wierder like the 2-10 wavelet.

I derive the boundary conditions from the interpolation view of the 
D(4,4) which tends to be a mirrored extension if I remember correctly.

The D(2,2) is -(1,1)/2 for predict and +(1,1)/4 for update.
The D(4,4) is -(-1,9,9,-1)/16 for predict, and +(-1,9,9,-1)/32 for update.

The family extends to higher numbers of moments, but I have never 
bothered with that since the boundary conditions get really annoying :).

What are the lifting steps you are using for the 9-7 (and maybe the 13-7 
that you mentioned). I have seen different lifting factorings for the 9-7.

> Concering the implementation I afraid I won't be allowed to say much
> about it. It is a line based approach that tries to keep data as local
> as possible by various tricks. 

Fair enough, I thought this might be the case, since good design here is 
what separates slow from fast :).

> Interestingly, the main problem in
> JPEG2000 is not even the wavelet filter (unless it is really a very
> simple-minded implementation), the major problem is the entropy coding
> model (aka "EBCOT"). 

For my own work I have found the biggest speed bottleneck is the wavelet 
transform (especially on larger images). The entropy coding in my coder 
is designed to be as fast as possible to decode, and so doesn't have the 
fine rate scaleable feature that JP2K has.

> I'd personally like to try the 4-4 wavelet, just see how it performs. (-;
> I could easely stick this into the JPEG internal vm software which allows
> arbitrary wavelets. I'm mainly interested in the magnitude of the Gibbs
> phenomena (aka "ringing") here.

It has been several years since I compared them, however I do remember 
the D(4,4) also looking more 'square' than the 9-7 (if you know what I 
mean). The D(4,4) is also used in the LuraWave LWF and LuraDocument 
formats. It is mathematically similar to the 9-7 since it has the same 
number of vanishing moments on both the synthesis and analysis wavelets 
IIRC.

Malcolm

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
0
Reply Malcolm 3/23/2005 8:30:28 PM

6 Replies
251 Views

(page loaded in 0.205 seconds)

Similiar Articles:












7/22/2012 10:19:58 PM


Reply: