Monte Carlo speed - Compound Poisson

  • Follow


One of my colleagues claimed that "FORTRAN would be way faster than MATLAB for doing Monte Carlo simulation".

Here we're interested in generating Compound Poisson random variables - that is, random variables of the form X1+...+XN where N is random with the Poisson distribution.  I tested various codes for generating 1 million Poisson(10)-Lognormal(0,1) samples and with careful vectorization brought the MATLAB run time down from 200+ sec to about 1.2 sec, not far off the theoretical limit of about 0.8 sec imposed by randn (see FEX #26042 <http://www.mathworks.com/matlabcentral/fileexchange/26042> for details).

By comparison, SAS and Mathematica implementations took about 3 sec and 10 sec respectively, though I'm not enough of an expert to say if that is the fastest possible time in those languages.

The question is, how much further speed gain could be realized by going to the trouble of implementing an algorithm in a "faster" language such as C or FORTRAN?
0
Reply Ben 12/11/2009 3:42:20 AM

Ben Petschel <noreply@nospam.org> wrote:
> One of my colleagues claimed that "FORTRAN would be way faster than
MATLAB for doing Monte Carlo simulation".
> 
> Here we're interested in generating Compound Poisson random variables -
> that is, random variables of the form X1+...+XN where N is random with the
> Poisson distribution.  I tested various codes for generating 1 million
> Poisson(10)-Lognormal(0,1) samples and with careful vectorization brought
> the MATLAB run time down from 200+ sec to about 1.2 sec, not far off the
> theoretical limit of about 0.8 sec imposed by randn (see FEX #26042
> <http://www.mathworks.com/matlabcentral/fileexchange/26042> for details).
> 

So your process here was to generate 1 million Poisson samples, with mean
of 10, and then generate around 10 million Lognormal samples, and sum them?

What are you using to generate the Poisson samples?  How much of your cpu
time is spent doing Poisson, how much doing Lognormal, how much doing the
sums?

"Fast Generation of Discrete Random Variables", Marsaglia, Tsang and Wang,
Jornal of Statistical Software, vol 11, Issue 3, July 2004.

The above paper has an excellent method which is very fast.  Generation of
Poisson variates should approach the speed with which you can generate
uniform variates.


> By comparison, SAS and Mathematica implementations took about 3 sec and
> 10 sec respectively, though I'm not enough of an expert to say if that is
> the fastest possible time in those languages.
>  
> The question is, how much further speed gain could be realized by going
> to the trouble of implementing an algorithm in a "faster" language such as
> C or FORTRAN?

If you wnat to go multi-threaded, potentially quite a bit.  Even if not, if
the answer to my question above is that lots of time is required for
Poisson c.f. the genrating same number of Uniform variates, then also quite
a bit.

I have C implementations available, but don't do Fortran myself.  I doubt
that you ould see much difference at all between C and Fortran for this,
but there is certainly scope for improvement over what is available in
MATLAB, SAS or MAthematica.

-- 
Dr Tristram J. Scott               
Energy Consultant                  
0
Reply tristram 12/11/2009 10:34:43 AM


tristram.scott@ntlworld.com (Tristram Scott) wrote in message <7fpUm.6811$6O1.2081@newsfe23.ams2>...
> What are you using to generate the Poisson samples?  How much of your cpu
> time is spent doing Poisson, how much doing Lognormal, how much doing the
> sums?
> 
> "Fast Generation of Discrete Random Variables", Marsaglia, Tsang and Wang,
> Jornal of Statistical Software, vol 11, Issue 3, July 2004.
> 
> The above paper has an excellent method which is very fast.  Generation of
> Poisson variates should approach the speed with which you can generate
> uniform variates.

Tristram, thanks for the suggestions!

Marsaglia et al's compact table lookup algorithm is implemented in RANDRAW (FEX #7309) and using this instead of POISSRND reduced the total runtime from 3 sec to 1.2 sec.

The final split was about 0.2sec for Poisson, 0.8sec for exp(randn) and 0.2 sec for the aggregation.  Given that the built-ins randn/exp/plus are already highly optimized, could a C implementation be any faster than about 1 second without multithreading?

Sounds like multithreading is the way to go!

Regards,
Ben
0
Reply Ben 12/11/2009 2:24:20 PM

Ben Petschel <noreply@nospam.org> wrote:
> Marsaglia et al's compact table lookup algorithm is implemented in
> RANDRAW (FEX #7309) and using this instead of POISSRND reduced the total
> runtime from 3 sec to 1.2 sec.
> 

I haven't looked at the implementation within RANDRAW.  Is it using the
same algorithm for the underlying uniform generation as you are comparing
it with in POISSRND?  I use the Mersenne Twister throughout.



> The final split was about 0.2sec for Poisson, 0.8sec for exp(randn) and
> 0.2 sec for the aggregation.  Given that the built-ins randn/exp/plus are
> already highly optimized, could a C implementation be any faster than about
> 1 second without multithreading?

I'm not sure without actually trying this out, but potentially there would
be room for improvement.  How important is this to you?

The exp() is likely taking as much cpu time as the randn:
>> n = 1e6;
>> t = cputime;x = rand(10,n);t(end+1) = cputime;
>> y = exp(x);t(end+1) = cputime;z = sum(y);t(end+1) = cputime;
>> diff(t)

ans =

    1.1400    1.6600    0.2500

So, 1.14 seconds for generating 10 million randn, 1.66 to exp them, and
0.25 to sum them.

I guess it might be possible to adapt Marsaglia's Ziggurat method to cope
with non-decreasing PDFs, such as the Lognormal has, but I am not sure it
would be a cheap method in the end.   This would potentially allow you to
avoid calling exp() 10 million times.

Do you ask for exp(randn()), or do you ask for randn() and then take the
exp()?  This is quite a large array of data, so there might be efficiencies
in avoiding allocating intermediate storage. 

When you take the sums, are you doing this down columns, rather than across
rows?  

>> t = cputime;x = rand(n,10);t(end+1) = cputime;
>> y = exp(x);t(end+1) = cputime;z = sum(y,2);t(end+1) = cputime;
>> diff(t)

ans =

    1.1400    1.6200    0.4600

Across rows is taking 0.46 seconds instead of 0.25.

> 
> Sounds like multithreading is the way to go!
> 

It certainly can be, but there is always overhead in handling threads and
waiting for them all to finish etc.  If you are looking at 1 sec for this
part of the code, I would guess that probably the thread overhead is going
to take a big chunk out of your potential gains.  But, if you are doing
this chunk of code many times over within some other loops, all Monte Carlo
fashion, then multi-threading should be able to make a big difference.


-- 
Dr Tristram J. Scott               
Energy Consultant                  
0
Reply tristram 12/11/2009 5:28:44 PM

Not much to add here. But one must be careful when using cputime on a 
multicore processor.
---Bob.

"Tristram Scott" <tristram.scott@ntlworld.com> wrote in message 
news:gjvUm.6243$Ub.459@newsfe17.ams2...
> Ben Petschel <noreply@nospam.org> wrote:
>> Marsaglia et al's compact table lookup algorithm is implemented in
>> RANDRAW (FEX #7309) and using this instead of POISSRND reduced the total
>> runtime from 3 sec to 1.2 sec.
>>
>
> I haven't looked at the implementation within RANDRAW.  Is it using the
> same algorithm for the underlying uniform generation as you are comparing
> it with in POISSRND?  I use the Mersenne Twister throughout.
>
>
>
>> The final split was about 0.2sec for Poisson, 0.8sec for exp(randn) and
>> 0.2 sec for the aggregation.  Given that the built-ins randn/exp/plus are
>> already highly optimized, could a C implementation be any faster than 
>> about
>> 1 second without multithreading?
>
> I'm not sure without actually trying this out, but potentially there would
> be room for improvement.  How important is this to you?
>
> The exp() is likely taking as much cpu time as the randn:
>>> n = 1e6;
>>> t = cputime;x = rand(10,n);t(end+1) = cputime;
>>> y = exp(x);t(end+1) = cputime;z = sum(y);t(end+1) = cputime;
>>> diff(t)
>
> ans =
>
>    1.1400    1.6600    0.2500
>
> So, 1.14 seconds for generating 10 million randn, 1.66 to exp them, and
> 0.25 to sum them.
>
> I guess it might be possible to adapt Marsaglia's Ziggurat method to cope
> with non-decreasing PDFs, such as the Lognormal has, but I am not sure it
> would be a cheap method in the end.   This would potentially allow you to
> avoid calling exp() 10 million times.
>
> Do you ask for exp(randn()), or do you ask for randn() and then take the
> exp()?  This is quite a large array of data, so there might be 
> efficiencies
> in avoiding allocating intermediate storage.
>
> When you take the sums, are you doing this down columns, rather than 
> across
> rows?
>
>>> t = cputime;x = rand(n,10);t(end+1) = cputime;
>>> y = exp(x);t(end+1) = cputime;z = sum(y,2);t(end+1) = cputime;
>>> diff(t)
>
> ans =
>
>    1.1400    1.6200    0.4600
>
> Across rows is taking 0.46 seconds instead of 0.25.
>
>>
>> Sounds like multithreading is the way to go!
>>
>
> It certainly can be, but there is always overhead in handling threads and
> waiting for them all to finish etc.  If you are looking at 1 sec for this
> part of the code, I would guess that probably the thread overhead is going
> to take a big chunk out of your potential gains.  But, if you are doing
> this chunk of code many times over within some other loops, all Monte 
> Carlo
> fashion, then multi-threading should be able to make a big difference.
>
>
> -- 
> Dr Tristram J. Scott
> Energy Consultant
> 


0
Reply Bobby 12/15/2009 12:13:40 PM

tristram.scott@ntlworld.com wrote: 
> Ben Petschel wrote:
> > The final split was about 0.2sec for Poisson, 0.8sec for exp(randn) and 
> > 0.2 sec for the aggregation.  Given that the built-ins randn/exp/plus are 
> > already highly optimized, could a C implementation be any faster than about 
> > 1 second without multithreading? 
> 
> I'm not sure without actually trying this out, but potentially there would 
> be room for improvement.  How important is this to you? 

I guess if there was only 20% speedup or even 50% it wouldn't be worth the hassle of implementing in another language.
0
Reply Ben 12/20/2009 2:46:03 PM

5 Replies
341 Views

(page loaded in 1.743 seconds)

Similiar Articles:











7/22/2012 4:54:22 PM


Reply: