f



poor spss v.12 performance -- workspace, cache, execute

Listmembers --

I have encountered slow (hours of CPU time) performance trying to run a
simple log10 transform compute statement (see below) on a moderate
sized file (4788 numeric vars X 40 cases; <5Mb; stored on the local
harddisk) on my desktop and laptop PCs (>1.75GHz; >425Mb RAM; Win XP
Home SP1) using spss v.12.0.2. By contrast, I observe quite reasonable
performance when I run a Student's t test (2394 condition A's vs. 2394
condition B's) on the same data.

In attempting to diagnose the problem, I've scoured Raynald Levesque's
site and book, as well as spss's corporate site. From these sources,
I've experimented with modifying (1) the workspace size -- increasing
it to 400Mb, (b) frequency of caches -- increasing it from the default
n=20 to n=5000, and number of EXECUTE statements -- from 1 per 4388
COMPUTE statements to 1 per 100 COMPUTEs (see below). I have also
confirmed that the syntax runs on a small subset of variables (1st 100)
to test whether
there was a simple syntax error.

The syntax for modifying the spss settings looks like:

CACHE.
SET WORKSPACE=399000.
SET CACHE 4788.
show all.

The syntax for the transformation looks like:

COMPUTE lgsf1=LG10(safe1) .
COMPUTE lgsf2=LG10(safe2) .
COMPUTE lgsf3=LG10(safe3) .
COMPUTE lgsf4=LG10(safe4) .
COMPUTE lgsf5=LG10(safe5) .
COMPUTE lgsf6=LG10(safe6) .
COMPUTE lgsf7=LG10(safe7) .
COMPUTE lgsf8=LG10(safe8) .
COMPUTE lgsafe9=LG10(safe9) .
..
..
..
COMPUTE lgth2391=LG10(thrt2391) .
COMPUTE lgth2392=LG10(thrt2392) .
COMPUTE lgth2393=LG10(thrt2393) .
COMPUTE lgth2394=LG10(thrt2394) .
EXECUTE .

As I said, I've experimented with the frequency of interspersing
EXECUTE statements.

If anyone has any suggestions for either improving performance or
diagnosing the problem, I would much appreciate it. Perhaps by
condensing
the code into a more elegant form, performance would be improved??

Thanks,
Alex Shackman
------------------------------------------------------------------
Alexander J. Shackman
Laboratory for Affective Neuroscience | W.M. Keck Laboratory for
Functional Brain Imaging & Behavior
University of Wisconsin-Madison
1202 West Johnson Street
Madison, Wisconsin 53706

PH: +1 (608) 358-5025 (cell)
FAX: +1 (608) 265-2875
EMAIL: ajshackman@gmail.com
WWW: http://psyphz.psych.wisc.edu/~shackman |
http://brainimaging.waisman.wisc.edu/~shackman/

0
shackman
1/16/2005 12:08:18 AM
comp.soft-sys.stat.spss 5679 articles. 0 followers. Post Follow

5 Replies
21840 Views

Similar Articles

[PageSpeed] 41

"shackman@wisc.edu" <ajshackman@gmail.com> wrote in message
news:1105834098.917026.66130@f14g2000cwb.googlegroups.com...
> Listmembers --
>
> I have encountered slow (hours of CPU time) performance trying to run a
> simple log10 transform compute statement (see below) on a moderate
> sized file (4788 numeric vars X 40 cases; <5Mb; stored on the local
> harddisk) on my desktop and laptop PCs (>1.75GHz; >425Mb RAM; Win XP
> Home SP1) using spss v.12.0.2. By contrast, I observe quite reasonable
> performance when I run a Student's t test (2394 condition A's vs. 2394
> condition B's) on the same data.
>
> In attempting to diagnose the problem, I've scoured Raynald Levesque's
> site and book, as well as spss's corporate site. From these sources,
> I've experimented with modifying (1) the workspace size -- increasing
> it to 400Mb, (b) frequency of caches -- increasing it from the default
> n=20 to n=5000, and number of EXECUTE statements -- from 1 per 4388
> COMPUTE statements to 1 per 100 COMPUTEs (see below). I have also
> confirmed that the syntax runs on a small subset of variables (1st 100)
> to test whether
> there was a simple syntax error.
>
> The syntax for modifying the spss settings looks like:
>
> CACHE.
> SET WORKSPACE=399000.
> SET CACHE 4788.
> show all.
>
> The syntax for the transformation looks like:
>
> COMPUTE lgsf1=LG10(safe1) .
> COMPUTE lgsf2=LG10(safe2) .
> COMPUTE lgsf3=LG10(safe3) .
> COMPUTE lgsf4=LG10(safe4) .
> COMPUTE lgsf5=LG10(safe5) .
> COMPUTE lgsf6=LG10(safe6) .
> COMPUTE lgsf7=LG10(safe7) .
> COMPUTE lgsf8=LG10(safe8) .
> COMPUTE lgsafe9=LG10(safe9) .
> .
> .
> .
> COMPUTE lgth2391=LG10(thrt2391) .
> COMPUTE lgth2392=LG10(thrt2392) .
> COMPUTE lgth2393=LG10(thrt2393) .
> COMPUTE lgth2394=LG10(thrt2394) .
> EXECUTE .
>
> As I said, I've experimented with the frequency of interspersing
> EXECUTE statements.
>
> If anyone has any suggestions for either improving performance or
> diagnosing the problem, I would much appreciate it. Perhaps by
> condensing
> the code into a more elegant form, performance would be improved??
>

WORKSPACE and CACHE have nothing to do with it.  I'm virtually certain the
problem  is simply the number of command lines being parsed and (more
importantly) echoed in the log.  If I'm right, you can make it much faster
just by not echoing the commands (SET PRINTBACK OFF.)

You should also be able to make it fast by replacing the long string of
computes with a much shorter sequence like

vector lgsf = lgsf(2394).         /* a vector of new variables
format lgsf1 to lgsf2394 (f8.5).
vector safe = safe1 to safe2394.   /* vector of existing variables
loop #i = 1 to 2394.
      compute lgsf(#i) = lg10(safe(#i)).
end loop.

Jonathan Fry
SPSS Inc.


0
Jonathan
1/17/2005 10:51:02 PM
On Mon, 17 Jan 2005 16:51:02 -0600, "Jonathan Fry"
<jon@NOSPAM.spss.com> wrote:

> 
> "shackman@wisc.edu" <ajshackman@gmail.com> wrote in message
> news:1105834098.917026.66130@f14g2000cwb.googlegroups.com...
> > Listmembers --
> >
> > I have encountered slow (hours of CPU time) performance trying to run a
> > simple log10 transform compute statement (see below) on a moderate
> > sized file (4788 numeric vars X 40 cases; <5Mb; stored on the local
> > harddisk) on my desktop and laptop PCs (>1.75GHz; >425Mb RAM; Win XP
> > Home SP1) using spss v.12.0.2. By contrast, I observe quite reasonable
> > performance when I run a Student's t test (2394 condition A's vs. 2394
> > condition B's) on the same data.

[snip, good documentation concerning attempts
to fix; example of code, etc.]

> 
> WORKSPACE and CACHE have nothing to do with it.  I'm virtually certain the
> problem  is simply the number of command lines being parsed and (more
> importantly) echoed in the log.  If I'm right, you can make it much faster
> just by not echoing the commands (SET PRINTBACK OFF.)

A few thousand lines need to be echoed, once.
This is trivial, isn't it?


> 
> You should also be able to make it fast by replacing the long string of
> computes with a much shorter sequence like
> 
> vector lgsf = lgsf(2394).         /* a vector of new variables
> format lgsf1 to lgsf2394 (f8.5).
> vector safe = safe1 to safe2394.   /* vector of existing variables
> loop #i = 1 to 2394.
>       compute lgsf(#i) = lg10(safe(#i)).
> end loop.
> 

I'm always interested in benchmarking and in what
affects performance.  The above looks like it could help.

But, Jon, and everyone, the original post says that
the task did not finish in  *hours*!

I think that I can understand how this *might* arise 
under a control of an interpretative parser -- If this is,
indeed the cause, I think that SPSS may want to put
serious effort into improving the handling of long
syntax files, for instance, by some partial compilation
of commands

1) When SPSS reads 2394 commands with unique variable
names, does it have to search the original variable list,
in order, to find the variable to operate on?   I remember
horrible performance from BMDP  owing to this cause,
maybe 25 years ago, with a list of 200 variables.  I cured 
it by using the variable 'number'  which BMDP could allow,
rather similar in essence to the cure described above.
However, BMDP fixed that problem, long before SPSS 
bought them out in 1990 or so.

2) The early version of Paradox's  programmable interface
had a problem with any long syntax, since it (seemingly)
saved all commands as a block of text, and needed to 
rescan all previous commands in order to count lines,
to find where each next command was.  That seemed slow
for hundreds of lines, on a  35 MH  computer.


While I've never had a few thousand commands for 40 cases, 
I have done a few hundred commands for hundreds of cases, 
with slower computers, and I expect less than a minute.
I know that taking a log is slow, compared to some things,
but it should not be that slow.

Another possibility?
The present problem has two 'WorkSpaces', of a sort,  that 
are unusually long - over 32 thousand bytes for each 
record, and perhaps twice that for the text of Computes.

I guess can imagine a subtle problem of 'thrashing'  if 
those two spaces are not both completely in memory.
It does seem to me that SPSS should reserve those spaces
without difficulty, as Jon says.  However, if Windows  is 
finding some reason to mis-handle the allocations, could
Windows invoke ridiculous amounts of paging?  -- This 
should show up to the user as a disk-read light that stays 
on during that long execution.

Still curious.

-- 
Rich Ulrich, wpilib@pitt.edu
http://www.pitt.edu/~wpilib/index.html
0
Richard
1/18/2005 4:01:51 AM
shackman@wisc.edu <ajshackman@gmail.com> wrote:
> Listmembers --

> I have encountered slow (hours of CPU time) performance trying to run a
> simple log10 transform compute statement (see below) on a moderate
> sized file (4788 numeric vars X 40 cases; <5Mb; stored on the local
> harddisk) on my desktop and laptop PCs (>1.75GHz; >425Mb RAM; Win XP
> Home SP1) using spss v.12.0.2. By contrast, I observe quite reasonable
> performance when I run a Student's t test (2394 condition A's vs. 2394
> condition B's) on the same data.

..............snip, snip

Jonathan and Rich's comments made me curious.  It clearly
does seem to be a problem of SPSS's interpreter. I duplicated
the OP's problem for 1000 and 5000 variables and found:

1) Running per the OP's approach, with 1000s of syntax statements,
took about 10 sec. for 1000 variables, and about 270 sec for 5000 .
2) Using DO REPEAT
   do repeat x = x1 to x5000/y = y1 to y5000 .
       compute y - lg10(x) .
   etc.  

took about 2.5 sec for 1000 variables, and about 100 sec for 5000
variables.
3) Loop, per Jonathan's suggestion, was essentially instantaneous
for 1000 variables, and took about 1.5 sec. for 5000 variables.

Some comments:

SPSS *does* appear to be seriously interpretation bound.
I would never have guessed how much this was true.  What
is more surprising to me is the nonlinearity in degradation
of performance.  Why should it take 27 times as long
to run 5 times as many syntax statements, per 1), or  40
times as long to run a do repeat that is 5X as long?
I'm surprised, in fact, that the do repeat is any faster
than just straight syntax because I had generally presumed
that do repeat simply amplified the flow of syntax fed
to the SPSS "engine."

-- 
=-=-=-=-=-=-=-=-=-==-=-=-=
Mike Lacy,  Ft Collins CO 80523
Clean out the 'junk' to email me.
0
Michael
1/18/2005 7:06:04 PM
Michael.Lacy.junk@colostate.edu wrote:
> Jonathan and Rich's comments made me curious.  It clearly
> does seem to be a problem of SPSS's interpreter. I duplicated
> the OP's problem for 1000 and 5000 variables and found:

> 1) Running per the OP's approach, with 1000s of syntax statements,
> took about 10 sec. for 1000 variables, and about 270 sec for 5000 .


..... snip, snip of my own material

I did the preceding with PRINTBACK ON. (I had neglected Jonathan's
comment here about that, and Rich pointed this out offline.)

So, I reran the preceding with SET PRINTBACK OFF, and got
a time of 6.5 sec for 1000 variables, and 160 sec with 5000 variables.
(I'm running under Windows with Viewer output, which might matter
here.) Note that the scaling of the problem is still nonlinear.

I can't say as I can understand why this might be.  Perhaps
Jonathan can enlighten us here.
-- 
=-=-=-=-=-=-=-=-=-==-=-=-=
Mike Lacy,  Ft Collins CO 80523
Clean out the 'junk' to email me.
0
Michael
1/18/2005 9:16:31 PM
On 18 Jan 2005 14:16:31 -0700, Michael.Lacy.junk@colostate.edu wrote:

> Michael.Lacy.junk@colostate.edu wrote:
> > Jonathan and Rich's comments made me curious.  It clearly
> > does seem to be a problem of SPSS's interpreter. I duplicated
> > the OP's problem for 1000 and 5000 variables and found:
> 
> > 1) Running per the OP's approach, with 1000s of syntax statements,
> > took about 10 sec. for 1000 variables, and about 270 sec for 5000 .
> 
> 
> .... snip, snip of my own material
> 
> I did the preceding with PRINTBACK ON. (I had neglected Jonathan's
> comment here about that, and Rich pointed this out offline.)
> 
> So, I reran the preceding with SET PRINTBACK OFF, and got
> a time of 6.5 sec for 1000 variables, and 160 sec with 5000 variables.
> (I'm running under Windows with Viewer output, which might matter
> here.) Note that the scaling of the problem is still nonlinear.
> 
> I can't say as I can understand why this might be.  Perhaps
> Jonathan can enlighten us here.

This is a little slow, sure, but it is not the grave
PROBLEM originally cited -- far slower, on a fast PC. 

Just to be sure, I looked back at what I quoted
from the Original Post --

Original > >
> > I have encountered slow (hours of CPU time) performance trying to run a
> > simple log10 transform compute statement (see below) on a moderate
> > sized file (4788 numeric vars X 40 cases; <5Mb; stored on the local
> > harddisk) on my desktop and laptop PCs (>1.75GHz; >425Mb RAM; Win XP
> > Home SP1) using spss v.12.0.2. 

It seems to me that Mike has confirmed that a total
performance time of *hours*  of CPU  time is outside 
of the scope of Jon's explanation.  And, if I read it
right, SPSS performs it that slowly on two computers, so 
the explanation is not just "one screwed up installation" 
of SPSS and/or Windows.

Unless the OP  was combining the times of a hundred runs?
However, he was generally excellently specific in 
his documentation.

-- 
Rich Ulrich, wpilib@pitt.edu
http://www.pitt.edu/~wpilib/index.html
0
Richard
1/18/2005 10:06:14 PM
Reply: