f



not that random

Hi,

Quick googling didn't turn anything up, so I though I'd share my experience=
 with the "uniform" function. This is the only randomization function calle=
d "pseudorandom" in the IBM Knowledge center. We just discovered this has s=
ome real implications.

We need to make a weekly random sample of 500 from about 300.000. In a firs=
t generation of the script, I used uniform(1) to first randomize the order =
of the file, then SAMPLE  500 from 250.000.

We removed the second part from the script as it seemed redundant (and also=
 for the practical problem of finding out what went wrong over 1000 lines o=
f code when the N for some reason drops below 250.000). Uniform is executed=
, then the file is sorted along the uniform number and the first 500 cases =
of the file are selected.

This turns out to have been A Bad Idea. We all know random often isn't trul=
y random. But there is a strong correlation between the random number gener=
ated and the casenumber of the record for which it is generated. As our fil=
es happens to be sorted roughly by year of birth, this has profoundly impac=
ted the randomness of our sample.

This tendency to give lower random numbers to certain cases happens every s=
ingle time you run uniform. This is dependent on the number of random opera=
tions you have done in the current SPSS session. In fact, given a certain d=
ataset, uniform will generate the exact same "random" numbers the next time=
 - given that you close and re-open SPSS entirely. It will do so even on a =
different machine. Running and re-running the same uniform function a few t=
imes, will then generate a sample differently correlated with the casenumbe=
r.

From what I read at the Knowledge center, basically any other random genera=
tor would suffice for our needs. Given our use-case, would you recommend an=
y particular method or methods?

Joost
0
joost
7/18/2016 8:01:40 AM
comp.soft-sys.stat.spss 5679 articles. 0 followers. Post Follow

7 Replies
380 Views

Similar Articles

[PageSpeed] 56

On Monday, July 18, 2016 at 2:01:48 AM UTC-6, joost schouppe wrote:
> Hi,
>=20
> Quick googling didn't turn anything up, so I though I'd share my experien=
ce with the "uniform" function. This is the only randomization function cal=
led "pseudorandom" in the IBM Knowledge center. We just discovered this has=
 some real implications.
>=20
> We need to make a weekly random sample of 500 from about 300.000. In a fi=
rst generation of the script, I used uniform(1) to first randomize the orde=
r of the file, then SAMPLE  500 from 250.000.
>=20
> We removed the second part from the script as it seemed redundant (and al=
so for the practical problem of finding out what went wrong over 1000 lines=
 of code when the N for some reason drops below 250.000). Uniform is execut=
ed, then the file is sorted along the uniform number and the first 500 case=
s of the file are selected.
>=20
> This turns out to have been A Bad Idea. We all know random often isn't tr=
uly random. But there is a strong correlation between the random number gen=
erated and the casenumber of the record for which it is generated. As our f=
iles happens to be sorted roughly by year of birth, this has profoundly imp=
acted the randomness of our sample.
>=20
> This tendency to give lower random numbers to certain cases happens every=
 single time you run uniform. This is dependent on the number of random ope=
rations you have done in the current SPSS session. In fact, given a certain=
 dataset, uniform will generate the exact same "random" numbers the next ti=
me - given that you close and re-open SPSS entirely. It will do so even on =
a different machine. Running and re-running the same uniform function a few=
 times, will then generate a sample differently correlated with the casenum=
ber.
>=20
> From what I read at the Knowledge center, basically any other random gene=
rator would suffice for our needs. Given our use-case, would you recommend =
any particular method or methods?
>=20
> Joost

I do not see any evidence of such a correlation.  The random number generat=
ors  pass the usual tests for randomness.  However, the Mersenne Twister ge=
nerator has better properties than the other one, which is the default.  Yo=
u can set it as the one to use with SET RNG=3DMT or in Edit > Options > Dat=
a.
0
Jon
7/18/2016 1:05:23 PM
Hi Jon,

What would you consider evidence?=20

I'm not entirely sure it is the casenum that is correlated, but there is a =
very very clear correlation between year of birth and chance of being selec=
ted in our data generated by about 120 random samples. And our data happens=
 to be sorted by year of birth, hence the assumption. Given a variant of th=
e dataset we use, I can predict an over-representation of 4% of the 40-64 y=
ear olds in every first sample within an SPSS session.

I could elaborate in a blog post if you like.

Is the Mersenne Twister a procedure behind all the randomizers? Because I w=
as hoping the RV. functions would be better than the UNIFORM function.

Joost
0
joost
7/18/2016 2:15:31 PM
On Mon, 18 Jul 2016 01:01:40 -0700 (PDT), joost schouppe
<joost.schouppe@gmail.com> wrote:

.... [ about the SPSS Pseudo-Random-Number-Generator, PRNG]

> In fact, given a certain dataset, uniform will generate the exact 
> same "random" numbers the next time - given that you close and   
> re-open SPSS entirely. It will do so even on a different machine. 
> Running and re-running the same uniform function a few times, will 
> then generate a sample differently correlated with the casenumber.
....

This is a /proper/ feature of a PRNG:  You give it a specific seed
and it allows you to exactly replicate the sequence.  You document
the 'seed' so that anyone can replicate on another machine. 

To get a different sequence, you use a different seed.  I used to
read some recommendation, like, "pick a 10+ digit, odd number" 
when reading the documentation for various routines.  Of course,
if you do a second randomization in the same run, it will pick up 
its generation from where it left off, so that would be a different
sequence. 

Now, from the definition of "random", it is going to be true that 
1 in 20 sequences should fail any given 5% test for randomness. 
And 1 in 1000 will fail at the 0.1% level.  - a PRNG that does /not/
produce that sort of test results is a faulty PRNG.

"Correlating with case number" is potentially a test for randomness.  
It could be that your (default) Seed has some correlation with 
Casenumber, which is purely a random occurrence.  However,
if you use the same seed, you will always get the same result. 

Unless your application is trivial, do not use the default for Seed. 

-- 
Rich Ulrich 

0
Rich
7/18/2016 5:01:14 PM
Hi Rich,

Thanks a lot! I was already under the impression this was a feature, not a bug. It is more the documentation which is lacking, not the software. What is the source for your citation? I cannot find this in any official documentation from SPSS.

One of the solutions I was thinking of, was in fact having a user generated seed fed to the script. Which is an impossible task for humans, but anyway.

But I thought the RV. functions are "truly" random, eliminating the need for manual seeding?

0
joost
7/19/2016 9:38:33 AM
On Tuesday, July 19, 2016 at 5:38:36 AM UTC-4, joost schouppe wrote:
> Hi Rich,
>=20
> Thanks a lot! I was already under the impression this was a feature, not =
a bug. It is more the documentation which is lacking, not the software. Wha=
t is the source for your citation? I cannot find this in any official docum=
entation from SPSS.
>=20
> One of the solutions I was thinking of, was in fact having a user generat=
ed seed fed to the script. Which is an impossible task for humans, but anyw=
ay.
>=20
> But I thought the RV. functions are "truly" random, eliminating the need =
for manual seeding?

Please see the following.

http://www.ibm.com/support/knowledgecenter/SSLVMB_21.0.0/com.ibm.spss.stati=
stics.help/randvarfunctionlist.htm

http://www.ibm.com/support/knowledgecenter/SSLVMB_21.0.0/com.ibm.spss.stati=
stics.help/syn_set_rng_seed_mtindex.htm

https://www.random.org/randomness/

Near the top of the first page given above, note the following:

"The following functions give a random variate from a specified distributio=
n. The arguments are the parameters of the distribution. You can repeat the=
 sequence of *pseudorandom* numbers by setting a seed in the Preferences di=
alog box before each sequence. Note the period in each function name."

HTH.
0
Bruce
7/19/2016 1:04:53 PM
On Tuesday, July 19, 2016 at 3:38:36 AM UTC-6, joost schouppe wrote:
> Hi Rich,
>=20
> Thanks a lot! I was already under the impression this was a feature, not =
a bug. It is more the documentation which is lacking, not the software. Wha=
t is the source for your citation? I cannot find this in any official docum=
entation from SPSS.
>=20
> One of the solutions I was thinking of, was in fact having a user generat=
ed seed fed to the script. Which is an impossible task for humans, but anyw=
ay.
>=20
> But I thought the RV. functions are "truly" random, eliminating the need =
for manual seeding?

The RNG's can be set to start at a fixed seed, for reproducibility, or a ra=
ndom seed.  See (Data > Random Number Generators) or SET SEED RANDOM.  The =
MT generator is entirely separate.  The RV functions will use whichever gen=
erator you specify.

If you want to send me a data file and appropriate syntax (peck AT gmail.co=
m), I could try to see why you are getting overrepresentation (but do check=
 the SEED initialization setting).  Of course, if you switch the generator =
and you still get nonuniform sampling, then you can be sure that the proble=
m is elsewhere.
0
Jon
7/19/2016 8:36:37 PM
I have set the RNG, with no other changes to randomization process: 

SET RNG=MT MTINDEX=RANDOM.

Results are looking good. In the graph chronological samples, with % of the anomalous age bracket in the sample. At number 86 we changed from double to single randomization. Sample 206 is when we introduced the RNG above.

http://i.imgur.com/kobZx1N.png
0
joost
8/16/2016 9:14:15 AM
Reply: