Hi, Quick googling didn't turn anything up, so I though I'd share my experience= with the "uniform" function. This is the only randomization function calle= d "pseudorandom" in the IBM Knowledge center. We just discovered this has s= ome real implications. We need to make a weekly random sample of 500 from about 300.000. In a firs= t generation of the script, I used uniform(1) to first randomize the order = of the file, then SAMPLE 500 from 250.000. We removed the second part from the script as it seemed redundant (and also= for the practical problem of finding out what went wrong over 1000 lines o= f code when the N for some reason drops below 250.000). Uniform is executed= , then the file is sorted along the uniform number and the first 500 cases = of the file are selected. This turns out to have been A Bad Idea. We all know random often isn't trul= y random. But there is a strong correlation between the random number gener= ated and the casenumber of the record for which it is generated. As our fil= es happens to be sorted roughly by year of birth, this has profoundly impac= ted the randomness of our sample. This tendency to give lower random numbers to certain cases happens every s= ingle time you run uniform. This is dependent on the number of random opera= tions you have done in the current SPSS session. In fact, given a certain d= ataset, uniform will generate the exact same "random" numbers the next time= - given that you close and re-open SPSS entirely. It will do so even on a = different machine. Running and re-running the same uniform function a few t= imes, will then generate a sample differently correlated with the casenumbe= r. From what I read at the Knowledge center, basically any other random genera= tor would suffice for our needs. Given our use-case, would you recommend an= y particular method or methods? Joost

0 |

7/18/2016 8:01:40 AM

On Monday, July 18, 2016 at 2:01:48 AM UTC-6, joost schouppe wrote: > Hi, >=20 > Quick googling didn't turn anything up, so I though I'd share my experien= ce with the "uniform" function. This is the only randomization function cal= led "pseudorandom" in the IBM Knowledge center. We just discovered this has= some real implications. >=20 > We need to make a weekly random sample of 500 from about 300.000. In a fi= rst generation of the script, I used uniform(1) to first randomize the orde= r of the file, then SAMPLE 500 from 250.000. >=20 > We removed the second part from the script as it seemed redundant (and al= so for the practical problem of finding out what went wrong over 1000 lines= of code when the N for some reason drops below 250.000). Uniform is execut= ed, then the file is sorted along the uniform number and the first 500 case= s of the file are selected. >=20 > This turns out to have been A Bad Idea. We all know random often isn't tr= uly random. But there is a strong correlation between the random number gen= erated and the casenumber of the record for which it is generated. As our f= iles happens to be sorted roughly by year of birth, this has profoundly imp= acted the randomness of our sample. >=20 > This tendency to give lower random numbers to certain cases happens every= single time you run uniform. This is dependent on the number of random ope= rations you have done in the current SPSS session. In fact, given a certain= dataset, uniform will generate the exact same "random" numbers the next ti= me - given that you close and re-open SPSS entirely. It will do so even on = a different machine. Running and re-running the same uniform function a few= times, will then generate a sample differently correlated with the casenum= ber. >=20 > From what I read at the Knowledge center, basically any other random gene= rator would suffice for our needs. Given our use-case, would you recommend = any particular method or methods? >=20 > Joost I do not see any evidence of such a correlation. The random number generat= ors pass the usual tests for randomness. However, the Mersenne Twister ge= nerator has better properties than the other one, which is the default. Yo= u can set it as the one to use with SET RNG=3DMT or in Edit > Options > Dat= a.

0 |

7/18/2016 1:05:23 PM

Hi Jon, What would you consider evidence?=20 I'm not entirely sure it is the casenum that is correlated, but there is a = very very clear correlation between year of birth and chance of being selec= ted in our data generated by about 120 random samples. And our data happens= to be sorted by year of birth, hence the assumption. Given a variant of th= e dataset we use, I can predict an over-representation of 4% of the 40-64 y= ear olds in every first sample within an SPSS session. I could elaborate in a blog post if you like. Is the Mersenne Twister a procedure behind all the randomizers? Because I w= as hoping the RV. functions would be better than the UNIFORM function. Joost

0 |

7/18/2016 2:15:31 PM

On Mon, 18 Jul 2016 01:01:40 -0700 (PDT), joost schouppe <joost.schouppe@gmail.com> wrote: .... [ about the SPSS Pseudo-Random-Number-Generator, PRNG] > In fact, given a certain dataset, uniform will generate the exact > same "random" numbers the next time - given that you close and > re-open SPSS entirely. It will do so even on a different machine. > Running and re-running the same uniform function a few times, will > then generate a sample differently correlated with the casenumber. .... This is a /proper/ feature of a PRNG: You give it a specific seed and it allows you to exactly replicate the sequence. You document the 'seed' so that anyone can replicate on another machine. To get a different sequence, you use a different seed. I used to read some recommendation, like, "pick a 10+ digit, odd number" when reading the documentation for various routines. Of course, if you do a second randomization in the same run, it will pick up its generation from where it left off, so that would be a different sequence. Now, from the definition of "random", it is going to be true that 1 in 20 sequences should fail any given 5% test for randomness. And 1 in 1000 will fail at the 0.1% level. - a PRNG that does /not/ produce that sort of test results is a faulty PRNG. "Correlating with case number" is potentially a test for randomness. It could be that your (default) Seed has some correlation with Casenumber, which is purely a random occurrence. However, if you use the same seed, you will always get the same result. Unless your application is trivial, do not use the default for Seed. -- Rich Ulrich

0 |

7/18/2016 5:01:14 PM

Hi Rich, Thanks a lot! I was already under the impression this was a feature, not a bug. It is more the documentation which is lacking, not the software. What is the source for your citation? I cannot find this in any official documentation from SPSS. One of the solutions I was thinking of, was in fact having a user generated seed fed to the script. Which is an impossible task for humans, but anyway. But I thought the RV. functions are "truly" random, eliminating the need for manual seeding?

0 |

7/19/2016 9:38:33 AM

On Tuesday, July 19, 2016 at 5:38:36 AM UTC-4, joost schouppe wrote: > Hi Rich, >=20 > Thanks a lot! I was already under the impression this was a feature, not = a bug. It is more the documentation which is lacking, not the software. Wha= t is the source for your citation? I cannot find this in any official docum= entation from SPSS. >=20 > One of the solutions I was thinking of, was in fact having a user generat= ed seed fed to the script. Which is an impossible task for humans, but anyw= ay. >=20 > But I thought the RV. functions are "truly" random, eliminating the need = for manual seeding? Please see the following. http://www.ibm.com/support/knowledgecenter/SSLVMB_21.0.0/com.ibm.spss.stati= stics.help/randvarfunctionlist.htm http://www.ibm.com/support/knowledgecenter/SSLVMB_21.0.0/com.ibm.spss.stati= stics.help/syn_set_rng_seed_mtindex.htm https://www.random.org/randomness/ Near the top of the first page given above, note the following: "The following functions give a random variate from a specified distributio= n. The arguments are the parameters of the distribution. You can repeat the= sequence of *pseudorandom* numbers by setting a seed in the Preferences di= alog box before each sequence. Note the period in each function name." HTH.

0 |

7/19/2016 1:04:53 PM

On Tuesday, July 19, 2016 at 3:38:36 AM UTC-6, joost schouppe wrote: > Hi Rich, >=20 > Thanks a lot! I was already under the impression this was a feature, not = a bug. It is more the documentation which is lacking, not the software. Wha= t is the source for your citation? I cannot find this in any official docum= entation from SPSS. >=20 > One of the solutions I was thinking of, was in fact having a user generat= ed seed fed to the script. Which is an impossible task for humans, but anyw= ay. >=20 > But I thought the RV. functions are "truly" random, eliminating the need = for manual seeding? The RNG's can be set to start at a fixed seed, for reproducibility, or a ra= ndom seed. See (Data > Random Number Generators) or SET SEED RANDOM. The = MT generator is entirely separate. The RV functions will use whichever gen= erator you specify. If you want to send me a data file and appropriate syntax (peck AT gmail.co= m), I could try to see why you are getting overrepresentation (but do check= the SEED initialization setting). Of course, if you switch the generator = and you still get nonuniform sampling, then you can be sure that the proble= m is elsewhere.

0 |

7/19/2016 8:36:37 PM

I have set the RNG, with no other changes to randomization process: SET RNG=MT MTINDEX=RANDOM. Results are looking good. In the graph chronological samples, with % of the anomalous age bracket in the sample. At number 86 we changed from double to single randomization. Sample 206 is when we introduced the RNG above. http://i.imgur.com/kobZx1N.png

0 |

8/16/2016 9:14:15 AM