Re: Help! Accessing observations from a dataset using a lookup
Your suggestion should be
if (id in (<list of ids>)) then output;
with a macro to make the list of wanted IDs. However that is the slowest
form of search and I would not suggest trying it in this form.
Formats can be good with shorter lists, but remember the whole list must
be held in memory. Several years ago Paul Dorfman did a test with 450,000
entries in the list to run against a hand tailored hash. Of course, the
hash won by a large factor, something like 40 times faster.
Making a format is easy, when you have the wanted ID's. For example,
data wanted (rename = (id = start )) ;
retain fmtname "$wanted" label "Y" hlo " " ;
infile cards eof = eof ;
input ID $20. ;
hlo = "O" ;
label = "N" ;
proc format cntlin = wanted ;
data big ;
do x = 1 to 10 ;
id = put(x, 8. -l) ;
data got ;
set big ;
if put(id,$wanted.) = "Y" ;
The hash method is almost as simple but much faster.
data wanted ;
infile cards ;
input ID $20. ;
data got ;
length id $ 20 ;
declare hash h(dataset: "work.wanted", hashexp: 16);
do until (eof) ;
set big end=eof ;
if h.check() = 0 then output ;
All of the code has been tested. Size test with real data are
left to you. I would suggest you start by taking first million
records from your big file as test data and looking for subsets,
of say every 1000th record, then tenth and finally every other
record. When your are comfortable with the code then move to
the full file and data set of wanted IDs.
Date: Thu, 5 Jul 2007 09:11:20 -0700
Reply-To: Alon <akadas@GMAIL.COM>
Sender: "SAS(r) Discussion"
From: Alon <akadas@GMAIL.COM>
Subject: Re: Help! Accessing observations from a dataset using a
Comments: To: sas-l
Content-Type: text/plain; charset="us-ascii"
First I'd like to say thanks to all of you for the input.
Just to confirm for all of you, your gasps are were not in vain, I did not
mistakenly change an 'm' to a 'b' or read an additional 3 zeroes. Yes,
these are on tapes as well.
The key to this problem is that Master is the large dataset which I cannot
modify (there goes the index idea out the window) but Lookup is the smaller
dataset. It may have 100 id's or 100,000 id's, but to say the least it is
much much smaller. Though copying the dataset over again and then dealing
with it works, This amount of redundancy assosciated with it is too large
for me to leave this unnoticed (though unfortunately that is what myself
and everyone else in my company resorts to).
Even though the format method seems to be nice I am not sure how I would
define the format. I also thought of a different approach that would store
the id's in a macro variable and then do a 'where in (&id)' - but then
realized that cannot be scaled for larger amount of Id's (or can it?).
One thing that came to mind is putting Id's into several macro variables in
chunks of say 1000 (I am not sure if one macro variable can fit that many,
but for arguments sake lets say it can) &ID1 &ID2.... &&ID&N (&N is the #
of macros needed - say N=6 if you have 5,453 Id's). So one solution would
be to just have a do loop with checks on the Id's (now I'm just strechting
out the code's capability, I've never done this before and have never seen
it done before so I hope something like this is possible - if any of you
have suggestions then please comment).
do i = 1 to &N; if (id in (&&ID&I..)) then output; end;
**I realize the above has an issue since it is not in a %do loop but a
regular data step do loop
The nice thing about this is that you can do some stuff like add flags (ie
&&FLAG&I ) which will serve as a counter and when &&flag&i = 1000 you short
out the statement '(id in (&&ID&I..))'.
I will also test out the Hash objects method (its good to learn how to do
Again, thanks everyone for the support.