Multiple data sets based on a variable

  • Follow


I need to create separate data sets from a single data set based on a
variable (column B) . Like thus:

Original Data Set 1:
19404,1987,4200439
19404,2329,5160848
19404,1987,6127673
19404,1987,7077287
19404,2329,5066985
19404,1987,6127673
19404,2329,4781209
19404,1987,0868172
19404,2329,6030901
19404,1987,9897579

Need:

Data Set 2:
19404,1987,4200439
19404,1987,6127673
19404,1987,7077287
19404,1987,6127673
19404,1987,0868172
19404,1987,9897579

Data Set 3:
19404,2329,5160848
19404,2329,5066985
19404,2329,4781209
19404,2329,6030901

There could be hundreds of data sets so I need a way to capture the
unique values and write those out to separate data sets.

Thanks for any help.

Mike S.
0
Reply MikeS 12/22/2009 6:51:44 PM

Preliminary steps: SORT the data set by column B, keeping only column B, and deleting duplicates.

Real work pseudo-code (refine till it comes out right):

DATA _NULL_;
    SET preliminary_data_set END=last;
    label = "value" || TRIM(PUT(_N_, 4.));
    CALL SYMPUT(label, column_B);
    IF last THEN
        CALL SYMPUT("count", _N_);
RUN;

DATA
    %DO i = 1 %TO &count;
        new_&&value&i
    %END;
    ;
    SET original_data_set;
    %DO i = 1 %TO &count
        IF column_B = &&value&i THEN
            OUTPUT new_&&value&i;
    %end
RUN;

It may be necessary to encapsulate both %DO loops in macros.

-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of MikeS
Sent: Tuesday, December 22, 2009 10:52 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Multiple data sets based on a variable

I need to create separate data sets from a single data set based on a
variable (column B) . Like thus:

Original Data Set 1:
19404,1987,4200439
19404,2329,5160848
19404,1987,6127673
19404,1987,7077287
19404,2329,5066985
19404,1987,6127673
19404,2329,4781209
19404,1987,0868172
19404,2329,6030901
19404,1987,9897579

Need:

Data Set 2:
19404,1987,4200439
19404,1987,6127673
19404,1987,7077287
19404,1987,6127673
19404,1987,0868172
19404,1987,9897579

Data Set 3:
19404,2329,5160848
19404,2329,5066985
19404,2329,4781209
19404,2329,6030901

There could be hundreds of data sets so I need a way to capture the
unique values and write those out to separate data sets.
0
Reply barry.a.schwarz (608) 12/22/2009 8:24:37 PM


Mike,

This is comparable to the previously SAS-L posted and much discussed problem "Split dataset sashelp.class into separate datasets by the age variable".  Here's the concise %for macro solution with your data:

    proc sort data=original nodupkey out=bees; by b; run;

    data %for(b, in=[bees], do=%nrstr(b_&b(where=(b=&b))));
        set original;
    run;

These include the b value in the name of the output dataset.  To name the output datasets sequentially, change the data step above to:

    data %for(b, in=[bees], do=%nrstr(b_&_n_(where=(b=&b))));
        set original;
    run;

Jim Anderson
UCSF

(You can find the %for macro and a description of how to use it on sascommunity.org here http://20.fi/2843)

-----Original Message-----
From: MikeS [mailto:mikestout@ROCKETMAIL.COM]
Sent: Tuesday, December 22, 2009 10:52 AM
Subject: Multiple data sets based on a variable

I need to create separate data sets from a single data set based on a
variable (column B) . Like thus:

Original Data Set 1:
19404,1987,4200439
19404,2329,5160848
19404,1987,6127673
19404,1987,7077287
19404,2329,5066985
19404,1987,6127673
19404,2329,4781209
19404,1987,0868172
19404,2329,6030901
19404,1987,9897579

Need:

Data Set 2:
19404,1987,4200439
19404,1987,6127673
19404,1987,7077287
19404,1987,6127673
19404,1987,0868172
19404,1987,9897579

Data Set 3:
19404,2329,5160848
19404,2329,5066985
19404,2329,4781209
19404,2329,6030901

There could be hundreds of data sets so I need a way to capture the
unique values and write those out to separate data sets.

Thanks for any help.

Mike S.
0
Reply James.Anderson (26) 12/22/2009 9:36:02 PM

<sigh> You do not want to do this
And the code examples on this page
Along with other provided examples in other posts
Will eventually convince you that by-var processing
Is a better way to go.

Tiny url: http://tinyurl.com/5abz5h

http://www.sascommunity.org/wiki/Split_Data_into_Subsets

but you have to squint at the later steps in your algorithm
to understand why

Ron Fehd  the process maven

-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of MikeS
Sent: Tuesday, December 22, 2009 10:52 AM
To: sas-l@uga.edu
Subject: Multiple data sets based on a variable

I need to create separate data sets from a single data set based on a
variable (column B) . Like thus:

Original Data Set 1:
19404,1987,4200439
19404,2329,5160848
19404,1987,6127673
19404,1987,7077287
19404,2329,5066985
19404,1987,6127673
19404,2329,4781209
19404,1987,0868172
19404,2329,6030901
19404,1987,9897579

Need:

Data Set 2:
19404,1987,4200439
19404,1987,6127673
19404,1987,7077287
19404,1987,6127673
19404,1987,0868172
19404,1987,9897579

Data Set 3:
19404,2329,5160848
19404,2329,5066985
19404,2329,4781209
19404,2329,6030901

There could be hundreds of data sets so I need a way to capture the
unique values and write those out to separate data sets.

Thanks for any help.

Mike S.
0
Reply rjf2 (3354) 12/23/2009 4:31:32 AM

I hope it helps you:


proc sql;
create table unique_B as
select distinct b from original;
run;

data _null_;
do until (fin);
set unique_B end=fin;
call symputx('myB',B);
call execute('data data_&myB; set original (where=(B=&myB)); run;');
end;
run;


Daniel Fernandez
Barcelona

2009/12/23 Fehd, Ronald J. (CDC/CCHIS/NCPHI) <rjf2@cdc.gov>:
> <sigh> You do not want to do this
> And the code examples on this page
> Along with other provided examples in other posts
> Will eventually convince you that by-var processing
> Is a better way to go.
>
> Tiny url: http://tinyurl.com/5abz5h
>
> http://www.sascommunity.org/wiki/Split_Data_into_Subsets
>
> but you have to squint at the later steps in your algorithm
> to understand why
>
> Ron Fehd  the process maven
>
> -----Original Message-----
> From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
> On Behalf Of MikeS
> Sent: Tuesday, December 22, 2009 10:52 AM
> To: sas-l@uga.edu
> Subject: Multiple data sets based on a variable
>
> I need to create separate data sets from a single data set based on a
> variable (column B) . Like thus:
>
> Original Data Set 1:
> 19404,1987,4200439
> 19404,2329,5160848
> 19404,1987,6127673
> 19404,1987,7077287
> 19404,2329,5066985
> 19404,1987,6127673
> 19404,2329,4781209
> 19404,1987,0868172
> 19404,2329,6030901
> 19404,1987,9897579
>
> Need:
>
> Data Set 2:
> 19404,1987,4200439
> 19404,1987,6127673
> 19404,1987,7077287
> 19404,1987,6127673
> 19404,1987,0868172
> 19404,1987,9897579
>
> Data Set 3:
> 19404,2329,5160848
> 19404,2329,5066985
> 19404,2329,4781209
> 19404,2329,6030901
>
> There could be hundreds of data sets so I need a way to capture the
> unique values and write those out to separate data sets.
>
> Thanks for any help.
>
> Mike S.
>
0
Reply fdezdan (222) 12/23/2009 10:31:19 AM

Our choices are:
[ ] accurate
[ ] cheap
[ ] fast

This solution is easy to code and obvious in what it does.

It is knot cheap:
You read the complete data set
Each time you create a subset.

Processing time: O(N(subsets)*N(rows))

Ron the a?c?f? maven

-----Original Message-----
From: Daniel Fern�ndez [mailto:fdezdan@gmail.com]
Sent: Wednesday, December 23, 2009 2:31 AM
To: Fehd, Ronald J. (CDC/CCHIS/NCPHI)
Cc: SAS-L@listserv.uga.edu
Subject: Re: Multiple data sets based on a variable

I hope it helps you:


proc sql;
create table unique_B as
select distinct b from original;
run;

data _null_;
do until (fin);
set unique_B end=fin;
call symputx('myB',B);
call execute('data data_&myB; set original (where=(B=&myB)); run;');
end;
run;


Daniel Fernandez
Barcelona

2009/12/23 Fehd, Ronald J. (CDC/CCHIS/NCPHI) <rjf2@cdc.gov>:
> <sigh> You do not want to do this
> And the code examples on this page
> Along with other provided examples in other posts
> Will eventually convince you that by-var processing
> Is a better way to go.
>
> Tiny url: http://tinyurl.com/5abz5h
>
> http://www.sascommunity.org/wiki/Split_Data_into_Subsets
>
> but you have to squint at the later steps in your algorithm
> to understand why
>
> Ron Fehd  the process maven
>
> -----Original Message-----
> From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
> On Behalf Of MikeS
> Sent: Tuesday, December 22, 2009 10:52 AM
> To: sas-l@uga.edu
> Subject: Multiple data sets based on a variable
>
> I need to create separate data sets from a single data set based on a
> variable (column B) . Like thus:
>
> Original Data Set 1:
> 19404,1987,4200439
> 19404,2329,5160848
> 19404,1987,6127673
> 19404,1987,7077287
> 19404,2329,5066985
> 19404,1987,6127673
> 19404,2329,4781209
> 19404,1987,0868172
> 19404,2329,6030901
> 19404,1987,9897579
>
> Need:
>
> Data Set 2:
> 19404,1987,4200439
> 19404,1987,6127673
> 19404,1987,7077287
> 19404,1987,6127673
> 19404,1987,0868172
> 19404,1987,9897579
>
> Data Set 3:
> 19404,2329,5160848
> 19404,2329,5066985
> 19404,2329,4781209
> 19404,2329,6030901
>
> There could be hundreds of data sets so I need a way to capture the
> unique values and write those out to separate data sets.
>
> Thanks for any help.
>
> Mike S.
>
0
Reply rjf2 (3354) 12/23/2009 2:53:28 PM

Hi, Fehd, Ronald J.
sorry I dodn�t read the linked solution.

After thinking about the same problem I got
my new proposal (it is fast and only read the no. observations must read):

* WE FIRST SORT BY B VARNAME;
proc sort ; by B ; run;

* WE FLAG FIRST AND LAST B VARNAME WITHOUT DUPLICATES;
data copy_of_original;
set original;
by b;
if first.b then first=_n_;
retain first;
if last.b then last=_n_;
if  last.b;
run;

* WE JUST SELECT THE NUMBER OF ROWS FOR EACH SUBSET;
data _null_;
do until(fin);
set copy_of_original end=fin;
by b;
call symputx('firstobs',first);
call symputx('obs',last);
call symputx('b',b);
call execute('
data dst_&b; set ORIGINAL (firstobs=&firstobs obs=&obs); run;');
end;
run;

I listen to your opinions about this method.

Daniel Fernandez
Barcelona






2009/12/23 Fehd, Ronald J. (CDC/CCHIS/NCPHI) <rjf2@cdc.gov>:
> Our choices are:
> [ ] accurate
> [ ] cheap
> [ ] fast
>
> This solution is easy to code and obvious in what it does.
>
> It is knot cheap:
> You read the complete data set
> Each time you create a subset.
>
> Processing time: O(N(subsets)*N(rows))
>
> Ron the a?c?f? maven
>
> -----Original Message-----
> From: Daniel Fern�ndez [mailto:fdezdan@gmail.com]
> Sent: Wednesday, December 23, 2009 2:31 AM
> To: Fehd, Ronald J. (CDC/CCHIS/NCPHI)
> Cc: SAS-L@listserv.uga.edu
> Subject: Re: Multiple data sets based on a variable
>
> I hope it helps you:
>
>
> proc sql;
> create table unique_B as
> select distinct b from original;
> run;
>
> data _null_;
> do until (fin);
> set unique_B end=fin;
> call symputx('myB',B);
> call execute('data data_&myB; set original (where=(B=&myB)); run;');
> end;
> run;
>
>
> Daniel Fernandez
> Barcelona
>
> 2009/12/23 Fehd, Ronald J. (CDC/CCHIS/NCPHI) <rjf2@cdc.gov>:
>> <sigh> You do not want to do this
>> And the code examples on this page
>> Along with other provided examples in other posts
>> Will eventually convince you that by-var processing
>> Is a better way to go.
>>
>> Tiny url: http://tinyurl.com/5abz5h
>>
>> http://www.sascommunity.org/wiki/Split_Data_into_Subsets
>>
>> but you have to squint at the later steps in your algorithm
>> to understand why
>>
>> Ron Fehd  the process maven
>>
>> -----Original Message-----
>> From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
>> On Behalf Of MikeS
>> Sent: Tuesday, December 22, 2009 10:52 AM
>> To: sas-l@uga.edu
>> Subject: Multiple data sets based on a variable
>>
>> I need to create separate data sets from a single data set based on a
>> variable (column B) . Like thus:
>>
>> Original Data Set 1:
>> 19404,1987,4200439
>> 19404,2329,5160848
>> 19404,1987,6127673
>> 19404,1987,7077287
>> 19404,2329,5066985
>> 19404,1987,6127673
>> 19404,2329,4781209
>> 19404,1987,0868172
>> 19404,2329,6030901
>> 19404,1987,9897579
>>
>> Need:
>>
>> Data Set 2:
>> 19404,1987,4200439
>> 19404,1987,6127673
>> 19404,1987,7077287
>> 19404,1987,6127673
>> 19404,1987,0868172
>> 19404,1987,9897579
>>
>> Data Set 3:
>> 19404,2329,5160848
>> 19404,2329,5066985
>> 19404,2329,4781209
>> 19404,2329,6030901
>>
>> There could be hundreds of data sets so I need a way to capture the
>> unique values and write those out to separate data sets.
>>
>> Thanks for any help.
>>
>> Mike S.
>>
>
0
Reply fdezdan (222) 12/23/2009 4:56:20 PM

This is a fairly common problem.  As Ron noted, it may not always be necessary/desirable to create separate data sets, but there certainly are times when there is no way around this.  For instance, there may be a requirement to supply an appropriate "subset" data set to each of a group of regional managers.

And, as the responses make clear, there is no easy, direct way to do this in SAS.  "Traditional" DATA step techniques require the output data set to be specified at compile time.  The hash object does provide a way around this, but at the cost of requiring all of the data to be stored in memory, which certainly isn't an inherent requirement of the problem.

Accepting the need to at least specify all possible output data sets up front in the DATA statement, I'd like to see something like:

/* Note this is a proposal, not working code */
DATA ... /* list of possible data sets, presumably generated */ ;
SET  Original;
OUTPUT _TEMPLATE_ (DSNAME = 'B_' || B);
RUN;

This proposes a new keyword, _TEMPLATE_, which would provide a dynamic capability to the OUTPUT statement.  Whenever the OUTPUT statement is executed at run time, SAS would evaluate the character expression associated with DSNAME, and write an observation out to the data set that matches.  It would be an error if the evaluated character string did not match one of the data set names listed on the DATA statement.  This would significantly simplify the code, compared to a SELECT block or IF-THEN-ELSE IF sequence, and might also allow the compiler to generate more efficient code.

I suspect it might require more changes to the underlying architecture to allow new output data sets to be opened on the fly, outside of the DATA step object extensions.  If that were possible, the syntax could be even simpler:

/* Note this is a proposal, not working code */
DATA _TEMPLATE_;
SET  Original;
OUTPUT _TEMPLATE_ (DSNAME = 'B_' || B);
RUN;

With this proposal, SAS would create output data sets as needed, based on the values in the original data sets, eliminating the need to specify all of the data set names on the DATA statement.

Who knows -- maybe this will be a welcome stocking-stuffer in SAS 9.x one of these years.  ;-)

Mike Rhoads
RhoadsM1@Westat.com



-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of MikeS
Sent: Tuesday, December 22, 2009 1:52 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Multiple data sets based on a variable

I need to create separate data sets from a single data set based on a variable (column B) . Like thus:

Original Data Set 1:
19404,1987,4200439
19404,2329,5160848
19404,1987,6127673
19404,1987,7077287
19404,2329,5066985
19404,1987,6127673
19404,2329,4781209
19404,1987,0868172
19404,2329,6030901
19404,1987,9897579

Need:

Data Set 2:
19404,1987,4200439
19404,1987,6127673
19404,1987,7077287
19404,1987,6127673
19404,1987,0868172
19404,1987,9897579

Data Set 3:
19404,2329,5160848
19404,2329,5066985
19404,2329,4781209
19404,2329,6030901

There could be hundreds of data sets so I need a way to capture the unique values and write those out to separate data sets.

Thanks for any help.

Mike S.
0
Reply RHOADSM1 (795) 12/23/2009 9:09:14 PM

7 Replies
296 Views

(page loaded in 0.097 seconds)

Similiar Articles:













7/30/2012 3:54:07 AM


Reply: