Oslo:
A few more particulars on the code I sent.
1. What if , for a given column, half of the values are missing, and
the other half are 1? Is that a constant? Or is it 50% vs 50%, just
as if it had been half 1's and half 2's? In the program I sent it will
be treated as a constant, because all the non-missing values are the
same. If you would rather treat it as 50/50, then change:
if min(of obs_: )=max(of obs_: ) then delete;
to
if nmiss(of obs_:) = 0 and min(of obs_: )=max(of obs_: ) then delete;
2. When you say to delete all columns with 95% a single value, my
code assumes that menas 95% of ALL values, not just 95% of non-missing
values. If you want 95% of non-missing values, change:
if max(of frq_:) >= .95*dim(obs) then delete;
to
if frq_dot >= .95*dim(obs) then delete;
else if frq_dot ^= max(of frq_:) and max(of fr
Regards,
Mark
From: oslo [mailto:hokut1@yahoo.com]
Sent: Wednesday, December 02, 2009 10:16 PM
To: Keintz, H. Mark
Cc: SAS-L@LISTSERV.UGA.EDU; hokut1@yahoo.com
Subject: Re: remove constant columns and small frequencies
Dear Kenitz;
Thanks so much. I appreciate for your response and your helps. Unfotunately I am not at my office to have chance to try for what you sugessted. I would be very happy if it work.
Regards,
Oslo
________________________________________
From: "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
To: SAS-L@LISTSERV.UGA.EDU
Sent: Wed, December 2, 2009 9:06:00 PM
Subject: Re: remove constant columns and small frequencies
Oslo:
I think the trick to doing what you need is to transpose
your data, so that you can filter based on the values across
a row instead of down a column. After filtering out the
unwanted rows, you can transpose the remaining rows back
to columns.
The below is an example. The major assumption in the program
is that all your variables only take the values 0, 1, or 2.
I also assumed you wanted to treat missing values the same
as valid values. I modified the 95% rule to exclude columns
with one value exactly 95% of the time.
Regards,
Mark
** First make a dataset with 100 rows and 3 variables **;
data have;
do _n_=1 to 100;
x1=1; ** X1 is constant;
x2=mod(_n_,3); ** X2 has equal freqs of 0,1,2 **;
x3=1; ** X3 =1 95% of the time **;
if _n_ <= 2 then x3=0;
else if _n_ >= 98 then x3=2;
x4=.; ** X4 is all missing;
x5=.; ** X5 is 95% missing;
if 50 <= _n_ <=54 then x5=1;
output;
end;
run;
** Transposed dataset will have vars _NAME_ plus OBS1--OBS100 **;
proc transpose data=have out=thave prefix=obs_; run;
** Now filter out unwanted rows **;
data thave2;
set thave;
**Drop rows with all missing **;
if n(of obs_:) = 0 then delete;
** Filter out other constant rows **;
if min(of obs_: )=max(of obs_: ) then delete;
frq_dot=0; ** Frequency of missing values **;
** Make array for holding frequency of 0, 1, or 2 **;
array frq {0:2} frq_0-frq_2;
do v=0 to 2; frq{v}=0; end;
** Review entire row, and generate frequencies **;
array obs {*} obs_: ;
do n=1 to dim(obs);
if obs{n}=. then frq_dot+1;
else frq{obs{n}}+1;
end;
** If any frequency 95% or more delete the row **;
if max(of frq_:) >= .95*dim(obs) then delete;
drop frq: v n;
run;
** Transpose back the data with desired rows **;
proc transpose data=thave2 out=want (drop=_name_) name=_name_; run;
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> oslo
> Sent: Wednesday, December 02, 2009 8:21 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: remove constant columns and small frequencies
>
> Dear All;
>
> I have spent a couple hours to overcome with problem before decide to
> write you. Unfortunately I had no chance except asking SAS-L users. I
> have a data set is from 300 individuals which has 45000 (x1-x45000)
> columns (input) and each column consist of only 0, 1, 2 and missing
> values (data are from a molecular study). I need to do two things to
> clean the data;
>
> 1) I need to remove all input variables which has constant value
> (for example only has 1 OR only has 2 OR only has 0). There are about
> 2800 columns like this.
> 2) I need remove the input variables which the frequency of any
> value (0. 1 OR 2) accounts more than 95%. (For example in the variable
> x348, the percent of 1 is 3.6% and percent of 2 is 96.4%. So I have to
> remove x348 from data set).
> I would be greatly appreciated having your helps.
>
> Best regards,
>
> Oslo
>
>
|