COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### Heckman Selection Model

• Email
• Follow

```It seems that there are at least two ways of conducting Heckman
Selection Model in SAS. One way is to do it in two steps and in the
first step to calculate inverse mill's ratio and apply it in the
second step. And another way is to directly estimate two equations
with a correlation coefficient rho generated.

Does anyone know their differences and under what circumstances should
each be used?

Thanks a lot,
Cathy

```
 0
Reply ly_cat (8) 3/30/2007 7:39:05 PM

See related articles to this posting

```On Mar 30, 3:39 pm, "cathy" <ly_...@hotmail.com> wrote:
> It seems that there are at least two ways of conducting Heckman
> Selection Model in SAS. One way is to do it in two steps and in the
> first step to calculate inverse mill's ratio and apply it in the
> second step. And another way is to directly estimate two equations
> with a correlation coefficient rho generated.
>
> Does anyone know their differences and under what circumstances should
> each be used?
>
> Thanks a lot,
> Cathy

Heckman shows that two step approach is equivalent to ML under
binromal assumption in large sample. Two step approach is much easy to
calculate back to later 1970s. That is alll I see the difference.

HTH

```
 0
Reply shiling99 (640) 3/30/2007 8:44:18 PM

```On Mar 30, 12:44 pm, shilin...@yahoo.com wrote:
> On Mar 30, 3:39 pm, "cathy" <ly_...@hotmail.com> wrote:
>
> > It seems that there are at least two ways of conducting Heckman
> > Selection Model in SAS. One way is to do it in two steps and in the
> > first step to calculate inverse mill's ratio and apply it in the
> > second step. And another way is to directly estimate two equations
> > with a correlation coefficient rho generated.
>
> > Does anyone know their differences and under what circumstances should
> > each be used?
>
> > Thanks a lot,
> > Cathy
>
> Heckman shows that two step approach is equivalent to ML under
> binromal assumption in large sample. Two step approach is much easy to
> calculate back to later 1970s. That is alll I see the difference.
>
> HTH

Do you mean that they are identical? How about the selection of
variables in the two equations. I tried to estimate two equations with
identical independent variables, but get a rho of 1. Is that because
they cannot identical?

Thanks,
Cathy

```
 0
Reply ly_cat (8) 3/31/2007 4:25:47 AM

```On Mar 31, 12:25 am, "cathy" <ly_...@hotmail.com> wrote:
> On Mar 30, 12:44 pm, shilin...@yahoo.com wrote:
>
>
>
>
>
> > On Mar 30, 3:39 pm, "cathy" <ly_...@hotmail.com> wrote:
>
> > > It seems that there are at least two ways of conducting Heckman
> > > Selection Model in SAS. One way is to do it in two steps and in the
> > > first step to calculate inverse mill's ratio and apply it in the
> > > second step. And another way is to directly estimate two equations
> > > with a correlation coefficient rho generated.
>
> > > Does anyone know their differences and under what circumstances should
> > > each be used?
>
> > > Thanks a lot,
> > > Cathy
>
> > Heckman shows that two step approach is equivalent to ML under
> > binromal assumption in large sample. Two step approach is much easy to
> > calculate back to later 1970s. That is alll I see the difference.
>
> > HTH
>
> Do you mean that they are identical? How about the selection of
> variables in the two equations. I tried to estimate two equations with
> identical independent variables, but get a rho of 1. Is that because
> they cannot identical?
>
> Thanks,
> Cathy- Hide quoted text -
>
> - Show quoted text -
> Do you mean that they are identical?
It means that the estimates will converge to the same limit/true
parameter in lager sample.

>How about the selection of variables in the two equations.

I am not sure what you refer to. Here is the standard hechman
incidental selection model set up,

Selection mechanism: z*=gamma*w+u
z=1 if z*>0; otherwise z=0;
regresion model:         y=beta*x+e
obseved y if z=1
(u,e) binormal distributed.

>I tried to estimate two equations with
> identical independent variables, but get a rho of 1. Is that because
> they cannot identical?

The two functions are related through the u, and e which is assumed
binormal distributed. The rho is defined as between (u,e) not (w,x).
It should be fine if one has w and x are the same.

The is no problem in estimation of selection model but the regresion
model will have bia estimation problem because it only have a
subsample when z=1.

Here is an example in which x is set to the same as w.

HTH.

*create a bivariate normal distribution data with sigma1=sigma2=1;
%let rho=0.8;
%let size=10000;

data binormal;
rho=&rho;
a1=sqrt((1+rho)/2);
a2=sqrt((1-rho)/2);
do i=1 to &size;
rd1=rannor(12390);
rd2=rannor(12390);
e1=a1*rd1+a2*rd2;
e2=a1*rd1-a2*rd2;
output;
end;
run;

*verify the sample data;
proc corr;
var e1 e2;
run;

data simu_data;
set binormal;
*participation eq;
w=rannor(12340);
z=(1+2*w>e1);
*observe y;
*x=rannor(12340);
x=w;
if z=1 then y=3+3*x+e2;
else y=.;
*err=0;
run;

title '>>>>selection biaed results with OLS<<<';
proc reg data=simu_data;
model y=x;
where y ne .;
run;
quit;

title '2-step appraoch 1-probit model 2-inverse mills ratio';
proc logistic data=simu_data desc;
output out=simu_data2 xbeta=xbeta;
run;

*calculte inverse mills ratio;
data simu_data2;
set simu_data2;
imr=pdf('NORMAL',xbeta)/cdf('NORMAL',xbeta);
run;

proc reg data=simu_data2;
model y=x imr;
where y ne .;
run;
quit;

title 'results from heckman approaches ---QLIM';
proc qlim data=simu_data;
model z = w /discrete (d=normal);
model y = x / select(z=1);
run;

title 'results from heckman approaches ---nlmixed';
proc nlmixed data=simu_data;
bounds s >0, -1<r<1;
parms a=2 b=2 c=1 d=1 s=1 r=0.5;
*selection function;
xbeta=c+d*w;
p=probnorm(xbeta);
if z=0 then l=log(1-p);
else if z=1 then do;
e=y-(a+b*x);
l2=(1/(sqrt(2*3.1415927)*s))*exp(-(e**2)/(2*s**2));
l3=probnorm((xbeta+r*e/s)/sqrt(1-r**2));
l=log(l2)+log(l3);
end;
MODEL  z ~general(l);

run;

title 'results from heckman approaches ---model';
proc model data=simu_data;
bounds s >0;
parms a=2 b=2 c=1 d=1 s=1 rho=0.5;
*selection function;
xbeta=c+d*w;
p=probnorm(xbeta);
*calculate the likelihood;
z=p;
if z=0 then l=log(1-p);
else if z=1 then do;
e=y-(a+b*x);
l2=(1/(sqrt(2*3.1415927)*s))*exp(-(e**2)/(2*s**2));
l3=probnorm((xbeta+r*e/s)/sqrt(1-r**2));
l=log(l2)+log(l3);
end;
l=-1*l;
ERRORMODEL  z ~general(l);
fit  z / CONVERGE=1e-8;
run;
quit;

```
 0
Reply shiling99 (640) 4/2/2007 3:14:14 PM

```Hi,

my 2 cents...

Did you see that proc QLIM lets you the possibility to works on Heckman Model ?

Cordialement/
Best regards,

Stephane COLAS

**************************
Soci�t� Datametric

Notre site / Our site
http://www.datametric.fr
**************************

Selon Shiling Zhang <shiling99@YAHOO.COM>:

> On Mar 31, 12:25 am, "cathy" <ly_...@hotmail.com> wrote:
> > On Mar 30, 12:44 pm, shilin...@yahoo.com wrote:
> >
> >
> >
> >
> >
> > > On Mar 30, 3:39 pm, "cathy" <ly_...@hotmail.com> wrote:
> >
> > > > It seems that there are at least two ways of conducting Heckman
> > > > Selection Model in SAS. One way is to do it in two steps and in the
> > > > first step to calculate inverse mill's ratio and apply it in the
> > > > second step. And another way is to directly estimate two equations
> > > > with a correlation coefficient rho generated.
> >
> > > > Does anyone know their differences and under what circumstances should
> > > > each be used?
> >
> > > > Thanks a lot,
> > > > Cathy
> >
> > > Heckman shows that two step approach is equivalent to ML under
> > > binromal assumption in large sample. Two step approach is much easy to
> > > calculate back to later 1970s. That is alll I see the difference.
> >
> > > HTH
> >
> > Do you mean that they are identical? How about the selection of
> > variables in the two equations. I tried to estimate two equations with
> > identical independent variables, but get a rho of 1. Is that because
> > they cannot identical?
> >
> > Thanks,
> > Cathy- Hide quoted text -
> >
> > - Show quoted text -
> > Do you mean that they are identical?
> It means that the estimates will converge to the same limit/true
> parameter in lager sample.
>
> >How about the selection of variables in the two equations.
>
> I am not sure what you refer to. Here is the standard hechman
> incidental selection model set up,
>
> Selection mechanism: z*=gamma*w+u
>                                  z=1 if z*>0; otherwise z=0;
> regresion model:         y=beta*x+e
>                                 obseved y if z=1
>                                 (u,e) binormal distributed.
>
> >I tried to estimate two equations with
> > identical independent variables, but get a rho of 1. Is that because
> > they cannot identical?
>
> The two functions are related through the u, and e which is assumed
> binormal distributed. The rho is defined as between (u,e) not (w,x).
> It should be fine if one has w and x are the same.
>
> The is no problem in estimation of selection model but the regresion
> model will have bia estimation problem because it only have a
> subsample when z=1.
>
> Here is an example in which x is set to the same as w.
>
> HTH.
>
> *create a bivariate normal distribution data with sigma1=sigma2=1;
> %let rho=0.8;
> %let size=10000;
>
>
> data binormal;
>   rho=&rho;
>   a1=sqrt((1+rho)/2);
>   a2=sqrt((1-rho)/2);
>   do i=1 to &size;
>     rd1=rannor(12390);
>     rd2=rannor(12390);
>     e1=a1*rd1+a2*rd2;
>     e2=a1*rd1-a2*rd2;
>     output;
>   end;
> run;
>
>
> *verify the sample data;
> proc corr;
>   var e1 e2;
> run;
>
>
> data simu_data;
>   set binormal;
>   *participation eq;
>   w=rannor(12340);
>   z=(1+2*w>e1);
>   *observe y;
>   *x=rannor(12340);
>   x=w;
>   if z=1 then y=3+3*x+e2;
>   else y=.;
>   *err=0;
> run;
>
>
> title '>>>>selection biaed results with OLS<<<';
> proc reg data=simu_data;
>   model y=x;
>   where y ne .;
> run;
> quit;
>
>
> title '2-step appraoch 1-probit model 2-inverse mills ratio';
> proc logistic data=simu_data desc;
>   output out=simu_data2 xbeta=xbeta;
> run;
>
>
> *calculte inverse mills ratio;
> data simu_data2;
>   set simu_data2;
>   imr=pdf('NORMAL',xbeta)/cdf('NORMAL',xbeta);
> run;
>
>
> proc reg data=simu_data2;
>   model y=x imr;
>   where y ne .;
> run;
> quit;
>
>
> title 'results from heckman approaches ---QLIM';
> proc qlim data=simu_data;
>      model z = w /discrete (d=normal);
>      model y = x / select(z=1);
> run;
>
>
> title 'results from heckman approaches ---nlmixed';
> proc nlmixed data=simu_data;
>   bounds s >0, -1<r<1;
>   parms a=2 b=2 c=1 d=1 s=1 r=0.5;
>   *selection function;
>   xbeta=c+d*w;
>   p=probnorm(xbeta);
>   if z=0 then l=log(1-p);
>   else if z=1 then do;
>      e=y-(a+b*x);
>          l2=(1/(sqrt(2*3.1415927)*s))*exp(-(e**2)/(2*s**2));
>          l3=probnorm((xbeta+r*e/s)/sqrt(1-r**2));
>     l=log(l2)+log(l3);
>   end;
>    MODEL  z ~general(l);
>
>
> run;
>
>
> title 'results from heckman approaches ---model';
> proc model data=simu_data;
>   bounds s >0;
>   parms a=2 b=2 c=1 d=1 s=1 rho=0.5;
>   *selection function;
>   xbeta=c+d*w;
>   p=probnorm(xbeta);
>   *calculate the likelihood;
>   z=p;
>   if z=0 then l=log(1-p);
>   else if z=1 then do;
>   e=y-(a+b*x);
>         l2=(1/(sqrt(2*3.1415927)*s))*exp(-(e**2)/(2*s**2));
>         l3=probnorm((xbeta+r*e/s)/sqrt(1-r**2));
>     l=log(l2)+log(l3);
>   end;
>   l=-1*l;
>    ERRORMODEL  z ~general(l);
>   fit  z / CONVERGE=1e-8;
> run;
> quit;
>
```
 0
Reply scolas (213) 4/2/2007 3:47:50 PM

```shiling99@YAHOO.COM replied:
>
>On Mar 31, 12:25 am, "cathy" <ly_...@hotmail.com> wrote:
> > On Mar 30, 12:44 pm, shilin...@yahoo.com wrote:
> >
> >
> >
> >
> >
> > > On Mar 30, 3:39 pm, "cathy" <ly_...@hotmail.com> wrote:
> >
> > > > It seems that there are at least two ways of conducting Heckman
> > > > Selection Model in SAS. One way is to do it in two steps and in the
> > > > first step to calculate inverse mill's ratio and apply it in the
> > > > second step. And another way is to directly estimate two equations
> > > > with a correlation coefficient rho generated.
> >
> > > > Does anyone know their differences and under what circumstances
>should
> > > > each be used?
> >
> > > > Thanks a lot,
> > > > Cathy
> >
> > > Heckman shows that two step approach is equivalent to ML under
> > > binromal assumption in large sample. Two step approach is much easy to
> > > calculate back to later 1970s. That is alll I see the difference.
> >
> > > HTH
> >
> > Do you mean that they are identical? How about the selection of
> > variables in the two equations. I tried to estimate two equations with
> > identical independent variables, but get a rho of 1. Is that because
> > they cannot identical?
> >
> > Thanks,
> > Cathy- Hide quoted text -
> >
> > - Show quoted text -
> > Do you mean that they are identical?
>It means that the estimates will converge to the same limit/true
>parameter in lager sample.
>
> >How about the selection of variables in the two equations.
>
>I am not sure what you refer to. Here is the standard hechman
>incidental selection model set up,
>
>Selection mechanism: z*=gamma*w+u
>                                  z=1 if z*>0; otherwise z=0;
>regresion model:         y=beta*x+e
>                                 obseved y if z=1
>                                 (u,e) binormal distributed.
>
> >I tried to estimate two equations with
> > identical independent variables, but get a rho of 1. Is that because
> > they cannot identical?
>
>The two functions are related through the u, and e which is assumed
>binormal distributed. The rho is defined as between (u,e) not (w,x).
>It should be fine if one has w and x are the same.
>
>The is no problem in estimation of selection model but the regresion
>model will have bia estimation problem because it only have a
>subsample when z=1.
>
>Here is an example in which x is set to the same as w.
>
>HTH.
>
>*create a bivariate normal distribution data with sigma1=sigma2=1;
>%let rho=0.8;
>%let size=10000;
>
>
>data binormal;
>   rho=&rho;
>   a1=sqrt((1+rho)/2);
>   a2=sqrt((1-rho)/2);
>   do i=1 to &size;
>     rd1=rannor(12390);
>     rd2=rannor(12390);
>     e1=a1*rd1+a2*rd2;
>     e2=a1*rd1-a2*rd2;
>     output;
>   end;
>run;
>
>
>*verify the sample data;
>proc corr;
>   var e1 e2;
>run;
>
>
>data simu_data;
>   set binormal;
>   *participation eq;
>   w=rannor(12340);
>   z=(1+2*w>e1);
>   *observe y;
>   *x=rannor(12340);
>   x=w;
>   if z=1 then y=3+3*x+e2;
>   else y=.;
>   *err=0;
>run;
>
>
>title '>>>>selection biaed results with OLS<<<';
>proc reg data=simu_data;
>   model y=x;
>   where y ne .;
>run;
>quit;
>
>
>title '2-step appraoch 1-probit model 2-inverse mills ratio';
>proc logistic data=simu_data desc;
>   output out=simu_data2 xbeta=xbeta;
>run;
>
>
>*calculte inverse mills ratio;
>data simu_data2;
>   set simu_data2;
>   imr=pdf('NORMAL',xbeta)/cdf('NORMAL',xbeta);
>run;
>
>
>proc reg data=simu_data2;
>   model y=x imr;
>   where y ne .;
>run;
>quit;
>
>
>title 'results from heckman approaches ---QLIM';
>proc qlim data=simu_data;
>      model z = w /discrete (d=normal);
>      model y = x / select(z=1);
>run;
>
>
>title 'results from heckman approaches ---nlmixed';
>proc nlmixed data=simu_data;
>   bounds s >0, -1<r<1;
>   parms a=2 b=2 c=1 d=1 s=1 r=0.5;
>   *selection function;
>   xbeta=c+d*w;
>   p=probnorm(xbeta);
>   if z=0 then l=log(1-p);
>   else if z=1 then do;
>      e=y-(a+b*x);
>          l2=(1/(sqrt(2*3.1415927)*s))*exp(-(e**2)/(2*s**2));
>          l3=probnorm((xbeta+r*e/s)/sqrt(1-r**2));
>     l=log(l2)+log(l3);
>   end;
>    MODEL  z ~general(l);
>
>
>run;
>
>
>title 'results from heckman approaches ---model';
>proc model data=simu_data;
>   bounds s >0;
>   parms a=2 b=2 c=1 d=1 s=1 rho=0.5;
>   *selection function;
>   xbeta=c+d*w;
>   p=probnorm(xbeta);
>   *calculate the likelihood;
>   z=p;
>   if z=0 then l=log(1-p);
>   else if z=1 then do;
>   e=y-(a+b*x);
>         l2=(1/(sqrt(2*3.1415927)*s))*exp(-(e**2)/(2*s**2));
>         l3=probnorm((xbeta+r*e/s)/sqrt(1-r**2));
>     l=log(l2)+log(l3);
>   end;
>   l=-1*l;
>    ERRORMODEL  z ~general(l);
>   fit  z / CONVERGE=1e-8;
>run;
>quit;

Shiling has done a great job of discussing the classic Heckman model, and
showing how it works.

But I wanted to stick in something else.  In my not-so-humble opiniion,
Heckman gets used in a lot of situations where it is not warranted.  But
(of course) you can't check the assumptions since you don't have the
unsampled data.  In particular, using it in survey sampling situations is
popular in some fields, but filled with assumptions that are probably not
true.  The biggest assumption when using the Heckman model in
survey sampling is the assumption that the 'refused to answer' group
is even from the same population as those who answered the question(s).

HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330

_________________________________________________________________
http://games.msn.com/en/flexicon/default.htm?icid=flexicon_hmemailtaglineapril07
```
 0
Reply davidlcassell (5579) 4/4/2007 5:37:07 AM

5 Replies
39 Views

Similar Articles

12/10/2013 7:33:28 AM
[PageSpeed]