COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### tanh or logistic

• Email
• Follow

```I have a 3 class outcome and I am thinking whether I should use the
logistic function with 3 outputs (ie 1,0,0 or 0,1,0 or 0,0,1) or the
tanh function mapped onto 1 output (ie class 1 is -1, class 2 is 0,
and class 3 is 1)??

What do you think?

Dirk
```
 0

See related articles to this posting

```Dirk Nachbar wrote:
> I have a 3 class outcome and I am thinking whether I should use the
> logistic function with 3 outputs (ie 1,0,0 or 0,1,0 or 0,0,1)

Yes. Or Softmax.

> or the tanh function mapped onto 1 output (ie class 1 is -1, class 2 is 0,
> and class 3 is 1)??

No.

> What do you think?
>
> Dirk
```
 0

```On Sep 22, 4:10 pm, Dirk Nachbar <dirk...@gmail.com> wrote:
> I have a 3 class outcome and I am thinking whether I should use the
> logistic function with 3 outputs (ie 1,0,0 or 0,1,0 or 0,0,1) or the
> tanh function mapped onto 1 output (ie class 1 is -1, class 2 is 0,
> and class 3 is 1)??
>
> What do you think?

Softmax.

If your 3-D training targets are columns (or rows)
of the unit matrix and the outputs are constrained
to sum to unity, then the outputs are consistent
estimators of the class posterior probabilities
conditional on the input.

Quite often the unipolar logistic sigmoid is used
to estimate the probabilities. However, the estimation
is not consistent because the unity-sum constraint
is not enforced during learning. Contrary to popular
thought, this learning flaw cannot be guaranteed to
be mitigated by normalizing the outputs of the
post-learning net.

See the section on output coding in the CANN FAQ.
ftp://ftp.sas.com/pub/neural/FAQ2.html#A_cat

where it warns that the outputs of post-learning
normalized logistic sigmoids are unreliable
probability estimates when the inputs are not
similar to the training data.

Note that, for some odd reason, MATLAB does not
offer the softmax option for learning because it
doesn't know that the formula for the derivative
is exactly the same as for the logistic. Therefore,
and dlogsig.m,

1. Make copies
2. Rename the copies softmaxgh.m and dsoftmaxgh.m
3. Define the derivative in softmaxgh.m as being
dsoftmaxgh.m

If you don't do this and use post-learning normalized
logistic sigmoid outputs, you can obtain incorrect
outputs when input data is not sufficiently similar
to training data.

Hope this helps.

Greg

```
 0

```On Sep 23, 7:07 am, Greg Heath <he...@alumni.brown.edu> wrote:

> Note that, for some odd reason, MATLAB does not
> offer the softmax option for learning because it
> doesn't know that the formula for the derivative
> is exactly the same as for the logistic. Therefore,
> and dlogsig.m,

The NETLAB toolbox for MATLAB does implement the softmax output
function, and I can strongly recommend the NETLAB toolbox over the
standard MATLAB NN toolbox.

http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/

It was written by Ian Nabney and Chris Bishop, and implements the
approach described in Bishop's excellent book.  There is also a book
describing the implementation (link on the above webpage) for those
wanting to extend the toolbox.  It is a great bit of kit and is also
free!

The Gaussian Process for Machine Learning toolbox is also a great bit
of kit that I would recommend to any user of neural networks, I have
found it much easier to get good results from GPs than MLPs for small
to medium sized datasets (less than about 4000 patterns).  It can be

http://www.gaussianprocess.org/gpml/code/matlab/doc/index.html

The book on GPs by Rasmussen and Williams, for which the software was
produced, is also excellent.
```
 0

```Greg Heath wrote:
> On Sep 22, 4:10 pm, Dirk Nachbar <dirk...@gmail.com> wrote:
>> I have a 3 class outcome and I am thinking whether I should use the
>> logistic function with 3 outputs (ie 1,0,0 or 0,1,0 or 0,0,1) or the
>> tanh function mapped onto 1 output (ie class 1 is -1, class 2 is 0,
>> and class 3 is 1)??
>>
>> What do you think?
>
> Softmax.
>
> If your 3-D training targets are columns (or rows)
> of the unit matrix and the outputs are constrained
> to sum to unity, then the outputs are consistent
> estimators of the class posterior probabilities
> conditional on the input.
>
> Quite often the unipolar logistic sigmoid is used
> to estimate the probabilities. However, the estimation
> is not consistent because the unity-sum constraint
> is not enforced during learning. Contrary to popular
> thought, this learning flaw cannot be guaranteed to
> be mitigated by normalizing the outputs of the
> post-learning net.
>
> See the section on output coding in the CANN FAQ.
> ftp://ftp.sas.com/pub/neural/FAQ2.html#A_cat
>
> where it warns that the outputs of post-learning
> normalized logistic sigmoids are unreliable
> probability estimates when the inputs are not
> similar to the training data.

In traditional stats packages, where the model is MNL (multi-nomial
logistic), the normalisation comes from predicting N-1 of the output.
The N-th output is 1.0 - SUM (of the N-1 outputs).

For MNL, if this isn't done, the model is under-constrained and the
co-efficients can grow without bound during iteration of likelihood
maximisation.

I find it ironic that neural net using a softmax layer don't have to worry

> Note that, for some odd reason, MATLAB does not
> offer the softmax option for learning because it
> doesn't know that the formula for the derivative
> is exactly the same as for the logistic. Therefore,
> and dlogsig.m,
>
> 1. Make copies
> 2. Rename the copies softmaxgh.m and dsoftmaxgh.m
> 3. Define the derivative in softmaxgh.m as being
>   dsoftmaxgh.m

If I had Matlab, I'd appreciate this.

Greg, do you know if Scilab works for softmax?

T.

> If you don't do this and use post-learning normalized
> logistic sigmoid outputs, you can obtain incorrect
> outputs when input data is not sufficiently similar
> to training data.
>
> Hope this helps.
>
> Greg
```
 0

```On Sep 23, 4:52=A0am, "Tomasso" <chi...@air.com.au> wrote:
> GregHeathwrote:
> > On Sep 22, 4:10 pm, Dirk Nachbar <dirk...@gmail.com> wrote:
> >> I have a 3 class outcome and I am thinking whether I should use the
> >> logistic function with 3 outputs (ie 1,0,0 or 0,1,0 or 0,0,1) or the
> >> tanh function mapped onto 1 output (ie class 1 is -1, class 2 is 0,
> >> and class 3 is 1)??
>
> >> What do you think?
>
> > Softmax.
>
> > If your 3-D training targets are columns (or rows)
> > of the unit matrix and the outputs are constrained
> > to sum to unity, then the outputs are consistent
> > estimators of the class posterior probabilities
> > conditional on the input.
>
> > Quite often the unipolar logistic sigmoid is used
> > to estimate the probabilities. However, the estimation
> > is not consistent because the unity-sum constraint
> > is not enforced during learning. Contrary to popular
> > thought, this learning flaw cannot be guaranteed to
> > be mitigated by normalizing the outputs of the
> > post-learning net.
>
> > See the section on output coding in the CANN FAQ.
> >ftp://ftp.sas.com/pub/neural/FAQ2.html#A_cat
>
> > where it warns that the outputs of post-learning
> > normalized logistic sigmoids are unreliable
> > probability estimates when the inputs are not
> > similar to the training data.
>
> In traditional stats packages, where the model is MNL (multi-nomial
> logistic), the normalisation comes from predicting N-1 of the output.
> The N-th output is 1.0 - SUM (of the N-1 outputs).
>
> For MNL, if this isn't done, the model is under-constrained and the
> co-efficients can grow without bound during iteration of likelihood
> maximisation.

Can you please post or send an example?
heath@alumni.brown.edu
I'd like to see why I have never experienced this.

> I find it ironic that neural net using a softmax layer don't have to worr=
y
>
> > Note that, for some odd reason, MATLAB does not
> > offer the softmax option for learning because it
> > doesn't know that the formula for the derivative
> > is exactly the same as for the logistic. Therefore,
> > if you have access to the source codes of softmax.m
> > and dlogsig.m,
>
> > 1. Make copies
> > 2. Rename the copies softmaxgh.m and dsoftmaxgh.m
> > 3. Define the derivative in softmaxgh.m as being
> > =A0 dsoftmaxgh.m
>
> If I had Matlab, I'd appreciate this.
>
> Greg, do you know if Scilab works for softmax?

Don't know.  Sorry,

Greg

P.S. I think I will start taking a look at MATLAB clones.
```
 0

```Greg Heath wrote:
> On Sep 23, 4:52 am, "Tomasso" <chi...@air.com.au> wrote:
>> GregHeathwrote:
>>> On Sep 22, 4:10 pm, Dirk Nachbar <dirk...@gmail.com> wrote:
>>>> I have a 3 class outcome and I am thinking whether I should use the
>>>> logistic function with 3 outputs (ie 1,0,0 or 0,1,0 or 0,0,1) or the
>>>> tanh function mapped onto 1 output (ie class 1 is -1, class 2 is 0,
>>>> and class 3 is 1)??
>>
>>>> What do you think?
>>
>>> Softmax.
>>
>>> If your 3-D training targets are columns (or rows)
>>> of the unit matrix and the outputs are constrained
>>> to sum to unity, then the outputs are consistent
>>> estimators of the class posterior probabilities
>>> conditional on the input.
>>
>>> Quite often the unipolar logistic sigmoid is used
>>> to estimate the probabilities. However, the estimation
>>> is not consistent because the unity-sum constraint
>>> is not enforced during learning. Contrary to popular
>>> thought, this learning flaw cannot be guaranteed to
>>> be mitigated by normalizing the outputs of the
>>> post-learning net.
>>
>>> See the section on output coding in the CANN FAQ.
>>> ftp://ftp.sas.com/pub/neural/FAQ2.html#A_cat
>>
>>> where it warns that the outputs of post-learning
>>> normalized logistic sigmoids are unreliable
>>> probability estimates when the inputs are not
>>> similar to the training data.
>>
>> In traditional stats packages, where the model is MNL (multi-nomial
>> logistic), the normalisation comes from predicting N-1 of the output.
>> The N-th output is 1.0 - SUM (of the N-1 outputs).
>>
>> For MNL, if this isn't done, the model is under-constrained and the
>> co-efficients can grow without bound during iteration of likelihood
>> maximisation.
>
> Can you please post or send an example?
> heath@alumni.brown.edu
> I'd like to see why I have never experienced this.

Check out MNL in S-PLUS / R. It's described in Venable and Ripley' book.

T.

>> I find it ironic that neural net using a softmax layer don't have to worry
>>
>>> Note that, for some odd reason, MATLAB does not
>>> offer the softmax option for learning because it
>>> doesn't know that the formula for the derivative
>>> is exactly the same as for the logistic. Therefore,
>>> and dlogsig.m,
>>
>>> 1. Make copies
>>> 2. Rename the copies softmaxgh.m and dsoftmaxgh.m
>>> 3. Define the derivative in softmaxgh.m as being
>>> dsoftmaxgh.m
>>
>> If I had Matlab, I'd appreciate this.
>>
>> Greg, do you know if Scilab works for softmax?
>
> Don't know.  Sorry,
>
> Greg
>
> P.S. I think I will start taking a look at MATLAB clones.
```
 0

```Greg Heath wrote:
> On Sep 23, 4:52 am, "Tomasso" <chi...@air.com.au> wrote:
>> GregHeathwrote:
>>> On Sep 22, 4:10 pm, Dirk Nachbar <dirk...@gmail.com> wrote:
>>>> I have a 3 class outcome and I am thinking whether I should use the
>>>> logistic function with 3 outputs (ie 1,0,0 or 0,1,0 or 0,0,1) or the
>>>> tanh function mapped onto 1 output (ie class 1 is -1, class 2 is 0,
>>>> and class 3 is 1)??
>>
>>>> What do you think?
>>
>>> Softmax.
>>
>>> If your 3-D training targets are columns (or rows)
>>> of the unit matrix and the outputs are constrained
>>> to sum to unity, then the outputs are consistent
>>> estimators of the class posterior probabilities
>>> conditional on the input.
>>
>>> Quite often the unipolar logistic sigmoid is used
>>> to estimate the probabilities. However, the estimation
>>> is not consistent because the unity-sum constraint
>>> is not enforced during learning. Contrary to popular
>>> thought, this learning flaw cannot be guaranteed to
>>> be mitigated by normalizing the outputs of the
>>> post-learning net.
>>
>>> See the section on output coding in the CANN FAQ.
>>> ftp://ftp.sas.com/pub/neural/FAQ2.html#A_cat
>>
>>> where it warns that the outputs of post-learning
>>> normalized logistic sigmoids are unreliable
>>> probability estimates when the inputs are not
>>> similar to the training data.
>>
>> In traditional stats packages, where the model is MNL (multi-nomial
>> logistic), the normalisation comes from predicting N-1 of the output.
>> The N-th output is 1.0 - SUM (of the N-1 outputs).
>>
>> For MNL, if this isn't done, the model is under-constrained and the
>> co-efficients can grow without bound during iteration of likelihood
>> maximisation.
>
> Can you please post or send an example?
> heath@alumni.brown.edu
> I'd like to see why I have never experienced this.

This maths is pretty well described here:

http://www.nd.edu/~rwilliam/stats2/l92.pdf (who use the first class, M = 1) as the
who use the last class.

Otherwise, iteration or matrix inversion to get the co-efficients is undetermined
and the co-efficients (weights) can be all scaled by an arbitrary constant. Matrix
inversion won't work in this case, and iteration will be unstable, unless something
is done to make a reference level or to tie down the arbitrary constant.

Another way to think of this is that binary logistic regression is really predicting
TWO classes, Y = 1 and Y = 0. For a binary case, if you know P(Y=1), you
automatically know P(Y=0) (because it is 1-P(Y=1). However, if you try to
solve the coefficient matrix (for maximum likelihood, or whatever) you have
an ill-conditioned problem. You just solve for P(Y=1).

>> I find it ironic that neural net using a softmax layer don't have to worry
>>
>>> Note that, for some odd reason, MATLAB does not
>>> offer the softmax option for learning because it
>>> doesn't know that the formula for the derivative
>>> is exactly the same as for the logistic. Therefore,
>>> and dlogsig.m,
>>
>>> 1. Make copies
>>> 2. Rename the copies softmaxgh.m and dsoftmaxgh.m
>>> 3. Define the derivative in softmaxgh.m as being
>>> dsoftmaxgh.m
>>
>> If I had Matlab, I'd appreciate this.
>>
>> Greg, do you know if Scilab works for softmax?
>
> Don't know.  Sorry,
>
> Greg
>
> P.S. I think I will start taking a look at MATLAB clones.
```
 0

```On Sep 25, 7:12 pm, "Tomasso" <chi...@air.com.au> wrote:
> Greg Heath wrote:
-----SNIP
> > Can you please post or send an example?
> > he...@alumni.brown.edu
> > I'd like to see why I have never experienced this.
>
> Check out MNL in S-PLUS / R.
> It's described in Venable and Ripley' book.

Is there an online reference?

However, I understand the topology.
The objective function should be cross-entropy.

What is it about the learning algorithm
(simple steepest descent?), learning goal and
stopping rules that would cause the prolonged
learning?

Typical MATLAB stopping rules are

1) The maximum number of EPOCHS (repetitions) is
reached.
2) The maximum amount of TIME has been exceeded.
3) Performance has been minimized to the GOAL.
5) Validation performance has increased more than
MAX_FAIL times since the last time it decreased
(when using validation).

I would think that 3 and 4 could be used to prevent
the prolonged learning phenomenon.

Anyway, I think I now see why it never happened
to me. I only use Levenberg-Marquardt, Conjugate
Gradient (large data sets) and Bayesian
Regularization learning algorithms. Moreover,
since MATLAB doesn't support the cross-entropy
objective functions, I use MSE (non regularized
learning) with the stopping goal MSEgoal
= 0.01*MSE00 where MSE00 is the MSE obtained when
each output is fixed at it's prior probability
value. This yields R^2 = 1-MSE/MSE00 >= 0.99.

Neverless if you can you post or email an example.
I can write my own cross-entropy objective function
routine and try it out.

> >> I find it ironic that neural net using a softmax
> >> neural net has zero hidden neurons, it is an issue!

Again, it depends on algorithm specifics and
stopping rules.

Greg
```
 0

```Gavin Cawley <gavincawley@googlemail.com> wrote in message <26aa8061-0221-408d-aecf-7250d24568d5@q2g2000vbk.googlegroups.com>...
> On Sep 23, 7:07 am, Greg Heath <he...@alumni.brown.edu> wrote:
>
> > Note that, for some odd reason, MATLAB does not
> > offer the softmax option for learning because it
> > doesn't know that the formula for the derivative
> > is exactly the same as for the logistic. Therefore,
> > if you have access to the source codes of softmax.m
> > and dlogsig.m,

Actually the Neural Network Toolbox does implement SOFTMAX derivatives.  All the transfer functions have derivatives and are trainable with the backprop algorithms.

The derivative is quite different from the logistic sigmoid, as the sigmoid is an element-wise operation (so for net input and output vectors with N elements the derivative has N elements) whereas each output element of SOFTMAX is dependent on each input element (the derivative has NxN elements).
```
 0

```On Sep 30, 1:10=A0am, "Mark Hudson Beale" <matlabcentral....@mhbinc.com>
wrote:
> Gavin Cawley <gavincaw...@googlemail.com> wrote in message <26aa8061-0221=
> > On Sep 23, 7:07 am, Greg Heath <he...@alumni.brown.edu> wrote:
>
> > > Note that, for some odd reason, MATLAB does not
> > > offer the softmax option for learning because it
> > > doesn't know that the formula for the derivative
> > > is exactly the same as for the logistic. Therefore,
> > > if you have access to the source codes of softmax.m
> > > and dlogsig.m,
>
> Actually the Neural Network Toolbox does implement SOFTMAX derivatives. =
=A0All the
> transfer functions have derivatives and are trainable with the backprop a=
lgorithms.

In what version did you implement  dsoftmax ?
Why isn't it mentioned in the documentation of newff etc.?
In ver 6.5.1, you have

% Mark Beale, 11-31-97
% Copyright 1992-2002 The MathWorks, Inc.
% \$Revision: 1.8 \$

if nargin < 1, error('Not enough arguments.'); end

% FUNCTION INFO
if isstr(n)
switch (n)
case 'deriv', a =3D '';  % GEH: can just change to a =3D dlogsig;
case 'name', a =3D 'Soft Max';
case 'output', a =3D [0 1];
case 'active', a =3D [-inf inf];
case 'type', a =3D 2;
otherwise, error('Unrecognized code.')
end
return
end

> The derivative is quite different from the logistic sigmoid, as the sigmo=
id is an element-wise operation (so for net input and output vectors with N=
elements the derivative has N elements) whereas each output element of SOF=
TMAX is dependent on each input >element (the derivative has NxN elements).

If size(output) =3D [N 1], the derivative w.r.t. a previous
layer weight has the same size... the product of the
above mentioned NXN matrix and the NX1 derivative
of the softmax input w.r.t. the weight.

Although the analytic chain rule calculation is messy,
when the smoke clears, it is the same equation as
dlogsig, i.e.,

d =3D a.*(1-a);

Hope this helps.

Greg

```
 0

```On Sep 30, 1:32=A0pm, Greg Heath <he...@alumni.brown.edu> wrote:
> On Sep 30, 1:10=A0am, "Mark Hudson Beale" <matlabcentral....@mhbinc.com>
> wrote:
>
> > Gavin Cawley <gavincaw...@googlemail.com> wrote in message <26aa8061-02=
> > > On Sep 23, 7:07 am, Greg Heath <he...@alumni.brown.edu> wrote:
>
> > > > Note that, for some odd reason, MATLAB does not
> > > > offer the softmax option for learning because it
> > > > doesn't know that the formula for the derivative
> > > > is exactly the same as for the logistic. Therefore,
> > > > if you have access to the source codes of softmax.m
> > > > and dlogsig.m,
>
> > Actually the Neural Network Toolbox does implement SOFTMAX derivatives.=
=A0All the
> > transfer functions have derivatives and are trainable with the backprop=
algorithms.
>
> In what version did you implement =A0dsoftmax ?
> Why isn't it mentioned in the documentation of newff etc.?
> In ver 6.5.1, you have
>
> % Mark Beale, 11-31-97
> % Copyright 1992-2002 The MathWorks, Inc.
> % \$Revision: 1.8 \$
>
> if nargin < 1, error('Not enough arguments.'); end
>
> % FUNCTION INFO
> if isstr(n)
> =A0 switch (n)
> =A0 =A0 case 'deriv', a =3D ''; =A0% GEH: can just change to a =3D dlogsi=
g;
> =A0 =A0 case 'name', a =3D 'Soft Max';
> =A0 =A0 case 'output', a =3D [0 1];
> =A0 =A0 case 'active', a =3D [-inf inf];
> =A0 =A0 case 'type', a =3D 2;
> =A0 =A0 otherwise, error('Unrecognized code.')
> =A0 end
> =A0 return
> end
>
> > The derivative is quite different from the logistic sigmoid, as the sig=
moid is an element-wise operation (so for net input and output vectors with=
N elements the derivative has N elements) whereas each output element of S=
OFTMAX is dependent on each input >element (the derivative has NxN elements=
).
>
> If size(output) =3D [N 1], the derivative w.r.t. a previous
> layer weight has the same size... the product of the
> above mentioned NXN matrix and the NX1 derivative
> of the softmax input w.r.t. the weight.
>
> Although the analytic chain rule calculation is messy,
> when the smoke clears, it is the same equation as
> dlogsig, i.e.,
>
> d =3D a.*(1-a);
>

OK. I see the change was made in 2006a.

http://www.mathworks.com/help/toolbox/nnet/rn/rn11.html

Greg
```
 0
Reply heath (3991) 9/30/2010 5:45:08 PM

```> > Although the analytic chain rule calculation is messy,
> > when the smoke clears, it is the same equation as
> > dlogsig, i.e.,
> >
> > d = a.*(1-a);

You are right the SOFTMAX derivative of output i with respect
to net input i is the same as LOGSIG:

dAi/dNi = Ai * (1 - Ai)

Unlike LOGSIG, the derivatives of each SOFTMAX output i with
respect to each net input j, where j is not equal to i, is not zero.
They are:

dAi/dNj = -Ai*Aj,  for each i not equal to j

Can I ask what version are you are working with?
```
 0