Backpropagation Network Question Regarding Weights/Biases

  • Follow


Dear all,

My knowledge of neural networks is probably just a month old. I have a
question regarding the training of the network using the
backpropagation (BP) algorithms in Matlab R14 (SP1).

I have a fixed training set and all my settings for the BP network does
not change. Now using that training set to train my BP network it will
generate the weights and biases depending on my target outputs. The
question is if I reinitialize the same network with all the similar
settings and redo my training with the same training set and output
targets(nothing has changed at all), I seem to be getting different
weights and biases.

1. Is this normal for a BP algorithm and why?

2. Is there some random number in the code of the BP algorithm in
Matlab that causes this difference in generating the weights and
biases?

My impression was that the network is just a network of mathematical
code which will generate the same solution if I present it with the
same numerical inputs.

I used the default setting that Matlab has for the BP algorithms and
tried changing the layer initialization function from "initnw"
(Nyugen-Widrow) to "initwb" (By weight and bias) but results are still
the same. The default settings are:

net.inputWeights{1,1}.initFcn = ''
net.biases{1}.initFcn = ''

which baffles me. I tried changing that to "initzero" but still no
change in the randomness in the final weights and biases.

This doesn't happen with the "newlin" network though. Weights and
biases are consistent with the training set here.

Thanks for reading my post and your comments.

Regards,
Edmund

0
Reply galusdomestica (1) 8/5/2005 8:38:27 AM

Edmund wrote:
> Dear all,
>
> My knowledge of neural networks is probably just a month old. I have a
> question regarding the training of the network using the
> backpropagation (BP) algorithms in Matlab R14 (SP1).
>
> I have a fixed training set and all my settings for the BP network does
> not change. Now using that training set to train my BP network it will
> generate the weights and biases depending on my target outputs. The
> question is if I reinitialize the same network with all the similar
> settings and redo my training with the same training set and output
> targets(nothing has changed at all), I seem to be getting different
> weights and biases.
>
> 1. Is this normal for a BP algorithm and why?
>
> 2. Is there some random number in the code of the BP algorithm in
> Matlab that causes this difference in generating the weights and
> biases?
>
> My impression was that the network is just a network of mathematical
> code which will generate the same solution if I present it with the
> same numerical inputs.
>
> I used the default setting that Matlab has for the BP algorithms and
> tried changing the layer initialization function from "initnw"
> (Nyugen-Widrow) to "initwb" (By weight and bias) but results are still
> the same. The default settings are:
>
> net.inputWeights{1,1}.initFcn = ''
> net.biases{1}.initFcn = ''
>
> which baffles me. I tried changing that to "initzero" but still no
> change in the randomness in the final weights and biases.
>
> This doesn't happen with the "newlin" network though. Weights and
> biases are consistent with the training set here.

You will get the same results if you reset the random number generator.

help randn
help rand

Hope this helps.

Greg

0
Reply Greg 8/9/2005 9:39:33 AM


Edmund wrote:
> Dear all,
>
> My knowledge of neural networks is probably just a month old. I have a
> question regarding the training of the network using the
> backpropagation (BP) algorithms in Matlab R14 (SP1).
>
> I have a fixed training set and all my settings for the BP network does
> not change. Now using that training set to train my BP network it will
> generate the weights and biases depending on my target outputs. The
> question is if I reinitialize the same network with all the similar
> settings and redo my training with the same training set and output
> targets(nothing has changed at all), I seem to be getting different
> weights and biases.
>
> 1. Is this normal for a BP algorithm and why?
>
> 2. Is there some random number in the code of the BP algorithm in
> Matlab that causes this difference in generating the weights and
> biases?
>
> My impression was that the network is just a network of mathematical
> code which will generate the same solution if I present it with the
> same numerical inputs.
>
> I used the default setting that Matlab has for the BP algorithms and
> tried changing the layer initialization function from "initnw"
> (Nyugen-Widrow) to "initwb" (By weight and bias) but results are still
> the same. The default settings are:
>
> net.inputWeights{1,1}.initFcn = ''
> net.biases{1}.initFcn = ''
>
> which baffles me. I tried changing that to "initzero" but still no
> change in the randomness in the final weights and biases.
>
> This doesn't happen with the "newlin" network though. Weights and
> biases are consistent with the training set here.

You will get the same results if you reset the state of random number
generators.

help rand
help randn

Hope this helps.

Greg

0
Reply Greg 8/9/2005 9:41:54 AM

Thanks a lot Greg!

Before setting up the network I just need to initialize the state of
random number (rand('state',0)) and I got the same results for my BP
network.

Another question :
Why Matlab, for the BP network, creates this randomness in the output
results? Is it to show that it is possible to get very different
results if the initial values are different or simply in general that
BP networks SHOULD generate random output results?

Thanks.

Regards,
Edmund

0
Reply Edmund 8/10/2005 7:20:57 AM

Edmund wrote:
> Thanks a lot Greg!
>
> Before setting up the network I just need to initialize the state of
> random number (rand('state',0)) and I got the same results for my BP
> network.

Yes. Also, you can replace '0' by any number from 1 to 2^31.
Sometimes I use 1e9*sum(clock) for a "random" initialization.

> Another question :
> Why Matlab, for the BP network, creates this randomness in the output
> results? Is it to show that it is possible to get very different
> results if the initial values are different or simply in general that
> BP networks SHOULD generate random output results?

Nothing peculiar to either MATLAB or NNs. For many nonlinear
optimization problems, without specific prior information, the
starting point should be random. Moreover, since most minimization
algorithms just search for local minima, multiple runs with different
random initializations are sometimes required to get a satisfactory
result.

Hope this helps.

Greg

0
Reply Greg 8/10/2005 9:38:13 PM

Greg Heath wrote:
> Edmund wrote:
>
> > Another question :
> > Why Matlab, for the BP network, creates this randomness in the output
> > results? Is it to show that it is possible to get very different
> > results if the initial values are different or simply in general that
> > BP networks SHOULD generate random output results?
>
> Nothing peculiar to either MATLAB or NNs. For many nonlinear
> optimization problems, without specific prior information, the
> starting point should be random. Moreover, since most minimization
> algorithms just search for local minima, multiple runs with different
> random initializations are sometimes required to get a satisfactory
> result.
>
> Hope this helps.
>
> Greg

Thanks again, Greg for the invaluable input.

I see what you mean now. I was trying to set the initial conditions
constant so I could compare "equally" the performance between different
BP algorithmns and different number of neurons in the hidden layer. But
hearing what you say now, I think a better approach is to leave the
initial conditions random and have a specific number of passes going
through the same network (with random initialization of the initial
conditions with each pass) and then compute the average error between
output and target results. Although this definitely takes a longer
time, depending on the number of passes I set, it will be the better
approach, I believe.

Regards,
Edmund

0
Reply Edmund 8/11/2005 1:12:43 PM

"Edmund" <galusdomestica@gmail.com> wrote in message 
news:1123765963.738754.187170@g47g2000cwa.googlegroups.com...
>
> Greg Heath wrote:
>> Edmund wrote:
>>
>> > Another question :
>> > Why Matlab, for the BP network, creates this randomness in the output
>> > results? Is it to show that it is possible to get very different
>> > results if the initial values are different or simply in general that
>> > BP networks SHOULD generate random output results?
>>
>> Nothing peculiar to either MATLAB or NNs. For many nonlinear
>> optimization problems, without specific prior information, the
>> starting point should be random. Moreover, since most minimization
>> algorithms just search for local minima, multiple runs with different
>> random initializations are sometimes required to get a satisfactory
>> result.
>>
>> Hope this helps.
>>
>> Greg
>
> Thanks again, Greg for the invaluable input.
>
> I see what you mean now. I was trying to set the initial conditions
> constant so I could compare "equally" the performance between different
> BP algorithmns and different number of neurons in the hidden layer. But
> hearing what you say now, I think a better approach is to leave the
> initial conditions random and have a specific number of passes going
> through the same network (with random initialization of the initial
> conditions with each pass) and then compute the average error between
> output and target results. Although this definitely takes a longer
> time, depending on the number of passes I set, it will be the better
> approach, I believe.
>
> Regards,
> Edmund
>
To make the results more comparable you could select 10 'seed' numbers for 
the 'randn' command. Then, run each algorithm with each one of the selected 
seed numbers and then compare the average error between them.

Regards

Vasileios 


0
Reply vasilisp 8/11/2005 1:18:54 PM

vasilisp@otenet.gr wrote:
> To make the results more comparable you could select 10 'seed' numbers for
> the 'randn' command. Then, run each algorithm with each one of the selected
> seed numbers and then compare the average error between them.
>
> Regards
>
> Vasileios


I think that's an excellent idea, Vasileios. In other words, we are
using the same set of random numbers for each network. I was a little
confused at the start with the terms 'seed' and 'state' for the rand
and randn functions in Matlab so I did a little research on my own. The
rand('seed') (or randn('seed')) uses the Matlab 4 uniform generator
while the rand('state') (or randn('state')) uses the Matlab 5 generator
which actually covers a larger range of random numbers.

Also, I did a little experiment on my own and found that if I
initialize using rand('state'), Matlab changes the state everytime it
generates a random number and the seed does not change. And if I
initialize using rand('seed'), the seed changes everytime a random
number is generated while the state remains unchanged. So in other
words, Matlab uses whatever number generator you start initializing at
the start.

So instead of using Vasileios' idea, I actually initialize my initial
conditions for the network by setting the "state" (I choose to use the
Matlab 5 generator for the larger range) to its initial
(rand('state',0)). Then I do a 'for' loop for 10 passes (I hope 10 is
enough for a good average) and it will consider the first 10 random
numbers from the initial state. This allows a more 'comparable'
approach between networks as well as an easier way to keep track of
what random numbers were being used. Hope it's good.

Regards,
Edmund

0
Reply Edmund 8/11/2005 2:53:45 PM

Edmund wrote:
> vasilisp@otenet.gr wrote:
> > To make the results more comparable you could select 10 'seed' numbers for
> > the 'randn' command. Then, run each algorithm with each one of the selected
> > seed numbers and then compare the average error between them.
>
> I think that's an excellent idea, Vasileios. In other words, we are
> using the same set of random numbers for each network.

No.

The constraint of using the same initial states for different
architectures is a red herring. Forget it.

Also forget about averaging all MSEs as a goodness criterion.

For each architecture

1. Loop over 30 to 100 pseudorandom initializations.
2. Use only "integer states". See my Feb 21 post for one
   way to do this

Optimal Number of Neural Network Hidden Nodes using "integer" RNG ...

3. Save the integer state index and MSE for each trial
4. Plot the MSE histogram and resulting CDF. Then, for a given MSE
   you can see what % of the trials yielded better performances.
5. I think the best criterion for comparing different architectures
   is the CDF curve.

> I was a little
> confused at the start with the terms 'seed' and 'state' for the rand
> and randn functions in Matlab so I did a little research on my own. The
> rand('seed') (or randn('seed')) uses the Matlab 4 uniform generator
> while the rand('state') (or randn('state')) uses the Matlab 5 generator
> which actually covers a larger range of random numbers.

There is no relationship between 'state' and 'seed'.

> Also, I did a little experiment on my own and found that if I
> initialize using rand('state'), Matlab changes the state everytime it
> generates a random number and the seed does not change. And if I
> initialize using rand('seed'), the seed changes everytime a random
> number is generated while the state remains unchanged. So in other
> words, Matlab uses whatever number generator you start initializing at
> the start.

Use the MATLAB 5 generator and see my July 2004 thread

State ofthe random number generator

> So instead of using Vasileios' idea, I actually initialize my initial
> conditions for the network by setting the "state" (I choose to use the
> Matlab 5 generator for the larger range) to its initial
> (rand('state',0)). Then I do a 'for' loop for 10 passes (I hope 10 is
> enough for a good average) and it will consider the first 10 random
> numbers from the initial state. This allows a more 'comparable'
> approach between networks as well as an easier way to keep track of
> what random numbers were being used. Hope it's good.

1. 10 is not enough
2. Only your ist run is an integer state. Better to use a
   scheme like the one I outlined on Feb 21.

Hope this helps.

Greg

0
Reply Greg 8/12/2005 3:48:51 AM

Greg Heath wrote:
> The constraint of using the same initial states for different
> architectures is a red herring. Forget it.
>
> Also forget about averaging all MSEs as a goodness criterion.

I am not averaging MSEs. I am finding the error between the output and
the actual targets and then averaging that error between trials/passes.
Is this bad too?


>
> For each architecture
>
> 1. Loop over 30 to 100 pseudorandom initializations.
> 2. Use only "integer states". See my Feb 21 post for one
>    way to do this
>
> Optimal Number of Neural Network Hidden Nodes using "integer" RNG ...
>
> 3. Save the integer state index and MSE for each trial
> 4. Plot the MSE histogram and resulting CDF. Then, for a given MSE
>    you can see what % of the trials yielded better performances.
> 5. I think the best criterion for comparing different architectures
>    is the CDF curve.

I have read your Feb 21 post and thanks for it. I will use the integer
state scheme suggested here. By the way, what does CDF stands for?
Cumulative Distribution Function or Continuous Distribution Function or
something else? Maybe I am ignorant but how do I get the CDF from the
MSE histogram (although I might not be using this but it's good general
knowledge and I might consider using it if it's really good).


>
> 1. 10 is not enough
> 2. Only your ist run is an integer state. Better to use a
>    scheme like the one I outlined on Feb 21.
>

The reason why I am using only 10 trials is because my input matrix is
large (57x594) and the computation time is very long even with just one
hidden layer. But I might consider 30 (maximum 50) and this will
probably take a couple of days for training the many architectures of
NN I am considering!

By the way, what is a reasonable number of epochs you would consider
for a comparison test such as this?

Thanks again.

Regards,
Edmund

0
Reply Edmund 8/12/2005 5:12:58 AM

Greg Heath wrote:
> The constraint of using the same initial states for different
> architectures is a red herring. Forget it.

> Also forget about averaging all MSEs as a goodness criterion.

I am not averaging MSEs. I am finding the error between the output and
the actual targets and then averaging that error between trials/passes.

Is this bad too?


> For each architecture
>
> 1. Loop over 30 to 100 pseudorandom initializations.
> 2. Use only "integer states". See my Feb 21 post for one
>    way to do this
>
> Optimal Number of Neural Network Hidden Nodes using "integer" RNG ...
>
> 3. Save the integer state index and MSE for each trial
> 4. Plot the MSE histogram and resulting CDF. Then, for a given MSE
>    you can see what % of the trials yielded better performances.
> 5. I think the best criterion for comparing different architectures
>    is the CDF curve.

I have read your Feb 21 post and thanks for it. I will use the integer
state scheme suggested here.
1. In your Feb 21 post, you have all these "fascinating" equations (for
e.g. Nw = O+(I+O+1)*H) and inequalities/rules that governs the
performance of the NN. I have been looking for equations like such. Can
you point me to references/literatures/papers that involves these rules
of thumb that could be useful in finding a good NN architecture.
2. By the way, what does CDF stands for? Cumulative Distribution
Function or Continuous Distribution Function or something else?
3.. Maybe I am ignorant but how do I get the CDF from the MSE histogram
(although I might not be using this but it's good general knowledge and
I might consider using it if it's really good).


> 1. 10 is not enough
> 2. Only your ist run is an integer state. Better to use a
>    scheme like the one I outlined on Feb 21.

The reason why I am using only 10 trials is because my input matrix is
large (57x594) and the computation time is very long even with just one

hidden layer. But I might consider 30 (maximum 50) and this will
probably take a couple of days for training the many architectures of
NN I am considering!
1. However, what is your justification of saying 30 or more trials, as
also pointed out in your Feb 21 post?
2. By the way, what is a reasonable number of epochs you would consider

for a comparison test such as this? 

Thanks again. 

Regards, 
Edmund

0
Reply Edmund 8/12/2005 6:02:35 AM

"Edmund" <galusdomestica@gmail.com> wrote in message 
news:1123826555.656796.42330@z14g2000cwz.googlegroups.com...
> Greg Heath wrote:
>> The constraint of using the same initial states for different
>> architectures is a red herring. Forget it.
>
>> Also forget about averaging all MSEs as a goodness criterion.
>
> I am not averaging MSEs. I am finding the error between the output and
> the actual targets and then averaging that error between trials/passes.
>
> Is this bad too?
>
>
>> For each architecture
>>
>> 1. Loop over 30 to 100 pseudorandom initializations.
>> 2. Use only "integer states". See my Feb 21 post for one
>>    way to do this
>>
>> Optimal Number of Neural Network Hidden Nodes using "integer" RNG ...
>>
>> 3. Save the integer state index and MSE for each trial
>> 4. Plot the MSE histogram and resulting CDF. Then, for a given MSE
>>    you can see what % of the trials yielded better performances.
>> 5. I think the best criterion for comparing different architectures
>>    is the CDF curve.
>
> I have read your Feb 21 post and thanks for it. I will use the integer
> state scheme suggested here.
> 1. In your Feb 21 post, you have all these "fascinating" equations (for
> e.g. Nw = O+(I+O+1)*H) and inequalities/rules that governs the
> performance of the NN. I have been looking for equations like such. Can
> you point me to references/literatures/papers that involves these rules
> of thumb that could be useful in finding a good NN architecture.
> 2. By the way, what does CDF stands for? Cumulative Distribution
> Function or Continuous Distribution Function or something else?
> 3.. Maybe I am ignorant but how do I get the CDF from the MSE histogram
> (although I might not be using this but it's good general knowledge and
> I might consider using it if it's really good).
>
>
>> 1. 10 is not enough
>> 2. Only your ist run is an integer state. Better to use a
>>    scheme like the one I outlined on Feb 21.
>
> The reason why I am using only 10 trials is because my input matrix is
> large (57x594) and the computation time is very long even with just one
>
> hidden layer. But I might consider 30 (maximum 50) and this will
> probably take a couple of days for training the many architectures of
> NN I am considering!
> 1. However, what is your justification of saying 30 or more trials, as
> also pointed out in your Feb 21 post?
> 2. By the way, what is a reasonable number of epochs you would consider
>
> for a comparison test such as this?
>
> Thanks again.
>
> Regards,
> Edmund
>

Well....

I must admit (and I thought about it, when I was writing my rely) that 10 
trials are not enough to constitute a sound basis for conclusions. However, 
from my experience in my PhD thesis, 10 is enough to give a very good 
indication of how things are. You ideally need no less than 50 trials 
provided that there are no 'practical limitations'.

Regards

Vasileios 


0
Reply vasilisp 8/12/2005 1:19:48 PM

Edmund wrote:
> Greg Heath wrote:
> > The constraint of using the same initial states for different
> > architectures is a red herring. Forget it.
>
> > Also forget about averaging all MSEs as a goodness criterion.
>
> I am not averaging MSEs. I am finding the error between the output and
> the actual targets and then averaging that error between trials/passes.
>
> Is this bad too?

I'm not sure what you mean. Please elucidate.

Each run starting with weight initialization and ending (hopefully) in
convergence is a trial.

Applying one training input vector and determining the resulting output
vector
is a pass

An epoch consists of one pass per training vector.

Averaging either the absolute value of the error or the square of the
error
over an epoch are the most typical criteria that are used.

> > For each architecture
> >
> > 1. Loop over 30 to 100 pseudorandom initializations.
> > 2. Use only "integer states". See my Feb 21 post for one
> >    way to do this
> >
> > Optimal Number of Neural Network Hidden Nodes using "integer" RNG ...
> >
> > 3. Save the integer state index and MSE for each trial
> > 4. Plot the MSE histogram and resulting CDF. Then, for a given MSE
> >    you can see what % of the trials yielded better performances.
> > 5. I think the best criterion for comparing different architectures
> >    is the CDF curve.
>
> I have read your Feb 21 post and thanks for it. I will use the integer
> state scheme suggested here.
> 1. In your Feb 21 post, you have all these "fascinating" equations (for
> e.g. Nw = O+(I+O+1)*H) and inequalities/rules that governs the
> performance of the NN. I have been looking for equations like such. Can
> you point me to references/literatures/papers that involves these rules
> of thumb that could be useful in finding a good NN architecture.

The only place I have seen these is in my c.a.n-n and c.s-s.m posts.
Try searching these groups archives using Google Groups and greg-heath
as one of the search terms.


> 2. By the way, what does CDF stands for? Cumulative Distribution
> Function

Yes.

> or Continuous Distribution Function or something else?
> 3.. Maybe I am ignorant but how do I get the CDF from the MSE histogram
> (although I might not be using this but it's good general knowledge and
> I might consider using it if it's really good).

Integrating a piecewise constant histogram (approximation to the pdf
(probabililty density function)) will yield a piecewise linear CDF.

> > 1. 10 is not enough
> > 2. Only your ist run is an integer state. Better to use a
> >    scheme like the one I outlined on Feb 21.
>
> The reason why I am using only 10 trials is because my input matrix is
> large (57x594) and the computation time is very long even with just one
> hidden layer.

That is not a large matrix. Aren't you using a good batch algorithm
like
Levenberg-Marquardt or conjugate gradient instead of ordinary backprop?

What are the summary statistics (e.g., min,median,mean,max) for
computation time and number of epochs?

> But I might consider 30 (maximum 50) and this will
> probably take a couple of days for training the many architectures of
> NN I am considering!

It depends on what you mean by many. Please elucidate. Sometimes, with
a
little planning, you don't have to try as many as you originally
thought.

> 1. However, what is your justification of saying 30 or more trials, as
> also pointed out in your Feb 21 post?

Trial and error experience.

The old statistical rule of thumb of treating 30 as the dividing
line between finite and infinite (i.e., t-test vs z-test) had nothing
to do with my choice. However, it makes you wonder. Doesn't it?

> 2. By the way, what is a reasonable number of epochs you would consider
> for a comparison test such as this?

Each problem is different. I generally start with the MATLAB default
value 
and adjust as needed.

Hope this helps.

Greg

0
Reply Greg 8/13/2005 12:42:26 AM

vasilisp@otenet.gr wrote:
> "Edmund" <galusdomestica@gmail.com> wrote in message
> news:1123826555.656796.42330@z14g2000cwz.googlegroups.com...
> > Greg Heath wrote:
> >> The constraint of using the same initial states for different
> >> architectures is a red herring. Forget it.
> >
> >> Also forget about averaging all MSEs as a goodness criterion.
> >
> > I am not averaging MSEs. I am finding the error between the output and
> > the actual targets and then averaging that error between trials/passes.
> >
> > Is this bad too?
> >
> >
> >> For each architecture
> >>
> >> 1. Loop over 30 to 100 pseudorandom initializations.
> >> 2. Use only "integer states". See my Feb 21 post for one
> >>    way to do this
> >>
> >> Optimal Number of Neural Network Hidden Nodes using "integer" RNG ...
> >>
> >> 3. Save the integer state index and MSE for each trial
> >> 4. Plot the MSE histogram and resulting CDF. Then, for a given MSE
> >>    you can see what % of the trials yielded better performances.
> >> 5. I think the best criterion for comparing different architectures
> >>    is the CDF curve.
> >
> > I have read your Feb 21 post and thanks for it. I will use the integer
> > state scheme suggested here.
> > 1. In your Feb 21 post, you have all these "fascinating" equations (for
> > e.g. Nw = O+(I+O+1)*H) and inequalities/rules that governs the
> > performance of the NN. I have been looking for equations like such. Can
> > you point me to references/literatures/papers that involves these rules
> > of thumb that could be useful in finding a good NN architecture.
> > 2. By the way, what does CDF stands for? Cumulative Distribution
> > Function or Continuous Distribution Function or something else?
> > 3.. Maybe I am ignorant but how do I get the CDF from the MSE histogram
> > (although I might not be using this but it's good general knowledge and
> > I might consider using it if it's really good).
> >
> >
> >> 1. 10 is not enough
> >> 2. Only your ist run is an integer state. Better to use a
> >>    scheme like the one I outlined on Feb 21.
> >
> > The reason why I am using only 10 trials is because my input matrix is
> > large (57x594) and the computation time is very long even with just one
> >
> > hidden layer. But I might consider 30 (maximum 50) and this will
> > probably take a couple of days for training the many architectures of
> > NN I am considering!
> > 1. However, what is your justification of saying 30 or more trials, as
> > also pointed out in your Feb 21 post?
> > 2. By the way, what is a reasonable number of epochs you would consider
> >
> > for a comparison test such as this?
> >
> > Thanks again.
> >
> > Regards,
> > Edmund
> >
>
> Well....
>
> I must admit (and I thought about it, when I was writing my rely) that 10
> trials are not enough to constitute a sound basis for conclusions. However,
> from my experience in my PhD thesis, 10 is enough to give a very good
> indication of how things are. You ideally need no less than 50 trials
> provided that there are no 'practical limitations'.

It depends on the complexity of the I/O mapping, the noise level
and the number of hidden nodes. Sometimes 10 works, sometimes it
takes 40.

Hope this helps.

Greg

0
Reply Greg 8/13/2005 12:49:08 AM

Greg Heath wrote:
> Edmund wrote:
> > Greg Heath wrote:
>
> > I am not averaging MSEs. I am finding the error between the output and
> > the actual targets and then averaging that error between trials/passes.
> >
> > Is this bad too?
>
> I'm not sure what you mean. Please elucidate.

Ok, I will try my best to explain. Actually, the method I am using is
very similar to what is shown in the character (alphabet) recognition
example in Matlab's Neural Network Toolbox User's Guide (this is my 1st
reference for learning NN). So basically I first train the NN with the
57 X 594 matrix, with noise and without added noise. Then I have the
same matrix, add a different amount of noise and try to test (the
correct term would be 'validate') the network. I compute the error
(abs(output-target)/target) between the output of this test matrix and
the actual targets which I know. I do this for different number of
neurons and different batch algorithms.

> Each run starting with weight initialization and ending (hopefully) in
> convergence is a trial.
>
> Applying one training input vector and determining the resulting output
> vector
> is a pass
>
> An epoch consists of one pass per training vector.
>
> Averaging either the absolute value of the error or the square of the
> error
> over an epoch are the most typical criteria that are used.

So far (mostly) my NN training ends in convergence. With sufficient
number of epochs, the NN converges. So basically when the MSE is at a
constant value, meaning it won't go significantly smaller with the no.
of epochs, that's when I calculate the error between the outputs and
the targets. Sometimes I set a particular goal for the MSE, and when
the NN has been trained to that goal, I collect the outputs and compare
them with the targets. But I also like the way you mentioned and I will
do that too for another test run in the mere future.

Question: My understanding is that one epoch means passing my entire 57
X 594 matrix (since this is batch training) into the NN all at once to
compute the weights and biases. And the next epoch will be to pass the
entire matrix again to calculate new weights and biases, so on and so
fore. So when you say "per training vector", the training vector is my
entire 57 X 594 with all its elements in its full glory. Am I correct?


> > 1. In your Feb 21 post, you have all these "fascinating" equations (for
> > e.g. Nw = O+(I+O+1)*H) and inequalities/rules that governs the
> > performance of the NN. I have been looking for equations like such. Can
> > you point me to references/literatures/papers that involves these rules
> > of thumb that could be useful in finding a good NN architecture.
>
> The only place I have seen these is in my c.a.n-n and c.s-s.m posts.
> Try searching these groups archives using Google Groups and greg-heath
> as one of the search terms.

Thanks. I can't believe no one has written a paper or a book about
this. Maybe you should do it! And if you do, I will buy it!


> > The reason why I am using only 10 trials is because my input matrix is
> > large (57x594) and the computation time is very long even with just one
> > hidden layer.
>
> That is not a large matrix. Aren't you using a good batch algorithm
> like
> Levenberg-Marquardt or conjugate gradient instead of ordinary backprop?
>
> What are the summary statistics (e.g., min,median,mean,max) for
> computation time and number of epochs?

The Levenberg-Marquardt (LM) algorithm doesn't work and I always get
the "Out of Memory" statement. I tried all the memory reduction methods
in Matlab's User Guide but nothing works. The best algorithms that
gives me the best results and the smallest error so far are the
Resilient Backpropagation (RP) and the Scaled Conjugate Gradient (SCG).
Both of these algorithms were said in the Matlab User's Guide to work
best with large networks and I guess they are right.

For a 57-20-24 (tansig,purelin) NN (SCG algorithm), the computation
time is between 5-10 seconds for every 25 epochs. This is considered
fast in my opinion but if it has more than 50 neurons, it's will be in
minutes. I got the time from my trusty wrist watch and if there is a
better way of doing this, I am at all ears or eyes.

> > But I might consider 30 (maximum 50) and this will
> > probably take a couple of days for training the many architectures of
> > NN I am considering!
>
> It depends on what you mean by many. Please elucidate. Sometimes, with
> a
> little planning, you don't have to try as many as you originally
> thought.

I am trying to train my network for different number of neurons (from 5
to 100 at intervals of 5 neurons) and I am doing this for 7 different
batch algorithms. Each NN architecture goes through 500 epochs. If you
add all the numbers up, it's a lot and the reason why I am doing so
many is not to show that I don't know at all which NN are good and
where I should focus my tests on but is to get an overall plot to see
how the performance of different algorithms and the effects of the
number of neurons on the NN. Something like a benchmark test.

Thanks.

Regards,
Edmund

0
Reply Edmund 8/13/2005 3:38:21 PM

Edmund wrote:
> Greg Heath wrote:
> > Edmund wrote:
> > > Greg Heath wrote:
> >
> > > I am not averaging MSEs. I am finding the error between the output and
> > > the actual targets and then averaging that error between trials/passes.
> > >
> > > Is this bad too?
> >
> > I'm not sure what you mean. Please elucidate.
>
> Ok, I will try my best to explain. Actually, the method I am using is
> very similar to what is shown in the character (alphabet) recognition
> example in Matlab's Neural Network Toolbox User's Guide (this is my 1st
> reference for learning NN). So basically I first train the NN with the
> 57 X 594 matrix, with noise and without added noise. Then I have the
> same matrix, add a different amount of noise and try to test (the
> correct term would be 'validate') the network. I compute the error
> (abs(output-target)/target)

I don't have access to MATLAB on this machine. However, I think
the only learning objective functions are mse, sse, mae (mean (or
median?)
absolute error and regularized mse.

One of the criteria you use should be the same one you trained on.

> between the output of this test matrix and
> the actual targets which I know.

You mean you average that quantity over the 594 cases.

> I do this for different number of
> neurons and different batch algorithms.
>
> > Each run starting with weight initialization and ending (hopefully) in
> > convergence is a trial.
> >
> > Applying one training input vector and determining the resulting output
> > vector
> > is a pass
> >
> > An epoch consists of one pass per training vector.
> >
> > Averaging either the absolute value of the error or the square of the
> > error
> > over an epoch are the most typical criteria that are used.
>
> So far (mostly) my NN training ends in convergence. With sufficient
> number of epochs, the NN converges. So basically when the MSE is at a
> constant value, meaning it won't go significantly smaller with the no.
> of epochs, that's when I calculate the error between the outputs and
> the targets. Sometimes I set a particular goal for the MSE, and when
> the NN has been trained to that goal, I collect the outputs and compare
> them with the targets. But I also like the way you mentioned and I will
> do that too for another test run in the mere future.
>
> Question: My understanding is that one epoch means passing my entire 57
> X 594 matrix (since this is batch training) into the NN all at once to
> compute the weights and biases. And the next epoch will be to pass the
> entire matrix again to calculate new weights and biases, so on and so
> fore.

Correct.

> So when you say "per training vector", the training vector is my
> entire 57 X 594 with all its elements in its full glory. Am I correct?

No. I was refering to a single 57-dim vector. Save the terminology
"pass" for pattern learning. Use "epoch" for batch learning.
it is less confusing.

> > > 1. In your Feb 21 post, you have all these "fascinating" equations (for
> > > e.g. Nw = O+(I+O+1)*H) and inequalities/rules that governs the
> > > performance of the NN. I have been looking for equations like such. Can
> > > you point me to references/literatures/papers that involves these rules
> > > of thumb that could be useful in finding a good NN architecture.
> >
> > The only place I have seen these is in my c.a.n-n and c.s-s.m posts.
> > Try searching these groups archives using Google Groups and greg-heath
> > as one of the search terms.
>
> Thanks. I can't believe no one has written a paper or a book about
> this. Maybe you should do it! And if you do, I will buy it!

I prefer to use my retirement in a more relaxed manner.

> > > The reason why I am using only 10 trials is because my input matrix is
> > > large (57x594) and the computation time is very long even with just one
> > > hidden layer.
> >
> > That is not a large matrix. Aren't you using a good batch algorithm
> > like
> > Levenberg-Marquardt or conjugate gradient instead of ordinary backprop?
> >
> > What are the summary statistics (e.g., min,median,mean,max) for
> > computation time and number of epochs?
>
> The Levenberg-Marquardt (LM) algorithm doesn't work and I always get
> the "Out of Memory" statement. I tried all the memory reduction methods
> in Matlab's User Guide but nothing works.

OK. I've had the same problem (Don't remember what the exact dimensions

were). However, if LM works you should also use the Bayesian Regulation
Algorithm. It will give you more confidence w.r.t. not overfitting.

> The best algorithms that
> gives me the best results and the smallest error so far are the
> Resilient Backpropagation (RP) and the Scaled Conjugate Gradient (SCG).
> Both of these algorithms were said in the Matlab User's Guide to work
> best with large networks and I guess they are right.

Good. I use SCG. Really haven't tried RP.

> For a 57-20-24 (tansig,purelin) NN (SCG algorithm), the computation
> time is between 5-10 seconds for every 25 epochs. This is considered
> fast in my opinion but if it has more than 50 neurons, it's will be in
> minutes. I got the time from my trusty wrist watch and if there is a
> better way of doing this, I am at all ears or eyes.

help tic
help toc

There are several others that calculate elapsed time and cpu time.

> > > But I might consider 30 (maximum 50) and this will
> > > probably take a couple of days for training the many architectures of
> > > NN I am considering!
> >
> > It depends on what you mean by many. Please elucidate. Sometimes, with
> > a
> > little planning, you don't have to try as many as you originally
> > thought.
>
> I am trying to train my network for different number of neurons (from 5
> to 100 at intervals of 5 neurons)

Too conservative. Don't be more conservative than a binary search.
For example, first do Hmin = 5 and Hmax determined by the a reasonable
limit of your weight-to-equation ratio

r = [(57+1)*Hmax+(Hmax+1)*24]/(594*24) <= 1/2

Hmax = (594/2-1)*24/(57+24+1) = 86

Next try Havg = (Hmin+Hmax)/2. If the result is not significantly
worse than the one using Hmax, set Hmax = Havg, define a new
Havg and so on.

> and I am doing this for 7 different
> batch algorithms. Each NN architecture goes through 500 epochs. If you
> add all the numbers up, it's a lot and the reason why I am doing so
> many is not to show that I don't know at all which NN are good and
> where I should focus my tests on but is to get an overall plot to see
> how the performance of different algorithms and the effects of the
> number of neurons on the NN. Something like a benchmark test.

I understand. However, increasing or decreasing sizes by a factor
of 2 and using binary searches can usually save the time that you are
complaining about. After you plot your results you can fill in
perceived gaps.

Hope this helps.

Greg

0
Reply Greg 8/16/2005 5:20:15 PM

15 Replies
57 Views

(page loaded in 0.405 seconds)

5/26/2013 5:17:34 AM


Reply: