COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### data splitting: train&test or train/test/validate

• Follow

```Hi, I have some questions. Anyone willing to answer - big thanks in advance.

a)Which one is the best way to split data?Advantages & disadvantages?
1.train & test
2.train/test/validate

b)What is the different between test & validate? Aren't they supposed to be the data that different from train set but still fall within the same range?

c)How we define the data set is small? eg: I have a set of data contains 6column (represent 5 input & 1 output) which every column has 31 data point. Is this consider small for MLP & RBF network?

d)Data splitting come first or i should normalize the data first?

I really appreciate any help & explanation.
Thank you.
```
 0
Reply aurikel 11/19/2009 5:57:03 AM

```On Nov 19, 12:57=A0am, "aurikel Radzali" <suriar...@gmail.com> wrote:
> Hi, I have some questions. Anyone willing to answer - big thanks in advan=
ce.
>
> a)Which one is the best way to split data?Advantages & disadvantages?
> 1.train & test
> 2.train/test/validate

Neither. Best is

train/validate/test

> b)What is the different between test & validate? Aren't they supposed to =
be the data that different from train set but still fall within the same ra=
nge?

All data are assumed to be random draws from the same
probability distribution.

Ntotal =3D Ndes+Ntst           % Total    =3D Design + Test
Ndes =3D Ntrn+Nval            % Design =3D Train + Validate

Training data is used to estimate weights. Ntrn must be
sufficiently large to obtain accurate weight estimates.

Test data is nondesign data used to obtain an unbiased
estimate of generalization error. If the result is used to improve
the design, it can no longer be used to obtain an unbiased
estimate. Therefore the data has to be repartitioned into new
trn/val/tst subsets. Ntst must be large enough to obtain
sufficiently precise error estimates.

Validation data is nontraining design data used to estimate
correct values for training parameters (e.g., learning rate,
momentum coefficient, number of epochs, etc). Typically,
Ntst/2 <=3D Nval <=3D Ntst.

If N =3D Ntrn+Nval+Ntst cannot be split into large enough
subsets, then cross-validation should be considered.

> c)How we define the data set is small? eg: I have a set of
data contains 6column (represent 5 input & 1 output) which
every column has 31 data point. Is this consider small for
>MLP & RBF network?

Yes. Search in comp.ai.neural-nets and c.s-s.m using
search words like

"greg-heath" split
"greg-heath" partition
"greg-heath" Ntrn Neq Nw
"greg-heath" Nval

> d)Data splitting come first or i should normalize the data first?

Data splitting.

Normalization parameters are obtained from the training set only.
They are then applied to all three subsets.

> I really appreciate any help & explanation.
> Thank you.

See the comp.ai.neural-nets FAQ.

Hope this helps.

Greg
```
 0
Reply Greg 11/19/2009 7:01:47 AM

1 Replies
504 Views

(page loaded in 0.03 seconds)

Similiar Articles:

7/23/2012 7:21:34 PM