### data mining

```In data mining you are typically dealing with millions of rows of data
and if you are talking about internet browsing data, even 100 million
rows  or more.

Let us say that you have 30 attributes (explanatory variables) in each
row plus a response variable (0 = no response 1 = response).

There are all kinds of analysis one can do on such a data set and I
would like some advice on desigining a program to do one of them
("Decision Trees")  with Fortran.

We want to split the original data set into two subsets (after which
the analysis can be repeated on each of the two subsets) by splitting
on any one of the thirty attributes.  If it is a true-false type
attribute then there is only way to split on that attribute, but in
other cases there would be more choices.  The aim of the split is to
create "pure" subsets so that one subset has more responders (based on
count or the proportion or perhaps other measures) than the other and
the top level attribute to be split on would be the one that makes the
difference as high as possible.  There might be some constraints as to
how big or small each subset can be.

If we are talking about a small number of rows then this is a pretty
elementary problem as far as I can see it.  There are free and
commercial packages that offer to do this - but if one were to do this
from scratch in Fortran, I would appreciate the group's suggestions as
to how this kind of volume of data can be handled.

```
analyst41 (233) 7/1/2007 1:05:03 PM

invalid (121) 7/1/2007 6:16:24 PM

analyst41 (233) 7/1/2007 7:54:50 PM

none1568 (7455) 7/1/2007 9:16:16 PM

beliavsky (2211) 7/1/2007 9:43:57 PM

```What you are after is cluster analysis and segmentation analysis
methods.

I'm interested in such problems since I'm CEO Tau Systems which
supplies market reasearch data analysis program software since 1972.
And yes, all we sell is written in Fortran F77.

It is very easy to arrive at statistically incorrect conclusions,
using accurarate tools the wrong way. To continue in your project you
MUST use a professional, experienced statistician as advisor.

By all means write programs to manage and massage your data, and to
sort and  select data into specified subsets. But don't try writing
the statistical analysis part.
Buy a commercial dedicated segmentation analysis package, or use SAS
or similar vast systems.

Lastly I would point out that ascii-based systems are readable, but
binary systems are ingerently 128 time faster in parallel processing
and occupy one eght of the storage medium space and access time. Also
binary systems can have fixed format fields, whereas ascii systems
with multiple-response replies taking up variable lengths, need a data
map as part of the file.

 0
tbwright (1098) 7/1/2007 10:53:13 PM

beliavsky (2211) 7/2/2007 12:43:42 AM

analyst41 (233) 7/2/2007 10:49:00 AM

invalid (121) 7/2/2007 8:58:27 PM

```> And how is clustering going to restore order in these subsets of strip-mined
> data, take for instance {'U','K','C','F'}?  Your output will be exactly what
> you tell it to be.
> --

The point is, cluster analysis in n-dimensional space will locate the

Segmentation analysis allows the identification and marking (with a
code) of members of those groups for selecting out the members, or by
reaching ranking score levels of partial sums of given scoring
coeefficients.

Most commercial software (for surveys) will handle up to about 65k
cases (6000 is rare; the whole USA is covered by 1024 households by
A.J. Nielson; the largest I ever handled was a survey of US Veterans
with several hendreds of thousands).

Our data mining poster has no idea of what he is getting into with
100M cases.
But I DO have a sorting program for that.....

( And just what is  {'U','K','C','F'}?? )

 0
tbwright (1098) 7/3/2007 5:08:33 AM

invalid (121) 7/3/2007 5:07:06 PM

invalid163 (957) 7/4/2007 12:08:23 AM

```> > ( And just what is  {'U','K','C','F'}?? )
>
> (You asked).  It's a clusterfuck, or any other of 4! outcomes.

Ah, No! It's a sorting problem, or even a permutation problem, but not
clustering!

 0
tbwright (1098) 7/4/2007 2:49:42 AM

```> > Let us assume that the data being "mined" can be done so legally.
> >>From what I understand, when credit card companies send those letters
> > outlining our "privacy rights", our usage of the card after receiving
> > those letters amounts to giving them our permission to "mine" our
> > purchase behavior.

If you read the really small print, you'll find they "might only pass
it on to associated companies".
And guess what companies they are associated with?

 0
tbwright (1098) 7/4/2007 2:51:45 AM

```Terence <tbwright@cantv.net> wrote:

> > > Let us assume that the data being "mined" can be done so legally.
> > >>From what I understand, when credit card companies send those letters
> > > outlining our "privacy rights", our usage of the card after receiving
> > > those letters amounts to giving them our permission to "mine" our
> > > purchase behavior.
>
> If you read the really small print, you'll find they "might only pass
> it on to associated companies".
> And guess what companies they are associated with?

You'll also find that they might do other things that they are legally
allowed to. It is often worded in such a way that you might think it is
things that they are legally required to do, but it invariably says
"allowed", not "required". They say that they can do anything with the
data that they are allowed to do, and they are allowed to do pretty much
anything with it as long as they cover it in the letter. Translation:
they reserve the right to do whatever they feel like.

My "language lawyer" skills help me in reading the fine print of
contracts and laws as well.

I'm afraid I'm wandering too far off topic,so I'll stop. Couldn't resist
the one post. (Besides, from the above short note I get the impression
it is one area where Terence and I have at least simillar perspectives.
:-))

--
Richard Maine                    | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle           |  -- Mark Twain
 0
nospam47 (9744) 7/4/2007 3:21:58 AM

``` ...one area where Terence and I have at least simillar perspectives.
Richard Maine

(Perspective) == (same object seen from different viewpoints)?

>> obj <<

But remove the object and head-on crash?   :o)>

 0
tbwright (1098) 7/4/2007 5:02:42 AM

```"Terence" <tbwright@cantv.net> wrote in message
> ...one area where Terence and I have at least simillar perspectives.
> Richard Maine
>
> (Perspective) == (same object seen from different viewpoints)?
>
>>> obj <<
>
> But remove the object and head-on crash?   :o)>
But that's part of the good news with geometries that are post-Gaussian.
The object defines the point of view, so almost anything works for frames of
reference to jibe.
--
WW

 0
invalid (121) 7/4/2007 8:12:38 AM

```On Jul 3, 11:21 pm, nos...@see.signature (Richard Maine) wrote:
> Terence <tbwri...@cantv.net> wrote:
> > > > Let us assume that the data being "mined" can be done so legally.
> > > >>From what I understand, when credit card companies send those letters
> > > > outlining our "privacy rights", our usage of the card after receiving
> > > > those letters amounts to giving them our permission to "mine" our
> > > > purchase behavior.
>
> > If you read the really small print, you'll find they "might only pass
> > it on to associated companies".
> > And guess what companies they are associated with?
>
> You'll also find that they might do other things that they are legally
> allowed to. It is often worded in such a way that you might think it is
> things that they are legally required to do, but it invariably says
> "allowed", not "required". They say that they can do anything with the
> data that they are allowed to do, and they are allowed to do pretty much
> anything with it as long as they cover it in the letter. Translation:
> they reserve the right to do whatever they feel like.
>
> My "language lawyer" skills help me in reading the fine print of
> contracts and laws as well.
>
> I'm afraid I'm wandering too far off topic,so I'll stop. Couldn't resist
> the one post. (Besides, from the above short note I get the impression
> it is one area where Terence and I have at least simillar perspectives.
> :-))
>

Yup - the regulars here will be ready with a million posts on hoary
trivialities like " Are 'go to' statements good or bad ?" - but pose
a  practical problem such as repeatedly partitioning a large data set
- and one mostly gets infantile responses.

> --
> Richard Maine                    | Good judgement comes from experience;
> email: last name at domain . net | experience comes from bad judgement.
> domain: summertriangle           |  -- Mark Twain

 0
analyst41 (233) 7/4/2007 5:00:18 PM

```analyst41@hotmail.com wrote:

> On Jul 3, 11:21 pm, nos...@see.signature (Richard Maine) wrote:
>
>>Terence <tbwri...@cantv.net> wrote:
>>
>>>>>Let us assume that the data being "mined" can be done so legally.
>>>>>>From what I understand, when credit card companies send those letters
>>>>>outlining our "privacy rights", our usage of the card after receiving
>>>>>those letters amounts to giving them our permission to "mine" our
>>>>>purchase behavior.
>>
>>>If you read the really small print, you'll find they "might only pass
>>>it on to associated companies".
>>>And guess what companies they are associated with?
>>
>>You'll also find that they might do other things that they are legally
>>allowed to. It is often worded in such a way that you might think it is
>>things that they are legally required to do, but it invariably says
>>"allowed", not "required". They say that they can do anything with the
>>data that they are allowed to do, and they are allowed to do pretty much
>>anything with it as long as they cover it in the letter. Translation:
>>they reserve the right to do whatever they feel like.
>>
>>My "language lawyer" skills help me in reading the fine print of
>>contracts and laws as well.
>>
>>I'm afraid I'm wandering too far off topic,so I'll stop. Couldn't resist
>>the one post. (Besides, from the above short note I get the impression
>>it is one area where Terence and I have at least simillar perspectives.
>>:-))
>>
>
>
> Yup - the regulars here will be ready with a million posts on hoary
> trivialities like " Are 'go to' statements good or bad ?" - but pose
> a  practical problem such as repeatedly partitioning a large data set
> - and one mostly gets infantile responses.

I consider the tone of responses mostly well reasoned and civil.  I do
agree that there seems to be a misconception of reality with regards to
over-emphasis of a goal of total portability, at times.

>
>
>>--
>>Richard Maine                    | Good judgement comes from experience;
>>email: last name at domain . net | experience comes from bad judgement.
>>domain: summertriangle           |  -- Mark Twain
>
>
>

--

Gary Scott
mailto:garylscott@sbcglobal dot net

Fortran Library:  http://www.fortranlib.com

Support the Original G95 Project:  http://www.g95.org
-OR-
Support the GNU GFortran Project:  http://gcc.gnu.org/fortran/index.html

If you want to do the impossible, don't hire an expert because he knows
it can't be done.

-- Henry Ford
 0
garylscott (1357) 7/4/2007 5:27:29 PM

```analyst41@hotmail.com wrote:
....

> Yup - the regulars here will be ready with a million posts on hoary
> trivialities like " Are 'go to' statements good or bad ?" - but pose
> a  practical problem such as repeatedly partitioning a large data set
> - and one mostly gets infantile responses.

Hey, it's still usenet... :)

I know there are folks doing humongous data mining of the sort you're
discussing but other than the rudiments of theory don't have enough
direct application experience in the area to provide any more on the
algorithms specifically for such large datasets.  I do recall back when
I was still reading JASA and Technometrics and such regularly I came
across some of that kind of thing but I've not even unpacked them since
the move back to the farm so no practical way to look for what I'm
thinking I recall as it wasn't in direct area of interest so wouldn't
have made it out of the original journals into the clippings files...

I guess I'd start w/ some literature searches though, if posed w/ the task.

--
 0
none1568 (7455) 7/4/2007 5:42:23 PM

I just bought the rights to Loadstar, Commodore Mailink, Ahoy, RUN, Compute! and the Commodore brand name! OCTOBER FOOLS!!! ...