data mining

  • Follow


In data mining you are typically dealing with millions of rows of data
and if you are talking about internet browsing data, even 100 million
rows  or more.

Let us say that you have 30 attributes (explanatory variables) in each
row plus a response variable (0 = no response 1 = response).

There are all kinds of analysis one can do on such a data set and I
would like some advice on desigining a program to do one of them
("Decision Trees")  with Fortran.

We want to split the original data set into two subsets (after which
the analysis can be repeated on each of the two subsets) by splitting
on any one of the thirty attributes.  If it is a true-false type
attribute then there is only way to split on that attribute, but in
other cases there would be more choices.  The aim of the split is to
create "pure" subsets so that one subset has more responders (based on
count or the proportion or perhaps other measures) than the other and
the top level attribute to be split on would be the one that makes the
difference as high as possible.  There might be some constraints as to
how big or small each subset can be.

If we are talking about a small number of rows then this is a pretty
elementary problem as far as I can see it.  There are free and
commercial packages that offer to do this - but if one were to do this
from scratch in Fortran, I would appreciate the group's suggestions as
to how this kind of volume of data can be handled.

0
Reply analyst41 (232) 7/1/2007 1:05:03 PM

<analyst41@hotmail.com> wrote in message 
news:1183295103.157455.15890@g4g2000hsf.googlegroups.com...
> In data mining you are typically dealing with millions of rows of data
> and if you are talking about internet browsing data, even 100 million
> rows  or more.
>
> Let us say that you have 30 attributes (explanatory variables) in each
> row plus a response variable (0 = no response 1 = response).
>
> There are all kinds of analysis one can do on such a data set and I
> would like some advice on desigining a program to do one of them
> ("Decision Trees")  with Fortran.
>
> We want to split the original data set into two subsets (after which
> the analysis can be repeated on each of the two subsets) by splitting
> on any one of the thirty attributes.  If it is a true-false type
> attribute then there is only way to split on that attribute, but in
> other cases there would be more choices.  The aim of the split is to
> create "pure" subsets so that one subset has more responders (based on
> count or the proportion or perhaps other measures) than the other and
> the top level attribute to be split on would be the one that makes the
> difference as high as possible.  There might be some constraints as to
> how big or small each subset can be.
>
> If we are talking about a small number of rows then this is a pretty
> elementary problem as far as I can see it.  There are free and
> commercial packages that offer to do this - but if one were to do this
> from scratch in Fortran, I would appreciate the group's suggestions as
> to how this kind of volume of data can be handled.
My own opinion is that so-called "data mining" is methodologically unsound, 
to say nothing of its legality.  If you are "harvesting" millions of rows of 
data, you might as well get your subsets by using the 29 dimensions of 
compatability.  My suggestion for what to do with the data is throw it out 
before a court tells you to.
--
Wade Ward 


0
Reply invalid (121) 7/1/2007 6:16:24 PM


On Jul 1, 2:16 pm, "Wade Ward" <inva...@invalid.nyet> wrote:
> <analys...@hotmail.com> wrote in message
>
> news:1183295103.157455.15890@g4g2000hsf.googlegroups.com...
>
>
>
> > In data mining you are typically dealing with millions of rows of data
> > and if you are talking about internet browsing data, even 100 million
> > rows  or more.
>
> > Let us say that you have 30 attributes (explanatory variables) in each
> > row plus a response variable (0 = no response 1 = response).
>
> > There are all kinds of analysis one can do on such a data set and I
> > would like some advice on desigining a program to do one of them
> > ("Decision Trees")  with Fortran.
>
> > We want to split the original data set into two subsets (after which
> > the analysis can be repeated on each of the two subsets) by splitting
> > on any one of the thirty attributes.  If it is a true-false type
> > attribute then there is only way to split on that attribute, but in
> > other cases there would be more choices.  The aim of the split is to
> > create "pure" subsets so that one subset has more responders (based on
> > count or the proportion or perhaps other measures) than the other and
> > the top level attribute to be split on would be the one that makes the
> > difference as high as possible.  There might be some constraints as to
> > how big or small each subset can be.
>
> > If we are talking about a small number of rows then this is a pretty
> > elementary problem as far as I can see it.  There are free and
> > commercial packages that offer to do this - but if one were to do this
> > from scratch in Fortran, I would appreciate the group's suggestions as
> > to how this kind of volume of data can be handled.
>
> My own opinion is that so-called "data mining" is methodologically unsound,
> to say nothing of its legality.  If you are "harvesting" millions of rows of
> data, you might as well get your subsets by using the 29 dimensions of
> compatability.  My suggestion for what to do with the data is throw it out
> before a court tells you to.
> --
> Wade Ward- Hide quoted text -
>
> - Show quoted text -

Let us assume that the data being "mined" can be done so legally.
>From what I understand, when credit card companies send those letters
outlining our "privacy rights", our usage of the card after receiving
those letters amounts to giving them our permission to "mine" our
purchase behavior.

If we split on all thirty attributes, the final subsets would most
likely be too small to be statistically stable.  The trick is to find
5-10 attributes that would give you the maximum differentiation among
the final subsets, while making sure that each subset is large enough
to allow a stable "scoring" of any customer/prospect who falls in to
that subset.

0
Reply analyst41 (232) 7/1/2007 7:54:50 PM

analyst41@hotmail.com wrote:
> On Jul 1, 2:16 pm, "Wade Ward" <inva...@invalid.nyet> wrote:
>> <analys...@hotmail.com> wrote in message
>>
>> news:1183295103.157455.15890@g4g2000hsf.googlegroups.com...
>>
>>
>>
>>> In data mining you are typically dealing with millions of rows of data
>>> and if you are talking about internet browsing data, even 100 million
>>> rows  or more.
>>> Let us say that you have 30 attributes (explanatory variables) in each
>>> row plus a response variable (0 = no response 1 = response).
>>> There are all kinds of analysis one can do on such a data set and I
>>> would like some advice on desigining a program to do one of them
>>> ("Decision Trees")  with Fortran.
>>> We want to split the original data set into two subsets (after which
>>> the analysis can be repeated on each of the two subsets) by splitting
>>> on any one of the thirty attributes.  If it is a true-false type
>>> attribute then there is only way to split on that attribute, but in
>>> other cases there would be more choices.  The aim of the split is to
>>> create "pure" subsets so that one subset has more responders (based on
>>> count or the proportion or perhaps other measures) than the other and
>>> the top level attribute to be split on would be the one that makes the
>>> difference as high as possible.  There might be some constraints as to
>>> how big or small each subset can be.
>>> If we are talking about a small number of rows then this is a pretty
>>> elementary problem as far as I can see it.  There are free and
>>> commercial packages that offer to do this - but if one were to do this
>>> from scratch in Fortran, I would appreciate the group's suggestions as
>>> to how this kind of volume of data can be handled.
>> My own opinion is that so-called "data mining" is methodologically unsound,
>> to say nothing of its legality.  If you are "harvesting" millions of rows of
>> data, you might as well get your subsets by using the 29 dimensions of
>> compatability.  My suggestion for what to do with the data is throw it out
>> before a court tells you to.
>> --
>> Wade Ward- Hide quoted text -
>>
>> - Show quoted text -
> 
> Let us assume that the data being "mined" can be done so legally.
>>From what I understand, when credit card companies send those letters
> outlining our "privacy rights", our usage of the card after receiving
> those letters amounts to giving them our permission to "mine" our
> purchase behavior.
> 
> If we split on all thirty attributes, the final subsets would most
> likely be too small to be statistically stable.  The trick is to find
> 5-10 attributes that would give you the maximum differentiation among
> the final subsets, while making sure that each subset is large enough
> to allow a stable "scoring" of any customer/prospect who falls in to
> that subset.

What little I know is it is essentially the k-means clustering problem. 
  I don't know of Fortran source, but a related search some time ago had 
led me to the C/C++ open-source clustering library.  Whether it will be 
of any benefit to your problem I don't know...

--

http://bonsai.ims.u-tokyo.ac.jp/%7Emdehoon/software/cluster/software.htm#source

0
Reply none1568 (6816) 7/1/2007 9:16:16 PM

On Jul 1, 4:16 pm, dpb <n...@non.net> wrote:

<snip>

> What little I know is it is essentially the k-means clustering problem.
>   I don't know of Fortran source, but a related search some time ago had
> led me to the C/C++ open-source clustering library.

There is plenty of Fortran code for clustering, for example kmeans
clustering code at http://people.scs.fsu.edu/~burkardt/f_src/kmeans/kmeans.html
..

0
Reply beliavsky (2207) 7/1/2007 9:43:57 PM

What you are after is cluster analysis and segmentation analysis
methods.

I'm interested in such problems since I'm CEO Tau Systems which
supplies market reasearch data analysis program software since 1972.
And yes, all we sell is written in Fortran F77.

It is very easy to arrive at statistically incorrect conclusions,
using accurarate tools the wrong way. To continue in your project you
MUST use a professional, experienced statistician as advisor.

By all means write programs to manage and massage your data, and to
sort and  select data into specified subsets. But don't try writing
the statistical analysis part.
Buy a commercial dedicated segmentation analysis package, or use SAS
or similar vast systems.

Lastly I would point out that ascii-based systems are readable, but
binary systems are ingerently 128 time faster in parallel processing
and occupy one eght of the storage medium space and access time. Also
binary systems can have fixed format fields, whereas ascii systems
with multiple-response replies taking up variable lengths, need a data
map as part of the file.


0
Reply tbwright (1098) 7/1/2007 10:53:13 PM

On Jul 1, 5:53 pm, Terence <tbwri...@cantv.net> wrote:

<snip>

> It is very easy to arrive at statistically incorrect conclusions,
> using accurarate tools the wrong way. To continue in your project you
> MUST use a professional, experienced statistician as advisor.
>
> By all means write programs to manage and massage your data, and to
> sort and  select data into specified subsets. But don't try writing
> the statistical analysis part.
> Buy a commercial dedicated segmentation analysis package, or use SAS
> or similar vast systems.

R, which is open source (C and Fortran) may be of comparable quality
to commercial statistical software.

0
Reply beliavsky (2207) 7/2/2007 12:43:42 AM

On Jul 1, 8:43 pm, Beliavsky <beliav...@aol.com> wrote:
> On Jul 1, 5:53 pm, Terence <tbwri...@cantv.net> wrote:
>
> <snip>
>
> > It is very easy to arrive at statistically incorrect conclusions,
> > using accurarate tools the wrong way. To continue in your project you
> > MUST use a professional, experienced statistician as advisor.
>
> > By all means write programs to manage and massage your data, and to
> > sort and  select data into specified subsets. But don't try writing
> > the statistical analysis part.
> > Buy a commercial dedicated segmentation analysis package, or use SAS
> > or similar vast systems.
>
> R, which is open source (C and Fortran) may be of comparable quality
> to commercial statistical software.

Thanks to all for the replies.  I'll check out the clsutering Fortran
code and see if it can be adapted to what I want to do.

0
Reply analyst41 (232) 7/2/2007 10:49:00 AM

<analyst41@hotmail.com> wrote in message 
news:1183373340.063888.177930@o61g2000hsh.googlegroups.com...
> On Jul 1, 8:43 pm, Beliavsky <beliav...@aol.com> wrote:
>> On Jul 1, 5:53 pm, Terence <tbwri...@cantv.net> wrote:
>>
>> <snip>
>>
>> > It is very easy to arrive at statistically incorrect conclusions,
>> > using accurarate tools the wrong way. To continue in your project you
>> > MUST use a professional, experienced statistician as advisor.
>>
>> > By all means write programs to manage and massage your data, and to
>> > sort and  select data into specified subsets. But don't try writing
>> > the statistical analysis part.
>> > Buy a commercial dedicated segmentation analysis package, or use SAS
>> > or similar vast systems.
>>
>> R, which is open source (C and Fortran) may be of comparable quality
>> to commercial statistical software.
>
> Thanks to all for the replies.  I'll check out the clsutering Fortran
> code and see if it can be adapted to what I want to do.
And how is clustering going to restore order in these subsets of strip-mined 
data, take for instance {'U','K','C','F'}?  Your output will be exactly what 
you tell it to be.
-- 
ww 


0
Reply invalid (121) 7/2/2007 8:58:27 PM

> And how is clustering going to restore order in these subsets of strip-mined
> data, take for instance {'U','K','C','F'}?  Your output will be exactly what
> you tell it to be.
> --
> Wade Ward

The point is, cluster analysis in n-dimensional space will locate the
important groups (suggest start with n<=10) .

Segmentation analysis allows the identification and marking (with a
code) of members of those groups for selecting out the members, or by
reaching ranking score levels of partial sums of given scoring
coeefficients.

Most commercial software (for surveys) will handle up to about 65k
cases (6000 is rare; the whole USA is covered by 1024 households by
A.J. Nielson; the largest I ever handled was a survey of US Veterans
with several hendreds of thousands).

Our data mining poster has no idea of what he is getting into with
100M cases.
But I DO have a sorting program for that.....

( And just what is  {'U','K','C','F'}?? )



0
Reply tbwright (1098) 7/3/2007 5:08:33 AM

"Terence" <tbwright@cantv.net> wrote in message 
news:1183439313.718825.260510@j4g2000prf.googlegroups.com...
>> And how is clustering going to restore order in these subsets of 
>> strip-mined
>> data, take for instance {'U','K','C','F'}?  Your output will be exactly 
>> what
>> you tell it to be.
>
> The point is, cluster analysis in n-dimensional space will locate the
> important groups (suggest start with n<=10) .
>
> Segmentation analysis allows the identification and marking (with a
> code) of members of those groups for selecting out the members, or by
> reaching ranking score levels of partial sums of given scoring
> coeefficients.
This sounds like statistics with a judge, which is not methodologically 
flawed.

> Most commercial software (for surveys) will handle up to about 65k
> cases (6000 is rare; the whole USA is covered by 1024 households by
> A.J. Nielson; the largest I ever handled was a survey of US Veterans
> with several hendreds of thousands).
>
> Our data mining poster has no idea of what he is getting into with
> 100M cases.
> But I DO have a sorting program for that.....
>
> ( And just what is  {'U','K','C','F'}?? )
(You asked).  It's a clusterfuck, or any other of 4! outcomes.
-- 
WW 


0
Reply invalid (121) 7/3/2007 5:07:06 PM

<analyst41@hotmail.com> wrote in message 
news:1183319690.068709.179330@c77g2000hse.googlegroups.com...
> On Jul 1, 2:16 pm, "Wade Ward" <inva...@invalid.nyet> wrote:
>> <analys...@hotmail.com> wrote in message
>>
>> news:1183295103.157455.15890@g4g2000hsf.googlegroups.com...
>>
>>
>>
>> > In data mining you are typically dealing with millions of rows of data
>> > and if you are talking about internet browsing data, even 100 million
>> > rows  or more.
>>
>> > Let us say that you have 30 attributes (explanatory variables) in each
>> > row plus a response variable (0 = no response 1 = response).
>>
>> > There are all kinds of analysis one can do on such a data set and I
>> > would like some advice on desigining a program to do one of them
>> > ("Decision Trees")  with Fortran.
>>
>> > We want to split the original data set into two subsets (after which
>> > the analysis can be repeated on each of the two subsets) by splitting
>> > on any one of the thirty attributes.  If it is a true-false type
>> > attribute then there is only way to split on that attribute, but in
>> > other cases there would be more choices.  The aim of the split is to
>> > create "pure" subsets so that one subset has more responders (based on
>> > count or the proportion or perhaps other measures) than the other and
>> > the top level attribute to be split on would be the one that makes the
>> > difference as high as possible.  There might be some constraints as to
>> > how big or small each subset can be.
>>
>> > If we are talking about a small number of rows then this is a pretty
>> > elementary problem as far as I can see it.  There are free and
>> > commercial packages that offer to do this - but if one were to do this
>> > from scratch in Fortran, I would appreciate the group's suggestions as
>> > to how this kind of volume of data can be handled.
>>
>> My own opinion is that so-called "data mining" is methodologically 
>> unsound,
>> to say nothing of its legality.  If you are "harvesting" millions of rows 
>> of
>> data, you might as well get your subsets by using the 29 dimensions of
>> compatability.  My suggestion for what to do with the data is throw it 
>> out
>> before a court tells you to.
I just sorted my socks using the 29 dimensions and got all matches.  Does 
this make me self-compatible?

> Let us assume that the data being "mined" can be done so legally.
>>From what I understand, when credit card companies send those letters
> outlining our "privacy rights", our usage of the card after receiving
> those letters amounts to giving them our permission to "mine" our
> purchase behavior.
Let's not.
-- 
mtp


0
Reply invalid163 (950) 7/4/2007 12:08:23 AM

> > ( And just what is  {'U','K','C','F'}?? )
>
> (You asked).  It's a clusterfuck, or any other of 4! outcomes.


Ah, No! It's a sorting problem, or even a permutation problem, but not
clustering!


0
Reply tbwright (1098) 7/4/2007 2:49:42 AM

> > Let us assume that the data being "mined" can be done so legally.
> >>From what I understand, when credit card companies send those letters
> > outlining our "privacy rights", our usage of the card after receiving
> > those letters amounts to giving them our permission to "mine" our
> > purchase behavior.

If you read the really small print, you'll find they "might only pass
it on to associated companies".
And guess what companies they are associated with?

0
Reply tbwright (1098) 7/4/2007 2:51:45 AM

Terence <tbwright@cantv.net> wrote:

> > > Let us assume that the data being "mined" can be done so legally.
> > >>From what I understand, when credit card companies send those letters
> > > outlining our "privacy rights", our usage of the card after receiving
> > > those letters amounts to giving them our permission to "mine" our
> > > purchase behavior.
> 
> If you read the really small print, you'll find they "might only pass
> it on to associated companies".
> And guess what companies they are associated with?

You'll also find that they might do other things that they are legally
allowed to. It is often worded in such a way that you might think it is
things that they are legally required to do, but it invariably says
"allowed", not "required". They say that they can do anything with the
data that they are allowed to do, and they are allowed to do pretty much
anything with it as long as they cover it in the letter. Translation:
they reserve the right to do whatever they feel like.

My "language lawyer" skills help me in reading the fine print of
contracts and laws as well.

I'm afraid I'm wandering too far off topic,so I'll stop. Couldn't resist
the one post. (Besides, from the above short note I get the impression
it is one area where Terence and I have at least simillar perspectives.
:-))

-- 
Richard Maine                    | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle           |  -- Mark Twain
0
Reply nospam47 (9742) 7/4/2007 3:21:58 AM

 ...one area where Terence and I have at least simillar perspectives.
Richard Maine

(Perspective) == (same object seen from different viewpoints)?

>> obj <<

But remove the object and head-on crash?   :o)>

0
Reply tbwright (1098) 7/4/2007 5:02:42 AM

"Terence" <tbwright@cantv.net> wrote in message 
news:1183525362.206686.310490@o11g2000prd.googlegroups.com...
> ...one area where Terence and I have at least simillar perspectives.
> Richard Maine
>
> (Perspective) == (same object seen from different viewpoints)?
>
>>> obj <<
>
> But remove the object and head-on crash?   :o)>
But that's part of the good news with geometries that are post-Gaussian. 
The object defines the point of view, so almost anything works for frames of 
reference to jibe.
-- 
WW 


0
Reply invalid (121) 7/4/2007 8:12:38 AM

On Jul 3, 11:21 pm, nos...@see.signature (Richard Maine) wrote:
> Terence <tbwri...@cantv.net> wrote:
> > > > Let us assume that the data being "mined" can be done so legally.
> > > >>From what I understand, when credit card companies send those letters
> > > > outlining our "privacy rights", our usage of the card after receiving
> > > > those letters amounts to giving them our permission to "mine" our
> > > > purchase behavior.
>
> > If you read the really small print, you'll find they "might only pass
> > it on to associated companies".
> > And guess what companies they are associated with?
>
> You'll also find that they might do other things that they are legally
> allowed to. It is often worded in such a way that you might think it is
> things that they are legally required to do, but it invariably says
> "allowed", not "required". They say that they can do anything with the
> data that they are allowed to do, and they are allowed to do pretty much
> anything with it as long as they cover it in the letter. Translation:
> they reserve the right to do whatever they feel like.
>
> My "language lawyer" skills help me in reading the fine print of
> contracts and laws as well.
>
> I'm afraid I'm wandering too far off topic,so I'll stop. Couldn't resist
> the one post. (Besides, from the above short note I get the impression
> it is one area where Terence and I have at least simillar perspectives.
> :-))
>

Yup - the regulars here will be ready with a million posts on hoary
trivialities like " Are 'go to' statements good or bad ?" - but pose
a  practical problem such as repeatedly partitioning a large data set
- and one mostly gets infantile responses.

> --
> Richard Maine                    | Good judgement comes from experience;
> email: last name at domain . net | experience comes from bad judgement.
> domain: summertriangle           |  -- Mark Twain


0
Reply analyst41 (232) 7/4/2007 5:00:18 PM

analyst41@hotmail.com wrote:

> On Jul 3, 11:21 pm, nos...@see.signature (Richard Maine) wrote:
> 
>>Terence <tbwri...@cantv.net> wrote:
>>
>>>>>Let us assume that the data being "mined" can be done so legally.
>>>>>>From what I understand, when credit card companies send those letters
>>>>>outlining our "privacy rights", our usage of the card after receiving
>>>>>those letters amounts to giving them our permission to "mine" our
>>>>>purchase behavior.
>>
>>>If you read the really small print, you'll find they "might only pass
>>>it on to associated companies".
>>>And guess what companies they are associated with?
>>
>>You'll also find that they might do other things that they are legally
>>allowed to. It is often worded in such a way that you might think it is
>>things that they are legally required to do, but it invariably says
>>"allowed", not "required". They say that they can do anything with the
>>data that they are allowed to do, and they are allowed to do pretty much
>>anything with it as long as they cover it in the letter. Translation:
>>they reserve the right to do whatever they feel like.
>>
>>My "language lawyer" skills help me in reading the fine print of
>>contracts and laws as well.
>>
>>I'm afraid I'm wandering too far off topic,so I'll stop. Couldn't resist
>>the one post. (Besides, from the above short note I get the impression
>>it is one area where Terence and I have at least simillar perspectives.
>>:-))
>>
> 
> 
> Yup - the regulars here will be ready with a million posts on hoary
> trivialities like " Are 'go to' statements good or bad ?" - but pose
> a  practical problem such as repeatedly partitioning a large data set
> - and one mostly gets infantile responses.

I consider the tone of responses mostly well reasoned and civil.  I do 
agree that there seems to be a misconception of reality with regards to 
over-emphasis of a goal of total portability, at times.

> 
> 
>>--
>>Richard Maine                    | Good judgement comes from experience;
>>email: last name at domain . net | experience comes from bad judgement.
>>domain: summertriangle           |  -- Mark Twain
> 
> 
> 


-- 

Gary Scott
mailto:garylscott@sbcglobal dot net

Fortran Library:  http://www.fortranlib.com

Support the Original G95 Project:  http://www.g95.org
-OR-
Support the GNU GFortran Project:  http://gcc.gnu.org/fortran/index.html

If you want to do the impossible, don't hire an expert because he knows 
it can't be done.

-- Henry Ford
0
Reply garylscott (1357) 7/4/2007 5:27:29 PM

analyst41@hotmail.com wrote:
....

> Yup - the regulars here will be ready with a million posts on hoary
> trivialities like " Are 'go to' statements good or bad ?" - but pose
> a  practical problem such as repeatedly partitioning a large data set
> - and one mostly gets infantile responses.

Hey, it's still usenet... :)

I know there are folks doing humongous data mining of the sort you're 
discussing but other than the rudiments of theory don't have enough 
direct application experience in the area to provide any more on the 
algorithms specifically for such large datasets.  I do recall back when 
I was still reading JASA and Technometrics and such regularly I came 
across some of that kind of thing but I've not even unpacked them since 
the move back to the farm so no practical way to look for what I'm 
thinking I recall as it wasn't in direct area of interest so wouldn't 
have made it out of the original journals into the clippings files...

I guess I'd start w/ some literature searches though, if posed w/ the task.

--
0
Reply none1568 (6816) 7/4/2007 5:42:23 PM

19 Replies
46 Views

(page loaded in 0.185 seconds)


Reply: