Text Mining (Index)

  • Follow


Dear all,
I have two datasets.
One dataset (FIRST) includes wordlists acquired from the document
scanning.
The other dataset (SECOND) has the keyword.
My goal is to write a program showing whether each word in FIRST
dataset contains the keyword in the SECOND dataset. Do you have any
ideas? Dataset exmples are as follows.

data First;
input word$1-23;
cards;
public management
public administration
FinancialManagement
SecurityAdministration
;run;
data Second;
input keyword$1-20;
cards;
public
management
administration
;
run;
0
Reply kilkon (9) 12/2/2009 2:14:31 PM

What is your result dataset going to look like?  There are a lot of ways to
do this, but it depends on your desired results; and whether capitalization,
spaces, punctuation, etc. are relevant.  Give us a sample of what the result
dataset should look like, and answer these questions...

Is
Public
a match to
public
?

Is
FinancialAdministration
a match to
Administration
?

Is
Jaguar's
a match to
Jaguars
?

Etc...

-Joe

On Wed, Dec 2, 2009 at 8:14 AM, Netguy <kilkon@gmail.com> wrote:

> Dear all,
> I have two datasets.
> One dataset (FIRST) includes wordlists acquired from the document
> scanning.
> The other dataset (SECOND) has the keyword.
> My goal is to write a program showing whether each word in FIRST
> dataset contains the keyword in the SECOND dataset. Do you have any
> ideas? Dataset exmples are as follows.
>
> data First;
> input word$1-23;
> cards;
> public management
> public administration
> FinancialManagement
> SecurityAdministration
> ;run;
> data Second;
> input keyword$1-20;
> cards;
> public
> management
> administration
> ;
> run;
>
0
Reply snoopy369 (1752) 12/2/2009 2:41:23 PM


Your problem has two parts.
First a many to many merge of words and keywords. Second how to test
if word contains keyword.
One way is with PROC SQL full join and the index function.

proc sql noprint;
   create table wanted as select *
     from first full join second
     on index(first.word,second.keyword)
   ;
quit;


On Dec 2, 9:14=A0am, Netguy <kil...@gmail.com> wrote:
> Dear all,
> I have two datasets.
> One dataset (FIRST) includes wordlists acquired from the document
> scanning.
> The other dataset (SECOND) has the keyword.
> My goal is to write a program showing whether each word in FIRST
> dataset contains the keyword in the SECOND dataset. Do you have any
> ideas? Dataset exmples are as follows.
>
> data First;
> input word$1-23;
> cards;
> public management
> public administration
> FinancialManagement
> SecurityAdministration
> ;run;
> data Second;
> input keyword$1-20;
> cards;
> public
> management
> administration
> ;
> run;

0
Reply Tom 12/2/2009 3:22:12 PM

> From: Netguy
> Subject: Text Mining (Index)
>
> Dear all,
> I have two datasets.
> One dataset (FIRST) includes wordlists acquired from the document
> scanning.
> The other dataset (SECOND) has the keyword.
> My goal is to write a program showing whether each word in FIRST
> dataset contains the keyword in the SECOND dataset. Do you have any
> ideas? Dataset exmples are as follows.
>
> data First;
> input word$1-23;
> cards;
> public management
> public administration
> FinancialManagement
> SecurityAdministration
> ;run;
> data Second;
> input keyword$1-20;
> cards;
> public
> management
> administration
> ;
> run;

There is a good reason that there is a product called
Text Miner

this page gives some overview
but the program shown is not robust with a large list of search words

http://tinyurl.com/ygw38ma for this page:

http://www.sascommunity.org/wiki/Finding_keywords_in_text

thinking out loud:

set up keywords:
* make a list of words to find
* standardize casing: lowcase, upcase
* load into lookup table:
  * hash table
  * temporary array
  * format

searching
* read text to search
* standardize casing: lowcase, upcase
* remove special characters: translate
* remove double blanks: compbl
* search each line for each word
  * indexw

Ron Fehd  the algorithm maven  CDC Atlanta GA USA RJF2 at cdc dot gov
0
Reply rjf2 (3354) 12/2/2009 3:37:35 PM

3 Replies
142 Views

(page loaded in 0.935 seconds)

Similiar Articles:




7/21/2012 5:19:44 AM


Reply: