|
|
Text Mining (Index)
Dear all,
I have two datasets.
One dataset (FIRST) includes wordlists acquired from the document
scanning.
The other dataset (SECOND) has the keyword.
My goal is to write a program showing whether each word in FIRST
dataset contains the keyword in the SECOND dataset. Do you have any
ideas? Dataset exmples are as follows.
data First;
input word$1-23;
cards;
public management
public administration
FinancialManagement
SecurityAdministration
;run;
data Second;
input keyword$1-20;
cards;
public
management
administration
;
run;
|
|
0
|
|
|
|
Reply
|
kilkon (9)
|
12/2/2009 2:14:31 PM |
|
What is your result dataset going to look like? There are a lot of ways to
do this, but it depends on your desired results; and whether capitalization,
spaces, punctuation, etc. are relevant. Give us a sample of what the result
dataset should look like, and answer these questions...
Is
Public
a match to
public
?
Is
FinancialAdministration
a match to
Administration
?
Is
Jaguar's
a match to
Jaguars
?
Etc...
-Joe
On Wed, Dec 2, 2009 at 8:14 AM, Netguy <kilkon@gmail.com> wrote:
> Dear all,
> I have two datasets.
> One dataset (FIRST) includes wordlists acquired from the document
> scanning.
> The other dataset (SECOND) has the keyword.
> My goal is to write a program showing whether each word in FIRST
> dataset contains the keyword in the SECOND dataset. Do you have any
> ideas? Dataset exmples are as follows.
>
> data First;
> input word$1-23;
> cards;
> public management
> public administration
> FinancialManagement
> SecurityAdministration
> ;run;
> data Second;
> input keyword$1-20;
> cards;
> public
> management
> administration
> ;
> run;
>
|
|
0
|
|
|
|
Reply
|
snoopy369 (1752)
|
12/2/2009 2:41:23 PM
|
|
Your problem has two parts.
First a many to many merge of words and keywords. Second how to test
if word contains keyword.
One way is with PROC SQL full join and the index function.
proc sql noprint;
create table wanted as select *
from first full join second
on index(first.word,second.keyword)
;
quit;
On Dec 2, 9:14=A0am, Netguy <kil...@gmail.com> wrote:
> Dear all,
> I have two datasets.
> One dataset (FIRST) includes wordlists acquired from the document
> scanning.
> The other dataset (SECOND) has the keyword.
> My goal is to write a program showing whether each word in FIRST
> dataset contains the keyword in the SECOND dataset. Do you have any
> ideas? Dataset exmples are as follows.
>
> data First;
> input word$1-23;
> cards;
> public management
> public administration
> FinancialManagement
> SecurityAdministration
> ;run;
> data Second;
> input keyword$1-20;
> cards;
> public
> management
> administration
> ;
> run;
|
|
0
|
|
|
|
Reply
|
Tom
|
12/2/2009 3:22:12 PM
|
|
> From: Netguy
> Subject: Text Mining (Index)
>
> Dear all,
> I have two datasets.
> One dataset (FIRST) includes wordlists acquired from the document
> scanning.
> The other dataset (SECOND) has the keyword.
> My goal is to write a program showing whether each word in FIRST
> dataset contains the keyword in the SECOND dataset. Do you have any
> ideas? Dataset exmples are as follows.
>
> data First;
> input word$1-23;
> cards;
> public management
> public administration
> FinancialManagement
> SecurityAdministration
> ;run;
> data Second;
> input keyword$1-20;
> cards;
> public
> management
> administration
> ;
> run;
There is a good reason that there is a product called
Text Miner
this page gives some overview
but the program shown is not robust with a large list of search words
http://tinyurl.com/ygw38ma for this page:
http://www.sascommunity.org/wiki/Finding_keywords_in_text
thinking out loud:
set up keywords:
* make a list of words to find
* standardize casing: lowcase, upcase
* load into lookup table:
* hash table
* temporary array
* format
searching
* read text to search
* standardize casing: lowcase, upcase
* remove special characters: translate
* remove double blanks: compbl
* search each line for each word
* indexw
Ron Fehd the algorithm maven CDC Atlanta GA USA RJF2 at cdc dot gov
|
|
0
|
|
|
|
Reply
|
rjf2 (3354)
|
12/2/2009 3:37:35 PM
|
|
|
3 Replies
142 Views
(page loaded in 0.935 seconds)
Similiar Articles: top 10 uses for random data compression?? anyone? - comp ...Hardly any abstract past index reduces ships in front of Saeed's joint speed. ... Plenty of constant literary text balances chemicals by Youssef's informal boxing. Text Mining, Big Data, Unstructured DataText Mining Introductory Overview. The purpose of Text Mining is to process unstructured (textual) information, extract meaningful numeric indices from the text, and ... Text miningText mining Ian H. Witten Computer Science, University of Waikato, Hamilton, New Zealand email ihw@cs.waikato.ac.nz Index terms “Bag of words” model, acronym ... 7/21/2012 5:19:44 AM
|
|
|
|
|
|
|
|
|