Hi
I'd like to import a text file containing common words such as
"at","the","and" etc.
Then if i take a String "I want to visit the shops today and buy some
clothes", i would like to strip out the common words to become "visit
shops today buy clothes"
I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But
im unsure how i would use this to remove all occurances of.
Any help is much appreciated, thanks
|
|
0
|
|
|
|
Reply
|
sfu (8)
|
9/12/2003 12:22:10 PM |
|
sfu wrote:
> Hi
> I'd like to import a text file containing common words such as
> "at","the","and" etc.
>
> Then if i take a String "I want to visit the shops today and buy some
> clothes", i would like to strip out the common words to become "visit
> shops today buy clothes"
>
> I know to speed up searching for "at" within the string i can use a
> Boyer Moore algorithm which returns the index of the first "at". But
> im unsure how i would use this to remove all occurances of.
>
> Any help is much appreciated, thanks
Hi,
I am bored today anyway, so I made this little program.
Is this what you want?
Who is that chap Boyer Moore by the way?
Regards,
Erwin
public class WordFilter {
public WordFilter() {
}
public static void main(String[] args) {
String strOrg = "I want to visit the shops today and buy some clothes and";
String filterThis[] = {" at " , " the " , " and "};
System.out.println("before: "+strOrg);
for(int iFilt = 0; iFilt<filterThis.length; iFilt++) {
int nWordLength = filterThis[iFilt].length();
int nFoundPos = strOrg.indexOf(filterThis[iFilt]);
while (nFoundPos != -1){
System.out.println("found "+filterThis[iFilt]+" at position:"+nFoundPos);
strOrg = strOrg.substring(0,nFoundPos)+"
"+strOrg.substring(nFoundPos+nWordLength);
nFoundPos = strOrg.indexOf(filterThis[iFilt]);
}
}
System.out.println("after: "+strOrg);
}
}
|
|
0
|
|
|
|
Reply
|
since_humans_read_this_I_am_spammed_too_much (2368)
|
9/12/2003 2:26:13 PM
|
|
sfu wrote:
> Hi
> I'd like to import a text file containing common words such as
> "at","the","and" etc.
The regular expressions package in java.util.regex sounds like what
you're after. Use Pattern.compile to create a Pattern object for a
given regular expression. Create a Matcher from that Pattern with
Pattern.matcher, and then replaceAll on that Matcher (replacing with
""). That matcher will come back with a suitably formed String.
A regex like "\b((and)|(or)|(at))\b" is a starting point.
Hope that helps,
Danny
|
|
0
|
|
|
|
Reply
|
dannywoodz7839 (11)
|
9/12/2003 2:28:18 PM
|
|
"sfu" <sfu@stephenfu.co.uk> wrote in message
news:9d9e0d80.0309120422.5959de97@posting.google.com...
> Hi
> I'd like to import a text file containing common words such as
> "at","the","and" etc.
>
> Then if i take a String "I want to visit the shops today and buy some
> clothes", i would like to strip out the common words to become "visit
> shops today buy clothes"
>
> I know to speed up searching for "at" within the string i can use a
> Boyer Moore algorithm which returns the index of the first "at". But
> im unsure how i would use this to remove all occurances of.
>
> Any help is much appreciated, thanks
If I had that problem I would use a StringTokenizer to take apart
the String and a HashMap filled with the common words to detect
which to discard. Reassemble the words that pass in a StringBuffer.
WBB
|
|
0
|
|
|
|
Reply
|
wbrogden (214)
|
9/12/2003 3:39:49 PM
|
|
Erwin Moller <since_humans_read_this_I_am_spammed_too_much@spamyourself.com> wrote in message news:<3f61d774$0$58705$e4fe514c@news.xs4all.nl>...
> sfu wrote:
>
> > Hi
> > I'd like to import a text file containing common words such as
> > "at","the","and" etc.
> >
> > Then if i take a String "I want to visit the shops today and buy some
> > clothes", i would like to strip out the common words to become "visit
> > shops today buy clothes"
> >
> > I know to speed up searching for "at" within the string i can use a
> > Boyer Moore algorithm which returns the index of the first "at". But
> > im unsure how i would use this to remove all occurances of.
> >
> > Any help is much appreciated, thanks
>
> Hi,
>
>
> I am bored today anyway, so I made this little program.
> Is this what you want?
> Who is that chap Boyer Moore by the way?
>
> Regards,
> Erwin
>
>
>
> public class WordFilter {
>
> public WordFilter() {
> }
>
> public static void main(String[] args) {
> String strOrg = "I want to visit the shops today and buy some clothes and";
> String filterThis[] = {" at " , " the " , " and "};
> System.out.println("before: "+strOrg);
>
> for(int iFilt = 0; iFilt<filterThis.length; iFilt++) {
> int nWordLength = filterThis[iFilt].length();
> int nFoundPos = strOrg.indexOf(filterThis[iFilt]);
> while (nFoundPos != -1){
> System.out.println("found "+filterThis[iFilt]+" at position:"+nFoundPos);
> strOrg = strOrg.substring(0,nFoundPos)+"
> "+strOrg.substring(nFoundPos+nWordLength);
> nFoundPos = strOrg.indexOf(filterThis[iFilt]);
> }
> }
> System.out.println("after: "+strOrg);
> }
> }
Cool thats great, Boyer-Moore is a fast string searching algorithm
which i could replace indexOf to improve search times.
Thanks for everyones ideas and help
|
|
0
|
|
|
|
Reply
|
sfu (8)
|
9/12/2003 5:22:32 PM
|
|
On 12 Sep 2003 05:22:10 -0700, sfu@stephenfu.co.uk (sfu) wrote or
quoted :
>I know to speed up searching for "at" within the string i can use a
>Boyer Moore algorithm which returns the index of the first "at". But
>im unsure how i would use this to remove all occurances of.
I would think the way you would implement this is to tokenise, break
the text into words.
You can do this with a Regex.split or a StringTokenizer, or with a
simple indexOf( ' ' );
Then you look up each word in a HashSet. Only the words you want to
expunge are there. Then write the words you keep to a new file.
See http://mindprod.com/jgloss/hashtable.html
and http://mindprod.com/fileio.html
--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
|
|
0
|
|
|
|
Reply
|
roedy (1019)
|
9/12/2003 8:30:08 PM
|
|
sfu writes:
> Hi
> ...
> I know to speed up searching for "at" within the string i can use a
> Boyer Moore algorithm which returns the index of the first "at". But
Using BM for searching the "at" wouldn't speed up things. Actually, it
would significantly slow them down. If the speed is of concern, you
should look into building a DFA from the words to search.
Regards
|
|
0
|
|
|
|
Reply
|
samokhodkin (8)
|
9/13/2003 10:28:35 AM
|
|
"S.A. Samokhodkin" <samokhodkin@sourceforge.net> wrote in message news:<bjurgj$v74$1@news.aha.ru>...
> sfu writes:
> > Hi
> > ...
> > I know to speed up searching for "at" within the string i can use a
> > Boyer Moore algorithm which returns the index of the first "at". But
>
> Using BM for searching the "at" wouldn't speed up things. Actually, it
> would significantly slow them down. If the speed is of concern, you
> should look into building a DFA from the words to search.
>
> Regards
it would slow things down? doh! i thought it was meant to be one of
the fastest string searching algorithms, especially over java's
indexOf. :(
|
|
0
|
|
|
|
Reply
|
sfu (8)
|
9/13/2003 1:53:47 PM
|
|
"S.A. Samokhodkin" <samokhodkin@sourceforge.net> wrote in message news:<bjurgj$v74$1@news.aha.ru>...
> sfu writes:
> > Hi
> > ...
> > I know to speed up searching for "at" within the string i can use a
> > Boyer Moore algorithm which returns the index of the first "at". But
>
> Using BM for searching the "at" wouldn't speed up things. Actually, it
> would significantly slow them down. If the speed is of concern, you
> should look into building a DFA from the words to search.
>
> Regards
yeah im looking for a fast way of doing this because i want to strip
out all the common words from an essay which could be > 2000 words for
example.
excuse me for my ignorance but what is a DFA?
thanks for your suggestions thou!
|
|
0
|
|
|
|
Reply
|
sfu (8)
|
9/13/2003 2:03:59 PM
|
|
On 13 Sep 2003 06:53:47 -0700, sfu wrote:
> it would slow things down? doh! i thought it was meant to be one of
> the fastest string searching algorithms, especially over java's
> indexOf. :(
Like many things, it depends.
BM *is* fast, for long search strings. The longer the better in fact.
For search strings of length 1 it has the same complexity as
String.indexOf(), but with a higher constant factor due to some added
complexity in the search loop.
BM also has a high initial cost associated with setting up the two
delta tables, so for short strings it loses unless you are searching
for the same string many times in a relatively large body of text, and
have an implementation that lets you reuse the delta tables (to take
advantage of the fact that the search string hasn't changed).
String.indexOf() is fine when the body of text isn't huge and the
search string is relatively short.
/gordon
--
[ do not email me copies of your followups ]
g o r d o n + n e w s @ b a l d e r 1 3 . s e
|
|
0
|
|
|
|
Reply
|
not108 (1060)
|
9/13/2003 2:10:54 PM
|
|
On 13 Sep 2003 06:53:47 -0700, sfu@stephenfu.co.uk (sfu) wrote or
quoted :
>it would slow things down? doh! i thought it was meant to be one of
>the fastest string searching algorithms, especially over java's
>indexOf. :(
See http://mindprod.com/products.html#BOYER
It is only faster in circumstance where the overhead to set it up pays
off.
There is a variant that can search for many words at once.
Note you are not looking for "the" but " the ". If you looked for
"the" you would also hit "there".
--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
|
|
0
|
|
|
|
Reply
|
roedy (1019)
|
9/13/2003 5:31:27 PM
|
|
On 13 Sep 2003 07:03:59 -0700, sfu@stephenfu.co.uk (sfu) wrote or
quoted :
>yeah im looking for a fast way of doing this because i want to strip
>out all the common words from an essay which could be > 2000 words for
>example.
No matter which recommended technique you select, it will be over in a
blink of an eye. The only time consuming part will be reading and
writing and that will be under a second.
By far the most time consuming part of the task will be firing up
java.exe.
--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
|
|
0
|
|
|
|
Reply
|
roedy (1019)
|
9/13/2003 5:33:29 PM
|
|
"William Brogden" <wbrogden@bga.com> wrote in message news:<3f61ccaf$1@giga.realtime.net>...
> "sfu" <sfu@stephenfu.co.uk> wrote in message
> news:9d9e0d80.0309120422.5959de97@posting.google.com...
> > Hi
> > I'd like to import a text file containing common words such as
> > "at","the","and" etc.
> >
> > Then if i take a String "I want to visit the shops today and buy some
> > clothes", i would like to strip out the common words to become "visit
> > shops today buy clothes"
> >
> > I know to speed up searching for "at" within the string i can use a
> > Boyer Moore algorithm which returns the index of the first "at". But
> > im unsure how i would use this to remove all occurances of.
> >
> > Any help is much appreciated, thanks
>
> If I had that problem I would use a StringTokenizer to take apart
> the String and a HashMap filled with the common words to detect
> which to discard. Reassemble the words that pass in a StringBuffer.
>
> WBB
while(essayTokenizer.hasMoreTokens()) {
String currentToken = essayTokenizer.nextToken();
if (currentToken.equals(hashTable.get(currentToken))) {
System.out.println("removed: " + currentToken);
} else {
strippedText += currentToken + " ";
}
}
does anyone know if its faster to use
strippedText += currentToken + " ";
or
text = new StringBuffer().append(currentToken + " ")
to create the stripped body of text over say 2000 tokens for example?
|
|
0
|
|
|
|
Reply
|
sfu (8)
|
9/13/2003 6:42:36 PM
|
|
Roedy Green <roedy@mindprod.com> wrote in message news:<d4l6mvskrc30td993arktce74ktejc6djl@4ax.com>...
> On 13 Sep 2003 07:03:59 -0700, sfu@stephenfu.co.uk (sfu) wrote or
> quoted :
>
> >yeah im looking for a fast way of doing this because i want to strip
> >out all the common words from an essay which could be > 2000 words for
> >example.
>
> No matter which recommended technique you select, it will be over in a
> blink of an eye. The only time consuming part will be reading and
> writing and that will be under a second.
>
> By far the most time consuming part of the task will be firing up
> java.exe.
oh right, cool!
|
|
0
|
|
|
|
Reply
|
sfu (8)
|
9/13/2003 8:34:42 PM
|
|
Roedy Green wrote:
> By far the most time consuming part of the task will be firing up
> java.exe.
By the sound of it, by far the most time consuming thing will be writing the
blasted program in the first place...
(Which, btw, is a serious point for sfu -- trying to make a program run fast is
nearly always a time-wasting academic exercise unless the program *actually*
runs too slowly for your needs.)
-- chris
|
|
0
|
|
|
|
Reply
|
chris.uppal (3970)
|
9/14/2003 8:38:25 AM
|
|
"sfu" <sfu@stephenfu.co.uk> wrote in message
news:9d9e0d80.0309131042.6c26fdd4@posting.google.com...
> "William Brogden" <wbrogden@bga.com> wrote in message
news:<3f61ccaf$1@giga.realtime.net>...
> > "sfu" <sfu@stephenfu.co.uk> wrote in message
> > news:9d9e0d80.0309120422.5959de97@posting.google.com...
> > > Hi
> > > I'd like to import a text file containing common words such as
> > > "at","the","and" etc.
> > >
> > > Then if i take a String "I want to visit the shops today and buy some
> > > clothes", i would like to strip out the common words to become "visit
> > > shops today buy clothes"
> > >
> > > I know to speed up searching for "at" within the string i can use a
> > > Boyer Moore algorithm which returns the index of the first "at". But
> > > im unsure how i would use this to remove all occurances of.
> > >
> > > Any help is much appreciated, thanks
> >
> > If I had that problem I would use a StringTokenizer to take apart
> > the String and a HashMap filled with the common words to detect
> > which to discard. Reassemble the words that pass in a StringBuffer.
> >
> > WBB
>
> while(essayTokenizer.hasMoreTokens()) {
> String currentToken = essayTokenizer.nextToken();
> if (currentToken.equals(hashTable.get(currentToken))) {
> System.out.println("removed: " + currentToken);
> } else {
> strippedText += currentToken + " ";
> }
> }
>
> does anyone know if its faster to use
> strippedText += currentToken + " ";
> or
> text = new StringBuffer().append(currentToken + " ")
> to create the stripped body of text over say 2000 tokens for example?
GAH! why would you want to do that? Create one StringBuffer before entering
the loop, use the size of the input String so StringBuffer won't have to
enlarge itself.
Also, as Roedy suggested, put your words to be dropped in a HashSet, then
you can use the very fast contains( obj ) method. You may want to use
toLowerCase on your string tokens and have all lower case words in the
HashSet.
Another thought - if your input string includes punctuation, be sure to
create
the StringTokenizer with the punctuation as tokens - otherwise you will
get tokens consisting of words plus punctuation and the match will fail.
Bill
|
|
0
|
|
|
|
Reply
|
wbrogden (214)
|
9/14/2003 3:42:26 PM
|
|
|
15 Replies
55 Views
(page loaded in 0.298 seconds)
|