String testText = "\"christian bongiorno\" AND Joe OR \"Electrical,
plumbing\"";
I would like to parse the above text into it's 'components' in an easy
and preferrably native java library fashion. I mean, I can implement a
custom parse, but it would be a little ugly.
Ultimately, I would like the following tokens:
1) Christian bongiorno
2) AND
3) Joe
4) OR
5) Electrical, plumbing
With StringTokenizer I can correctly get the quoted words, but it
doesn't distingush the non-quoted. So, I get
StringTokenizer tokens = new StringTokenizer("\"christian bongiorno\"
AND Joe OR \"Electrical, plumbing\"","\"");
produces
1) Christian bongiorno
2) AND Joe OR
3) Electrical, plumbing
As you guessed, this is for text searching. Also, No 3rd party
libraries. But be all core Java
ideas?
|
|
0
|
|
|
|
Reply
|
cbongior (40)
|
8/15/2005 7:47:59 PM |
|
<cbongior@stny.rr.com> wrote in message
news:1124135279.803767.315210@z14g2000cwz.googlegroups.com...
> String testText = "\"christian bongiorno\" AND Joe OR \"Electrical,
> plumbing\"";
>
> I would like to parse the above text into it's 'components' in an easy
> and preferrably native java library fashion. I mean, I can implement a
> custom parse, but it would be a little ugly.
>
> Ultimately, I would like the following tokens:
>
> 1) Christian bongiorno
> 2) AND
> 3) Joe
> 4) OR
> 5) Electrical, plumbing
>
> With StringTokenizer I can correctly get the quoted words, but it
> doesn't distingush the non-quoted. So, I get
>
> StringTokenizer tokens = new StringTokenizer("\"christian bongiorno\"
> AND Joe OR \"Electrical, plumbing\"","\"");
>
> produces
>
> 1) Christian bongiorno
> 2) AND Joe OR
> 3) Electrical, plumbing
>
> As you guessed, this is for text searching. Also, No 3rd party
> libraries. But be all core Java
>
> ideas?
There's a difference between parsing and tokenizing. A lot of the time
when people say parsing, they mean tokenizing (which is why the string
tokenizer solves their problem). The problem you're describing is actual,
real parsing.
If you don't want to use 3rd party tools, then you'll just have to write
a parser by hand. Lookup "recursive descent parsing". You may also want to
try posting future questions on this project to comp.compilers to learn more
about parsing theory.
- Oliver
|
|
0
|
|
|
|
Reply
|
owong (5281)
|
8/15/2005 8:06:39 PM
|
|
cbongior@stny.rr.com wrote:
> String testText = "\"christian bongiorno\" AND Joe OR \"Electrical,
> plumbing\"";
>
> I would like to parse the above text into it's 'components' in an easy
> and preferrably native java library fashion. I mean, I can implement a
> custom parse, but it would be a little ugly.
>
> Ultimately, I would like the following tokens:
>
> 1) Christian bongiorno
> 2) AND
> 3) Joe
> 4) OR
> 5) Electrical, plumbing
>
> With StringTokenizer I can correctly get the quoted words, but it
> doesn't distingush the non-quoted. So, I get
>
> StringTokenizer tokens = new StringTokenizer("\"christian bongiorno\"
> AND Joe OR \"Electrical, plumbing\"","\"");
>
> produces
>
> 1) Christian bongiorno
> 2) AND Joe OR
> 3) Electrical, plumbing
How about something with regular expressions, e.g.:
jc@soyuz:~/tmp$ cat bparse.java
public class bparse {
public static void main(String [] asArgs) {
java.util.regex.Pattern p
= java.util.regex.Pattern.compile(asArgs[0]) ;
System.out.println(" regex: '" + asArgs[0] + "'") ;
for(int i=1; i<asArgs.length; ++i) {
String sExpr = asArgs[i] ;
System.out.println("input str: '" + sExpr + "'") ;
java.util.regex.Matcher m = p.matcher(sExpr) ;
while(m.find()) {
System.out.println(
" match: '"
+ sExpr.substring(m.start(), m.end()) + "'") ;
}
}
}
}
jc@soyuz:~/tmp$ java bparse '("[^"]*"|AND|OR|[A-Za-z0-9]+)'
"\"christian bongiorno\" AND Joe OR \"Electrical, plumbing\""
regex: '("[^"]*"|AND|OR|[A-Za-z0-9]+)'
input str: '"christian bongiorno" AND Joe OR "Electrical, plumbing"'
match: '"christian bongiorno"'
match: 'AND'
match: 'Joe'
match: 'OR'
match: '"Electrical, plumbing"'
?
|
|
0
|
|
|
|
Reply
|
shakahshakah (188)
|
8/15/2005 8:14:54 PM
|
|
I was SURE regular expression could do it, but my regexp skills SUCK!
As an aside, linux interprets the commandline differently than windows.
Windows turned those commandline args into like, 8 seperate arguments.
So, I adapted but, it works!
Thanks
Christian
http://christian.bongiorno.org/resume.pdf
|
|
0
|
|
|
|
Reply
|
cbongior (40)
|
8/15/2005 9:13:02 PM
|
|
One question though? In the results, is it possible to easily throw out
the " " around a quoted part?
so...
instead of
match: '"christian bongiorno"'
I get
match: 'christian bongiorno'
|
|
0
|
|
|
|
Reply
|
cbongior (40)
|
8/15/2005 9:27:02 PM
|
|
cbongior@stny.rr.com wrote:
> One question though? In the results, is it possible to easily throw out
> the " " around a quoted part?
>
> so...
> instead of
> match: '"christian bongiorno"'
>
> I get
> match: 'christian bongiorno'
How about:
while(m.find()) {
String sMatch = sExpr.substring(m.start(), m.end()) ;
if(sMatch.startsWith("\"") && sMatch.endsWith("\"")) {
sMatch = sMatch.substring(1, sMatch.length()-1) ;
}
System.out.println(" match: '" + sMatch + "'") ;
}
|
|
0
|
|
|
|
Reply
|
shakahshakah (188)
|
8/15/2005 9:45:19 PM
|
|
On 15 Aug 2005 12:47:59 -0700, cbongior@stny.rr.com wrote or quoted :
>As you guessed, this is for text searching. Also, No 3rd party
>libraries. But be all core Java
Here is how you could implement an elcheapo tokenizer.
use a regex to split on space, teaching it to ignore spaces inside
quotes.
Use a HashMap of defined words and keywords mapping to an enum that
classifies them.
Look up the word to see if it is magic e.g. and or.
You now have an array of tokens that identify their general class.
That is a lot easier to parse, especially if you use a postfix
notation.
The other approach is to use a parser generator, which will be much
easier than you imagine. See http://mindprod.com/jgloss/parser.html
|
|
0
|
|
|
|
Reply
|
look-on (3298)
|
8/15/2005 9:57:47 PM
|
|
Thanks, I was thinking that something in the REGEX could do it. I
logicked around it already.
|
|
0
|
|
|
|
Reply
|
cbongior (40)
|
8/16/2005 3:17:30 PM
|
|
|
7 Replies
15 Views
(page loaded in 0.104 seconds)
|