Parsing a Boolean expression easy?

  • Follow


String testText = "\"christian bongiorno\" AND Joe OR \"Electrical,
plumbing\"";

I would like to parse the above text into it's 'components' in an easy
and preferrably native java library fashion. I mean, I can implement a
custom parse, but it would be a little ugly.

Ultimately, I would like the following tokens:

1) Christian bongiorno
2) AND
3) Joe
4) OR
5) Electrical, plumbing

With StringTokenizer I can correctly get the quoted words, but it
doesn't distingush the non-quoted. So, I get

StringTokenizer tokens = new StringTokenizer("\"christian bongiorno\"
AND Joe OR \"Electrical, plumbing\"","\"");

produces

1) Christian bongiorno
2) AND Joe OR
3) Electrical, plumbing

As you guessed, this is for text searching. Also, No 3rd party
libraries. But be all core Java

ideas?

0
Reply cbongior (40) 8/15/2005 7:47:59 PM

<cbongior@stny.rr.com> wrote in message 
news:1124135279.803767.315210@z14g2000cwz.googlegroups.com...
> String testText = "\"christian bongiorno\" AND Joe OR \"Electrical,
> plumbing\"";
>
> I would like to parse the above text into it's 'components' in an easy
> and preferrably native java library fashion. I mean, I can implement a
> custom parse, but it would be a little ugly.
>
> Ultimately, I would like the following tokens:
>
> 1) Christian bongiorno
> 2) AND
> 3) Joe
> 4) OR
> 5) Electrical, plumbing
>
> With StringTokenizer I can correctly get the quoted words, but it
> doesn't distingush the non-quoted. So, I get
>
> StringTokenizer tokens = new StringTokenizer("\"christian bongiorno\"
> AND Joe OR \"Electrical, plumbing\"","\"");
>
> produces
>
> 1) Christian bongiorno
> 2) AND Joe OR
> 3) Electrical, plumbing
>
> As you guessed, this is for text searching. Also, No 3rd party
> libraries. But be all core Java
>
> ideas?

    There's a difference between parsing and tokenizing. A lot of the time 
when people say parsing, they mean tokenizing (which is why the string 
tokenizer solves their problem). The problem you're describing is actual, 
real parsing.

    If you don't want to use 3rd party tools, then you'll just have to write 
a parser by hand. Lookup "recursive descent parsing". You may also want to 
try posting future questions on this project to comp.compilers to learn more 
about parsing theory.

    - Oliver 


0
Reply owong (5281) 8/15/2005 8:06:39 PM


cbongior@stny.rr.com wrote:
> String testText = "\"christian bongiorno\" AND Joe OR \"Electrical,
> plumbing\"";
>
> I would like to parse the above text into it's 'components' in an easy
> and preferrably native java library fashion. I mean, I can implement a
> custom parse, but it would be a little ugly.
>
> Ultimately, I would like the following tokens:
>
> 1) Christian bongiorno
> 2) AND
> 3) Joe
> 4) OR
> 5) Electrical, plumbing
>
> With StringTokenizer I can correctly get the quoted words, but it
> doesn't distingush the non-quoted. So, I get
>
> StringTokenizer tokens = new StringTokenizer("\"christian bongiorno\"
> AND Joe OR \"Electrical, plumbing\"","\"");
>
> produces
>
> 1) Christian bongiorno
> 2) AND Joe OR
> 3) Electrical, plumbing

How about something with regular expressions, e.g.:

jc@soyuz:~/tmp$ cat bparse.java
public class bparse {
  public static void main(String [] asArgs) {
    java.util.regex.Pattern p
      = java.util.regex.Pattern.compile(asArgs[0]) ;
    System.out.println("    regex: '" + asArgs[0] + "'") ;
    for(int i=1; i<asArgs.length; ++i) {
      String sExpr = asArgs[i] ;
      System.out.println("input str: '" + sExpr + "'") ;
      java.util.regex.Matcher m = p.matcher(sExpr) ;
      while(m.find()) {
        System.out.println(
          "    match: '"
          + sExpr.substring(m.start(), m.end()) + "'") ;
      }
    }
  }
}

jc@soyuz:~/tmp$ java bparse '("[^"]*"|AND|OR|[A-Za-z0-9]+)'
"\"christian bongiorno\" AND Joe OR \"Electrical, plumbing\""
    regex: '("[^"]*"|AND|OR|[A-Za-z0-9]+)'
input str: '"christian bongiorno" AND Joe OR "Electrical, plumbing"'
    match: '"christian bongiorno"'
    match: 'AND'
    match: 'Joe'
    match: 'OR'
    match: '"Electrical, plumbing"'

?

0
Reply shakahshakah (188) 8/15/2005 8:14:54 PM

I was SURE regular expression could do it, but my regexp skills SUCK!
As an aside, linux interprets the commandline differently than windows.
Windows turned those commandline args into like, 8 seperate arguments.
So, I adapted but, it works!

Thanks

Christian

http://christian.bongiorno.org/resume.pdf

0
Reply cbongior (40) 8/15/2005 9:13:02 PM

One question though? In the results, is it possible to easily throw out
the " " around a quoted part?

so...
instead of
match: '"christian bongiorno"' 

I get
match: 'christian bongiorno'

0
Reply cbongior (40) 8/15/2005 9:27:02 PM

cbongior@stny.rr.com wrote:
> One question though? In the results, is it possible to easily throw out
> the " " around a quoted part?
>
> so...
> instead of
> match: '"christian bongiorno"'
>
> I get
> match: 'christian bongiorno'

How about:
  while(m.find()) {
    String sMatch = sExpr.substring(m.start(), m.end()) ;
    if(sMatch.startsWith("\"") && sMatch.endsWith("\"")) {
      sMatch = sMatch.substring(1, sMatch.length()-1) ;
    }
    System.out.println("    match: '" + sMatch + "'") ;
  }

0
Reply shakahshakah (188) 8/15/2005 9:45:19 PM

On 15 Aug 2005 12:47:59 -0700, cbongior@stny.rr.com wrote or quoted :

>As you guessed, this is for text searching. Also, No 3rd party
>libraries. But be all core Java

Here is how you could implement an elcheapo tokenizer.

use a regex to split on space, teaching it to ignore spaces inside
quotes.

Use a HashMap of defined words and keywords mapping to an enum that
classifies them.

Look up the word to see if it is magic e.g. and or.

You now have an array of tokens that identify their general class.
That is a lot easier to parse, especially if you use a postfix
notation.

The other approach is to use a parser generator, which will be much
easier than you imagine. See http://mindprod.com/jgloss/parser.html


0
Reply look-on (3298) 8/15/2005 9:57:47 PM

Thanks, I was thinking that something in the REGEX could do it. I
logicked around it already.

0
Reply cbongior (40) 8/16/2005 3:17:30 PM

7 Replies
15 Views

(page loaded in 0.104 seconds)


Reply: