f



compiling case insensitive regular expressions

Hello,

I need to compile regular expressions which are case insensitive,
there are two cases, the part which must be matched case insensitvely
might be just a portion but it can be the entire RE as well.  The RE
will be Unicode enabled and must be compiled to AFD.

I am wondering what is the "best practice" in this field, is it
generally more efficient to precompute for each RE its "case
insensitive RE" (by replacing each symbol by a ORed version of it,
i.e; hello becomes (h|H)(e|E)(l|L)(l|L)(o|O)..) and compile that in
place of the original or is it as good to simply replace the input
symbols by their lowercase version and compile a "lowercase only"
version of the RE?

any idea?

Best regards
Armel
[I doubt it would make much difference.  If you build the case folding into
the RE, the tables are bigger, if you do it at runtime, the code path is
slightly longer but the overall program is likely to be a little smaller.
-John]


0
11/1/2010 9:17:43 PM
comp.compilers 3310 articles. 1 followers. Post Follow

6 Replies
623 Views

Similar Articles

[PageSpeed] 27

Armel <armelasselin@hotmail.com> wrote:

> I need to compile regular expressions which are case insensitive,
> there are two cases, the part which must be matched case insensitvely
> might be just a portion but it can be the entire RE as well.  The RE
> will be Unicode enabled and must be compiled to AFD.

Another way to do it, which I have seen used in the Paracel
FDF/TextFinder hardware search processor is to supply a bit mask for
each character being compared.  Only bits with a '1' in the mask are
used in the comparison.  (Bitwise AND with the mask, and compare to
the previously masked query character.)

-- glen

0
glen
11/3/2010 1:22:47 AM
I found that it is more efficient to make the regex case insensitive
than to pre-process your input. If you use equivalence classes then
there is no increase in table size. The way to do it is to have a case
insensitive flag that you can turn on and off (see the flex syntax for
(?i: )  etc.) You donbt need to actually update the text of your regex,
you just test for the flag when you process each character set and add
in both characters then if necessary. So for example (?i:[abc])
resolves to [ABCabc].

HTH

Regards,

Ben Hanson (http://www.benhanson.net/lexertl.html)

0
benhanson2
11/3/2010 2:49:36 PM
"glen herrmannsfeldt" <gah@ugcs.caltech.edu> a icrit dans le message de
> Armel <armelasselin@hotmail.com> wrote:
>> I need to compile regular expressions which are case insensitive,
>> [..]
> Another way [...] is to supply a bit mask for
> each character being compared.  Only bits with a '1' in the mask are
> used in the comparison.

I used that technic to build a LL(1) based Z80 disassembler/decompiler years
ago, to decode "structural bits" (vs. values bits) in instructions, but I
don't think its applicable to a AFD-based regular expression engine, before
knowing which bits can be ignored the AFD will first need to determine a
class for the character. it will simply result in a "tolower(c)" (or
toupper(c) ) called on each character of the input, but expressed and coded
in another way. The preparation phase still has to verify that there are no
two paths leaving a state with 'a' and 'A' for example (as they would become
a single path once the mask applied).

it would advocate in favor of the tolower( ) solution.

Regards
Armel

0
Armel
11/4/2010 12:03:36 PM
> I need to compile regular expressions which are case insensitive,
> there are two cases, the part which must be matched case insensitvely
> might be just a portion but it can be the entire RE as well. B The RE
> will be Unicode enabled and must be compiled to AFD.
>
> I am wondering what is the "best practice" in this field, is it
> generally more efficient to precompute for each RE its "case
> insensitive RE" (by replacing each symbol by a ORed version of it,
> i.e; hello becomes (h|H)(e|E)(l|L)(l|L)(o|O)..) and compile that in
> place of the original or is it as good to simply replace the input
> symbols by their lowercase version and compile a "lowercase only"
> version of the RE?

My experience has been that you almost always want to do it in
the RE, not by fiddling with the input stream.

* If only some of the regexp is case-insensitive, transforming the
  input is incorrect.

* A tight byte-at-a-time DFA inner loop slows down noticeably
  (10% or so on a typical x86) when you have the extra conditional
  branch to turn upper case into lower.

* Computing "ToLower" of an arbitrary Unicode code point is
  quite expensive.  If you bake it into the RE and then the DFA
  then the work happens during table construction, and you get
  much faster execution.

All that said, it is also the case that turning /Hello, world/ into
/[Hh][Ee][Ll][Ll][Oo], [Ww][Oo][Rr][Ll][Dd]/ is perhaps not the best
use of memory, especially if you have an inefficient representation
of character classes and an efficient representation of literal strings.

In RE2 (http://code.google.com/p/re2/) I introduced a bit in the parsed
RE representation to mean "ASCII case-insenstiive" so that I could
continue to store the parsed form as the more compact literal strings
where possible.  Even so, ASCII is really the only easy case.  If you
type in /=CE=9A=CE=B1=CE=BB=CE=B7=CE=BC=CE=AD=CF=81=CE=B1 =CE=BA=CF=8C=CF=
=83=CE=BC=CE=B5/ then RE2 turns it into a sequence of
character classes.

One final warning: even ASCII is not completely straightforward:
according to the Unicode spec, a case-insensitive match for /sky/ should
match =C5=BFKy - that's a long s, a Kelvin symbol, and an ordinary y.

Russ
0
Russ
11/4/2010 7:15:24 PM
Armel <armelasselin@hotmail.com> wrote:
(snip, I wrote)

>> Another way [...] is to supply a bit mask for each character being
> compared.  Only bits with a '1' in the mask are used in the comparison.

> I used that technic to build a LL(1) based Z80 disassembler/decompiler years
> ago, to decode "structural bits" (vs. values bits) in instructions, but I
> don't think its applicable to a AFD-based regular expression engine, before
> knowing which bits can be ignored the AFD will first need to determine a
> class for the character. it will simply result in a "tolower(c)" (or
> toupper(c) ) called on each character of the input, but expressed and coded
> in another way. The preparation phase still has to verify that there are no
> two paths leaving a state with 'a' and 'A' for example (as they would become
> a single path once the mask applied).

Yes, it would have to be built into the evaluation engine.  In the
case of the Paracel FDF, it is part of the hardware, and also part of
PSL (Pattern Specification Language).

Also, the PSL compiler will figure out the optimal mask for a given
character position.  If, for example, one put [bc] where the
characters only differ by one bit, the compiler would generate 'b'
with a mask of X'3e'.  (The default mask for ASCII masks the high bit.
Case insensitivity is the default for lower case query characters.)

It seems to me that it could result in more compact compiled regular
expressions, that also might execute faster.

-- glen
0
glen
11/5/2010 4:20:11 AM
On 11/4/2010 12:15 PM, Russ Cox wrote:

<snip, regexes>

admittedly, I don't much use regexes, although a few times I have used
regex-like strategies (such as a special notation for matching/encoding
x86 instructions, ...), and have used regexes in a few cases.

most of my tokenizers/... are plain C code.


> One final warning: even ASCII is not completely straightforward:
> according to the Unicode spec, a case-insensitive match for /sky/ should
> match E?Ky - that's a long s, a Kelvin symbol, and an ordinary y.

rarely do I find it necessary to resort to the full level of Unicode
pedantics, especially WRT most code and data tasks (it would just slow
down the handling logic).

so, one can assume a single "sane" mapping between any upper and
lower-case characters, and ignore any cases where there is not a 1:1
mapping (random characters in other locales/... which may compare equal
in some cases, characters which map between multiple characters in upper
or lower-case, ...).

that or just treat Unicode space as a "big ASCII": there is all the
special case logic for ASCII, and everything else is often just treated
as "unknown solid characters" (this is how most of my compiler code works).

for example, my assembler is partly case-insensitive, but is only case
insensitive for certain things (opcode and register names, but not
labels or variables), and only in ASCII range (there are no greek or
cyrillic or similar opcode names anyways, so it doesn't matter).

I usually compare by forcing both sides to lower case during the
compare. or, case-insensitive string interning simply forcing the string
to lower-case prior to interning it (one can intern "FOO" or "Foo" and
get back "foo"). interning allows '==' rather than string compare
operations, and so is often faster (though not always, since if only
doing a small number of compares the cost of interning the string is
greater than the cost of doing the compares).

usually, I leave things case sensitive (and demand an exact 1:1
code-point match), since this makes things simpler and faster. also I
tend to see case-sensitive as more intuitive anyways, since to me 'A'
and 'a' seem like different letters anyways.


and, yes, I am also a fan of UTF-8, since it maps nicely to ASCII and
95% of the time takes less space than UTF-16.


decided to leave out a bit about various ways to implement
uppercase/lowercase mapping, as this is probably not the issue.

0
BGB
11/6/2010 6:46:47 PM
Reply:

Similar Artilces:

To compile or not to compile
Is there some benefit in compiling the apache source yourself? -- //Points ------------------------------------------------------------ http://underthebed.homeip.net ...

Compile or not compile?
There's something that sometimes in while it bores me a little bit. How do I know if I need to compile a function/procedure or it will be able to compile "on fly". At the beginning I thought this was related to the paths idl "knew", but this happens for two functions/routines in the same directory. Can someone tell when a function can be runned without be compile? Nuno Oliveira wrote: > There's something that sometimes in while it bores me a little bit. How > do I know if I need to compile a function/procedure or it will be able > to compile &quo...

Compiled or not compiled
I need to distinguish between two situations: one when a Matlab function is run from Matlab command window and the other one when it is run in compiled mode. Any ideas how to do this? Thanks Tomy Duby "Tomy Duby" <tomy.duby@agilent.com> wrote in message news:idq8q8$85k$1@fred.mathworks.com... > I need to distinguish between two situations: one when a Matlab function > is run from Matlab command window and the other one when it is run in > compiled mode. > Any ideas how to do this? HELP ISDEPLOYED. -- Steve Lord slord@mathworks.com comp.s...

which compiler compiled?
Is there a way I can tell which compiler compiled an executable? I looked at the file with a hex editor and didn't see anything obvious. Lawrence "Lawrence" <just4me@nowhere.com> writes: > Is there a way I can tell which compiler compiled an executable? Yes with some compilers and some executable formats. E.g. on HP-UX for PA-RISC: $ aCC leak.C $ odump -compunit a.out ... 6 0 ANSI C++ leak.C /tmp ctcom options = -inst compiletime -diags 523 -inline_power 1 -longbranch 2 -unique_strings on -cachesize 256 B...

Compiling Compiler
Hi there, I need some advice/opinion of the experts out there regarding compilers. I have sun cc compiler that supports 64 bit environment. Now, I compile ACK using this compiler.This should give a compiler that supports 64bit environment. Since the output files are in Solaris format, I compile the ACK source *once again* using the just compiled ACK compiler so that I get Minix object and binary files. Now my questions is that does this 2nd version of ACK be able to produce true 64bit code? Sanky wrote: > Hi there, > > I need some advice/opinion of the experts out there regarding...

To compile or not to compile
Is there some benefit in compiling the apache source yourself? -- //Points ------------------------------------------------------------ http://underthebed.homeip.net ...

compiling Qt .... compiling Qt ....compiling Qt.... compiling Qt
I got myself new kernel from kernel.org... make xconfig needs Qt (new system). So I got new Qt source this afternoon. It is still compiling.... Who said MS is bloat has not done this. What does Qt do? Dunno, how about xforms for xconfig? On Thu, 02 Feb 2006 19:18:09 +0000, Jan Panteltje wrote: > Dunno, how about xforms for xconfig? Why not curses based? menuform -- Hilsen/Regards Michael Rasmussen http://keyserver.veridis.com:11371/pks/lookup?op=get&search=0xE3E80917 On a sunny day (Thu, 02 Feb 2006 21:55:33 +0100) it happened Michael Rasmussen <mir@miras.or...

Case for MIB compilers : which compiler is right ?
Hello, the following excerpt in an SNMP MIB ASN.1 definition file is accepted without error by a MIB compiler (let's call it C1) and rejected by another (let's call it C2). Which compiler is right ? C2 complains on line 2 of the following excerpt. ... -- next line is line 1 errorStatus OBJECT-TYPE SYNTAX SEQUENCE OF error MAX-ACCESS not-accessible STATUS current DESCRIPTION "" ::= { subtree 1 } error OBJECT-TYPE SYNTAX errorStatusStructure MAX-ACCESS not-accessible STATUS ...

Compiling Regular Expressions
I have a co-worker who wrote a little program to search for particular phrases in a logfile to process the error messages. She was originally doing something like this: while (<INFILE>) { chomp; if ( /error:/ || /Error 1/ || /Unsatisfied symbols/ || /Unresolved/ || /fatal/ || /not recognized/ || /cannot/ || /not found/ || /does not exist/ || /syntax error/ || /selector undefined/ || /No such/ ) { <Some sort of processing> } } The problem is that the error ...

regular expression compilation
There is a recent thread on the speed of execution of sub, gsub, and gensub, which reminds me of a question: Does awk (or gawk) recompile the regexp each time such functions are evaluated? Even if the regexp is a variable? That could have a big influence on the time. -- glen Am 17.09.2015 um 21:42 schrieb Kenny McCormack: > In article <mtf4sv$30a$1@speranza.aioe.org>, > Janis Papanagnou <janis_papanagnou@hotmail.com> wrote: >> Am 17.09.2015 um 21:27 schrieb Aharon Robbins: >>> In article <mtd5m9$3vh$1@news.m-online.net>, >>> J...

program that compiles in C compiler but not in C++ compiler
Hi, I need a small program that compiles in C compiler but not in C++ compiler. Thx in advans, Karthik Balaguru KBG <karthik.balaguru@lntinfotech.com> wrote: > I need a small program that compiles in C compiler but not in C++ > compiler. No problem, just send $10 to paypal@zevv.nl and I'll do your homework for you. -- :wq ^X^Cy^K^X^C^C^C^C KBG said: > Hi, > > I need a small program that compiles in C compiler but not in C++ > compiler. Can you think of any syntactic differences between C and C++? For example, what about keywords? They are very, very sen...

Obj. : Case for MIB compilers : which compiler is right ?
Mr Fock and Mr PerKins, thank you very much for your answers. To Mr Perkins, about : "The language for defining SNMP MIB modules is NOT ASN.1". I think you mean that SMI (an adapted subset of ASN.1) is the language for defining SNMP MIB modules. So, you are right. Best regards. -- FR This message is monitored by axinews : http://www.axinews.com/ ...

Compiling with the eclipse compiler
Hey all,I have a weird problem (at least it's weird to me).I have a big project that I need to compile through ant with eclipse'scompiler.In the build.xml my compile target looks like this : <target name="compile" depends="release-settings, init" description="compiles everything from source"> <echo message="Compiling with debug = ${build.debug}"/> <depend srcdir="${src.dir}" destdir="${class.dir}" cache="${dependencies.dir}" closure="true"/> <javac srcdir=...

which compiler was used to compile?
Is there a slick way to invent a constant in a program that tells you which compiler you used. I distribute both JET and JavaC versions of my code, and it helps to know on any error dump which version they were using. Right now I manually change a static final just before compiling. I thought there ought to be something less error prone. -- Canadian Mind Products, Roedy Green. Coaching, problem solving, economical contract programming. See http://mindprod.com/jgloss/jgloss.html for The Java Glossary. Roedy Green wrote: > Is there a slick way to invent a constant in a program that tell...

Web resources about - compiling case insensitive regular expressions - comp.compilers

Insensitive munitions - Wikipedia, the free encyclopedia
Insensitive munitions (IM) will only burn (rather than explode) when subjected to fast or slow heating, bullets , shrapnel , shaped charges or ...

Koran Burning Is Insensitive, Unnecessary; Pastor Jones, Please Stand Down (Updated)
Update: Book burning is bad. But the Muslim cleric who is running for parliament in Afghanistan is calling for the murder of American children ...

Insensitive on the App Store on iTunes
Get Insensitive on the App Store. See screenshots and ratings, and read customer reviews.

'No Escape' review: Owen Wilson shines in insensitive thriller - YouTube
Ashley Dvorkin and Fox411 movie reviewer Justin Craig discuss the thriller 'No Escape' starring Owen Wilson, Lake Bell and Pierce Brosnan For ...

Paris attacks: Rob Lowe’s insensitive tweets - Francois Hollande - AdelaideNow Search Search
AMERICAN actor Rob Lowe has been slammed on Twitter after making insensitive comments about the Paris attacks.

Decision to remove memorials 'insensitive'
... roadside crash memorials will be removed from two Queensland motorways next month, despite mourning families describing the decision as insensitive. ...

Paris attacks: Rob Lowe’s insensitive tweets - Francois Hollande - The Courier-Mail Search Search
AMERICAN actor Rob Lowe has been slammed on Twitter after making insensitive comments about the Paris attacks.

How to avoid culturally insensitive photography
A guide to not looking fool on the interweb for the rest of your life.

Paris attacks: Rob Lowe’s insensitive tweets - Francois Hollande
AMERICAN actor Rob Lowe has been slammed on Twitter after making insensitive comments about the Paris attacks.

British Open 2015 golf: Marc Leishman subject of insensitive Twitter post - PerthNow Search Search
IF YOU’RE not familiar with Marc Leishman’s story, here is a quick recap.

Resources last updated: 1/24/2016 12:02:25 PM