f



Regular expressions, compilation & tcl 8.6...

Hi all,

Has anything changed noticeably in the cache Tcl uses for caching 
regular expressions?

I tried to verify a large set (thousands) of small & simple regular 
expressions, with "regexp $one {}" and Tcl resulted in occupying 800MB 
or RAM. Is this expected?

(I remembered that the last N expression compilations were kept in 
memory...)

George

set patterns [<return a list of regexp patterns>]
## Ensure all patterns are valid!
foreach one $patterns {
   if {[catch {regexp $one {}} error]} {
     error "Invalid pattern: $one\n$error"
   }
}
0
petasis (1405)
11/12/2009 12:21:13 AM
comp.lang.tcl 23428 articles. 2 followers. Post Follow

7 Replies
635 Views

Similar Articles

[PageSpeed] 13

On Nov 12, 1:21=A0am, Georgios Petasis <peta...@iit.demokritos.gr>
wrote:
> Hi all,
>
> Has anything changed noticeably in the cache Tcl uses for caching
> regular expressions?
>
> I tried to verify a large set (thousands) of small & simple regular
> expressions, with "regexp $one {}" and Tcl resulted in occupying 800MB
> or RAM. Is this expected?
>
> (I remembered that the last N expression compilations were kept in
> memory...)
>
> George
>
> set patterns [<return a list of regexp patterns>]
> ## Ensure all patterns are valid!
> foreach one $patterns {
> =A0 =A0if {[catch {regexp $one {}} error]} {
> =A0 =A0 =A0error "Invalid pattern: $one\n$error"
> =A0 =A0}

I don't know of a specific cache in the RE engine; however there's the
Tcl_Obj internal rep which plays the same role. Since the compiled
automaton is then attached to the pattern value, it "sticks" to each
element of $patterns, which survives well over the lifecycle of the
loop variable. To verify this theory, you can try two things:

  (1) unset $patterns and see the memory consumption drop
  (2) or, defeat the Tcl_Obj caching:

       if {[catch  {regexp [string range $one 0 end] {}} error]} {

     and see no more memory consumption than the list's storage.

-Alex

0
11/12/2009 1:13:05 AM
On 12 Nov, 00:21, Georgios Petasis <peta...@iit.demokritos.gr> wrote:
> Has anything changed noticeably in the cache Tcl uses for caching
> regular expressions?

Not for many years.

Tcl has two caches for compiled REs. There is a per-thread cache that
is indexed by the literal string form of the RE (I believe that holds
the last 20 compiled REs, but could be wrong) and compiled REs are
also cached in the internal representation of the values. If you put
all the REs in global variables (or use literal REs) and just use them
by reference then it will be those internal representation caches
which are used.

> I tried to verify a large set (thousands) of small & simple regular
> expressions, with "regexp $one {}" and Tcl resulted in occupying 800MB
> or RAM. Is this expected?

Yes, that will trigger the building of all those internal
representations. Mostly that's a good strategy, but you've found the
case where it isn't. Congratulations.

Donal.
0
11/12/2009 9:36:34 AM
O/H Alexandre Ferrieux έγραψε:
> On Nov 12, 1:21 am, Georgios Petasis <peta...@iit.demokritos.gr>
> wrote:
>> Hi all,
>>
>> Has anything changed noticeably in the cache Tcl uses for caching
>> regular expressions?
>>
>> I tried to verify a large set (thousands) of small & simple regular
>> expressions, with "regexp $one {}" and Tcl resulted in occupying 800MB
>> or RAM. Is this expected?
>>
>> (I remembered that the last N expression compilations were kept in
>> memory...)
>>
>> George
>>
>> set patterns [<return a list of regexp patterns>]
>> ## Ensure all patterns are valid!
>> foreach one $patterns {
>>    if {[catch {regexp $one {}} error]} {
>>      error "Invalid pattern: $one\n$error"
>>    }
> 
> I don't know of a specific cache in the RE engine; however there's the
> Tcl_Obj internal rep which plays the same role. Since the compiled
> automaton is then attached to the pattern value, it "sticks" to each
> element of $patterns, which survives well over the lifecycle of the
> loop variable. To verify this theory, you can try two things:
> 
>   (1) unset $patterns and see the memory consumption drop
>   (2) or, defeat the Tcl_Obj caching:
> 
>        if {[catch  {regexp [string range $one 0 end] {}} error]} {
> 
>      and see no more memory consumption than the list's storage.
> 
> -Alex
> 

Dear Alex,

Indeed this solves the problem (i.e. wish stabilises ~30MB no matter how 
many times I run the loop). But I cannot understand why.
There is a small cache per thread (as Donal also remembers), but the 
regular expressions are stored in the same variable. Since the string of 
the variable "one" changes, shouldn't the compiled regexp also be discarded?
Maybe there is a leak somewhere?

George
0
petasis (1405)
11/12/2009 12:01:06 PM
O/H Donal K. Fellows έγραψε:
> On 12 Nov, 00:21, Georgios Petasis <peta...@iit.demokritos.gr> wrote:
>> Has anything changed noticeably in the cache Tcl uses for caching
>> regular expressions?
> 
> Not for many years.
> 
> Tcl has two caches for compiled REs. There is a per-thread cache that
> is indexed by the literal string form of the RE (I believe that holds
> the last 20 compiled REs, but could be wrong) and compiled REs are
> also cached in the internal representation of the values. If you put
> all the REs in global variables (or use literal REs) and just use them
> by reference then it will be those internal representation caches
> which are used.
> 
>> I tried to verify a large set (thousands) of small & simple regular
>> expressions, with "regexp $one {}" and Tcl resulted in occupying 800MB
>> or RAM. Is this expected?
> 
> Yes, that will trigger the building of all those internal
> representations. Mostly that's a good strategy, but you've found the
> case where it isn't. Congratulations.
> 
> Donal.

Dear Donal,

I am a little puzzled by this. Since the string of the variable changes 
with each loop to a new pattern, why is the old compilation of the 
regular expression kept?

George
0
petasis (1405)
11/12/2009 12:03:39 PM
O/H Georgios Petasis έγραψε:
> O/H Alexandre Ferrieux έγραψε:
>> On Nov 12, 1:21 am, Georgios Petasis <peta...@iit.demokritos.gr>
>> wrote:
>>> Hi all,
>>>
>>> Has anything changed noticeably in the cache Tcl uses for caching
>>> regular expressions?
>>>
>>> I tried to verify a large set (thousands) of small & simple regular
>>> expressions, with "regexp $one {}" and Tcl resulted in occupying 800MB
>>> or RAM. Is this expected?
>>>
>>> (I remembered that the last N expression compilations were kept in
>>> memory...)
>>>
>>> George
>>>
>>> set patterns [<return a list of regexp patterns>]
>>> ## Ensure all patterns are valid!
>>> foreach one $patterns {
>>>    if {[catch {regexp $one {}} error]} {
>>>      error "Invalid pattern: $one\n$error"
>>>    }
>>
>> I don't know of a specific cache in the RE engine; however there's the
>> Tcl_Obj internal rep which plays the same role. Since the compiled
>> automaton is then attached to the pattern value, it "sticks" to each
>> element of $patterns, which survives well over the lifecycle of the
>> loop variable. To verify this theory, you can try two things:
>>
>>   (1) unset $patterns and see the memory consumption drop
>>   (2) or, defeat the Tcl_Obj caching:
>>
>>        if {[catch  {regexp [string range $one 0 end] {}} error]} {
>>
>>      and see no more memory consumption than the list's storage.
>>
>> -Alex
>>
> 
> Dear Alex,
> 
> Indeed this solves the problem (i.e. wish stabilises ~30MB no matter how 
> many times I run the loop). But I cannot understand why.
> There is a small cache per thread (as Donal also remembers), but the 
> regular expressions are stored in the same variable. Since the string of 
> the variable "one" changes, shouldn't the compiled regexp also be 
> discarded?
> Maybe there is a leak somewhere?
> 
> George

How stupid of me :D
Of course there in no leak, the regular expression objects are also 
referenced by the list object (they are the list elements!)...

George
0
petasis (1405)
11/12/2009 12:05:48 PM
O/H Georgios Petasis έγραψε:
> O/H Donal K. Fellows έγραψε:
>> On 12 Nov, 00:21, Georgios Petasis <peta...@iit.demokritos.gr> wrote:
>>> Has anything changed noticeably in the cache Tcl uses for caching
>>> regular expressions?
>>
>> Not for many years.
>>
>> Tcl has two caches for compiled REs. There is a per-thread cache that
>> is indexed by the literal string form of the RE (I believe that holds
>> the last 20 compiled REs, but could be wrong) and compiled REs are
>> also cached in the internal representation of the values. If you put
>> all the REs in global variables (or use literal REs) and just use them
>> by reference then it will be those internal representation caches
>> which are used.
>>
>>> I tried to verify a large set (thousands) of small & simple regular
>>> expressions, with "regexp $one {}" and Tcl resulted in occupying 800MB
>>> or RAM. Is this expected?
>>
>> Yes, that will trigger the building of all those internal
>> representations. Mostly that's a good strategy, but you've found the
>> case where it isn't. Congratulations.
>>
>> Donal.
> 
> Dear Donal,
> 
> I am a little puzzled by this. Since the string of the variable changes 
> with each loop to a new pattern, why is the old compilation of the 
> regular expression kept?
> 
> George

Dear Donal,

Forget the last e-mail. I understood that the patterns are also ref 
counted by the list object that holds all the patterns...
So, their internal representation (the compiled regexp) is cached as it 
is indexed...

George
0
petasis (1405)
11/12/2009 12:07:05 PM
On Nov 12, 1:05=C2=A0pm, Georgios Petasis <peta...@iit.demokritos.gr>
wrote:
> O/H Georgios Petasis =CE=AD=CE=B3=CF=81=CE=B1=CF=88=CE=B5:
>
>
>
> > O/H Alexandre Ferrieux =CE=AD=CE=B3=CF=81=CE=B1=CF=88=CE=B5:
> >> On Nov 12, 1:21 am, Georgios Petasis <peta...@iit.demokritos.gr>
> >> wrote:
> >>> Hi all,
>
> >>> Has anything changed noticeably in the cache Tcl uses for caching
> >>> regular expressions?
>
> >>> I tried to verify a large set (thousands) of small & simple regular
> >>> expressions, with "regexp $one {}" and Tcl resulted in occupying 800M=
B
> >>> or RAM. Is this expected?
>
> >>> (I remembered that the last N expression compilations were kept in
> >>> memory...)
>
> >>> George
>
> >>> set patterns [<return a list of regexp patterns>]
> >>> ## Ensure all patterns are valid!
> >>> foreach one $patterns {
> >>> =C2=A0 =C2=A0if {[catch {regexp $one {}} error]} {
> >>> =C2=A0 =C2=A0 =C2=A0error "Invalid pattern: $one\n$error"
> >>> =C2=A0 =C2=A0}
>
> >> I don't know of a specific cache in the RE engine; however there's the
> >> Tcl_Obj internal rep which plays the same role. Since the compiled
> >> automaton is then attached to the pattern value, it "sticks" to each
> >> element of $patterns, which survives well over the lifecycle of the
> >> loop variable. To verify this theory, you can try two things:
>
> >> =C2=A0 (1) unset $patterns and see the memory consumption drop
> >> =C2=A0 (2) or, defeat the Tcl_Obj caching:
>
> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0if {[catch =C2=A0{regexp [string range $one=
 0 end] {}} error]} {
>
> >> =C2=A0 =C2=A0 =C2=A0and see no more memory consumption than the list's=
 storage.
>
> >> -Alex
>
> > Dear Alex,
>
> > Indeed this solves the problem (i.e. wish stabilises ~30MB no matter ho=
w
> > many times I run the loop). But I cannot understand why.
> > There is a small cache per thread (as Donal also remembers), but the
> > regular expressions are stored in the same variable. Since the string o=
f
> > the variable "one" changes, shouldn't the compiled regexp also be
> > discarded?
> > Maybe there is a leak somewhere?
>
> > George
>
> How stupid of me :D
> Of course there in no leak, the regular expression objects are also
> referenced by the list object (they are the list elements!)...

No it's my fault. I miserably failed to convey that meaning with
"element of $patterns, which survives well over the lifecycle of the
loop variable"...

-Alex
0
11/12/2009 12:43:40 PM
Reply: