ANSI terminal escape sequence regexp

  • Follow


I am looking for a regexp that matches the ANSI terminal escape sequences
(ESC [ ...) (for xterm), or alternatively for a tool (Linux) that replaces
ANSI terminal sequences by an arbitrary chosen fixed replacement. Thanks.

Janis
0
Reply Janis 5/25/2010 10:43:55 AM

Janis Papanagnou wrote:

> I am looking for a regexp that matches the ANSI terminal escape sequences
> (ESC [ ...) (for xterm), or alternatively for a tool (Linux) that replaces
> ANSI terminal sequences by an arbitrary chosen fixed replacement. Thanks.
> 
> Janis

Are these sequences "hardwired" into an application, or is the application
using curses? If the latter, you should be able to fudge a terminfo entry
to produce the required sequences. See terminfo(5).

Andrew
0
Reply Andrew 5/25/2010 12:13:52 PM


Janis Papanagnou wrote:

> I am looking for a regexp that matches the ANSI terminal escape sequences
> (ESC [ ...) (for xterm), or alternatively for a tool (Linux) that replaces
> ANSI terminal sequences by an arbitrary chosen fixed replacement. Thanks.

I've never done that, but I suppose any regex flavor that can match the 
escape character would do, so for example with GNU sed's ERE to match 
coloring sequences:

\x1b\[[0-9]+;[0-9]+m

or something similar.

$ GREEN='\033[01;32m'; YELLOW='\033[01;33m'
$ printf "$GREEN - $YELLOW\n" | sed -r 's/\x1b\[[0-9]+;[0-9]+m/FOO/g'
FOO - FOO

Apologies if I didn't understand correctly what you're after.
0
Reply pk 5/25/2010 12:24:08 PM

Andrew McDermott wrote:
> Janis Papanagnou wrote:
> 
>> I am looking for a regexp that matches the ANSI terminal escape sequences
>> (ESC [ ...) (for xterm), or alternatively for a tool (Linux) that replaces
>> ANSI terminal sequences by an arbitrary chosen fixed replacement. Thanks.
> 
> Are these sequences "hardwired" into an application, or is the application
> using curses? If the latter, you should be able to fudge a terminfo entry
> to produce the required sequences. See terminfo(5).

I am telnet'ing to a server that emits those ANSI sequences in addition
to the data I am interested in. It's not specified what that server will
actually emit, therefore I am looking for a "universal" regexp for those
sequences. Probably something like  \027[[]\([0-9]*;\)+[A-Za-z0-9]  or so.

Since it's likely that I might make mistakes when defining this, and since
I believe that it's as well likely that someone else already invented that
wheel, I am asking.

Janis
0
Reply Janis 5/25/2010 1:15:21 PM

pk wrote:
> Janis Papanagnou wrote:
> 
>> I am looking for a regexp that matches the ANSI terminal escape sequences
>> (ESC [ ...) (for xterm), or alternatively for a tool (Linux) that replaces
>> ANSI terminal sequences by an arbitrary chosen fixed replacement. Thanks.
> 
> I've never done that, but I suppose any regex flavor that can match the 
> escape character would do, so for example with GNU sed's ERE to match 
> coloring sequences:
> 
> \x1b\[[0-9]+;[0-9]+m
> 
> or something similar.
> 
> $ GREEN='\033[01;32m'; YELLOW='\033[01;33m'
> $ printf "$GREEN - $YELLOW\n" | sed -r 's/\x1b\[[0-9]+;[0-9]+m/FOO/g'
> FOO - FOO
> 
> Apologies if I didn't understand correctly what you're after.

Sorry for having been unclear.

I know that I just need some BRE/ERE tool, like sed, to substitute the
actual ANSI codes. I was interested in a regexp that covers all ANSI
sequences in one regexp expression because, actually, I don't know what
the telnet server will emit. (Please see also my response to Andrew.)

Janis
0
Reply Janis 5/25/2010 1:19:11 PM

Janis Papanagnou wrote:

> pk wrote:
>> Janis Papanagnou wrote:
>> 
>>> I am looking for a regexp that matches the ANSI terminal escape
>>> sequences (ESC [ ...) (for xterm), or alternatively for a tool (Linux)
>>> that replaces ANSI terminal sequences by an arbitrary chosen fixed
>>> replacement. Thanks.
>> 
>> I've never done that, but I suppose any regex flavor that can match the
>> escape character would do, so for example with GNU sed's ERE to match
>> coloring sequences:
>> 
>> \x1b\[[0-9]+;[0-9]+m
>> 
>> or something similar.
>> 
>> $ GREEN='\033[01;32m'; YELLOW='\033[01;33m'
>> $ printf "$GREEN - $YELLOW\n" | sed -r 's/\x1b\[[0-9]+;[0-9]+m/FOO/g'
>> FOO - FOO
>> 
>> Apologies if I didn't understand correctly what you're after.
> 
> Sorry for having been unclear.
> 
> I know that I just need some BRE/ERE tool, like sed, to substitute the
> actual ANSI codes. I was interested in a regexp that covers all ANSI
> sequences in one regexp expression because, actually, I don't know what
> the telnet server will emit. (Please see also my response to Andrew.)

See if this expect tip helps:

http://wiki.tcl.tk/9673
0
Reply pk 5/25/2010 1:32:54 PM

pk wrote:
> Janis Papanagnou wrote:
> 
>> pk wrote:
>>> Janis Papanagnou wrote:
>>>
>>>> I am looking for a regexp that matches the ANSI terminal escape
>>>> sequences (ESC [ ...) (for xterm), or alternatively for a tool (Linux)
>>>> that replaces ANSI terminal sequences by an arbitrary chosen fixed
>>>> replacement. Thanks.
>>> I've never done that, but I suppose any regex flavor that can match the
>>> escape character would do, so for example with GNU sed's ERE to match
>>> coloring sequences:
>>>
>>> \x1b\[[0-9]+;[0-9]+m
>>>
>>> or something similar.
>>>
>>> $ GREEN='\033[01;32m'; YELLOW='\033[01;33m'
>>> $ printf "$GREEN - $YELLOW\n" | sed -r 's/\x1b\[[0-9]+;[0-9]+m/FOO/g'
>>> FOO - FOO
>>>
>>> Apologies if I didn't understand correctly what you're after.
>> Sorry for having been unclear.
>>
>> I know that I just need some BRE/ERE tool, like sed, to substitute the
>> actual ANSI codes. I was interested in a regexp that covers all ANSI
>> sequences in one regexp expression because, actually, I don't know what
>> the telnet server will emit. (Please see also my response to Andrew.)
> 
> See if this expect tip helps:
> 
> http://wiki.tcl.tk/9673

Not sure. Quoting from the link (first example)...

  regexp -- {^\x1b(\[|\(|\))[;?0-9]*[0-9A-Za-z]} ${data} match

It seems that ANSI sequences can terminate in a digit. How could one
distinguish in a sequence like, say, \x1b[0A whether the A is part of
the ANSI sequence or part of the subsequent data.

Janis
0
Reply Janis 5/25/2010 2:39:29 PM

Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
> pk wrote:
<snip>
>> See if this expect tip helps:
>> 
>> http://wiki.tcl.tk/9673
>
> Not sure. Quoting from the link (first example)...
>
>   regexp -- {^\x1b(\[|\(|\))[;?0-9]*[0-9A-Za-z]} ${data} match
>
> It seems that ANSI sequences can terminate in a digit.

A quick scan of some online documents suggest that this is not so.  All
the sequences I've see end in a letter.  Wikipedia suggest the last byte
must be between ASCII @ and ~ inclusive.
 
If you are prepared to use a very general regexp that will strip out
ill-formed escape sequences you could start with

  \x1b\[[^@-~]*[@-~]

You then need to catch the two-byte sequences:

  \x1b\[[^@-~]*[@-~]|\x1b[@-~]

This will go wrong for those sequences that can include quoted strings
like those that set key mappings.  Maybe you can ignore these.

There is also a one-byte alternative to \x1b[ which is \x9b so you might
want to try:

  (\x1b\[|\x9b)[^@-~]*[@-~]|\x1b[@-~]

-- 
Ben.
0
Reply Ben 5/25/2010 4:43:48 PM

Ben Bacarisse wrote:

> Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
>> pk wrote:
> <snip>
>>> See if this expect tip helps:
>>> 
>>> http://wiki.tcl.tk/9673
>>
>> Not sure. Quoting from the link (first example)...
>>
>>   regexp -- {^\x1b(\[|\(|\))[;?0-9]*[0-9A-Za-z]} ${data} match
>>
>> It seems that ANSI sequences can terminate in a digit.
> 
> A quick scan of some online documents suggest that this is not so.  All
> the sequences I've see end in a letter.  Wikipedia suggest the last byte
> must be between ASCII @ and ~ inclusive.
>  
> If you are prepared to use a very general regexp that will strip out
> ill-formed escape sequences you could start with
> 
>   \x1b\[[^@-~]*[@-~]
> 
> You then need to catch the two-byte sequences:
> 
>   \x1b\[[^@-~]*[@-~]|\x1b[@-~]
> 
> This will go wrong for those sequences that can include quoted strings
> like those that set key mappings.  Maybe you can ignore these.
> 
> There is also a one-byte alternative to \x1b[ which is \x9b so you might
> want to try:
> 
>   (\x1b\[|\x9b)[^@-~]*[@-~]|\x1b[@-~]

For reference, here are some tables with most ANSI escape sequences:

http://isthe.com/chongo/tech/comp/ansi_escapes.html
http://ascii-table.com/ansi-escape-sequences.php 

0
Reply pk 5/25/2010 4:49:38 PM

Ben Bacarisse wrote:
> Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
>> pk wrote:
> <snip>
>>> See if this expect tip helps:
>>>
>>> http://wiki.tcl.tk/9673
>> Not sure. Quoting from the link (first example)...
>>
>>   regexp -- {^\x1b(\[|\(|\))[;?0-9]*[0-9A-Za-z]} ${data} match
>>
>> It seems that ANSI sequences can terminate in a digit.
> 
> A quick scan of some online documents suggest that this is not so.  All
> the sequences I've see end in a letter.  Wikipedia suggest the last byte
> must be between ASCII @ and ~ inclusive.
>  
> If you are prepared to use a very general regexp that will strip out
> ill-formed escape sequences you could start with
> 
>   \x1b\[[^@-~]*[@-~]
> 
> You then need to catch the two-byte sequences:
> 
>   \x1b\[[^@-~]*[@-~]|\x1b[@-~]
> 
> This will go wrong for those sequences that can include quoted strings
> like those that set key mappings.  Maybe you can ignore these.

Yes, I think I can ignore those.

> 
> There is also a one-byte alternative to \x1b[ which is \x9b so you might
> want to try:
> 
>   (\x1b\[|\x9b)[^@-~]*[@-~]|\x1b[@-~]
> 

Looks good, and seems to work. Thanks, Ben. Thanks also to Andrew and
pk.

Just an additional note for those who try that expression and observe
problems; setting LANG=C might fix some issues in non-C locales.

Janis
0
Reply Janis 5/25/2010 5:52:57 PM

In article <htgnf1$s5d$1@news.m-online.net>,
Janis Papanagnou  <janis_papanagnou@hotmail.com> wrote:
>It seems that ANSI sequences can terminate in a digit. How could one
>distinguish in a sequence like, say, \x1b[0A whether the A is part of
>the ANSI sequence or part of the subsequent data.

No, I don't think they can.  The patterns I've used in the past for excising
ANSI sequences:

    gsub(/\033\[[^a-zA-Z]*./, "")
    gsub(/\033./, "")

Apparently the terminating character can actually be characters 64 through 95,
not just letters, though I haven't seen that.
And of course you may also encounter the single-character CSI, character 155,
in place of \033[.

	John

-- 
John DuBois  spcecdt@armory.com  KC6QKZ/AE  http://www.armory.com/~spcecdt/
0
Reply spcecdt 5/25/2010 6:41:56 PM

pk <pk@pk.invalid> writes:
<snip>
> For reference, here are some tables with most ANSI escape sequences:
>
> http://isthe.com/chongo/tech/comp/ansi_escapes.html
> http://ascii-table.com/ansi-escape-sequences.php 

Yes, I found both of those but they seem less that comprehensive (my
test being if they tell you about \e[J and \e[1J as well as \e2J).

ECMA-48 seems to be the most definitive reference I can find online.  It
gives a more restrictive pattern:

  (\x1b\[|\x9b)[\x30-\x3f]*[\x40-\x7e]

In fact, trailing bytes in the range \x70-\7e ('p' to '~' in ASCII) are
reserved for private or experimental use so this could be made even more
restricted.

-- 
Ben.
0
Reply Ben 5/25/2010 9:15:57 PM

Ben Bacarisse wrote:
> pk <pk@pk.invalid> writes:
> <snip>
>> For reference, here are some tables with most ANSI escape sequences:
>>
>> http://isthe.com/chongo/tech/comp/ansi_escapes.html
>> http://ascii-table.com/ansi-escape-sequences.php 
> 
> Yes, I found both of those but they seem less that comprehensive (my
> test being if they tell you about \e[J and \e[1J as well as \e2J).
> 
> ECMA-48 seems to be the most definitive reference I can find online.  It
> gives a more restrictive pattern:
> 
>   (\x1b\[|\x9b)[\x30-\x3f]*[\x40-\x7e]

I wonder, though, why, e.g.,

  ESC ( B
  ESC =
  ESC >

(which, incidentally, are all in the data that I parse) are not covered
by the pattern that you've found in the ECMA-48 reference.

> In fact, trailing bytes in the range \x70-\7e ('p' to '~' in ASCII) are
> reserved for private or experimental use so this could be made even more
> restricted.
> 

BTW, in one of the references there are also escape sequences that seems
to be terminated by a digit; ESC 7 and ESC 8, for example.

Janis
0
Reply Janis 5/25/2010 10:25:24 PM

Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

> Ben Bacarisse wrote:
>> pk <pk@pk.invalid> writes:
>> <snip>
>>> For reference, here are some tables with most ANSI escape sequences:
>>>
>>> http://isthe.com/chongo/tech/comp/ansi_escapes.html
>>> http://ascii-table.com/ansi-escape-sequences.php 
>> 
>> Yes, I found both of those but they seem less that comprehensive (my
>> test being if they tell you about \e[J and \e[1J as well as \e2J).
>> 
>> ECMA-48 seems to be the most definitive reference I can find online.  It
>> gives a more restrictive pattern:
>> 
>>   (\x1b\[|\x9b)[\x30-\x3f]*[\x40-\x7e]
>
> I wonder, though, why, e.g.,
>
>   ESC ( B
>   ESC =
>   ESC >
>
> (which, incidentally, are all in the data that I parse) are not covered
> by the pattern that you've found in the ECMA-48 reference.

What I quoted was a pattern for what ECMA-48 calls control sequences.
There are four other categories (the C0 set, the C1 set, independent
control functions and control strings) and I have not gone through and
worked them all out.  I think there is a lot of history being codified
here.

>> In fact, trailing bytes in the range \x70-\7e ('p' to '~' in ASCII) are
>> reserved for private or experimental use so this could be made even more
>> restricted.
>> 
>
> BTW, in one of the references there are also escape sequences that seems
> to be terminated by a digit; ESC 7 and ESC 8, for example.

That may well be possible.  I was only describing "control sequences" --
those that start with CSI (the Control Sequence Introducer) \e[.

There aught to be an ANSI document, of course, but they are not always
easily available.  It might be easier to read though than ECMA-48 which
is rather hard going.

-- 
Ben.
0
Reply Ben 5/25/2010 11:06:07 PM

Janis Papanagnou wrote:
> Ben Bacarisse wrote:
>> pk <pk@pk.invalid> writes:
>> <snip>
>>> For reference, here are some tables with most ANSI escape sequences:
>>>
>>> http://isthe.com/chongo/tech/comp/ansi_escapes.html
>>> http://ascii-table.com/ansi-escape-sequences.php 
>> 
>> Yes, I found both of those but they seem less that comprehensive (my
>> test being if they tell you about \e[J and \e[1J as well as \e2J).
>> 
>> ECMA-48 seems to be the most definitive reference I can find online.  It
>> gives a more restrictive pattern:
>> 
>>   (\x1b\[|\x9b)[\x30-\x3f]*[\x40-\x7e]
>
> I wonder, though, why, e.g.,
>
>   ESC ( B
>   ESC =
>   ESC >

I don't know of a handy online reference but I have an old copy of an
actual VT100 user guide with a pretty good description that seems
comprehensive. For example
	      
    ESC ( B is shown as ANSI SCS control which switches from G0 to G1
    char set.

    ESC = is shown as DECKPAM Keypad App Mode (DEC private)
    ESC > is shown as DECKPNM Keypad Numeric Mode (DEC private)

> (which, incidentally, are all in the data that I parse) are not covered
> by the pattern that you've found in the ECMA-48 reference.
>
>> In fact, trailing bytes in the range \x70-\7e ('p' to '~' in ASCII) are
>> reserved for private or experimental use so this could be made even more
>> restricted.
>> 
> BTW, in one of the references there are also escape sequences that seems
> to be terminated by a digit; ESC 7 and ESC 8, for example.

Ok, I'm back and it seems there is a copy at:

www.piesoftwareinc.co.uk/textonly/VT100_User_Guide.pdf 

I don't know if it helps but it has a lot of pages :)


0
Reply stan 5/26/2010 9:23:35 PM

14 Replies
1892 Views

(page loaded in 0.199 seconds)

Similiar Articles:


















7/19/2012 5:30:11 PM


Reply: