f



1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648

Hi,

I have a question regarding lexical analysis. I recently came across a
bug in our lexical analyser in phc (www.phpcompiler.org), that I am
unsure how to solve. This is the problem: our current definition for
integer constant looks something like

INT    ([1-9][0-9]*)|0

In particular, note that it does not allow for an (optional) "+" or "-"
at the start of the integer. This means that the strings "1 - 1", "1
-1" and "1-1" all generate the same sequence of three tokens INT(1),
OP(-), INT(1), for which the syntax analyser generates the subtree
BIN_OP(-, 1, 1).

For the string "1 - -1", the lexer (unsurprisingly) generates INT(1),
OP(-), OP(-), INT(1). The syntax analyser recognises this as BIN_OP(1,
UNARY_OP(-, 1)). In other words, the second "-" is treated as a unary
operator, rather than as part of the number.

This works fine, with the sole exception of the number "-2147483648".
The problem is, of course, overflow: -2147483648 is a valid negative
number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
valid positive number. Thus, the above method of dealing with "-" as a
unary operator breaks down.

The solution is to interpret the "-" as part of the number, and
generate INT(-2147483648), rather than OP(-), INT(...). However,
changing the definition of INT to

INT    [+-]?([1-9][0-9]*)|0

causes "1-1" to be recognised as INT(1), INT(-1), which is incorrect.
Even requiring a space before the operator

INT    (" "[+-])?([1-9][0-9]*)|0

does not help, because "1 -1" will be recognised as INT(1), INT(-1)
instead of INT(1), OP(-), INT(1).

Is there a good solution that solves both problems? At the moment, I
have added a new rule to the lexer that matches this exact number,
which solves the problem - but it's a hack.. (actually, there are two
rules, one for 32-bit machines, one for 64-bit machines..)

Thanks,

Edsko
0
devriese (27)
1/31/2006 2:57:37 PM
comp.compilers 3310 articles. 1 followers. Post Follow

11 Replies
4381 Views

Similar Articles

[PageSpeed] 23

"Edsko de Vries" <devriese@cs.tcd.ie> writes:

> our current definition for integer constant looks something like
>
> INT    ([1-9][0-9]*)|0
>
> In particular, note that it does not allow for an (optional) "+" or "-"
> at the start of the integer. In other words, "-" is treated as a unary
> operator, rather than as part of the number.
>
> This works fine, with the sole exception of the number "-2147483648".
> The problem is, of course, overflow: -2147483648 is a valid negative
> number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
> valid positive number.
>
> At the moment, I have added a new rule to the lexer that matches
> this exact number, which solves the problem - but it's a hack..

One solution is to let the lexer keep the number as a string and
handle unary minus applied to a constant in the parser (at this time
converting the string to a number).  Or you can use a bignum package
in the compiler, only converting to machine integers when you generate
code.

        Torben

0
torbenm
1/31/2006 5:47:16 PM
> This works fine, with the sole exception of the number "-2147483648".
> The problem is, of course, overflow: -2147483648 is a valid negative
> number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
> valid positive number.  Thus, the above method of dealing with "-" as a
> unary operator breaks down.

I know a few vendor-supplied C compilers that get this wrong,
printing warnings about "positive constant overflows" or something
like that, which is all the more confusing an error when you look at
it and say "but that's *not* a positive constant!".

> The solution is to interpret the "-" as part of the number, and
> generate INT(-2147483648),

I think a better solution is to let the parser do its job and determine
whether the - is a negation or a subtraction, and then once that
is done, you can warn about INT(2147483648) that isn't inside a
negation.  I put this into the Plan 9 C compilers recently and it
worked well.

Russ

0
Russ
1/31/2006 5:47:25 PM
"Edsko de Vries" <devriese@cs.tcd.ie> writes:

> Is there a good solution that solves both problems?

I don't know if you consider it "good", but doing compile time
evaluation of constant expressions with an unbounded precision package
would solve the issue.  It immediately extend to having unbounded
precision integers at runtime when you'll add it.

--
Jean-Marc

0
Jean
1/31/2006 5:47:34 PM
On 31 Jan 2006 09:57:37 -0500, Edsko de Vries wrote:

[...]
> Is there a good solution that solves both problems? At the moment, I
> have added a new rule to the lexer that matches this exact number,
> which solves the problem - but it's a hack.. (actually, there are two
> rules, one for 32-bit machines, one for 64-bit machines..)

In Ada all literals are considered of some indefinite numeric type
[Universal_Integer]. The compiler evaluates all expressions exactly and
generates error when the result is converted to the target [finite] type.
So, for example, if Int_8 has range -128..127, nevertheless the following
is legal Ada:

   X : Int_8 := 1000 - 900; -- The result is in the range

With this approach, literals are unsigned and unary minus works just fine.

Of course, you will not be able to compile something like

   2**98756466 - 2**98756466

which is also legal. But it is not a big loss. (:-))

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

0
Dmitry
1/31/2006 5:47:46 PM
On 31 Jan 2006 09:57:37 -0500, "Edsko de Vries" <devriese@cs.tcd.ie>
wrote:

>-2147483648 is a valid negative number (assuming 32-bit numbers),
>but the integer 2147483648 is _not_ a valid positive number.

>Even requiring a space before the operator
>does not help, because "1 -1" will be recognised as INT(1), INT(-1)
>instead of INT(1), OP(-), INT(1).
>
>Is there a good solution that solves both problems?

You can solve it two ways: require that white space or another
operator (such as parentheses) follow the binary operator, or restrict
the range of integer constants so that unary negation is always
possible.

Or you could try a suitably condescending error message ... it's
obvious that "1-1" is a misspelling of "H"  8-)


George
0
George
2/1/2006 2:20:51 AM
Russ Cox wrote:
>>This works fine, with the sole exception of the number "-2147483648".
>>The problem is, of course, overflow: -2147483648 is a valid negative
>>number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
>>valid positive number.  Thus, the above method of dealing with "-" as a
>>unary operator breaks down.
>
>
> I know a few vendor-supplied C compilers that get this wrong,
> printing warnings about "positive constant overflows" or something
> like that, which is all the more confusing an error when you look at
> it and say "but that's *not* a positive constant!".
>

Your C compiler is correct.  In standard C there are no "negative"
integer constants.

See section 6.4.4.1 of the C99 standard.

What is happening here, on a 32-bit 2's compliment machine is that
the value is too large to fit into a signed 'int', so
it becomes an 'unsigned int', following the type hierarchy
in section 6.4.4.1.  You've then applied a unary negation
to an 'unsigned int'  and the result is also an 'unsigned int' (see
section 6.5.3.3).... it's not clear what the arithmetic
negation of an 'unsigned int' would be.

For this example, our compiled generates:

   Warning #2296: unary negation applied to an unsigned type

Some other compilers also generate warnings like "constant is so large
that it is unsigned", I think that's one I've seen from GCC before.


So - to go back to what the original post was about, the C language
has the same problem.

Our solution is to keep constants as strings.  When it is time for
the value of the constant to be evaluated, we have to examine it
to see which type can "hold" it, and place it in appropriate
"containers" for that type.

This can be particularly cumbersome in a cross-platform environment,
such as a 32-bit host compiling for a 64-bit machine... where
there would be a requirement for manipulating large values in
a non-native fashion (e.g. some "bignum" implemenation.)


	- Dave Rivers -

--
rivers@dignus.com                        Work: (919) 676-0847
Get your mainframe programming tools at http://www.dignus.com
0
Thomas
2/1/2006 2:21:08 AM
Edsko de Vries writes:
>> our current definition for integer constant looks something like
>>    INT    ([1-9][0-9]*)|0
>> In particular, note that it does not allow for an (optional) "+" or "-"
>> at the start of the integer. In other words, "-" is treated as a unary
>> operator, rather than as part of the number.
>>
>> This works fine, with the sole exception of the number "-2147483648".
>> The problem is, of course, overflow: -2147483648 is a valid negative
>> number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
>> valid positive number.
>>
>> At the moment, I have added a new rule to the lexer that matches
>> this exact number, which solves the problem - but it's a hack..
>

Torben �gidius Mogensen wrote:
> One solution is to let the lexer keep the number as a string and
> handle unary minus applied to a constant in the parser (at this time
> converting the string to a number).  Or you can use a bignum package
> in the compiler, only converting to machine integers when you generate
> code.

Another method is to convert all but the last digit into an integer,
so that <2147483648> becomes <214748364, 8>.  Then a simple
range comparison can be made to determine if the token is too
big for a valid integer value.  This method avoids overflow.

It's interesting to note that the C and C++ grammars do not
treat leading signs as part of a constant but as a separate
unary operator.  Other languages such as COBOL do treat
leading signs as part of a literal constant, but these kinds of
languages also typically require additional spaces to separate
operands from operators.

-drt
0
David
2/1/2006 2:22:45 AM
Edsko de Vries <devriese@cs.tcd.ie> wrote:
>The problem is, of course, overflow: -2147483648 is a valid negative
>number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
>valid positive number...  Is there a good solution...?

The hack that has been used for this in the past is to compute and store a
number as the negative of its actual value (temporarily, until you know
whether there's a unary minus in front).

If that's too distasteful, you really have to find an extra bit somewhere:

+ Store the constant as an unsigned number (32-bit unsigned goes up to
  4_294_967_295 without overflow, so 2_147_483_648 is no problem),
  until you know the sign.

+ Use a larger signed type, if you have one.

+ Use a bignum (unlimited-length integer) library.

+ Store the number as a string, unconverted, until you know its sign.

--
spsystems.net is temporarily off the air;               |   Henry Spencer
mail to henry at zoo.utoronto.ca instead.               | henry@spsystems.net
0
henry
2/1/2006 2:23:00 AM
Edsko de Vries wrote:

> I have a question regarding lexical analysis. I recently came across a
> bug in our lexical analyser in phc (www.phpcompiler.org), that I am
> unsure how to solve. This is the problem: our current definition for
> integer constant looks something like

(snip)

> This works fine, with the sole exception of the number "-2147483648".
> The problem is, of course, overflow: -2147483648 is a valid negative
> number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
> valid positive number. Thus, the above method of dealing with "-" as a
> unary operator breaks down.

(snip)

Look at the language definition.  C allows only positive constants
and unary operators, as you say, though a constant expression might
have the value -2147483648.

Fortran allows only positive constants in expressions, but optionally
signed constants in DATA statements.  It might not be required, but I
believe most compilers will allow the maximum negative number in a DATA
statement.

-- glen
0
glen
2/2/2006 4:34:02 PM
Edsko de Vries wrote:

> This works fine, with the sole exception of the number "-2147483648".
> The problem is, of course, overflow: -2147483648 is a valid negative
> number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
> valid positive number. Thus, the above method of dealing with "-" as a
> unary operator breaks down.

When you separate the sign from the numbers, you should use unsigned
values to prevent overflows. Shouldn't your code also accept unsigned
integers > 2^31?

DoDi

0
Hans
2/2/2006 4:35:41 PM
David R Tribble wrote:
> Another method is to convert all but the last digit into an integer,
> so that <2147483648> becomes <214748364, 8>.  Then a simple
> range comparison can be made to determine if the token is too
> big for a valid integer value.  This method avoids overflow.

Just to add a note...
There is a rule (I thought it was Horner's Rule, but that's not right),
which allows you to check whether adding a digit to a number will
result in overflow before doing the actual multiply and add of the
digit.

-drt
0
David
2/3/2006 11:35:45 PM
Reply:

Web resources about - 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 - comp.compilers

Michael Bloomberg may launch independent US presidential bid
Reuters Michael Bloomberg may launch independent US presidential bid - source Reuters Michael Bloomberg, the billionaire former mayor of New ...

After Blizzard, Snowed-in East Coast Prepares to Dig Out
Millions of Americans were preparing to dig themselves out Sunday after a mammoth blizzard with hurricane-force winds and record-setting snowfall ...

Police charge 17-year-old in Canada after 4 shot dead
Police on Saturday charged a 17-year-old boy with four counts of first-degree murder and seven counts of attempted murder in a mass shooting ...

Panda at DC's National Zoo frolics in snow
Tian Tian, the giant panda bear at Washington, DC's National Zoo enjoys rolling around in the snow after a winter storm dumped nearly two feet ...

China, Iran upgrade ties to carry forward millennia-old friendship
Xinhua China, Iran upgrade ties to carry forward millennia-old friendship Xinhua TEHRAN, Jan. 23, 2016 (Xinhua) Chinese PresidentXi Jinping(R ...

Blue Origin Executes First Successful Relaunch And Landing Of Reusable Rocket
The private sector space race is heating up, and Amazon.com co-founder Jeff Bezos’ Blue Origin is back out in front with the first successful ...

Sundance Film Review: ‘The Greasy Strangler’
Comparisons can be drawn to John Waters and Harmony Korine, among others, but they do not flatter "Strangler."

Can Star Wars ‘Force’ Shah Rukh Khan’s Dilwale and Deepika Padukone’s Bajirao Mastani into corner?
Sean Mclain in the Wall Street Journal reported that the studio wanted to avoid clashing dates with blockbuster Bollywood releases — Shah Rukh ...

Joe Biden: Vice President Claims The U.S. Prepared For ‘Military Solution’ Against Islamic State In Syria ...
One of the biggest issues the Obama administration had to deal with throughout their eight-year tenure is the threat of the Islamic State in ...

John Legend & ‘Underground’ EP On “Slow Moving Change” Of Oscars & Diversity – Sundance
"I think what the Academy did this week is a step in the right direction but it's going to be a slow moving change because the Academy reflects ...

Resources last updated: 1/24/2016 8:05:25 AM