Thread: Token separation

Token separation

From

Tim Landscheidt

Date:

15 January 2012, 04:03:03

Hi,

I just tried to input a hexadecimal number in PostgreSQL
(8.4) and was rather surprised by the result:

| tim=# SELECT 0x13;
|  x13
| -----
|    0
| (1 Zeile)

| tim=# SELECT 0abc;
|  abc
| -----
|    0
| (1 Zeile)

| tim=#

The documentation says:

| A token can be a key word, an identifier, a quoted identifi-
| er, a literal (or constant), or a special character symbol.
| Tokens are normally separated by whitespace (space, tab,
| newline), but need not be if there is no ambiguity (which is
| generally only the case if a special character is adjacent
| to some other token type).

Is this behaviour really conforming to the standard?  Even
stranger is what MySQL (5.1.59) makes out of it:

| mysql> SELECT 0x40;
| +------+
| | 0x40 |
| +------+
| | @    |
| +------+
| 1 row in set (0.00 sec)

| mysql> SELECT 0abc;
| ERROR 1054 (42S22): Unknown column '0abc' in 'field list'
| mysql>

Tim

Re: Token separation

From

Tom Lane

Date:

15 January 2012, 13:16:19

Tim Landscheidt <tim@tim-landscheidt.de> writes:
> [ "0x13" is lexed as "0" then "x13" ]

> Is this behaviour really conforming to the standard?

Well, it's pretty much the universal behavior of flex-based lexers,
anyway.  A token ends when the next character can no longer sensibly
be added to it.

Possibly the documentation should be tweaked to mention the
number-followed-by-identifier case.
        regards, tom lane

Re: Token separation

From

Tim Landscheidt

Date:

16 January 2012, 11:20:05

Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> [ "0x13" is lexed as "0" then "x13" ]

>> Is this behaviour really conforming to the standard?

> Well, it's pretty much the universal behavior of flex-based lexers,
> anyway.  A token ends when the next character can no longer sensibly
> be added to it.

I know, but - off the top of my head - in most other lan-
guages "0abc" will then give a syntax error.

> Possibly the documentation should be tweaked to mention the
> number-followed-by-identifier case.

Especially if you consider such cases:

| tim=# SELECT 1D1; SELECT 1E1; SELECT 1F1;
|  d1
| ----
|   1
| (1 Zeile)

|  ?column?
| ----------
|        10
| (1 Zeile)

|  f1
| ----
|   1
| (1 Zeile)

| tim=#

I don't think it's common to hit this, but the documentation
surely could use a caveat.  I will write something up and
submit it to -docs.

Thanks,
Tim

Re: Token separation

From

Jasen Betts

Date:

20 January 2012, 09:32:33

On 2012-01-16, Tim Landscheidt <tim@tim-landscheidt.de> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>>> [ "0x13" is lexed as "0" then "x13" ]
>
>>> Is this behaviour really conforming to the standard?
>
>> Well, it's pretty much the universal behavior of flex-based lexers,
>> anyway.  A token ends when the next character can no longer sensibly
>> be added to it.
>
> I know, but - off the top of my head - in most other lan-
> guages "0abc" will then give a syntax error.
In most other languages "0 abc" would also be a syntax error.0and  doesn't give a syntax error in phpeg: <? echo 0and
0;?>
 
-- 
⚂⚃ 100% natural