Thread: pg should ignore u+200b zero width space

pg should ignore u+200b zero width space

From
James Cloos
Date:
i had a hell of a night last night trying to figure out what was wrong
with an attempt to use on constraint.

this morning i tried again and it worked.

but pasting the failed line from .psql_history into psql still failed.

i had to resort to od(1) to discover a no break space showed up just
before on constraint in the failing lines.

pg should treat a no break space after whitespace as just more
whitespace.

i'm not sure about alone between printing characters, but definitely
immediately after current whitespace characters.

[sent directly to the list because the 'community login' form is b0rked.]

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 0x997A9F17ED7DAEA6



Re: pg should ignore u+200b zero width space

From
Heikki Linnakangas
Date:
On 03/11/2020 15:41, James Cloos wrote:
> i had a hell of a night last night trying to figure out what was wrong
> with an attempt to use on constraint.
> 
> this morning i tried again and it worked.
> 
> but pasting the failed line from .psql_history into psql still failed.
> 
> i had to resort to od(1) to discover a no break space showed up just
> before on constraint in the failing lines.
> 
> pg should treat a no break space after whitespace as just more
> whitespace.
> 
> i'm not sure about alone between printing characters, but definitely
> immediately after current whitespace characters.

Hmm. I'm not sure if change the behavior is a good idea, but a hint in 
the error message would be nice. Something like:

postgres=# insert ​into foo values (1);
ERROR:  syntax error at or near "​into"
LINE 1: insert ​into foo values (1);
                ^
HINT: There is a zero-width space character (U+200B) on the line.

Patches are welcome. Are there other problematic characters like that?

- Heikki



Re: pg should ignore u+200b zero width space

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 03/11/2020 15:41, James Cloos wrote:
>> pg should treat a no break space after whitespace as just more
>> whitespace.

> Hmm. I'm not sure if change the behavior is a good idea, but a hint in 
> the error message would be nice. Something like:

The difficulty with doing anything in this space --- whether it be
ignoring, throwing an error, or whatever --- is that it makes the
lexer's behavior encoding-sensitive and potentially locale-sensitive.
That's problematic for all sorts of reasons.  One of the worst is
that frontend programs such as psql and ecpg also have SQL lexers,
and there'd be no way to keep their behavior in precise sync with
the backend's.  (They might not even be running in the same encoding,
never mind locale.)  It might even be possible to build security
holes around that; recall the fun we've had with trying to lock
down quoting rules in encodings where backslash can be part of a
multibyte character :-(.

Perhaps it'd be all right to confine the change in behavior to
just modifying the error text in cases where we were going to
throw an error anyway.  But I think this is much harder than
it sounds to do in a valid, safe way.

            regards, tom lane



Re: pg should ignore u+200b zero width space

From
Heikki Linnakangas
Date:
On 03/11/2020 16:52, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On 03/11/2020 15:41, James Cloos wrote:
>>> pg should treat a no break space after whitespace as just more
>>> whitespace.
> 
>> Hmm. I'm not sure if change the behavior is a good idea, but a hint in
>> the error message would be nice. Something like:
> 
> The difficulty with doing anything in this space --- whether it be
> ignoring, throwing an error, or whatever --- is that it makes the
> lexer's behavior encoding-sensitive and potentially locale-sensitive.
> That's problematic for all sorts of reasons.  One of the worst is
> that frontend programs such as psql and ecpg also have SQL lexers,
> and there'd be no way to keep their behavior in precise sync with
> the backend's.  (They might not even be running in the same encoding,
> never mind locale.)  It might even be possible to build security
> holes around that; recall the fun we've had with trying to lock
> down quoting rules in encodings where backslash can be part of a
> multibyte character :-(.
> 
> Perhaps it'd be all right to confine the change in behavior to
> just modifying the error text in cases where we were going to
> throw an error anyway.  But I think this is much harder than
> it sounds to do in a valid, safe way.

Yeah, my thinking was to just add a hint when you're throwing a syntax 
error anyway. Something simple like check if client_encoding is utf8 and 
there is a U+200b in the query string, and add the hint if so. It 
doesn't need to catch all cases, and rare false positives are OK too.

- Heikki



Re: pg should ignore u+200b zero width space

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 03/11/2020 16:52, Tom Lane wrote:
>> Perhaps it'd be all right to confine the change in behavior to
>> just modifying the error text in cases where we were going to
>> throw an error anyway.  But I think this is much harder than
>> it sounds to do in a valid, safe way.

> Yeah, my thinking was to just add a hint when you're throwing a syntax 
> error anyway. Something simple like check if client_encoding is utf8 and 
> there is a U+200b in the query string, and add the hint if so. It 
> doesn't need to catch all cases, and rare false positives are OK too.

TBH, that's exactly the kind of under-baked solution I *don't* want.

For starters, &zwsp is hardly the only problem here; even more common
is  , which exists in most encodings not only UTF8.  And once we
had that, there'd no doubt be pressure to ignore BOMs.  And so on.

Also I don't really want us throwing such errors when the funny space is
inside a literal or comment; chasing false positives like that will cost
users way more time than they could save when the hint is on-point.

            regards, tom lane



Re: pg should ignore u+200b zero width space

From
Peter Eisentraut
Date:
On 2020-11-03 15:52, Tom Lane wrote:
> The difficulty with doing anything in this space --- whether it be
> ignoring, throwing an error, or whatever --- is that it makes the
> lexer's behavior encoding-sensitive and potentially locale-sensitive.
> That's problematic for all sorts of reasons.  One of the worst is
> that frontend programs such as psql and ecpg also have SQL lexers,
> and there'd be no way to keep their behavior in precise sync with
> the backend's.  (They might not even be running in the same encoding,
> never mind locale.)  It might even be possible to build security
> holes around that; recall the fun we've had with trying to lock
> down quoting rules in encodings where backslash can be part of a
> multibyte character :-(.

I think we could support whitespace properly if we're using Unicode, and 
use ASCII whitespace for all other encodings.  The combinations to deal 
with are finite.  But it is a tricky problem that needs to be approached 
with care.