Thread: pg should ignore u+200b zero width space
i had a hell of a night last night trying to figure out what was wrong with an attempt to use on constraint. this morning i tried again and it worked. but pasting the failed line from .psql_history into psql still failed. i had to resort to od(1) to discover a no break space showed up just before on constraint in the failing lines. pg should treat a no break space after whitespace as just more whitespace. i'm not sure about alone between printing characters, but definitely immediately after current whitespace characters. [sent directly to the list because the 'community login' form is b0rked.] -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 0x997A9F17ED7DAEA6
On 03/11/2020 15:41, James Cloos wrote: > i had a hell of a night last night trying to figure out what was wrong > with an attempt to use on constraint. > > this morning i tried again and it worked. > > but pasting the failed line from .psql_history into psql still failed. > > i had to resort to od(1) to discover a no break space showed up just > before on constraint in the failing lines. > > pg should treat a no break space after whitespace as just more > whitespace. > > i'm not sure about alone between printing characters, but definitely > immediately after current whitespace characters. Hmm. I'm not sure if change the behavior is a good idea, but a hint in the error message would be nice. Something like: postgres=# insert into foo values (1); ERROR: syntax error at or near "into" LINE 1: insert into foo values (1); ^ HINT: There is a zero-width space character (U+200B) on the line. Patches are welcome. Are there other problematic characters like that? - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 03/11/2020 15:41, James Cloos wrote: >> pg should treat a no break space after whitespace as just more >> whitespace. > Hmm. I'm not sure if change the behavior is a good idea, but a hint in > the error message would be nice. Something like: The difficulty with doing anything in this space --- whether it be ignoring, throwing an error, or whatever --- is that it makes the lexer's behavior encoding-sensitive and potentially locale-sensitive. That's problematic for all sorts of reasons. One of the worst is that frontend programs such as psql and ecpg also have SQL lexers, and there'd be no way to keep their behavior in precise sync with the backend's. (They might not even be running in the same encoding, never mind locale.) It might even be possible to build security holes around that; recall the fun we've had with trying to lock down quoting rules in encodings where backslash can be part of a multibyte character :-(. Perhaps it'd be all right to confine the change in behavior to just modifying the error text in cases where we were going to throw an error anyway. But I think this is much harder than it sounds to do in a valid, safe way. regards, tom lane
On 03/11/2020 16:52, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> On 03/11/2020 15:41, James Cloos wrote: >>> pg should treat a no break space after whitespace as just more >>> whitespace. > >> Hmm. I'm not sure if change the behavior is a good idea, but a hint in >> the error message would be nice. Something like: > > The difficulty with doing anything in this space --- whether it be > ignoring, throwing an error, or whatever --- is that it makes the > lexer's behavior encoding-sensitive and potentially locale-sensitive. > That's problematic for all sorts of reasons. One of the worst is > that frontend programs such as psql and ecpg also have SQL lexers, > and there'd be no way to keep their behavior in precise sync with > the backend's. (They might not even be running in the same encoding, > never mind locale.) It might even be possible to build security > holes around that; recall the fun we've had with trying to lock > down quoting rules in encodings where backslash can be part of a > multibyte character :-(. > > Perhaps it'd be all right to confine the change in behavior to > just modifying the error text in cases where we were going to > throw an error anyway. But I think this is much harder than > it sounds to do in a valid, safe way. Yeah, my thinking was to just add a hint when you're throwing a syntax error anyway. Something simple like check if client_encoding is utf8 and there is a U+200b in the query string, and add the hint if so. It doesn't need to catch all cases, and rare false positives are OK too. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 03/11/2020 16:52, Tom Lane wrote: >> Perhaps it'd be all right to confine the change in behavior to >> just modifying the error text in cases where we were going to >> throw an error anyway. But I think this is much harder than >> it sounds to do in a valid, safe way. > Yeah, my thinking was to just add a hint when you're throwing a syntax > error anyway. Something simple like check if client_encoding is utf8 and > there is a U+200b in the query string, and add the hint if so. It > doesn't need to catch all cases, and rare false positives are OK too. TBH, that's exactly the kind of under-baked solution I *don't* want. For starters, &zwsp is hardly the only problem here; even more common is  , which exists in most encodings not only UTF8. And once we had that, there'd no doubt be pressure to ignore BOMs. And so on. Also I don't really want us throwing such errors when the funny space is inside a literal or comment; chasing false positives like that will cost users way more time than they could save when the hint is on-point. regards, tom lane
On 2020-11-03 15:52, Tom Lane wrote: > The difficulty with doing anything in this space --- whether it be > ignoring, throwing an error, or whatever --- is that it makes the > lexer's behavior encoding-sensitive and potentially locale-sensitive. > That's problematic for all sorts of reasons. One of the worst is > that frontend programs such as psql and ecpg also have SQL lexers, > and there'd be no way to keep their behavior in precise sync with > the backend's. (They might not even be running in the same encoding, > never mind locale.) It might even be possible to build security > holes around that; recall the fun we've had with trying to lock > down quoting rules in encodings where backslash can be part of a > multibyte character :-(. I think we could support whitespace properly if we're using Unicode, and use ASCII whitespace for all other encodings. The combinations to deal with are finite. But it is a tricky problem that needs to be approached with care.