Re: UTF8 with BOM support in psql - Mailing list pgsql-hackers

From Chuck McDevitt
Subject Re: UTF8 with BOM support in psql
Date
Msg-id 2106D8DC89010842BABA5CD03FEA4061012E8BE3B9@EXVMBX018-10.exch018.msoutlookonline.net
Whole thread Raw
In response to Re: UTF8 with BOM support in psql  (Peter Eisentraut <peter_e@gmx.net>)
Responses Re: UTF8 with BOM support in psql
List pgsql-hackers
>
> I don't know what the best solution is here.  The BOM encoded as UTF-8
> is valid data in other encodings.  Of course, there is your point that
> such data cannot be at the start of an SQL command.
>

Is the UTF-8 BOM ( EF BB BF ) actually valid data in any other multi-byte encoding (other than it's intended use in
UTF-8)?

I realize that for single-byte encoding, such as latin-1, it would be legal as data, since any bytes other that 00 are
legal,although never legal outside a quoted string in a SQL command or psql command. 

Certainly, no psql command input file can start with those bytes, or you would get an error (unless it is changed so
theBOM is ignored). 

As to zero-width non-breaking space:  the BOM is supposed to be treated as such if in the middle of a string, but if it
isthe start, it is just the BOM, and isn't considered part of the data, if I'm reading the spec right.  Perhaps the
lexersshould allow for it as white space (along with other Unicode space characters, such as U+2060). 
It's not really important, since allowing the BOM sequence in the middle of a file is "deprecated" according to the
Unicodestandard. 

And what if you see a real BOM, FF FE or FE FF or FF FE 00 00 or 00 00 FE FF?  Give an error saying UTF-16 and UTF-32
arenot supported? 

Or is there a plan to read and convert the UTF-16 or UTF-32 to UTF-8, so psql and PostgreSQL understand it?
(BTW, that would actually be nice on Windows, where UTF-16 is common).

If we accept UTF-8 BOM, we should at least detect the other BOM sequences and give an error or warning.

Overall, from my user point of view, having psql deal with the BOM (at least the utf-8 one) would be more friendly than
currentbehavior, as some editors (notepad for example) automatically add the BOM to the beginning of Unicode files, and
it'snot obvious without dumping the file in hex. 





pgsql-hackers by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: next CommitFest
Next
From: "Albe Laurenz"
Date:
Subject: Re: Rejecting weak passwords