Re: making the backend's json parser work in frontend code - Mailing list pgsql-hackers

From Robert Haas
Subject Re: making the backend's json parser work in frontend code
Date
Msg-id CA+Tgmobo6N8=C3FB9q+u8_n2w23iVYNxrpOfO9na1dba+m7Udw@mail.gmail.com
Whole thread Raw
In response to Re: making the backend's json parser work in frontend code  (Mark Dilger <mark.dilger@enterprisedb.com>)
Responses Re: making the backend's json parser work in frontend code  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: making the backend's json parser work in frontend code  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
On Wed, Jan 22, 2020 at 10:00 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Hopefully, this addresses Robert’s concern upthread about the filesystem name not necessarily being in utf8 format,
thoughI might be misunderstanding the exact thrust of his concern.  I can think of other possible interpretations of
hisconcern as he expressed it, so I’ll wait for him to clarify. 

No, that's not it. Suppose that Álvaro Herrera has some custom
settings he likes to put on all the PostgreSQL clusters that he uses,
so he creates a file álvaro.conf and uses an "include" directive in
postgresql.conf to suck in those settings. If he also likes UTF-8,
then the file name will be stored in the file system as a 12-byte
value of which the first two bytes will be 0xc3 0xa1. In that case,
everything will be fine, because JSON is supposed to always be UTF-8,
and the file name is UTF-8, and it's all good. But suppose he instead
likes LATIN-1. Then the file name will be stored as an 11-byte value
and the first byte will be 0xe1. The second byte, representing a
lower-case 'l', will be 0x6c. But we can't put a byte sequence that
goes 0xe1 0x6c into a JSON manifest stored as UTF-8, because that's
not valid in UTF-8.  UTF-8 requires that every byte from 0xc0-0xff be
followed by one or more bytes in the range 0x80-0xbf, and our
hypothetical file name that starts with 0xe1 0x6c does not meet that
criteria.

Now, you might say "well, why don't we just do an encoding
conversion?", but we can't. When the filesystem tells us what the file
names are, it does not tell us what encoding the person who created
those files had in mind. We don't know that they had *any* encoding in
mind. IIUC, a file in the data directory can have a name that consists
of any sequence of bytes whatsoever, so long as it doesn't contain
prohibited characters like a path separator or \0 byte. But only some
of those possible octet sequences can be stored in a manifest that has
to be valid UTF-8.

The degree to which there is a practical problem here is limited by
the fact that most filenames within the data directory are chosen by
the system, e.g. base/16384/16385, and those file names are only going
to contain ASCII characters (i.e. code points 0-127) and those are
valid in UTF-8 and lots of other encodings. Moreover, most people who
create additional files in the data directory will probably use ASCII
characters for those as well, at least if they are from an
English-speaking country, and if they're not, they're likely going to
use UTF-8, and then they'll still be fine. But there is no rule that
says people have to do that, and if somebody wants to use file names
based around SJIS or whatever, the backup manifest functionality
should not for that reason break.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Jehan-Guillaume de Rorthais
Date:
Subject: Re: Fetching timeline during recovery
Next
From: Alvaro Herrera
Date:
Subject: Re: DROP OWNED CASCADE vs Temp tables