Re: Giving the shared catalogues a defined encoding - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Giving the shared catalogues a defined encoding
Date
Msg-id CA+hUKGL=F0pSLLf3nLpA_-sBwYsLg7s=FD6YFo_PDvS84FE_hw@mail.gmail.com
Whole thread Raw
In response to Re: Giving the shared catalogues a defined encoding  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sat, Dec 7, 2024 at 7:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Over in the discussion of bug #18735, I've come to the realization
> that these problems apply equally to the filesystem path names that
> the server deals with: not only the data directory path, but the
> path to the installation files [1].  Can we apply the same sort of
> restrictions to those?  I'm envisioning that initdb would check
> either encoding-validity or all-ASCII-ness of those path names
> depending on which mode it's setting the server up in.

Here are some things I have learned about pathname encoding:

* Some systems enforce an encoding: macOS always requires UTF-8, ext4
does too if you turn on case insensitivity, zfs has a utf8only option,
and a few less interesting-to-us ones have relevant mount options.  On
the first three at least: open("cafe\xe9", ...) -> EILSEQ, independent
of user space notions like locales.

* Implications of such a system with non-ASCII data directory:
 * there is only one valid configuration at initdb time:
--encoding=UTF8 --cluster-encoding=DATABASE, which is probably the
default anyway, so doesn't need to be spelled out
 * --cluster-encoding=ASCII would fail with the attached patch
 * --encoding=EUCXXX/MULE --encoding=DATABASE might fail if those
encodings have picky verification
 * --encoding=LATIN1 --cluster-encoding=DATABASE would be wrong but
bogusly pass verification (LATIN1 and other single-byte encodings have
no invalid sequences), and SHOW data_directory would expose the
familiar UTF8-through-LATIN1 glasses distortion "café"→"café".
All of that is perfectly reasonable I think, I just want to highlight
the cascading effect of the new constraint: Apple's file system
restricts your *database* encoding, with this design, unless you stick
to plain ASCII pathnames.  It is an interesting case to compare with
when untangling the Windows mess, see below...

* Traditional Unix filesystems eg ext4/xfs/ufs/... just don't care:
beyond '/' being special, the encoding is in the eye of the beholder.
(According to POSIX 2024 you shouldn't have <newline> in a path
component either, see Austin Group issue #251 and others about
attempts to screw things down a bit and document EILSEQ in various
interfaces).  That's cool, just make sure --encoding matches what
you're actually using, accept default --cluster-encoding=DATABASE, and
everything should be OK.

* Windows has a completely different model.  Pathnames are really
UTF-16 in the kernel and on disk.  All char strings exchanged with the
system have a defined encoding, but it was non-obvious to this humble
Unix hacker what it is in each case.  I don't have Windows, so I spent
the afternoon firing test code at CI[1][2] to figure some of it out.
Some cases relevant to initdb: environ[] and argv[] are in ACP
encoding, even if the parent process used a different encoding by
calling setlocale() before putenv() or system() etc, or used the
UTF-16 variants _wputenv() or _wsystem().  You can also access them as
UTF-16 if you want.  So that's how the data directory pathname arrives
into initdb/postgres.  Then to use it, the filesystem functions seem
to be in two classes: the POSIXoid ones in the C runtime with
lowercase names like mkdir() are affected by setlocale() and use its
encoding, while the NT native ones with CamelCase like CreateFile()
don't care about locales and keep using the ACP.  That sounded like a
problem because we mix them: our open() is really a wrapper on
CreateFile() and yet elsewhere we also use unlink() etc, but
fortunately we keep the server locked in "C" locale and then the
lowercase ones appear to use the ACP in that case anyway, so the
difference is not revealed (it might upset frontend programs though?).

* Consequence:  It is senseless to check if getenv("PGDATA") or
argv[]-supplied paths can be validated with the cluster encoding, on
Windows.  The thing to do instead would be to check if they can be
converted from ACP to the cluster encoding, and then store the
converted version for display as SHOW data_directory, but keep the ACP
versions we need for filesystem APIs, and probably likewise for lots
of other things, and then plug all the bugs and edge cases that fall
out of that, for the rest of time... or adopt wchar_t interfaces
everywhere perhaps via wrappers that use database encoding... or other
variations which all sound completely out of scope...

* What I'm wondering is whether we can instead achieve coherence along
the lines of the Apple UTF-8 case I described above, but with an extra
step: if you want to use non-ASCII paths *you have to make your ACP
match the database and cluster encoding*.  So either you go all-in on
your favourite 80s encoding like WIN1252 that matches your ACP
(impossible for 932 AKA SJIS), or you switch your system's ACP to
UTF-8.  Alternatively, I believe postgres.exe could even be built in a
way that makes its ACP always UTF-8[3] (I guess the loader has to help
with that as it affects the way it sets up environ[] and argv[] before
main() runs).  I don't know all the consequences though.  And I don't
know what exact rules would be best, but something like that would be
in keeping with the general philosophy of this project: just figure
out how to block the combinations that don't work correctly.

> Changing the catalog encoding would also have to re-verify the
> suitability of the paths.

Yeah, my current development version does pick that up, but ...

postgres=# alter system set cluster encoding to ascii;
ERROR:  configuration parameter "hba_file" has invalid value
"/home/tmunro/projects/postgresql/build/café/pg_hba.conf"
DETAIL:  Configuration parameter values defined at scopes affecting
all sessions must be valid in CLUSTER ENCODING

... I should probably check it explicitly so I can make a better error
message.  That one is arbitrarily picking on a computed GUC it
happened to hit first in hash table order, and not data_directory
itself.  (HBA content is also an interesting topic.)

> Of course this isn't 100% bulletproof
> since someone could rename those directories later.  But I think
> that's in "if you break it you get to keep both pieces" territory.

If you rename the data directory, my in-development patch still
notices though, and warns at startup (ERROR seems unsuitable here).
One fun aspect is that you have to finish recovery first, to have a
consistent CLUSTER ENCODING value before validation.

[424912] WARNING:  configuration parameter "data_directory" has
invalid value "/home/tmunro/projects/postgresql/build/café"
[424912] DETAIL:  Configuration parameter values defined at scopes
affecting all sessions must be valid in CLUSTER ENCODING
[424912] LOG:  database system is ready to accept connections

The GUC validation is strongly enforced with ERROR in other contexts.
For plain SET, I'm trying out a scheme for marking the GUCs that are
shareable like application_name, and being liberal with the rest for
interactive values that can't escape from the database.

[1] https://github.com/macdice/hello-windows/blob/env/test.c
[2] https://cirrus-ci.com/task/5463497838952448
[3] https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page



pgsql-hackers by date:

Previous
From: Greg Sabino Mullane
Date:
Subject: Re: Add a warning message when using unencrypted passwords
Next
From: Alexander Borisov
Date:
Subject: Re: Proposal to add a new URL data type.