Re: Giving the shared catalogues a defined encoding - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Giving the shared catalogues a defined encoding |
Date | |
Msg-id | CA+hUKGL=F0pSLLf3nLpA_-sBwYsLg7s=FD6YFo_PDvS84FE_hw@mail.gmail.com Whole thread Raw |
In response to | Re: Giving the shared catalogues a defined encoding (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
On Sat, Dec 7, 2024 at 7:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Over in the discussion of bug #18735, I've come to the realization > that these problems apply equally to the filesystem path names that > the server deals with: not only the data directory path, but the > path to the installation files [1]. Can we apply the same sort of > restrictions to those? I'm envisioning that initdb would check > either encoding-validity or all-ASCII-ness of those path names > depending on which mode it's setting the server up in. Here are some things I have learned about pathname encoding: * Some systems enforce an encoding: macOS always requires UTF-8, ext4 does too if you turn on case insensitivity, zfs has a utf8only option, and a few less interesting-to-us ones have relevant mount options. On the first three at least: open("cafe\xe9", ...) -> EILSEQ, independent of user space notions like locales. * Implications of such a system with non-ASCII data directory: * there is only one valid configuration at initdb time: --encoding=UTF8 --cluster-encoding=DATABASE, which is probably the default anyway, so doesn't need to be spelled out * --cluster-encoding=ASCII would fail with the attached patch * --encoding=EUCXXX/MULE --encoding=DATABASE might fail if those encodings have picky verification * --encoding=LATIN1 --cluster-encoding=DATABASE would be wrong but bogusly pass verification (LATIN1 and other single-byte encodings have no invalid sequences), and SHOW data_directory would expose the familiar UTF8-through-LATIN1 glasses distortion "café"→"café". All of that is perfectly reasonable I think, I just want to highlight the cascading effect of the new constraint: Apple's file system restricts your *database* encoding, with this design, unless you stick to plain ASCII pathnames. It is an interesting case to compare with when untangling the Windows mess, see below... * Traditional Unix filesystems eg ext4/xfs/ufs/... just don't care: beyond '/' being special, the encoding is in the eye of the beholder. (According to POSIX 2024 you shouldn't have <newline> in a path component either, see Austin Group issue #251 and others about attempts to screw things down a bit and document EILSEQ in various interfaces). That's cool, just make sure --encoding matches what you're actually using, accept default --cluster-encoding=DATABASE, and everything should be OK. * Windows has a completely different model. Pathnames are really UTF-16 in the kernel and on disk. All char strings exchanged with the system have a defined encoding, but it was non-obvious to this humble Unix hacker what it is in each case. I don't have Windows, so I spent the afternoon firing test code at CI[1][2] to figure some of it out. Some cases relevant to initdb: environ[] and argv[] are in ACP encoding, even if the parent process used a different encoding by calling setlocale() before putenv() or system() etc, or used the UTF-16 variants _wputenv() or _wsystem(). You can also access them as UTF-16 if you want. So that's how the data directory pathname arrives into initdb/postgres. Then to use it, the filesystem functions seem to be in two classes: the POSIXoid ones in the C runtime with lowercase names like mkdir() are affected by setlocale() and use its encoding, while the NT native ones with CamelCase like CreateFile() don't care about locales and keep using the ACP. That sounded like a problem because we mix them: our open() is really a wrapper on CreateFile() and yet elsewhere we also use unlink() etc, but fortunately we keep the server locked in "C" locale and then the lowercase ones appear to use the ACP in that case anyway, so the difference is not revealed (it might upset frontend programs though?). * Consequence: It is senseless to check if getenv("PGDATA") or argv[]-supplied paths can be validated with the cluster encoding, on Windows. The thing to do instead would be to check if they can be converted from ACP to the cluster encoding, and then store the converted version for display as SHOW data_directory, but keep the ACP versions we need for filesystem APIs, and probably likewise for lots of other things, and then plug all the bugs and edge cases that fall out of that, for the rest of time... or adopt wchar_t interfaces everywhere perhaps via wrappers that use database encoding... or other variations which all sound completely out of scope... * What I'm wondering is whether we can instead achieve coherence along the lines of the Apple UTF-8 case I described above, but with an extra step: if you want to use non-ASCII paths *you have to make your ACP match the database and cluster encoding*. So either you go all-in on your favourite 80s encoding like WIN1252 that matches your ACP (impossible for 932 AKA SJIS), or you switch your system's ACP to UTF-8. Alternatively, I believe postgres.exe could even be built in a way that makes its ACP always UTF-8[3] (I guess the loader has to help with that as it affects the way it sets up environ[] and argv[] before main() runs). I don't know all the consequences though. And I don't know what exact rules would be best, but something like that would be in keeping with the general philosophy of this project: just figure out how to block the combinations that don't work correctly. > Changing the catalog encoding would also have to re-verify the > suitability of the paths. Yeah, my current development version does pick that up, but ... postgres=# alter system set cluster encoding to ascii; ERROR: configuration parameter "hba_file" has invalid value "/home/tmunro/projects/postgresql/build/café/pg_hba.conf" DETAIL: Configuration parameter values defined at scopes affecting all sessions must be valid in CLUSTER ENCODING ... I should probably check it explicitly so I can make a better error message. That one is arbitrarily picking on a computed GUC it happened to hit first in hash table order, and not data_directory itself. (HBA content is also an interesting topic.) > Of course this isn't 100% bulletproof > since someone could rename those directories later. But I think > that's in "if you break it you get to keep both pieces" territory. If you rename the data directory, my in-development patch still notices though, and warns at startup (ERROR seems unsuitable here). One fun aspect is that you have to finish recovery first, to have a consistent CLUSTER ENCODING value before validation. [424912] WARNING: configuration parameter "data_directory" has invalid value "/home/tmunro/projects/postgresql/build/café" [424912] DETAIL: Configuration parameter values defined at scopes affecting all sessions must be valid in CLUSTER ENCODING [424912] LOG: database system is ready to accept connections The GUC validation is strongly enforced with ERROR in other contexts. For plain SET, I'm trying out a scheme for marking the GUCs that are shareable like application_name, and being liberal with the rest for interactive values that can't escape from the database. [1] https://github.com/macdice/hello-windows/blob/env/test.c [2] https://cirrus-ci.com/task/5463497838952448 [3] https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
pgsql-hackers by date: