Thread: Remaining dependency on setlocale()
After some previous work here: https://postgr.es/m/89475ee5487d795124f4e25118ea8f1853edb8cb.camel@j-davis.com we are less dependent on setlocale(), but it's still not completely gone. setlocale() counts as thread-unsafe, so it would be nice to eliminate it completely. The obvious answer is uselocale(), which sets the locale only for the calling thread, and takes precedence over whatever is set with setlocale(). But there are a couple problems: 1. I don't think it's supported on Windows. 2. I don't see a good way to canonicalize a locale name, like in check_locale(), which uses the result of setlocale(). Thoughts? Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > But there are a couple problems: > 1. I don't think it's supported on Windows. Can't help with that, but surely Windows has some thread-safe way. > 2. I don't see a good way to canonicalize a locale name, like in > check_locale(), which uses the result of setlocale(). What I can tell you about that is that check_locale's expectation that setlocale does any useful canonicalization is mostly wishful thinking [1]. On a lot of platforms you just get the input string back again. If that's the only thing keeping us on setlocale, I think we could drop it. (Perhaps we should do some canonicalization of our own instead?) regards, tom lane [1] https://www.postgresql.org/message-id/14856.1348497531@sss.pgh.pa.us
On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > But there are a couple problems: > > > 1. I don't think it's supported on Windows. > > Can't help with that, but surely Windows has some thread-safe way. It does. It's not exactly the same, instead there is a thing you can call that puts setlocale() itself into a thread-local mode, but last time I checked that mode was missing on MinGW so that's a bit of an obstacle. How far can we get by using more _l() functions? For example, [1] shows a use of strftime() that I think can be converted to strftime_l() so that it doesn't depend on setlocale(). Since POSIX doesn't specify every obvious _l function, we might need to provide any missing wrappers that save/restore thread-locally with uselocale(). Windows doesn't have uselocale(), but it generally doesn't need such wrappers because it does have most of the obvious _l() functions. > > 2. I don't see a good way to canonicalize a locale name, like in > > check_locale(), which uses the result of setlocale(). > > What I can tell you about that is that check_locale's expectation > that setlocale does any useful canonicalization is mostly wishful > thinking [1]. On a lot of platforms you just get the input string > back again. If that's the only thing keeping us on setlocale, > I think we could drop it. (Perhaps we should do some canonicalization > of our own instead?) +1 I know it does something on Windows (we know the EDB installer gives it strings like "Language,Country" and it converts them to "Language_Country.Encoding", see various threads about it all going wrong), but I'm not sure it does anything we actually want to encourage. I'm hoping we can gradually screw it down so that we only have sane BCP 47 in the system on that OS, and I don't see why we wouldn't just use them verbatim. [1] https://www.postgresql.org/message-id/CA%2BhUKGJ%3Dca39Cg%3Dy%3DS89EaCYvvCF8NrZRO%3Duog-cnz0VzC6Kfg%40mail.gmail.com
On 8/7/24 03:07, Thomas Munro wrote: > How far can we get by using more _l() functions? For example, [1] > shows a use of strftime() that I think can be converted to > strftime_l() so that it doesn't depend on setlocale(). Since POSIX > doesn't specify every obvious _l function, we might need to provide > any missing wrappers that save/restore thread-locally with > uselocale(). Windows doesn't have uselocale(), but it generally > doesn't need such wrappers because it does have most of the obvious > _l() functions. Most of the strtoX functions have an _l variant, but one to watch is atoi, which is defined with a hardcoded call to strtol, at least with glibc: 8<---------- /* Convert a string to an int. */ int atoi (const char *nptr) { return (int) strtol (nptr, (char **) NULL, 10); } 8<---------- I guess in many/most places we use atoi we don't care, but maybe it matters for some? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Aug 7, 2024 at 9:42 AM Joe Conway <mail@joeconway.com> wrote: > I guess in many/most places we use atoi we don't care, but maybe it > matters for some? I think we should move in the direction of replacing atoi() calls with strtol() and actually checking for errors. In many places where use atoi(), it's unlikely that the string would be anything but an integer, so error checks are arguably unnecessary. A backup label file isn't likely to say "START TIMELINE: potaytoes". On the other hand, if it did say that, I'd prefer to get an error about potaytoes than have it be treated as if it said "START TIMELINE: 0". And I've definitely found missing error-checks over the years. For example, on pg14, "pg_basebackup -Ft -Zmaximum -Dx" works as if you specified "-Z0" because atoi("maximum") == 0. If we make a practice of checking integer conversions for errors everywhere, we might avoid some such silliness. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote: > How far can we get by using more _l() functions? There are a ton of calls to, for example, isspace(), used mostly for parsing. I wouldn't expect a lot of differences in behavior from locale to locale, like might be the case with iswspace(), but behavior can be different at least in theory. So I guess we're stuck with setlocale()/uselocale() for a while, unless we're able to move most of those call sites over to an ascii-only variant. Regards, Jeff Davis
On 8/7/24 13:16, Jeff Davis wrote: > On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote: >> How far can we get by using more _l() functions? > > There are a ton of calls to, for example, isspace(), used mostly for > parsing. > > I wouldn't expect a lot of differences in behavior from locale to > locale, like might be the case with iswspace(), but behavior can be > different at least in theory. > > So I guess we're stuck with setlocale()/uselocale() for a while, unless > we're able to move most of those call sites over to an ascii-only > variant. FWIW I see all of these in glibc: isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l, isgraph_l, islower_l, isprint_l, ispunct_l, isspace_l, isupper_l, isxdigit_l -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Aug 7, 2024 at 1:29 PM Joe Conway <mail@joeconway.com> wrote: > FWIW I see all of these in glibc: > > isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l, > isgraph_l, islower_l, isprint_l, ispunct_l, isspace_l, isupper_l, > isxdigit_l On my MacBook (Ventura, 13.6.7), I see all of these except for isascii_l. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote: > There are a ton of calls to, for example, isspace(), used mostly for > parsing. > > I wouldn't expect a lot of differences in behavior from locale to > locale, like might be the case with iswspace(), but behavior can be > different at least in theory. > > So I guess we're stuck with setlocale()/uselocale() for a while, unless > we're able to move most of those call sites over to an ascii-only > variant. We do know of a few isspace() calls that are already questionable[1] (should be scanner_isspace(), or something like that). It's not only weird that SELECT ROW('libertà!') is displayed with or without double quote depending (in theory) on your locale, it's also undefined behaviour because we feed individual bytes of a multi-byte sequence to isspace(), so OSes disagree, and in practice we know that macOS and Windows think that the byte 0xa inside 'à' is a space while glibc and FreeBSD don't. Looking at the languages with many sequences containing 0xa0, I guess you'd probably need to be processing CJK text and cross-platform for the difference to become obvious (that was the case for the problem report I analysed): for i in range(1, 0xffff): if (i < 0xd800 or i > 0xdfff) and 0xa0 in chr(i).encode('UTF-8'): print("%04x: %s" % (i, chr(i))) [1] https://www.postgresql.org/message-id/flat/CA%2BHWA9awUW0%2BRV_gO9r1ABZwGoZxPztcJxPy8vMFSTbTfi4jig%40mail.gmail.com
On Thu, Aug 8, 2024 at 6:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Aug 7, 2024 at 1:29 PM Joe Conway <mail@joeconway.com> wrote: > > FWIW I see all of these in glibc: > > > > isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l, > > isgraph_l, islower_l, isprint_l, ispunct_l, isspace_l, isupper_l, > > isxdigit_l > > On my MacBook (Ventura, 13.6.7), I see all of these except for isascii_l. Those (except isascii_l) are from POSIX 2008[1]. They were absorbed from "Extended API Set Part 4"[2], along with locale_t (that's why there is a header <xlocale.h> on a couple of systems even though after absorption they are supposed to be in <locale.h>). We already decided that all computers have that stuff (commit 8d9a9f03), but the reality is a little messier than that... NetBSD hasn't implemented uselocale() yet[3], though it has a good set of _l functions. As discussed in [3], ECPG code is therefore currently broken in multithreaded clients because it's falling back to a setlocale() path, and I think Windows+MinGW must be too (it lacks HAVE__CONFIGTHREADLOCALE), but those both have a good set of _l functions. In that thread I tried to figure out how to use _l functions to fix that problem, but ... The issue there is that we have our own snprintf.c, that implicitly requires LC_NUMERIC to be "C" (it is documented as always printing floats a certain way ignoring locale and that's what the callers there want in frontend and backend code, but in reality it punts to system snprintf for floats, assuming that LC_NUMERIC is "C", which we configure early in backend startup, but frontend code has to do it for itself!). So we could use snprintf_l or strtod_l instead, but POSIX hasn't got those yet. Or we could use own own Ryu code (fairly specific), but integrating Ryu into our snprintf.c (and correctly implementing all the %... stuff?) sounds like quite a hard, devil-in-the-details kind of an undertaking to me. Or maybe it's easy, I dunno. As for the _l functions, you could probably get away with "every computer has either uselocale() or snprintf_() (or strtod_()?)" and have two code paths in our snprintf.c. But then we'd also need a place to track a locale_t for a long-lived newlocale("C"), which was too messy in my latest attempt... [1] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/isspace.html [2] https://pubs.opengroup.org/onlinepubs/9699939499/toc.pdf [3] https://www.postgresql.org/message-id/flat/CWZBBRR6YA8D.8EHMDRGLCKCD%40neon.tech
On Wed, 2024-08-07 at 13:28 -0400, Joe Conway wrote: > FWIW I see all of these in glibc: > > isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l, > isgraph_l, islower_l, isprint_l, ispunct_l, isspace_l, isupper_l, > isxdigit_l My point was just that there are a lot of those call sites (especially for isspace()) in various parsers. It feels like a lot of code churn to change all of them, when a lot of them seem to be intended for ascii anyway. And where do we get the locale_t structure from? We can create our own at database connection time, and supply it to each of those call sites, but I'm not sure that's a huge advantage over just using uselocale(). Regards, Jeff Davis
On 8/8/24 12:45 AM, Jeff Davis wrote: > My point was just that there are a lot of those call sites (especially > for isspace()) in various parsers. It feels like a lot of code churn to > change all of them, when a lot of them seem to be intended for ascii > anyway. > > And where do we get the locale_t structure from? We can create our own > at database connection time, and supply it to each of those call sites, > but I'm not sure that's a huge advantage over just using uselocale(). I am leaning towards that we should write our own pure ascii functions for this. Since we do not support any non-ascii compatible encodings anyway I do not see the point in having locale support in most of these call-sites. Andewas
On Fri, 2024-08-09 at 13:41 +0200, Andreas Karlsson wrote: > I am leaning towards that we should write our own pure ascii > functions > for this. That makes sense for a lot of call sites, but it could cause breakage if we aren't careful. > Since we do not support any non-ascii compatible encodings > anyway I do not see the point in having locale support in most of > these > call-sites. An ascii-compatible encoding just means that the code points in the ascii range are represented as ascii. I'm not clear on whether code points in the ascii range can return different results for things like isspace(), but it sounds plausible -- toupper() can return different results for 'i' in tr_TR. Also, what about the values outside 128-255, which are still valid input to isspace()? Regards, Jeff Davis
On Tue Aug 6, 2024 at 5:00 PM CDT, Jeff Davis wrote: > After some previous work here: > > https://postgr.es/m/89475ee5487d795124f4e25118ea8f1853edb8cb.camel@j-davis.com > > we are less dependent on setlocale(), but it's still not completely > gone. > > setlocale() counts as thread-unsafe, so it would be nice to eliminate > it completely. > > The obvious answer is uselocale(), which sets the locale only for the > calling thread, and takes precedence over whatever is set with > setlocale(). > > But there are a couple problems: > > 1. I don't think it's supported on Windows. > > 2. I don't see a good way to canonicalize a locale name, like in > check_locale(), which uses the result of setlocale(). > > Thoughts? Hey Jeff, See this thread[0] for some work that I had previously done. Feel free to take it over, or we could collaborate. [0]: https://www.postgresql.org/message-id/CWMW5OZBWJ10.1YFLQWSUE5RE9@neon.tech -- Tristan Partin Neon (https://neon.tech)
Hi, On Fri, 2024-08-09 at 15:16 -0500, Tristan Partin wrote: > Hey Jeff, > > See this thread[0] for some work that I had previously done. Feel > free > to take it over, or we could collaborate. > > [0]: > https://www.postgresql.org/message-id/CWMW5OZBWJ10.1YFLQWSUE5RE9@neon.tech Sounds good, sorry I missed that. Can you please rebase and we can discuss in that thread? Regards, Jeff Davis
On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Jeff Davis <pgsql@j-davis.com> writes: > > > But there are a couple problems: > > > > > 1. I don't think it's supported on Windows. > > > > Can't help with that, but surely Windows has some thread-safe way. > > It does. It's not exactly the same, instead there is a thing you can > call that puts setlocale() itself into a thread-local mode, but last > time I checked that mode was missing on MinGW so that's a bit of an > obstacle. Actually the MinGW situation might be better than that these days. I know of three environments where we currently have to keep code working on MinGW: build farm animal fairywren (msys2 compiler toochain), CI's optional "Windows - Server 2019, MinGW64 - Meson" task, and CI's "CompilerWarnings" task, in the "mingw_cross_warning" step (which actually runs on Linux, and uses configure rather than meson). All three environments show that they have _configthreadlocale. So could we could simply require it on Windows? Then it might be possible to write a replacement implementation of uselocale() that does a two-step dance with _configthreadlocale() and setlocale(), restoring both afterwards if they changed. That's what ECPG open-codes already. The NetBSD situation is more vexing. I was trying to find out if someone is working on it and unfortunately it looks like there is a principled stand against adding it: https://mail-index.netbsd.org/tech-userlevel/2015/12/28/msg009546.html https://mail-index.netbsd.org/netbsd-users/2017/02/14/msg019352.html They're right that we really just want to use "C" in some places, and their LC_C_LOCALE is a very useful system-provided value to be able to pass into _l functions. It's a shame it's non-standard, because without it you have to allocate a locale_t for "C" and keep it somewhere to feed to _l functions...
I've posted a new attempt at ripping those ECPG setlocales() out on the other thread that had the earlier version and discussion: https://www.postgresql.org/message-id/CA%2BhUKG%2BYv%2Bps%3DnS2T8SS1UDU%3DiySHSr4sGHYiYGkPTpZx6Ooww%40mail.gmail.com
On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote: > > How far can we get by using more _l() functions? > > There are a ton of calls to, for example, isspace(), used mostly for > parsing. > > I wouldn't expect a lot of differences in behavior from locale to > locale, like might be the case with iswspace(), but behavior can be > different at least in theory. > > So I guess we're stuck with setlocale()/uselocale() for a while, unless > we're able to move most of those call sites over to an ascii-only > variant. Here are two more cases that I don't think I've seen discussed. 1. The nl_langinfo() call in pg_get_encoding_from_locale(), can probably be changed to nl_langinfo_l() (it is everywhere we currently care about except Windows, which has a different already-thread-safe alternative; AIX seems to lack the _l version, but someone writing a patch to re-add support for that OS could supply the configure goo for a uselocale() safe/restore implementation). One problem is that it has callers that pass it NULL meaning the backend default, but we'd perhaps use LC_C_GLOBAL for now and have to think about where we get the database default locale_t in the future. 2. localeconv() is *doubly* non-thread-safe: it depends on the current locale, and it also returns an object whose storage might be clobbered by any other call to localeconv(), setlocale, or even, according to POSIX, uselocale() (!!!). I think that effectively closes off that escape hatch. On some OSes (macOS, BSDs) you find localeconv_l() and then I think they give you a more workable lifetime: as long as the locale_t lives, which makes perfect sense. I am surprised that no one has invented localeconv_r() where you supply the output storage, and you could wrap that in uselocale() save/restore to deal with the other problem, or localeconv_r_l() or something. I can't understand why this is so bad. The glibc documentation calls it "a masterpiece of poor design". Ahh, so it seems like we need to delete our use of localeconf() completely, because we should be able to get all the information we need from nl_langinfo_l() instead: https://www.gnu.org/software/libc/manual/html_node/Locale-Information.html
On Mon, Aug 12, 2024 at 3:24 PM Thomas Munro <thomas.munro@gmail.com> wrote: > 1. The nl_langinfo() call in pg_get_encoding_from_locale(), can > probably be changed to nl_langinfo_l() (it is everywhere we currently > care about except Windows, which has a different already-thread-safe > alternative ... ... though if we wanted to replace all use of localeconv and struct lconv with nl_langinfo_l() calls, it's not totally obvious how to do that on Windows. Its closest thing is GetLocaleInfoEx(), but that has complications: it takes wchar_t locale names, which we don't even have and can't access when we only have a locale_t, and it must look them up in some data structure every time, and it copies data out to the caller as wchar_t so now you have two conversion problems and a storage problem. If I understand correctly, the whole point of nl_langinfo_l(item, loc) is that it is supposed to be fast, it's really just an array lookup, and item is just an index, and the result is supposed to be stable as long as loc hasn't been freed (and the thread hasn't exited). So you can use it without putting your own caching in front of it. One idea I came up with which I haven't tried and it might turn out to be terrible, is that we could change our definition of locale_t on Windows. Currently it's a typedef to Windows' _locale_t, and we use it with a bunch of _XXX functions that we access by macro to remove the underscore. Instead, we could make locale_t a pointer to a struct of our own design in WIN32 builds, holding the native _locale_t and also an array full of all the values that nl_langinfo_l() can return. We'd provide the standard enums, indexes into that array, in a fake POSIX-oid header <langinfo.h>. Then nl_langinfo_l(item, loc) could be implemented as loc->private_langinfo[item], and strcoll_l(.., loc) could be a static inline function that does _strcol_l(..., loc->private_windows_locale_t). These structs would be allocated and freed with standard-looking newlocale() and freelocale(), so we could finally stop using #ifdef WIN32-wrapped _create_locale() directly. Then everything would look more POSIX-y, nl_langinfo_l() could be used directly wherever we need fast access to that info, and we could, I think, banish the awkward localeconv, right? I don't know if this all makes total sense and haven't tried it, just spitballing here...
On Mon, Aug 12, 2024 at 4:53 PM Thomas Munro <thomas.munro@gmail.com> wrote: > ... though if we wanted to replace all use of localeconv and struct > lconv with nl_langinfo_l() calls, Gah. I realised while trying the above that you can't really replace localeconv() with nl_langinfo_l() as the GNU documentation recommends, because some of the lconv fields we're using are missing from langinfo.h in POSIX (only GNU added them all, that was a good idea). :-( Next idea: Windows: its localeconv() returns pointer to thread-local storage, good, but it still needs setlocale(). So the options are: make our own lconv-populator function for Windows, using GetLocaleInfoEx(), or do that _configthreadlocale() dance (possibly excluding some MinGW configurations from working) Systems that have localeconv_l(): use that POSIX: use uselocale() and also put a big global lock around localeconv() call + accessing result (optionally skipping that on an OS-by-OS basis after confirming that its implementation doesn't really need it) The reason the uselocale() + localeconv() seems to require a Big Lock (by default at least) is that the uselocale() deals only with the "current locale" aspect, not the output buffer aspect. Clearly the standard allows for it to be thread-local storage (that's why since 2008 it says that after thread-exit you can't access the result, and I guess that's how it works on real systems (?)), but it also seems to allow for a single static buffer (that's why it says that it's not re-entrant, and any call to localeconv() might clobber it). That might be OK in practice because we tend to cache that stuff, eg when assigning GUC lc_monetary (that cache would presumably become thread-local in the first phase of the multithreading plan), so the locking shouldn't really hurt. The reason we'd have to have three ways, and not just two, is again that NetBSD declined to implement uselocale(). I'll try this in a bit unless someone else has better ideas or plans for this part... sorry for the drip-feeding.
On Tue, Aug 13, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote: > I'll try this in a bit unless someone else has better ideas or plans > for this part... sorry for the drip-feeding. And done, see commitfest entry #5170.
On Sat, 2024-08-10 at 09:42 +1200, Thomas Munro wrote: > The NetBSD situation is more vexing. I was trying to find out if > someone is working on it and unfortunately it looks like there is a > principled stand against adding it: > > https://mail-index.netbsd.org/tech-userlevel/2015/12/28/msg009546.html > https://mail-index.netbsd.org/netbsd-users/2017/02/14/msg019352.html The objection seems to be very general: that uselocale() modifies the thread state and affects calls a long distance from uselocale(). I don't disagree with the general sentiment. But in effect, that just prevents people migrating away from setlocale(), to which the same argument applies, and is additionally thread-unsafe. The only alternative is to essentially ban the use of non-_l variants, which is fine I suppose, but causes a fair amount of code churn. > They're right that we really just want to use "C" in some places, and > their LC_C_LOCALE is a very useful system-provided value to be able > to > pass into _l functions. It's a shame it's non-standard, because > without it you have to allocate a locale_t for "C" and keep it > somewhere to feed to _l functions... If we're going to do that, why not just have ascii-only variants of our own? pg_ascii_isspace(...) is at least as readable as isspace_l(..., LC_C_LOCALE). Regards, Jeff Davis
On Wed, Aug 14, 2024 at 1:05 PM Jeff Davis <pgsql@j-davis.com> wrote: > The only alternative is to essentially ban the use of non-_l variants, > which is fine I suppose, but causes a fair amount of code churn. Let's zoom out a bit and consider some ways we could set up the process, threads and individual calls: 1. The process global locale is always "C". If you ever call uselocale(), it can only be for short stretches, and you have to restore it straight after; perhaps it is only ever used in replacement _l() functions for systems that lack them. You need to use _l() functions for all non-"C" locales. The current database default needs to be available as a variable (in future: thread-local variable, or reachable from one), so you can use it in _l() functions. The "C" locale can be accessed implicitly with non-l() functions, or you could ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE) for "C". Or a name like PG_C_LOCALE, which, in backend code could be just LC_GLOBAL_LOCALE, while in frontend/library code it could be the singleton mechanism I showed in CF#5166. XXX Note that nailing LC_ALL to "C" in the backend would extend our existing policy for LC_NUMERIC to all categories. That's why we can use strtod() in the backend and expect the radix character to be '.'. It's interesting to contemplate the strtod() calls in CF#5166: they are code copied-and-pasted from backend and frontend; in the backend we can use strtod() currently but in the frontend code I showed a change to strtod_l(..., PG_C_LOCALE), in order to be able to delete some ugly deeply nested uselocale()/setlocale() stuff of the exact sort that NetBSD hackers (and I) hate. It's obviously a bit of a code smell that it's copied-and-pasted in the first place, and should really share code. Supposing some of that stuff finishes up in src/common, then I think you'd want a strtod_l(..., PG_C_LOCALE) that could be allowed to take advantage of the knowledge that the global locale is "C" in the backend. Just thoughts... 2. The process global locale is always "C". Each backend process (in future: thread) calls uselocale() to set the thread-local locale to the database default, so it can keep using the non-_l() functions as a way to access the database default, and otherwise uses _l() functions if it wants something else (as we do already). The "C" locale can be accessed with foo_l(..., LC_GLOBAL_LOCALE) or PG_C_LOCALE etc. XXX This option is blocked by NetBSD's rejection of uselocale(). I guess if you really wanted to work around NetBSD's policy you could make your own wrapper for all affected functions, translating foo() to foo_l(..., pg_thread_current_locale), so you could write uselocale(), which is pretty much what every other libc does... But eughhh 3. The process global locale is inherited from the system or can be set by the user however they want for the benefit of extensions, but we never change it after startup or refer to it. Then we do the same as 1 or 2, except if we ever want "C" we'll need a locale_t for that, again perhaps using the PC_C_LOCALE mechanism. Non-_l() functions are effectively useless except in cases where you really want to use the system's settings inherited from startup, eg for messages, so they'd mostly be banned. What else? > > They're right that we really just want to use "C" in some places, and > > their LC_C_LOCALE is a very useful system-provided value to be able > > to > > pass into _l functions. It's a shame it's non-standard, because > > without it you have to allocate a locale_t for "C" and keep it > > somewhere to feed to _l functions... > > If we're going to do that, why not just have ascii-only variants of our > own? pg_ascii_isspace(...) is at least as readable as isspace_l(..., > LC_C_LOCALE). Yeah, I agree there are some easy things we should do that way. In fact we've already established that scanner_isspace() needs to be used in lots more places for that, even aside from thread-safety.
On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote: > 1. The process global locale is always "C". If you ever call > uselocale(), it can only be for short stretches, and you have to > restore it straight after; perhaps it is only ever used in > replacement > _l() functions for systems that lack them. You need to use _l() > functions for all non-"C" locales. The current database default > needs > to be available as a variable (in future: thread-local variable, or > reachable from one), so you can use it in _l() functions. The "C" > locale can be accessed implicitly with non-l() functions, or you > could > ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE) > for > "C". Or a name like PG_C_LOCALE, which, in backend code could be > just > LC_GLOBAL_LOCALE, while in frontend/library code it could be the > singleton mechanism I showed in CF#5166. +1 to this approach. It makes things more consistent across platforms and avoids surprising dependencies on the global setting. We'll have to be careful that each call site is either OK with C, or that it gets changed to an _l() variant. We also have to be careful about extensions. Regards, Jeff Davis
On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Jeff Davis <pgsql@j-davis.com> writes: > > > 2. I don't see a good way to canonicalize a locale name, like in > > > check_locale(), which uses the result of setlocale(). > > > > What I can tell you about that is that check_locale's expectation > > that setlocale does any useful canonicalization is mostly wishful > > thinking [1]. On a lot of platforms you just get the input string > > back again. If that's the only thing keeping us on setlocale, > > I think we could drop it. (Perhaps we should do some canonicalization > > of our own instead?) > > +1 > > I know it does something on Windows (we know the EDB installer gives > it strings like "Language,Country" and it converts them to > "Language_Country.Encoding", see various threads about it all going > wrong), but I'm not sure it does anything we actually want to > encourage. I'm hoping we can gradually screw it down so that we only > have sane BCP 47 in the system on that OS, and I don't see why we > wouldn't just use them verbatim. Some more thoughts on check_locale() and canonicalisation: I doubt the canonicalisation does anything useful on any Unix system, as they're basically just file names. In the case of glibc, the encoding part is munged before opening the file so it tolerates .utf8 or .UTF-8 or .u---T----f------8 on input, but it still returns whatever you gave it so the return value isn't cleaning the input or anything. "" is a problem however... the special value for "native environment" is returned as a real locale name, which we probably still need in places. We could change that to newlocale("") + query instead, but there is a portability pipeline problem getting the name out of it: 1. POSIX only just added getlocalename_l() in 2024[1][2]. 2. Glibc has non-standard nl_langinfo_l(NL_LOCALE_NAME(category), loc). 3. The <xlocale.h> systems (macOS/*BSD) have non-standard querylocale(mask, loc). 4. AFAIK there is no way to do it on pure POSIX 2008 systems. 5. For Windows, there is a completely different thing to get the user's default locale, see CF#3772. The systems in category 4 would in practice be Solaris and (if it comes back) AIX. Given that, we probably just can't go that way soon. So I think the solution could perhaps be something like: in some early startup phase before there are any threads, we nail down all the locale categories to "C" (or whatever we decide on for the permanent global locale), and also query the "" categories and make a copy of them in case anyone wants them later, and then never call setlocale() again. [1] https://pubs.opengroup.org/onlinepubs/9799919799/functions/getlocalename_l.html [2] https://www.austingroupbugs.net/view.php?id=1220
On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote: > So I think the solution could perhaps be something like: in some > early > startup phase before there are any threads, we nail down all the > locale categories to "C" (or whatever we decide on for the permanent > global locale), and also query the "" categories and make a copy of > them in case anyone wants them later, and then never call setlocale() > again. +1. Regards, Jeff Davis
On Thu, Aug 15, 2024 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote: > > So I think the solution could perhaps be something like: in some > > early > > startup phase before there are any threads, we nail down all the > > locale categories to "C" (or whatever we decide on for the permanent > > global locale), and also query the "" categories and make a copy of > > them in case anyone wants them later, and then never call setlocale() > > again. > > +1. We currently nail down these categories: /* We keep these set to "C" always. See pg_locale.c for explanation. */ init_locale("LC_MONETARY", LC_MONETARY, "C"); init_locale("LC_NUMERIC", LC_NUMERIC, "C"); init_locale("LC_TIME", LC_TIME, "C"); CF #5170 has patches to make it so that we stop changing them even transiently, using locale_t interfaces to feed our caches of stuff needed to work with those categories, so they really stay truly nailed down. It sounds like someone needs to investigate doing the same thing for these two, from CheckMyDatabase(): if (pg_perm_setlocale(LC_COLLATE, collate) == NULL) ereport(FATAL, (errmsg("database locale is incompatible with operating system"), errdetail("The database was initialized with LC_COLLATE \"%s\", " " which is not recognized by setlocale().", collate), errhint("Recreate the database with another locale or install the missing locale."))); if (pg_perm_setlocale(LC_CTYPE, ctype) == NULL) ereport(FATAL, (errmsg("database locale is incompatible with operating system"), errdetail("The database was initialized with LC_CTYPE \"%s\", " " which is not recognized by setlocale().", ctype), errhint("Recreate the database with another locale or install the missing locale."))); How should that work? Maybe we could imagine something like MyDatabaseLocale, a locale_t with LC_COLLATE and LC_CTYPE categories set appropriately. Or should it be a pg_locale_t instead (if your database default provider is ICU, then you don't even need a locale_t, right?). Then I think there is one quite gnarly category, from assign_locale_messages() (a GUC assignment function): (void) pg_perm_setlocale(LC_MESSAGES, newval); I have never really studied gettext(), but I know it was just standardised in POSIX 2024, and the standardised interface has _l() variants of all functions. Current implementations don't have them yet. Clearly we absolutely couldn't call pg_perm_setlocale() after early startup -- but if gettext() is relying on the current locale to affect far away code, then maybe this is one place where we'd just have to use uselocale(). Perhaps we could plan some transitional strategy where NetBSD users lose the ability to change the GUC without restarting the server and it has to be the same for all sessions, or something like that, until they produce either gettext_l() or uselocale(), but I haven't thought hard about this part at all yet...
On 15.08.24 00:43, Thomas Munro wrote: > "" is a problem however... the special value for "native environment" > is returned as a real locale name, which we probably still need in > places. We could change that to newlocale("") + query instead, but Where do we need that in the server? It should just be initdb doing that and then initializing the server with concrete values based on that. I guess technically some of these GUC settings default to the environment? But I think we could consider getting rid of that.
On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut <peter@eisentraut.org> wrote: > On 15.08.24 00:43, Thomas Munro wrote: > > "" is a problem however... the special value for "native environment" > > is returned as a real locale name, which we probably still need in > > places. We could change that to newlocale("") + query instead, but > > Where do we need that in the server? Hmm. Yeah, right, the only way I've found so far to even reach that code and that captures that result is: create database db2 locale = ''; Thats puts 'en_NZ.UTF-8' or whatever in pg_database. In contrast, create collation will accept '' but just store it verbatim, and the GUCs for changing time, monetary, numeric accept it too and keep it verbatim. We could simply ban '' in all user commands. I doubt they're documented as acceptable values, once you get past initdb and have a running system. Looking into that... > It should just be initdb doing that and then initializing the server > with concrete values based on that. Right. > I guess technically some of these GUC settings default to the > environment? But I think we could consider getting rid of that. Yeah.
On Fri, Aug 16, 2024 at 9:09 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut <peter@eisentraut.org> wrote: > > It should just be initdb doing that and then initializing the server > > with concrete values based on that. > > Right. > > > I guess technically some of these GUC settings default to the > > environment? But I think we could consider getting rid of that. > > Yeah. Seems to make a lot of sense. I tried that out over in CF #5170. (In case it's not clear why I'm splitting discussion between threads: I was thinking of this thread as the one where we're discussing what needs to be done, with other threads being spun off to become CF entry with concrete patches. I realised re-reading some discussion that that might not be obvious...)
On 8/9/24 8:24 PM, Jeff Davis wrote: > On Fri, 2024-08-09 at 13:41 +0200, Andreas Karlsson wrote: >> I am leaning towards that we should write our own pure ascii >> functions >> for this. > > That makes sense for a lot of call sites, but it could cause breakage > if we aren't careful. > >> Since we do not support any non-ascii compatible encodings >> anyway I do not see the point in having locale support in most of >> these >> call-sites. > > An ascii-compatible encoding just means that the code points in the > ascii range are represented as ascii. I'm not clear on whether code > points in the ascii range can return different results for things like > isspace(), but it sounds plausible -- toupper() can return different > results for 'i' in tr_TR. > > Also, what about the values outside 128-255, which are still valid > input to isspace()? My idea was that in a lot of those cases we only try to parse e.g. 0-9 as digits and always only . as the decimal separator so we should make just make that obvious by either using locale C or writing our own ascii only functions. These strings are meant to be read by machines, not humans, primarily. Andreas
On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote: > On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote: > > 1. The process global locale is always "C". If you ever call > > uselocale(), it can only be for short stretches, and you have to > > restore it straight after; perhaps it is only ever used in > > replacement > > _l() functions for systems that lack them. You need to use _l() > > functions for all non-"C" locales. The current database default > > needs > > to be available as a variable (in future: thread-local variable, or > > reachable from one), so you can use it in _l() functions. The "C" > > locale can be accessed implicitly with non-l() functions, or you > > could > > ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE) > > for > > "C". Or a name like PG_C_LOCALE, which, in backend code could be > > just > > LC_GLOBAL_LOCALE, while in frontend/library code it could be the > > singleton mechanism I showed in CF#5166. > > +1 to this approach. It makes things more consistent across platforms > and avoids surprising dependencies on the global setting. > > We'll have to be careful that each call site is either OK with C, or > that it gets changed to an _l() variant. We also have to be careful > about extensions. Did we reach a conclusion here? Any thoughts on moving in this direction, and whether 18 is the right time to do it? Regards, Jeff Davis
On Fri, Dec 13, 2024 at 8:22 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote: > > On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote: > > > 1. The process global locale is always "C". If you ever call > > > uselocale(), it can only be for short stretches, and you have to > > > restore it straight after; perhaps it is only ever used in > > > replacement > > > _l() functions for systems that lack them. You need to use _l() > > > functions for all non-"C" locales. The current database default > > > needs > > > to be available as a variable (in future: thread-local variable, or > > > reachable from one), so you can use it in _l() functions. The "C" > > > locale can be accessed implicitly with non-l() functions, or you > > > could > > > ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE) > > > for > > > "C". Or a name like PG_C_LOCALE, which, in backend code could be > > > just > > > LC_GLOBAL_LOCALE, while in frontend/library code it could be the > > > singleton mechanism I showed in CF#5166. > > > > +1 to this approach. It makes things more consistent across platforms > > and avoids surprising dependencies on the global setting. > > > > We'll have to be careful that each call site is either OK with C, or > > that it gets changed to an _l() variant. We also have to be careful > > about extensions. > > Did we reach a conclusion here? Any thoughts on moving in this > direction, and whether 18 is the right time to do it? I think this is the best way, and I haven't seen anyone supporting any other idea. (I'm working on those setlocale()-removal patches I mentioned, more very soon...)
On 13.12.24 10:44, Thomas Munro wrote: > On Fri, Dec 13, 2024 at 8:22 AM Jeff Davis <pgsql@j-davis.com> wrote: >> On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote: >>> On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote: >>>> 1. The process global locale is always "C". If you ever call >>>> uselocale(), it can only be for short stretches, and you have to >>>> restore it straight after; perhaps it is only ever used in >>>> replacement >>>> _l() functions for systems that lack them. You need to use _l() >>>> functions for all non-"C" locales. The current database default >>>> needs >>>> to be available as a variable (in future: thread-local variable, or >>>> reachable from one), so you can use it in _l() functions. The "C" >>>> locale can be accessed implicitly with non-l() functions, or you >>>> could >>>> ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE) >>>> for >>>> "C". Or a name like PG_C_LOCALE, which, in backend code could be >>>> just >>>> LC_GLOBAL_LOCALE, while in frontend/library code it could be the >>>> singleton mechanism I showed in CF#5166. >>> >>> +1 to this approach. It makes things more consistent across platforms >>> and avoids surprising dependencies on the global setting. >>> >>> We'll have to be careful that each call site is either OK with C, or >>> that it gets changed to an _l() variant. We also have to be careful >>> about extensions. >> >> Did we reach a conclusion here? Any thoughts on moving in this >> direction, and whether 18 is the right time to do it? > > I think this is the best way, and I haven't seen anyone supporting any > other idea. (I'm working on those setlocale()-removal patches I > mentioned, more very soon...) I also think this is the right direction, and we'll get closer with the remaining patches that Thomas has lined up. I think at this point, we could already remove all locale settings related to LC_COLLATE. Nothing uses that anymore. I think we will need to keep the global LC_CTYPE setting set to something useful, for example so that system error messages come out in the right encoding. But I'm concerned about the the Perl_setlocale() dance in plperl.c. Perl apparently does a setlocale(LC_ALL, "") during startup, and that code is a workaround to reset everything back afterwards. We need to be careful not to break that. (Perl has fixed that in 5.19, but the fix requires that you set another environment variable before launching Perl, which you can't do in a threaded system, so we'd probably need another fix eventually. See <https://github.com/Perl/perl5/issues/8274>.)
On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote: > I think we will need to keep the global LC_CTYPE setting set to > something useful, for example so that system error messages come out > in > the right encoding. Do we need to rely on the global LC_CTYPE setting? We already use bind_textdomain_codeset(). > But I'm concerned about the the Perl_setlocale() dance in plperl.c. > Perl apparently does a setlocale(LC_ALL, "") during startup, and that > code is a workaround to reset everything back afterwards. We need to > be > careful not to break that. > > (Perl has fixed that in 5.19, but the fix requires that you set > another > environment variable before launching Perl, which you can't do in a > threaded system, so we'd probably need another fix eventually. See > <https://github.com/Perl/perl5/issues/8274>.) I don't fully understand that issue, but I would think the direction we are going (keeping the global LC_CTYPE more consistent and relying on it less) would make the problem better. Regards, Jeff Davis
On 17.12.24 19:10, Jeff Davis wrote: > On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote: >> I think we will need to keep the global LC_CTYPE setting set to >> something useful, for example so that system error messages come out >> in >> the right encoding. > > Do we need to rely on the global LC_CTYPE setting? We already use > bind_textdomain_codeset(). I don't think that would cover messages from the C library (strerror, dlerror, etc.). >> But I'm concerned about the the Perl_setlocale() dance in plperl.c. >> Perl apparently does a setlocale(LC_ALL, "") during startup, and that >> code is a workaround to reset everything back afterwards. We need to >> be >> careful not to break that. >> >> (Perl has fixed that in 5.19, but the fix requires that you set >> another >> environment variable before launching Perl, which you can't do in a >> threaded system, so we'd probably need another fix eventually. See >> <https://github.com/Perl/perl5/issues/8274>.) > > I don't fully understand that issue, but I would think the direction we > are going (keeping the global LC_CTYPE more consistent and relying on > it less) would make the problem better. Yes, I think it's the right direction, but we need to figure this issue out eventually.