Thread: Remaining dependency on setlocale()

Remaining dependency on setlocale()

From
Jeff Davis
Date:
After some previous work here:

https://postgr.es/m/89475ee5487d795124f4e25118ea8f1853edb8cb.camel@j-davis.com

we are less dependent on setlocale(), but it's still not completely
gone.

setlocale() counts as thread-unsafe, so it would be nice to eliminate
it completely.

The obvious answer is uselocale(), which sets the locale only for the
calling thread, and takes precedence over whatever is set with
setlocale().

But there are a couple problems:

1. I don't think it's supported on Windows.

2. I don't see a good way to canonicalize a locale name, like in
check_locale(), which uses the result of setlocale().

Thoughts?

Regards,
    Jeff Davis




Re: Remaining dependency on setlocale()

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> But there are a couple problems:

> 1. I don't think it's supported on Windows.

Can't help with that, but surely Windows has some thread-safe way.

> 2. I don't see a good way to canonicalize a locale name, like in
> check_locale(), which uses the result of setlocale().

What I can tell you about that is that check_locale's expectation
that setlocale does any useful canonicalization is mostly wishful
thinking [1].  On a lot of platforms you just get the input string
back again.  If that's the only thing keeping us on setlocale,
I think we could drop it.  (Perhaps we should do some canonicalization
of our own instead?)

            regards, tom lane

[1] https://www.postgresql.org/message-id/14856.1348497531@sss.pgh.pa.us



Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > But there are a couple problems:
>
> > 1. I don't think it's supported on Windows.
>
> Can't help with that, but surely Windows has some thread-safe way.

It does.  It's not exactly the same, instead there is a thing you can
call that puts setlocale() itself into a thread-local mode, but last
time I checked that mode was missing on MinGW so that's a bit of an
obstacle.

How far can we get by using more _l() functions?  For example, [1]
shows a use of strftime() that I think can be converted to
strftime_l() so that it doesn't depend on setlocale().  Since POSIX
doesn't specify every obvious _l function, we might need to provide
any missing wrappers that save/restore thread-locally with
uselocale().  Windows doesn't have uselocale(), but it generally
doesn't need such wrappers because it does have most of the obvious
_l() functions.

> > 2. I don't see a good way to canonicalize a locale name, like in
> > check_locale(), which uses the result of setlocale().
>
> What I can tell you about that is that check_locale's expectation
> that setlocale does any useful canonicalization is mostly wishful
> thinking [1].  On a lot of platforms you just get the input string
> back again.  If that's the only thing keeping us on setlocale,
> I think we could drop it.  (Perhaps we should do some canonicalization
> of our own instead?)

+1

I know it does something on Windows (we know the EDB installer gives
it strings like "Language,Country" and it converts them to
"Language_Country.Encoding", see various threads about it all going
wrong), but I'm not sure it does anything we actually want to
encourage.  I'm hoping we can gradually screw it down so that we only
have sane BCP 47 in the system on that OS, and I don't see why we
wouldn't just use them verbatim.

[1]
https://www.postgresql.org/message-id/CA%2BhUKGJ%3Dca39Cg%3Dy%3DS89EaCYvvCF8NrZRO%3Duog-cnz0VzC6Kfg%40mail.gmail.com



Re: Remaining dependency on setlocale()

From
Joe Conway
Date:
On 8/7/24 03:07, Thomas Munro wrote:
> How far can we get by using more _l() functions?  For example, [1]
> shows a use of strftime() that I think can be converted to
> strftime_l() so that it doesn't depend on setlocale().  Since POSIX
> doesn't specify every obvious _l function, we might need to provide
> any missing wrappers that save/restore thread-locally with
> uselocale().  Windows doesn't have uselocale(), but it generally
> doesn't need such wrappers because it does have most of the obvious
> _l() functions.

Most of the strtoX functions have an _l variant, but one to watch is 
atoi, which is defined with a hardcoded call to strtol, at least with glibc:

8<----------
/* Convert a string to an int.  */
int
atoi (const char *nptr)
{
   return (int) strtol (nptr, (char **) NULL, 10);
}
8<----------

I guess in many/most places we use atoi we don't care, but maybe it 
matters for some?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Remaining dependency on setlocale()

From
Robert Haas
Date:
On Wed, Aug 7, 2024 at 9:42 AM Joe Conway <mail@joeconway.com> wrote:
> I guess in many/most places we use atoi we don't care, but maybe it
> matters for some?

I think we should move in the direction of replacing atoi() calls with
strtol() and actually checking for errors. In many places where use
atoi(), it's unlikely that the string would be anything but an
integer, so error checks are arguably unnecessary. A backup label file
isn't likely to say "START TIMELINE: potaytoes". On the other hand, if
it did say that, I'd prefer to get an error about potaytoes than have
it be treated as if it said "START TIMELINE: 0". And I've definitely
found missing error-checks over the years. For example, on pg14,
"pg_basebackup -Ft -Zmaximum -Dx" works as if you specified "-Z0"
because atoi("maximum") == 0. If we make a practice of checking
integer conversions for errors everywhere, we might avoid some such
silliness.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Remaining dependency on setlocale()

From
Jeff Davis
Date:
On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
> How far can we get by using more _l() functions?

There are a ton of calls to, for example, isspace(), used mostly for
parsing.

I wouldn't expect a lot of differences in behavior from locale to
locale, like might be the case with iswspace(), but behavior can be
different at least in theory.

So I guess we're stuck with setlocale()/uselocale() for a while, unless
we're able to move most of those call sites over to an ascii-only
variant.

Regards,
    Jeff Davis




Re: Remaining dependency on setlocale()

From
Joe Conway
Date:
On 8/7/24 13:16, Jeff Davis wrote:
> On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
>> How far can we get by using more _l() functions?
> 
> There are a ton of calls to, for example, isspace(), used mostly for
> parsing.
> 
> I wouldn't expect a lot of differences in behavior from locale to
> locale, like might be the case with iswspace(), but behavior can be
> different at least in theory.
> 
> So I guess we're stuck with setlocale()/uselocale() for a while, unless
> we're able to move most of those call sites over to an ascii-only
> variant.

FWIW I see all of these in glibc:

isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l, 
isgraph_l,  islower_l, isprint_l, ispunct_l, isspace_l, isupper_l, 
isxdigit_l


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Remaining dependency on setlocale()

From
Robert Haas
Date:
On Wed, Aug 7, 2024 at 1:29 PM Joe Conway <mail@joeconway.com> wrote:
> FWIW I see all of these in glibc:
>
> isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
> isgraph_l,  islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
> isxdigit_l

On my MacBook (Ventura, 13.6.7), I see all of these except for isascii_l.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote:
> There are a ton of calls to, for example, isspace(), used mostly for
> parsing.
>
> I wouldn't expect a lot of differences in behavior from locale to
> locale, like might be the case with iswspace(), but behavior can be
> different at least in theory.
>
> So I guess we're stuck with setlocale()/uselocale() for a while, unless
> we're able to move most of those call sites over to an ascii-only
> variant.

We do know of a few isspace() calls that are already questionable[1]
(should be scanner_isspace(), or something like that).  It's not only
weird that SELECT ROW('libertà!') is displayed with or without double
quote depending (in theory) on your locale, it's also undefined
behaviour because we feed individual bytes of a multi-byte sequence to
isspace(), so OSes disagree, and in practice we know that macOS and
Windows think that the byte 0xa inside 'à' is a space while glibc and
FreeBSD don't.  Looking at the languages with many sequences
containing 0xa0, I guess you'd probably need to be processing CJK text
and cross-platform for the difference to become obvious (that was the
case for the problem report I analysed):

for i in range(1, 0xffff):
  if (i < 0xd800 or i > 0xdfff) and 0xa0 in chr(i).encode('UTF-8'):
    print("%04x: %s" % (i, chr(i)))

[1] https://www.postgresql.org/message-id/flat/CA%2BHWA9awUW0%2BRV_gO9r1ABZwGoZxPztcJxPy8vMFSTbTfi4jig%40mail.gmail.com



Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Thu, Aug 8, 2024 at 6:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 7, 2024 at 1:29 PM Joe Conway <mail@joeconway.com> wrote:
> > FWIW I see all of these in glibc:
> >
> > isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
> > isgraph_l,  islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
> > isxdigit_l
>
> On my MacBook (Ventura, 13.6.7), I see all of these except for isascii_l.

Those (except isascii_l) are from POSIX 2008[1].  They were absorbed
from "Extended API Set Part 4"[2], along with locale_t (that's why
there is a header <xlocale.h> on a couple of systems even though after
absorption they are supposed to be in <locale.h>).  We already
decided that all computers have that stuff (commit 8d9a9f03), but the
reality is a little messier than that... NetBSD hasn't implemented
uselocale() yet[3], though it has a good set of _l functions.  As
discussed in [3], ECPG code is therefore currently broken in
multithreaded clients because it's falling back to a setlocale() path,
and I think Windows+MinGW must be too (it lacks
HAVE__CONFIGTHREADLOCALE), but those both have a good set of _l
functions.  In that thread I tried to figure out how to use _l
functions to fix that problem, but ...

The issue there is that we have our own snprintf.c, that implicitly
requires LC_NUMERIC to be "C" (it is documented as always printing
floats a certain way ignoring locale and that's what the callers there
want in frontend and backend code, but in reality it punts to system
snprintf for floats, assuming that LC_NUMERIC is "C", which we
configure early in backend startup, but frontend code has to do it for
itself!).  So we could use snprintf_l or strtod_l instead, but POSIX
hasn't got those yet.  Or we could use own own Ryu code (fairly
specific), but integrating Ryu into our snprintf.c (and correctly
implementing all the %... stuff?) sounds like quite a hard,
devil-in-the-details kind of an undertaking to me.  Or maybe it's
easy, I dunno.  As for the _l functions, you could probably get away
with "every computer has either uselocale() or snprintf_() (or
strtod_()?)" and have two code paths in our snprintf.c.  But then we'd
also need a place to track a locale_t for a long-lived newlocale("C"),
which was too messy in my latest attempt...

[1] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/isspace.html
[2] https://pubs.opengroup.org/onlinepubs/9699939499/toc.pdf
[3] https://www.postgresql.org/message-id/flat/CWZBBRR6YA8D.8EHMDRGLCKCD%40neon.tech



Re: Remaining dependency on setlocale()

From
Jeff Davis
Date:
On Wed, 2024-08-07 at 13:28 -0400, Joe Conway wrote:
> FWIW I see all of these in glibc:
>
> isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
> isgraph_l,  islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
> isxdigit_l

My point was just that there are a lot of those call sites (especially
for isspace()) in various parsers. It feels like a lot of code churn to
change all of them, when a lot of them seem to be intended for ascii
anyway.

And where do we get the locale_t structure from? We can create our own
at database connection time, and supply it to each of those call sites,
but I'm not sure that's a huge advantage over just using uselocale().

Regards,
    Jeff Davis




Re: Remaining dependency on setlocale()

From
Andreas Karlsson
Date:
On 8/8/24 12:45 AM, Jeff Davis wrote:
> My point was just that there are a lot of those call sites (especially
> for isspace()) in various parsers. It feels like a lot of code churn to
> change all of them, when a lot of them seem to be intended for ascii
> anyway.
> 
> And where do we get the locale_t structure from? We can create our own
> at database connection time, and supply it to each of those call sites,
> but I'm not sure that's a huge advantage over just using uselocale().

I am leaning towards that we should write our own pure ascii functions 
for this. Since we do not support any non-ascii compatible encodings 
anyway I do not see the point in having locale support in most of these 
call-sites.

Andewas



Re: Remaining dependency on setlocale()

From
Jeff Davis
Date:
On Fri, 2024-08-09 at 13:41 +0200, Andreas Karlsson wrote:
> I am leaning towards that we should write our own pure ascii
> functions
> for this.

That makes sense for a lot of call sites, but it could cause breakage
if we aren't careful.

>  Since we do not support any non-ascii compatible encodings
> anyway I do not see the point in having locale support in most of
> these
> call-sites.

An ascii-compatible encoding just means that the code points in the
ascii range are represented as ascii. I'm not clear on whether code
points in the ascii range can return different results for things like
isspace(), but it sounds plausible -- toupper() can return different
results for 'i' in tr_TR.

Also, what about the values outside 128-255, which are still valid
input to isspace()?

Regards,
    Jeff Davis




Re: Remaining dependency on setlocale()

From
"Tristan Partin"
Date:
On Tue Aug 6, 2024 at 5:00 PM CDT, Jeff Davis wrote:
> After some previous work here:
>
> https://postgr.es/m/89475ee5487d795124f4e25118ea8f1853edb8cb.camel@j-davis.com
>
> we are less dependent on setlocale(), but it's still not completely
> gone.
>
> setlocale() counts as thread-unsafe, so it would be nice to eliminate
> it completely.
>
> The obvious answer is uselocale(), which sets the locale only for the
> calling thread, and takes precedence over whatever is set with
> setlocale().
>
> But there are a couple problems:
>
> 1. I don't think it's supported on Windows.
>
> 2. I don't see a good way to canonicalize a locale name, like in
> check_locale(), which uses the result of setlocale().
>
> Thoughts?

Hey Jeff,

See this thread[0] for some work that I had previously done. Feel free
to take it over, or we could collaborate.

[0]: https://www.postgresql.org/message-id/CWMW5OZBWJ10.1YFLQWSUE5RE9@neon.tech

--
Tristan Partin
Neon (https://neon.tech)



Re: Remaining dependency on setlocale()

From
Jeff Davis
Date:
Hi,

On Fri, 2024-08-09 at 15:16 -0500, Tristan Partin wrote:
> Hey Jeff,
>
> See this thread[0] for some work that I had previously done. Feel
> free
> to take it over, or we could collaborate.
>
> [0]:
> https://www.postgresql.org/message-id/CWMW5OZBWJ10.1YFLQWSUE5RE9@neon.tech

Sounds good, sorry I missed that.

Can you please rebase and we can discuss in that thread?

Regards,
    Jeff Davis




Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> > > But there are a couple problems:
> >
> > > 1. I don't think it's supported on Windows.
> >
> > Can't help with that, but surely Windows has some thread-safe way.
>
> It does.  It's not exactly the same, instead there is a thing you can
> call that puts setlocale() itself into a thread-local mode, but last
> time I checked that mode was missing on MinGW so that's a bit of an
> obstacle.

Actually the MinGW situation might be better than that these days.  I
know of three environments where we currently have to keep code
working on MinGW: build farm animal fairywren (msys2 compiler
toochain), CI's optional "Windows - Server 2019, MinGW64 - Meson"
task, and CI's "CompilerWarnings" task, in the "mingw_cross_warning"
step (which actually runs on Linux, and uses configure rather than
meson).  All three environments show that they have
_configthreadlocale.  So could we could simply require it on Windows?
Then it might be possible to write a replacement implementation of
uselocale() that does a two-step dance with _configthreadlocale() and
setlocale(), restoring both afterwards if they changed.  That's what
ECPG open-codes already.

The NetBSD situation is more vexing.  I was trying to find out if
someone is working on it and unfortunately it looks like there is a
principled stand against adding it:

https://mail-index.netbsd.org/tech-userlevel/2015/12/28/msg009546.html
https://mail-index.netbsd.org/netbsd-users/2017/02/14/msg019352.html

They're right that we really just want to use "C" in some places, and
their LC_C_LOCALE is a very useful system-provided value to be able to
pass into _l functions.  It's a shame it's non-standard, because
without it you have to allocate a locale_t for "C" and keep it
somewhere to feed to _l functions...



Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
I've posted a new attempt at ripping those ECPG setlocales() out on
the other thread that had the earlier version and discussion:

https://www.postgresql.org/message-id/CA%2BhUKG%2BYv%2Bps%3DnS2T8SS1UDU%3DiySHSr4sGHYiYGkPTpZx6Ooww%40mail.gmail.com



Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
> > How far can we get by using more _l() functions?
>
> There are a ton of calls to, for example, isspace(), used mostly for
> parsing.
>
> I wouldn't expect a lot of differences in behavior from locale to
> locale, like might be the case with iswspace(), but behavior can be
> different at least in theory.
>
> So I guess we're stuck with setlocale()/uselocale() for a while, unless
> we're able to move most of those call sites over to an ascii-only
> variant.

Here are two more cases that I don't think I've seen discussed.

1.  The nl_langinfo() call in pg_get_encoding_from_locale(), can
probably be changed to nl_langinfo_l() (it is everywhere we currently
care about except Windows, which has a different already-thread-safe
alternative; AIX seems to lack the _l version, but someone writing a
patch to re-add support for that OS could supply the configure goo for
a uselocale() safe/restore implementation).  One problem is that it
has callers that pass it NULL meaning the backend default, but we'd
perhaps use LC_C_GLOBAL for now and have to think about where we get
the database default locale_t in the future.

2.  localeconv() is *doubly* non-thread-safe: it depends on the
current locale, and it also returns an object whose storage might be
clobbered by any other call to localeconv(), setlocale, or even,
according to POSIX, uselocale() (!!!).  I think that effectively
closes off that escape hatch.  On some OSes (macOS, BSDs) you find
localeconv_l() and then I think they give you a more workable
lifetime: as long as the locale_t lives, which makes perfect sense.  I
am surprised that no one has invented localeconv_r() where you supply
the output storage, and you could wrap that in uselocale()
save/restore to deal with the other problem, or localeconv_r_l() or
something.  I can't understand why this is so bad.  The glibc
documentation calls it "a masterpiece of poor design".  Ahh, so it
seems like we need to delete our use of localeconf() completely,
because we should be able to get all the information we need from
nl_langinfo_l() instead:

https://www.gnu.org/software/libc/manual/html_node/Locale-Information.html



Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Mon, Aug 12, 2024 at 3:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> 1.  The nl_langinfo() call in pg_get_encoding_from_locale(), can
> probably be changed to nl_langinfo_l() (it is everywhere we currently
> care about except Windows, which has a different already-thread-safe
> alternative ...

... though if we wanted to replace all use of localeconv and struct
lconv with nl_langinfo_l() calls, it's not totally obvious how to do
that on Windows.  Its closest thing is GetLocaleInfoEx(), but that has
complications: it takes wchar_t locale names, which we don't even have
and can't access when we only have a locale_t, and it must look them
up in some data structure every time, and it copies data out to the
caller as wchar_t so now you have two conversion problems and a
storage problem.  If I understand correctly, the whole point of
nl_langinfo_l(item, loc) is that it is supposed to be fast, it's
really just an array lookup, and item is just an index, and the result
is supposed to be stable as long as loc hasn't been freed (and the
thread hasn't exited).  So you can use it without putting your own
caching in front of it.  One idea I came up with which I haven't tried
and it might turn out to be terrible, is that we could change our
definition of locale_t on Windows.  Currently it's a typedef to
Windows' _locale_t, and we use it with a bunch of _XXX functions that
we access by macro to remove the underscore.  Instead, we could make
locale_t a pointer to a struct of our own design in WIN32 builds,
holding the native _locale_t and also an array full of all the values
that nl_langinfo_l() can return.  We'd provide the standard enums,
indexes into that array, in a fake POSIX-oid header <langinfo.h>.
Then nl_langinfo_l(item, loc) could be implemented as
loc->private_langinfo[item], and strcoll_l(.., loc) could be a static
inline function that does _strcol_l(...,
loc->private_windows_locale_t).  These structs would be allocated and
freed with standard-looking newlocale() and freelocale(), so we could
finally stop using #ifdef WIN32-wrapped _create_locale() directly.
Then everything would look more POSIX-y, nl_langinfo_l() could be used
directly wherever we need fast access to that info, and we could, I
think, banish the awkward localeconv, right?  I don't know if this all
makes total sense and haven't tried it, just spitballing here...



Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Mon, Aug 12, 2024 at 4:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... though if we wanted to replace all use of localeconv and struct
> lconv with nl_langinfo_l() calls,

Gah.  I realised while trying the above that you can't really replace
localeconv() with nl_langinfo_l() as the GNU documentation recommends,
because some of the lconv fields we're using are missing from
langinfo.h in POSIX (only GNU added them all, that was a good idea).
:-(

Next idea:

Windows: its localeconv() returns pointer to thread-local storage,
good, but it still needs setlocale().  So the options are: make our
own lconv-populator function for Windows, using GetLocaleInfoEx(), or
do that _configthreadlocale() dance (possibly excluding some MinGW
configurations from working)
Systems that have localeconv_l(): use that
POSIX: use uselocale() and also put a big global lock around
localeconv() call + accessing result (optionally skipping that on an
OS-by-OS basis after confirming that its implementation doesn't really
need it)

The reason the uselocale() + localeconv() seems to require a Big Lock
(by default at least) is that the uselocale() deals only with the
"current locale" aspect, not the output buffer aspect.  Clearly the
standard allows for it to be thread-local storage (that's why since
2008 it says that after thread-exit you can't access the result, and I
guess that's how it works on real systems (?)), but it also seems to
allow for a single static buffer (that's why it says that it's not
re-entrant, and any call to localeconv() might clobber it).  That
might be OK in practice because we tend to cache that stuff, eg when
assigning GUC lc_monetary (that cache would presumably become
thread-local in the first phase of the multithreading plan), so the
locking shouldn't really hurt.

The reason we'd have to have three ways, and not just two, is again
that NetBSD declined to implement uselocale().

I'll try this in a bit unless someone else has better ideas or plans
for this part... sorry for the drip-feeding.



Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Tue, Aug 13, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I'll try this in a bit unless someone else has better ideas or plans
> for this part... sorry for the drip-feeding.

And done, see commitfest entry #5170.



Re: Remaining dependency on setlocale()

From
Jeff Davis
Date:
On Sat, 2024-08-10 at 09:42 +1200, Thomas Munro wrote:
> The NetBSD situation is more vexing.  I was trying to find out if
> someone is working on it and unfortunately it looks like there is a
> principled stand against adding it:
>
> https://mail-index.netbsd.org/tech-userlevel/2015/12/28/msg009546.html
> https://mail-index.netbsd.org/netbsd-users/2017/02/14/msg019352.html

The objection seems to be very general: that uselocale() modifies the
thread state and affects calls a long distance from uselocale(). I
don't disagree with the general sentiment. But in effect, that just
prevents people migrating away from setlocale(), to which the same
argument applies, and is additionally thread-unsafe.

The only alternative is to essentially ban the use of non-_l variants,
which is fine I suppose, but causes a fair amount of code churn.

> They're right that we really just want to use "C" in some places, and
> their LC_C_LOCALE is a very useful system-provided value to be able
> to
> pass into _l functions.  It's a shame it's non-standard, because
> without it you have to allocate a locale_t for "C" and keep it
> somewhere to feed to _l functions...

If we're going to do that, why not just have ascii-only variants of our
own? pg_ascii_isspace(...) is at least as readable as isspace_l(...,
LC_C_LOCALE).

Regards,
    Jeff Davis




Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Wed, Aug 14, 2024 at 1:05 PM Jeff Davis <pgsql@j-davis.com> wrote:
> The only alternative is to essentially ban the use of non-_l variants,
> which is fine I suppose, but causes a fair amount of code churn.

Let's zoom out a bit and consider some ways we could set up the
process, threads and individual calls:

1.  The process global locale is always "C".  If you ever call
uselocale(), it can only be for short stretches, and you have to
restore it straight after; perhaps it is only ever used in replacement
_l() functions for systems that lack them.  You need to use _l()
functions for all non-"C" locales.  The current database default needs
to be available as a variable (in future: thread-local variable, or
reachable from one), so you can use it in _l() functions.  The "C"
locale can be accessed implicitly with non-l() functions, or you could
ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE) for
"C".  Or a name like PG_C_LOCALE, which, in backend code could be just
LC_GLOBAL_LOCALE, while in frontend/library code it could be the
singleton mechanism I showed in CF#5166.

XXX Note that nailing LC_ALL to "C" in the backend would extend our
existing policy for LC_NUMERIC to all categories.  That's why we can
use strtod() in the backend and expect the radix character to be '.'.
It's interesting to contemplate the strtod() calls in CF#5166: they
are code copied-and-pasted from backend and frontend; in the backend
we can use strtod() currently but in the frontend code I showed a
change to strtod_l(..., PG_C_LOCALE), in order to be able to delete
some ugly deeply nested uselocale()/setlocale() stuff of the exact
sort that NetBSD hackers (and I) hate.  It's obviously a bit of a code
smell that it's copied-and-pasted in the first place, and should
really share code.  Supposing some of that stuff finishes up in
src/common, then I think you'd want a strtod_l(..., PG_C_LOCALE) that
could be allowed to take advantage of the knowledge that the global
locale is "C" in the backend.  Just thoughts...

2.  The process global locale is always "C".  Each backend process (in
future: thread) calls uselocale() to set the thread-local locale to
the database default, so it can keep using the non-_l() functions as a
way to access the database default, and otherwise uses _l() functions
if it wants something else (as we do already).  The "C" locale can be
accessed with foo_l(..., LC_GLOBAL_LOCALE) or PG_C_LOCALE etc.

XXX This option is blocked by NetBSD's rejection of uselocale().  I
guess if you really wanted to work around NetBSD's policy you could
make your own wrapper for all affected functions, translating foo() to
foo_l(..., pg_thread_current_locale), so you could write uselocale(),
which is pretty much what every other libc does...  But eughhh

3.  The process global locale is inherited from the system or can be
set by the user however they want for the benefit of extensions, but
we never change it after startup or refer to it.  Then we do the same
as 1 or 2, except if we ever want "C" we'll need a locale_t for that,
again perhaps using the PC_C_LOCALE mechanism.  Non-_l() functions are
effectively useless except in cases where you really want to use the
system's settings inherited from startup, eg for messages, so they'd
mostly be banned.

What else?

> > They're right that we really just want to use "C" in some places, and
> > their LC_C_LOCALE is a very useful system-provided value to be able
> > to
> > pass into _l functions.  It's a shame it's non-standard, because
> > without it you have to allocate a locale_t for "C" and keep it
> > somewhere to feed to _l functions...
>
> If we're going to do that, why not just have ascii-only variants of our
> own? pg_ascii_isspace(...) is at least as readable as isspace_l(...,
> LC_C_LOCALE).

Yeah, I agree there are some easy things we should do that way.  In
fact we've already established that scanner_isspace() needs to be used
in lots more places for that, even aside from thread-safety.



Re: Remaining dependency on setlocale()

From
Jeff Davis
Date:
On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
> 1.  The process global locale is always "C".  If you ever call
> uselocale(), it can only be for short stretches, and you have to
> restore it straight after; perhaps it is only ever used in
> replacement
> _l() functions for systems that lack them.  You need to use _l()
> functions for all non-"C" locales.  The current database default
> needs
> to be available as a variable (in future: thread-local variable, or
> reachable from one), so you can use it in _l() functions.  The "C"
> locale can be accessed implicitly with non-l() functions, or you
> could
> ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
> for
> "C".  Or a name like PG_C_LOCALE, which, in backend code could be
> just
> LC_GLOBAL_LOCALE, while in frontend/library code it could be the
> singleton mechanism I showed in CF#5166.

+1 to this approach. It makes things more consistent across platforms
and avoids surprising dependencies on the global setting.

We'll have to be careful that each call site is either OK with C, or
that it gets changed to an _l() variant. We also have to be careful
about extensions.

Regards,
    Jeff Davis




Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> > > 2. I don't see a good way to canonicalize a locale name, like in
> > > check_locale(), which uses the result of setlocale().
> >
> > What I can tell you about that is that check_locale's expectation
> > that setlocale does any useful canonicalization is mostly wishful
> > thinking [1].  On a lot of platforms you just get the input string
> > back again.  If that's the only thing keeping us on setlocale,
> > I think we could drop it.  (Perhaps we should do some canonicalization
> > of our own instead?)
>
> +1
>
> I know it does something on Windows (we know the EDB installer gives
> it strings like "Language,Country" and it converts them to
> "Language_Country.Encoding", see various threads about it all going
> wrong), but I'm not sure it does anything we actually want to
> encourage.  I'm hoping we can gradually screw it down so that we only
> have sane BCP 47 in the system on that OS, and I don't see why we
> wouldn't just use them verbatim.

Some more thoughts on check_locale() and canonicalisation:

I doubt the canonicalisation does anything useful on any Unix system,
as they're basically just file names.  In the case of glibc, the
encoding part is munged before opening the file so it tolerates .utf8
or .UTF-8 or .u---T----f------8 on input, but it still returns
whatever you gave it so the return value isn't cleaning the input or
anything.

"" is a problem however... the special value for "native environment"
is returned as a real locale name, which we probably still need in
places.  We could change that to newlocale("") + query instead, but
there is a portability pipeline problem getting the name out of it:

1. POSIX only just added getlocalename_l() in 2024[1][2].
2. Glibc has non-standard nl_langinfo_l(NL_LOCALE_NAME(category), loc).
3. The <xlocale.h> systems (macOS/*BSD) have non-standard
querylocale(mask, loc).
4. AFAIK there is no way to do it on pure POSIX 2008 systems.
5. For Windows, there is a completely different thing to get the
user's default locale, see CF#3772.

The systems in category 4 would in practice be Solaris and (if it
comes back) AIX.  Given that, we probably just can't go that way soon.

So I think the solution could perhaps be something like: in some early
startup phase before there are any threads, we nail down all the
locale categories to "C" (or whatever we decide on for the permanent
global locale), and also query the "" categories and make a copy of
them in case anyone wants them later, and then never call setlocale()
again.

[1] https://pubs.opengroup.org/onlinepubs/9799919799/functions/getlocalename_l.html
[2] https://www.austingroupbugs.net/view.php?id=1220



Re: Remaining dependency on setlocale()

From
Jeff Davis
Date:
On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote:
> So I think the solution could perhaps be something like: in some
> early
> startup phase before there are any threads, we nail down all the
> locale categories to "C" (or whatever we decide on for the permanent
> global locale), and also query the "" categories and make a copy of
> them in case anyone wants them later, and then never call setlocale()
> again.

+1.

Regards,
    Jeff Davis




Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Thu, Aug 15, 2024 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote:
> > So I think the solution could perhaps be something like: in some
> > early
> > startup phase before there are any threads, we nail down all the
> > locale categories to "C" (or whatever we decide on for the permanent
> > global locale), and also query the "" categories and make a copy of
> > them in case anyone wants them later, and then never call setlocale()
> > again.
>
> +1.

We currently nail down these categories:

    /* We keep these set to "C" always.  See pg_locale.c for explanation. */
    init_locale("LC_MONETARY", LC_MONETARY, "C");
    init_locale("LC_NUMERIC", LC_NUMERIC, "C");
    init_locale("LC_TIME", LC_TIME, "C");

CF #5170 has patches to make it so that we stop changing them even
transiently, using locale_t interfaces to feed our caches of stuff
needed to work with those categories, so they really stay truly nailed
down.

It sounds like someone needs to investigate doing the same thing for
these two, from CheckMyDatabase():

    if (pg_perm_setlocale(LC_COLLATE, collate) == NULL)
        ereport(FATAL,
                (errmsg("database locale is incompatible with
operating system"),
                 errdetail("The database was initialized with
LC_COLLATE \"%s\", "
                           " which is not recognized by setlocale().", collate),
                 errhint("Recreate the database with another locale or
install the missing locale.")));

    if (pg_perm_setlocale(LC_CTYPE, ctype) == NULL)
        ereport(FATAL,
                (errmsg("database locale is incompatible with
operating system"),
                 errdetail("The database was initialized with LC_CTYPE \"%s\", "
                           " which is not recognized by setlocale().", ctype),
                 errhint("Recreate the database with another locale or
install the missing locale.")));

How should that work?  Maybe we could imagine something like
MyDatabaseLocale, a locale_t with LC_COLLATE and LC_CTYPE categories
set appropriately.  Or should it be a pg_locale_t instead (if your
database default provider is ICU, then you don't even need a locale_t,
right?).

Then I think there is one quite gnarly category, from
assign_locale_messages() (a GUC assignment function):

    (void) pg_perm_setlocale(LC_MESSAGES, newval);

I have never really studied gettext(), but I know it was just
standardised in POSIX 2024, and the standardised interface has _l()
variants of all functions.  Current implementations don't have them
yet.  Clearly we absolutely couldn't call pg_perm_setlocale() after
early startup --  but if gettext() is relying on the current locale to
affect far away code, then maybe this is one place where we'd just
have to use uselocale().  Perhaps we could plan some transitional
strategy where NetBSD users lose the ability to change the GUC without
restarting the server and it has to be the same for all sessions, or
something like that, until they produce either gettext_l() or
uselocale(), but I haven't thought hard about this part at all yet...



Re: Remaining dependency on setlocale()

From
Peter Eisentraut
Date:
On 15.08.24 00:43, Thomas Munro wrote:
> "" is a problem however... the special value for "native environment"
> is returned as a real locale name, which we probably still need in
> places.  We could change that to newlocale("") + query instead, but

Where do we need that in the server?

It should just be initdb doing that and then initializing the server 
with concrete values based on that.

I guess technically some of these GUC settings default to the 
environment?  But I think we could consider getting rid of that.



Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 15.08.24 00:43, Thomas Munro wrote:
> > "" is a problem however... the special value for "native environment"
> > is returned as a real locale name, which we probably still need in
> > places.  We could change that to newlocale("") + query instead, but
>
> Where do we need that in the server?

Hmm.  Yeah, right, the only way I've found so far to even reach that
code and that captures that result is:

create database db2 locale = '';

Thats puts 'en_NZ.UTF-8' or whatever in pg_database.  In contrast,
create collation will accept '' but just store it verbatim, and the
GUCs for changing time, monetary, numeric accept it too and keep it
verbatim.  We could simply ban '' in all user commands.  I doubt
they're documented as acceptable values, once you get past initdb and
have a running system.  Looking into that...

> It should just be initdb doing that and then initializing the server
> with concrete values based on that.

Right.

> I guess technically some of these GUC settings default to the
> environment?  But I think we could consider getting rid of that.

Yeah.



Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Fri, Aug 16, 2024 at 9:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> > It should just be initdb doing that and then initializing the server
> > with concrete values based on that.
>
> Right.
>
> > I guess technically some of these GUC settings default to the
> > environment?  But I think we could consider getting rid of that.
>
> Yeah.

Seems to make a lot of sense.  I tried that out over in CF #5170.

(In case it's not clear why I'm splitting discussion between threads:
I was thinking of this thread as the one where we're discussing what
needs to be done, with other threads being spun off to become CF entry
with concrete patches.  I realised re-reading some discussion that
that might not be obvious...)



Re: Remaining dependency on setlocale()

From
Andreas Karlsson
Date:
On 8/9/24 8:24 PM, Jeff Davis wrote:
> On Fri, 2024-08-09 at 13:41 +0200, Andreas Karlsson wrote:
>> I am leaning towards that we should write our own pure ascii
>> functions
>> for this.
> 
> That makes sense for a lot of call sites, but it could cause breakage
> if we aren't careful.
> 
>>   Since we do not support any non-ascii compatible encodings
>> anyway I do not see the point in having locale support in most of
>> these
>> call-sites.
> 
> An ascii-compatible encoding just means that the code points in the
> ascii range are represented as ascii. I'm not clear on whether code
> points in the ascii range can return different results for things like
> isspace(), but it sounds plausible -- toupper() can return different
> results for 'i' in tr_TR.
> 
> Also, what about the values outside 128-255, which are still valid
> input to isspace()?

My idea was that in a lot of those cases we only try to parse e.g. 0-9 
as digits and always only . as the decimal separator so we should make 
just make that obvious by either using locale C or writing our own ascii 
only functions. These strings are meant to be read by machines, not 
humans, primarily.

Andreas



Re: Remaining dependency on setlocale()

From
Jeff Davis
Date:
On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
> On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
> > 1.  The process global locale is always "C".  If you ever call
> > uselocale(), it can only be for short stretches, and you have to
> > restore it straight after; perhaps it is only ever used in
> > replacement
> > _l() functions for systems that lack them.  You need to use _l()
> > functions for all non-"C" locales.  The current database default
> > needs
> > to be available as a variable (in future: thread-local variable, or
> > reachable from one), so you can use it in _l() functions.  The "C"
> > locale can be accessed implicitly with non-l() functions, or you
> > could
> > ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
> > for
> > "C".  Or a name like PG_C_LOCALE, which, in backend code could be
> > just
> > LC_GLOBAL_LOCALE, while in frontend/library code it could be the
> > singleton mechanism I showed in CF#5166.
>
> +1 to this approach. It makes things more consistent across platforms
> and avoids surprising dependencies on the global setting.
>
> We'll have to be careful that each call site is either OK with C, or
> that it gets changed to an _l() variant. We also have to be careful
> about extensions.

Did we reach a conclusion here? Any thoughts on moving in this
direction, and whether 18 is the right time to do it?

Regards,
    Jeff Davis





Re: Remaining dependency on setlocale()

From
Thomas Munro
Date:
On Fri, Dec 13, 2024 at 8:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
> > On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
> > > 1.  The process global locale is always "C".  If you ever call
> > > uselocale(), it can only be for short stretches, and you have to
> > > restore it straight after; perhaps it is only ever used in
> > > replacement
> > > _l() functions for systems that lack them.  You need to use _l()
> > > functions for all non-"C" locales.  The current database default
> > > needs
> > > to be available as a variable (in future: thread-local variable, or
> > > reachable from one), so you can use it in _l() functions.  The "C"
> > > locale can be accessed implicitly with non-l() functions, or you
> > > could
> > > ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
> > > for
> > > "C".  Or a name like PG_C_LOCALE, which, in backend code could be
> > > just
> > > LC_GLOBAL_LOCALE, while in frontend/library code it could be the
> > > singleton mechanism I showed in CF#5166.
> >
> > +1 to this approach. It makes things more consistent across platforms
> > and avoids surprising dependencies on the global setting.
> >
> > We'll have to be careful that each call site is either OK with C, or
> > that it gets changed to an _l() variant. We also have to be careful
> > about extensions.
>
> Did we reach a conclusion here? Any thoughts on moving in this
> direction, and whether 18 is the right time to do it?

I think this is the best way, and I haven't seen anyone supporting any
other idea.  (I'm working on those setlocale()-removal patches I
mentioned, more very soon...)



Re: Remaining dependency on setlocale()

From
Peter Eisentraut
Date:
On 13.12.24 10:44, Thomas Munro wrote:
> On Fri, Dec 13, 2024 at 8:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
>> On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
>>> On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
>>>> 1.  The process global locale is always "C".  If you ever call
>>>> uselocale(), it can only be for short stretches, and you have to
>>>> restore it straight after; perhaps it is only ever used in
>>>> replacement
>>>> _l() functions for systems that lack them.  You need to use _l()
>>>> functions for all non-"C" locales.  The current database default
>>>> needs
>>>> to be available as a variable (in future: thread-local variable, or
>>>> reachable from one), so you can use it in _l() functions.  The "C"
>>>> locale can be accessed implicitly with non-l() functions, or you
>>>> could
>>>> ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
>>>> for
>>>> "C".  Or a name like PG_C_LOCALE, which, in backend code could be
>>>> just
>>>> LC_GLOBAL_LOCALE, while in frontend/library code it could be the
>>>> singleton mechanism I showed in CF#5166.
>>>
>>> +1 to this approach. It makes things more consistent across platforms
>>> and avoids surprising dependencies on the global setting.
>>>
>>> We'll have to be careful that each call site is either OK with C, or
>>> that it gets changed to an _l() variant. We also have to be careful
>>> about extensions.
>>
>> Did we reach a conclusion here? Any thoughts on moving in this
>> direction, and whether 18 is the right time to do it?
> 
> I think this is the best way, and I haven't seen anyone supporting any
> other idea.  (I'm working on those setlocale()-removal patches I
> mentioned, more very soon...)

I also think this is the right direction, and we'll get closer with the 
remaining patches that Thomas has lined up.

I think at this point, we could already remove all locale settings 
related to LC_COLLATE.  Nothing uses that anymore.

I think we will need to keep the global LC_CTYPE setting set to 
something useful, for example so that system error messages come out in 
the right encoding.

But I'm concerned about the the Perl_setlocale() dance in plperl.c. 
Perl apparently does a setlocale(LC_ALL, "") during startup, and that 
code is a workaround to reset everything back afterwards.  We need to be 
careful not to break that.

(Perl has fixed that in 5.19, but the fix requires that you set another 
environment variable before launching Perl, which you can't do in a 
threaded system, so we'd probably need another fix eventually.  See 
<https://github.com/Perl/perl5/issues/8274>.)




Re: Remaining dependency on setlocale()

From
Jeff Davis
Date:
On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote:
> I think we will need to keep the global LC_CTYPE setting set to
> something useful, for example so that system error messages come out
> in
> the right encoding.

Do we need to rely on the global LC_CTYPE setting? We already use
bind_textdomain_codeset().

> But I'm concerned about the the Perl_setlocale() dance in plperl.c.
> Perl apparently does a setlocale(LC_ALL, "") during startup, and that
> code is a workaround to reset everything back afterwards.  We need to
> be
> careful not to break that.
>
> (Perl has fixed that in 5.19, but the fix requires that you set
> another
> environment variable before launching Perl, which you can't do in a
> threaded system, so we'd probably need another fix eventually.  See
> <https://github.com/Perl/perl5/issues/8274>.)

I don't fully understand that issue, but I would think the direction we
are going (keeping the global LC_CTYPE more consistent and relying on
it less) would make the problem better.

Regards,
    Jeff Davis




Re: Remaining dependency on setlocale()

From
Peter Eisentraut
Date:
On 17.12.24 19:10, Jeff Davis wrote:
> On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote:
>> I think we will need to keep the global LC_CTYPE setting set to
>> something useful, for example so that system error messages come out
>> in
>> the right encoding.
> 
> Do we need to rely on the global LC_CTYPE setting? We already use
> bind_textdomain_codeset().

I don't think that would cover messages from the C library (strerror, 
dlerror, etc.).

>> But I'm concerned about the the Perl_setlocale() dance in plperl.c.
>> Perl apparently does a setlocale(LC_ALL, "") during startup, and that
>> code is a workaround to reset everything back afterwards.  We need to
>> be
>> careful not to break that.
>>
>> (Perl has fixed that in 5.19, but the fix requires that you set
>> another
>> environment variable before launching Perl, which you can't do in a
>> threaded system, so we'd probably need another fix eventually.  See
>> <https://github.com/Perl/perl5/issues/8274>.)
> 
> I don't fully understand that issue, but I would think the direction we
> are going (keeping the global LC_CTYPE more consistent and relying on
> it less) would make the problem better.

Yes, I think it's the right direction, but we need to figure this issue 
out eventually.