Re: Remaining dependency on setlocale() - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Remaining dependency on setlocale()
Date
Msg-id CA+hUKGJUPPZZjZMGR047w=OrZgemZYoRrVPkvCdSO9iA56M0QA@mail.gmail.com
Whole thread Raw
In response to Re: Remaining dependency on setlocale()  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote:
> There are a ton of calls to, for example, isspace(), used mostly for
> parsing.
>
> I wouldn't expect a lot of differences in behavior from locale to
> locale, like might be the case with iswspace(), but behavior can be
> different at least in theory.
>
> So I guess we're stuck with setlocale()/uselocale() for a while, unless
> we're able to move most of those call sites over to an ascii-only
> variant.

We do know of a few isspace() calls that are already questionable[1]
(should be scanner_isspace(), or something like that).  It's not only
weird that SELECT ROW('libertà!') is displayed with or without double
quote depending (in theory) on your locale, it's also undefined
behaviour because we feed individual bytes of a multi-byte sequence to
isspace(), so OSes disagree, and in practice we know that macOS and
Windows think that the byte 0xa inside 'à' is a space while glibc and
FreeBSD don't.  Looking at the languages with many sequences
containing 0xa0, I guess you'd probably need to be processing CJK text
and cross-platform for the difference to become obvious (that was the
case for the problem report I analysed):

for i in range(1, 0xffff):
  if (i < 0xd800 or i > 0xdfff) and 0xa0 in chr(i).encode('UTF-8'):
    print("%04x: %s" % (i, chr(i)))

[1] https://www.postgresql.org/message-id/flat/CA%2BHWA9awUW0%2BRV_gO9r1ABZwGoZxPztcJxPy8vMFSTbTfi4jig%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: tiny step toward threading: reduce dependence on setlocale()
Next
From: Paul Jungwirth
Date:
Subject: Re: SQL:2011 application time