Re: C11: should we use char32_t for unicode code points? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: C11: should we use char32_t for unicode code points?
Date
Msg-id CA+hUKGLXQUYK7Cq5KbLGgTWo7pORs7yhBWO1AEnZt7xTYbLRhg@mail.gmail.com
Whole thread Raw
In response to Re: C11: should we use char32_t for unicode code points?  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: C11: should we use char32_t for unicode code points?
Re: C11: should we use char32_t for unicode code points?
List pgsql-hackers
On Mon, Oct 27, 2025 at 8:43 AM Jeff Davis <pgsql@j-davis.com> wrote:
> What would be the problem if it were larger than 32 bits?

Hmm, OK fair question, I can't think of any, I was just working
through the standard and thinking myopically about the exact
definition, but I think it's actually already covered by other things
we assume/require (ie the existence of uint32_t forces the size of
char32_t if you follow the chain of definitions backwards), and as you
say it probably doesn't even matter.  I suppose you could also skip
the __STC_UTF_32__ assertion given that we already make a larger
assumption about wchar_t encoding, and it seems to be exhaustively
established that no implementation fails to conform to C23 for
char32_t (see earlier link to Meneide's blog).  I don't personally
understand what C11 was smoking when it left that unspecified for
another 12 years.

> > I wonder if the XXX_libc_mb() functions that contain our hard-coded
> > assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should
> > use your to_char32_t() too (probably with a longer name
> > pg_wchar_to_char32_t() if it's in a header for wider use).
>
> I don't think those functions do depend on UTF-32. iswalpha(), etc.,
> take a wint_t, which is just a wchar_t that can also be WEOF.

I was noticing that toupper_libc_mb() directly tests if a pg_wchar
value is in the ASCII range, which only makes sense given knowledge of
pg_wchar's encoding, so perhap that should trigger this new coding
rule.  But I agree that's pretty obscure...  feel free to ignore that
suggestion.

Hmm, the comment at the top explains that we apply that special ASCII
treatment for default locales and not non-default locales, but it
doesn't explain *why* we make that distinction.  Do you know?

> One thing I never understood about this is that it's our code that
> converts from the server encoding to pg_wchar (e.g.
> pg_latin12wchar_with_len()), so we must understand the representation
> of pg_wchar. And we cast directly from pg_wchar to wchar_t, so we
> understand the encoding of wchar_t, too, right?

Right, we do know the encoding of pg_wchar in every case (assuming
that all pg_wchar values come from our transcoding routines).  We just
don't know if that encoding is also the one used by libc's
locale-sensitive functions that deal in wchar_t, except when the
locale is one that uses UTF-8 for char encoding, in which case we
assume that every libc must surely use Unicode codepoints in wchar_t.
That probably covers the vast majority of real world databases in the
UTF-8 age, and no known system fails to meet this expectation.  Of
course the encoding used by every libc for non-UTF-8 locales is
theoretically knowable too, but since they vary and in some cases are
not even documented, it would be too painful to contemplate any
dependency on that.

Let me try to work through this in more detail...  corrections
welcome, but this is what I have managed to understand about this
module so far, in my quest to grok PostgreSQL's overall character
encoding model (and holes therein):

For locales that use UTF-8 for char, we expect libc to understand
pg_wchar/wchar_t/wint_t values as UTF-32 or at a stretch UTF-16.  The
expected source of these pg_wchar values is our various regexp code
paths that will use our mbutils pg_wchar conversion to UTF-32, with a
reasonable copying strategy for sizeof(wchar_t) == 2 (that's Windows
and I think otherwise only AIX in 32 bit builds, if it comes back).
If any libc didn't use Unicode codepoints in its locale-sensitive
wchar_t functions for UTF-8 locales we'd get garbage results, but we
don't know of any such system.  It's a bit of a shame that C11 didn't
introduce the obvious isualpha(char32_t) variants for a
standard-supported version of that realpolitik we depend on, but
perhaps one day...

There is one minor quirk here that it might be nice to document in top
comment section 2: on Windows we also expect wchar_t to be understood
by system wctype functions as UTF-16 for locales that *don't* use
UTF-8 for char (an assumption that definitely doesn't hold on many
Unixen).  That is important because on Windows we allow non-UTF-8
locales to be used in UTF-8 databases for historical reasons.

For single-byte encodings: pg_latin12wchar_with_len() just
zero-extends the bytes to pg_wchar, so when the pg_locale_libc.c
functions truncate them and call 8-bit ctype stuff eg isalpha_l(), it
completes a perfect round trip inside our code.  (BTW
pg_latin12wchar_with_len() has the same definition as
pg_ascii2wchar_with_len(), and is used for many single-byte encodings
other than LATIN1 which makes me wonder why we don't just have a
single function pg_char2wchar_with_len() that is used by all "simple
widening" cases.)  We never know or care which encoding libc would
itself use for these locales' wchar_t, as we don't ever pass it a
wchar_t.  Assuming I understood that correctly, I think it would be
nice if the "100% correct for LATINn" comment stated the reason for
that certainty explicitly, ie that it closes an information-preserving
round-trip beginning with the coercion in pg_latin12wchar_with_len()
and that libc never receives a wchar_t/wint_t that we fabricated.

A bit of a digression, which I *think* is out-of-scope for this
module, but just while I'm working through all the implications:  This
could produce unspecified results if a wchar_t from another source
ever arrived into these functions eg wchar_t made by libc or
L"literal" made by the compiler, both unspecified.  In practice, a
wchar_t of non-PostgreSQL origin that is truncated to 8 bits would
probably still give a sensible result for codepoints 0-127 (= 7 bit
subset of Unicode, and we require all server encodings to be supersets
of ASCII), and 0-255 for LATIN1 (= 8 bit subset of Unicode), because:
the two main approaches to single-byte char -> wchar_t conversion in
libc implementations seem to be conversion to Unicode (Windows,
glibc?), and simply casting char to wchar_t (I think this is probably
what *BSD and Solaris do for single-byte non-UTF-8 locales leading to
the complaint that wchar_t encoding is locale-dependent on those
systems, though I haven't checked in detail, and that's of course also
exactly what our own conversion does), so I think that means that
128-255 that would give nonsense results for non-LATIN1 single byte
encodings on Windows or glibc (?) but perhaps not other Unixen.  For
example, take ISO 8859-7, the legacy single byte encoding for Greek:
it encodes α as 0xe1, and Windows and glibc (?) would presumably
encode that as (wchar_t) 0x03b1 (the Unicode codepoint), and then
wc_isalpha_libc_sb() would truncate that to 0xb1 which is ± in ISO
8859-7, so isalpha_l() would return false, despite α being the OG
alpha (not tested, just a thought experiment looking at tables).  But
since handling pg_wchar of non-PostgreSQL origin doesn't seem to be
one of our goals, there is no problem to fix here, it might just be
worthy of a note in that commentary: we don't try to deal with wchar_t
values not made by PostgreSQL, except where noted (non-escaping uses
of char2wchar() in controlled scopes).

For multi-byte encodings other than UTF-8, pg_locale_libc.c is
basically giving up almost completely, but could probably be tightened
up.  I can't imagine we'll ever add another multibyte encoding, and I
believe we can ignore MULE internal, as no libc supports it (so you
could only get here with the C locale where you'll get the garbage
results you asked for...  in fact I wonder why need MULE internal at
all... it seems to be a sort of double-encoding for multiplexing other
encodings, so we can't exactly say it's not blessed by a standard,
it's indirectly defined by "all the standards" in a sense, but it's
also entirely obsoleted by Unicode's unification so I don't know what
problem it solves for anyone, or if anyone ever needed it in any
reasonable pg_upgrade window of history...).  Of server-supported
encodings, that leaves only EUC_* to think about.

The EUC family has direct encoding of 7-bit ASCII and then 3
selectable character sets represented by sequences with the high bit
set, with details varying between the Chinese (simplified Chinese),
Taiwanese (traditional Chinese), Japanese (2 kinds) and Korean
variants.  I don't know if the pg_wchar encoding we're producing in
pg_euc*2wchar_with_len() has a name, but it doesn't appear to match
the description of the standard "fixed" representation on the
Wikipedia page for Extended Unix Code (it's too wide for starters,
looking at the shift distances).  The main thing seems to be that we
simply zero-extend the ASCII range into a pg_wchar directly, so when
we cast it down to call 8-bit ctype functions, I expect we produce
correct results for ASCII characters... and then I don't know what but
I guess nothing good for 128-255, and then surely hot garbage for
everything else, cycling through the 0-255 answers repeatedly as we
climb the pg_wchar value range.  The key point being that it's *not* a
perfect information-preserving round-trip, as we achieve for
single-byte encodings.  Some ideas for improvements:

1.  Cheap but incomplete: use a different ctype method table that
short-circuits the results (false for isalpha et al, pass-through for
upper/lower) for pg_wchar >= 128 and uses the existing 8-bit ctype
functions for ASCII.

2.  More expensive but complete: handle ASCII range with existing
8-bit ctype functions, and otherwise convert our pg_wchar back to MB
char format and then use libc's mbstowcs_l() to make a wchar_t that
libc's wchar_t-based functions should understand.  To avoid doing hard
work for nothing (ideogram-based languages generally don't care about
ctype stuff so that'd be the vast majority of characters appearing in
Chinese/Japanese/Korean text) at the cost of having to do a bunch of
research, we could should short-circuit the core CJK character ranges,
and do the extra CPU cycles for the rest, to catch the Latin +
accents, Greek, Cyrillic characters that are also supported in these
encodings for foreign names, variables in scientific language etc.  I
guess that implies a classifier that would be associated with ... the
encoding?  That would of course break if wchar_t values of
non-PostgreSQL origin arrive here, but see above note about nailing
down a contract that formally excludes that outside narrow
non-escaping sites.

3.  I assume there are some good reasons we don't do this but... if we
used char2wchar() in the first place (= libc native wchar_t) for the
regexp stuff that calls this stuff (as we do already inside
whole-string upper/lower, just not character upper/lower or character
classification), then we could simply call the wchar_t libc functions
directly and unconditionally in the libc provider for all cases,
instead of the 8-bit variants with broken edge cases for non-UTF-8
databases.  I didn't try to find the historical discussions, but I can
imagine already that we might not have done that because it has to
copy to cope with non-NULL-terminated strings, might perhaps have
weird incompatibilities with our own multibyte sequence detection,
might be slower (and/or might have been unusably broken ancient
libcs?), and it would only be appropriate for libc locales anyway and
yet now we have other locale providers that certainly don't want some
unspecified wchar_t encoding or libc involved.  It's also likely that
non-UTF-8 systems are of dwindling interest to anyone outside perhaps
client encodings (hence my attempt to ram home some simplifying
assumptions about that in that project to nail down some rules where
the encoding is fuzzy that I mentioned in a thread from a few months
ago).  So I'm not seriously suggesting this, just thinking out loud
about the corner we've painted ourselves into where idea #2's multiple
transcoding steps would be necessary to get the "right" answer for any
character in these encodings.  Hnngh.

In passing, I wonder why _libc.c has that comment about ICU in
parentheses.  Not relevant here.  I haven't thought much about whether
it's relevant in the ICU provider code (it may come back to that
do-we-accept-pg_wchar-we-didn't-make? question), but if it is then it
also applies to Windows and probably glibc in the libc provider and I
don't immediately see any problem (assuming no-we-don't! answer).



pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: minor error message enhance: print RLS policy name when only one permissive policy exists
Next
From: Chao Li
Date:
Subject: Re: [PATCH] Add pg_get_trigger_ddl() to retrieve the CREATE TRIGGER statement