Re: C11: should we use char32_t for unicode code points? - Mailing list pgsql-hackers
| From | Thomas Munro |
|---|---|
| Subject | Re: C11: should we use char32_t for unicode code points? |
| Date | |
| Msg-id | CA+hUKGLXQUYK7Cq5KbLGgTWo7pORs7yhBWO1AEnZt7xTYbLRhg@mail.gmail.com Whole thread Raw |
| In response to | Re: C11: should we use char32_t for unicode code points? (Jeff Davis <pgsql@j-davis.com>) |
| Responses |
Re: C11: should we use char32_t for unicode code points?
Re: C11: should we use char32_t for unicode code points? |
| List | pgsql-hackers |
On Mon, Oct 27, 2025 at 8:43 AM Jeff Davis <pgsql@j-davis.com> wrote: > What would be the problem if it were larger than 32 bits? Hmm, OK fair question, I can't think of any, I was just working through the standard and thinking myopically about the exact definition, but I think it's actually already covered by other things we assume/require (ie the existence of uint32_t forces the size of char32_t if you follow the chain of definitions backwards), and as you say it probably doesn't even matter. I suppose you could also skip the __STC_UTF_32__ assertion given that we already make a larger assumption about wchar_t encoding, and it seems to be exhaustively established that no implementation fails to conform to C23 for char32_t (see earlier link to Meneide's blog). I don't personally understand what C11 was smoking when it left that unspecified for another 12 years. > > I wonder if the XXX_libc_mb() functions that contain our hard-coded > > assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should > > use your to_char32_t() too (probably with a longer name > > pg_wchar_to_char32_t() if it's in a header for wider use). > > I don't think those functions do depend on UTF-32. iswalpha(), etc., > take a wint_t, which is just a wchar_t that can also be WEOF. I was noticing that toupper_libc_mb() directly tests if a pg_wchar value is in the ASCII range, which only makes sense given knowledge of pg_wchar's encoding, so perhap that should trigger this new coding rule. But I agree that's pretty obscure... feel free to ignore that suggestion. Hmm, the comment at the top explains that we apply that special ASCII treatment for default locales and not non-default locales, but it doesn't explain *why* we make that distinction. Do you know? > One thing I never understood about this is that it's our code that > converts from the server encoding to pg_wchar (e.g. > pg_latin12wchar_with_len()), so we must understand the representation > of pg_wchar. And we cast directly from pg_wchar to wchar_t, so we > understand the encoding of wchar_t, too, right? Right, we do know the encoding of pg_wchar in every case (assuming that all pg_wchar values come from our transcoding routines). We just don't know if that encoding is also the one used by libc's locale-sensitive functions that deal in wchar_t, except when the locale is one that uses UTF-8 for char encoding, in which case we assume that every libc must surely use Unicode codepoints in wchar_t. That probably covers the vast majority of real world databases in the UTF-8 age, and no known system fails to meet this expectation. Of course the encoding used by every libc for non-UTF-8 locales is theoretically knowable too, but since they vary and in some cases are not even documented, it would be too painful to contemplate any dependency on that. Let me try to work through this in more detail... corrections welcome, but this is what I have managed to understand about this module so far, in my quest to grok PostgreSQL's overall character encoding model (and holes therein): For locales that use UTF-8 for char, we expect libc to understand pg_wchar/wchar_t/wint_t values as UTF-32 or at a stretch UTF-16. The expected source of these pg_wchar values is our various regexp code paths that will use our mbutils pg_wchar conversion to UTF-32, with a reasonable copying strategy for sizeof(wchar_t) == 2 (that's Windows and I think otherwise only AIX in 32 bit builds, if it comes back). If any libc didn't use Unicode codepoints in its locale-sensitive wchar_t functions for UTF-8 locales we'd get garbage results, but we don't know of any such system. It's a bit of a shame that C11 didn't introduce the obvious isualpha(char32_t) variants for a standard-supported version of that realpolitik we depend on, but perhaps one day... There is one minor quirk here that it might be nice to document in top comment section 2: on Windows we also expect wchar_t to be understood by system wctype functions as UTF-16 for locales that *don't* use UTF-8 for char (an assumption that definitely doesn't hold on many Unixen). That is important because on Windows we allow non-UTF-8 locales to be used in UTF-8 databases for historical reasons. For single-byte encodings: pg_latin12wchar_with_len() just zero-extends the bytes to pg_wchar, so when the pg_locale_libc.c functions truncate them and call 8-bit ctype stuff eg isalpha_l(), it completes a perfect round trip inside our code. (BTW pg_latin12wchar_with_len() has the same definition as pg_ascii2wchar_with_len(), and is used for many single-byte encodings other than LATIN1 which makes me wonder why we don't just have a single function pg_char2wchar_with_len() that is used by all "simple widening" cases.) We never know or care which encoding libc would itself use for these locales' wchar_t, as we don't ever pass it a wchar_t. Assuming I understood that correctly, I think it would be nice if the "100% correct for LATINn" comment stated the reason for that certainty explicitly, ie that it closes an information-preserving round-trip beginning with the coercion in pg_latin12wchar_with_len() and that libc never receives a wchar_t/wint_t that we fabricated. A bit of a digression, which I *think* is out-of-scope for this module, but just while I'm working through all the implications: This could produce unspecified results if a wchar_t from another source ever arrived into these functions eg wchar_t made by libc or L"literal" made by the compiler, both unspecified. In practice, a wchar_t of non-PostgreSQL origin that is truncated to 8 bits would probably still give a sensible result for codepoints 0-127 (= 7 bit subset of Unicode, and we require all server encodings to be supersets of ASCII), and 0-255 for LATIN1 (= 8 bit subset of Unicode), because: the two main approaches to single-byte char -> wchar_t conversion in libc implementations seem to be conversion to Unicode (Windows, glibc?), and simply casting char to wchar_t (I think this is probably what *BSD and Solaris do for single-byte non-UTF-8 locales leading to the complaint that wchar_t encoding is locale-dependent on those systems, though I haven't checked in detail, and that's of course also exactly what our own conversion does), so I think that means that 128-255 that would give nonsense results for non-LATIN1 single byte encodings on Windows or glibc (?) but perhaps not other Unixen. For example, take ISO 8859-7, the legacy single byte encoding for Greek: it encodes α as 0xe1, and Windows and glibc (?) would presumably encode that as (wchar_t) 0x03b1 (the Unicode codepoint), and then wc_isalpha_libc_sb() would truncate that to 0xb1 which is ± in ISO 8859-7, so isalpha_l() would return false, despite α being the OG alpha (not tested, just a thought experiment looking at tables). But since handling pg_wchar of non-PostgreSQL origin doesn't seem to be one of our goals, there is no problem to fix here, it might just be worthy of a note in that commentary: we don't try to deal with wchar_t values not made by PostgreSQL, except where noted (non-escaping uses of char2wchar() in controlled scopes). For multi-byte encodings other than UTF-8, pg_locale_libc.c is basically giving up almost completely, but could probably be tightened up. I can't imagine we'll ever add another multibyte encoding, and I believe we can ignore MULE internal, as no libc supports it (so you could only get here with the C locale where you'll get the garbage results you asked for... in fact I wonder why need MULE internal at all... it seems to be a sort of double-encoding for multiplexing other encodings, so we can't exactly say it's not blessed by a standard, it's indirectly defined by "all the standards" in a sense, but it's also entirely obsoleted by Unicode's unification so I don't know what problem it solves for anyone, or if anyone ever needed it in any reasonable pg_upgrade window of history...). Of server-supported encodings, that leaves only EUC_* to think about. The EUC family has direct encoding of 7-bit ASCII and then 3 selectable character sets represented by sequences with the high bit set, with details varying between the Chinese (simplified Chinese), Taiwanese (traditional Chinese), Japanese (2 kinds) and Korean variants. I don't know if the pg_wchar encoding we're producing in pg_euc*2wchar_with_len() has a name, but it doesn't appear to match the description of the standard "fixed" representation on the Wikipedia page for Extended Unix Code (it's too wide for starters, looking at the shift distances). The main thing seems to be that we simply zero-extend the ASCII range into a pg_wchar directly, so when we cast it down to call 8-bit ctype functions, I expect we produce correct results for ASCII characters... and then I don't know what but I guess nothing good for 128-255, and then surely hot garbage for everything else, cycling through the 0-255 answers repeatedly as we climb the pg_wchar value range. The key point being that it's *not* a perfect information-preserving round-trip, as we achieve for single-byte encodings. Some ideas for improvements: 1. Cheap but incomplete: use a different ctype method table that short-circuits the results (false for isalpha et al, pass-through for upper/lower) for pg_wchar >= 128 and uses the existing 8-bit ctype functions for ASCII. 2. More expensive but complete: handle ASCII range with existing 8-bit ctype functions, and otherwise convert our pg_wchar back to MB char format and then use libc's mbstowcs_l() to make a wchar_t that libc's wchar_t-based functions should understand. To avoid doing hard work for nothing (ideogram-based languages generally don't care about ctype stuff so that'd be the vast majority of characters appearing in Chinese/Japanese/Korean text) at the cost of having to do a bunch of research, we could should short-circuit the core CJK character ranges, and do the extra CPU cycles for the rest, to catch the Latin + accents, Greek, Cyrillic characters that are also supported in these encodings for foreign names, variables in scientific language etc. I guess that implies a classifier that would be associated with ... the encoding? That would of course break if wchar_t values of non-PostgreSQL origin arrive here, but see above note about nailing down a contract that formally excludes that outside narrow non-escaping sites. 3. I assume there are some good reasons we don't do this but... if we used char2wchar() in the first place (= libc native wchar_t) for the regexp stuff that calls this stuff (as we do already inside whole-string upper/lower, just not character upper/lower or character classification), then we could simply call the wchar_t libc functions directly and unconditionally in the libc provider for all cases, instead of the 8-bit variants with broken edge cases for non-UTF-8 databases. I didn't try to find the historical discussions, but I can imagine already that we might not have done that because it has to copy to cope with non-NULL-terminated strings, might perhaps have weird incompatibilities with our own multibyte sequence detection, might be slower (and/or might have been unusably broken ancient libcs?), and it would only be appropriate for libc locales anyway and yet now we have other locale providers that certainly don't want some unspecified wchar_t encoding or libc involved. It's also likely that non-UTF-8 systems are of dwindling interest to anyone outside perhaps client encodings (hence my attempt to ram home some simplifying assumptions about that in that project to nail down some rules where the encoding is fuzzy that I mentioned in a thread from a few months ago). So I'm not seriously suggesting this, just thinking out loud about the corner we've painted ourselves into where idea #2's multiple transcoding steps would be necessary to get the "right" answer for any character in these encodings. Hnngh. In passing, I wonder why _libc.c has that comment about ICU in parentheses. Not relevant here. I haven't thought much about whether it's relevant in the ICU provider code (it may come back to that do-we-accept-pg_wchar-we-didn't-make? question), but if it is then it also applies to Windows and probably glibc in the libc provider and I don't immediately see any problem (assuming no-we-don't! answer).
pgsql-hackers by date: