Re: C11: should we use char32_t for unicode code points? - Mailing list pgsql-hackers
| From | Thomas Munro |
|---|---|
| Subject | Re: C11: should we use char32_t for unicode code points? |
| Date | |
| Msg-id | CA+hUKGJ5Xh0KxLYXDZuPvw1_fHX=yuzb4xxtam1Cr6TPZZ1o+w@mail.gmail.com Whole thread Raw |
| In response to | Re: C11: should we use char32_t for unicode code points? (Jeff Davis <pgsql@j-davis.com>) |
| List | pgsql-hackers |
On Sat, Oct 25, 2025 at 4:25 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2025-10-24 at 18:43 +0900, Tatsuo Ishii wrote: > > Unless char32_t is solely used for the Unicode code point data, I > > think it would be better to define something like "pg_unicode" and > > use > > it instead of directly using char32_t because it would be cleaner for > > code readers. > > That was my original idea, but then I saw that apparently char32_t is > intended for Unicode code points: > > https://www.gnu.org/software/gnulib/manual/html_node/The-char32_005ft-type.html It's definitely a codepoint but C11 only promised UTF-32 encoding if __STDC_UTF_32__ is defined to 1, and otherwise the encoding is unknown. The C23 standard resolved that insanity and required UTF-32, and there are no known systems[1] that didn't already conform, but I guess you could static_assert(__STDC_UTF_32__, "char32_t must use UTF-32 encoding"). It's also defined as at least, not exactly, 32 bits but we already require the machine to have uint32_t so it must be exactly 32 bits for us, and we could static_assert(sizeof(char32_t) == 4) for good measure. So all up, the standard type matches our existing assumptions about pg_wchar *if* the database encoding is UTF8. IIUC you're proposing that all the stuff that only works when database encoding is UTF8 should be flipped over to the new type, and that seems like a really good idea to me: remaining uses of pg_wchar would be warnings that the encoding is only conditionally known. It'd be documentation without new type safety though: for example I think you missed a spot, the return type of the definition of utf8_to_unicode() (I didn't search exhaustively). Only in C++ is it a distinct type that would catch that and a few other mistakes. Do you consider explicit casts between eg pg_wchar and char32_t to be useful documentation for humans, when coercion should just work? I kinda thought we were trying to cut down on useless casts, they might signal something but can also hide bugs. Should the few places that deal in surrogates be using char16_t instead? I wonder if the XXX_libc_mb() functions that contain our hard-coded assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should use your to_char32_t() too (probably with a longer name pg_wchar_to_char32_t() if it's in a header for wider use). That'd highlight the exact points at which we make that assumption and centralise the assertion about database encoding, and then the code that compares with various known cut-off values would be clearly in the char32_t world. > But I am also OK with a new type if others find it more readable. Adding yet another name to this soup doesn't immediately sound like it would make anything more readable to me. ISO has standardised this for the industry, so I'd vote for adopting it without indirection that makes the reader work harder to understand what it is. The churn doesn't seem excessive either, it's fairly well contained stuff already moving around a lot in recent releases with all your recent and ongoing revamping work. There is one small practical problem though: Apple hasn't got around to supplying <uchar.h> in its C SDK yet. It's there for C++ only, and isn't needed for the type in C++ anyway. I don't think that alone warrants a new name wart, as the standard tells us it must match uint32_least32_t so we can just define it ourselves if !defined(__cplusplus__) && !defined(HAVE_UCHAR_H), until Apple gets around to that. Since it confused me briefly: Apple does provide <unicode/uchar.h> but that's a coincidentally named ICU header, and on that subject I see that ICU hasn't adopted these types yet but there are some hints that they're thinking about it; meanwhile their C++ interfaces have begun to document that they are acceptable in a few template functions. All other target systems have it AFAICS. Windows: tested by CI, MinGW: found discussion, *BSD, Solaris, Illumos: found man pages. As for the conversion functions in <uchar.h>, they're of course missing on macOS but they also depend on the current locale, so it's almost like C, POSIX and NetBSD have conspired to make them as useless to us as possible. They solve the "size and encoding of wchar_t is undefined" problem, but there are no _l() variants and we can't depend on uselocale() being available. Probably wouldn't be much use to us anyway considering our more complex and general transcoding requirements, I just thought about this while contemplating hypothetical pre-C23 systems that don't use UTF-32, specifically what would break if such a system existed: probably nothing as long as you don't use these. I guess another way you could tell would be if you used the fancy new U-prefixed character/string literal syntax, but I can't see much need for that. In passing, we seem to have a couple of mentions of "pg_wchar_t" (bogus _t) in existing comments. [1] https://thephd.dev/c-the-improvements-june-september-virtual-c-meeting
pgsql-hackers by date: