Thread: [18] clarify the difference between pg_wchar, wchar_t, and Unicode code points
[18] clarify the difference between pg_wchar, wchar_t, and Unicode code points
From
Jeff Davis
Date:
I'm not sure I understand all of the history behind pg_wchar, but it seems to be some blend of: (a) Postgres's own internal representation of a decoded character (b) libc's wchar_t (c) Unicode code point For example, Postgres has its own encoding/decoding routines, so (a) is the most obvious definition. When the server encoding is UTF-8, the internal representation is a Unicode code point, which is convenient for the builtin and ICU providers, as well as some (most? all?) libc implementations. Other encodings have different represenations which seem to favor the libc provider. pg_wchar is also passed directly to libc routines like iswalpha_l() (see pg_wc_isalpha()), which is depending on definition (b). We guard it with: if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF) to ensure that the pg_wchar is representable in the libc's wchar_t type. As far as I can tell this is still no guarantee of correctness; it's just a sanity check. I didn't find an obviously better way of doing it, however. When using ICU, we also pass a pg_wchar directly to ICU routines, which depends on definition (c), and can lead to problems like: https://www.postgresql.org/message-id/e7b67d24288f811aebada7c33f9ae629dde0def5.camel@j-davis.com The comment at the top of pg_regc_locale.c explains some of the above, but not all. I'd like to organize this a bit better: * a new typedef for a Unicode code point ("codepoint"? "uchar"?) * a no-op conversion routine from pg_wchar to a codepoint that would assert that the server encoding is UTF-8 (#ifndef FRONTEND, of course) * a no-op conversion routine from pg_wchar to wchar_t that would be a good place for a comment describing that it's a "best effort" and may not be correct in all cases We could even go so far as to make the pg_wchar type not implicitly- castable, so that callers would be forced to convert it to either a wchar_t or a code point. Tom also suggested here: https://www.postgresql.org/message-id/360857.1701302164%40sss.pgh.pa.us that we don't necessarily need to use libc at all, and I like that idea. Perhaps the suggestions above are a step in that direction, or perhaps we can skip ahead? I intend to submit a patch for the July CF. Thoughts? Regards, Jeff Davis
Re: [18] clarify the difference between pg_wchar, wchar_t, and Unicode code points
From
Peter Eisentraut
Date:
On 16.04.24 01:40, Jeff Davis wrote: > I'm not sure I understand all of the history behind pg_wchar, but it > seems to be some blend of: > > (a) Postgres's own internal representation of a decoded character > (b) libc's wchar_t > (c) Unicode code point > > For example, Postgres has its own encoding/decoding routines, so (a) is > the most obvious definition. (a) is the correct definition, I think. The other ones are just occasional conveniences, and occasionally wrong. > When using ICU, we also pass a pg_wchar directly to ICU routines, which > depends on definition (c), and can lead to problems like: > > https://www.postgresql.org/message-id/e7b67d24288f811aebada7c33f9ae629dde0def5.camel@j-davis.com That's just a plain bug, I think. It's missing the encoding check that for example pg_strncoll_icu() does. > The comment at the top of pg_regc_locale.c explains some of the above, > but not all. I'd like to organize this a bit better: > > * a new typedef for a Unicode code point ("codepoint"? "uchar"?) > * a no-op conversion routine from pg_wchar to a codepoint that would > assert that the server encoding is UTF-8 (#ifndef FRONTEND, of course) > * a no-op conversion routine from pg_wchar to wchar_t that would be a > good place for a comment describing that it's a "best effort" and may > not be correct in all cases I guess sometimes you really want to just store an array of Unicode code points. But I'm not sure how this would actually address coding mistakes like the one above. You still need to check the server encoding and do encoding conversion when necessary.