On Wed, 2025-10-29 at 14:00 +1300, Thomas Munro wrote:
> I wonder if the logic to select the member/semantics could be turned
> into an enum in the encoding table, to make it even clearer, and then
> that could be used as an index into a table of ctype methods obejcts
> in _libc.c.
As long as we're able to isolate that logic in the libc provider,
that's reasonable. The other providers don't need that complexity, they
just need to decode straight to UTF-32.
> You showed char16_t for Windows, but we don't ever get char16_t out
> of
> wchar.c, it's always char32_t for UTF-8 input. It's just that
> _libc.c
> truncates to UTF-16 or short-circuits to avoid overflow on that
> platform (and in the past AIX 32-bit and maybe more), so it wouldn't
> belong in a hypothetical union or enum.
Oh, I see.
> >
> Perhaps we could at least put the conversion in a new encoding table
> function pointer "pg_wchar_custom_to_wchar_t", so we could reserve a
> place to put that sort of optimisation in
That sounds like a good step forward. And maybe one to convert to UTF-
32 for ICU, also?
> If we do develop this idea though, one issue to contemplate is that
> EUC code points might generate more than one wchar_t, looking at
> EUC_JIS_2004[1].
Wow, that's unfortunate.
Regards,
Jeff Davis