On Wed, Oct 29, 2025 at 2:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I'm picturing something like PG_WCHAR_CHAR
> (direclty usable with ctype.h), PG_WCHAR_UTF32 (self-explanatory, also
> assumed be compatible with UTF-8 locales' wchar_t), PG_WCHAR_CUSTOM
> (we only know that ASCII range is sane as Ishii-san explained, and for
> anything else you'd need to re-encode via libc or give up, but
> preferably not go nuts and return junk). The enum would create a new
> central place to document the cross-module semantics.
Here are some sketch-quality patches to try out some of these ideas,
for discussion. I gave them .txt endings so as not to hijack your
thread's CI.
* Fixing a different but related bug spotted in passing: we truncate
codepoints passed to Windows' iswalpha_l() et al, instead of detecting
overflow like some other places do. Not tested on Windows, but it
seemed pretty obviously wrong?
* Classifying all pg_wchar encodings as producing PG_WCHAR_CHAR,
PG_WCHAR_UTF32 or PG_WCHAR_CUSTOM, and dispatching to libc ctype
methods based with that.
* Easy EUC change: filtering out non-ASCII for _CUSTOM. I can't seem
to convince SQL-level regexes to expose bogus results on master
though... maybe the pg_wchar encoding actively avoids the by shifting
values up so you often or always cast to a harmless value? Still
better to formalise that I think, if we don't move ahead with the more
ambitious plan...
* More ambitious re-encoding strategy, replacing previous change, with
apparently plausible results.
* Various refactorings with helper macros to avoid making mistakes in
all that repetitive wrapper stuff.
Here's what my ja_JP.eucJP database shows, on FreeBSD. BTW in my
earlier emails I was confused and thought that kanji would not be in
class [[:alpha:]], but that's wrong: Unicode calls it "other letter",
and it looks like that makes all modern libcs return true for
iswalpha():
postgres=# select regexp_replace('1234 Постгрес 5678', '[[:alpha:]]+', '象');
regexp_replace
----------------
1234 象 5678
(1 row)
postgres=# select regexp_replace('1234 ポスグレ 5678', '[[:alpha:]]+', '象');
regexp_replace
----------------
1234 象 5678
(1 row)
postgres=# select regexp_replace('1234 ポスグレ? 5678', '[[:punct:]]+', '。');
regexp_replace
----------------------
1234 ポスグレ。 5678
(1 row)
(That's not an ASCII question mark, it's one of the kanji-box sized
punctuation characters.)
I had to hack regc_pg_locale.c slightly to teach it that just because
I set max_chr to 127 it doesn't mean I want it to turn locale support
off. Haven't looked into that code to figure out what it should do
instead, but it definitely shouldn't be allowed to probe made up
pg_wchar values, because EUC's pg_wchar encoding is sparse and
transcoding can error out.
A mystery that blocked me for too long: regexp_match('café', 'CAFÉ',
'i') and regexp_match('Αθήνα', 'ΑΘΉΝΑ', 'i') match with Apple's
ja_JP.eucJP as do the examples above, but mysteriously didn't on
FreeBSD's where this code started, could be a bug in its ja_JP.eucJP
locale affecting toupper/tolower... Wish I could get that time back.
I imagine that for the ICU + non-UTF-8 locale bug you mentioned, we
might need a very similar set of re-encoding wrappers: something like
pg_wchar -> mb -> UTF-8 -> UTF-32. All this re-encoding sounds
pretty bad, but I can't see any way around the re-encoding with these
edge-case configurations, and we're still supposed to spit out correct
right answers...