Re: encoding affects ICU regex character classification - Mailing list pgsql-hackers

From Tom Lane
Subject Re: encoding affects ICU regex character classification
Date
Msg-id 360857.1701302164@sss.pgh.pa.us
Whole thread Raw
In response to encoding affects ICU regex character classification  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: encoding affects ICU regex character classification
List pgsql-hackers
Jeff Davis <pgsql@j-davis.com> writes:
> The problem seems to be confusion between pg_wchar and a unicode code
> point in pg_wc_isalpha() and related functions.

Yeah, that's an ancient sore spot: we don't really know what the
representation of wchar is.  We assume it's Unicode code points
for UTF8 locales, but libc isn't required to do that AFAIK.  See
comment block starting about line 20 in regc_pg_locale.c.

I doubt that ICU has much to do with this directly.

We'd have to find an alternate source of knowledge to replace the
<wctype.h> functions if we wanted to fix it fully ... can ICU do that?

            regards, tom lane



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: encoding affects ICU regex character classification
Next
From: Tomas Vondra
Date:
Subject: Re: logical decoding and replication of sequences, take 2