Home > mailing lists

Re: Assert failure with ICU support - Mailing list pgsql-bugs

From	Jeff Davis
Subject	Re: Assert failure with ICU support
Date	April 19, 2023 20:31:17
Msg-id	ece303a3068026c6d6f10cab2a7f3f0836f4adb7.camel@j-davis.com Whole thread Raw
In response to	Assert failure with ICU support (Richard Guo <guofenglinux@gmail.com>)
Responses	Re: Assert failure with ICU support
List	pgsql-bugs

Tree view

On Wed, 2023-04-19 at 18:30 +0800, Richard Guo wrote:
> I happened to run into an assert failure by chance with ICU support.
> Here is the query:
>
>     SELECT '1' SIMILAR TO '\൧';
>
> The failure happens in lexescape(),
>
>         default:
>             assert(iscalpha(c));
>             FAILW(REG_EESCAPE); /* unknown alphabetic escape */
>             break;
>
> Without ICU support, the same query just gives an error.
>
> # SELECT '1' SIMILAR TO '\൧';
> ERROR:  invalid regular expression: invalid escape \ sequence
>
> FWIW, I googled a bit and '൧' seems to be number 1 in Malayalam.

Thank you for the report and analysis! The problem exists all the way
back if you do:

  SELECT '1' COLLATE "en-US-x-icu" SIMILAR TO '\൧';

The root cause (which you allude to) is that the code makes the
assumption that digits only include 0-9, but u_isdigit('൧') == true,
which violates that assumption.

For Linux[1] specifically, it seems that the assumption should hold for
iswdigit(). But looking here[2], it seems that the behavior of
iswdigit() depends on the locale and I don't think it's correct to make
that assumption.

I did some experimentation on ICU and I found (pseudocode -- the real
code needs to create a UChar32 from an encoded string first):

  char name: MALAYALAM DIGIT ONE
  u_isalnum('൧'): true
  u_isalpha('൧'): false
  u_isdigit('൧'): true
  u_charType('൧') == U_DECIMAL_DIGIT_NUMBER: true
  u_hasBinaryProperty('൧', UCHAR_POSIX_XDIGIT): true
  u_hasBinaryProperty('൧', UCHAR_POSIX_ALNUM): true

The docs[3] for ICU say:

  "There are also functions that provide easy migration from C/POSIX
functions like isblank(). Their use is generally discouraged because
the C/POSIX standards do not define their semantics beyond the ASCII
range, which means that different implementations exhibit very
different behavior. Instead, Unicode properties should be used
directly."

We should probably just check that it's plain ASCII.

Unfortunately I would not be surprised if there are more bugs similar
to this one.

Regards,
    Jeff Davis

[1] https://man7.org/linux/man-pages/man3/iswdigit.3.html
[2]
https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswdigit.html
[3]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details

pgsql-bugs by date:

From: PG Bug reporting form
Date: 19 April 2023, 19:54:38
Subject: BUG #17904: Inconsistent value of max_worker_processes parameter in the config file and running cluster

From: Tom Lane
Date: 19 April 2023, 20:45:48
Subject: Re: Assert failure with ICU support

Re: Assert failure with ICU support - Mailing list pgsql-bugs

Previous

Next