BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters
Date
Msg-id 15476-4314f480acf0f114@postgresql.org
Whole thread Raw
Responses Re: BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      15476
Logged by:          Kenji Uno
Email address:      h8mastre@gmail.com
PostgreSQL version: 9.6.2
Operating system:   Windows Server 2012 Japanese
Description:

# Problem on show_trgm with 4 byte UTF-8 characters

On Encoding=UTF-8 database, try:

SELECT show_trgm('123');
→ OK

SELECT show_trgm('日本語');
→ probably OK.

SELECT show_trgm('🔍');
→ ERROR!

ERROR:  invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the
database encoding.
SQL state: 22021


I have reviewed some of your source code. And I have found a suspect
point.

Please check: t_isdigit, t_isspace, t_isalpha, and t_isprint.
https://github.com/postgres/postgres/blob/322548a8abe225f2cfd6a48e07b99e2711d28ef7/src/backend/tsearch/ts_locale.c#L35

char2wchar 4th parameter should take number of input bytes. However they
pass character count.

int clen = pg_mblen(ptr);
...
char2wchar(character, 2, ptr, clen, mylocale);


I'm afraid, but could you look into about this?


pgsql-bugs by date:

Previous
From: Paul Schaap
Date:
Subject: Re: BUG #15475: Views over CITEXT columns return no data
Next
From: PG Bug reporting form
Date:
Subject: BUG #15477: Procedure call with named inout refcursor parameter -"invalid input syntax for type boolean" error