Thread: BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters

BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15476
Logged by:          Kenji Uno
Email address:      h8mastre@gmail.com
PostgreSQL version: 9.6.2
Operating system:   Windows Server 2012 Japanese
Description:

# Problem on show_trgm with 4 byte UTF-8 characters

On Encoding=UTF-8 database, try:

SELECT show_trgm('123');
→ OK

SELECT show_trgm('日本語');
→ probably OK.

SELECT show_trgm('🔍');
→ ERROR!

ERROR:  invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the
database encoding.
SQL state: 22021


I have reviewed some of your source code. And I have found a suspect
point.

Please check: t_isdigit, t_isspace, t_isalpha, and t_isprint.
https://github.com/postgres/postgres/blob/322548a8abe225f2cfd6a48e07b99e2711d28ef7/src/backend/tsearch/ts_locale.c#L35

char2wchar 4th parameter should take number of input bytes. However they
pass character count.

int clen = pg_mblen(ptr);
...
char2wchar(character, 2, ptr, clen, mylocale);


I'm afraid, but could you look into about this?


Re: BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters

From
Tom Lane
Date:
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
> On Encoding=UTF-8 database, try:
> SELECT show_trgm('123');
> → OK
> SELECT show_trgm('日本語');
> → probably OK.
> SELECT show_trgm('🔍');
> ERROR:  invalid multibyte character for locale
> HINT:  The server's LC_CTYPE locale is probably incompatible with the
> database encoding.
> SQL state: 22021

I failed to reproduce this on a Linux machine.  It looks to me like the
problem is that Windows' MultiByteToWideChar doesn't think that UTF8
character is valid.

> Please check: t_isdigit, t_isspace, t_isalpha, and t_isprint.
>
https://github.com/postgres/postgres/blob/322548a8abe225f2cfd6a48e07b99e2711d28ef7/src/backend/tsearch/ts_locale.c#L35
> char2wchar 4th parameter should take number of input bytes. However they
> pass character count.
> int clen = pg_mblen(ptr);
> ...
> char2wchar(character, 2, ptr, clen, mylocale);

Huh?  pg_mblen returns the number of bytes in a multibyte character,
so this looks fine to me.

            regards, tom lane


Re: BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters

From
Tom Lane
Date:
kenji uno <h8mastre@gmail.com> writes:
>> I failed to reproduce this on a Linux machine.  It looks to me like the
>> problem is that Windows' MultiByteToWideChar doesn't think that UTF8
>> character is valid.

> I'm just wondering why my issue occurs only on Windows.
> But I knew why: char2wchar's tolen requires +1 output buffer size, due to
> null-termination.

Oooh ... the problem, effectively, is that the ts_locale.c functions are
expecting to get back UTF32 but what they'll actually get on Windows is
UTF16.  So if the given character is outside the BMP range, char2wchar
needs to produce a surrogate pair, which there's not room for given that
the output buffer can only hold 1 wchar_t plus trailing null.

Then the other problem is that the Windows-Unicode code path in char2wchar
just fails for an undersized output buffer, which you would not expect
from its documentation.  And it fails with a misleading error message,
too.

I'll see what I can do about this --- thanks for the report!

            regards, tom lane