Thread: BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters
The following bug has been logged on the website: Bug reference: 15476 Logged by: Kenji Uno Email address: h8mastre@gmail.com PostgreSQL version: 9.6.2 Operating system: Windows Server 2012 Japanese Description: # Problem on show_trgm with 4 byte UTF-8 characters On Encoding=UTF-8 database, try: SELECT show_trgm('123'); → OK SELECT show_trgm('日本語'); → probably OK. SELECT show_trgm('🔍'); → ERROR! ERROR: invalid multibyte character for locale HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding. SQL state: 22021 I have reviewed some of your source code. And I have found a suspect point. Please check: t_isdigit, t_isspace, t_isalpha, and t_isprint. https://github.com/postgres/postgres/blob/322548a8abe225f2cfd6a48e07b99e2711d28ef7/src/backend/tsearch/ts_locale.c#L35 char2wchar 4th parameter should take number of input bytes. However they pass character count. int clen = pg_mblen(ptr); ... char2wchar(character, 2, ptr, clen, mylocale); I'm afraid, but could you look into about this?
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes: > On Encoding=UTF-8 database, try: > SELECT show_trgm('123'); > → OK > SELECT show_trgm('日本語'); > → probably OK. > SELECT show_trgm('🔍'); > ERROR: invalid multibyte character for locale > HINT: The server's LC_CTYPE locale is probably incompatible with the > database encoding. > SQL state: 22021 I failed to reproduce this on a Linux machine. It looks to me like the problem is that Windows' MultiByteToWideChar doesn't think that UTF8 character is valid. > Please check: t_isdigit, t_isspace, t_isalpha, and t_isprint. > https://github.com/postgres/postgres/blob/322548a8abe225f2cfd6a48e07b99e2711d28ef7/src/backend/tsearch/ts_locale.c#L35 > char2wchar 4th parameter should take number of input bytes. However they > pass character count. > int clen = pg_mblen(ptr); > ... > char2wchar(character, 2, ptr, clen, mylocale); Huh? pg_mblen returns the number of bytes in a multibyte character, so this looks fine to me. regards, tom lane
kenji uno <h8mastre@gmail.com> writes: >> I failed to reproduce this on a Linux machine. It looks to me like the >> problem is that Windows' MultiByteToWideChar doesn't think that UTF8 >> character is valid. > I'm just wondering why my issue occurs only on Windows. > But I knew why: char2wchar's tolen requires +1 output buffer size, due to > null-termination. Oooh ... the problem, effectively, is that the ts_locale.c functions are expecting to get back UTF32 but what they'll actually get on Windows is UTF16. So if the given character is outside the BMP range, char2wchar needs to produce a surrogate pair, which there's not room for given that the output buffer can only hold 1 wchar_t plus trailing null. Then the other problem is that the Windows-Unicode code path in char2wchar just fails for an undersized output buffer, which you would not expect from its documentation. And it fails with a misleading error message, too. I'll see what I can do about this --- thanks for the report! regards, tom lane