Thread: Mac OS: invalid byte sequence for encoding "UTF8"
Hello. When a user try to create a text search dictionary for the russian language on Mac OS then called the following error message: CREATE EXTENSION hunspell_ru_ru; + ERROR: invalid byte sequence for encoding "UTF8": 0xd1 + CONTEXT: line 341 of configuration file "/Users/stas/code/postgrespro2/tmp_install/Users/stas/code/postgrespro2/install/share/tsearch_data/ru_ru.affix": "SFX Y хаться шутся хаться Russian dictionary was downloaded from http://extensions.openoffice.org/en/project/slovari-dlya-russkogo-yazyka-dictionaries-russian Affix and dictionary files was extracted from the archive and converted to UTF-8. Also a converted dictionary can be downloaded from https://github.com/select-artur/hunspell_dicts/tree/master/ru_ru This behavior occurs on: - Mac OS X 10.10 Yosemite and Mac OS X 10.11 El Capitan. - latest PostgreSQL version from git and PostgreSQL 9.5 (probably also on 9.4.5). There is also the test to reproduce this bug in the attachment. Did you meet this bug? Do you have a solution or a workaround? Thanks in advance. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
On Wed, Jan 27, 2016 at 10:59 AM, Artur Zakirov <a.zakirov@postgrespro.ru> wrote:
Hello.
When a user try to create a text search dictionary for the russian language on Mac OS then called the following error message:
CREATE EXTENSION hunspell_ru_ru;
+ ERROR: invalid byte sequence for encoding "UTF8": 0xd1
+ CONTEXT: line 341 of configuration file "/Users/stas/code/postgrespro2/tmp_install/Users/stas/code/postgrespro2/install/share/tsearch_data/ru_ru.affix": "SFX Y хаться шутся хаться
Russian dictionary was downloaded from http://extensions.openoffice.org/en/project/slovari-dlya-russkogo-yazyka-dictionaries-russian
Affix and dictionary files was extracted from the archive and converted to UTF-8. Also a converted dictionary can be downloaded from https://github.com/select-artur/hunspell_dicts/tree/master/ru_ru
Not sure why the file uses "SET KOI8-R" directive then?
This behavior occurs on:
- Mac OS X 10.10 Yosemite and Mac OS X 10.11 El Capitan.
- latest PostgreSQL version from git and PostgreSQL 9.5 (probably also on 9.4.5).
There is also the test to reproduce this bug in the attachment.
What error message do you get with this test program? (I don't get any, but I'm not on Mac OS.)
--
Alex
On 27.01.2016 13:46, Shulgin, Oleksandr wrote: > > Not sure why the file uses "SET KOI8-R" directive then? > This directive is used only by Hunspell program. PostgreSQL ignores this directive and assumes that input affix and dictionary files in the UTF-8 encoding. > > > What error message do you get with this test program? (I don't get any, > but I'm not on Mac OS.) > -- > Alex > > With this program you will get wrong output. A error message is not called. You can execute the following commands: > cc test.c -o test> ./test You will get the output: SFX/Y/?/аться/шутся Although the output should be: SFX/Y/хаться/шутся/хаться -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Hi. I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene languagefilter? =) That can be reproduced with simpler test Stas > On 27 Jan 2016, at 13:59, Artur Zakirov <a.zakirov@postgrespro.ru> wrote: > > On 27.01.2016 13:46, Shulgin, Oleksandr wrote: >> >> Not sure why the file uses "SET KOI8-R" directive then? >> > > This directive is used only by Hunspell program. PostgreSQL ignores this directive and assumes that input affix and dictionaryfiles in the UTF-8 encoding. > >> >> >> What error message do you get with this test program? (I don't get any, >> but I'm not on Mac OS.) >> -- >> Alex >> >> > > With this program you will get wrong output. A error message is not called. You can execute the following commands: > > > cc test.c -o test > > ./test > > You will get the output: > > SFX/Y/?/аться/шутся > > Although the output should be: > > SFX/Y/хаться/шутся/хаться > > -- > Artur Zakirov > Postgres Professional: http://www.postgrespro.com > Russian Postgres Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 27.01.2016 14:14, Stas Kelvich wrote: > Hi. > > I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene languagefilter? =) > > That can be reproduced with simpler test > > Stas > > The test program was corrected. Now it uses wchar_t type. And it works correctly and gives right output. I think the NIImportOOAffixes() in spell.c should be corrected to avoid this bug. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
On 27.01.2016 15:28, Artur Zakirov wrote: > On 27.01.2016 14:14, Stas Kelvich wrote: >> Hi. >> >> I tried that and confirm strange behaviour. It seems that problem with >> small cyrillic letter ‘х’. (simplest obscene language filter? =) >> >> That can be reproduced with simpler test >> >> Stas >> >> > > The test program was corrected. Now it uses wchar_t type. And it works > correctly and gives right output. > > I think the NIImportOOAffixes() in spell.c should be corrected to avoid > this bug. > I have attached a patch. It adds new functions parse_ooaffentry() and get_nextentry() and fixes a couple comments. Now russian and other supported dictionaries can be used for text search in Mac OS. parse_ooaffentry() parses an affix file entry instead of sscanf(). It has a similar algorithm to the parse_affentry() function. Should I create a new patch to fix this bug (as I did) or this patch should go with the patch http://www.postgresql.org/message-id/56AA02EE.6090004@postgrespro.ru ? -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
On 28.01.2016 17:42, Artur Zakirov wrote: > On 27.01.2016 15:28, Artur Zakirov wrote: >> On 27.01.2016 14:14, Stas Kelvich wrote: >>> Hi. >>> >>> I tried that and confirm strange behaviour. It seems that problem with >>> small cyrillic letter ‘х’. (simplest obscene language filter? =) >>> >>> That can be reproduced with simpler test >>> >>> Stas >>> >>> >> >> The test program was corrected. Now it uses wchar_t type. And it works >> correctly and gives right output. >> >> I think the NIImportOOAffixes() in spell.c should be corrected to avoid >> this bug. >> > > I have attached a patch. It adds new functions parse_ooaffentry() and > get_nextentry() and fixes a couple comments. > > Now russian and other supported dictionaries can be used for text search > in Mac OS. > > parse_ooaffentry() parses an affix file entry instead of sscanf(). It > has a similar algorithm to the parse_affentry() function. > > Should I create a new patch to fix this bug (as I did) or this patch > should go with the patch > http://www.postgresql.org/message-id/56AA02EE.6090004@postgrespro.ru ? > I have created a new entry in the commitfest for this patch https://commitfest.postgresql.org/9/496/ -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Artur Zakirov <a.zakirov@postgrespro.ru> writes: >> I think the NIImportOOAffixes() in spell.c should be corrected to avoid >> this bug. > I have attached a patch. It adds new functions parse_ooaffentry() and > get_nextentry() and fixes a couple comments. I do not like this patch much. It is basically "let's stop using sscanf() because it seems to have a bug on one platform". There are at least two things wrong with that approach: 1. By my count there are about 80 uses of *scanf() in our code. Are we going to replace every one of them with hand-rolled code? If not, why is only this instance vulnerable? How can we know whether future uses will have a problem? 2. We're not being very good citizens of the software universe if we just install a hack in Postgres rather than nagging Apple to fix the bug at its true source. I think the appropriate next step to take is to dig into the OS X sources (see http://www.opensource.apple.com, I think probably the relevant code is in the Libc package) and identify exactly what is causing the misbehavior. That would both allow an informed answer to point #1 and greatly increase the odds of getting action on a bug report to Apple. Even if we end up applying this patch verbatim, I think we need that information first. regards, tom lane
On 09.02.2016 20:13, Tom Lane wrote: > I do not like this patch much. It is basically "let's stop using sscanf() > because it seems to have a bug on one platform". There are at least two > things wrong with that approach: > > 1. By my count there are about 80 uses of *scanf() in our code. Are we > going to replace every one of them with hand-rolled code? If not, why > is only this instance vulnerable? How can we know whether future uses > will have a problem? It seems that *scanf() with %s format occures only here: - check.c - get_bin_version() - server.c - get_major_server_version() - filemap.c - isRelDataFile() - pg_backup_directory.c - _LoadBlobs() - xlog.c - do_pg_stop_backup() - mac.c - macaddr_in() I think here sscanf() do not works with the UTF-8 characters. And probably this is only spell.c issue. I agree that previous patch is wrong. Instead of using new parse_ooaffentry() function maybe better to use sscanf() with %ls format. The %ls format is used to read a wide character string. > > 2. We're not being very good citizens of the software universe if we > just install a hack in Postgres rather than nagging Apple to fix the > bug at its true source. > > I think the appropriate next step to take is to dig into the OS X > sources (see http://www.opensource.apple.com, I think probably the > relevant code is in the Libc package) and identify exactly what is > causing the misbehavior. That would both allow an informed answer > to point #1 and greatly increase the odds of getting action on a > bug report to Apple. Even if we end up applying this patch verbatim, > I think we need that information first. > > regards, tom lane > I think this is not a bug. It is a normal behavior. In Mac OS sscanf() with the %s format reads the string one character at a time. The size of letter 'х' is 2. And sscanf() separate it into two wrong characters. In conclusion, I think in spell.c should be used sscanf() with %ls format. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
> It seems that *scanf() with %s format occures only here: > - check.c - get_bin_version() > - server.c - get_major_server_version() > - filemap.c - isRelDataFile() > - pg_backup_directory.c - _LoadBlobs() > - xlog.c - do_pg_stop_backup() > - mac.c - macaddr_in() > I think here sscanf() do not works with the UTF-8 characters. And probably this > is only spell.c issue. Hmm. Here src/backend/access/transam/xlog.c read_tablespace_map() using %s in scanf looks suspisious. I don't fully understand but it looks like it tries to read oid as string. So, it should be safe in usial case Next, _LoadBlobs() reads filename (fname) with a help of sscanf. Could file name be in UTF-8 encoding here? > > I agree that previous patch is wrong. Instead of using new parse_ooaffentry() > function maybe better to use sscanf() with %ls format. The %ls format is used to > read a wide character string. Does %ls modifier exist everewhere? Apple docs says (https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man3/sscanf.3.html): s ... If an l qualifier is present, the next pointer must be a pointer to wchar_t, into which the input will be placedafter conversion by mbrtowc Actually, it means that wchar2char() call should be used, but it uses wcstombs[_l] which could do not present on some platforms.Does it mean that l modifier of string presents too or not? What do we need to do if %l exists but wcstombs[_l] not? I'm a bit crazy with locale problems and it seems to me that Artur's patch is good idea. Actually, I don't remember exactly, but, seems, commit 7ac8a4be8946c11d5a6bf91bb971b9750c1c60e5 introduced parse_affentry() instead of corresponding sscanf to avoid problems with encoding and scanf. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Artur Zakirov <a.zakirov@postgrespro.ru> writes: > I agree that previous patch is wrong. Instead of using new > parse_ooaffentry() function maybe better to use sscanf() with %ls > format. The %ls format is used to read a wide character string. No, that way is going to give you worse portability problems than what we have now. Older implementations won't have %ls, and even if they do, they might not have wcstombs() which is the only way you'd get from libc's idea of wide characters to an encoding we recognize. > I think this is not a bug. It is a normal behavior. In Mac OS sscanf() > with the %s format reads the string one character at a time. The size of > letter 'х' is 2. And sscanf() separate it into two wrong characters. That argument might be convincing if OSX behaved that way for all multibyte characters, but it doesn't seem to be doing that. Why is only 'х' affected? regards, tom lane
On 10.02.2016 18:51, Teodor Sigaev wrote: > Hmm. Here > src/backend/access/transam/xlog.c read_tablespace_map() > using %s in scanf looks suspisious. I don't fully understand but it > looks like it tries to read oid as string. So, it should be safe in > usial case > > Next, _LoadBlobs() reads filename (fname) with a help of sscanf. Could > file name be in UTF-8 encoding here? This function reads the "blobs.toc" file. It lines have the following format: <uid> <filename> Where <filename> is blob_<uid>.dat. Therefore it should be safe too. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
I wrote: > Artur Zakirov <a.zakirov@postgrespro.ru> writes: >> I think this is not a bug. It is a normal behavior. In Mac OS sscanf() >> with the %s format reads the string one character at a time. The size of >> letter 'х' is 2. And sscanf() separate it into two wrong characters. > That argument might be convincing if OSX behaved that way for all > multibyte characters, but it doesn't seem to be doing that. Why is > only 'х' affected? I looked into the OS X sources, and found that indeed you are right: *scanf processes the input a byte at a time, and applies isspace() to each byte separately, even when the locale is such that that's a clearly insane thing to do. Since this code was derived from FreeBSD, FreeBSD has or once had the same issue. (A look at the freebsd project on github says it still does, assuming that's the authoritative repo.) Not sure about other BSDen. I also verified that in UTF8-based locales, isspace() thinks that 0x85 and 0xA0, and no other high-bit-set values, are spaces. Not sure exactly why it thinks that, but that explains why 'х' fails when adjacent code points don't. So apparently the coding rule we have to adopt is "don't use *scanf() on data that might contain multibyte characters". (There might be corner cases where it'd work all right for conversion specifiers other than %s, but probably you might as well just use strtol and friends in such cases.) Ugh. regards, tom lane
On 2016-02-10 16:19, Tom Lane wrote: > I wrote: >> Artur Zakirov <a.zakirov@postgrespro.ru> writes: >>> I think this is not a bug. It is a normal behavior. In Mac OS >>> sscanf() >>> with the %s format reads the string one character at a time. The size >>> of >>> letter 'х' is 2. And sscanf() separate it into two wrong characters. > >> That argument might be convincing if OSX behaved that way for all >> multibyte characters, but it doesn't seem to be doing that. Why is >> only 'х' affected? > > I looked into the OS X sources, and found that indeed you are right: > *scanf processes the input a byte at a time, and applies isspace() to > each byte separately, even when the locale is such that that's a > clearly > insane thing to do. Since this code was derived from FreeBSD, FreeBSD > has or once had the same issue. (A look at the freebsd project on > github > says it still does, assuming that's the authoritative repo.) Not sure > about other BSDen. > > I also verified that in UTF8-based locales, isspace() thinks that 0x85 > and > 0xA0, and no other high-bit-set values, are spaces. Not sure exactly > why > it thinks that, but that explains why 'х' fails when adjacent code > points > don't. > > So apparently the coding rule we have to adopt is "don't use *scanf() > on data that might contain multibyte characters". (There might be > corner > cases where it'd work all right for conversion specifiers other than > %s, > but probably you might as well just use strtol and friends in such > cases.) > Ugh. > > regards, tom lane Definitive FreeBSD Sources: https://svnweb.freebsd.org/base/ -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 7011 W Parmer Ln, Apt 1115, Austin, TX 78729-6961
Larry Rosenman <ler@lerctr.org> writes: > On 2016-02-10 16:19, Tom Lane wrote: >> I looked into the OS X sources, and found that indeed you are right: >> *scanf processes the input a byte at a time, and applies isspace() to >> each byte separately, even when the locale is such that that's a >> clearly insane thing to do. Since this code was derived from FreeBSD, >> FreeBSD has or once had the same issue. (A look at the freebsd project >> on github says it still does, assuming that's the authoritative repo.) >> Not sure about other BSDen. > Definitive FreeBSD Sources: > https://svnweb.freebsd.org/base/ Ah, thanks for the link. I'm not totally sure which branch is most current, but at least on this one, it's still clearly wrong: https://svnweb.freebsd.org/base/stable/10/lib/libc/stdio/vfscanf.c?revision=291336&view=markup convert_string(), which handles %s, applies isspace() to individual bytes regardless of locale. convert_wstring(), which handles %ls, does it more intelligently ... but as I said upthread, relying on %ls would just give us a different set of portability problems. It looks like Artur's patch is indeed what we need to do, along with looking around for other *scanf() uses that are vulnerable. regards, tom lane
On 2016-02-10 17:00, Tom Lane wrote: > Larry Rosenman <ler@lerctr.org> writes: >> On 2016-02-10 16:19, Tom Lane wrote: >>> I looked into the OS X sources, and found that indeed you are right: >>> *scanf processes the input a byte at a time, and applies isspace() to >>> each byte separately, even when the locale is such that that's a >>> clearly insane thing to do. Since this code was derived from >>> FreeBSD, >>> FreeBSD has or once had the same issue. (A look at the freebsd >>> project >>> on github says it still does, assuming that's the authoritative >>> repo.) >>> Not sure about other BSDen. > >> Definitive FreeBSD Sources: >> https://svnweb.freebsd.org/base/ > > Ah, thanks for the link. I'm not totally sure which branch is most > current, but at least on this one, it's still clearly wrong: > https://svnweb.freebsd.org/base/stable/10/lib/libc/stdio/vfscanf.c?revision=291336&view=markup > convert_string(), which handles %s, applies isspace() to individual > bytes > regardless of locale. convert_wstring(), which handles %ls, does it > more > intelligently ... but as I said upthread, relying on %ls would just > give > us a different set of portability problems. > > It looks like Artur's patch is indeed what we need to do, along with > looking around for other *scanf() uses that are vulnerable. > > regards, tom lane that would be the current 10.x tree, production, and getting ready for 10.3 which is in code slush. If you want, file a bug at https://bugs.freebsd.org/bugzilla -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 7011 W Parmer Ln, Apt 1115, Austin, TX 78729-6961
Larry Rosenman <ler@lerctr.org> writes: > If you want, file a bug at https://bugs.freebsd.org/bugzilla Probably not much point; the commit log shows pretty clearly that they have been thinking about the code's behavior with multibyte characters, so I assume they've intentionally decided to keep it like this. regards, tom lane
Artur Zakirov <a.zakirov@postgrespro.ru> writes: > [ tsearch_aff_parse_v1.patch ] I've pushed this with some corrections --- notably, I did not like the lack of buffer-overrun prevention, and it did the wrong thing if a line had more than one trailing space character. We still need to look at other uses of *scanf(), but I think that any other changes that might be needed should be separate patches anyhow. regards, tom lane
On 02/10/16 17:19, Tom Lane wrote: > I also verified that in UTF8-based locales, isspace() thinks that 0x85 and > 0xA0, and no other high-bit-set values, are spaces. Not sure exactly why Unicode NEXT LINE (NEL) and NO-BREAK SPACE, respectively. http://unicode.org/standard/reports/tr13/tr13-5.html http://unicode.org/charts/PDF/U0080.pdf -Chap
Chapman Flack <chap@anastigmatix.net> writes: > On 02/10/16 17:19, Tom Lane wrote: >> I also verified that in UTF8-based locales, isspace() thinks that 0x85 and >> 0xA0, and no other high-bit-set values, are spaces. Not sure exactly why > Unicode NEXT LINE (NEL) and NO-BREAK SPACE, respectively. Yeah, I got that --- what seems squishier is that none of the other C1 control characters are considered whitespace? regards, tom lane
On 02/10/16 23:55, Tom Lane wrote: > Yeah, I got that --- what seems squishier is that none of the other C1 > control characters are considered whitespace? That seems to be exactly the case: http://www.unicode.org/Public/5.2.0/ucd/PropList.txt 09..0D, 20, 85, and A0 are the only whitespace chars whose codepoints fit in a byte. -Chap
On 11.02.2016 01:19, Tom Lane wrote: > I wrote: >> Artur Zakirov <a.zakirov@postgrespro.ru> writes: >>> I think this is not a bug. It is a normal behavior. In Mac OS sscanf() >>> with the %s format reads the string one character at a time. The size of >>> letter 'Ñ…' is 2. And sscanf() separate it into two wrong characters. > >> That argument might be convincing if OSX behaved that way for all >> multibyte characters, but it doesn't seem to be doing that. Why is >> only 'Ñ…' affected? > > I looked into the OS X sources, and found that indeed you are right: > *scanf processes the input a byte at a time, and applies isspace() to > each byte separately, even when the locale is such that that's a clearly > insane thing to do. Since this code was derived from FreeBSD, FreeBSD > has or once had the same issue. (A look at the freebsd project on github > says it still does, assuming that's the authoritative repo.) Not sure > about other BSDen. > > I also verified that in UTF8-based locales, isspace() thinks that 0x85 and > 0xA0, and no other high-bit-set values, are spaces. Not sure exactly why > it thinks that, but that explains why 'Ñ…' fails when adjacent code points > don't. > > So apparently the coding rule we have to adopt is "don't use *scanf() > on data that might contain multibyte characters". (There might be corner > cases where it'd work all right for conversion specifiers other than %s, > but probably you might as well just use strtol and friends in such cases.) > Ugh. > > regards, tom lane > Yes, I meant this. The second byte divides the word into two wrong pieces. Sorry for my unclear explanation. I should to explain more clearly. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
On 11.02.2016 03:33, Tom Lane wrote: > Artur Zakirov <a.zakirov@postgrespro.ru> writes: >> [ tsearch_aff_parse_v1.patch ] > > I've pushed this with some corrections --- notably, I did not like the > lack of buffer-overrun prevention, and it did the wrong thing if a line > had more than one trailing space character. > > We still need to look at other uses of *scanf(), but I think that any > other changes that might be needed should be separate patches anyhow. > > regards, tom lane > Thank you! -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company