Thread: Mac OS: invalid byte sequence for encoding "UTF8"

Mac OS: invalid byte sequence for encoding "UTF8"

From

Artur Zakirov

Date:

27 January 2016, 09:59:49

Hello.

When a user try to create a text search dictionary for the russian
language on Mac OS then called the following error message:

   CREATE EXTENSION hunspell_ru_ru;
+ ERROR:  invalid byte sequence for encoding "UTF8": 0xd1
+ CONTEXT:  line 341 of configuration file
"/Users/stas/code/postgrespro2/tmp_install/Users/stas/code/postgrespro2/install/share/tsearch_data/ru_ru.affix":
"SFX Y   хаться шутся        хаться

Russian dictionary was downloaded from
http://extensions.openoffice.org/en/project/slovari-dlya-russkogo-yazyka-dictionaries-russian
Affix and dictionary files was extracted from the archive and converted
to UTF-8. Also a converted dictionary can be downloaded from
https://github.com/select-artur/hunspell_dicts/tree/master/ru_ru

This behavior occurs on:
- Mac OS X 10.10 Yosemite and Mac OS X 10.11 El Capitan.
- latest PostgreSQL version from git and PostgreSQL 9.5 (probably also
on 9.4.5).

There is also the test to reproduce this bug in the attachment.

Did you meet this bug? Do you have a solution or a workaround?

Thanks in advance.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

test.c

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

"Shulgin, Oleksandr"

Date:

27 January 2016, 10:46:56

On Wed, Jan 27, 2016 at 10:59 AM, Artur Zakirov <a.zakirov@postgrespro.ru> wrote:

Hello.

When a user try to create a text search dictionary for the russian language on Mac OS then called the following error message:

CREATE EXTENSION hunspell_ru_ru;
+ ERROR: invalid byte sequence for encoding "UTF8": 0xd1
+ CONTEXT: line 341 of configuration file "/Users/stas/code/postgrespro2/tmp_install/Users/stas/code/postgrespro2/install/share/tsearch_data/ru_ru.affix": "SFX Y хаться шутся хаться

Russian dictionary was downloaded from http://extensions.openoffice.org/en/project/slovari-dlya-russkogo-yazyka-dictionaries-russian
Affix and dictionary files was extracted from the archive and converted to UTF-8. Also a converted dictionary can be downloaded from https://github.com/select-artur/hunspell_dicts/tree/master/ru_ru

Not sure why the file uses "SET KOI8-R" directive then?

This behavior occurs on:
- Mac OS X 10.10 Yosemite and Mac OS X 10.11 El Capitan.
- latest PostgreSQL version from git and PostgreSQL 9.5 (probably also on 9.4.5).

There is also the test to reproduce this bug in the attachment.

What error message do you get with this test program? (I don't get any, but I'm not on Mac OS.)

Alex

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Artur Zakirov

Date:

27 January 2016, 10:59:43

On 27.01.2016 13:46, Shulgin, Oleksandr wrote:
>
> Not sure why the file uses "SET KOI8-R" directive then?
>

This directive is used only by Hunspell program. PostgreSQL ignores this 
directive and assumes that input affix and dictionary files in the UTF-8 
encoding.

>
>
> What error message do you get with this test program?  (I don't get any,
> but I'm not on Mac OS.)
> --
> Alex
>
>

With this program you will get wrong output. A error message is not 
called. You can execute the following commands:
> cc test.c -o test> ./test

You will get the output:

SFX/Y/?/аться/шутся

Although the output should be:

SFX/Y/хаться/шутся/хаться

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Stas Kelvich

Date:

27 January 2016, 11:14:46

Hi.

I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene
languagefilter? =) 

That can be reproduced with simpler test

Stas



> On 27 Jan 2016, at 13:59, Artur Zakirov <a.zakirov@postgrespro.ru> wrote:
>
> On 27.01.2016 13:46, Shulgin, Oleksandr wrote:
>>
>> Not sure why the file uses "SET KOI8-R" directive then?
>>
>
> This directive is used only by Hunspell program. PostgreSQL ignores this directive and assumes that input affix and
dictionaryfiles in the UTF-8 encoding. 
>
>>
>>
>> What error message do you get with this test program?  (I don't get any,
>> but I'm not on Mac OS.)
>> --
>> Alex
>>
>>
>
> With this program you will get wrong output. A error message is not called. You can execute the following commands:
>
> > cc test.c -o test
> > ./test
>
> You will get the output:
>
> SFX/Y/?/аться/шутся
>
> Although the output should be:
>
> SFX/Y/хаться/шутся/хаться
>
> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

test.c

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Artur Zakirov

Date:

27 January 2016, 12:28:58

On 27.01.2016 14:14, Stas Kelvich wrote:
> Hi.
>
> I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene
languagefilter? =) 
>
> That can be reproduced with simpler test
>
> Stas
>
>

The test program was corrected. Now it uses wchar_t type. And it works
correctly and gives right output.

I think the NIImportOOAffixes() in spell.c should be corrected to avoid
this bug.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

test.c

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Artur Zakirov

Date:

28 January 2016, 14:42:12

On 27.01.2016 15:28, Artur Zakirov wrote:
> On 27.01.2016 14:14, Stas Kelvich wrote:
>> Hi.
>>
>> I tried that and confirm strange behaviour. It seems that problem with
>> small cyrillic letter ‘х’. (simplest obscene language filter? =)
>>
>> That can be reproduced with simpler test
>>
>> Stas
>>
>>
>
> The test program was corrected. Now it uses wchar_t type. And it works
> correctly and gives right output.
>
> I think the NIImportOOAffixes() in spell.c should be corrected to avoid
> this bug.
>

I have attached a patch. It adds new functions parse_ooaffentry() and
get_nextentry() and fixes a couple comments.

Now russian and other supported dictionaries can be used for text search
in Mac OS.

parse_ooaffentry() parses an affix file entry instead of sscanf(). It
has a similar algorithm to the parse_affentry() function.

Should I create a new patch to fix this bug (as I did) or this patch
should go with the patch
http://www.postgresql.org/message-id/56AA02EE.6090004@postgrespro.ru ?

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

tsearch_aff_parse_v1.patch

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Artur Zakirov

Date:

29 January 2016, 12:42:51

On 28.01.2016 17:42, Artur Zakirov wrote:
> On 27.01.2016 15:28, Artur Zakirov wrote:
>> On 27.01.2016 14:14, Stas Kelvich wrote:
>>> Hi.
>>>
>>> I tried that and confirm strange behaviour. It seems that problem with
>>> small cyrillic letter ‘х’. (simplest obscene language filter? =)
>>>
>>> That can be reproduced with simpler test
>>>
>>> Stas
>>>
>>>
>>
>> The test program was corrected. Now it uses wchar_t type. And it works
>> correctly and gives right output.
>>
>> I think the NIImportOOAffixes() in spell.c should be corrected to avoid
>> this bug.
>>
>
> I have attached a patch. It adds new functions parse_ooaffentry() and
> get_nextentry() and fixes a couple comments.
>
> Now russian and other supported dictionaries can be used for text search
> in Mac OS.
>
> parse_ooaffentry() parses an affix file entry instead of sscanf(). It
> has a similar algorithm to the parse_affentry() function.
>
> Should I create a new patch to fix this bug (as I did) or this patch
> should go with the patch
> http://www.postgresql.org/message-id/56AA02EE.6090004@postgrespro.ru ?
>

I have created a new entry in the commitfest for this patch 
https://commitfest.postgresql.org/9/496/

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Tom Lane

Date:

09 February 2016, 17:13:13

Artur Zakirov <a.zakirov@postgrespro.ru> writes:
>> I think the NIImportOOAffixes() in spell.c should be corrected to avoid
>> this bug.

> I have attached a patch. It adds new functions parse_ooaffentry() and 
> get_nextentry() and fixes a couple comments.

I do not like this patch much.  It is basically "let's stop using sscanf()
because it seems to have a bug on one platform".  There are at least two
things wrong with that approach:

1. By my count there are about 80 uses of *scanf() in our code.  Are we
going to replace every one of them with hand-rolled code?  If not, why
is only this instance vulnerable?  How can we know whether future uses
will have a problem?

2. We're not being very good citizens of the software universe if we
just install a hack in Postgres rather than nagging Apple to fix the
bug at its true source.

I think the appropriate next step to take is to dig into the OS X
sources (see http://www.opensource.apple.com, I think probably the
relevant code is in the Libc package) and identify exactly what is
causing the misbehavior.  That would both allow an informed answer
to point #1 and greatly increase the odds of getting action on a
bug report to Apple.  Even if we end up applying this patch verbatim,
I think we need that information first.
        regards, tom lane

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Artur Zakirov

Date:

10 February 2016, 13:39:42

On 09.02.2016 20:13, Tom Lane wrote:
> I do not like this patch much.  It is basically "let's stop using sscanf()
> because it seems to have a bug on one platform".  There are at least two
> things wrong with that approach:
>
> 1. By my count there are about 80 uses of *scanf() in our code.  Are we
> going to replace every one of them with hand-rolled code?  If not, why
> is only this instance vulnerable?  How can we know whether future uses
> will have a problem?

It seems that *scanf() with %s format occures only here:
- check.c - get_bin_version()
- server.c - get_major_server_version()
- filemap.c - isRelDataFile()
- pg_backup_directory.c - _LoadBlobs()
- xlog.c - do_pg_stop_backup()
- mac.c - macaddr_in()
I think here sscanf() do not works with the UTF-8 characters. And 
probably this is only spell.c issue.

I agree that previous patch is wrong. Instead of using new 
parse_ooaffentry() function maybe better to use sscanf() with %ls 
format. The %ls format is used to read a wide character string.

>
> 2. We're not being very good citizens of the software universe if we
> just install a hack in Postgres rather than nagging Apple to fix the
> bug at its true source.
>
> I think the appropriate next step to take is to dig into the OS X
> sources (see http://www.opensource.apple.com, I think probably the
> relevant code is in the Libc package) and identify exactly what is
> causing the misbehavior.  That would both allow an informed answer
> to point #1 and greatly increase the odds of getting action on a
> bug report to Apple.  Even if we end up applying this patch verbatim,
> I think we need that information first.
>
>             regards, tom lane
>

I think this is not a bug. It is a normal behavior. In Mac OS sscanf() 
with the %s format reads the string one character at a time. The size of 
letter 'х' is 2. And sscanf() separate it into two wrong characters.

In conclusion, I think in spell.c should be used sscanf() with %ls format.

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Teodor Sigaev

Date:

10 February 2016, 15:51:37

> It seems that *scanf() with %s format occures only here:
> - check.c - get_bin_version()
> - server.c - get_major_server_version()
> - filemap.c - isRelDataFile()
> - pg_backup_directory.c - _LoadBlobs()
> - xlog.c - do_pg_stop_backup()
> - mac.c - macaddr_in()
> I think here sscanf() do not works with the UTF-8 characters. And probably this
> is only spell.c issue.

Hmm. Here
src/backend/access/transam/xlog.c read_tablespace_map()
using %s in scanf looks suspisious. I don't fully understand but it looks like 
it tries to read oid as string. So, it should be safe in usial case

Next, _LoadBlobs() reads filename (fname) with a help of sscanf. Could file name 
be in UTF-8 encoding here?

>
> I agree that previous patch is wrong. Instead of using new parse_ooaffentry()
> function maybe better to use sscanf() with %ls format. The %ls format is used to
> read a wide character string.
Does %ls modifier exist everewhere?
Apple docs says 
(https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man3/sscanf.3.html):
s ...  If an l qualifier is present, the next pointer must be a pointer to wchar_t,  into which the input will be
placedafter conversion by mbrtowc
 

Actually, it means that wchar2char() call should be used, but it uses  wcstombs[_l] which could do not present on some
platforms.Does it mean that 
 
l modifier of string presents too or not? What do we need to do if %l exists but 
wcstombs[_l] not?

I'm a bit crazy with locale problems and it seems to me that Artur's patch is 
good idea. Actually, I don't remember exactly, but, seems, commit 
7ac8a4be8946c11d5a6bf91bb971b9750c1c60e5 introduced parse_affentry() instead of 
corresponding sscanf to avoid problems with encoding and scanf.




-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Tom Lane

Date:

10 February 2016, 16:58:09

Artur Zakirov <a.zakirov@postgrespro.ru> writes:
> I agree that previous patch is wrong. Instead of using new 
> parse_ooaffentry() function maybe better to use sscanf() with %ls 
> format. The %ls format is used to read a wide character string.

No, that way is going to give you worse portability problems than what
we have now.  Older implementations won't have %ls, and even if they
do, they might not have wcstombs() which is the only way you'd get from
libc's idea of wide characters to an encoding we recognize.

> I think this is not a bug. It is a normal behavior. In Mac OS sscanf() 
> with the %s format reads the string one character at a time. The size of 
> letter 'х' is 2. And sscanf() separate it into two wrong characters.

That argument might be convincing if OSX behaved that way for all
multibyte characters, but it doesn't seem to be doing that.  Why is
only 'х' affected?
        regards, tom lane

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Artur Zakirov

Date:

10 February 2016, 17:06:30

On 10.02.2016 18:51, Teodor Sigaev wrote:
> Hmm. Here
> src/backend/access/transam/xlog.c read_tablespace_map()
> using %s in scanf looks suspisious. I don't fully understand but it
> looks like it tries to read oid as string. So, it should be safe in
> usial case
>
> Next, _LoadBlobs() reads filename (fname) with a help of sscanf. Could
> file name be in UTF-8 encoding here?

This function reads the "blobs.toc" file. It lines have the following 
format:
<uid> <filename>

Where <filename> is blob_<uid>.dat.

Therefore it should be safe too.

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Tom Lane

Date:

10 February 2016, 22:19:58

I wrote:
> Artur Zakirov <a.zakirov@postgrespro.ru> writes:
>> I think this is not a bug. It is a normal behavior. In Mac OS sscanf() 
>> with the %s format reads the string one character at a time. The size of 
>> letter 'х' is 2. And sscanf() separate it into two wrong characters.

> That argument might be convincing if OSX behaved that way for all
> multibyte characters, but it doesn't seem to be doing that.  Why is
> only 'х' affected?

I looked into the OS X sources, and found that indeed you are right:
*scanf processes the input a byte at a time, and applies isspace() to
each byte separately, even when the locale is such that that's a clearly
insane thing to do.  Since this code was derived from FreeBSD, FreeBSD
has or once had the same issue.  (A look at the freebsd project on github
says it still does, assuming that's the authoritative repo.)  Not sure
about other BSDen.

I also verified that in UTF8-based locales, isspace() thinks that 0x85 and
0xA0, and no other high-bit-set values, are spaces.  Not sure exactly why
it thinks that, but that explains why 'х' fails when adjacent code points
don't.

So apparently the coding rule we have to adopt is "don't use *scanf()
on data that might contain multibyte characters".  (There might be corner
cases where it'd work all right for conversion specifiers other than %s,
but probably you might as well just use strtol and friends in such cases.)
Ugh.
        regards, tom lane

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Larry Rosenman

Date:

10 February 2016, 22:39:23

On 2016-02-10 16:19, Tom Lane wrote:
> I wrote:
>> Artur Zakirov <a.zakirov@postgrespro.ru> writes:
>>> I think this is not a bug. It is a normal behavior. In Mac OS 
>>> sscanf()
>>> with the %s format reads the string one character at a time. The size 
>>> of
>>> letter 'х' is 2. And sscanf() separate it into two wrong characters.
> 
>> That argument might be convincing if OSX behaved that way for all
>> multibyte characters, but it doesn't seem to be doing that.  Why is
>> only 'х' affected?
> 
> I looked into the OS X sources, and found that indeed you are right:
> *scanf processes the input a byte at a time, and applies isspace() to
> each byte separately, even when the locale is such that that's a 
> clearly
> insane thing to do.  Since this code was derived from FreeBSD, FreeBSD
> has or once had the same issue.  (A look at the freebsd project on 
> github
> says it still does, assuming that's the authoritative repo.)  Not sure
> about other BSDen.
> 
> I also verified that in UTF8-based locales, isspace() thinks that 0x85 
> and
> 0xA0, and no other high-bit-set values, are spaces.  Not sure exactly 
> why
> it thinks that, but that explains why 'х' fails when adjacent code 
> points
> don't.
> 
> So apparently the coding rule we have to adopt is "don't use *scanf()
> on data that might contain multibyte characters".  (There might be 
> corner
> cases where it'd work all right for conversion specifiers other than 
> %s,
> but probably you might as well just use strtol and friends in such 
> cases.)
> Ugh.
> 
>             regards, tom lane
Definitive FreeBSD Sources:

https://svnweb.freebsd.org/base/


-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
US Mail: 7011 W Parmer Ln, Apt 1115, Austin, TX 78729-6961

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Tom Lane

Date:

10 February 2016, 23:01:04

Larry Rosenman <ler@lerctr.org> writes:
> On 2016-02-10 16:19, Tom Lane wrote:
>> I looked into the OS X sources, and found that indeed you are right:
>> *scanf processes the input a byte at a time, and applies isspace() to
>> each byte separately, even when the locale is such that that's a
>> clearly insane thing to do.  Since this code was derived from FreeBSD,
>> FreeBSD has or once had the same issue.  (A look at the freebsd project
>> on github says it still does, assuming that's the authoritative repo.)
>> Not sure about other BSDen.

> Definitive FreeBSD Sources:
> https://svnweb.freebsd.org/base/

Ah, thanks for the link.  I'm not totally sure which branch is most
current, but at least on this one, it's still clearly wrong:
https://svnweb.freebsd.org/base/stable/10/lib/libc/stdio/vfscanf.c?revision=291336&view=markup
convert_string(), which handles %s, applies isspace() to individual bytes
regardless of locale.  convert_wstring(), which handles %ls, does it more
intelligently ... but as I said upthread, relying on %ls would just give
us a different set of portability problems.

It looks like Artur's patch is indeed what we need to do, along with
looking around for other *scanf() uses that are vulnerable.
        regards, tom lane

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Larry Rosenman

Date:

10 February 2016, 23:15:21

On 2016-02-10 17:00, Tom Lane wrote:
> Larry Rosenman <ler@lerctr.org> writes:
>> On 2016-02-10 16:19, Tom Lane wrote:
>>> I looked into the OS X sources, and found that indeed you are right:
>>> *scanf processes the input a byte at a time, and applies isspace() to
>>> each byte separately, even when the locale is such that that's a
>>> clearly insane thing to do.  Since this code was derived from 
>>> FreeBSD,
>>> FreeBSD has or once had the same issue.  (A look at the freebsd 
>>> project
>>> on github says it still does, assuming that's the authoritative 
>>> repo.)
>>> Not sure about other BSDen.
> 
>> Definitive FreeBSD Sources:
>> https://svnweb.freebsd.org/base/
> 
> Ah, thanks for the link.  I'm not totally sure which branch is most
> current, but at least on this one, it's still clearly wrong:
> https://svnweb.freebsd.org/base/stable/10/lib/libc/stdio/vfscanf.c?revision=291336&view=markup
> convert_string(), which handles %s, applies isspace() to individual 
> bytes
> regardless of locale.  convert_wstring(), which handles %ls, does it 
> more
> intelligently ... but as I said upthread, relying on %ls would just 
> give
> us a different set of portability problems.
> 
> It looks like Artur's patch is indeed what we need to do, along with
> looking around for other *scanf() uses that are vulnerable.
> 
>             regards, tom lane

that would be the current 10.x tree, production, and getting ready for 
10.3 which is in code slush.

If you want, file a bug at https://bugs.freebsd.org/bugzilla


-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
US Mail: 7011 W Parmer Ln, Apt 1115, Austin, TX 78729-6961

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Tom Lane

Date:

10 February 2016, 23:59:24

Larry Rosenman <ler@lerctr.org> writes:
> If you want, file a bug at https://bugs.freebsd.org/bugzilla

Probably not much point; the commit log shows pretty clearly that they
have been thinking about the code's behavior with multibyte characters,
so I assume they've intentionally decided to keep it like this.
        regards, tom lane

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Tom Lane

Date:

11 February 2016, 00:35:38

Artur Zakirov <a.zakirov@postgrespro.ru> writes:
>  [ tsearch_aff_parse_v1.patch ]

I've pushed this with some corrections --- notably, I did not like the
lack of buffer-overrun prevention, and it did the wrong thing if a line
had more than one trailing space character.

We still need to look at other uses of *scanf(), but I think that any
other changes that might be needed should be separate patches anyhow.
        regards, tom lane

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Chapman Flack

Date:

11 February 2016, 04:51:36

On 02/10/16 17:19, Tom Lane wrote:

> I also verified that in UTF8-based locales, isspace() thinks that 0x85 and
> 0xA0, and no other high-bit-set values, are spaces.  Not sure exactly why

Unicode NEXT LINE (NEL) and NO-BREAK SPACE, respectively.

http://unicode.org/standard/reports/tr13/tr13-5.html
http://unicode.org/charts/PDF/U0080.pdf

-Chap

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Tom Lane

Date:

11 February 2016, 04:56:15

Chapman Flack <chap@anastigmatix.net> writes:
> On 02/10/16 17:19, Tom Lane wrote:
>> I also verified that in UTF8-based locales, isspace() thinks that 0x85 and
>> 0xA0, and no other high-bit-set values, are spaces.  Not sure exactly why

> Unicode NEXT LINE (NEL) and NO-BREAK SPACE, respectively.

Yeah, I got that --- what seems squishier is that none of the other C1
control characters are considered whitespace?
        regards, tom lane

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Chapman Flack

Date:

11 February 2016, 05:20:24

On 02/10/16 23:55, Tom Lane wrote:

> Yeah, I got that --- what seems squishier is that none of the other C1
> control characters are considered whitespace?

That seems to be exactly the case:

http://www.unicode.org/Public/5.2.0/ucd/PropList.txt

09..0D, 20, 85, and A0 are the only whitespace chars whose codepoints
fit in a byte.

-Chap

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Artur Zakirov

Date:

11 February 2016, 08:14:58

On 11.02.2016 01:19, Tom Lane wrote:
> I wrote:
>> Artur Zakirov <a.zakirov@postgrespro.ru> writes:
>>> I think this is not a bug. It is a normal behavior. In Mac OS sscanf()
>>> with the %s format reads the string one character at a time. The size of
>>> letter 'Ñ…' is 2. And sscanf() separate it into two wrong characters.
>
>> That argument might be convincing if OSX behaved that way for all
>> multibyte characters, but it doesn't seem to be doing that.  Why is
>> only 'Ñ…' affected?
>
> I looked into the OS X sources, and found that indeed you are right:
> *scanf processes the input a byte at a time, and applies isspace() to
> each byte separately, even when the locale is such that that's a clearly
> insane thing to do.  Since this code was derived from FreeBSD, FreeBSD
> has or once had the same issue.  (A look at the freebsd project on github
> says it still does, assuming that's the authoritative repo.)  Not sure
> about other BSDen.
>
> I also verified that in UTF8-based locales, isspace() thinks that 0x85 and
> 0xA0, and no other high-bit-set values, are spaces.  Not sure exactly why
> it thinks that, but that explains why 'Ñ…' fails when adjacent code points
> don't.
>
> So apparently the coding rule we have to adopt is "don't use *scanf()
> on data that might contain multibyte characters".  (There might be corner
> cases where it'd work all right for conversion specifiers other than %s,
> but probably you might as well just use strtol and friends in such cases.)
> Ugh.
>
>             regards, tom lane
>

Yes, I meant this. The second byte divides the word into two wrong pieces.

Sorry for my unclear explanation. I should to explain more clearly.

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From

Artur Zakirov

Date:

11 February 2016, 08:15:18

On 11.02.2016 03:33, Tom Lane wrote:
> Artur Zakirov <a.zakirov@postgrespro.ru> writes:
>>   [ tsearch_aff_parse_v1.patch ]
>
> I've pushed this with some corrections --- notably, I did not like the
> lack of buffer-overrun prevention, and it did the wrong thing if a line
> had more than one trailing space character.
>
> We still need to look at other uses of *scanf(), but I think that any
> other changes that might be needed should be separate patches anyhow.
>
>             regards, tom lane
>

Thank you!

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company