Thread: tsearch2 in postgresql 8.3.1 - invalid byte sequence for encoding "UTF8": 0xc3

hi,

i have an issue with tseach2, i just installed postgresql 8.3.1 on windows
using UTF8 server encoding / client encoding and LOCALE Canada / French.

CREATE DATABASE mydbWITH OWNER = me ENCODING = 'UTF8';
CREATE TABLE product
(
  product_id       SERIAL         NOT NULL,
  name             VARCHAR(50)    NOT NULL,
  description      VARCHAR(1000)  NOT NULL,
  CONSTRAINT pk_product PRIMARY KEY (product_id)
);
ALTER TABLE product ADD COLUMN search_vector tsvector;
CREATE INDEX idx_search_vector ON product USING gist(search_vector);

UPDATE product SET search_vector = setweight(to_tsvector(name), 'A') ||
to_tsvector(description);

ERROR: invalid byte sequence for encoding "UTF8": 0xc3
SQL state: 22021
Hint: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".

any idea why it's happening?
pat

patrick wrote:
> hi,
>
> i have an issue with tseach2, i just installed postgresql 8.3.1 on
> windows using UTF8 server encoding / client encoding and LOCALE Canada /
> French.

> UPDATE product SET search_vector = setweight(to_tsvector(name), 'A') ||
> to_tsvector(description);
>
> ERROR: invalid byte sequence for encoding "UTF8": 0xc3
> SQL state: 22021
> Hint: This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by "client_encoding".

Can you identify which row(s) are causing this problem? If we have the
value that's causing this, someone can reproduce it.

--
   Richard Huxton
   Archonet Ltd

> Can you identify which row(s) are causing this problem? If we have the
> value that's causing this, someone can reproduce it.

i have only 1 row:
46; "the product name"; "the description";

i don't see any specials chars or accents.

knowing that some of my clients are french, should i use LATIN9 as database
encoding / client encoding? or maybe it's because of the LOCALE
(French/Canada)?

here's the code again:
ALTER TABLE product ADD COLUMN search_vector tsvector;
CREATE INDEX idx_search_vector ON product USING gist(search_vector);
UPDATE product SET search_vector = setweight(to_tsvector(name), 'A') ||
to_tsvector(description);

pat


patrick wrote:
>> Can you identify which row(s) are causing this problem? If we have the
>> value that's causing this, someone can reproduce it.
>
> i have only 1 row:
> 46; "the product name"; "the description";
>
> i don't see any specials chars or accents.

I think I've reproduced it here, and it's not your data.

> knowing that some of my clients are french, should i use LATIN9 as
> database encoding / client encoding? or maybe it's because of the LOCALE
> (French/Canada)?

> UPDATE product SET search_vector = setweight(to_tsvector(name), 'A') ||
> to_tsvector(description);

It's your script. It's got a character that's not UTF8 and you've told
the system that's what your client_encoding is. I think it's "|".

See if you can reproduce it with: SELECT 'abc'::text || 'def::text;

I got the error by editing the script at a command-prompt. If you use
notepad or some other MS-Windows (TM) based editor it should let you
choose ANSI as the format to save in.

--
   Richard Huxton
   Archonet Ltd

patrick wrote:
> SELECT 'abc'::text || 'def'::text;
> it's working fine (no need to convert the query ASCII to UTF8 or such

OK, now try each of these in turn:
UPDATE product SET search_vector = to_tsvector(name);

UPDATE product SET search_vector = setweight(to_tsvector(name), 'A');

UPDATE product SET search_vector = setweight(to_tsvector(name), 'A') ||
to_tsvector(description);

If they work when you're typing them in pgadmin but not in the script
file then it's the script file. I copied your code and got a similar
error, except my table was empty. No data. So I knew it wasn't
complaining about the data itself.

> i am using pgadmin (1.8.2) to pass the query:
> show client_encoding = UNICODE.
>
> in postgresql.conf i have:
> client_encoding; Value = UTF8, Current value = UNICODE;
> i tried to restart postgresql for reloading his configuration, but it's
> still UNICODE...

The issue is what characters were in your script file.

> reinstalling postgresql 8.3.1 on windows and leaving the server encoding
> and client encoding to default (cannot remember what was it - something
> like WIN1252 or such) then i have no more problem, but i am worry about
> the impact of transfering the database to linux (eventually i would like
> to transfer my database to my linux server)...

There are three things to remember:
1. Is my database encoding supported by all my backends (and there you
are fine).
2. Can PG translate from my client-encoding to the database encoding
(European-based stuff should be fine).
3. Am I sending data in the client-encoding I've told PG I'm using.

I still think it's #3 that's causing you problems.

--
   Richard Huxton
   Archonet Ltd

Richard Huxton <dev@archonet.com> writes:
> The issue is what characters were in your script file.

I'm wondering about non-UTF8 characters in the dictionary file(s) used
by the text search configuration.  Failure to load a configuration
file would explain why it only shows up in tsearch-related queries.

            regards, tom lane

Missed the mailing list on the last reply

Richard Huxton wrote:
> patrick wrote:
>> hi richard,
>>
>> thanks for your help! i found something... but first let me answer
>> your question:
>>
>>> UPDATE product SET search_vector = to_tsvector(name);
>>> UPDATE product SET search_vector = setweight(to_tsvector(name), 'A');
>>> UPDATE product SET search_vector = setweight(to_tsvector(name), 'A')
>>> || to_tsvector(description);
>>
>> thoses queries are not working, same message:
>> ERROR: invalid byte sequence for encoding "UTF8": 0xc3
>
> Hmm. OK. Can reproduce that here, but only...
>
>> what i found is in postgresql.conf if i change:
>> default_text_search_config from pg_catalog.french to
>> pg_catalog.english then the query is working fine.
>
> with a "french" configuration. Not only english, but also italian,
> german etc. all seem to work here on Windows 8.3.1.
>
> However, "french" works fine with 8.3.0 compiled from source on Linux.
>
> Comparing the two french.stop lists of stopwords (look in
> .../share/tsearch_data) they are identical.
>
> That leaves the snowball stemming library itself. There seem to be two
> source files for these in src/backend/snowball/libstemmer, one for
> ISO8859-1 and one for UTF-8. These files seem identical between 8.3.0
> and 8.3.1 (assuming I'm working anoncvs.postgresql.org properly).
>
> Possibly a build problem on Windows? I'll test against 8.3.1 on Linux if
> I get a chance.

No changes (from diff -r) between the source on 8.3.0 and 8.3.1 for the
backend/snowball directories. Looks like someone with a Windows build
environment would be useful.

--
   Richard Huxton
   Archonet Ltd

Richard Huxton <dev@archonet.com> writes:
> Missed the mailing list on the last reply
>> patrick wrote:
>>> thoses queries are not working, same message:
>>> ERROR: invalid byte sequence for encoding "UTF8": 0xc3
>>>
>>> what i found is in postgresql.conf if i change:
>>> default_text_search_config from pg_catalog.french to
>>> pg_catalog.english then the query is working fine.

I am just about convinced the problem is with french.stop.

There is more to that error message than meets the eye: 0xc3 is a valid
first byte for a two-byte UTF8 character, so the only way that the
message would look just like that is if 0xc3 is the last byte in the
presented string.  Looking at french.stop, the only plausible place for
this to happen is the line

à

(that's \303\240 or 0xc3 0xa0).  I am thinking that something decided
the \240 was junk and removed it.

I wonder whether the dictionaries ought not be reading their data files
in binary mode.  They appear to all be using AllocateFile(filename, "r")
which means that we're at the mercy of whatever text-mode conversion
Windows feels like doing.

            regards, tom lane

SELECT 'abc'::text || 'def'::text;
it's working fine (no need to convert the query ASCII to UTF8 or such

i am using pgadmin (1.8.2) to pass the query:
show client_encoding = UNICODE.

in postgresql.conf i have:
client_encoding; Value = UTF8, Current value = UNICODE;
i tried to restart postgresql for reloading his configuration, but it's
still UNICODE...

reinstalling postgresql 8.3.1 on windows and leaving the server encoding and
client encoding to default (cannot remember what was it - something like
WIN1252 or such) then i have no more problem, but i am worry about the
impact of transfering the database to linux (eventually i would like to
transfer my database to my linux server)...

pat

Re: tsearch2 in postgresql 8.3.1 - invalid byte sequence for encoding "UTF8": 0xc3

From
Martijn van Oosterhout
Date:
On Wed, Mar 19, 2008 at 07:55:40PM -0400, Tom Lane wrote:
> (that's \303\240 or 0xc3 0xa0).  I am thinking that something decided
> the \240 was junk and removed it.

Hmm, it is coincidently the space character +0x80, which is defined as
a non-breaking space in many Latin encodings. Perhaps ctype decided it
was a space, or sscanf didn't read it...

Have  anice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Wed, Mar 19, 2008 at 07:55:40PM -0400, Tom Lane wrote:
>> (that's \303\240 or 0xc3 0xa0).  I am thinking that something decided
>> the \240 was junk and removed it.

> Hmm, it is coincidently the space character +0x80, which is defined as
> a non-breaking space in many Latin encodings.

Yeah, that's what I'm thinking about.  I poked around in Microsoft's
documentation and couldn't find any suggestion that fgets() would
remove such a character, however.

Another possible theory is that the french.stop file got edited using
something that had the wrong idea about the file's encoding, and
proceeded to throw away the nbsp.

            regards, tom lane