Thread: 0xc3 error Text Search Windows French

0xc3 error Text Search Windows French

From

Andrew

Date:

25 June 2008, 15:21:57

I have a feeling that an issue I'm running into is related to this:
http://archives.postgresql.org/pgsql-bugs/2008-06/msg00113.php

On Windows XP running PgAdmin III 1.8.4 against either PostgreSQL 8.3.0
or 8.3.3 DB, when attempting to do a:

select * from ts_debug('french', 'catalogue');

getting the following error:

ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT:  SQL function "ts_debug" statement 1

I have replaced the french.stop file with the one from the snowball web
site (http://snowball.tartarus.org/algorithms/french/stemmer.html) to
see if that would make any difference. But the same issue.  I have also
attempted to load the French Hunspell dictionary from the Open Office
web site (http://wiki.services.openoffice.org/wiki/Dictionaries), using
the following command:

CREATE TEXT SEARCH DICTIONARY public.fr_ispell (
    TEMPLATE = pg_catalog.ispell,
    DictFile = fr_FR,
    AffFile = fr_FR,
    StopWords = french
);

But getting the same error.  I have successfully loaded the English and
Arabic dictionaries and an Arabic stop file I sourced from elsewhere,
and they work fine with the various text search function calls, so it
appears to be specifically related to a French character occurring in
the stop file and the dictionaries.  To use the French OO dictionaries,
I had to convert them from an ISO-8859-15 character set encoding to
UTF-8.  As it still had the same result as with the packaged stop file
when converting on Windows, I downloaded them and converted the encoding
on a Linux machine before copying them across to windows to see if that
would help, but it didn't.

However, if I run the ts_debug('french', 'catalogue'); against a Linux
version of PostgreSQL 8.3.1, it works fine.  I have not tried version
8.3.1 on Windows.  While there are a lot more combinations to exhaust
before I can make a categorical statement, at this stage it appears to
be pointing towards an issue with the UTF-8 parser of PostgreSQL on Windows.

Is this an outstanding defect, or is there something that I'm doing
wrong in my environment?  I have attempted to find anything related on
the Internet, but other than the introductory reference, I have not
found anything, which for what I would imagine to be, of the size of the
French user base surprises me.  Hence, I'm thinking that perhaps it may
be something in my environment causing the issue.  If others could also
reproduce the error on their XP machines, that would indicate that the
issue was not something specific just to me.

At this stage, it is not that important to me, as I'm just playing
around with text search for my own curiosity and French was just a
language I have randomly picked, along with Arabic (for which I'm
lacking a snowball stemmer).  I don't actually read, much less speak
those languages.  However, it would still be nice to have them working.

An additional related topic.  OO have for some languages, thesaurus
files which are not in the same format as supported by Pg Full Text
Search.  Are there any plans to support the OO thesaurus file formats?
They also have hyphenation files. Are there any plans to extend the
current dictionary files to include hyphenation rules as captured in the
OO hyphenation files?  I'm not sure how, if at all hyphenation rules
would improve on indexing and searches, but I thought as the files
exist, I would pose the question.

Thanks,

Andy

Re: 0xc3 error Text Search Windows French

From

Andrew

Date:

25 June 2008, 15:49:52

Sorry one last detail.

All of my databases are in utf-8 format.  My Windows XP is en_AU and
defaults to ISO-8859-1 character sets.  My postgresql.conf is set to the
default for the client_encoding setting, which should then default to
the database utf-8 format.

Andrew wrote:
> One additional aspect.  I just ran the create text search dictionary
> command without the stopfile declaration using the OO dictionaries,
> and it worked fine with the select ts_lexize('public.fr_ispell',
> 'catalogue'); command executing with no problems.  However, after
> creating an associated catalogue based on a copy of the
> pg_catalog.french catalogue, calls to ts_debug against my custom
> French config result in the 0xc3 error.  So it is looking like the
> problem is restricted to the parsing of the stop file.
> I ran through the other out of the box supplied stemmers, which I have
> not touched in anyway and it is also occurring with the portuguese
> catalogue.
>
> Cheers
>
> Andy
>
> Andrew wrote:
>> I have a feeling that an issue I'm running into is related to this:
>> http://archives.postgresql.org/pgsql-bugs/2008-06/msg00113.php
>>
>> On Windows XP running PgAdmin III 1.8.4 against either PostgreSQL
>> 8.3.0 or 8.3.3 DB, when attempting to do a:
>>
>> select * from ts_debug('french', 'catalogue');
>>
>> getting the following error:
>>
>> ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
>> HINT:  This error can also happen if the byte sequence does not match
>> the encoding expected by the server, which is controlled by
>> "client_encoding".
>> CONTEXT:  SQL function "ts_debug" statement 1
>>
>> I have replaced the french.stop file with the one from the snowball
>> web site
>> (http://snowball.tartarus.org/algorithms/french/stemmer.html) to see
>> if that would make any difference. But the same issue.  I have also
>> attempted to load the French Hunspell dictionary from the Open Office
>> web site (http://wiki.services.openoffice.org/wiki/Dictionaries),
>> using the following command:
>>
>> CREATE TEXT SEARCH DICTIONARY public.fr_ispell (
>>    TEMPLATE = pg_catalog.ispell,
>>    DictFile = fr_FR,
>>    AffFile = fr_FR,
>>    StopWords = french
>> );
>>
>> But getting the same error.  I have successfully loaded the English
>> and Arabic dictionaries and an Arabic stop file I sourced from
>> elsewhere, and they work fine with the various text search function
>> calls, so it appears to be specifically related to a French character
>> occurring in the stop file and the dictionaries.  To use the French
>> OO dictionaries, I had to convert them from an ISO-8859-15 character
>> set encoding to UTF-8.  As it still had the same result as with the
>> packaged stop file when converting on Windows, I downloaded them and
>> converted the encoding on a Linux machine before copying them across
>> to windows to see if that would help, but it didn't.
>>
>> However, if I run the ts_debug('french', 'catalogue'); against a
>> Linux version of PostgreSQL 8.3.1, it works fine.  I have not tried
>> version 8.3.1 on Windows.  While there are a lot more combinations to
>> exhaust before I can make a categorical statement, at this stage it
>> appears to be pointing towards an issue with the UTF-8 parser of
>> PostgreSQL on Windows.
>>
>> Is this an outstanding defect, or is there something that I'm doing
>> wrong in my environment?  I have attempted to find anything related
>> on the Internet, but other than the introductory reference, I have
>> not found anything, which for what I would imagine to be, of the size
>> of the French user base surprises me.  Hence, I'm thinking that
>> perhaps it may be something in my environment causing the issue.  If
>> others could also reproduce the error on their XP machines, that
>> would indicate that the issue was not something specific just to me.
>>
>> At this stage, it is not that important to me, as I'm just playing
>> around with text search for my own curiosity and French was just a
>> language I have randomly picked, along with Arabic (for which I'm
>> lacking a snowball stemmer).  I don't actually read, much less speak
>> those languages.  However, it would still be nice to have them working.
>>
>> An additional related topic.  OO have for some languages, thesaurus
>> files which are not in the same format as supported by Pg Full Text
>> Search.  Are there any plans to support the OO thesaurus file
>> formats?  They also have hyphenation files. Are there any plans to
>> extend the current dictionary files to include hyphenation rules as
>> captured in the OO hyphenation files?  I'm not sure how, if at all
>> hyphenation rules would improve on indexing and searches, but I
>> thought as the files exist, I would pose the question.
>>
>> Thanks,
>>
>> Andy
>>
>>
>>
>>
>>
>
>

Re: 0xc3 error Text Search Windows French

From

Andrew

Date:

25 June 2008, 16:29:13

One additional aspect.  I just ran the create text search dictionary
command without the stopfile declaration using the OO dictionaries, and
it worked fine with the select ts_lexize('public.fr_ispell',
'catalogue'); command executing with no problems.  However, after
creating an associated catalogue based on a copy of the
pg_catalog.french catalogue, calls to ts_debug against my custom French
config result in the 0xc3 error.  So it is looking like the problem is
restricted to the parsing of the stop file.

I ran through the other out of the box supplied stemmers, which I have
not touched in anyway and it is also occurring with the portuguese
catalogue.

Cheers

Andy

Andrew wrote:
> I have a feeling that an issue I'm running into is related to this:
> http://archives.postgresql.org/pgsql-bugs/2008-06/msg00113.php
>
> On Windows XP running PgAdmin III 1.8.4 against either PostgreSQL
> 8.3.0 or 8.3.3 DB, when attempting to do a:
>
> select * from ts_debug('french', 'catalogue');
>
> getting the following error:
>
> ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
> HINT:  This error can also happen if the byte sequence does not match
> the encoding expected by the server, which is controlled by
> "client_encoding".
> CONTEXT:  SQL function "ts_debug" statement 1
>
> I have replaced the french.stop file with the one from the snowball
> web site (http://snowball.tartarus.org/algorithms/french/stemmer.html)
> to see if that would make any difference. But the same issue.  I have
> also attempted to load the French Hunspell dictionary from the Open
> Office web site
> (http://wiki.services.openoffice.org/wiki/Dictionaries), using the
> following command:
>
> CREATE TEXT SEARCH DICTIONARY public.fr_ispell (
>    TEMPLATE = pg_catalog.ispell,
>    DictFile = fr_FR,
>    AffFile = fr_FR,
>    StopWords = french
> );
>
> But getting the same error.  I have successfully loaded the English
> and Arabic dictionaries and an Arabic stop file I sourced from
> elsewhere, and they work fine with the various text search function
> calls, so it appears to be specifically related to a French character
> occurring in the stop file and the dictionaries.  To use the French OO
> dictionaries, I had to convert them from an ISO-8859-15 character set
> encoding to UTF-8.  As it still had the same result as with the
> packaged stop file when converting on Windows, I downloaded them and
> converted the encoding on a Linux machine before copying them across
> to windows to see if that would help, but it didn't.
>
> However, if I run the ts_debug('french', 'catalogue'); against a Linux
> version of PostgreSQL 8.3.1, it works fine.  I have not tried version
> 8.3.1 on Windows.  While there are a lot more combinations to exhaust
> before I can make a categorical statement, at this stage it appears to
> be pointing towards an issue with the UTF-8 parser of PostgreSQL on
> Windows.
>
> Is this an outstanding defect, or is there something that I'm doing
> wrong in my environment?  I have attempted to find anything related on
> the Internet, but other than the introductory reference, I have not
> found anything, which for what I would imagine to be, of the size of
> the French user base surprises me.  Hence, I'm thinking that perhaps
> it may be something in my environment causing the issue.  If others
> could also reproduce the error on their XP machines, that would
> indicate that the issue was not something specific just to me.
>
> At this stage, it is not that important to me, as I'm just playing
> around with text search for my own curiosity and French was just a
> language I have randomly picked, along with Arabic (for which I'm
> lacking a snowball stemmer).  I don't actually read, much less speak
> those languages.  However, it would still be nice to have them working.
>
> An additional related topic.  OO have for some languages, thesaurus
> files which are not in the same format as supported by Pg Full Text
> Search.  Are there any plans to support the OO thesaurus file
> formats?  They also have hyphenation files. Are there any plans to
> extend the current dictionary files to include hyphenation rules as
> captured in the OO hyphenation files?  I'm not sure how, if at all
> hyphenation rules would improve on indexing and searches, but I
> thought as the files exist, I would pose the question.
>
> Thanks,
>
> Andy
>
>
>
>
>

Re: 0xc3 error Text Search Windows French

From

Alvaro Herrera

Date:

25 June 2008, 16:38:29

Andrew wrote:
> I have a feeling that an issue I'm running into is related to this:
> http://archives.postgresql.org/pgsql-bugs/2008-06/msg00113.php
>
> On Windows XP running PgAdmin III 1.8.4 against either PostgreSQL 8.3.0
> or 8.3.3 DB, when attempting to do a:
>
> select * from ts_debug('french', 'catalogue');
>
> getting the following error:
>
> ERROR:  invalid byte sequence for encoding "UTF8": 0xc3

This is probably a bug fixed after 8.3.3:

http://git.postgresql.org/?p=postgresql.git;a=commitdiff;h=6bcf5a85233f7a039648990d1037119efb61146d
(This is the HEAD version of the patch; 8.3 should be identically
patched.)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: 0xc3 error Text Search Windows French

From

Alvaro Herrera

Date:

26 June 2008, 09:50:41

Andrew wrote:
> Thanks Alvaro.

Please don't forget to CC the list.

> Reading the source for the patch, I can see how that should address the
> issue.  Though I don't really understand how it is working in Linux but
> not on Windows.  I assume that Linux OS is passing the UTF-8 character
> and Windows is passing a localised character, despite the format of the
> file being read, hence the need for pg_mblen, rather than just
> incrementing the pointer to the next character.  Not that it matters.
> If it was a Linux bug, I would just download the trunk and rebuild.

Windows is weird in its multibyte use, and if the server encoding in the
Linux server is set to a single-byte encoding (or maybe on MB encodings
too), then it wouldn't surprise me that one worked and the other didn't.

> However, I don't have a Windows C++ build environment and not that
> interested in creating one at the moment.  So I will continue playing
> with text search, and download the next Windows version once it is
> available.  Any indication of when the next release is scheduled?

No idea -- it will be released when "enough bugs are fixed" (or very
quickly if a security problem is found), so you could be waiting for a
while.

--
Alvaro Herrera                  http://www.amazon.com/gp/registry/5ZYLFMCVHXC
"Amanece.                                               (Ignacio Reyes)
 El Cerro San Cristóbal me mira, cínicamente, con ojos de virgen"