Re: 0xc3 error Text Search Windows French - Mailing list pgsql-general
From | Andrew |
---|---|
Subject | Re: 0xc3 error Text Search Windows French |
Date | |
Msg-id | 48629340.2030704@pacific.net.au Whole thread Raw |
In response to | 0xc3 error Text Search Windows French (Andrew <archa@pacific.net.au>) |
List | pgsql-general |
Sorry one last detail. All of my databases are in utf-8 format. My Windows XP is en_AU and defaults to ISO-8859-1 character sets. My postgresql.conf is set to the default for the client_encoding setting, which should then default to the database utf-8 format. Andrew wrote: > One additional aspect. I just ran the create text search dictionary > command without the stopfile declaration using the OO dictionaries, > and it worked fine with the select ts_lexize('public.fr_ispell', > 'catalogue'); command executing with no problems. However, after > creating an associated catalogue based on a copy of the > pg_catalog.french catalogue, calls to ts_debug against my custom > French config result in the 0xc3 error. So it is looking like the > problem is restricted to the parsing of the stop file. > I ran through the other out of the box supplied stemmers, which I have > not touched in anyway and it is also occurring with the portuguese > catalogue. > > Cheers > > Andy > > Andrew wrote: >> I have a feeling that an issue I'm running into is related to this: >> http://archives.postgresql.org/pgsql-bugs/2008-06/msg00113.php >> >> On Windows XP running PgAdmin III 1.8.4 against either PostgreSQL >> 8.3.0 or 8.3.3 DB, when attempting to do a: >> >> select * from ts_debug('french', 'catalogue'); >> >> getting the following error: >> >> ERROR: invalid byte sequence for encoding "UTF8": 0xc3 >> HINT: This error can also happen if the byte sequence does not match >> the encoding expected by the server, which is controlled by >> "client_encoding". >> CONTEXT: SQL function "ts_debug" statement 1 >> >> I have replaced the french.stop file with the one from the snowball >> web site >> (http://snowball.tartarus.org/algorithms/french/stemmer.html) to see >> if that would make any difference. But the same issue. I have also >> attempted to load the French Hunspell dictionary from the Open Office >> web site (http://wiki.services.openoffice.org/wiki/Dictionaries), >> using the following command: >> >> CREATE TEXT SEARCH DICTIONARY public.fr_ispell ( >> TEMPLATE = pg_catalog.ispell, >> DictFile = fr_FR, >> AffFile = fr_FR, >> StopWords = french >> ); >> >> But getting the same error. I have successfully loaded the English >> and Arabic dictionaries and an Arabic stop file I sourced from >> elsewhere, and they work fine with the various text search function >> calls, so it appears to be specifically related to a French character >> occurring in the stop file and the dictionaries. To use the French >> OO dictionaries, I had to convert them from an ISO-8859-15 character >> set encoding to UTF-8. As it still had the same result as with the >> packaged stop file when converting on Windows, I downloaded them and >> converted the encoding on a Linux machine before copying them across >> to windows to see if that would help, but it didn't. >> >> However, if I run the ts_debug('french', 'catalogue'); against a >> Linux version of PostgreSQL 8.3.1, it works fine. I have not tried >> version 8.3.1 on Windows. While there are a lot more combinations to >> exhaust before I can make a categorical statement, at this stage it >> appears to be pointing towards an issue with the UTF-8 parser of >> PostgreSQL on Windows. >> >> Is this an outstanding defect, or is there something that I'm doing >> wrong in my environment? I have attempted to find anything related >> on the Internet, but other than the introductory reference, I have >> not found anything, which for what I would imagine to be, of the size >> of the French user base surprises me. Hence, I'm thinking that >> perhaps it may be something in my environment causing the issue. If >> others could also reproduce the error on their XP machines, that >> would indicate that the issue was not something specific just to me. >> >> At this stage, it is not that important to me, as I'm just playing >> around with text search for my own curiosity and French was just a >> language I have randomly picked, along with Arabic (for which I'm >> lacking a snowball stemmer). I don't actually read, much less speak >> those languages. However, it would still be nice to have them working. >> >> An additional related topic. OO have for some languages, thesaurus >> files which are not in the same format as supported by Pg Full Text >> Search. Are there any plans to support the OO thesaurus file >> formats? They also have hyphenation files. Are there any plans to >> extend the current dictionary files to include hyphenation rules as >> captured in the OO hyphenation files? I'm not sure how, if at all >> hyphenation rules would improve on indexing and searches, but I >> thought as the files exist, I would pose the question. >> >> Thanks, >> >> Andy >> >> >> >> >> > >
pgsql-general by date: