Re: Tsearch2 and Unicode? - Mailing list pgsql-general
From | Oleg Bartunov |
---|---|
Subject | Re: Tsearch2 and Unicode? |
Date | |
Msg-id | Pine.GSO.4.61.0411221645540.24069@ra.sai.msu.su Whole thread Raw |
In response to | Re: Tsearch2 and Unicode? ("Markus Wollny" <Markus.Wollny@computec.de>) |
List | pgsql-general |
This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---559023410-491009931-1101131295=:24069 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: 8BIT Markus, it'd be nice if you (or somebody) wrtite a note about unicode, so it could be added to tsearch2 documentation. It will help people and save time and hair :) Oleg On Mon, 22 Nov 2004, Markus Wollny wrote: > Hi! > > I dug through my list-archives - I actually used to have the very same problem that you described: special chars beingswallowed by tsearch2-functions. The source of the problem was that I had INITDB'ed my cluster with DE@euro as locale,whereas my databases used Unicode encoding. This does not work correctly. I had to dump, initdb to the correct UTF-8-locale(de_DE.UTF-8 in my case) and reload to get tsearch2 to work correctly. You may find the original discussion here:http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php > If you wish to find out which locale was used during INITDB for your cluster, you may use the pg_controldata program that'ssupplied with PostgreSQL. > > Kind regards > > Markus > > > >> -----Urspr�ngliche Nachricht----- >> Von: pgsql-general-owner@postgresql.org >> [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von >> Dawid Kuroczko >> Gesendet: Mittwoch, 17. November 2004 17:17 >> An: Pgsql General >> Betreff: [GENERAL] Tsearch2 and Unicode? >> >> I'm trying to use tsearch2 with database which is in >> 'UNICODE' encoding. >> It works fine for English text, but as I intend to search >> Polish texts I did: >> >> insert into pg_ts_cfg('default_polish', 'default', >> 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as >> written in manual). >> >> However, Polish-specific chars are being eaten alive, it seems. >> I.e. doing select to_tsvector('default_polish', body) from >> messages; results in list of words but with national chars stripped... >> >> I wonder, am I doing something wrong, or just tsearch2 >> doesn't grok Unicode, despite the locales setting? This also >> is a good question regarding ispell_dict and its feelings >> regarding Unicode, but that's another story. >> >> Assuming Unicode unsupported means I should perhaps... oh, >> convert the data to iso8859 prior feeding it to_tsvector()... >> interesting idea, but so far I have failed to actually do >> it. Maybe store the data as 'bytea' and add a column with >> encoding information (assuming I don't want to recreate whole >> database with new encoding, and that I want to use unicode >> for some columns (so I don't have to keep encoding with every >> text everywhere...). >> >> And while we are at it, how do you feel -- an extra column >> with tsvector and its index -- would it be OK to keep it away >> from my data (so I can safely get rid of them if need be)? >> [ I intend to keep index of around 2 000 000 records, few KBs >> of text each ]... >> >> Regards, >> Dawid Kuroczko >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 5: Have you checked our extensive FAQ? >> >> http://www.postgresql.org/docs/faqs/FAQ.html >> > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83 ---559023410-491009931-1101131295=:24069--
pgsql-general by date: