Re: integrated tsearch doesn't work with non utf8 database - Mailing list pgsql-hackers
From | Oleg Bartunov |
---|---|
Subject | Re: integrated tsearch doesn't work with non utf8 database |
Date | |
Msg-id | Pine.LNX.4.64.0709081017260.2767@sn.sai.msu.ru Whole thread Raw |
In response to | Re: integrated tsearch doesn't work with non utf8 database ("Heikki Linnakangas" <heikki@enterprisedb.com>) |
List | pgsql-hackers |
On Fri, 7 Sep 2007, Heikki Linnakangas wrote: > Pavel Stehule wrote: >> postgres=# select ts_debug('cs','PЪЪЪЪliЪЪ ЪЪluЪЪouЪЪkЪЪ kЪЪЪЪ se napil ЪЪlutЪЪ vody'); >> ERROR: character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2" >> CONTEXT: SQL function "ts_debug" statement 1 > > I can reproduce that. In fact, you don't need the custom config or > dictionary at all: > > postgres=# CREATE DATABASE latin2 encoding='latin2'; > CREATE DATABASE > postgres=# \c latin2 > You are now connected to database "latin2". > latin2=# select ts_debug('simple','foo'); > ERROR: character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2" > CONTEXT: SQL function "ts_debug" statement 1 > > It fails trying to lexize the string using the danish snowball stemmer, > because the danish stopword file contains character 'ЪЪ' which doesn't > have an equivalent in LATIN2. > > Now what the heck is it doing with the danish stemmer, you might ask. > ts_debug is implemented as a SQL function; EXPLAINing the complex SELECT > behind it, I get this plan: > > latin2=# \i foo.sql > QUERY PLAN > > ----------------------------------------------------------------------------------------------------------------------------- > Hash Join (cost=2.80..1134.45 rows=80 width=100) > Hash Cond: (parse.tokid = tt.tokid) > InitPlan > -> Seq Scan on pg_ts_config (cost=0.00..1.20 rows=1 width=4) > Filter: (oid = 3748::oid) > -> Seq Scan on pg_ts_config (cost=0.00..1.20 rows=1 width=4) > Filter: (oid = 3748::oid) > -> Function Scan on ts_parse parse (cost=0.00..12.50 rows=1000 > width=36) > -> Hash (cost=0.20..0.20 rows=16 width=68) > -> Function Scan on ts_token_type tt (cost=0.00..0.20 rows=16 > width=68) > SubPlan > -> Limit (cost=7.33..7.36 rows=1 width=36) > -> Subquery Scan dl (cost=7.33..7.36 rows=1 width=36) > -> Sort (cost=7.33..7.34 rows=1 width=8) > Sort Key: m.mapseqno > -> Seq Scan on pg_ts_config_map m > (cost=0.00..7.32 rows=1 width=8) > Filter: ((ts_lexize(mapdict, $1) IS NOT > NULL) AND (mapcfg = 3765::oid) AND (maptokentype = $0)) > -> Sort (cost=6.57..6.57 rows=1 width=8) > Sort Key: m.mapseqno > -> Seq Scan on pg_ts_config_map m (cost=0.00..6.56 rows=1 > width=8) > Filter: ((mapcfg = 3765::oid) AND (maptokentype = $0)) > (21 rows) > > Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict, > $1). That means that it will call ts_lexize on every dictionary, which > will try to load every dictionary. And loading danish_stem dictionary > fails in latin2 encoding, because of the problem with the stopword file. > > We could rewrite ts_debug as a C-function, so that it doesn't try to ts_debug currently doesn't work well with thesaurus dictionary, so it certainly needs to be rewritten in C. We left rewriting it for future. > access any unnecessary dictionaries. It seems wrong to install > dictionaries in databases where they won't work in the first place, but > I don't see an easy fix for that. Any comments or better ideas? > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
pgsql-hackers by date: