Re: integrated tsearch doesn't work with non utf8 database - Mailing list pgsql-hackers
| From | Oleg Bartunov |
|---|---|
| Subject | Re: integrated tsearch doesn't work with non utf8 database |
| Date | |
| Msg-id | Pine.LNX.4.64.0709081017260.2767@sn.sai.msu.ru Whole thread Raw |
| In response to | Re: integrated tsearch doesn't work with non utf8 database ("Heikki Linnakangas" <heikki@enterprisedb.com>) |
| List | pgsql-hackers |
On Fri, 7 Sep 2007, Heikki Linnakangas wrote:
> Pavel Stehule wrote:
>> postgres=# select ts_debug('cs','PЪЪЪЪliЪЪ ЪЪluЪЪouЪЪkЪЪ kЪЪЪЪ se napil ЪЪlutЪЪ vody');
>> ERROR: character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
>> CONTEXT: SQL function "ts_debug" statement 1
>
> I can reproduce that. In fact, you don't need the custom config or
> dictionary at all:
>
> postgres=# CREATE DATABASE latin2 encoding='latin2';
> CREATE DATABASE
> postgres=# \c latin2
> You are now connected to database "latin2".
> latin2=# select ts_debug('simple','foo');
> ERROR: character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
> CONTEXT: SQL function "ts_debug" statement 1
>
> It fails trying to lexize the string using the danish snowball stemmer,
> because the danish stopword file contains character 'ЪЪ' which doesn't
> have an equivalent in LATIN2.
>
> Now what the heck is it doing with the danish stemmer, you might ask.
> ts_debug is implemented as a SQL function; EXPLAINing the complex SELECT
> behind it, I get this plan:
>
> latin2=# \i foo.sql
> QUERY PLAN
>
>
-----------------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=2.80..1134.45 rows=80 width=100)
> Hash Cond: (parse.tokid = tt.tokid)
> InitPlan
> -> Seq Scan on pg_ts_config (cost=0.00..1.20 rows=1 width=4)
> Filter: (oid = 3748::oid)
> -> Seq Scan on pg_ts_config (cost=0.00..1.20 rows=1 width=4)
> Filter: (oid = 3748::oid)
> -> Function Scan on ts_parse parse (cost=0.00..12.50 rows=1000
> width=36)
> -> Hash (cost=0.20..0.20 rows=16 width=68)
> -> Function Scan on ts_token_type tt (cost=0.00..0.20 rows=16
> width=68)
> SubPlan
> -> Limit (cost=7.33..7.36 rows=1 width=36)
> -> Subquery Scan dl (cost=7.33..7.36 rows=1 width=36)
> -> Sort (cost=7.33..7.34 rows=1 width=8)
> Sort Key: m.mapseqno
> -> Seq Scan on pg_ts_config_map m
> (cost=0.00..7.32 rows=1 width=8)
> Filter: ((ts_lexize(mapdict, $1) IS NOT
> NULL) AND (mapcfg = 3765::oid) AND (maptokentype = $0))
> -> Sort (cost=6.57..6.57 rows=1 width=8)
> Sort Key: m.mapseqno
> -> Seq Scan on pg_ts_config_map m (cost=0.00..6.56 rows=1
> width=8)
> Filter: ((mapcfg = 3765::oid) AND (maptokentype = $0))
> (21 rows)
>
> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
> $1). That means that it will call ts_lexize on every dictionary, which
> will try to load every dictionary. And loading danish_stem dictionary
> fails in latin2 encoding, because of the problem with the stopword file.
>
> We could rewrite ts_debug as a C-function, so that it doesn't try to
ts_debug currently doesn't work well with thesaurus dictionary, so it
certainly needs to be rewritten in C. We left rewriting it for future.
> access any unnecessary dictionaries. It seems wrong to install
> dictionaries in databases where they won't work in the first place, but
> I don't see an easy fix for that. Any comments or better ideas?
>
>
Regards, Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
pgsql-hackers by date: