Home > mailing lists

Re: integrated tsearch doesn't work with non utf8 database - Mailing list pgsql-hackers

From	Oleg Bartunov
Subject	Re: integrated tsearch doesn't work with non utf8 database
Date	September 8, 2007 06:19:59
Msg-id	Pine.LNX.4.64.0709081017260.2767@sn.sai.msu.ru Whole thread Raw
In response to	Re: integrated tsearch doesn't work with non utf8 database ("Heikki Linnakangas" <heikki@enterprisedb.com>)
List	pgsql-hackers

Tree view

On Fri, 7 Sep 2007, Heikki Linnakangas wrote:

> Pavel Stehule wrote:
>> postgres=# select ts_debug('cs','PЪЪЪЪliЪЪ ЪЪluЪЪouЪЪkЪЪ kЪЪЪЪ se napil ЪЪlutЪЪ vody');
>> ERROR:  character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
>> CONTEXT:  SQL function "ts_debug" statement 1
>
> I can reproduce that. In fact, you don't need the custom config or
> dictionary at all:
>
> postgres=# CREATE DATABASE latin2 encoding='latin2';
> CREATE DATABASE
> postgres=# \c latin2
> You are now connected to database "latin2".
> latin2=#  select ts_debug('simple','foo');
> ERROR:  character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
> CONTEXT:  SQL function "ts_debug" statement 1
>
> It fails trying to lexize the string using the danish snowball stemmer,
> because the danish stopword file contains character 'ЪЪ' which doesn't
> have an equivalent in LATIN2.
>
> Now what the heck is it doing with the danish stemmer, you might ask.
> ts_debug is implemented as a SQL function; EXPLAINing the complex SELECT
> behind it, I get this plan:
>
> latin2=# \i foo.sql
>                                                         QUERY PLAN
>
>
-----------------------------------------------------------------------------------------------------------------------------
> Hash Join  (cost=2.80..1134.45 rows=80 width=100)
>   Hash Cond: (parse.tokid = tt.tokid)
>   InitPlan
>     ->  Seq Scan on pg_ts_config  (cost=0.00..1.20 rows=1 width=4)
>           Filter: (oid = 3748::oid)
>     ->  Seq Scan on pg_ts_config  (cost=0.00..1.20 rows=1 width=4)
>           Filter: (oid = 3748::oid)
>   ->  Function Scan on ts_parse parse  (cost=0.00..12.50 rows=1000
> width=36)
>   ->  Hash  (cost=0.20..0.20 rows=16 width=68)
>         ->  Function Scan on ts_token_type tt  (cost=0.00..0.20 rows=16
> width=68)
>   SubPlan
>     ->  Limit  (cost=7.33..7.36 rows=1 width=36)
>           ->  Subquery Scan dl  (cost=7.33..7.36 rows=1 width=36)
>                 ->  Sort  (cost=7.33..7.34 rows=1 width=8)
>                       Sort Key: m.mapseqno
>                       ->  Seq Scan on pg_ts_config_map m
> (cost=0.00..7.32 rows=1 width=8)
>                             Filter: ((ts_lexize(mapdict, $1) IS NOT
> NULL) AND (mapcfg = 3765::oid) AND (maptokentype = $0))
>     ->  Sort  (cost=6.57..6.57 rows=1 width=8)
>           Sort Key: m.mapseqno
>           ->  Seq Scan on pg_ts_config_map m  (cost=0.00..6.56 rows=1
> width=8)
>                 Filter: ((mapcfg = 3765::oid) AND (maptokentype = $0))
> (21 rows)
>
> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
> $1). That means that it will call ts_lexize on every dictionary, which
> will try to load every dictionary. And loading danish_stem dictionary
> fails in latin2 encoding, because of the problem with the stopword file.
>
> We could rewrite ts_debug as a C-function, so that it doesn't try to

ts_debug currently doesn't work well with thesaurus dictionary, so it
certainly needs to be rewritten in C. We left rewriting it for future.

> access any unnecessary dictionaries. It seems wrong to install
> dictionaries in databases where they won't work in the first place, but
> I don't see an easy fix for that. Any comments or better ideas?
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

pgsql-hackers by date:

From: Tom Lane
Date: 08 September 2007, 03:34:47
Subject: Re: apparent tsearch breakage on 64-bit machines

From: apoc9009
Date: 08 September 2007, 08:39:28
Subject: Re: [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

Re: integrated tsearch doesn't work with non utf8 database - Mailing list pgsql-hackers

Previous

Next