Thread: integrated tsearch doesn't work with non utf8 database

integrated tsearch doesn't work with non utf8 database

From

"Pavel Stehule"

Date:

07 September 2007, 19:13:59

Hello

last time I checked utf8 database. Now I checked latin2 encoding
database. I used dictionaries from last test.

client_encoding                 | utf8
lc_collate                      | cs_CZ.iso-8859-2
lc_ctype                        | cs_CZ.iso-8859-2
lc_messages                     | cs_CZ.iso-8859-2
lc_monetary                     | cs_CZ.iso-8859-2
lc_numeric                      | cs_CZ.iso-8859-2
lc_time                         | cs_CZ.iso-8859-2

postgres=# create Text search dictionary cspell2(template=ispell,
afffile=czech, dictfile=czech);
CREATE TEXT SEARCH DICTIONARY
postgres=# alter text search configuration cs alter mapping for word,
lword  with cspell2, simple;
ALTER TEXT SEARCH CONFIGURATION
postgres=# select ts_debug('cs','Příliš žluťoučký kůň se napil žluté vody');
ERROR:  character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
CONTEXT:  SQL function "ts_debug" statement 1
postgres=#


database was initialised

/usr/local/pgsql/bin/initdb  --encoding=latin2
--locale=cs_CZ.iso-8859-2 -D /usr/local/pgsql/data/

locales is correct


postgres=# set client_encoding to utf8;
SET
postgres=# select upper('Příliš žluťoučký kůň se napil žluté vody');                 upper
------------------------------------------PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ SE NAPIL ŽLUTÉ VODY
(1 row)

Regards
Pavel Stehule

Re: integrated tsearch doesn't work with non utf8 database

From

"Heikki Linnakangas"

Date:

07 September 2007, 21:24:27

Pavel Stehule wrote:
> postgres=# select ts_debug('cs','Příliš žluťoučký kůň se napil žluté vody');
> ERROR:  character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
> CONTEXT:  SQL function "ts_debug" statement 1

I can reproduce that. In fact, you don't need the custom config or
dictionary at all:

postgres=# CREATE DATABASE latin2 encoding='latin2';
CREATE DATABASE
postgres=# \c latin2
You are now connected to database "latin2".
latin2=#  select ts_debug('simple','foo');
ERROR:  character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
CONTEXT:  SQL function "ts_debug" statement 1

It fails trying to lexize the string using the danish snowball stemmer,
because the danish stopword file contains character 'å' which doesn't
have an equivalent in LATIN2.

Now what the heck is it doing with the danish stemmer, you might ask.
ts_debug is implemented as a SQL function; EXPLAINing the complex SELECT
behind it, I get this plan:

latin2=# \i foo.sql                                                        QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------Hash
Join (cost=2.80..1134.45 rows=80 width=100)  Hash Cond: (parse.tokid = tt.tokid)  InitPlan    ->  Seq Scan on
pg_ts_config (cost=0.00..1.20 rows=1 width=4)          Filter: (oid = 3748::oid)    ->  Seq Scan on pg_ts_config
(cost=0.00..1.20rows=1 width=4)          Filter: (oid = 3748::oid)  ->  Function Scan on ts_parse parse
(cost=0.00..12.50rows=1000

width=36)  ->  Hash  (cost=0.20..0.20 rows=16 width=68)        ->  Function Scan on ts_token_type tt  (cost=0.00..0.20
rows=16
width=68)  SubPlan    ->  Limit  (cost=7.33..7.36 rows=1 width=36)          ->  Subquery Scan dl  (cost=7.33..7.36
rows=1width=36)                ->  Sort  (cost=7.33..7.34 rows=1 width=8)                      Sort Key: m.mapseqno
                ->  Seq Scan on pg_ts_config_map m

(cost=0.00..7.32 rows=1 width=8)                            Filter: ((ts_lexize(mapdict, $1) IS NOT
NULL) AND (mapcfg = 3765::oid) AND (maptokentype = $0))    ->  Sort  (cost=6.57..6.57 rows=1 width=8)          Sort
Key:m.mapseqno          ->  Seq Scan on pg_ts_config_map m  (cost=0.00..6.56 rows=1

width=8)                Filter: ((mapcfg = 3765::oid) AND (maptokentype = $0))
(21 rows)

Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
$1). That means that it will call ts_lexize on every dictionary, which
will try to load every dictionary. And loading danish_stem dictionary
fails in latin2 encoding, because of the problem with the stopword file.

We could rewrite ts_debug as a C-function, so that it doesn't try to
access any unnecessary dictionaries. It seems wrong to install
dictionaries in databases where they won't work in the first place, but
I don't see an easy fix for that. Any comments or better ideas?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: integrated tsearch doesn't work with non utf8 database

From

Oleg Bartunov

Date:

08 September 2007, 06:19:59

On Fri, 7 Sep 2007, Heikki Linnakangas wrote:

> Pavel Stehule wrote:
>> postgres=# select ts_debug('cs','PЪЪЪЪliЪЪ ЪЪluЪЪouЪЪkЪЪ kЪЪЪЪ se napil ЪЪlutЪЪ vody');
>> ERROR:  character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
>> CONTEXT:  SQL function "ts_debug" statement 1
>
> I can reproduce that. In fact, you don't need the custom config or
> dictionary at all:
>
> postgres=# CREATE DATABASE latin2 encoding='latin2';
> CREATE DATABASE
> postgres=# \c latin2
> You are now connected to database "latin2".
> latin2=#  select ts_debug('simple','foo');
> ERROR:  character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
> CONTEXT:  SQL function "ts_debug" statement 1
>
> It fails trying to lexize the string using the danish snowball stemmer,
> because the danish stopword file contains character 'ЪЪ' which doesn't
> have an equivalent in LATIN2.
>
> Now what the heck is it doing with the danish stemmer, you might ask.
> ts_debug is implemented as a SQL function; EXPLAINing the complex SELECT
> behind it, I get this plan:
>
> latin2=# \i foo.sql
>                                                         QUERY PLAN
>
>
-----------------------------------------------------------------------------------------------------------------------------
> Hash Join  (cost=2.80..1134.45 rows=80 width=100)
>   Hash Cond: (parse.tokid = tt.tokid)
>   InitPlan
>     ->  Seq Scan on pg_ts_config  (cost=0.00..1.20 rows=1 width=4)
>           Filter: (oid = 3748::oid)
>     ->  Seq Scan on pg_ts_config  (cost=0.00..1.20 rows=1 width=4)
>           Filter: (oid = 3748::oid)
>   ->  Function Scan on ts_parse parse  (cost=0.00..12.50 rows=1000
> width=36)
>   ->  Hash  (cost=0.20..0.20 rows=16 width=68)
>         ->  Function Scan on ts_token_type tt  (cost=0.00..0.20 rows=16
> width=68)
>   SubPlan
>     ->  Limit  (cost=7.33..7.36 rows=1 width=36)
>           ->  Subquery Scan dl  (cost=7.33..7.36 rows=1 width=36)
>                 ->  Sort  (cost=7.33..7.34 rows=1 width=8)
>                       Sort Key: m.mapseqno
>                       ->  Seq Scan on pg_ts_config_map m
> (cost=0.00..7.32 rows=1 width=8)
>                             Filter: ((ts_lexize(mapdict, $1) IS NOT
> NULL) AND (mapcfg = 3765::oid) AND (maptokentype = $0))
>     ->  Sort  (cost=6.57..6.57 rows=1 width=8)
>           Sort Key: m.mapseqno
>           ->  Seq Scan on pg_ts_config_map m  (cost=0.00..6.56 rows=1
> width=8)
>                 Filter: ((mapcfg = 3765::oid) AND (maptokentype = $0))
> (21 rows)
>
> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
> $1). That means that it will call ts_lexize on every dictionary, which
> will try to load every dictionary. And loading danish_stem dictionary
> fails in latin2 encoding, because of the problem with the stopword file.
>
> We could rewrite ts_debug as a C-function, so that it doesn't try to

ts_debug currently doesn't work well with thesaurus dictionary, so it
certainly needs to be rewritten in C. We left rewriting it for future.

> access any unnecessary dictionaries. It seems wrong to install
> dictionaries in databases where they won't work in the first place, but
> I don't see an easy fix for that. Any comments or better ideas?
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: integrated tsearch doesn't work with non utf8 database

From

Teodor Sigaev

Date:

10 September 2007, 12:12:24

> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
> $1). That means that it will call ts_lexize on every dictionary, which
> will try to load every dictionary. And loading danish_stem dictionary
> fails in latin2 encoding, because of the problem with the stopword file.

Attached patch should fix it, I hope.

New plan:
  Hash Join  (cost=2.80..1073.85 rows=80 width=100)
    Hash Cond: (parse.tokid = tt.tokid)
    InitPlan
      ->  Seq Scan on pg_ts_config  (cost=0.00..1.20 rows=1 width=4)
            Filter: (oid = 11308::oid)
      ->  Seq Scan on pg_ts_config  (cost=0.00..1.20 rows=1 width=4)
            Filter: (oid = 11308::oid)
    ->  Function Scan on ts_parse parse  (cost=0.00..12.50 rows=1000 width=36)
    ->  Hash  (cost=0.20..0.20 rows=16 width=68)
          ->  Function Scan on ts_token_type tt  (cost=0.00..0.20 rows=16 width=68)
    SubPlan
      ->  Limit  (cost=6.57..6.60 rows=1 width=36)
            ->  Subquery Scan dl  (cost=6.57..6.60 rows=1 width=36)
                  ->  Sort  (cost=6.57..6.58 rows=1 width=8)
                        Sort Key: ((ts_lexize(m.mapdict, $1) IS NULL)), m.mapseqno
                        ->  Seq Scan on pg_ts_config_map m  (cost=0.00..6.56
rows=1 width=8)
                              Filter: ((mapcfg = 11308::oid) AND (maptokentype =
$0))
      ->  Sort  (cost=6.57..6.57 rows=1 width=8)
            Sort Key: m.mapseqno
            ->  Seq Scan on pg_ts_config_map m  (cost=0.00..6.56 rows=1 width=8)
                  Filter: ((mapcfg = 11308::oid) AND (maptokentype = $0))


At least, it checks only needed dictionaries.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/
*** ./src/backend/catalog/system_views.sql.orig    Mon Sep 10 15:51:27 2007
--- ./src/backend/catalog/system_views.sql    Mon Sep 10 16:09:52 2007
***************
*** 415,422 ****
              ( SELECT mapdict, pg_catalog.ts_lexize(mapdict, parse.token) AS lex
                FROM pg_catalog.pg_ts_config_map AS m
                WHERE m.mapcfg = $1 AND m.maptokentype = parse.tokid
!               ORDER BY m.mapseqno ) dl
!         WHERE dl.lex IS NOT NULL
          LIMIT 1
      ) AS "Lexized token"
  FROM pg_catalog.ts_parse(
--- 415,421 ----
              ( SELECT mapdict, pg_catalog.ts_lexize(mapdict, parse.token) AS lex
                FROM pg_catalog.pg_ts_config_map AS m
                WHERE m.mapcfg = $1 AND m.maptokentype = parse.tokid
!               ORDER BY pg_catalog.ts_lexize(mapdict, parse.token) IS NULL, m.mapseqno ) dl
          LIMIT 1
      ) AS "Lexized token"
  FROM pg_catalog.ts_parse(

Re: integrated tsearch doesn't work with non utf8 database

From

Tom Lane

Date:

10 September 2007, 14:37:40

Teodor Sigaev <teodor@sigaev.ru> writes:
>> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
>> $1). That means that it will call ts_lexize on every dictionary, which
>> will try to load every dictionary. And loading danish_stem dictionary
>> fails in latin2 encoding, because of the problem with the stopword file.

> Attached patch should fix it, I hope.

Uh, how will that help?  AFAICS it still has to call ts_lexize with
every dictionary.
        regards, tom lane

Re: integrated tsearch doesn't work with non utf8 database

From

"Heikki Linnakangas"

Date:

10 September 2007, 14:57:47

Tom Lane wrote:
> Teodor Sigaev <teodor@sigaev.ru> writes:
>>> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
>>> $1). That means that it will call ts_lexize on every dictionary, which
>>> will try to load every dictionary. And loading danish_stem dictionary
>>> fails in latin2 encoding, because of the problem with the stopword file.
> 
>> Attached patch should fix it, I hope.
> 
> Uh, how will that help?  AFAICS it still has to call ts_lexize with
> every dictionary.

No, ts_lexize is no longer in the seq scan filter, but in the sort key
that's calculated only for those rows that match the filter 'mapcfg=?
AND maptokentype=?'. It is pretty kludgey, though. The planner could
choose another plan, that fails, if the statistics were different.
Rewriting the function in C would be a more robust fix.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: integrated tsearch doesn't work with non utf8 database

From

Tom Lane

Date:

10 September 2007, 16:12:10

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Uh, how will that help?  AFAICS it still has to call ts_lexize with
>> every dictionary.

> No, ts_lexize is no longer in the seq scan filter, but in the sort key
> that's calculated only for those rows that match the filter 'mapcfg=?
> AND maptokentype=?'. It is pretty kludgey, though.

Oh, I see: in the original formulation the planner can push the "WHERE
dl.lex IS NOT NULL" clause down into the sorted subquery, and what with
one thing and another that clause ends up getting evaluated first in the
scan filter condition.  We could prevent that by increasing ts_lexize's
procost from 1 to 2 (or so), which might be a a good thing anyway since
I suppose it's not especially cheap.  It's still a klugy solution
though.

I think Teodor's solution is wrong as it stands, because if the subquery
finds matches for mapcfg and maptokentype, but none of those rows
produce a non-null ts_lexize result, it will instead emit one row with a
null result, which is not what should happen.
        regards, tom lane

Re: integrated tsearch doesn't work with non utf8 database

From

Teodor Sigaev

Date:

10 September 2007, 16:21:26

> I think Teodor's solution is wrong as it stands, because if the subquery
> finds matches for mapcfg and maptokentype, but none of those rows
> produce a non-null ts_lexize result, it will instead emit one row with a
> null result, which is not what should happen. But concatenation with NULL will have result NULL, so "Lexized token"
column
 
will have NULL as supposed.
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: integrated tsearch doesn't work with non utf8 database

From

Tom Lane

Date:

10 September 2007, 16:25:09

Teodor Sigaev <teodor@sigaev.ru> writes:
>> I think Teodor's solution is wrong as it stands, because if the subquery
>> finds matches for mapcfg and maptokentype, but none of those rows
>> produce a non-null ts_lexize result, it will instead emit one row with a
>> null result, which is not what should happen.

> But concatenation with NULL will have result NULL, so "Lexized token" column 
> will have NULL as supposed.

Ah, you're right --- objection withdrawn.
        regards, tom lane