Thread: BUG #13766: weird ts_headline/ts_vector/ts_query behaviour

BUG #13766: weird ts_headline/ts_vector/ts_query behaviour

From
aslesha.akella@gmail.com
Date:
The following bug has been logged on the website:

Bug reference:      13766
Logged by:          Calendar 42
Email address:      aslesha.akella@gmail.com
PostgreSQL version: 9.2.4
Operating system:   gentoo
Description:

We are trying to make text search for a word "goede" and "goed". It got the
following results with different languages.

select ts_headline(replace(strip(to_tsvector('dutch', 'Goede
vrijdag'))::text,'''',''), plainto_tsquery('dutch', 'goede'));
 ts_headline
--------------
 goed vrijdag


select ts_headline(replace(strip(to_tsvector('dutch', 'Goede
vrijdag'))::text,'''',''), plainto_tsquery('dutch', 'goed'));
 ts_headline
--------------
 goed vrijdag

(NOTE: this works)
select ts_headline(replace(strip(to_tsvector('english', 'Goede
vrijdag'))::text,'''',''), to_tsquery('english', 'goed'));
      ts_headline
---------------------
 <b>goed</b> vrijdag


select ts_headline(replace(strip(to_tsvector('english', 'Goede
vrijdag'))::text,'''',''), to_tsquery('english', 'goede'));
 ts_headline
--------------
 goed vrijdag


this works too but didnt understand how, because the stem for the word
"Goede" in 'simple' is 'goede'. But 'simple' works for 'goed' and not for
'goede'

select ts_headline(replace(strip(to_tsvector('simple', 'Goede
vrijdag'))::text,'''',''), to_tsquery('simple', 'goed'));
     ts_headline
----------------------
 <b>goede</b> vrijdag

 select ts_headline(replace(strip(to_tsvector('simple', 'Goede
vrijdag'))::text,'''',''), to_tsquery('simple', 'goede'));
  ts_headline
---------------
 goede vrijdag

Re: BUG #13766: weird ts_headline/ts_vector/ts_query behaviour

From
Artur Zakirov
Date:
On 10.11.2015 16:53, aslesha.akella@gmail.com wrote:
>
> We are trying to make text search for a word "goede" and "goed". It got the
> following results with different languages.
>

Hi

Do you use predefined text search configurations "english" and "dutch"?

If so and you do not change them, then "english" and "dutch"
configurations use English and Dutch stemming algorithms
(https://en.wikipedia.org/wiki/Stemming) and check for stop words. In
the following examples you can see how words are converted to lexems:

 > select to_tsvector('dutch', 'Goede vrijdag');
      to_tsvector
----------------------
  'goed':1 'vrijdag':2
(1 row)

 > select to_tsvector('dutch', 'Goed vrijdag');
      to_tsvector
----------------------
  'goed':1 'vrijdag':2
(1 row)

 > select to_tsvector('english', 'Goed vrijdag');
     to_tsvector
--------------------
  'go':1 'vrijdag':2
(1 row)

 > select to_tsvector('english', 'Goede vrijdag');
      to_tsvector
----------------------
  'goed':1 'vrijdag':2
(1 row)

The simple configuration do not use stemming algorithms. It only convert
input words to lower case lexems and exclude stop words.

You also can create ispell dictionary and use it. More information in
the documentation:
http://www.postgresql.org/docs/devel/static/textsearch-dictionaries.html
and good articles:
http://shisaa.jp/postset/postgresql-full-text-search-part-1.html
http://shisaa.jp/postset/postgresql-full-text-search-part-2.html
http://shisaa.jp/postset/postgresql-full-text-search-part-3.html

But I am not sure that I understood your question correctly.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company