Thread: Full text search bug ('russian' regconfig)

Full text search bug ('russian' regconfig)


Text search doesn't work correct with the EQUAL string in text and query (russian dictionary config),
as you can see in example ts_vector receives different from ts_query lexemes for identical text:

tsv = 'дан':1 'магазин':2 'нужн':3 'посеща':4 'точн':5
tsq = 'нужн' & 'точн' & 'дан' & 'посещаем' & 'магазин'

        (web_query_and @@ ts_title)::INTEGER AS full_title_entries, -- 0 / supposed 1
        (web_query_and @@ 'зачем нужны точные данные о посещаемости магазинов?')::INTEGER AS full_title_entries2,
                to_tsvector('russian', STRIP(to_tsvector('russian', 'зачем нужны точные данные о посещаемости
магазинов?'))::TEXT) AS ts_title, 
                websearch_to_tsquery('russian', REPLACE('зачем нужны точные данные о посещаемости магазинов?', '- ' ,
''))AS web_query_and 

        ) AS main

Best regards,

Re: Full text search bug ('russian' regconfig)

Artur Zakirov

On 2/19/2020 5:35 PM, egocenter wrote:
> Text search doesn't work correct with the EQUAL string in text and query (russian dictionary config),
> as you can see in example ts_vector receives different from ts_query lexemes for identical text:
> tsv = 'дан':1 'магазин':2 'нужн':3 'посеща':4 'точн':5
> tsq = 'нужн' & 'точн' & 'дан' & 'посещаем' & 'магазин'

It is because you call to_tsvector() two times. 'russian' is a Snowball 
dictionary and it uses stemming algorithms to cut words ending. Your 
query works if to_tsvector() isn't called twice on the same text:

   web_query_and @@ ts_title,
   web_query_and @@ 'зачем нужны точные данные о посещаемости магазинов',
     to_tsvector('russian', 'зачем нужны точные данные о посещаемости 
магазинов') AS ts_title,
     websearch_to_tsquery('russian', 'зачем нужны точные данные о 
посещаемости магазинов?') AS web_query_and
   ) AS main;

It gives 'true' for the first column.


Re: Full text search bug ('russian' regconfig)

Hello, Artur!

Thanks for the answer,
ok, it's strange that only 1 word is affected that way (as if two lexemes exist for 1 word)...

*I use double to_tsvector to eliminate words duplicates.
in the example below ts_title = 'histori':2 'watcom':1,3
and it gives 2 entries in 'город - watcom' via ts_rank_cd

I need to count UNIQUE words entries but it seems to be no luck with std functionality
(I see 2 ways: custom ts_rank function OR to_tsvector / edit tsvector and leave only first position for 'watcom':
ts_title = 'histori':2 'watcom':1).

If you have any idea on that situation, I would highly appreciate it! Thanks in advance)

        round((ts_rank_cd(ts_title, web_query_or)/0.1)::NUMERIC, 0) AS title_entries_count, -- 2, but should be 1
     to_tsvector('russian', 'watcom history | watcom') AS ts_title,
     websearch_to_tsquery('russian', REPLACE('город - watcom', '- ' , '')) AS web_query_and, -- тире заменено для
отменыего конвертации в отрицание ! 
     REPLACE(websearch_to_tsquery(:reg_config, REPLACE('город - watcom', '- ' , ''))::TEXT, '&', '|')::tsquery AS

   ) AS main;


   > Hello

> On 2/19/2020 5:35 PM, egocenter wrote:
>> Text search doesn't work correct with the EQUAL string in text and query (russian dictionary config),
>> as you can see in example ts_vector receives different from ts_query lexemes for identical text:
>> tsv = 'дан':1 'магазин':2 'нужн':3 'посеща':4 'точн':5
>> tsq = 'нужн' & 'точн' & 'дан' & 'посещаем' & 'магазин'

> It is because you call to_tsvector() two times. 'russian' is a Snowball
> dictionary and it uses stemming algorithms to cut words ending. Your
> query works if to_tsvector() isn't called twice on the same text:

>    web_query_and @@ ts_title,
>    web_query_and @@ 'зачем нужны точные данные о посещаемости магазинов',
>    *
>    (SELECT
>      to_tsvector('russian', 'зачем нужны точные данные о посещаемости
> магазинов') AS ts_title,
>      websearch_to_tsquery('russian', 'зачем нужны точные данные о
> посещаемости магазинов?') AS web_query_and
>    ) AS main;

> It gives 'true' for the first column.