Home > mailing lists

Phrase search vs. multi-lexeme tokens - Mailing list pgsql-hackers

From	Alexander Korotkov
Subject	Phrase search vs. multi-lexeme tokens
Date	November 12, 2020 16:09:51
Msg-id	CAPpHfdv0EzVhf6CWfB1_TTZqXV_2Sn-jSY3zSd7ePH=-+1V2DQ@mail.gmail.com Whole thread Raw
Responses	Re: Phrase search vs. multi-lexeme tokens Re: Phrase search vs. multi-lexeme tokens
List	pgsql-hackers

Tree view

Hackers,

I'm investigating the bug report [1] about the behavior of
websearch_to_tsquery() with quotes and multi-lexeme tokens.  See the
example below.

# select to_tsvector('pg_class foo') @@ websearch_to_tsquery('"pg_class
foo"');
 ?column?
----------
 f

So, tsvector doesn't match tsquery, when absolutely the same text was
put to the to_tsvector() and to the quotes of websearch_to_tsquery().
Looks wrong to me.  Let's examine output of to_tsvector() and
websearch_to_tsquery().

# select to_tsvector('pg_class foo');
       to_tsvector
--------------------------
 'class':2 'foo':3 'pg':1

# select websearch_to_tsquery('"pg_class foo"');
     websearch_to_tsquery
------------------------------
 ( 'pg' & 'class' ) <-> 'foo'
(1 row)

So, 'pg_class' token was split into two lexemes 'pg' and 'class'.  But
the output websearch_to_tsquery() connects 'pg' and 'class' with &
operator.  tsquery expects 'pg' and 'class' to be both neighbors of
'foo'.  So, 'pg' and 'class' are expected to share the same position,
and that isn't true for tsvector.  Let's see how phraseto_tsquery()
handles that.

# select to_tsvector('pg_class foo') @@ phraseto_tsquery('pg_class foo');
 ?column?
----------
 t

# select phraseto_tsquery('pg_class foo');
      phraseto_tsquery
----------------------------
 'pg' <-> 'class' <-> 'foo'

phraseto_tsquery() connects all the lexemes with phrase operators and
everything works OK.

For me it's obvious that phraseto_tsquery() and websearch_to_tsquery()
with quotes should work the same way.  Noticeably, current behavior of
websearch_to_tsquery() is recorded in the regression tests.  So, it
might look that this behavior is intended, but it's too ridiculous and
I think the regression tests contain oversight as well.

I've prepared a fix, which doesn't break the fts parser abstractions
too much (attached patch), but I've faced another similar issue in
to_tsquery().

# select to_tsvector('pg_class foo') @@ to_tsquery('pg_class <-> foo');
 ?column?
----------
 f

# select to_tsquery('pg_class <-> foo');
          to_tsquery
------------------------------
 ( 'pg' & 'class' ) <-> 'foo'

I think if a user writes 'pg_class <-> foo', then it's expected to
match 'pg_class foo' independently on which lexemes 'pg_class' is
split into.

This issue looks like the much more complex design bug in phrase
search.  Fixing this would require some kind of readahead or multipass
processing, because we don't know how to process 'pg_class' in
advance.

Is this really a design bug existing in phrase search from the
beginning.  Or am I missing something?

Links
1. https://www.postgresql.org/message-id/16592-70b110ff9731c07d%40postgresql.org

------
Regards,
Alexander Korotkov

Attachment

websearch_fix_p2.patch

pgsql-hackers by date:

From: Heikki Linnakangas
Date: 12 November 2020, 15:58:02
Subject: Re: Refactor pg_rewind code and make it work against a standby

From: Daniel Gustafsson
Date: 12 November 2020, 16:17:31
Subject: Re: Online checksums patch - once again

Phrase search vs. multi-lexeme tokens - Mailing list pgsql-hackers

Attachment

Previous

Next