Phrase search vs. multi-lexeme tokens - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Phrase search vs. multi-lexeme tokens
Date
Msg-id CAPpHfdv0EzVhf6CWfB1_TTZqXV_2Sn-jSY3zSd7ePH=-+1V2DQ@mail.gmail.com
Whole thread Raw
Responses Re: Phrase search vs. multi-lexeme tokens  (Alexander Korotkov <aekorotkov@gmail.com>)
Re: Phrase search vs. multi-lexeme tokens  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hackers,

I'm investigating the bug report [1] about the behavior of
websearch_to_tsquery() with quotes and multi-lexeme tokens.  See the
example below.

# select to_tsvector('pg_class foo') @@ websearch_to_tsquery('"pg_class
foo"');
 ?column?
----------
 f

So, tsvector doesn't match tsquery, when absolutely the same text was
put to the to_tsvector() and to the quotes of websearch_to_tsquery().
Looks wrong to me.  Let's examine output of to_tsvector() and
websearch_to_tsquery().

# select to_tsvector('pg_class foo');
       to_tsvector
--------------------------
 'class':2 'foo':3 'pg':1

# select websearch_to_tsquery('"pg_class foo"');
     websearch_to_tsquery
------------------------------
 ( 'pg' & 'class' ) <-> 'foo'
(1 row)

So, 'pg_class' token was split into two lexemes 'pg' and 'class'.  But
the output websearch_to_tsquery() connects 'pg' and 'class' with &
operator.  tsquery expects 'pg' and 'class' to be both neighbors of
'foo'.  So, 'pg' and 'class' are expected to share the same position,
and that isn't true for tsvector.  Let's see how phraseto_tsquery()
handles that.

# select to_tsvector('pg_class foo') @@ phraseto_tsquery('pg_class foo');
 ?column?
----------
 t

# select phraseto_tsquery('pg_class foo');
      phraseto_tsquery
----------------------------
 'pg' <-> 'class' <-> 'foo'

phraseto_tsquery() connects all the lexemes with phrase operators and
everything works OK.

For me it's obvious that phraseto_tsquery() and websearch_to_tsquery()
with quotes should work the same way.  Noticeably, current behavior of
websearch_to_tsquery() is recorded in the regression tests.  So, it
might look that this behavior is intended, but it's too ridiculous and
I think the regression tests contain oversight as well.

I've prepared a fix, which doesn't break the fts parser abstractions
too much (attached patch), but I've faced another similar issue in
to_tsquery().

# select to_tsvector('pg_class foo') @@ to_tsquery('pg_class <-> foo');
 ?column?
----------
 f

# select to_tsquery('pg_class <-> foo');
          to_tsquery
------------------------------
 ( 'pg' & 'class' ) <-> 'foo'

I think if a user writes 'pg_class <-> foo', then it's expected to
match 'pg_class foo' independently on which lexemes 'pg_class' is
split into.

This issue looks like the much more complex design bug in phrase
search.  Fixing this would require some kind of readahead or multipass
processing, because we don't know how to process 'pg_class' in
advance.

Is this really a design bug existing in phrase search from the
beginning.  Or am I missing something?

Links
1. https://www.postgresql.org/message-id/16592-70b110ff9731c07d%40postgresql.org

------
Regards,
Alexander Korotkov

Attachment

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Refactor pg_rewind code and make it work against a standby
Next
From: Daniel Gustafsson
Date:
Subject: Re: Online checksums patch - once again