Thread: Phrase search vs. multi-lexeme tokens

Phrase search vs. multi-lexeme tokens

From
Alexander Korotkov
Date:
Hackers,

I'm investigating the bug report [1] about the behavior of
websearch_to_tsquery() with quotes and multi-lexeme tokens.  See the
example below.

# select to_tsvector('pg_class foo') @@ websearch_to_tsquery('"pg_class
foo"');
 ?column?
----------
 f

So, tsvector doesn't match tsquery, when absolutely the same text was
put to the to_tsvector() and to the quotes of websearch_to_tsquery().
Looks wrong to me.  Let's examine output of to_tsvector() and
websearch_to_tsquery().

# select to_tsvector('pg_class foo');
       to_tsvector
--------------------------
 'class':2 'foo':3 'pg':1

# select websearch_to_tsquery('"pg_class foo"');
     websearch_to_tsquery
------------------------------
 ( 'pg' & 'class' ) <-> 'foo'
(1 row)

So, 'pg_class' token was split into two lexemes 'pg' and 'class'.  But
the output websearch_to_tsquery() connects 'pg' and 'class' with &
operator.  tsquery expects 'pg' and 'class' to be both neighbors of
'foo'.  So, 'pg' and 'class' are expected to share the same position,
and that isn't true for tsvector.  Let's see how phraseto_tsquery()
handles that.

# select to_tsvector('pg_class foo') @@ phraseto_tsquery('pg_class foo');
 ?column?
----------
 t

# select phraseto_tsquery('pg_class foo');
      phraseto_tsquery
----------------------------
 'pg' <-> 'class' <-> 'foo'

phraseto_tsquery() connects all the lexemes with phrase operators and
everything works OK.

For me it's obvious that phraseto_tsquery() and websearch_to_tsquery()
with quotes should work the same way.  Noticeably, current behavior of
websearch_to_tsquery() is recorded in the regression tests.  So, it
might look that this behavior is intended, but it's too ridiculous and
I think the regression tests contain oversight as well.

I've prepared a fix, which doesn't break the fts parser abstractions
too much (attached patch), but I've faced another similar issue in
to_tsquery().

# select to_tsvector('pg_class foo') @@ to_tsquery('pg_class <-> foo');
 ?column?
----------
 f

# select to_tsquery('pg_class <-> foo');
          to_tsquery
------------------------------
 ( 'pg' & 'class' ) <-> 'foo'

I think if a user writes 'pg_class <-> foo', then it's expected to
match 'pg_class foo' independently on which lexemes 'pg_class' is
split into.

This issue looks like the much more complex design bug in phrase
search.  Fixing this would require some kind of readahead or multipass
processing, because we don't know how to process 'pg_class' in
advance.

Is this really a design bug existing in phrase search from the
beginning.  Or am I missing something?

Links
1. https://www.postgresql.org/message-id/16592-70b110ff9731c07d%40postgresql.org

------
Regards,
Alexander Korotkov

Attachment

Re: Phrase search vs. multi-lexeme tokens

From
Alexander Korotkov
Date:
On Thu, Nov 12, 2020 at 4:09 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> This issue looks like the much more complex design bug in phrase
> search.  Fixing this would require some kind of readahead or multipass
> processing, because we don't know how to process 'pg_class' in
> advance.
>
> Is this really a design bug existing in phrase search from the
> beginning.  Or am I missing something?

No feedback yet.  I've added this to the commitfest to don't lose track of this.
https://commitfest.postgresql.org/31/2854/

------
Regards,
Alexander Korotkov



Re: Phrase search vs. multi-lexeme tokens

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> # select to_tsvector('pg_class foo') @@ websearch_to_tsquery('"pg_class foo"');
>  ?column?
> ----------
>  f

Yeah, surely this is wrong.

> # select to_tsquery('pg_class <-> foo');
>           to_tsquery
> ------------------------------
>  ( 'pg' & 'class' ) <-> 'foo'

> I think if a user writes 'pg_class <-> foo', then it's expected to
> match 'pg_class foo' independently on which lexemes 'pg_class' is
> split into.

Indeed.  It seems to me that this:

regression=# select to_tsquery('pg_class');
   to_tsquery
----------------
 'pg' & 'class'
(1 row)

is wrong all by itself.  Now that we have phrase search, a much
saner translation would be "'pg' <-> 'class'".  If we fixed that
then it seems like the more complex case would just work.

I read your patch over quickly and it seems like a reasonable
approach (but sadly underdocumented).  Can we extend the idea
to fix the to_tsquery case?

            regards, tom lane



Re: Phrase search vs. multi-lexeme tokens

From
Alexander Korotkov
Date:
Hi!

On Wed, Jan 6, 2021 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alexander Korotkov <aekorotkov@gmail.com> writes:
> > # select to_tsvector('pg_class foo') @@ websearch_to_tsquery('"pg_class foo"');
> >  ?column?
> > ----------
> >  f
>
> Yeah, surely this is wrong.

Thank you for confirming my thoughts.  I also felt that is wrong but
doubted such a basic bug could exist for so long.

> > # select to_tsquery('pg_class <-> foo');
> >           to_tsquery
> > ------------------------------
> >  ( 'pg' & 'class' ) <-> 'foo'
>
> > I think if a user writes 'pg_class <-> foo', then it's expected to
> > match 'pg_class foo' independently on which lexemes 'pg_class' is
> > split into.
>
> Indeed.  It seems to me that this:
>
> regression=# select to_tsquery('pg_class');
>    to_tsquery
> ----------------
>  'pg' & 'class'
> (1 row)
>
> is wrong all by itself.  Now that we have phrase search, a much
> saner translation would be "'pg' <-> 'class'".  If we fixed that
> then it seems like the more complex case would just work.

Nice idea!  Fixing this way should be much easier than fixing only the
case when we have the phrase operator on the upper level.

> I read your patch over quickly and it seems like a reasonable
> approach (but sadly underdocumented).  Can we extend the idea
> to fix the to_tsquery case?

Sure, I'll provide a revised patch.

------
Regards,
Alexander Korotkov



Re: Phrase search vs. multi-lexeme tokens

From
Alexander Korotkov
Date:
On Thu, Jan 7, 2021 at 6:36 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> > I read your patch over quickly and it seems like a reasonable
> > approach (but sadly underdocumented).  Can we extend the idea
> > to fix the to_tsquery case?
>
> Sure, I'll provide a revised patch.

The next version of the patch is attached.  Now, it just makes
to_tsquery() and websearch_to_tsquery() use phrase operator to connect
multiple lexemes of the same tsquery token.  I leave plainto_tsquery()
aside because it considers the whole argument as a single token.
Changing it would make it an equivalent of phraseto_tsquery().

------
Regards,
Alexander Korotkov

Attachment

Re: Phrase search vs. multi-lexeme tokens

From
Neil Chen
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

Greetings,

Although I am not an expert in this field, I carefully read the full-text search section in the document. I think the
changeis surprising, but yes, it is correct.
 
I found that your patch didn't modify the regress/excepted/tsearch.out. So I updated it and carried out the regression
test.It passed. Also, I manually executed some test cases, all of which were OK. 

Re: Phrase search vs. multi-lexeme tokens

From
Alexander Korotkov
Date:
Hi, Neil!

On Mon, Jan 25, 2021 at 11:45 AM Neil Chen <carpenter.nail.cz@gmail.com> wrote:
>
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           not tested
> Documentation:            not tested
>
> Greetings,
>
> Although I am not an expert in this field, I carefully read the full-text search section in the document. I think the
changeis surprising, but yes, it is correct.
 
> I found that your patch didn't modify the regress/excepted/tsearch.out. So I updated it and carried out the
regressiontest. It passed. Also, I manually executed some test cases, all of which were OK.
 

Thank you for looking into this.  Yes, I've adjusted tsearch.sql
regression tests to provide reasonable exercises for the new logic,
but forgot to add tsearch.out to the patch.

BTW, you mentioned you read the documentation.  Do you think it needs
to be adjusted accordingly to the patch?

------
Regards,
Alexander Korotkov



Re: Phrase search vs. multi-lexeme tokens

From
Neil Chen
Date:
Hi Alexander,


On Mon, Jan 25, 2021 at 11:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

BTW, you mentioned you read the documentation.  Do you think it needs
to be adjusted accordingly to the patch?

 
Yes, I checked section 8.11, section 9.13 and Chapter 12 of the document. The change of this patch did not conflict with the document, because it was not mentioned in the document at all. We can simply not modify it, or we can supplement these situations.

--
There is no royal road to learning.
HighGo Software Co.

Re: Phrase search vs. multi-lexeme tokens

From
Alexander Korotkov
Date:
On Tue, Jan 26, 2021 at 4:31 AM Neil Chen <carpenter.nail.cz@gmail.com> wrote:
> On Mon, Jan 25, 2021 at 11:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>>
>>
>> BTW, you mentioned you read the documentation.  Do you think it needs
>> to be adjusted accordingly to the patch?
>>
>
> Yes, I checked section 8.11, section 9.13 and Chapter 12 of the document. The change of this patch did not conflict
withthe document, because it was not mentioned in the document at all. We can simply not modify it, or we can
supplementthese situations. 

I've checked the docs myself and I think you're right (despite that's
surprising for me).  It seems that this patch just changes
undocumented aspects of full-text search to be more consistent and
intuitive.

The revised patch is attached.  This revision adds just comment and
commit message.  I'm going to push this if no objections.

------
Regards,
Alexander Korotkov

Attachment