Thread: Phrase search vs. multi-lexeme tokens
Hackers, I'm investigating the bug report [1] about the behavior of websearch_to_tsquery() with quotes and multi-lexeme tokens. See the example below. # select to_tsvector('pg_class foo') @@ websearch_to_tsquery('"pg_class foo"'); ?column? ---------- f So, tsvector doesn't match tsquery, when absolutely the same text was put to the to_tsvector() and to the quotes of websearch_to_tsquery(). Looks wrong to me. Let's examine output of to_tsvector() and websearch_to_tsquery(). # select to_tsvector('pg_class foo'); to_tsvector -------------------------- 'class':2 'foo':3 'pg':1 # select websearch_to_tsquery('"pg_class foo"'); websearch_to_tsquery ------------------------------ ( 'pg' & 'class' ) <-> 'foo' (1 row) So, 'pg_class' token was split into two lexemes 'pg' and 'class'. But the output websearch_to_tsquery() connects 'pg' and 'class' with & operator. tsquery expects 'pg' and 'class' to be both neighbors of 'foo'. So, 'pg' and 'class' are expected to share the same position, and that isn't true for tsvector. Let's see how phraseto_tsquery() handles that. # select to_tsvector('pg_class foo') @@ phraseto_tsquery('pg_class foo'); ?column? ---------- t # select phraseto_tsquery('pg_class foo'); phraseto_tsquery ---------------------------- 'pg' <-> 'class' <-> 'foo' phraseto_tsquery() connects all the lexemes with phrase operators and everything works OK. For me it's obvious that phraseto_tsquery() and websearch_to_tsquery() with quotes should work the same way. Noticeably, current behavior of websearch_to_tsquery() is recorded in the regression tests. So, it might look that this behavior is intended, but it's too ridiculous and I think the regression tests contain oversight as well. I've prepared a fix, which doesn't break the fts parser abstractions too much (attached patch), but I've faced another similar issue in to_tsquery(). # select to_tsvector('pg_class foo') @@ to_tsquery('pg_class <-> foo'); ?column? ---------- f # select to_tsquery('pg_class <-> foo'); to_tsquery ------------------------------ ( 'pg' & 'class' ) <-> 'foo' I think if a user writes 'pg_class <-> foo', then it's expected to match 'pg_class foo' independently on which lexemes 'pg_class' is split into. This issue looks like the much more complex design bug in phrase search. Fixing this would require some kind of readahead or multipass processing, because we don't know how to process 'pg_class' in advance. Is this really a design bug existing in phrase search from the beginning. Or am I missing something? Links 1. https://www.postgresql.org/message-id/16592-70b110ff9731c07d%40postgresql.org ------ Regards, Alexander Korotkov
Attachment
On Thu, Nov 12, 2020 at 4:09 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > This issue looks like the much more complex design bug in phrase > search. Fixing this would require some kind of readahead or multipass > processing, because we don't know how to process 'pg_class' in > advance. > > Is this really a design bug existing in phrase search from the > beginning. Or am I missing something? No feedback yet. I've added this to the commitfest to don't lose track of this. https://commitfest.postgresql.org/31/2854/ ------ Regards, Alexander Korotkov
Alexander Korotkov <aekorotkov@gmail.com> writes: > # select to_tsvector('pg_class foo') @@ websearch_to_tsquery('"pg_class foo"'); > ?column? > ---------- > f Yeah, surely this is wrong. > # select to_tsquery('pg_class <-> foo'); > to_tsquery > ------------------------------ > ( 'pg' & 'class' ) <-> 'foo' > I think if a user writes 'pg_class <-> foo', then it's expected to > match 'pg_class foo' independently on which lexemes 'pg_class' is > split into. Indeed. It seems to me that this: regression=# select to_tsquery('pg_class'); to_tsquery ---------------- 'pg' & 'class' (1 row) is wrong all by itself. Now that we have phrase search, a much saner translation would be "'pg' <-> 'class'". If we fixed that then it seems like the more complex case would just work. I read your patch over quickly and it seems like a reasonable approach (but sadly underdocumented). Can we extend the idea to fix the to_tsquery case? regards, tom lane
Hi! On Wed, Jan 6, 2021 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Korotkov <aekorotkov@gmail.com> writes: > > # select to_tsvector('pg_class foo') @@ websearch_to_tsquery('"pg_class foo"'); > > ?column? > > ---------- > > f > > Yeah, surely this is wrong. Thank you for confirming my thoughts. I also felt that is wrong but doubted such a basic bug could exist for so long. > > # select to_tsquery('pg_class <-> foo'); > > to_tsquery > > ------------------------------ > > ( 'pg' & 'class' ) <-> 'foo' > > > I think if a user writes 'pg_class <-> foo', then it's expected to > > match 'pg_class foo' independently on which lexemes 'pg_class' is > > split into. > > Indeed. It seems to me that this: > > regression=# select to_tsquery('pg_class'); > to_tsquery > ---------------- > 'pg' & 'class' > (1 row) > > is wrong all by itself. Now that we have phrase search, a much > saner translation would be "'pg' <-> 'class'". If we fixed that > then it seems like the more complex case would just work. Nice idea! Fixing this way should be much easier than fixing only the case when we have the phrase operator on the upper level. > I read your patch over quickly and it seems like a reasonable > approach (but sadly underdocumented). Can we extend the idea > to fix the to_tsquery case? Sure, I'll provide a revised patch. ------ Regards, Alexander Korotkov
On Thu, Jan 7, 2021 at 6:36 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > I read your patch over quickly and it seems like a reasonable > > approach (but sadly underdocumented). Can we extend the idea > > to fix the to_tsquery case? > > Sure, I'll provide a revised patch. The next version of the patch is attached. Now, it just makes to_tsquery() and websearch_to_tsquery() use phrase operator to connect multiple lexemes of the same tsquery token. I leave plainto_tsquery() aside because it considers the whole argument as a single token. Changing it would make it an equivalent of phraseto_tsquery(). ------ Regards, Alexander Korotkov
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: not tested Greetings, Although I am not an expert in this field, I carefully read the full-text search section in the document. I think the changeis surprising, but yes, it is correct. I found that your patch didn't modify the regress/excepted/tsearch.out. So I updated it and carried out the regression test.It passed. Also, I manually executed some test cases, all of which were OK.
Hi, Neil! On Mon, Jan 25, 2021 at 11:45 AM Neil Chen <carpenter.nail.cz@gmail.com> wrote: > > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: not tested > Documentation: not tested > > Greetings, > > Although I am not an expert in this field, I carefully read the full-text search section in the document. I think the changeis surprising, but yes, it is correct. > I found that your patch didn't modify the regress/excepted/tsearch.out. So I updated it and carried out the regressiontest. It passed. Also, I manually executed some test cases, all of which were OK. Thank you for looking into this. Yes, I've adjusted tsearch.sql regression tests to provide reasonable exercises for the new logic, but forgot to add tsearch.out to the patch. BTW, you mentioned you read the documentation. Do you think it needs to be adjusted accordingly to the patch? ------ Regards, Alexander Korotkov
Hi Alexander,
On Mon, Jan 25, 2021 at 11:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
BTW, you mentioned you read the documentation. Do you think it needs
to be adjusted accordingly to the patch?
Yes, I checked section 8.11, section 9.13 and Chapter 12 of the document. The change of this patch did not conflict with the document, because it was not mentioned in the document at all. We can simply not modify it, or we can supplement these situations.
There is no royal road to learning.
HighGo Software Co.
On Tue, Jan 26, 2021 at 4:31 AM Neil Chen <carpenter.nail.cz@gmail.com> wrote: > On Mon, Jan 25, 2021 at 11:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: >> >> >> BTW, you mentioned you read the documentation. Do you think it needs >> to be adjusted accordingly to the patch? >> > > Yes, I checked section 8.11, section 9.13 and Chapter 12 of the document. The change of this patch did not conflict withthe document, because it was not mentioned in the document at all. We can simply not modify it, or we can supplementthese situations. I've checked the docs myself and I think you're right (despite that's surprising for me). It seems that this patch just changes undocumented aspects of full-text search to be more consistent and intuitive. The revised patch is attached. This revision adds just comment and commit message. I'm going to push this if no objections. ------ Regards, Alexander Korotkov