Thread: BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses

The following bug has been logged on the website:

Bug reference:      18479
Logged by:          Manos Emmanouilidis
Email address:      esemmano@gmail.com
PostgreSQL version: 15.4
Operating system:   macOS Sonoma 14.4.1 x86
Description:

Although the docs
https://www.postgresql.org/docs/current/textsearch-controls.html say nothing
about websearch_to_tsquery supporting parentheses in queries, I noticed some
inconsistent behaviour when using multiple 'or' keywords with parentheses in
postgres 15.4

In the following tests 1-3 the behaviour matches what I expect, but in the
4th test the 'or' keyword is used verbatim when parentheses are present in
the middle of the query.

select websearch_to_tsquery('german', 'foo or baz bar');
 websearch_to_tsquery
-----------------------
 'foo' | 'baz' & 'bar'

select websearch_to_tsquery('german', 'foo or baz bar or ding dong');
          websearch_to_tsquery
-----------------------------------------
 'foo' | 'baz' & 'bar' | 'ding' & 'dong'

select websearch_to_tsquery('german', 'foo or baz bar or (ding dong)');
          websearch_to_tsquery
-----------------------------------------
 'foo' | 'baz' & 'bar' | 'ding' & 'dong'

select websearch_to_tsquery('german', 'foo or (baz bar) or (ding dong)');
              websearch_to_tsquery
------------------------------------------------
 'foo' | 'baz' & 'bar' & 'or' & 'ding' & 'dong'

I do not mean to say that this is necessarily a bug that needs fixing, but
maybe either the docs should call out that using parens leads to undefined
behaviour, or the function's logic could be updated to be consistent in the
presence of parens - either ignore them completely or always consider them,
but not both depending on the input


PG Bug reporting form <noreply@postgresql.org> writes:
> Although the docs
> https://www.postgresql.org/docs/current/textsearch-controls.html say nothing
> about websearch_to_tsquery supporting parentheses in queries, I noticed some
> inconsistent behaviour when using multiple 'or' keywords with parentheses in
> postgres 15.4

The definition of websearch_to_tsquery says pretty plainly that
"Other punctuation is ignored".  So I'd expect parens to do nothing.
That makes this problematic:

> select websearch_to_tsquery('german', 'foo or baz bar or (ding dong)');
>           websearch_to_tsquery
> -----------------------------------------
>  'foo' | 'baz' & 'bar' | 'ding' & 'dong'

> select websearch_to_tsquery('german', 'foo or (baz bar) or (ding dong)');
>               websearch_to_tsquery
> ------------------------------------------------
>  'foo' | 'baz' & 'bar' & 'or' & 'ding' & 'dong'


I found what seems to be the issue in gettoken_query_websearch: it
ignores ISOPERATOR chars (including parens) in WAITOPERAND state,
but not in WAITOPERATOR state.  That results in switching back to
WAITOPERAND state which will consume the "or" as a regular word.
So a minimal fix could look like the attached.

It's fairly confusing that this code manages to ignore not-ISOPERATOR
punctuation.  It seems like that gets eaten by gettoken_tsvector()
and then later we decide there's not really a word there.

I'm also confused how come the same thing doesn't happen in the
english tsconfig.  Not sure it's worth poking at more, though.

            regards, tom lane

diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 690a80d774..eb08e912ea 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -492,6 +492,12 @@ gettoken_query_websearch(TSQueryParserState state, int8 *operator,
                     *operator = OP_OR;
                     return PT_OPR;
                 }
+                else if (ISOPERATOR(state->buf))
+                {
+                    /* ignore other operators here too */
+                    state->buf++;
+                    continue;
+                }
                 else if (*state->buf == '\0')
                 {
                     return PT_END;

[ couldn't let go of this ... ]

I wrote:
> It's fairly confusing that this code manages to ignore not-ISOPERATOR
> punctuation.  It seems like that gets eaten by gettoken_tsvector()
> and then later we decide there's not really a word there.

Yeah, further investigation shows that such cases effectively act
like stopwords: they are passed back to makepol() as VAL strings,
but then lexize processing rejects them as not words.

> I'm also confused how come the same thing doesn't happen in the
> english tsconfig.  Not sure it's worth poking at more, though.

D'oh: "or" is a stopword in the english config.  The english case
is still wrong of course, just differently:

regression=# select websearch_to_tsquery('english', 'foo or (baz bar) or (ding dong)');
          websearch_to_tsquery
-----------------------------------------
 'foo' | 'baz' & 'bar' & 'ding' & 'dong'
(1 row)

            regards, tom lane