Re: BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses
Date
Msg-id 2130969.1718316260@sss.pgh.pa.us
Whole thread Raw
In response to BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses
List pgsql-bugs
PG Bug reporting form <noreply@postgresql.org> writes:
> Although the docs
> https://www.postgresql.org/docs/current/textsearch-controls.html say nothing
> about websearch_to_tsquery supporting parentheses in queries, I noticed some
> inconsistent behaviour when using multiple 'or' keywords with parentheses in
> postgres 15.4

The definition of websearch_to_tsquery says pretty plainly that
"Other punctuation is ignored".  So I'd expect parens to do nothing.
That makes this problematic:

> select websearch_to_tsquery('german', 'foo or baz bar or (ding dong)');
>           websearch_to_tsquery
> -----------------------------------------
>  'foo' | 'baz' & 'bar' | 'ding' & 'dong'

> select websearch_to_tsquery('german', 'foo or (baz bar) or (ding dong)');
>               websearch_to_tsquery
> ------------------------------------------------
>  'foo' | 'baz' & 'bar' & 'or' & 'ding' & 'dong'


I found what seems to be the issue in gettoken_query_websearch: it
ignores ISOPERATOR chars (including parens) in WAITOPERAND state,
but not in WAITOPERATOR state.  That results in switching back to
WAITOPERAND state which will consume the "or" as a regular word.
So a minimal fix could look like the attached.

It's fairly confusing that this code manages to ignore not-ISOPERATOR
punctuation.  It seems like that gets eaten by gettoken_tsvector()
and then later we decide there's not really a word there.

I'm also confused how come the same thing doesn't happen in the
english tsconfig.  Not sure it's worth poking at more, though.

            regards, tom lane

diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 690a80d774..eb08e912ea 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -492,6 +492,12 @@ gettoken_query_websearch(TSQueryParserState state, int8 *operator,
                     *operator = OP_OR;
                     return PT_OPR;
                 }
+                else if (ISOPERATOR(state->buf))
+                {
+                    /* ignore other operators here too */
+                    state->buf++;
+                    continue;
+                }
                 else if (*state->buf == '\0')
                 {
                     return PT_END;

pgsql-bugs by date:

Previous
From: Pawel Kudzia
Date:
Subject: Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Next
From: Tom Lane
Date:
Subject: Re: BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses