Thread: BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses
BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18479 Logged by: Manos Emmanouilidis Email address: esemmano@gmail.com PostgreSQL version: 15.4 Operating system: macOS Sonoma 14.4.1 x86 Description: Although the docs https://www.postgresql.org/docs/current/textsearch-controls.html say nothing about websearch_to_tsquery supporting parentheses in queries, I noticed some inconsistent behaviour when using multiple 'or' keywords with parentheses in postgres 15.4 In the following tests 1-3 the behaviour matches what I expect, but in the 4th test the 'or' keyword is used verbatim when parentheses are present in the middle of the query. select websearch_to_tsquery('german', 'foo or baz bar'); websearch_to_tsquery ----------------------- 'foo' | 'baz' & 'bar' select websearch_to_tsquery('german', 'foo or baz bar or ding dong'); websearch_to_tsquery ----------------------------------------- 'foo' | 'baz' & 'bar' | 'ding' & 'dong' select websearch_to_tsquery('german', 'foo or baz bar or (ding dong)'); websearch_to_tsquery ----------------------------------------- 'foo' | 'baz' & 'bar' | 'ding' & 'dong' select websearch_to_tsquery('german', 'foo or (baz bar) or (ding dong)'); websearch_to_tsquery ------------------------------------------------ 'foo' | 'baz' & 'bar' & 'or' & 'ding' & 'dong' I do not mean to say that this is necessarily a bug that needs fixing, but maybe either the docs should call out that using parens leads to undefined behaviour, or the function's logic could be updated to be consistent in the presence of parens - either ignore them completely or always consider them, but not both depending on the input
Re: BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > Although the docs > https://www.postgresql.org/docs/current/textsearch-controls.html say nothing > about websearch_to_tsquery supporting parentheses in queries, I noticed some > inconsistent behaviour when using multiple 'or' keywords with parentheses in > postgres 15.4 The definition of websearch_to_tsquery says pretty plainly that "Other punctuation is ignored". So I'd expect parens to do nothing. That makes this problematic: > select websearch_to_tsquery('german', 'foo or baz bar or (ding dong)'); > websearch_to_tsquery > ----------------------------------------- > 'foo' | 'baz' & 'bar' | 'ding' & 'dong' > select websearch_to_tsquery('german', 'foo or (baz bar) or (ding dong)'); > websearch_to_tsquery > ------------------------------------------------ > 'foo' | 'baz' & 'bar' & 'or' & 'ding' & 'dong' I found what seems to be the issue in gettoken_query_websearch: it ignores ISOPERATOR chars (including parens) in WAITOPERAND state, but not in WAITOPERATOR state. That results in switching back to WAITOPERAND state which will consume the "or" as a regular word. So a minimal fix could look like the attached. It's fairly confusing that this code manages to ignore not-ISOPERATOR punctuation. It seems like that gets eaten by gettoken_tsvector() and then later we decide there's not really a word there. I'm also confused how come the same thing doesn't happen in the english tsconfig. Not sure it's worth poking at more, though. regards, tom lane diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c index 690a80d774..eb08e912ea 100644 --- a/src/backend/utils/adt/tsquery.c +++ b/src/backend/utils/adt/tsquery.c @@ -492,6 +492,12 @@ gettoken_query_websearch(TSQueryParserState state, int8 *operator, *operator = OP_OR; return PT_OPR; } + else if (ISOPERATOR(state->buf)) + { + /* ignore other operators here too */ + state->buf++; + continue; + } else if (*state->buf == '\0') { return PT_END;
Re: BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses
From
Tom Lane
Date:
[ couldn't let go of this ... ] I wrote: > It's fairly confusing that this code manages to ignore not-ISOPERATOR > punctuation. It seems like that gets eaten by gettoken_tsvector() > and then later we decide there's not really a word there. Yeah, further investigation shows that such cases effectively act like stopwords: they are passed back to makepol() as VAL strings, but then lexize processing rejects them as not words. > I'm also confused how come the same thing doesn't happen in the > english tsconfig. Not sure it's worth poking at more, though. D'oh: "or" is a stopword in the english config. The english case is still wrong of course, just differently: regression=# select websearch_to_tsquery('english', 'foo or (baz bar) or (ding dong)'); websearch_to_tsquery ----------------------------------------- 'foo' | 'baz' & 'bar' & 'ding' & 'dong' (1 row) regards, tom lane