BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term |
Date | |
Msg-id | 16865-4ffdc3e682e6d75b@postgresql.org Whole thread Raw |
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 16865 Logged by: Dimitri Nüscheler Email address: dimitri.nuescheler@gmail.com PostgreSQL version: 13.1 Operating system: Debian Description: Hello I'm writing a small search engine application that also supports negation (exclude results that contain terms starting with string). After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery when matched to a tsvector using the @@ operator no longer works properly, so I started bisecting the commit history and tracked it down to this commit (see the query and expected and actual result further below): > commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad) > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: Fri Jul 24 15:26:51 2020 -0400 > > Fix assorted bugs by changing TS_execute's callback API to ternary logic. > ... I'm still working on creating a reproducible test-case without having to share company data. I'm also trying to understand the code as a fun exercise. I can at least share some of the queries and result data: DDL: CREATE TABLE IF NOT EXISTS sherlock_catalog ( uri varchar, description varchar NOT NULL, metadata varchar NOT NULL, textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english', sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED, last_seen timestamptz, PRIMARY KEY (uri) ); CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar AS $$ BEGIN RETURN lower(regexp_replace(regexp_replace(regexp_replace(str, '[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1 \2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g')); END; $$ LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT; Query: SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* & ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE '%Discontinued%' LIMIT 2; Result: 2 rows Expected result: 0 rows, because the textsearch vector contains: 'discontinu':86 Plan: Limit (cost=130.61..12789.24 rows=1 width=136) -> Bitmap Heap Scan on sherlock_catalog (cost=130.61..12789.24 rows=1 width=136) Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* & !''discontinu'':*'::tsquery) Filter: ((metadata)::text ~~ '%Discontinued%'::text) -> Bitmap Index Scan on sherlock_catalog_textsearch (cost=0.00..130.61 rows=3548 width=0) Index Cond: (textsearch @@ '''full'':* & ''suppli'':* & !''discontinu'':*'::tsquery) The generated plan is structurally the same for PostgreSQL 12.5 respectively versions before that commit, but if I alter the plan using (SET enable_*scan = off) so that the planner comes up with a sequential scan, I will get the expected results. Some other results (count): Count of the same query: user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':* & ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE '%Discontinu%'; count ------- 4962 (1 Zeile) Negation, but not a prefix search: user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':* & ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE '%Discontinu%'; count ------- 0 (1 Zeile) Without negation: user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':* & ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE '%Discontinu%'; count ------- 13127 (1 Zeile) Without the "discontinu" term at all user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':* & ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%'; count ------- 13127 (1 Zeile) So it seems like the negated query manages to filter out some data, but not all - as if it failed to recheck and definitely determine the TS_YES respectively TS_NO answer from an in-precise TS_MAYBE answer from an unprecise index-based answer (without position information?)? if I even understand this remotely correctly, I'm new to this. I'll try to find out more and to prepare shareable data that reproduces the problem, but I also wonder if I manage to dive into the code and understand something about it :-) Kind Regards Dimitri Nüscheler
pgsql-bugs by date: