BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows |
Date | |
Msg-id | 16792-b1913b6b4e098331@postgresql.org Whole thread Raw |
Responses |
Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 16792 Logged by: pawel kudzia Email address: kudzia@gmail.com PostgreSQL version: 11.10 Operating system: Debian Buster x86_64 Description: I'd like to ask for advice how to proceed with pin-pointing cause of the silent corruption of GIN index that I'm facing. Relevant part of the table structure: CREATE TABLE public.entity ( entity_id bigint NOT NULL, attribute_value_ids integer[] NOT NULL, attribute_name_ids integer[] NOT NULL, [..] ); ALTER TABLE ONLY public.entity ADD CONSTRAINT entity_pkey PRIMARY KEY (entity_id); CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin (attribute_name_ids public.gin__int_ops); CREATE INDEX entity_attribute_value_ids_gin ON public.entity USING gin (attribute_value_ids public.gin__int_ops); How does the issue manifest? Queries which use GIN on integer[] column occasionally return too many rows, including ones that actually do not match criteria expressed in WHERE. Queries like below should never return any rows, yet - occasionally they do return some results: data=> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{4980}' ) AND NOT ( (attribute_name_ids||0) && '{4980}') ; entity_id ----------- 31213924 31195117 31209184 (3 rows) Query plan: QUERY PLAN ------------------------------------------------------------------------------------------------- Bitmap Heap Scan on entity (cost=50.39..3068.11 rows=26923 width=22) Recheck Cond: (attribute_name_ids && '{4980}'::integer[]) Filter: (NOT ((attribute_name_ids || 0) && '{4980}'::integer[])) -> Bitmap Index Scan on entity_attribute_name_ids_gin (cost=0.00..50.39 rows=27058 width=0) Index Cond: (attribute_name_ids && '{4980}'::integer[]) (5 rows) The query was crafted intentionally to let PostgreSQL use GIN index entity_attribute_name_ids_gin [ left side of AND ] and then filter out rows that after inspection of their values do not actually match that criteria. What have I done so far? Originally the problem was discovered when running PostgreSQL 11.7 from Debian's repo. In the first step we've upgraded to PostgreSQL 11.9 also from Debian's repo and run: reindex (verbose) table entity; vacuum (verbose, full, analyze) entity; After few days of updates the problem returned. We've upgraded to PostgreSQL 11.10 from postgresql.org repository for Debian and reindexed / vacuumed again. After few weeks of updates problem returned again. Other information worth noting: * table in question has 38M rows and is the only table created after PostgreSQL installation * server in question is master of streaming replication; issue occurs also on the slave servers - this suggests corruption of the on-disk GIN index data * just rewriting rows falsely returned by select above - UPDATE entity SET attribute_name_ids ='{....}' WHERE entity_id=123 - fixes the issue * we've set up rudimentary consistency checks that allow to catch some of the corruptions within 1h of occurring - so far corruptions only happen to rows that have been updated since the previous check * server in question handles heavy read/write traffic * PostgreSQL in question runs in LXC container with Debian Buster running on top of Debian Buster running on top of bare-metal server * it's highly unlikely that hardware malfunction is to be blamed - the same physical server handles also few MySQL instances with hundreds of GB of data with heavy consistency checks cross checking content of MySQL with PostgreSQL, unpacking gzip'ed blobs stored in MySQL databases and de-serializing objects stored there. if there was a bit-rot / bit-flip in memory would that's not detected or fixed by ECC - high level consistency checks would pick it. * despite numerous attempts i cannot reproduce the issue in test environment, possibly due to much lower level of write traffic. it takes days to weeks to have it re-occur on the production server. Thank you in advance for your suggestions how to tackle this. best regards, Pawel
pgsql-bugs by date: