Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows - Mailing list pgsql-bugs

From Pawel Kudzia
Subject Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Date
Msg-id CAJYBUS8gs4XMWHSL1H=VpDBxkad2yV2ND0H9XVxY4mtbmyz2gg@mail.gmail.com
Whole thread Raw
In response to Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows  (Pawel Kudzia <kudzia@gmail.com>)
List pgsql-bugs
On Mon, Aug 2, 2021 at 10:07 AM Pawel Kudzia <kudzia@gmail.com> wrote:
>
> On Thu, Jun 17, 2021 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Pawel Kudzia <kudzia@gmail.com> writes:
> > > with help from IRC we've found that decreasing work_mem from 1MB to 256kB
> > > or less makes the problem go away:
> >
> > Hmm.  So that suggests that the index itself is *not* corrupt,
> > but the problem is associated with a bug in the indexscan
> > algorithms.
> >
> > Have you experimented with different index opclasses?  Your
> > original report said you were using gin__int_ops, but that's
> > not the default, and we sort of deprecate intarray these days.
>
> Hello,
>
> Three weeks ago we've changed from
>
> CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
> (attribute_name_ids public.gin__int_ops);
>
> into
>
> CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
> (attribute_name_ids);
>
> Sadly - the problems are back.
>
> I have:
> * queries that return incorrect responses:
> SET enable_seqscan = OFF;
> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1485}' )
> AND NOT ( (attribute_name_ids||0) && '{1485}') LIMIT 10;
> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{144}' )
> AND NOT ( (attribute_name_ids||0) && '{144}') LIMIT 10;
> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{2652}' )
> AND NOT ( (attribute_name_ids||0) && '{2652}') LIMIT 10;
>
> all three return:
>  entity_id |      primname
> -----------+--------------------
>   99311962 | msci_674591ltepsgt
>   99312880 | msci_674667mgu
>
> data=# explain analyze SELECT entity_id FROM entity WHERE (
> attribute_name_ids && '{2652}' ) AND NOT ( (attribute_name_ids||0) &&
> '{2652}') LIMIT 10;
>                                                                      QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=30.59..31.71 rows=10 width=8) (actual
> time=149.988..200.749 rows=2 loops=1)
>    ->  Bitmap Heap Scan on entity  (cost=30.59..3798.81 rows=33644
> width=8) (actual time=149.985..200.744 rows=2 loops=1)
>          Recheck Cond: (attribute_name_ids && '{2652}'::integer[])
>          Rows Removed by Index Recheck: 79741
>          Filter: (NOT ((attribute_name_ids || 0) && '{2652}'::integer[]))
>          Rows Removed by Filter: 16723
>          Heap Blocks: exact=8277 lossy=9989
>          ->  Bitmap Index Scan on entity_attribute_name_ids_gin
> (cost=0.00..30.59 rows=33813 width=0) (actual time=9.712..9.713
> rows=37367 loops=1)
>                Index Cond: (attribute_name_ids && '{2652}'::integer[])
>  Planning Time: 0.472 ms
>  Execution Time: 200.977 ms
> (11 rows)
>
>
> * first time when hourly consistency check reported the inconsistency
> [ 2021-08-01 19:04:15 for entity_id=99312880, 2021-08-01 23:04:15 for
> entity_id=99311962 ]
> * timestamps when affected row has been updated [ entity_id=99312880:
> 2021-08-01 16:51:59, 2021-08-01 19:04:06 [ most likely that update led
> to inconsistency ], 2021-08-01 23:04:14, 2021-08-02 01:08:18,
> 2021-08-02 05:12:15, entity_id=99311962 2021-08-01 16:51:59,
> 2021-08-01 19:04:05, 2021-08-01 23:04:13 [ likely culprit ],
> 2021-08-02 01:08:16 ]
> * wal files from the **master server** from the time when the
> inconsistency started to occur [ from 2021-08-01 14:50:37 till
> 2021-08-02 04:31:00 ; sadly i don't have enough to cover the whole
> period between snapshots mentioned below  ]
> * file-level snapshot of the database taken from streaming
> **replication slave** before the problem started to occur [ 2021-08-01
> 12:00:01 ] and after it occured [ 2021-08-02 00:00:01 ]
>
> I'll be happy to run any diagnostics needed, provide access to
> particular wal file.
>

Hello,

We continue to run into this issue and we still do not have any
reliable method to reproduce it. Is there any diagnostic information
that we could provide to make it easier to resolve?

Thank you!


-- 
regards,
Pawel Kudzia



pgsql-bugs by date:

Previous
From: hubert depesz lubaczewski
Date:
Subject: Re: Logs vanish after partial log destination change
Next
From: Yongqian Li
Date:
Subject: Re: Unexpected behavior from using default config value