Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows - Mailing list pgsql-bugs

From Alexander Korotkov
Subject Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Date
Msg-id CAPpHfdu41xL4z6LXt6-0USA8cZ0Nm8YtEA7B8ZZwrBWcc+mA5w@mail.gmail.com
Whole thread Raw
In response to IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-bugs
(On Thu, Jun 24, 2021 at 3:59 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 17/06/2021 22:49, Tom Lane wrote:
> > I wrote:
> >> Pawel Kudzia <kudzia@gmail.com> writes:
> >>> with help from IRC we've found that decreasing work_mem from 1MB to 256kB
> >>> or less makes the problem go away:
> >
> >> Hmm.  So that suggests that the index itself is *not* corrupt,
> >> but the problem is associated with a bug in the indexscan
> >> algorithms.
> >
> > After staring at the code I think there is at least one bug, and
> > possibly two, in shimTriConsistentFn.  That's likely to be implicated
> > here because intarray's GIN opclass only provides a bool consistent
> > function.  I'm not very clear on the circumstances that lead to the scan
> > code inventing GIN_MAYBE inputs, so I haven't been able to construct a
> > test case.
> >
> > The definite bug is triggered because intarray relies on the API
> > specification that says that if the search mode is
> > GIN_SEARCH_MODE_DEFAULT, then the consistentFn will only be called
> > when there's at least one TRUE key:
> >
> >          case RTOverlapStrategyNumber:
> >              /* result is not lossy */
> >              *recheck = false;
> >              /* at least one element in check[] is true, so result = true */
> >              res = true;
> >              break;
> >
> > shimTriConsistentFn ignores this and calls it on all-FALSE inputs,
> > for which it'll incorrectly get a TRUE result, as it will also for
> > all the following checks.  The upshot is that shimTriConsistentFn
> > will convert any case with a MAYBE input to a hard TRUE result,
> > with no recheck requirement.  This'd easily explain the reported
> > misbehavior.
>
> That's subtle. The documentation says:
>
> >        If <literal>*searchMode</literal> is set to
> >        <literal>GIN_SEARCH_MODE_DEFAULT</literal> (which is the value it is
> >        initialized to before call), only items that match at least one of
> >        the returned keys are considered candidate matches.
>
> I guess you can read that as "consistentFn will only be called when
> there is at least one matching item", but that didn't occur to me when I
> read that at first. But I agree we need to fix shimTriConsistentFn to
> respect that.
>
> > The other thing that I'm unsure about, because the code is horribly
> > underdocumented, is that it's not very clear whether
> > shimTriConsistentFn is accurately converting between the bool and
> > the tristate APIs.  That's because it's not very clear what the
> > tristate API actually *is*.  In particular, is the end state of
> > key->recheckCurItem meaningful in the tristate case?  If it's not,
> > then the short-circuit case for no MAYBE inputs is broken, because
> > it'll return TRUE when the bool consistentFn is trying to tell us
> > to recheck.  But if it is meaningful, then the multiway case is broken,
> > because it will return the recheckCurItem result set by the last call to
> > the bool consistentfn; which might be false even though other calls said
> > true.  (So in that case I think we'd need "key->recheckCurItem = recheck"
> > at the end.)  I also wonder how any of that works correctly for real
> > triconsistent functions, which don't have access to the recheckCurItem
> > flag.
> >
> > Anyway, I'm punting this to the code authors.  I'd like to see
> > some comments clarifying what the API really is, not just a
> > quick-n-dirty code fix.
>
> I've been trying to create a test case for this, but no luck so far. I
> cannot find a codepath that would hit these bugs with the kind of query
> that Pawel used. The search condition is very simple: "column &&
> '{constant}'", with only one key and one constant to search for. There
> are three calls to triConsistentFn:
>
> 1. In startScanKey, but that's only reached if (key->nentries > 1), so
> not reachable with Pawel's query.
>
> 2. In keyGetItem, when haveLossyEntry==true. But AFAICS, haveLossyEntry
> is never true with Pawel's query. That would require curItem to be a
> lossy pointer, and that in turn can only happen when matchBitmap is
> used, and that's only used with partial match queries and with queries
> that need a full-index scan.
>
> 3. The second call in keyGetItem is reached, but it is only passed
> GIN_MAYBE when curItem is lossy. Which isn't possible with this query,
> see point 2.
>
> AFAICS the recheckCurItem bug should also not cause wrong results with
> that query. I must be missing something.
>
> I could probably construct a test case with a different query, one
> involving multiple elements in the search key, but I'd like to have a
> solid explanation for the original report.

+1,
I see query provided by Pawel is too simple.  Bugs, which Tom digged
in shimTriConsistentFn() don't seem to explain the wrong query results
provided by Pawel.

> Pawel, is this a production system, or would it be possible for you to
> build from sources with a patch with some extra debugging output?

Do you think it also worth checking whether bug persists when set
fastupdate = off.  Then we could localize whether bug is related to
pending lists.

------
Regards,
Alexander Korotkov



pgsql-bugs by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: BUG #17072: Assert for clogGroupNext failed due to a race condition in TransactionGroupUpdateXidStatus()
Next
From: Alexander Korotkov
Date:
Subject: Re: BUG #17066: Cache lookup failed when null (iso-8859-1) is passed as anycompatiblemultirange