Re: GIN data corruption bug(s) in 9.6devel - Mailing list pgsql-hackers

From Noah Misch
Subject Re: GIN data corruption bug(s) in 9.6devel
Date
Msg-id 20160422060046.GC2042217@tornado.leadboat.com
Whole thread Raw
In response to Re: GIN data corruption bug(s) in 9.6devel  (Teodor Sigaev <teodor@sigaev.ru>)
Responses Re: GIN data corruption bug(s) in 9.6devel  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On Mon, Apr 18, 2016 at 05:48:17PM +0300, Teodor Sigaev wrote:
> >>Added, see attached patch (based on v3.1)
> >
> >With this applied, I am getting a couple errors I have not seen before
> >after extensive crash recovery testing:
> >ERROR:  attempted to delete invisible tuple
> >ERROR:  unexpected chunk number 1 (expected 2) for toast value
> >100338365 in pg_toast_16425
> Huh, seems, it's not related to GIN at all... Indexes don't play with toast
> machinery. The single place where this error can occur is a heap_delete() -
> deleting already deleted tuple.

Like you, I would not expect gin_alone_cleanup-4.patch to cause such an error.
I get the impression Jeff has a test case that he had run in many iterations
against the unpatched baseline.  I also get the impression that a similar or
smaller number of its iterations against gin_alone_cleanup-4.patch triggered
these two errors (once apiece, or multiple times?).  Jeff, is that right?  If
so, until we determine the cause, we should assume the cause arrived in
gin_alone_cleanup-4.patch.  An error in pointer arithmetic or locking might
corrupt an unrelated buffer, leading to this symptom.

> >I've restarted the test harness with intentional crashes turned off,
> >to see if the problems are related to crash recovery or are more
> >generic than that.
> >
> >I've never seen these particular problems before, so don't have much
> >insight into what might be going on or how to debug it.

Could you describe the test case in sufficient detail for Teodor to reproduce
your results?

> Check my reasoning: In version 4 I added a remebering of tail of pending
> list into blknoFinish variable. And when we read page which was a tail on
> cleanup start then we sets cleanupFinish variable and after cleaning that
> page we will stop further cleanup. Any insert caused during cleanup will be
> placed after blknoFinish (corner case: in that page), so, vacuum should not
> miss tuples marked as deleted.

Would any hacker volunteer to review Teodor's reasoning here?

Thanks,
nm



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: max_parallel_degree > 0 for 9.6 beta
Next
From: Michael Paquier
Date:
Subject: Re: VS 2015 support in src/tools/msvc