Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
Date
Msg-id 20160510211556.rumt74jrhqhsaxqx@alap3.anarazel.de
Whole thread Raw
In response to Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On 2016-05-10 13:17:52 -0700, Jeff Janes wrote:
> On Tue, May 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
> >> On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote:
> >> > The easy way to trigger this problem would be to have an oid wraparound
> >> > - but the WAL shows that that's not the case here.  I've not figured
> >> > that one out entirely (and won't tonight). But I do see WAL records
> >> > like:
> >> > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
> >> > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693
> 
> Were there any CHECKPOINT_SHUTDOWN records, or any other NEXTOID
> records, between those two records you show?

Yes, check http://www.postgresql.org/message-id/20160510210013.2akn4iee7gl4ycen@alap3.anarazel.de

I think the explanation about how the bug is occuring there makes sense.


> My current test harness updates the scalar count field on every
> iteration, but changes the (probably toasted) text_array field with a
> probability of only 1% each time.  Perhaps making that more likely (by
> changing line 186 of count.pl) would make it easier to trigger the
> bug.  I'll try that in my next iteration of tests.

So my current theory about why the whole thing is kinda hard to
reproduce is that "luck" determines how aggressively the toast table is
vacuumed, and how often it actually succeeds in being vacuumed. You also
need a good bit of bad luck for the hint bits by GetNewOidWithIndex() to
not survive, given that shared_buffers is pretty small *and* checksums
are enabled.

I guess testing with a bigger shared memory and without checksums will
make it easier to hit the bug.

Regards,

Andres



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
Next
From: Andres Freund
Date:
Subject: Re: Perf Benchmarking and regression.