atomic pin/unpin causing errors - Mailing list pgsql-hackers

From Jeff Janes
Subject atomic pin/unpin causing errors
Date
Msg-id CAMkU=1w85Dqt766AUrCnyqCXfZ+rsk1witAc_=v5+Pce93Sftw@mail.gmail.com
Whole thread Raw
Responses Re: atomic pin/unpin causing errors  (Andres Freund <andres@anarazel.de>)
Re: atomic pin/unpin causing errors  (Andres Freund <andres@anarazel.de>)
Re: atomic pin/unpin causing errors  (Andres Freund <andres@anarazel.de>)
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
I've bisected the errors I was seeing, discussed in
http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com

It look like they first appear in:

commit 48354581a49c30f5757c203415aa8412d85b0f70
Author: Andres Freund <andres@anarazel.de>
Date:   Sun Apr 10 20:12:32 2016 -0700
   Allow Pin/UnpinBuffer to operate in a lockfree manner.


I get the errors:

ERROR:  attempted to delete invisible tuple
STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2

And also:

ERROR:  unexpected chunk number 1 (expected 2) for toast value
85223889 in pg_toast_16424
STATEMENT:  update foo set count=count+1 where text_array @> $1

Once these errors start occurring, they happen often.  Usually the
"attempted to delete invisible tuple" happens first.

These errors show up after about 9 hours of run time.  The timing is
predictable enough that I don't think it is a purely stochastic race
condition.  It seems like some counter variable is overflowing.  But
it is not the ShmemVariableCache->nextXid counter, as I previously
speculated.  This test does not advance that fast enough to for it to
wrap around within 9 hours of run time.  But I am at a loss of what
other variable it might be. Since the system goes through a crash and
recovery every few seconds, any backend-local counters or
shared-memory counters would get reset upon recovery.  Right?

I think the invisible tuple referred to might be a tuple in the toast
table, not in the parent table.

I don't see the problem with an cassert-enabled, probably because it
is just too slow to ever reach the point where the problem occurs.

Any suggestions about where or how to look?  I don't know if the
"attempted to delete invisible tuple" is the bug itself, or is just
tripping over corruption left behind by someone else.

(This was all run using Teodor's test-enabling patch
gin_alone_cleanup-4.patch, so as not to change horses in midstream.
Now that a version of that patch has been committed, I will try to
repeat this in HEAD)

Cheers,

Jeff



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Add jsonb_compact(...) for whitespace-free jsonb to text
Next
From: Alvaro Herrera
Date:
Subject: Re: Replying to a pgsql-committers email by CC'ing hackers