I've bisected the errors I was seeing, discussed in
http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com
It look like they first appear in:
commit 48354581a49c30f5757c203415aa8412d85b0f70
Author: Andres Freund <andres@anarazel.de>
Date: Sun Apr 10 20:12:32 2016 -0700
Allow Pin/UnpinBuffer to operate in a lockfree manner.
I get the errors:
ERROR: attempted to delete invisible tuple
STATEMENT: update foo set count=count+1,text_array=$1 where text_array @> $2
And also:
ERROR: unexpected chunk number 1 (expected 2) for toast value
85223889 in pg_toast_16424
STATEMENT: update foo set count=count+1 where text_array @> $1
Once these errors start occurring, they happen often. Usually the
"attempted to delete invisible tuple" happens first.
These errors show up after about 9 hours of run time. The timing is
predictable enough that I don't think it is a purely stochastic race
condition. It seems like some counter variable is overflowing. But
it is not the ShmemVariableCache->nextXid counter, as I previously
speculated. This test does not advance that fast enough to for it to
wrap around within 9 hours of run time. But I am at a loss of what
other variable it might be. Since the system goes through a crash and
recovery every few seconds, any backend-local counters or
shared-memory counters would get reset upon recovery. Right?
I think the invisible tuple referred to might be a tuple in the toast
table, not in the parent table.
I don't see the problem with an cassert-enabled, probably because it
is just too slow to ever reach the point where the problem occurs.
Any suggestions about where or how to look? I don't know if the
"attempted to delete invisible tuple" is the bug itself, or is just
tripping over corruption left behind by someone else.
(This was all run using Teodor's test-enabling patch
gin_alone_cleanup-4.patch, so as not to change horses in midstream.
Now that a version of that patch has been committed, I will try to
repeat this in HEAD)
Cheers,
Jeff