Re: atomic pin/unpin causing errors - Mailing list pgsql-hackers

From Andres Freund
Subject Re: atomic pin/unpin causing errors
Date
Msg-id 20160430001055.nc2rdgw3uqkckd4j@alap3.anarazel.de
Whole thread Raw
In response to atomic pin/unpin causing errors  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
Hi,

On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
> I've bisected the errors I was seeing, discussed in
> http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com
> 
> It look like they first appear in:
> 
> commit 48354581a49c30f5757c203415aa8412d85b0f70
> Author: Andres Freund <andres@anarazel.de>
> Date:   Sun Apr 10 20:12:32 2016 -0700
> 
>     Allow Pin/UnpinBuffer to operate in a lockfree manner.
> 
> 
> I get the errors:
> 
> ERROR:  attempted to delete invisible tuple
> STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2
> 
> And also:
> 
> ERROR:  unexpected chunk number 1 (expected 2) for toast value
> 85223889 in pg_toast_16424
> STATEMENT:  update foo set count=count+1 where text_array @> $1
> 
> Once these errors start occurring, they happen often.  Usually the
> "attempted to delete invisible tuple" happens first.

That kind of seems to implicate clog/vacuuming or something like that
being involved.


> These errors show up after about 9 hours of run time.  The timing is
> predictable enough that I don't think it is a purely stochastic race
> condition.

Hm. I've a bit of a hard time believing that such a timing could be
caused by the above patch. How confident that it's that patch, and not
just changed timing due to performance changes?  And you definitely can
only reproduce the problem with the regular crash cycles?


> It seems like some counter variable is overflowing.  But
> it is not the ShmemVariableCache->nextXid counter, as I previously
> speculated.  This test does not advance that fast enough to for it to
> wrap around within 9 hours of run time.  But I am at a loss of what
> other variable it might be. Since the system goes through a crash and
> recovery every few seconds, any backend-local counters or
> shared-memory counters would get reset upon recovery.  Right?

A lot of those counters will be re-set based on WAL contents. So if
they're corrupted once, several of them are prone to continue to be
corrupted.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [BUGS] Breakage with VACUUM ANALYSE + partitions
Next
From: Andreas Seltenreich
Date:
Subject: Re: [sqlsmith] Failed assertion in BecomeLockGroupLeader