Home > mailing lists

Re: atomic pin/unpin causing errors - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: atomic pin/unpin causing errors
Date	April 30, 2016 03:11:02
Msg-id	20160430001055.nc2rdgw3uqkckd4j@alap3.anarazel.de Whole thread Raw
In response to	atomic pin/unpin causing errors (Jeff Janes <jeff.janes@gmail.com>)
List	pgsql-hackers

Tree view

Hi,

On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
> I've bisected the errors I was seeing, discussed in
> http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com
> 
> It look like they first appear in:
> 
> commit 48354581a49c30f5757c203415aa8412d85b0f70
> Author: Andres Freund <andres@anarazel.de>
> Date:   Sun Apr 10 20:12:32 2016 -0700
> 
>     Allow Pin/UnpinBuffer to operate in a lockfree manner.
> 
> 
> I get the errors:
> 
> ERROR:  attempted to delete invisible tuple
> STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2
> 
> And also:
> 
> ERROR:  unexpected chunk number 1 (expected 2) for toast value
> 85223889 in pg_toast_16424
> STATEMENT:  update foo set count=count+1 where text_array @> $1
> 
> Once these errors start occurring, they happen often.  Usually the
> "attempted to delete invisible tuple" happens first.

That kind of seems to implicate clog/vacuuming or something like that
being involved.


> These errors show up after about 9 hours of run time.  The timing is
> predictable enough that I don't think it is a purely stochastic race
> condition.

Hm. I've a bit of a hard time believing that such a timing could be
caused by the above patch. How confident that it's that patch, and not
just changed timing due to performance changes?  And you definitely can
only reproduce the problem with the regular crash cycles?


> It seems like some counter variable is overflowing.  But
> it is not the ShmemVariableCache->nextXid counter, as I previously
> speculated.  This test does not advance that fast enough to for it to
> wrap around within 9 hours of run time.  But I am at a loss of what
> other variable it might be. Since the system goes through a crash and
> recovery every few seconds, any backend-local counters or
> shared-memory counters would get reset upon recovery.  Right?

A lot of those counters will be re-set based on WAL contents. So if
they're corrupted once, several of them are prone to continue to be
corrupted.

Greetings,

Andres Freund

pgsql-hackers by date:

From: Andres Freund
Date: 30 April 2016, 02:58:45
Subject: Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

From: Andreas Seltenreich
Date: 30 April 2016, 03:28:37
Subject: Re: [sqlsmith] Failed assertion in BecomeLockGroupLeader

Re: atomic pin/unpin causing errors - Mailing list pgsql-hackers

Previous

Next