Re: txid failed epoch increment, again, aka 6291 - Mailing list pgsql-hackers

From Daniel Farina
Subject Re: txid failed epoch increment, again, aka 6291
Date
Msg-id CAAZKuFbDRuvL7i5_wheWYud7yFf69Nmnq+0XTBfTCFyR0B_gAw@mail.gmail.com
Whole thread Raw
In response to Re: txid failed epoch increment, again, aka 6291  (Noah Misch <noah@leadboat.com>)
Responses Re: txid failed epoch increment, again, aka 6291  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
On Thu, Sep 6, 2012 at 3:04 AM, Noah Misch <noah@leadboat.com> wrote:
> On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote:
>> I might try to find the segments leading up to the overflow point and
>> try xlogdumping them to see what we can see.
>
> That would be helpful to see.
>
> Just to grasp at yet-flimsier straws, could you post (URL preferred, else
> private mail) the output of "objdump -dS" on your "postgres" executable?

https://dl.dropbox.com/s/444ktxbrimaguxu/txid-wrap-objdump-dS-postgres.txt.gz

Sure, it's a 9.0.6 with pg_cancel_backend by-same-role backported
along with the standard debian changes, so nothing all that
interesting should be going on that isn't going on normally with
compilers on this platform.  I am also starting to grovel through this
assembly, although I don't have a ton of experience finding problems
this way.

To save you a tiny bit of time aligning the assembly with the C, this line
  c797f:    e8 7c c9 17 00           callq  244300 <LWLockAcquire>

Seems to be the beginning of:
LWLockAcquire(XidGenLock, LW_SHARED);checkPoint.nextXid = ShmemVariableCache->nextXid;checkPoint.oldestXid =
ShmemVariableCache->oldestXid;checkPoint.oldestXidDB= ShmemVariableCache->oldestXidDB;LWLockRelease(XidGenLock);
 


>> If there's anything to note about the workload, I'd say that it does
>> tend to make fairly pervasive use of long running transactions which
>> can span probably more than one checkpoint, and the txid reporting
>> functions, and a concurrency level of about 300 or so backends ... but
>> per my reading of the mechanism so far, it doesn't seem like any of
>> this should matter.
>
> Thanks for the details; I agree none of that sounds suspicious.
>
> After some further pondering and testing, this remains a mystery to me.  These
> symptoms imply a proper update of ControlFile->checkPointCopy.nextXid without
> having properly updated ControlFile->checkPointCopy.nextXidEpoch.  After
> recovery, only CreateCheckPoint() updates ControlFile->checkPointCopy at all.
> Its logic for doing so looks simple and correct.

Yeah.  I'm pretty flabbergasted that so much seems to be going right
while this goes wrong.

-- 
fdr



pgsql-hackers by date:

Previous
From: Pavan Deolasee
Date:
Subject: pg_dump transaction's read-only mode
Next
From: "Kevin Grittner"
Date:
Subject: Re: pg_dump transaction's read-only mode