Thread: txid failed epoch increment, again, aka 6291

txid failed epoch increment, again, aka 6291

From

Daniel Farina

Date:

04 September 2012, 09:08:15

It seems like this has reproduced once more.  And once again, there
doesn't appear to be any funny business in pg_control (but the structs
are pasted here for your re-check), and there are successful sensical
updates to it.  The primary is running 9.0.6.

However, we do have a new piece of data: there was a very brief period
where txid_snapshot did report an xmin greater than 2^33, our next
epoch boundary, by a few thousand transactions.  That could be because
the reporting function GetNextXidAndEpoch does its own epoch
calculation before the checkpoint and then after a checkpoint that
forgets to increment the epoch there is no need to add post-facto
adjust the epoch anymore.

I've been reviewing the mechanism in CreateCheckPoint for this on and
off for a couple of days, but so far I haven't come up with a
convincing mechanism. However, given that it seems historically that
this bug is more likely to surface than *not* surface on this system,
perhaps we can try for a sometimes-reproducing test case. I'm still
struggling for a hint of a solution, though, so toss your thoughts
here.

$2 = {Insert = {LogwrtResult = {Write = {xlogid = 0, xrecoff = 0},
Flush = {xlogid = 0, xrecoff = 0}}, PrevRecord = {xlogid = 0,     xrecoff = 0}, curridx = 0, currpage = 0x7ff4ed04a000,
currpos=
 
0x0, RedoRecPtr = {xlogid = 18751, xrecoff = 1200832888},   forcePageWrites = 0 '\000'}, LogwrtRqst = {Write = {xlogid
=0,
 
xrecoff = 0}, Flush = {xlogid = 0, xrecoff = 0}}, LogwrtResult = {   Write = {xlogid = 0, xrecoff = 0}, Flush = {xlogid
=0, xrecoff =
 
0}}, ckptXidEpoch = 1, ckptXid = 9904084, asyncXactLSN = {   xlogid = 0, xrecoff = 0}, lastRemovedLog = 0,
lastRemovedSeg= 0,
 
Write = {LogwrtResult = {Write = {xlogid = 0, xrecoff = 0},     Flush = {xlogid = 0, xrecoff = 0}}, curridx = 0,
lastSegSwitchTime = 0}, pages = 0x7ff4ed04a000 "", xlblocks =
0x7ff4ed0471d8, XLogCacheBlck = 1023, ThisTimeLineID = 0, RecoveryTargetTLI = 6,
archiveCleanupCommand = '\000' <repeats 1023 times>, SharedRecoveryInProgress = 1 '\001', lastCheckPointRecPtr =
{xlogid
= 18751, xrecoff = 1671519088}, lastCheckPoint = {redo = {     xlogid = 18751, xrecoff = 1200832888}, ThisTimeLineID =
6,
nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524,   nextMulti = 1119, nextMultiOffset = 3115, oldestXid =
4115479553,
oldestXidDB = 1, time = 1346746796, oldestActiveXid = 9776547}, replayEndRecPtr = {xlogid = 18751, xrecoff =
1748623656},
recoveryLastRecPtr = {xlogid = 18751, xrecoff = 1748623656}, recoveryLastXTime = 400062234671833, info_lck = 0 '\000'}
(gdb) p ControlFile
$3 = (ControlFileData *) 0x7ff4ed046bf8
(gdb) p *ControlFile
$4 = {system_identifier = 5613733157253676693, pg_control_version =
903, catalog_version_no = 201008051, state = DB_IN_ARCHIVE_RECOVERY, time = 1346746898, checkPoint =
{xlogid = 18751, xrecoff = 1072693824}, prevCheckPoint = {   xlogid = 18751, xrecoff = 1072693824}, checkPointCopy =
{redo=
 
{xlogid = 18751, xrecoff = 602482536}, ThisTimeLineID = 6,   nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524,
nextMulti
= 1119, nextMultiOffset = 3115, oldestXid = 4115479553,   oldestXidDB = 1, time = 1346746496, oldestActiveXid =
9558248},
minRecoveryPoint = {xlogid = 18751, xrecoff = 1748623656}, backupStartPoint = {xlogid = 0, xrecoff = 0}, wal_level =
2,
MaxConnections = 500, max_prepared_xacts = 500, max_locks_per_xact =
64, maxAlign = 8, floatFormat = 1234567, blcksz = 8192, relseg_size =
131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen = 64, indexMaxKeys = 32, toast_max_chunk_size =
1996,
enableIntTimes = 1 '\001', float4ByVal = 1 '\001', float8ByVal = 1 '\001', crc = 3725972657}

-- 
fdr

Re: txid failed epoch increment, again, aka 6291

From

Noah Misch

Date:

04 September 2012, 15:12:35

On Tue, Sep 04, 2012 at 02:07:30AM -0700, Daniel Farina wrote:
> It seems like this has reproduced once more.  And once again, there
> doesn't appear to be any funny business in pg_control (but the structs
> are pasted here for your re-check), and there are successful sensical
> updates to it.  The primary is running 9.0.6.

What version is the standby running?

> However, we do have a new piece of data: there was a very brief period
> where txid_snapshot did report an xmin greater than 2^33, our next
> epoch boundary, by a few thousand transactions.  That could be because
> the reporting function GetNextXidAndEpoch does its own epoch
> calculation before the checkpoint and then after a checkpoint that
> forgets to increment the epoch there is no need to add post-facto
> adjust the epoch anymore.

That makes sense.

> I've been reviewing the mechanism in CreateCheckPoint for this on and
> off for a couple of days, but so far I haven't come up with a
> convincing mechanism. However, given that it seems historically that
> this bug is more likely to surface than *not* surface on this system,
> perhaps we can try for a sometimes-reproducing test case. I'm still
> struggling for a hint of a solution, though, so toss your thoughts
> here.

The cause is not apparent to me, either, and simple tests here do not
reproduce the problem.

I am suspicious of xlog_redo() updating ControlData->checkPointCopy without
acquiring ControlFileLock.  If a restartpoint is finishing at about the same
time, ControlFile->checkPointCopy.nextXid might come from the just-read
checkpoint record while ControlFile->checkPointCopy.nextXidEpoch bears the
value from the older checkpoint just adopted as a restartpoint.  The resulting
inconsistency would, however, vanish at the next ControlFile->checkPointCopy
update.  This does not explain the symptoms you have reported, and it cannot
explain much of anything on a primary.

> (gdb) p *ControlFile
> $4 = {system_identifier = 5613733157253676693, pg_control_version =
> 903, catalog_version_no = 201008051,
>   state = DB_IN_ARCHIVE_RECOVERY, time = 1346746898, checkPoint =

This capture, it seems, is from a standby.

> {xlogid = 18751, xrecoff = 1072693824}, prevCheckPoint = {
>     xlogid = 18751, xrecoff = 1072693824}, checkPointCopy = {redo =
> {xlogid = 18751, xrecoff = 602482536}, ThisTimeLineID = 6,
>     nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524, nextMulti
> = 1119, nextMultiOffset = 3115, oldestXid = 4115479553,
>     oldestXidDB = 1, time = 1346746496, oldestActiveXid = 9558248},

You expected checkPointCopy = { ... nextXidEpoch = 2, ... }, correct?  (Well,
nextXidEpoch = 3 if you count the previous missed increment.)  Does pg_control
on the primary also bear epoch 1 where epoch 2 is expected?

Your last restartpoint was exactly five minutes before your last checkpoint,
so there's no evidence of a dearth of safe restartpoint opportunities.

nm

Re: txid failed epoch increment, again, aka 6291

From

Daniel Farina

Date:

04 September 2012, 16:47:49

On Tue, Sep 4, 2012 at 8:04 AM, Noah Misch <noah@leadboat.com> wrote:
> On Tue, Sep 04, 2012 at 02:07:30AM -0700, Daniel Farina wrote:
>> It seems like this has reproduced once more.  And once again, there
>> doesn't appear to be any funny business in pg_control (but the structs
>> are pasted here for your re-check), and there are successful sensical
>> updates to it.  The primary is running 9.0.6.
>
> What version is the standby running?

We have several, all of which so far agree on the same txid,
presumably syndicated from the primary: one is 9.0.7, another is
9.0.8.  Incidentally I am also currently preparing a 9.0.9 one that I
was going to use for some forensics.

>
>> However, we do have a new piece of data: there was a very brief period
>> where txid_snapshot did report an xmin greater than 2^33, our next
>> epoch boundary, by a few thousand transactions.  That could be because
>> the reporting function GetNextXidAndEpoch does its own epoch
>> calculation before the checkpoint and then after a checkpoint that
>> forgets to increment the epoch there is no need to add post-facto
>> adjust the epoch anymore.
>
> That makes sense.
>
>> I've been reviewing the mechanism in CreateCheckPoint for this on and
>> off for a couple of days, but so far I haven't come up with a
>> convincing mechanism. However, given that it seems historically that
>> this bug is more likely to surface than *not* surface on this system,
>> perhaps we can try for a sometimes-reproducing test case. I'm still
>> struggling for a hint of a solution, though, so toss your thoughts
>> here.
>
> The cause is not apparent to me, either, and simple tests here do not
> reproduce the problem.
>
> I am suspicious of xlog_redo() updating ControlData->checkPointCopy without
> acquiring ControlFileLock.  If a restartpoint is finishing at about the same
> time, ControlFile->checkPointCopy.nextXid might come from the just-read
> checkpoint record while ControlFile->checkPointCopy.nextXidEpoch bears the
> value from the older checkpoint just adopted as a restartpoint.  The resulting
> inconsistency would, however, vanish at the next ControlFile->checkPointCopy
> update.  This does not explain the symptoms you have reported, and it cannot
> explain much of anything on a primary.
>
>> (gdb) p *ControlFile
>> $4 = {system_identifier = 5613733157253676693, pg_control_version =
>> 903, catalog_version_no = 201008051,
>>   state = DB_IN_ARCHIVE_RECOVERY, time = 1346746898, checkPoint =
>
> This capture, it seems, is from a standby.

Yeah. I can probably also get one from the primary.

>> {xlogid = 18751, xrecoff = 1072693824}, prevCheckPoint = {
>>     xlogid = 18751, xrecoff = 1072693824}, checkPointCopy = {redo =
>> {xlogid = 18751, xrecoff = 602482536}, ThisTimeLineID = 6,
>>     nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524, nextMulti
>> = 1119, nextMultiOffset = 3115, oldestXid = 4115479553,
>>     oldestXidDB = 1, time = 1346746496, oldestActiveXid = 9558248},
>
> You expected checkPointCopy = { ... nextXidEpoch = 2, ... }, correct?  (Well,
> nextXidEpoch = 3 if you count the previous missed increment.)  Does pg_control
> on the primary also bear epoch 1 where epoch 2 is expected?

Well, in the aggregate life of this thing as we know it we probably
expect more like 4, or even 5.  It bears 1 where 2 is expected,
though, to answer the original question.

>
> Your last restartpoint was exactly five minutes before your last checkpoint,
> so there's no evidence of a dearth of safe restartpoint opportunities.
>
> nm

I might try to find the segments leading up to the overflow point and
try xlogdumping them to see what we can see.

If there's anything to note about the workload, I'd say that it does
tend to make fairly pervasive use of long running transactions which
can span probably more than one checkpoint, and the txid reporting
functions, and a concurrency level of about 300 or so backends ... but
per my reading of the mechanism so far, it doesn't seem like any of
this should matter.

-- 
fdr

Re: txid failed epoch increment, again, aka 6291

From

Noah Misch

Date:

06 September 2012, 10:04:13

On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote:
> I might try to find the segments leading up to the overflow point and
> try xlogdumping them to see what we can see.

That would be helpful to see.

Just to grasp at yet-flimsier straws, could you post (URL preferred, else
private mail) the output of "objdump -dS" on your "postgres" executable?

> If there's anything to note about the workload, I'd say that it does
> tend to make fairly pervasive use of long running transactions which
> can span probably more than one checkpoint, and the txid reporting
> functions, and a concurrency level of about 300 or so backends ... but
> per my reading of the mechanism so far, it doesn't seem like any of
> this should matter.

Thanks for the details; I agree none of that sounds suspicious.

After some further pondering and testing, this remains a mystery to me.  These
symptoms imply a proper update of ControlFile->checkPointCopy.nextXid without
having properly updated ControlFile->checkPointCopy.nextXidEpoch.  After
recovery, only CreateCheckPoint() updates ControlFile->checkPointCopy at all.
Its logic for doing so looks simple and correct.

Re: txid failed epoch increment, again, aka 6291

From

Daniel Farina

Date:

07 September 2012, 08:38:43

On Thu, Sep 6, 2012 at 3:04 AM, Noah Misch <noah@leadboat.com> wrote:
> On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote:
>> I might try to find the segments leading up to the overflow point and
>> try xlogdumping them to see what we can see.
>
> That would be helpful to see.
>
> Just to grasp at yet-flimsier straws, could you post (URL preferred, else
> private mail) the output of "objdump -dS" on your "postgres" executable?

https://dl.dropbox.com/s/444ktxbrimaguxu/txid-wrap-objdump-dS-postgres.txt.gz

Sure, it's a 9.0.6 with pg_cancel_backend by-same-role backported
along with the standard debian changes, so nothing all that
interesting should be going on that isn't going on normally with
compilers on this platform.  I am also starting to grovel through this
assembly, although I don't have a ton of experience finding problems
this way.

To save you a tiny bit of time aligning the assembly with the C, this line
  c797f:    e8 7c c9 17 00           callq  244300 <LWLockAcquire>

Seems to be the beginning of:
LWLockAcquire(XidGenLock, LW_SHARED);checkPoint.nextXid = ShmemVariableCache->nextXid;checkPoint.oldestXid =
ShmemVariableCache->oldestXid;checkPoint.oldestXidDB= ShmemVariableCache->oldestXidDB;LWLockRelease(XidGenLock);
 


>> If there's anything to note about the workload, I'd say that it does
>> tend to make fairly pervasive use of long running transactions which
>> can span probably more than one checkpoint, and the txid reporting
>> functions, and a concurrency level of about 300 or so backends ... but
>> per my reading of the mechanism so far, it doesn't seem like any of
>> this should matter.
>
> Thanks for the details; I agree none of that sounds suspicious.
>
> After some further pondering and testing, this remains a mystery to me.  These
> symptoms imply a proper update of ControlFile->checkPointCopy.nextXid without
> having properly updated ControlFile->checkPointCopy.nextXidEpoch.  After
> recovery, only CreateCheckPoint() updates ControlFile->checkPointCopy at all.
> Its logic for doing so looks simple and correct.

Yeah.  I'm pretty flabbergasted that so much seems to be going right
while this goes wrong.

-- 
fdr

Re: txid failed epoch increment, again, aka 6291

From

Noah Misch

Date:

07 September 2012, 12:49:29

On Fri, Sep 07, 2012 at 01:37:57AM -0700, Daniel Farina wrote:
> On Thu, Sep 6, 2012 at 3:04 AM, Noah Misch <noah@leadboat.com> wrote:
> > On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote:
> >> I might try to find the segments leading up to the overflow point and
> >> try xlogdumping them to see what we can see.
> >
> > That would be helpful to see.
> >
> > Just to grasp at yet-flimsier straws, could you post (URL preferred, else
> > private mail) the output of "objdump -dS" on your "postgres" executable?
>
> https://dl.dropbox.com/s/444ktxbrimaguxu/txid-wrap-objdump-dS-postgres.txt.gz

Thanks.  Nothing looks amiss there.

I've attached the test harness I used to try reproducing this.  It worked
through over 500 epoch increments without a hitch; clearly, it fails to
reproduce an essential aspect of your system.  Could you attempt to modify it
in the direction of better-resembling your production workload until it
reproduces the problem?

nm

Attachment

burnxid.shar

Re: txid failed epoch increment, again, aka 6291

From

Daniel Farina

Date:

07 September 2012, 18:47:55

On Fri, Sep 7, 2012 at 5:49 AM, Noah Misch <noah@leadboat.com> wrote:
> On Fri, Sep 07, 2012 at 01:37:57AM -0700, Daniel Farina wrote:
>> On Thu, Sep 6, 2012 at 3:04 AM, Noah Misch <noah@leadboat.com> wrote:
>> > On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote:
>> >> I might try to find the segments leading up to the overflow point and
>> >> try xlogdumping them to see what we can see.
>> >
>> > That would be helpful to see.
>> >
>> > Just to grasp at yet-flimsier straws, could you post (URL preferred, else
>> > private mail) the output of "objdump -dS" on your "postgres" executable?
>>
>> https://dl.dropbox.com/s/444ktxbrimaguxu/txid-wrap-objdump-dS-postgres.txt.gz
>
> Thanks.  Nothing looks amiss there.
>
> I've attached the test harness I used to try reproducing this.  It worked
> through over 500 epoch increments without a hitch; clearly, it fails to
> reproduce an essential aspect of your system.  Could you attempt to modify it
> in the direction of better-resembling your production workload until it
> reproduces the problem?

Sure, I can mess around with it on our exact environment as well
(compilers, Xen, et al).  We have not seen consistent reproduction
either -- most epochs seem to fail to increment (sample size: few, but
more than three) but epoch incrementing has happened more than zero
times for sure.

I wonder if we can rope in this guy, who is the only other report I've
seen of this:

http://lists.pgfoundry.org/pipermail/skytools-users/2012-March/001601.html

So I'm CCing him....

He seems to have reproduced it in 9.1, but I haven't seen his
operating system information on my very brief skim of that thread.

-- 
fdr