Thread: txid failed epoch increment, again, aka 6291
It seems like this has reproduced once more. And once again, there doesn't appear to be any funny business in pg_control (but the structs are pasted here for your re-check), and there are successful sensical updates to it. The primary is running 9.0.6. However, we do have a new piece of data: there was a very brief period where txid_snapshot did report an xmin greater than 2^33, our next epoch boundary, by a few thousand transactions. That could be because the reporting function GetNextXidAndEpoch does its own epoch calculation before the checkpoint and then after a checkpoint that forgets to increment the epoch there is no need to add post-facto adjust the epoch anymore. I've been reviewing the mechanism in CreateCheckPoint for this on and off for a couple of days, but so far I haven't come up with a convincing mechanism. However, given that it seems historically that this bug is more likely to surface than *not* surface on this system, perhaps we can try for a sometimes-reproducing test case. I'm still struggling for a hint of a solution, though, so toss your thoughts here. $2 = {Insert = {LogwrtResult = {Write = {xlogid = 0, xrecoff = 0}, Flush = {xlogid = 0, xrecoff = 0}}, PrevRecord = {xlogid = 0, xrecoff = 0}, curridx = 0, currpage = 0x7ff4ed04a000, currpos= 0x0, RedoRecPtr = {xlogid = 18751, xrecoff = 1200832888}, forcePageWrites = 0 '\000'}, LogwrtRqst = {Write = {xlogid =0, xrecoff = 0}, Flush = {xlogid = 0, xrecoff = 0}}, LogwrtResult = { Write = {xlogid = 0, xrecoff = 0}, Flush = {xlogid =0, xrecoff = 0}}, ckptXidEpoch = 1, ckptXid = 9904084, asyncXactLSN = { xlogid = 0, xrecoff = 0}, lastRemovedLog = 0, lastRemovedSeg= 0, Write = {LogwrtResult = {Write = {xlogid = 0, xrecoff = 0}, Flush = {xlogid = 0, xrecoff = 0}}, curridx = 0, lastSegSwitchTime = 0}, pages = 0x7ff4ed04a000 "", xlblocks = 0x7ff4ed0471d8, XLogCacheBlck = 1023, ThisTimeLineID = 0, RecoveryTargetTLI = 6, archiveCleanupCommand = '\000' <repeats 1023 times>, SharedRecoveryInProgress = 1 '\001', lastCheckPointRecPtr = {xlogid = 18751, xrecoff = 1671519088}, lastCheckPoint = {redo = { xlogid = 18751, xrecoff = 1200832888}, ThisTimeLineID = 6, nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524, nextMulti = 1119, nextMultiOffset = 3115, oldestXid = 4115479553, oldestXidDB = 1, time = 1346746796, oldestActiveXid = 9776547}, replayEndRecPtr = {xlogid = 18751, xrecoff = 1748623656}, recoveryLastRecPtr = {xlogid = 18751, xrecoff = 1748623656}, recoveryLastXTime = 400062234671833, info_lck = 0 '\000'} (gdb) p ControlFile $3 = (ControlFileData *) 0x7ff4ed046bf8 (gdb) p *ControlFile $4 = {system_identifier = 5613733157253676693, pg_control_version = 903, catalog_version_no = 201008051, state = DB_IN_ARCHIVE_RECOVERY, time = 1346746898, checkPoint = {xlogid = 18751, xrecoff = 1072693824}, prevCheckPoint = { xlogid = 18751, xrecoff = 1072693824}, checkPointCopy = {redo= {xlogid = 18751, xrecoff = 602482536}, ThisTimeLineID = 6, nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524, nextMulti = 1119, nextMultiOffset = 3115, oldestXid = 4115479553, oldestXidDB = 1, time = 1346746496, oldestActiveXid = 9558248}, minRecoveryPoint = {xlogid = 18751, xrecoff = 1748623656}, backupStartPoint = {xlogid = 0, xrecoff = 0}, wal_level = 2, MaxConnections = 500, max_prepared_xacts = 500, max_locks_per_xact = 64, maxAlign = 8, floatFormat = 1234567, blcksz = 8192, relseg_size = 131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen = 64, indexMaxKeys = 32, toast_max_chunk_size = 1996, enableIntTimes = 1 '\001', float4ByVal = 1 '\001', float8ByVal = 1 '\001', crc = 3725972657} -- fdr
On Tue, Sep 04, 2012 at 02:07:30AM -0700, Daniel Farina wrote: > It seems like this has reproduced once more. And once again, there > doesn't appear to be any funny business in pg_control (but the structs > are pasted here for your re-check), and there are successful sensical > updates to it. The primary is running 9.0.6. What version is the standby running? > However, we do have a new piece of data: there was a very brief period > where txid_snapshot did report an xmin greater than 2^33, our next > epoch boundary, by a few thousand transactions. That could be because > the reporting function GetNextXidAndEpoch does its own epoch > calculation before the checkpoint and then after a checkpoint that > forgets to increment the epoch there is no need to add post-facto > adjust the epoch anymore. That makes sense. > I've been reviewing the mechanism in CreateCheckPoint for this on and > off for a couple of days, but so far I haven't come up with a > convincing mechanism. However, given that it seems historically that > this bug is more likely to surface than *not* surface on this system, > perhaps we can try for a sometimes-reproducing test case. I'm still > struggling for a hint of a solution, though, so toss your thoughts > here. The cause is not apparent to me, either, and simple tests here do not reproduce the problem. I am suspicious of xlog_redo() updating ControlData->checkPointCopy without acquiring ControlFileLock. If a restartpoint is finishing at about the same time, ControlFile->checkPointCopy.nextXid might come from the just-read checkpoint record while ControlFile->checkPointCopy.nextXidEpoch bears the value from the older checkpoint just adopted as a restartpoint. The resulting inconsistency would, however, vanish at the next ControlFile->checkPointCopy update. This does not explain the symptoms you have reported, and it cannot explain much of anything on a primary. > (gdb) p *ControlFile > $4 = {system_identifier = 5613733157253676693, pg_control_version = > 903, catalog_version_no = 201008051, > state = DB_IN_ARCHIVE_RECOVERY, time = 1346746898, checkPoint = This capture, it seems, is from a standby. > {xlogid = 18751, xrecoff = 1072693824}, prevCheckPoint = { > xlogid = 18751, xrecoff = 1072693824}, checkPointCopy = {redo = > {xlogid = 18751, xrecoff = 602482536}, ThisTimeLineID = 6, > nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524, nextMulti > = 1119, nextMultiOffset = 3115, oldestXid = 4115479553, > oldestXidDB = 1, time = 1346746496, oldestActiveXid = 9558248}, You expected checkPointCopy = { ... nextXidEpoch = 2, ... }, correct? (Well, nextXidEpoch = 3 if you count the previous missed increment.) Does pg_control on the primary also bear epoch 1 where epoch 2 is expected? Your last restartpoint was exactly five minutes before your last checkpoint, so there's no evidence of a dearth of safe restartpoint opportunities. nm
On Tue, Sep 4, 2012 at 8:04 AM, Noah Misch <noah@leadboat.com> wrote: > On Tue, Sep 04, 2012 at 02:07:30AM -0700, Daniel Farina wrote: >> It seems like this has reproduced once more. And once again, there >> doesn't appear to be any funny business in pg_control (but the structs >> are pasted here for your re-check), and there are successful sensical >> updates to it. The primary is running 9.0.6. > > What version is the standby running? We have several, all of which so far agree on the same txid, presumably syndicated from the primary: one is 9.0.7, another is 9.0.8. Incidentally I am also currently preparing a 9.0.9 one that I was going to use for some forensics. > >> However, we do have a new piece of data: there was a very brief period >> where txid_snapshot did report an xmin greater than 2^33, our next >> epoch boundary, by a few thousand transactions. That could be because >> the reporting function GetNextXidAndEpoch does its own epoch >> calculation before the checkpoint and then after a checkpoint that >> forgets to increment the epoch there is no need to add post-facto >> adjust the epoch anymore. > > That makes sense. > >> I've been reviewing the mechanism in CreateCheckPoint for this on and >> off for a couple of days, but so far I haven't come up with a >> convincing mechanism. However, given that it seems historically that >> this bug is more likely to surface than *not* surface on this system, >> perhaps we can try for a sometimes-reproducing test case. I'm still >> struggling for a hint of a solution, though, so toss your thoughts >> here. > > The cause is not apparent to me, either, and simple tests here do not > reproduce the problem. > > I am suspicious of xlog_redo() updating ControlData->checkPointCopy without > acquiring ControlFileLock. If a restartpoint is finishing at about the same > time, ControlFile->checkPointCopy.nextXid might come from the just-read > checkpoint record while ControlFile->checkPointCopy.nextXidEpoch bears the > value from the older checkpoint just adopted as a restartpoint. The resulting > inconsistency would, however, vanish at the next ControlFile->checkPointCopy > update. This does not explain the symptoms you have reported, and it cannot > explain much of anything on a primary. > >> (gdb) p *ControlFile >> $4 = {system_identifier = 5613733157253676693, pg_control_version = >> 903, catalog_version_no = 201008051, >> state = DB_IN_ARCHIVE_RECOVERY, time = 1346746898, checkPoint = > > This capture, it seems, is from a standby. Yeah. I can probably also get one from the primary. >> {xlogid = 18751, xrecoff = 1072693824}, prevCheckPoint = { >> xlogid = 18751, xrecoff = 1072693824}, checkPointCopy = {redo = >> {xlogid = 18751, xrecoff = 602482536}, ThisTimeLineID = 6, >> nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524, nextMulti >> = 1119, nextMultiOffset = 3115, oldestXid = 4115479553, >> oldestXidDB = 1, time = 1346746496, oldestActiveXid = 9558248}, > > You expected checkPointCopy = { ... nextXidEpoch = 2, ... }, correct? (Well, > nextXidEpoch = 3 if you count the previous missed increment.) Does pg_control > on the primary also bear epoch 1 where epoch 2 is expected? Well, in the aggregate life of this thing as we know it we probably expect more like 4, or even 5. It bears 1 where 2 is expected, though, to answer the original question. > > Your last restartpoint was exactly five minutes before your last checkpoint, > so there's no evidence of a dearth of safe restartpoint opportunities. > > nm I might try to find the segments leading up to the overflow point and try xlogdumping them to see what we can see. If there's anything to note about the workload, I'd say that it does tend to make fairly pervasive use of long running transactions which can span probably more than one checkpoint, and the txid reporting functions, and a concurrency level of about 300 or so backends ... but per my reading of the mechanism so far, it doesn't seem like any of this should matter. -- fdr
On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote: > I might try to find the segments leading up to the overflow point and > try xlogdumping them to see what we can see. That would be helpful to see. Just to grasp at yet-flimsier straws, could you post (URL preferred, else private mail) the output of "objdump -dS" on your "postgres" executable? > If there's anything to note about the workload, I'd say that it does > tend to make fairly pervasive use of long running transactions which > can span probably more than one checkpoint, and the txid reporting > functions, and a concurrency level of about 300 or so backends ... but > per my reading of the mechanism so far, it doesn't seem like any of > this should matter. Thanks for the details; I agree none of that sounds suspicious. After some further pondering and testing, this remains a mystery to me. These symptoms imply a proper update of ControlFile->checkPointCopy.nextXid without having properly updated ControlFile->checkPointCopy.nextXidEpoch. After recovery, only CreateCheckPoint() updates ControlFile->checkPointCopy at all. Its logic for doing so looks simple and correct.
On Thu, Sep 6, 2012 at 3:04 AM, Noah Misch <noah@leadboat.com> wrote: > On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote: >> I might try to find the segments leading up to the overflow point and >> try xlogdumping them to see what we can see. > > That would be helpful to see. > > Just to grasp at yet-flimsier straws, could you post (URL preferred, else > private mail) the output of "objdump -dS" on your "postgres" executable? https://dl.dropbox.com/s/444ktxbrimaguxu/txid-wrap-objdump-dS-postgres.txt.gz Sure, it's a 9.0.6 with pg_cancel_backend by-same-role backported along with the standard debian changes, so nothing all that interesting should be going on that isn't going on normally with compilers on this platform. I am also starting to grovel through this assembly, although I don't have a ton of experience finding problems this way. To save you a tiny bit of time aligning the assembly with the C, this line c797f: e8 7c c9 17 00 callq 244300 <LWLockAcquire> Seems to be the beginning of: LWLockAcquire(XidGenLock, LW_SHARED);checkPoint.nextXid = ShmemVariableCache->nextXid;checkPoint.oldestXid = ShmemVariableCache->oldestXid;checkPoint.oldestXidDB= ShmemVariableCache->oldestXidDB;LWLockRelease(XidGenLock); >> If there's anything to note about the workload, I'd say that it does >> tend to make fairly pervasive use of long running transactions which >> can span probably more than one checkpoint, and the txid reporting >> functions, and a concurrency level of about 300 or so backends ... but >> per my reading of the mechanism so far, it doesn't seem like any of >> this should matter. > > Thanks for the details; I agree none of that sounds suspicious. > > After some further pondering and testing, this remains a mystery to me. These > symptoms imply a proper update of ControlFile->checkPointCopy.nextXid without > having properly updated ControlFile->checkPointCopy.nextXidEpoch. After > recovery, only CreateCheckPoint() updates ControlFile->checkPointCopy at all. > Its logic for doing so looks simple and correct. Yeah. I'm pretty flabbergasted that so much seems to be going right while this goes wrong. -- fdr
On Fri, Sep 07, 2012 at 01:37:57AM -0700, Daniel Farina wrote: > On Thu, Sep 6, 2012 at 3:04 AM, Noah Misch <noah@leadboat.com> wrote: > > On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote: > >> I might try to find the segments leading up to the overflow point and > >> try xlogdumping them to see what we can see. > > > > That would be helpful to see. > > > > Just to grasp at yet-flimsier straws, could you post (URL preferred, else > > private mail) the output of "objdump -dS" on your "postgres" executable? > > https://dl.dropbox.com/s/444ktxbrimaguxu/txid-wrap-objdump-dS-postgres.txt.gz Thanks. Nothing looks amiss there. I've attached the test harness I used to try reproducing this. It worked through over 500 epoch increments without a hitch; clearly, it fails to reproduce an essential aspect of your system. Could you attempt to modify it in the direction of better-resembling your production workload until it reproduces the problem? nm
Attachment
On Fri, Sep 7, 2012 at 5:49 AM, Noah Misch <noah@leadboat.com> wrote: > On Fri, Sep 07, 2012 at 01:37:57AM -0700, Daniel Farina wrote: >> On Thu, Sep 6, 2012 at 3:04 AM, Noah Misch <noah@leadboat.com> wrote: >> > On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote: >> >> I might try to find the segments leading up to the overflow point and >> >> try xlogdumping them to see what we can see. >> > >> > That would be helpful to see. >> > >> > Just to grasp at yet-flimsier straws, could you post (URL preferred, else >> > private mail) the output of "objdump -dS" on your "postgres" executable? >> >> https://dl.dropbox.com/s/444ktxbrimaguxu/txid-wrap-objdump-dS-postgres.txt.gz > > Thanks. Nothing looks amiss there. > > I've attached the test harness I used to try reproducing this. It worked > through over 500 epoch increments without a hitch; clearly, it fails to > reproduce an essential aspect of your system. Could you attempt to modify it > in the direction of better-resembling your production workload until it > reproduces the problem? Sure, I can mess around with it on our exact environment as well (compilers, Xen, et al). We have not seen consistent reproduction either -- most epochs seem to fail to increment (sample size: few, but more than three) but epoch incrementing has happened more than zero times for sure. I wonder if we can rope in this guy, who is the only other report I've seen of this: http://lists.pgfoundry.org/pipermail/skytools-users/2012-March/001601.html So I'm CCing him.... He seems to have reproduced it in 9.1, but I haven't seen his operating system information on my very brief skim of that thread. -- fdr