txid failed epoch increment, again, aka 6291 - Mailing list pgsql-hackers
From | Daniel Farina |
---|---|
Subject | txid failed epoch increment, again, aka 6291 |
Date | |
Msg-id | CAAZKuFbnPYje8Xax2mMpCVfC4Mh8CZ3aT2Hd9z3W5GOLkce+_Q@mail.gmail.com Whole thread Raw |
Responses |
Re: txid failed epoch increment, again, aka 6291
|
List | pgsql-hackers |
It seems like this has reproduced once more. And once again, there doesn't appear to be any funny business in pg_control (but the structs are pasted here for your re-check), and there are successful sensical updates to it. The primary is running 9.0.6. However, we do have a new piece of data: there was a very brief period where txid_snapshot did report an xmin greater than 2^33, our next epoch boundary, by a few thousand transactions. That could be because the reporting function GetNextXidAndEpoch does its own epoch calculation before the checkpoint and then after a checkpoint that forgets to increment the epoch there is no need to add post-facto adjust the epoch anymore. I've been reviewing the mechanism in CreateCheckPoint for this on and off for a couple of days, but so far I haven't come up with a convincing mechanism. However, given that it seems historically that this bug is more likely to surface than *not* surface on this system, perhaps we can try for a sometimes-reproducing test case. I'm still struggling for a hint of a solution, though, so toss your thoughts here. $2 = {Insert = {LogwrtResult = {Write = {xlogid = 0, xrecoff = 0}, Flush = {xlogid = 0, xrecoff = 0}}, PrevRecord = {xlogid = 0, xrecoff = 0}, curridx = 0, currpage = 0x7ff4ed04a000, currpos= 0x0, RedoRecPtr = {xlogid = 18751, xrecoff = 1200832888}, forcePageWrites = 0 '\000'}, LogwrtRqst = {Write = {xlogid =0, xrecoff = 0}, Flush = {xlogid = 0, xrecoff = 0}}, LogwrtResult = { Write = {xlogid = 0, xrecoff = 0}, Flush = {xlogid =0, xrecoff = 0}}, ckptXidEpoch = 1, ckptXid = 9904084, asyncXactLSN = { xlogid = 0, xrecoff = 0}, lastRemovedLog = 0, lastRemovedSeg= 0, Write = {LogwrtResult = {Write = {xlogid = 0, xrecoff = 0}, Flush = {xlogid = 0, xrecoff = 0}}, curridx = 0, lastSegSwitchTime = 0}, pages = 0x7ff4ed04a000 "", xlblocks = 0x7ff4ed0471d8, XLogCacheBlck = 1023, ThisTimeLineID = 0, RecoveryTargetTLI = 6, archiveCleanupCommand = '\000' <repeats 1023 times>, SharedRecoveryInProgress = 1 '\001', lastCheckPointRecPtr = {xlogid = 18751, xrecoff = 1671519088}, lastCheckPoint = {redo = { xlogid = 18751, xrecoff = 1200832888}, ThisTimeLineID = 6, nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524, nextMulti = 1119, nextMultiOffset = 3115, oldestXid = 4115479553, oldestXidDB = 1, time = 1346746796, oldestActiveXid = 9776547}, replayEndRecPtr = {xlogid = 18751, xrecoff = 1748623656}, recoveryLastRecPtr = {xlogid = 18751, xrecoff = 1748623656}, recoveryLastXTime = 400062234671833, info_lck = 0 '\000'} (gdb) p ControlFile $3 = (ControlFileData *) 0x7ff4ed046bf8 (gdb) p *ControlFile $4 = {system_identifier = 5613733157253676693, pg_control_version = 903, catalog_version_no = 201008051, state = DB_IN_ARCHIVE_RECOVERY, time = 1346746898, checkPoint = {xlogid = 18751, xrecoff = 1072693824}, prevCheckPoint = { xlogid = 18751, xrecoff = 1072693824}, checkPointCopy = {redo= {xlogid = 18751, xrecoff = 602482536}, ThisTimeLineID = 6, nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524, nextMulti = 1119, nextMultiOffset = 3115, oldestXid = 4115479553, oldestXidDB = 1, time = 1346746496, oldestActiveXid = 9558248}, minRecoveryPoint = {xlogid = 18751, xrecoff = 1748623656}, backupStartPoint = {xlogid = 0, xrecoff = 0}, wal_level = 2, MaxConnections = 500, max_prepared_xacts = 500, max_locks_per_xact = 64, maxAlign = 8, floatFormat = 1234567, blcksz = 8192, relseg_size = 131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen = 64, indexMaxKeys = 32, toast_max_chunk_size = 1996, enableIntTimes = 1 '\001', float4ByVal = 1 '\001', float8ByVal = 1 '\001', crc = 3725972657} -- fdr
pgsql-hackers by date: