txid failed epoch increment, again, aka 6291 - Mailing list pgsql-hackers
| From | Daniel Farina |
|---|---|
| Subject | txid failed epoch increment, again, aka 6291 |
| Date | |
| Msg-id | CAAZKuFbnPYje8Xax2mMpCVfC4Mh8CZ3aT2Hd9z3W5GOLkce+_Q@mail.gmail.com Whole thread Raw |
| Responses |
Re: txid failed epoch increment, again, aka 6291
|
| List | pgsql-hackers |
It seems like this has reproduced once more. And once again, there
doesn't appear to be any funny business in pg_control (but the structs
are pasted here for your re-check), and there are successful sensical
updates to it. The primary is running 9.0.6.
However, we do have a new piece of data: there was a very brief period
where txid_snapshot did report an xmin greater than 2^33, our next
epoch boundary, by a few thousand transactions. That could be because
the reporting function GetNextXidAndEpoch does its own epoch
calculation before the checkpoint and then after a checkpoint that
forgets to increment the epoch there is no need to add post-facto
adjust the epoch anymore.
I've been reviewing the mechanism in CreateCheckPoint for this on and
off for a couple of days, but so far I haven't come up with a
convincing mechanism. However, given that it seems historically that
this bug is more likely to surface than *not* surface on this system,
perhaps we can try for a sometimes-reproducing test case. I'm still
struggling for a hint of a solution, though, so toss your thoughts
here.
$2 = {Insert = {LogwrtResult = {Write = {xlogid = 0, xrecoff = 0},
Flush = {xlogid = 0, xrecoff = 0}}, PrevRecord = {xlogid = 0, xrecoff = 0}, curridx = 0, currpage = 0x7ff4ed04a000,
currpos=
0x0, RedoRecPtr = {xlogid = 18751, xrecoff = 1200832888}, forcePageWrites = 0 '\000'}, LogwrtRqst = {Write = {xlogid
=0,
xrecoff = 0}, Flush = {xlogid = 0, xrecoff = 0}}, LogwrtResult = { Write = {xlogid = 0, xrecoff = 0}, Flush = {xlogid
=0, xrecoff =
0}}, ckptXidEpoch = 1, ckptXid = 9904084, asyncXactLSN = { xlogid = 0, xrecoff = 0}, lastRemovedLog = 0,
lastRemovedSeg= 0,
Write = {LogwrtResult = {Write = {xlogid = 0, xrecoff = 0}, Flush = {xlogid = 0, xrecoff = 0}}, curridx = 0,
lastSegSwitchTime = 0}, pages = 0x7ff4ed04a000 "", xlblocks =
0x7ff4ed0471d8, XLogCacheBlck = 1023, ThisTimeLineID = 0, RecoveryTargetTLI = 6,
archiveCleanupCommand = '\000' <repeats 1023 times>, SharedRecoveryInProgress = 1 '\001', lastCheckPointRecPtr =
{xlogid
= 18751, xrecoff = 1671519088}, lastCheckPoint = {redo = { xlogid = 18751, xrecoff = 1200832888}, ThisTimeLineID =
6,
nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524, nextMulti = 1119, nextMultiOffset = 3115, oldestXid =
4115479553,
oldestXidDB = 1, time = 1346746796, oldestActiveXid = 9776547}, replayEndRecPtr = {xlogid = 18751, xrecoff =
1748623656},
recoveryLastRecPtr = {xlogid = 18751, xrecoff = 1748623656}, recoveryLastXTime = 400062234671833, info_lck = 0 '\000'}
(gdb) p ControlFile
$3 = (ControlFileData *) 0x7ff4ed046bf8
(gdb) p *ControlFile
$4 = {system_identifier = 5613733157253676693, pg_control_version =
903, catalog_version_no = 201008051, state = DB_IN_ARCHIVE_RECOVERY, time = 1346746898, checkPoint =
{xlogid = 18751, xrecoff = 1072693824}, prevCheckPoint = { xlogid = 18751, xrecoff = 1072693824}, checkPointCopy =
{redo=
{xlogid = 18751, xrecoff = 602482536}, ThisTimeLineID = 6, nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524,
nextMulti
= 1119, nextMultiOffset = 3115, oldestXid = 4115479553, oldestXidDB = 1, time = 1346746496, oldestActiveXid =
9558248},
minRecoveryPoint = {xlogid = 18751, xrecoff = 1748623656}, backupStartPoint = {xlogid = 0, xrecoff = 0}, wal_level =
2,
MaxConnections = 500, max_prepared_xacts = 500, max_locks_per_xact =
64, maxAlign = 8, floatFormat = 1234567, blcksz = 8192, relseg_size =
131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen = 64, indexMaxKeys = 32, toast_max_chunk_size =
1996,
enableIntTimes = 1 '\001', float4ByVal = 1 '\001', float8ByVal = 1 '\001', crc = 3725972657}
--
fdr
pgsql-hackers by date: