Re: BUG #6661: out-of-order XID insertion in KnownAssignedXids - Mailing list pgsql-bugs
From | Andres Freund |
---|---|
Subject | Re: BUG #6661: out-of-order XID insertion in KnownAssignedXids |
Date | |
Msg-id | 201206071558.24877.andres@2ndquadrant.com Whole thread Raw |
In response to | Re: BUG #6661: out-of-order XID insertion in KnownAssignedXids (Valentine Gogichashvili <valgog@gmail.com>) |
Responses |
Re: BUG #6661: out-of-order XID insertion in KnownAssignedXids
|
List | pgsql-bugs |
Hi, On Thursday, June 07, 2012 12:44:08 PM Valentine Gogichashvili wrote: > I have the situation again, one of 3 slaves was slow to play all the WAL > files and being about 10GB late it crashed with the same error again. > > I collected DEBUG4 output in this time: > https://docs.google.com/open?id=0B2NMMrfiBQcLZjNDbU0xQ3lvWms Ok, I stared at this some time and I think I see what the problem is. Some log excerpts that lead my reasoning: 2012-06-06 15:35:51.954 "recovery snapshot waiting for non-overflowed snapshot or until oldest active xid on standby is at least 3730302193 (now 3730301766)",,,,,"xlog redo running xacts: nextXid 3730302194 latestCompletedXid 3730302179 oldestRunningXid 3730301766; 68 xacts: ... subxid ovf",,,,"" So we found a overflowed snapshot after we already are in STANDBY_SNAPSHOT_PENDING. This implies we have seen an overflowed snapshot before. In STANDBY_SNAPSHOT_PENDING we start recording known assigned xids to build our snapshot incrementally. 2012-06-06 15:35:51.954 record known xact 3730301766 latestObservedXid 3730302197 2012-06-06 15:37:49.580 record known xact 3730316333 latestObservedXid 3730316332 3730316517 Ok, we started filling the KnownAssignedXid machinery. That means procArray- >numKnownAssignedXids != 0. 2012-06-06 15:37:54.401 FATAL,XX000,"out-of-order XID insertion in KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3730316518 latestCompletedXid 3730316517 oldestRunningXid 3730316333; 4 xacts: 3730316429 3730316508 3730316333 3730316430",,,,"" the "xlog redo running xacts" bits tells us we have got a snapshot that is *not* overflowed. That means in ProcArrayAppylRecoveryInfo in the following part: if (standbyState == STANDBY_SNAPSHOT_PENDING) { /* * If the snapshot isn't overflowed or if its empty we can * reset our pending state and use this snapshot instead. */ if (!running->subxid_overflow || running->xcnt == 0) { standbyState = STANDBY_INITIALIZED; } else { ... } } we reset our state back to STANDBY_INITIALIZED. And fall back to computing our state at once in one piece instead of incrementally as we started doing for SNAPSHOT_PENDING. The problem is is know that that code assumes we run without any previous recorded xids. This is even formalized in an assert which we don't hit because were running without Assert: ... Assert(standbyState == STANDBY_INITIALIZED); ... Assert(procArray->numKnownAssignedXids == 0); ... after that we start adding all currently running xids from the snapshot to the KnownAssigned machinery. They are already recorded though, so we fail in KnownAssignedXidsAdd with the OPs error. The simplest fix for that seems to be to simply reset the KnownAssignedXids state in the above branch. Any arguments against that? This seems to have been broken in commit 10b7c686e52a6d1bb10194ebf9331ef06f044d46 Andres
pgsql-bugs by date: