[HACKERS] Phantom segment upon promotion causing troubles. - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | [HACKERS] Phantom segment upon promotion causing troubles. |
Date | |
Msg-id | 20170619073026.zcwpe6mydsaz5ygd@alap3.anarazel.de Whole thread Raw |
Responses |
Re: [HACKERS] Phantom segment upon promotion causing troubles.
Re: [HACKERS] Phantom segment upon promotion causing troubles. |
List | pgsql-hackers |
Hi, Greg Burek from Heroku (CCed) reported a weird issue on IM, that was weird enough to be interesting. What he'd observed was that he promoted some PITR standby, and early clones of that node work, but later clones did not, failing to read some segment. The problems turns out to be the following: When a node is promoted at a segment boundary, just after an XLOG_SWITCH record we'll haveEndOfLog = EndRecPtr; pointing to the *beginning* of the next segment, as XLOG_SWITCH records are treated as using the whole segment. After creating the END_OF_RECOVERY record (or checkpoint), we'll do: if (ArchiveRecoveryRequested){ /* * We switched to a new timeline. Clean up segments on the old * timeline. * * If there are any higher-numbered segments on the old timeline, * remove them. They might contain valid WAL,but they might also be * pre-allocated files containing garbage. In any case, they are not * part of the newtimeline's history so we don't need them. */ RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID); note that this uses EndOfLog, pointing to ab/cd000000 (i.e. the beginning of a record). RemoveNonParentXlogFiles calls RemoveNonParentXlogFiles() which in turn uses RemoveXlogFile() to remove superflous files. That's where the fun begins. static void RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr) {XLogSegNo endlogSegNo;XLogSegNo recycleSegNo; ... #define XLByteToPrevSeg(xlrp, logSegNo) \logSegNo = ((xlrp) - 1) / XLogSegSize ...XLByteToPrevSeg(endptr, endlogSegNo);if (PriorRedoPtr == InvalidXLogRecPtr) recycleSegNo = endlogSegNo + 10;else recycleSegNo = XLOGfileslop(PriorRedoPtr); ... InstallXLogFileSegment(&endlogSegNo, path, true, recycleSegNo, true)) ... So what happens here is that we're calling InstallXLogFileSegment() to remove superflous xlog files (e.g. because they're before the recovery target, because restore command ran before the trigger file was detected or because walsender received them). But because endptr = ab/cd000000, the use of XLByteToPrevSeg() means InstallXLogFileSegment() will be called with the *previous* segment's segment number. That in turn will lead to InstallXLogFileSegment() installing the to-be-removed segment into the current timeline, but into a segment from one *before* the creation of new timeline, for the purpose of recycling the segment. I'll call this the "phantom" segment, which has no meaningful content and lives on a timeline which does not yet exist. As there's no .ready file created for that segment, and we'll never actually write to it, it'll initially just sit around. Not visible for archiving, and normally unused by wal streaming. But that changes at later checkpoints, because, via RemoveOldXlogFiles()'s XLogArchiveCheckDone() checks we: /** XLogArchiveCheckDone* ...* If <XLOG>.done exists, then return true; else if <XLOG>.ready exists,* then return false; else create <XLOG>.ready andreturn false.** The reason we do things this way is so that if the original attempt to* create <XLOG>.ready fails, we'llretry during subsequent checkpoints. So we'll at some later point create a .ready for the above created phantom segment. Which then will get archived. At that point we're in trouble. If any standbys of that promoted node catch up after that fact (or new ones are created from older base backups), after the phantom segment has been archived, and restore_command is set, recovery will fail. The reason for that is that one commonly will have recovery_target_timeline = latest (or the new timeline) set. And XLogFileReadAnyTLI() is pretty simplistic. When restoring a segment it'll simply probe all timelines, starting from the newest. Which means that, once archived, our phantom segment will "hide" the actual segment from the source timeline. Because it's not parseable (it's at a different segment, thus parsing decide it's unusable), recovery will hang at that point. Which means quick standbys catch up, slow ones are "dead". It's "fixable" by creating a restore_command which filters that phantom segment, or deleting the segment from the archive. The minimal fix here is presumably not to use XLByteToPrevSeg() in RemoveXlogFile(), but XLByteToSeg(). I don't quite see what purpose it serves here - I don't think it's ever needed. Normally it's harmless because InstallXLogFileSegment() checks where it could install the file to, but that doesn't work around timeline bumps, triggering the problem at hand. This seems to be very longstanding behaviour, I'm not sure where it's originating from (hard to track due to code movement). There seems to be a larger question ehre though: Why does XLogFileReadAnyTLI() probe all timelines even if they weren't a parent at that period? That seems like a bad idea, especially in more complicated scenarios where some precursor timeline might live for longer than it was a parent? ISTM XLogFileReadAnyTLI() should check which timeline a segment ought to come from, based on the historY? Comments? Greetings, Andres Freund
pgsql-hackers by date: