Thread: [BUG] Archive recovery failure on 9.3+.
Hello, we happened to see server crash on archive recovery under some condition. After TLI was incremented, there should be the case that the WAL file for older timeline is archived but not for that of the same segment id but for newer timeline. Archive recovery should fail for the case with PANIC error like follows, | PANIC: record with zero length at 0/1820D40 Replay script is attached. This issue occured for 9.4dev, 9.3.2, and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the TLI before trying archive for older TLIs. This occurrs during fetching checkpoint redo record in archive recovery. > if (checkPoint.redo < RecPtr) > { > /* back up to find the record */ > record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false); And this is caused by that the segment file for older timeline in archive directory is preferred to that for newer timeline in pg_xlog. Looking into pg_xlog before trying the older TLIs in archive like 9.2- fixes this issue. The attached patch is one possible solution for 9.4dev. Attached files are, - recvtest.sh: Replay script. Step 1 and 2 makes the condition and step 3 causes the issue. - archrecvfix_20131212.patch: The patch fixes the issue. Archive recovery reads pg_xlog before trying older TLI in archive similarly to 9.1- by this patch. regards, -- Kyotaro Horiguchi NTT Open Source Software Center #/bin/bash ROOT=`pwd` PGDATA=$ROOT/test/data ARCHDIR=$ROOT/test/arc if [ ! -d $ARCHDIR -o ! -d $PGDATA ]; then echo "$PGDATA and/or $ARCHDIR not found" exit fi echo "### EMPTY ARCHIVE DIRECTORY ###" if [ -d $ARCHDIR ]; then rm -f $ARCHDIR/*; fi echo "### EMPTY PGDATA DIRECTORY ###" if [ -d $PGDATA ]; then rm -r $PGDATA/*; fi echo "### DO INITDB ###" initdb -D $PGDATA > /dev/null echo "### set up postgresql.conf ###" cat >> $PGDATA/postgresql.conf <<EOF wal_level = archive archive_mode = on archive_command = '/bin/cp %p $ARCHDIR/%f' # log_min_messages = debug5 EOF echo "### STAGE 1/3 -- PUT XLOG ..001...001 AND ..002.HISTORY INTO ARCHIVE ###" echo "### STAGE 1/3: 1/2 START SERVER ###" pg_ctl start -D $PGDATA -w echo "### STAGE 1/3: 2/2 STOP SERVER ###" pg_ctl stop -D $PGDATA echo "### STAGE 2/3 -- PUT XLOG ..002...001 INTO ONLY pg_xlog ###" echo "### STAGE 2/3: 1/3 PREPARE recovery.conf ###" cat > $PGDATA/recovery.conf <<EOF restore_command = '/bin/cp $ARCHDIR/%f %p' EOF echo "### STAGE 2/3: 2/3 START SERVER IN ARCHIVE RECOVERY MODE ###" pg_ctl start -D $PGDATA -w echo "### STAGE 2/3: 3/3 STOP SERVER IMMEDIATELY ###" pg_ctl stop -m i -D $PGDATA echo "### ls $ARCHDIR" ls $ARCHDIR echo "### ls $PGDATA/pg_xlog" ls $PGDATA/pg_xlog echo "### STAGE 3/3 - START SERVER IN ARCHIVE RECOVERY MODE AGAIN ###" echo "### STAGE 3/3: 1/2 RESTORE recovery.conf ###" mv $PGDATA/recovery.done $PGDATA/recovery.conf echo "### STAGE 3/3: 2/2 START SERVER IN ARCHIVE RECOVERY MODE 2ND RUN ###" pg_ctl start -D $PGDATA -w -t 2 if [ $? -ne 0 ]; then echo "### SERVER CRASHED ###" exit fi echo "### SERVER SEEMS SUCCESSFULLY UP. STOP IT. ###" pg_ctl stop -D $PGDATA -w echo "### SERVER DID NOT CRASH ###" diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6fa5479..75be478 100755 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -10935,10 +10935,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, curFileTLI =0; /* - * Try to restore the file from archive, or read an existing - * file from pg_xlog. + * When XLOG_FROM_ARCHIVE, read xlog file with largest TLI + * preferring archive to pg_xlog. Or when XLOG_FROM_PG_XLOG, + * search only pg_xlog. */ - readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, currentSource); + readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, + currentSource == XLOG_FROM_ARCHIVE ? + XLOG_FROM_ANY : currentSource); if (readFile >= 0) return true; /* success! */
Hi,
Somebody is reading this thread?resolve this in next release.
I think his patch is reasonable to fix this problem.
Please check this again.
regards,
--------------------------
--------------------------
Tomonari Katsumata
2013/12/12 Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Hello, we happened to see server crash on archive recovery under
some condition.
After TLI was incremented, there should be the case that the WAL
file for older timeline is archived but not for that of the same
segment id but for newer timeline. Archive recovery should fail
for the case with PANIC error like follows,
| PANIC: record with zero length at 0/1820D40
Replay script is attached. This issue occured for 9.4dev, 9.3.2,
and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
TLI before trying archive for older TLIs.
This occurrs during fetching checkpoint redo record in archive
recovery.
> if (checkPoint.redo < RecPtr)
> {
> /* back up to find the record */
> record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);
And this is caused by that the segment file for older timeline in
archive directory is preferred to that for newer timeline in
pg_xlog.
Looking into pg_xlog before trying the older TLIs in archive like
9.2- fixes this issue. The attached patch is one possible
solution for 9.4dev.
Attached files are,
- recvtest.sh: Replay script. Step 1 and 2 makes the condition
and step 3 causes the issue.
- archrecvfix_20131212.patch: The patch fixes the issue. Archive
recovery reads pg_xlog before trying older TLI in archive
similarly to 9.1- by this patch.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 12, 2013 at 11:00 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, we happened to see server crash on archive recovery under > some condition. > > After TLI was incremented, there should be the case that the WAL > file for older timeline is archived but not for that of the same > segment id but for newer timeline. Archive recovery should fail > for the case with PANIC error like follows, > > | PANIC: record with zero length at 0/1820D40 > > Replay script is attached. This issue occured for 9.4dev, 9.3.2, > and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the > TLI before trying archive for older TLIs. > > This occurrs during fetching checkpoint redo record in archive > recovery. > >> if (checkPoint.redo < RecPtr) >> { >> /* back up to find the record */ >> record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false); > > And this is caused by that the segment file for older timeline in > archive directory is preferred to that for newer timeline in > pg_xlog. > > Looking into pg_xlog before trying the older TLIs in archive like > 9.2- fixes this issue. The attached patch is one possible > solution for 9.4dev. > > Attached files are, > > - recvtest.sh: Replay script. Step 1 and 2 makes the condition > and step 3 causes the issue. > > - archrecvfix_20131212.patch: The patch fixes the issue. Archive > recovery reads pg_xlog before trying older TLI in archive > similarly to 9.1- by this patch. > > regards, > Horiguchi-san, Wonder if the following commit in 9.2 branch was to address a similar/same problem? http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4aed94f1660fb55bc825bf7f3135379dab28eb55 -- Amit Langote
On 12/12/2013 04:00 AM, Kyotaro HORIGUCHI wrote: > Hello, we happened to see server crash on archive recovery under > some condition. > > After TLI was incremented, there should be the case that the WAL > file for older timeline is archived but not for that of the same > segment id but for newer timeline. Archive recovery should fail > for the case with PANIC error like follows, > > | PANIC: record with zero length at 0/1820D40 > > Replay script is attached. This issue occured for 9.4dev, 9.3.2, > and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the > TLI before trying archive for older TLIs. > > This occurrs during fetching checkpoint redo record in archive > recovery. > >> if (checkPoint.redo < RecPtr) >> { >> /* back up to find the record */ >> record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false); Hmm. After running the script, pg_controldata says: Latest checkpoint location: 0/2000090 Prior checkpoint location: 0/2000090 Latest checkpoint's REDO location: 0/2000090 The PANIC is actually coming from here: > /* > * Re-fetch the last valid or last applied record, so we can identify the > * exact endpoint of what we consider the valid portion of WAL. > */ > record = ReadRecord(xlogreader, LastRec, PANIC, false); If recovery started from an online checkpoint, then I think you'd get a similar PANIC from the code you quoted. > And this is caused by that the segment file for older timeline in > archive directory is preferred to that for newer timeline in > pg_xlog. Yep. > Looking into pg_xlog before trying the older TLIs in archive like > 9.2- fixes this issue. The attached patch is one possible > solution for 9.4dev. > > Attached files are, > > - recvtest.sh: Replay script. Step 1 and 2 makes the condition > and step 3 causes the issue. > > - archrecvfix_20131212.patch: The patch fixes the issue. Archive > recovery reads pg_xlog before trying older TLI in archive > similarly to 9.1- by this patch. Hmm, that seems reasonable at a quick glance. I think it also needs a change to the state transition code earlier in the loop, to not move from XLOG_FROM_ARCHIVE to XLOG_FROM_PG_XLOG, if we've already tried reading pg_xlog. Another way to look at this is that it's wrong that we immediately PANIC if we successfully restore a file from archive, but then fail to read the record we're looking for. Instead, we should advance to next state in the state machine, ie. try reading the same record from pg_xlog, and only give up if the read fails from all sources. Yet another way to look at this is that we shouldn't even try to read the file with TLI 1 from the archive, when we know the checkpoint record is on timeline 2. It can't possibly work. I've been thinking for some time that we should stop doing the scan of all possible TLIs, and only try to read the exact file that contains the record we're reading. We have that information now, in the timeline history file, and we already do that during streaming replication. I was afraid of changing the behavior of archive recovery in 9.4, but maybe that's the way to go in the long term. - Heikki
On 01/09/2014 08:18 PM, Heikki Linnakangas wrote: > On 12/12/2013 04:00 AM, Kyotaro HORIGUCHI wrote: >> Hello, we happened to see server crash on archive recovery under >> some condition. >> >> After TLI was incremented, there should be the case that the WAL >> file for older timeline is archived but not for that of the same >> segment id but for newer timeline. Archive recovery should fail >> for the case with PANIC error like follows, >> >> | PANIC: record with zero length at 0/1820D40 >> >> Replay script is attached. This issue occured for 9.4dev, 9.3.2, >> and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the >> TLI before trying archive for older TLIs. >> >> This occurrs during fetching checkpoint redo record in archive >> recovery. >> >>> if (checkPoint.redo < RecPtr) >>> { >>> /* back up to find the record */ >>> record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false); > > Hmm. After running the script, pg_controldata says: > > Latest checkpoint location: 0/2000090 > Prior checkpoint location: 0/2000090 > Latest checkpoint's REDO location: 0/2000090 > > The PANIC is actually coming from here: > >> /* >> * Re-fetch the last valid or last applied record, so we can >> identify the >> * exact endpoint of what we consider the valid portion of WAL. >> */ >> record = ReadRecord(xlogreader, LastRec, PANIC, false); > > If recovery started from an online checkpoint, then I think you'd get a > similar PANIC from the code you quoted. > >> And this is caused by that the segment file for older timeline in >> archive directory is preferred to that for newer timeline in >> pg_xlog. > > Yep. Actually, why is the partially-filled 000000010000000000000002 file archived in the first place? Looking at the code, it's been like that forever, but it seems like a bad idea. If the original server is still up and running, and writing more data to that file, what will happen is that when the original server later tries to archive it, it will fail because the partial version of the file is already in the archive. Or worse, the partial version overwrites a previously archived more complete version. This is the code that does that: > /* > * If we are establishing a new timeline, we have to copy data from the > * last WAL segment of the old timeline to create a starting WAL segment > * for the new timeline. > * > * Notify the archiver that the last WAL segment of the old timeline is > * ready to copy to archival storage. Otherwise, it is not archived for a > * while. > */ > if (endTLI != ThisTimeLineID) > { > XLogFileCopy(endLogSegNo, endTLI, endLogSegNo); > > if (XLogArchivingActive()) > { > XLogFileName(xlogpath, endTLI, endLogSegNo); > XLogArchiveNotify(xlogpath); > } > } So, the rationale is that otherwise it would take a long time until that segment is archived. To be precise, I don't think the segment with the old TLI would ever be archived without the above, but the same segment on the new timeline would, after it fills up. Wouldn't it be better to not archive the old segment, and instead switch to a new segment after writing the end-of-recovery checkpoint, so that the segment on the new timeline is archived sooner? - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > Actually, why is the partially-filled 000000010000000000000002 file > archived in the first place? ... > So, the rationale is that otherwise it would take a long time until that > segment is archived. To be precise, I don't think the segment with the > old TLI would ever be archived without the above, but the same segment > on the new timeline would, after it fills up. > Wouldn't it be better to not archive the old segment, and instead switch > to a new segment after writing the end-of-recovery checkpoint, so that > the segment on the new timeline is archived sooner? Don't we want to archive both? If you want to recover to the end of the old timeline, you're going to need that file too, no? regards, tom lane
On 01/09/2014 10:16 PM, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: >> Actually, why is the partially-filled 000000010000000000000002 file >> archived in the first place? ... > >> So, the rationale is that otherwise it would take a long time until that >> segment is archived. To be precise, I don't think the segment with the >> old TLI would ever be archived without the above, but the same segment >> on the new timeline would, after it fills up. > >> Wouldn't it be better to not archive the old segment, and instead switch >> to a new segment after writing the end-of-recovery checkpoint, so that >> the segment on the new timeline is archived sooner? > > Don't we want to archive both? If you want to recover to the end of the > old timeline, you're going to need that file too, no? Hmm. It should be the responsibility of the original server to archive the segment on the old timeline. Mind you, partial segments are never archived, except for this one case, so how did the old segment find its way to the new server? A few possibilities come to mind: the DBA manually copied it from the old server to pg_xlog, it was streamed by streaming replication, or it was included in a base backup. The OP's test script resembles the base backup case. In all of those cases, I don't think it's the new server's responsibility to archive it. If it was copied to pg_xlog manually, the administrator may also copy it to the archive if he feels like it. If it was streamed from a live server, the original server should take care of it. If it was included in a backup, well, it's included in the backup so it doesn't necessarily need to be archived. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 01/09/2014 10:16 PM, Tom Lane wrote: >> Don't we want to archive both? If you want to recover to the end of the >> old timeline, you're going to need that file too, no? > Hmm. It should be the responsibility of the original server to archive > the segment on the old timeline. Oh ... I was thinking about the case of a timeline switch within one server, say as a result of PITR recovery. If you're sure this same case doesn't come up that way, then nevermind. regards, tom lane
On 01/09/2014 10:36 PM, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: >> On 01/09/2014 10:16 PM, Tom Lane wrote: >>> Don't we want to archive both? If you want to recover to the end of the >>> old timeline, you're going to need that file too, no? > >> Hmm. It should be the responsibility of the original server to archive >> the segment on the old timeline. > > Oh ... I was thinking about the case of a timeline switch within one > server, say as a result of PITR recovery. If you're sure this same case > doesn't come up that way, then nevermind. I'm not sure what you mean, but it does apply to same-server PITR recovery too. In that case, again, if you copy the partial segment to pg_xlog when doing the PITR recovery, clearly you already somehow have access to the partial segment, and if you want to do another PITR to the same point, you should just copy it again. - Heikki
On 01/09/2014 12:05 PM, Heikki Linnakangas wrote: > Actually, why is the partially-filled 000000010000000000000002 file > archived in the first place? Looking at the code, it's been like that > forever, but it seems like a bad idea. If the original server is still > up and running, and writing more data to that file, what will happen is > that when the original server later tries to archive it, it will fail > because the partial version of the file is already in the archive. Or > worse, the partial version overwrites a previously archived more > complete version. Oh! This explains some transient errors I've seen. > Wouldn't it be better to not archive the old segment, and instead switch > to a new segment after writing the end-of-recovery checkpoint, so that > the segment on the new timeline is archived sooner? It would be better to zero-fill and switch segments, yes. We should NEVER be in a position of archiving two different versions of the same segment. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 01/09/2014 10:55 PM, Josh Berkus wrote: > On 01/09/2014 12:05 PM, Heikki Linnakangas wrote: > >> Actually, why is the partially-filled 000000010000000000000002 file >> archived in the first place? Looking at the code, it's been like that >> forever, but it seems like a bad idea. If the original server is still >> up and running, and writing more data to that file, what will happen is >> that when the original server later tries to archive it, it will fail >> because the partial version of the file is already in the archive. Or >> worse, the partial version overwrites a previously archived more >> complete version. > > Oh! This explains some transient errors I've seen. > >> Wouldn't it be better to not archive the old segment, and instead switch >> to a new segment after writing the end-of-recovery checkpoint, so that >> the segment on the new timeline is archived sooner? > > It would be better to zero-fill and switch segments, yes. We should > NEVER be in a position of archiving two different versions of the same > segment. Ok, I think we're in agreement that that's the way to go for master. Now, what to do about back-branches? On one hand, I'd like to apply the same fix to all stable branches, as the current behavior is silly and always has been. On the other hand, we haven't heard any complaints about it, so we probably shouldn't fix what ain't broken. Perhaps we should apply it to 9.3, as that's where we have the acute problem the OP reported. Thoughts? In summary, I propose that we change master and REL9_3_STABLE to not archive the partial segment from previous timeline. Older branches will keep the current behavior. - Heikki
Hi Heikki, I need PostgreSQL9.3 which fixed this problem. It didn't happen in PostgreSQL9.2, so I agree with your proposal which changes are done against 93_STABLE and master. Can you fix this in next release(9.3.3)? ------------ Tomonari Katsumata (2014/01/13 20:16), Heikki Linnakangas wrote:> On 01/09/2014 10:55 PM, Josh Berkus wrote:>> On 01/09/2014 12:05 PM, HeikkiLinnakangas wrote:>>> Actually, why is the partially-filled 000000010000000000000002 file>>> archived in the firstplace? Looking at the code, it's been like that>>> forever, but it seems like a bad idea. If the original server isstill>>> up and running, and writing more data to that file, what will happen is>>> that when the original server latertries to archive it, it will fail>>> because the partial version of the file is already in the archive. Or>>> worse,the partial version overwrites a previously archived more>>> complete version.>>>> Oh! This explains some transienterrors I've seen.>>> Wouldn't it be better to not archive the old segment, and instead switch>>> to a new segment after writing the end-of-recovery checkpoint, so that>>> the segment on the new timeline is archivedsooner?>>>> It would be better to zero-fill and switch segments, yes. We should>> NEVER be in a position of archivingtwo different versions of the same>> segment.>> Ok, I think we're in agreement that that's the way to go for master.>>Now, what to do about back-branches? On one hand, I'd like to apply the same fix to all stable branches, as the current behavior is silly and always has been. On the other hand, we haven't heard any complaints about it, so we probably shouldn't fix what ain't broken. Perhaps we should apply it to 9.3, as that's where we have the acute problem the OP reported. Thoughts?>> In summary, I propose that we change master and REL9_3_STABLE to not archive the partial segment from previous timeline. Older branches will keep the current behavior.>> - Heikki>
Re: Heikki Linnakangas 2014-01-13 <52D3CAFF.3010807@vmware.com> > >>Actually, why is the partially-filled 000000010000000000000002 file > >>archived in the first place? Looking at the code, it's been like that > >>forever, but it seems like a bad idea. If the original server is still > >>up and running, and writing more data to that file, what will happen is > >>that when the original server later tries to archive it, it will fail > >>because the partial version of the file is already in the archive. Or > >>worse, the partial version overwrites a previously archived more > >>complete version. > > > >Oh! This explains some transient errors I've seen. > > > >>Wouldn't it be better to not archive the old segment, and instead switch > >>to a new segment after writing the end-of-recovery checkpoint, so that > >>the segment on the new timeline is archived sooner? > > > >It would be better to zero-fill and switch segments, yes. We should > >NEVER be in a position of archiving two different versions of the same > >segment. > > Ok, I think we're in agreement that that's the way to go for master. > > Now, what to do about back-branches? On one hand, I'd like to apply > the same fix to all stable branches, as the current behavior is > silly and always has been. On the other hand, we haven't heard any > complaints about it, so we probably shouldn't fix what ain't broken. > Perhaps we should apply it to 9.3, as that's where we have the acute > problem the OP reported. Thoughts? > > In summary, I propose that we change master and REL9_3_STABLE to not > archive the partial segment from previous timeline. Older branches > will keep the current behavior. I've seen the "can't archive file from the old timeline" problem on 9.2 and 9.3 slaves after promotion. The problem is in conjunction with the proposed archive_command in the default postgresql.conf comments: # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f' With 9.1, it works, but 9.2 and 9.3 don't archive anything until I remove the "test ! -f" part. (An alternative fix would be to declare the behavior ok and adjust that example in the config.) I've always blamed 9.2+'s cascading replication for this, but haven't investigated deeper. Christoph -- cb@df7cb.de | http://www.df7cb.de/
Hello, > I need PostgreSQL9.3 which fixed this problem. > > It didn't happen in PostgreSQL9.2, so I agree > with your proposal which changes are done > against 93_STABLE and master. > > Can you fix this in next release(9.3.3)? I was going to push to move this a bit, but... > >>> Wouldn't it be better to not archive the old segment, and instead > >>> switch > >>> to a new segment after writing the end-of-recovery checkpoint, so that > >>> the segment on the new timeline is archived sooner? > >> > >> It would be better to zero-fill and switch segments, yes. We should > >> NEVER be in a position of archiving two different versions of the same > >> segment. > > > > Ok, I think we're in agreement that that's the way to go for master. I've almost inclined to that but on some thoughts on the idea, comming to think of recovery upto target timeline, the old segment found to be necessary for the case. Without the old segment, we would be obliged to seek to the first segment of the *next* timeline (Is there any (simple) means to predict where is it?) to complete the task. Is it the right way we kick the older one out of archive? > > Now, what to do about back-branches? On one hand, I'd like to apply > > the same fix to all stable branches, as the current behavior is silly > > and always has been. On the other hand, we haven't heard any > > complaints about it, so we probably shouldn't fix what ain't > > broken. Perhaps we should apply it to 9.3, as that's where we have the > > acute problem the OP reported. Thoughts? > > > > In summary, I propose that we change master and REL9_3_STABLE to not > > archive the partial segment from previous timeline. Older branches > > will keep the current behavior. -- Kyotaro Horiguchi NTT Open Source Software Center
On 02/13/2014 08:44 AM, Kyotaro HORIGUCHI wrote: >>>>> Wouldn't it be better to not archive the old segment, and instead >>>>> switch >>>>> to a new segment after writing the end-of-recovery checkpoint, so that >>>>> the segment on the new timeline is archived sooner? >>>> >>>> It would be better to zero-fill and switch segments, yes. We should >>>> NEVER be in a position of archiving two different versions of the same >>>> segment. >>> >>> Ok, I think we're in agreement that that's the way to go for master. > > I've almost inclined to that but on some thoughts on the idea, > comming to think of recovery upto target timeline, the old > segment found to be necessary for the case. Without the old > segment, we would be obliged to seek to the first segment of the > *next* timeline (Is there any (simple) means to predict where is > it?) to complete the task. How did the server that created the new timeline get the old, partial, segment? Was it already archived? Or did the DBA copy it into pg_xlog manually? Or was it streamed by streaming replication? Whatever the mechanism, the same mechanism ought to make sure the old segment is available for PITR, too. Hmm. If you have set up streaming replication and a WAL archive, and your master dies and you fail over to a standby, what you describe does happen. The partial old segment is not in the archive, so you cannot PITR to a point in the old timeline that falls within the partial segment (ie. just before the failover). However, it's not guaranteed that all the preceding WAL segments on the old timeline were already archived, anyway, so even if the partial segment is archived, it's not guaranteed to work. The old master is responsible for archiving the WAL on the old timeline, and the new master is responsible for archiving all the WAL on the new timeline. That's a straightforward, easy-to-understand rule. It might be useful to have a mode where the standby also archives all the received WAL, but that would need to be a separate option. > Is it the right way we kick the older one out of archive? If it's already in the archive, it's not going to be removed from the archive. - Heikki
On 02/12/2014 01:24 PM, Christoph Berg wrote: > Re: Heikki Linnakangas 2014-01-13 <52D3CAFF.3010807@vmware.com> >>>> Actually, why is the partially-filled 000000010000000000000002 file >>>> archived in the first place? Looking at the code, it's been like that >>>> forever, but it seems like a bad idea. If the original server is still >>>> up and running, and writing more data to that file, what will happen is >>>> that when the original server later tries to archive it, it will fail >>>> because the partial version of the file is already in the archive. Or >>>> worse, the partial version overwrites a previously archived more >>>> complete version. >>> >>> Oh! This explains some transient errors I've seen. >>> >>>> Wouldn't it be better to not archive the old segment, and instead switch >>>> to a new segment after writing the end-of-recovery checkpoint, so that >>>> the segment on the new timeline is archived sooner? >>> >>> It would be better to zero-fill and switch segments, yes. We should >>> NEVER be in a position of archiving two different versions of the same >>> segment. >> >> Ok, I think we're in agreement that that's the way to go for master. >> >> Now, what to do about back-branches? On one hand, I'd like to apply >> the same fix to all stable branches, as the current behavior is >> silly and always has been. On the other hand, we haven't heard any >> complaints about it, so we probably shouldn't fix what ain't broken. >> Perhaps we should apply it to 9.3, as that's where we have the acute >> problem the OP reported. Thoughts? >> >> In summary, I propose that we change master and REL9_3_STABLE to not >> archive the partial segment from previous timeline. Older branches >> will keep the current behavior. > > I've seen the "can't archive file from the old timeline" problem on > 9.2 and 9.3 slaves after promotion. The problem is in conjunction with > the proposed archive_command in the default postgresql.conf comments: > > # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f' > > With 9.1, it works, but 9.2 and 9.3 don't archive anything until I > remove the "test ! -f" part. (An alternative fix would be to declare > the behavior ok and adjust that example in the config.) Hmm, the behavior is the same in 9.1 and 9.2. Did you use a different archive_command in 9.1, without the "test"? - Heikki
Hello, I might have been misunderstood your words. At Thu, 13 Feb 2014 10:11:22 +0200, Heikki Linnakangas <hlinnakangas@vmware.com> wrote in <52FC7E2A.9060703@vmware.com> > On 02/13/2014 08:44 AM, Kyotaro HORIGUCHI wrote: > >>>>> Wouldn't it be better to not archive the old segment, and instead > >>>>> switch > >>>>> to a new segment after writing the end-of-recovery checkpoint, so that > >>>>> the segment on the new timeline is archived sooner? > >>>> > >>>> It would be better to zero-fill and switch segments, yes. We should > >>>> NEVER be in a position of archiving two different versions of the same > >>>> segment. > >>> > >>> Ok, I think we're in agreement that that's the way to go for master. Does this mean that no mechanical solution for this situation will be given and an operator should remove the older segment for the same segment id before starting recovoery process? > > I've almost inclined to that but on some thoughts on the idea, > > comming to think of recovery upto target timeline, the old > > segment found to be necessary for the case. Without the old > > segment, we would be obliged to seek to the first segment of the > > *next* timeline (Is there any (simple) means to predict where is > > it?) to complete the task. > > How did the server that created the new timeline get the old, partial, > segment? Was it already archived? Or did the DBA copy it into pg_xlog > manually? Or was it streamed by streaming replication? Whatever the > mechanism, the same mechanism ought to make sure the old segment is > available for PITR, too. Sure. > Hmm. If you have set up streaming replication and a WAL archive, and > your master dies and you fail over to a standby, what you describe > does happen. The partial old segment is not in the archive, so you > cannot PITR to a point in the old timeline that falls within the > partial segment (ie. just before the failover). However, it's not > guaranteed that all the preceding WAL segments on the old timeline > were already archived, anyway, so even if the partial segment is > archived, it's not guaranteed to work. Yes, and putting aside the insane or vanished segments in archive, I understand that a pair of master and standby (standby and cascaded standby and so on, too) can share one WAL archive, or archived WAL segments and all the WAL segments not archived and left in pg_xlog of the old master should be merged into WAL archive of new master (promoted old slave) to keep the availability of the online backup taken from the old master. Even with the shared WAL archive, missing segments in archive should be filled up using pg_xlog though. Nevertheless, the process can be implemented in automatic way. The test script at first of this thread is for the case of shared archive and I have unconsciously put that as the context. > The old master is responsible for archiving the WAL on the old > timeline, and the new master is responsible for archiving all the WAL > on the new timeline. That's a straightforward, easy-to-understand > rule. Yes, I was somewhat confused because of my assumption of shared archive, but it actually can be converged into single archive, and the older version of PostgreSQL could cope with that situation. > It might be useful to have a mode where the standby also > archives all the received WAL, but that would need to be a separate > option. Perhaps such a mechanism is not demanded :) > > Is it the right way we kick the older one out of archive? > > If it's already in the archive, it's not going to be removed from the > archive. I have understood the conclusion so far is not archiving the older segment when promotion but it seems a bit odd story as you suggested. If the conclusion here is no aid as my new understanding, would you let me hear the reason why recovery have changed to prefer archive to pg_xlog? The commit abf5c5c9a4 seems to change the behavior but I don't find the reason for the change. ragards, -- Kyotaro Horiguchi NTT Open Source Software Center
On 02/13/2014 01:37 PM, Kyotaro HORIGUCHI wrote: > At Thu, 13 Feb 2014 10:11:22 +0200, Heikki Linnakangas <hlinnakangas@vmware.com> wrote in <52FC7E2A.9060703@vmware.com> >> On 02/13/2014 08:44 AM, Kyotaro HORIGUCHI wrote: >>>>>>> Wouldn't it be better to not archive the old segment, and instead >>>>>>> switch >>>>>>> to a new segment after writing the end-of-recovery checkpoint, so that >>>>>>> the segment on the new timeline is archived sooner? >>>>>> >>>>>> It would be better to zero-fill and switch segments, yes. We should >>>>>> NEVER be in a position of archiving two different versions of the same >>>>>> segment. >>>>> >>>>> Ok, I think we're in agreement that that's the way to go for master. > > Does this mean that no mechanical solution for this situation > will be given and an operator should remove the older segment for > the same segment id before starting recovoery process? I don't understand. With the proposed change to never archive a partial segment, there is no segment with the same filename in the archive. There is nothing to remove. Or did you mean that you have to do that in 9.0-9.2 back-branches, if we only back-patch this change to 9.3? >> If it's already in the archive, it's not going to be removed from the >> archive. > > I have understood the conclusion so far is not archiving the > older segment when promotion but it seems a bit odd story as you > suggested. If the conclusion here is no aid as my new > understanding, would you let me hear the reason why recovery have > changed to prefer archive to pg_xlog? > > The commit abf5c5c9a4 seems to change the behavior but I don't > find the reason for the change. Well, it didn't work too well before abf5c5c9a4 either. With your test script, and commit abf5c5c9a4 reverted [1], you got this: ### STAGE 3/3: 2/2 START SERVER IN ARCHIVE RECOVERY MODE 2ND RUN ### waiting for server to start....LOG: database system was interrupted; last known up at 2014-02-13 14:17:53 EET LOG: starting archive recovery LOG: restored log file "00000002.history" from archive /bin/cp: cannot stat ‘/home/heikki/pgsql.93stable/test/arc/000000020000000000000002’: No such file or directory LOG: restored log file "000000010000000000000002" from archive LOG: record with zero length at 0/2000090 LOG: invalid primary checkpoint record /bin/cp: cannot stat ‘/home/heikki/pgsql.93stable/test/arc/000000020000000000000002’: No such file or directory LOG: restored log file "000000010000000000000002" from archive LOG: using previous checkpoint record at 0/2000028 LOG: consistent recovery state reached at 0/2000028 LOG: record with zero length at 0/2000090 LOG: redo is not required /bin/cp: cannot stat ‘/home/heikki/pgsql.93stable/test/arc/000000020000000000000002’: No such file or directory LOG: restored log file "000000010000000000000002" from archive /bin/cp: cannot stat ‘/home/heikki/pgsql.93stable/test/arc/00000003.history’: No such file or directory LOG: selected new timeline ID: 3 LOG: restored log file "00000002.history" from archive LOG: archive recovery complete LOG: database system is ready to accept connections LOG: autovacuum launcher started Note the "invalid primary checkpoint record" message. It restored the partial 000000010000000000000002 file from the archive, and tried to recover using that instead of the 000000020000000000000002 file from pg_xlog. The primary checkpoint record is not present in 000000010000000000000002, so it fell back to recover from the previous checkpoint. It was the exact same issue, even though it appeared to work if you don't look carefully. The second recovery never replayed the WAL generated on timeline 2, it branched off directly from timeline 1, at the same point where timeline 2 did. You can see that by looking at the history files. Also, if you create a table and insert a row to it after the first recovery, just before killing the server, you'll see that the row is not there after the second recovery. That's wrong; the data files might already contain some changes from timeline 2, and you end up with a corrupt database. Thanks to commit abf5c5c9a4, the situation is at least detected, and the system refuses to start up. The behavior where we prefer a segment from archive with lower TLI over a file with higher TLI in pg_xlog actually changed in commit a068c391ab0. Arguably changing it wasn't a good idea, but the problem your test script demonstrates can be fixed by not archiving the partial segment, with no change to the preference of archive/pg_xlog. As discussed, archiving a partial segment seems like a bad idea anyway, so let's just stop doing that. [1] just applying commit abf5c5c9a4 in reverse doesn't apply cleanly due to further changes to the code, but I did this to simulate the old behavior: diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6a4a504..352d3f7 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4366,6 +4366,7 @@ readRecoveryCommandFile(void) /* Enable fetching from archive recovery area */ ArchiveRecoveryRequested = true; + InArchiveRecovery = true; /* * If user specified recovery_target_timeline, validate it or compute the - Heikki
Re: Heikki Linnakangas 2014-02-13 <52FC9468.4050602@vmware.com> > >With 9.1, it works, but 9.2 and 9.3 don't archive anything until I > >remove the "test ! -f" part. (An alternative fix would be to declare > >the behavior ok and adjust that example in the config.) > > Hmm, the behavior is the same in 9.1 and 9.2. Did you use a > different archive_command in 9.1, without the "test"? The behavior is different: 9.1 doesn't try to re-archive the last segment from the old timeline. (I've seen it always happen in training workshops, so that's always mostly idle machines. Possibly 9.1 will re-archive the last segment under load.) I've just reproduced it here with these settings (in Debian's /etc/postgresql-common/createcluster.conf syntax): archive_mode = on archive_command = 'test ! -f /tmp/pg-%v/%%f && cp %%p /tmp/pg-%v/%%f' wal_level = hot_standby max_wal_senders = 3 ... pg_basebackup ... recovery.conf: restore_command = 'cp /tmp/pg-9.3/%f %p' 9.1 slave log (in German, but it should be easy to figure out): LOG: konsistenter Wiederherstellungszustand erreicht bei 0/20000D8 cp: der Aufruf von stat für „/tmp/pg-9.1/000000010000000000000003“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden LOG: konnte Datei »pg_xlog/000000010000000000000003« nicht öffnen (Logdatei 0, Segment 3): Datei oder Verzeichnis nichtgefunden LOG: Redo fertig bei 0/20000D8 LOG: Logdatei »000000010000000000000002« aus Archiv wiederhergestellt cp: der Aufruf von stat für „/tmp/pg-9.1/00000002.history“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden LOG: gewählte neue Timeline-ID: 2 cp: der Aufruf von stat für „/tmp/pg-9.1/00000001.history“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden LOG: Wiederherstellung aus Archiv abgeschlossen LOG: unvollständiges Startpaket LOG: Autovacuum-Launcher startet LOG: Datenbanksystem ist bereit, um Verbindungen anzunehmen (eof) ls /tmp/pg-9.1 000000010000000000000001 000000010000000000000002.00000020.backup 000000020000000000000003 000000010000000000000002 000000020000000000000002 00000002.history 9.2 slave log: LOG: konsistenter Wiederherstellungszustand erreicht bei 0/20000E0 cp: der Aufruf von stat für „/tmp/pg-9.2/000000010000000000000003“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden LOG: konnte Datei »pg_xlog/000000010000000000000003« nicht öffnen (Logdatei 0, Segment 3): Datei oder Verzeichnis nichtgefunden LOG: Redo fertig bei 0/20000E0 LOG: Logdatei »000000010000000000000002« aus Archiv wiederhergestellt cp: der Aufruf von stat für „/tmp/pg-9.2/00000002.history“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden LOG: gewählte neue Timeline-ID: 2 cp: der Aufruf von stat für „/tmp/pg-9.2/00000001.history“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden LOG: Wiederherstellung aus Archiv abgeschlossen LOG: unvollständiges Startpaket LOG: Autovacuum-Launcher startet LOG: Datenbanksystem ist bereit, um Verbindungen anzunehmen LOG: Archivbefehl ist fehlgeschlagen mit Statuscode 1 DETAIL: Der fehlgeschlagene Archivbefehl war: test ! -f /tmp/pg-9.2/000000010000000000000002 && cp pg_xlog/000000010000000000000002/tmp/pg-9.2/000000010000000000000002 LOG: Archivbefehl ist fehlgeschlagen mit Statuscode 1 ... ls /tmp/pg-9.2 000000010000000000000001 000000010000000000000002 000000010000000000000002.00000020.backup The 9.3 log/ls looks the same. Mit freundlichen Grüßen, Christoph Berg -- Senior Berater, Tel.: +49 (0)21 61 / 46 43-187 credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209 Hohenzollernstr. 133, 41061 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer pgp fingerprint: 5C48 FE61 57F4 9179 5970 87C6 4C5A 6BAB 12D2 A7AE
On 02/13/2014 02:42 PM, Christoph Berg wrote: > Re: Heikki Linnakangas 2014-02-13 <52FC9468.4050602@vmware.com> >>> With 9.1, it works, but 9.2 and 9.3 don't archive anything until I >>> remove the "test ! -f" part. (An alternative fix would be to declare >>> the behavior ok and adjust that example in the config.) >> >> Hmm, the behavior is the same in 9.1 and 9.2. Did you use a >> different archive_command in 9.1, without the "test"? > > The behavior is different: 9.1 doesn't try to re-archive the last > segment from the old timeline. (I've seen it always happen in training > workshops, so that's always mostly idle machines. Possibly 9.1 will > re-archive the last segment under load.) > > I've just reproduced it here with these settings (in Debian's > /etc/postgresql-common/createcluster.conf syntax): Ah, I think I see what's going on. 9.2 changed the archive recovery behavior so that the restored WAL segments are copied into pg_xlog, so that you can recover the standby back to the same point after restart, even if the archive later stops working (also needed for cascading standbys). In 9.1, we only attempted to archive the last segment from previous timeline if it wasn't restored from archive. In 9.2, it's always archived. I was testing this with streaming replication; 9.1 and 9.2 behave the same in that scenario. But they differ when doing archive recovery. Is this an argument for back-patching the "don't archive last segment from old timeline" patch to 9.2 and 9.3, but leaving 9.1 alone? You have the same problem with 9.1 and streaming replication, but no-one's complained.. - Heikki
Re: Heikki Linnakangas 2014-02-13 <52FCCA40.3060908@vmware.com> > I was testing this with streaming replication; 9.1 and 9.2 behave > the same in that scenario. But they differ when doing archive > recovery. > > Is this an argument for back-patching the "don't archive last > segment from old timeline" patch to 9.2 and 9.3, but leaving 9.1 > alone? You have the same problem with 9.1 and streaming replication, > but no-one's complained.. I can't say what would be better from a technical viewpoint. For me, this is literally an academic problem people run into when they try replication during the PostgreSQL course, using the archive_command example from postgresql.conf. I guess if no one else has complained yet, there's no reason to touch the back branches. Is removing the "test ! -f" part and hence overwriting files in the archive safe, i.e. are the files the same? Mit freundlichen Grüßen, Christoph Berg -- Senior Berater, Tel.: +49 (0)21 61 / 46 43-187 credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209 Hohenzollernstr. 133, 41061 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer pgp fingerprint: 5C48 FE61 57F4 9179 5970 87C6 4C5A 6BAB 12D2 A7AE
On 02/13/2014 03:53 PM, Christoph Berg wrote: > Is removing the "test ! -f" part and hence overwriting files in the > archive safe, i.e. are the files the same? No. Not in general, anyway. If the old master keeps running, even for a moment, after the partial file was copied, it will have created more WAL records which are present in the old master's copy of the segment, but not in the standby's. You could perhaps use "diff" as part of archive_command, to check if archived version is identical to the one about to be archived. - Heikki
Re: Heikki Linnakangas 2014-02-13 <52FCD02C.3060201@vmware.com> > >Is removing the "test ! -f" part and hence overwriting files in the > >archive safe, i.e. are the files the same? > > No. Not in general, anyway. If the old master keeps running, even > for a moment, after the partial file was copied, it will have > created more WAL records which are present in the old master's copy > of the segment, but not in the standby's. So why does PostgreSQL then ever try to archive something from the old timeline? Isn't overwriting files on the old timeline just a very bad idea in general? If so, it shouldn't be the responsibility of archive_command to prevent this, but the server shouldn't be trying that in the first place. > You could perhaps use "diff" as part of archive_command, to check if > archived version is identical to the one about to be archived. Well, if that makes it into the default postgresql.conf (and docs)... Otoh, it's still not nice that this could still be an archive command that fails, i.e. requires admin intervention before archiving can proceed. Mit freundlichen Grüßen, Christoph Berg -- Senior Berater, Tel.: +49 (0)21 61 / 46 43-187 credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209 Hohenzollernstr. 133, 41061 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer pgp fingerprint: 5C48 FE61 57F4 9179 5970 87C6 4C5A 6BAB 12D2 A7AE
On 02/13/2014 04:07 PM, Christoph Berg wrote: > Re: Heikki Linnakangas 2014-02-13 <52FCD02C.3060201@vmware.com> >>> Is removing the "test ! -f" part and hence overwriting files in the >>> archive safe, i.e. are the files the same? >> >> No. Not in general, anyway. If the old master keeps running, even >> for a moment, after the partial file was copied, it will have >> created more WAL records which are present in the old master's copy >> of the segment, but not in the standby's. > > So why does PostgreSQL then ever try to archive something from the old > timeline? Isn't overwriting files on the old timeline just a very bad > idea in general? Yes. > If so, it shouldn't be the responsibility of > archive_command to prevent this, but the server shouldn't be trying > that in the first place. Totally agreed. That's why I'm proposing that we stop doing that :-). It's been a bad idea from day one, but I'm a bit hesitent to back-patch until someone actually complains. There has been subtle changes to archive recovery behavior in each release, so I'm worried that the old behavior was in fact useful to someone in some corner case that we're missing. - Heikki
On 02/13/2014 02:42 PM, Heikki Linnakangas wrote: > The behavior where we prefer a segment from archive with lower TLI over > a file with higher TLI in pg_xlog actually changed in commit > a068c391ab0. Arguably changing it wasn't a good idea, but the problem > your test script demonstrates can be fixed by not archiving the partial > segment, with no change to the preference of archive/pg_xlog. As > discussed, archiving a partial segment seems like a bad idea anyway, so > let's just stop doing that. After some further thought, while not archiving the partial segment fixes your test script, it's not enough to fix all variants of the problem. Even if archive recovery doesn't archive the last, partial, segment, if the original master server is still running, it's entirely possible that it fills the segment and archives it. In that case, archive recovery will again prefer the archived segment with lower TLI over the segment with newer TLI in pg_xlog. So I agree we should commit the patch you posted (or something to that effect). The change to not archive the last segment still seems like a good idea, but perhaps we should only do that in master. Even if after that patch, you can have a problem in more complicated scenarios involving both an archive and streaming replication. For example, imagine a timeline history like this: TLI 1 ----+---------------------------> | 2 +---------------------------> Now imagine that timeline 1 has been fully archived, and there are WAL segments much higher than the points where the timeline switch occurred present in the archive. But none of the WAL segments for timeline 2 have been archived; they are only present in a master server. You want to perform recovery to timeline 2, using the archived WAL segments for timelines 1, and streaming replication to catch up to the tip of timeline 2. Whether we prefer files from pg_xlog or archive will make no difference in this case, as there are no files in pg_xlog. But it will merrily apply all the WAL for timeline 1 from the archive that it can find, past the timeline switch point. After that, when it tries to connect to the server will streaming replication, it will fail. There's not much we can do about that in 9.2 and below, but in 9.3 the timeline history file contains the exact timeline switch points, so we could be more careful and not apply any extra WAL on the old timeline past the switch point. We could also be more exact in which files we try to restore from the archive, instead of just polling every future TLI in the history. - Heikki
On 02/13/2014 06:47 PM, Heikki Linnakangas wrote: > On 02/13/2014 02:42 PM, Heikki Linnakangas wrote: >> The behavior where we prefer a segment from archive with lower TLI over >> a file with higher TLI in pg_xlog actually changed in commit >> a068c391ab0. Arguably changing it wasn't a good idea, but the problem >> your test script demonstrates can be fixed by not archiving the partial >> segment, with no change to the preference of archive/pg_xlog. As >> discussed, archiving a partial segment seems like a bad idea anyway, so >> let's just stop doing that. > > After some further thought, while not archiving the partial segment > fixes your test script, it's not enough to fix all variants of the > problem. Even if archive recovery doesn't archive the last, partial, > segment, if the original master server is still running, it's entirely > possible that it fills the segment and archives it. In that case, > archive recovery will again prefer the archived segment with lower TLI > over the segment with newer TLI in pg_xlog. > > So I agree we should commit the patch you posted (or something to that > effect). The change to not archive the last segment still seems like a > good idea, but perhaps we should only do that in master. To draw this to conclusion, barring any further insights to this, I'm going to commit the attached patch to master and REL9_3_STABLE. Please have a look at the patch, to see if I'm missing something. I modified the state machine to skip over XLOG_FROM_XLOG state, if reading in XLOG_FROM_ARCHIVE failed; otherwise you first scan the archive and pg_xlog together, and then pg_xlog alone, which is pointless. In master, I'm also going to remove the "archive last segment on old timeline" code. - Heikki
Attachment
Hello, Before taking up the topic.. At Thu, 13 Feb 2014 19:45:38 +0200, Heikki Linnakangas wrote > On 02/13/2014 06:47 PM, Heikki Linnakangas wrote: > > On 02/13/2014 02:42 PM, Heikki Linnakangas wrote: > >> The behavior where we prefer a segment from archive with lower TLI > >> over > >> a file with higher TLI in pg_xlog actually changed in commit > >> a068c391ab0. Arguably changing it wasn't a good idea, but the problem > >> your test script demonstrates can be fixed by not archiving the > >> partial > >> segment, with no change to the preference of archive/pg_xlog. As > >> discussed, archiving a partial segment seems like a bad idea anyway, > >> so > >> let's just stop doing that. It surely makes things simple and I rather like the idea but as long as the final and possiblly partial segment of the lower TLI is actually created and the recovery mechanism allows users to command recovery operation requires such segments (recovery_target_timeline does this), a "perfect archive" - which means an archive which can cover all sorts of restore operatoins - necessarily may have such duplicate segments, I believe. Besides, I suppose that policy makes operations around archive/restore way difficult. DBAs should get stuck with tensive work of picking only actually needed segments for the recovery undertaken out of the haystack. It sounds somewhat gloomy.. # However I also doubt the appropriateness of stockpiling archive # segments spanning over so many timelines, two generations are # enough to cause this issue. Anyway, returning to the topic, > > After some further thought, while not archiving the partial segment > > fixes your test script, it's not enough to fix all variants of the > > problem. Even if archive recovery doesn't archive the last, partial, > > segment, if the original master server is still running, it's entirely > > possible that it fills the segment and archives it. In that case, > > archive recovery will again prefer the archived segment with lower TLI > > over the segment with newer TLI in pg_xlog. Yes, it is the generalized description of the case I've mentioned. (Though I've not reached that thought :) > > So I agree we should commit the patch you posted (or something to that > > effect). The change to not archive the last segment still seems like a > > good idea, but perhaps we should only do that in master. My opinion on duplicate segments on older timelines is as decribed above. > To draw this to conclusion, barring any further insights to this, I'm > going to commit the attached patch to master and REL9_3_STABLE. Please > have a look at the patch, to see if I'm missing something. I modified > the state machine to skip over XLOG_FROM_XLOG state, if reading in > XLOG_FROM_ARCHIVE failed; otherwise you first scan the archive and > pg_xlog together, and then pg_xlog alone, which is pointless. > > In master, I'm also going to remove the "archive last segment on old > timeline" code. Thank you for finishing the patch. I didn't think of the behavior after XLOG_FROM_ARCHIVE failure. It seems that the state machine will go round getting rid of extra round with it. Recovery process becomes able to grab the segment on highest (expected) TLI among those with the same segment id regardless of their locations. I think the recovery process will cope with "perfect" archives described above for all types of recovery operation. The state machine loop considering fallback from archive to pg_xlog now seems somewhat too complicated than needed but it's also no harm. Though, here which was in my original patch, > readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, > currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY : currentSource); is sticking far out the line wrapping boundary and seems somewhat dirty:( And what the conditional operator seems to make the meaning of the XLOG_FROM_ARCHIVE and _ANY a bit confused. But I failed to unify them to any side so it is left as is.. Finally, the patch you will find attached is fixed only in styling mentioned above from your last patch. This patch applies current HEAD and I confirmed that it fixes this issue but I have not checked the lastSourceFailed section. Simple file removal could not lead to there. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 508970a..85a0ce9 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -11006,17 +11006,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, /*------- * Standby modeis implemented by a state machine: * - * 1. Read from archive (XLOG_FROM_ARCHIVE) - * 2. Read from pg_xlog (XLOG_FROM_PG_XLOG) - * 3. Check trigger file - * 4. Read from primary server via walreceiver (XLOG_FROM_STREAM) - * 5. Rescan timelines - * 6. Sleep 5 seconds, and loop back to 1. + * 1. Read from either archive or pg_xlog (XLOG_FROM_ARCHIVE), or just + * pg_xlog (XLOG_FROM_XLOG) + * 2. Check trigger file + * 3. Read from primary server via walreceiver (XLOG_FROM_STREAM) + * 4. Rescan timelines + * 5. Sleep 5 seconds, and loop back to 1. * * Failure to read from the current source advances the state machineto - * the next state. In addition, successfully reading a file from pg_xlog - * moves the state machine from state 2 back to state 1 (we always prefer - * files in the archive over files in pg_xlog). + * the next state. * * 'currentSource' indicates the current state. There are no currentSource * valuesfor "check trigger", "rescan timelines", and "sleep" states, @@ -11044,9 +11042,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, switch (currentSource) { case XLOG_FROM_ARCHIVE: - currentSource = XLOG_FROM_PG_XLOG; - break; - case XLOG_FROM_PG_XLOG: /* @@ -11212,7 +11207,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * Try to restorethe file from archive, or read an existing * file from pg_xlog. */ - readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, currentSource); + readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, + currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY : + currentSource); if (readFile >= 0) return true; /* success! */
On 02/14/2014 10:38 AM, Kyotaro HORIGUCHI wrote: > Finally, the patch you will find attached is fixed only in > styling mentioned above from your last patch. This patch applies > current HEAD and I confirmed that it fixes this issue but I have > not checked the lastSourceFailed section. Simple file removal > could not lead to there. Ok, applied. Thanks! - Heikki
Thank you for committing. > On 02/14/2014 10:38 AM, Kyotaro HORIGUCHI wrote: > > Finally, the patch you will find attached is fixed only in > > styling mentioned above from your last patch. This patch applies > > current HEAD and I confirmed that it fixes this issue but I have > > not checked the lastSourceFailed section. Simple file removal > > could not lead to there. > > Ok, applied. Thanks! -- Kyotaro Horiguchi NTT Open Source Software Center