Thread: pg_standby -l might destory the archived file
Hi, pg_standby can use ln command to restore an archived file, which might destroy the archived file as follows. 1) pg_standby creates the symlink to the archived file '102' 2) '102' is applied 3) the next file '103' doesn't exist and the trigger file is created 4) '102' is re-fetched 5) at the end of recovery, the symlink to '102' is rename to '202', but it still points '102' 6) after recovery, '202' is recycled (rename to '208', which still points '102') 7) '208' is written new xlog records over --> the archived file '102' comes down! One simple solution to fix this problem is copying the content of the symlink (ie. the archived file itself) and deleting it instead of renaming it at the end of recovery. Thought? -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes: > pg_standby can use ln command to restore an archived file, > which might destroy the archived file as follows. Does it matter? pg_standby's source area wouldn't normally be an "archive" in the real sense of the word, it's just a temporary staging area between master and slave. (If it were being used as a real archive, keeping it on the same disk as the live slave seems pretty foolish anyway, so the case wouldn't arise.) regards, tom lane
Tom Lane wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> pg_standby can use ln command to restore an archived file, >> which might destroy the archived file as follows. > > Does it matter? pg_standby's source area wouldn't normally be an > "archive" in the real sense of the word, it's just a temporary staging > area between master and slave. (If it were being used as a real > archive, keeping it on the same disk as the live slave seems pretty > foolish anyway, so the case wouldn't arise.) It seems perfectly sane to source pg_standby directly from the archive to me. And we're talking about symbolic linking, so the archive directory might well be on an NFS mount. (I haven't looked at the issue yet) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [090601 10:56]: > Tom Lane wrote: >> Fujii Masao <masao.fujii@gmail.com> writes: >>> pg_standby can use ln command to restore an archived file, >>> which might destroy the archived file as follows. >> >> Does it matter? pg_standby's source area wouldn't normally be an >> "archive" in the real sense of the word, it's just a temporary staging >> area between master and slave. (If it were being used as a real >> archive, keeping it on the same disk as the live slave seems pretty >> foolish anyway, so the case wouldn't arise.) > > It seems perfectly sane to source pg_standby directly from the archive > to me. And we're talking about symbolic linking, so the archive > directory might well be on an NFS mount. I would expect that any archive directly available would at least be RO to the postgres slave... But.... Something like this would stop the "symlink" being renamed... Not portable, but probably portable across platforms that have symlinks...diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.cindex1b575e2..cba3f7a 100644--- a/src/backend/access/transam/xlog.c+++ b/src/backend/access/transam/xlog.c@@-3042,13 +3042,23 @@ RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr) { if (XLogArchiveCheckDone(xlde->d_name)) {+ struct stat stat_buf; snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name); /* * Before deleting the file, see if it can be recycled as a * future logsegment.+ * If it's a symlink, we don't recycle it */- if (InstallXLogFileSegment(&endlogId, &endlogSeg, path,+ if ( (stat(path, &stat_buf)== 0) && S_ISLNK(stat_buf.st_mode))+ {+ ereport(DEBUG2,+ (errmsg("removing transaction log symlink \"%s\"",+ xlde->d_name)));+ unlink(path);+ CheckpointStats.ckpt_segs_removed++;+ }+ else if (InstallXLogFileSegment(&endlogId, &endlogSeg, path, true, &max_advance, true)) { You can make a smaller patch if your not interested in the DEBUG2 message saying that it deleted a symlink... -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Aidan Van Dyk wrote: > * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [090601 10:56]: >> Tom Lane wrote: >>> Fujii Masao <masao.fujii@gmail.com> writes: >>>> pg_standby can use ln command to restore an archived file, >>>> which might destroy the archived file as follows. >>> Does it matter? pg_standby's source area wouldn't normally be an >>> "archive" in the real sense of the word, it's just a temporary staging >>> area between master and slave. (If it were being used as a real >>> archive, keeping it on the same disk as the live slave seems pretty >>> foolish anyway, so the case wouldn't arise.) >> It seems perfectly sane to source pg_standby directly from the archive >> to me. And we're talking about symbolic linking, so the archive >> directory might well be on an NFS mount. > > I would expect that any archive directly available would at least be RO > to the postgres slave... But.... Me too. I wonder if we should just remove the symlink option from pg_standby. Does anyone use it? Is there a meaningful performance difference? > Something like this would stop the "symlink" being renamed... Not portable, but probably portable > across platforms that have symlinks... > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c That seems reasonable as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > I wonder if we should just remove the symlink option from pg_standby. I was considering suggesting that too... regards, tom lane
Hi, On Mon, Jun 1, 2009 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> pg_standby can use ln command to restore an archived file, >> which might destroy the archived file as follows. > > Does it matter? pg_standby's source area wouldn't normally be an > "archive" in the real sense of the word, it's just a temporary staging > area between master and slave. If so, it might be deleted after triggering the warm-standby. But, we cannot remove it because the latest xlog file which is required for normal recovery might exist in it. This is another undesirable scenario. Is this problem? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes: > If so, it might be deleted after triggering the warm-standby. But, we cannot > remove it because the latest xlog file which is required for normal recovery > might exist in it. This is another undesirable scenario. Is this problem? What recovery? In the problem case you're positing, the slave server has executed a checkpoint and come up live. It's never going to be interested in the old xlog again. regards, tom lane
Hi, On Tue, Jun 2, 2009 at 10:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> If so, it might be deleted after triggering the warm-standby. But, we cannot >> remove it because the latest xlog file which is required for normal recovery >> might exist in it. This is another undesirable scenario. Is this problem? > > What recovery? In the problem case you're positing, the slave server > has executed a checkpoint and come up live. It's never going to be > interested in the old xlog again. Yes, the old xlog itself is not used again. But, the *old file* might be recycled and used later. The case that I'm looking at is that the symlink to a temporary area is recycled. Am I missing something? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes: > Yes, the old xlog itself is not used again. But, the *old file* might > be recycled and used later. The case that I'm looking at is that the > symlink to a temporary area is recycled. Am I missing something? Actually, I think the right fix for that would be to add defenses to xlog.c to not try to "recycle" a file that is a symlink. No good could possibly come of that. regards, tom lane
Hi, On Tue, Jun 2, 2009 at 10:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> Yes, the old xlog itself is not used again. But, the *old file* might >> be recycled and used later. The case that I'm looking at is that the >> symlink to a temporary area is recycled. Am I missing something? > > Actually, I think the right fix for that would be to add defenses to > xlog.c to not try to "recycle" a file that is a symlink. OK, I tweaked Aidan's patch. Thanks Aidan! http://archives.postgresql.org/message-id/20090601152736.GL15213@yugib.highrise.ca Changes are: - use lstat instead of stat - add #if HAVE_WORKING_LINK and #endif code Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Fujii Masao wrote: > On Tue, Jun 2, 2009 at 10:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Fujii Masao <masao.fujii@gmail.com> writes: >>> Yes, the old xlog itself is not used again. But, the *old file* might >>> be recycled and used later. The case that I'm looking at is that the >>> symlink to a temporary area is recycled. Am I missing something? >> Actually, I think the right fix for that would be to add defenses to >> xlog.c to not try to "recycle" a file that is a symlink. > > OK, I tweaked Aidan's patch. Thanks Aidan! > http://archives.postgresql.org/message-id/20090601152736.GL15213@yugib.highrise.ca > > Changes are: > - use lstat instead of stat > - add #if HAVE_WORKING_LINK and #endif code Committed. I left out the "#ifdef HAVE_WORKING_LINK" and used S_ISREG() instead of S_ISLNK. We use lstat + S_ISREG elsewhere too, so there should be no portability issues. I backpatched to 8.3, since that's when pg_standby was added. Arguably earlier versions should've been changed too, as pg_standby works with earlier versions, but I decided to not rock the boat as this only affects the pg_standby -l mode. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi, On Tue, Jun 2, 2009 at 3:40 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Fujii Masao wrote: >> >> On Tue, Jun 2, 2009 at 10:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> >>> Fujii Masao <masao.fujii@gmail.com> writes: >>>> >>>> Yes, the old xlog itself is not used again. But, the *old file* might >>>> be recycled and used later. The case that I'm looking at is that the >>>> symlink to a temporary area is recycled. Am I missing something? >>> >>> Actually, I think the right fix for that would be to add defenses to >>> xlog.c to not try to "recycle" a file that is a symlink. >> >> OK, I tweaked Aidan's patch. Thanks Aidan! >> >> http://archives.postgresql.org/message-id/20090601152736.GL15213@yugib.highrise.ca >> >> Changes are: >> - use lstat instead of stat >> - add #if HAVE_WORKING_LINK and #endif code > > Committed. I left out the "#ifdef HAVE_WORKING_LINK" and used S_ISREG() > instead of S_ISLNK. We use lstat + S_ISREG elsewhere too, so there should be > no portability issues. Thanks a lot! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, 2009-06-01 at 14:47 +0900, Fujii Masao wrote: > pg_standby can use ln command to restore an archived file, > which might destroy the archived file as follows. > > 1) pg_standby creates the symlink to the archived file '102' > 2) '102' is applied > 3) the next file '103' doesn't exist and the trigger file is created > 4) '102' is re-fetched > 5) at the end of recovery, the symlink to '102' is rename to '202', > but it still points '102' > 6) after recovery, '202' is recycled (rename to '208', which still > points '102') > 7) '208' is written new xlog records over > --> the archived file '102' comes down! > > One simple solution to fix this problem... err...I don't see *any* problem at all, since pg_standby does not do step (1) in the way you say and therefore never does step (5). Any links created are explicitly deleted in all cases at the end of recovery. General comment on thread: What's going on with all these "fixes"? Anybody reading the commit log and/or weekly news is going to get fairly worried for no reason at all. For that reason I ask for longer consideration and wider discussion before committing something - it would certainly avoid lengthy post commit discussion as has occurred twice recently. I see no reason for such haste on these "fixes". If there's a need for haste, ship them to your customers directly, please don't scare other people's. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > err...I don't see *any* problem at all, since pg_standby does not do > step (1) in the way you say and therefore never does step (5). Any links > created are explicitly deleted in all cases at the end of recovery. That's a good point; don't we recover files under names like RECOVERYXLOG, not under names that could possibly conflict with regular WAL files? regards, tom lane
Simon Riggs wrote: > On Mon, 2009-06-01 at 14:47 +0900, Fujii Masao wrote: > >> pg_standby can use ln command to restore an archived file, >> which might destroy the archived file as follows. >> >> 1) pg_standby creates the symlink to the archived file '102' >> 2) '102' is applied >> 3) the next file '103' doesn't exist and the trigger file is created >> 4) '102' is re-fetched >> 5) at the end of recovery, the symlink to '102' is rename to '202', >> but it still points '102' >> 6) after recovery, '202' is recycled (rename to '208', which still >> points '102') >> 7) '208' is written new xlog records over >> --> the archived file '102' comes down! >> >> One simple solution to fix this problem... > > err...I don't see *any* problem at all, since pg_standby does not do > step (1) in the way you say and therefore never does step (5). Any links > created are explicitly deleted in all cases at the end of recovery. I don't know how you came to that conclusion, but Fujii-sans description seems accurate to me, and I can reproduce the behavior on my laptop. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> err...I don't see *any* problem at all, since pg_standby does not do >> step (1) in the way you say and therefore never does step (5). Any links >> created are explicitly deleted in all cases at the end of recovery. > > That's a good point; don't we recover files under names like > RECOVERYXLOG, not under names that could possibly conflict with regular > WAL files? Yes. But we rename RECOVERYXLOG to 000000010000000000000057 or similar at the end of recovery, in exitArchiveRecovery(). Thinking about this some more, I think we should've changed exitArchiveRecovery() rather than RemoveOldXlogFiles(): it would be more robust if exitArchiveRecovery() always copied the last WAL file rather than just renamed it. It doesn't seem safe to rely on the file the symlink points to to be valid after recovery is finished, and we might write to it before it's recycled, so the current fix isn't complete. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> That's a good point; don't we recover files under names like >> RECOVERYXLOG, not under names that could possibly conflict with regular >> WAL files? > Yes. But we rename RECOVERYXLOG to 000000010000000000000057 or similar > at the end of recovery, in exitArchiveRecovery(). > Thinking about this some more, I think we should've changed > exitArchiveRecovery() rather than RemoveOldXlogFiles(): it would be more > robust if exitArchiveRecovery() always copied the last WAL file rather > than just renamed it. It doesn't seem safe to rely on the file the > symlink points to to be valid after recovery is finished, and we might > write to it before it's recycled, so the current fix isn't complete. Hmm. I think really the reason it's coded that way is that we assumed the recovery command would be physically copying the file from someplace else. pg_standby is violating the backend's expectations by using a symlink. And I really doubt that the technique is saving anything, since the data has to be read in from the archive location anyway. I'm leaning back to the position that pg_standby's -l option is simply a bad idea and should be removed. regards, tom lane
Hi, On Wed, Jun 3, 2009 at 3:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tom Lane wrote: >>> That's a good point; don't we recover files under names like >>> RECOVERYXLOG, not under names that could possibly conflict with regular >>> WAL files? > >> Yes. But we rename RECOVERYXLOG to 000000010000000000000057 or similar >> at the end of recovery, in exitArchiveRecovery(). > >> Thinking about this some more, I think we should've changed >> exitArchiveRecovery() rather than RemoveOldXlogFiles(): it would be more >> robust if exitArchiveRecovery() always copied the last WAL file rather >> than just renamed it. It doesn't seem safe to rely on the file the >> symlink points to to be valid after recovery is finished, and we might >> write to it before it's recycled, so the current fix isn't complete. > > Hmm. I think really the reason it's coded that way is that we assumed > the recovery command would be physically copying the file from someplace > else. pg_standby is violating the backend's expectations by using a > symlink. And I really doubt that the technique is saving anything, since > the data has to be read in from the archive location anyway. > > I'm leaning back to the position that pg_standby's -l option is simply a > bad idea and should be removed. I'm OK with this. And, we should document the assumption for restore_command? Otherwise, some users might wrongly use 'ln' command to restore archived files. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, 2009-06-02 at 14:54 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > Tom Lane wrote: > >> That's a good point; don't we recover files under names like > >> RECOVERYXLOG, not under names that could possibly conflict with regular > >> WAL files? > > > Yes. But we rename RECOVERYXLOG to 000000010000000000000057 or similar > > at the end of recovery, in exitArchiveRecovery(). > > > Thinking about this some more, I think we should've changed > > exitArchiveRecovery() rather than RemoveOldXlogFiles(): it would be more > > robust if exitArchiveRecovery() always copied the last WAL file rather > > than just renamed it. It doesn't seem safe to rely on the file the > > symlink points to to be valid after recovery is finished, and we might > > write to it before it's recycled, so the current fix isn't complete. > > Hmm. I think really the reason it's coded that way is that we assumed > the recovery command would be physically copying the file from someplace > else. pg_standby is violating the backend's expectations by using a > symlink. And I really doubt that the technique is saving anything, since > the data has to be read in from the archive location anyway. > > I'm leaning back to the position that pg_standby's -l option is simply a > bad idea and should be removed. ISTM we didn't clearly state what the recovery_command should do either way. Even if you remove the pg_standby option that will not fix the problem for people who have written their own script, or existing users of pg_standby. The safe way is to do as Heikki suggests and copy the final file into place and I would add that we must then fsync it also. That should be back-patched possibly as far as 8.0. Documenting a change is not nearly enough. Removing -l is a separate discussion. If there was a consensus against it, I would suggest that we deprecate the option, so that it does nothing. As an aside, I would be also much more comfortable if there was an option to not recycle the WAL files at all, as a safe mode for error checking at least. The question is: why do we need to zero fill the file first anyway? We could save the 8 bytes for the prev pointer on every record if we added enough zeroes onto every WAL write to zap any pre-existing header data, causing recovery to fail. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2009-06-02 at 19:43 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Mon, 2009-06-01 at 14:47 +0900, Fujii Masao wrote: > > > >> pg_standby can use ln command to restore an archived file, > >> which might destroy the archived file as follows. > >> > >> 1) pg_standby creates the symlink to the archived file '102' > >> 2) '102' is applied > >> 3) the next file '103' doesn't exist and the trigger file is created > >> 4) '102' is re-fetched > >> 5) at the end of recovery, the symlink to '102' is rename to '202', > >> but it still points '102' > >> 6) after recovery, '202' is recycled (rename to '208', which still > >> points '102') > >> 7) '208' is written new xlog records over > >> --> the archived file '102' comes down! > >> > >> One simple solution to fix this problem... > > > > err...I don't see *any* problem at all, since pg_standby does not do > > step (1) in the way you say and therefore never does step (5). Any links > > created are explicitly deleted in all cases at the end of recovery. > > I don't know how you came to that conclusion, but Fujii-sans description > seems accurate to me, and I can reproduce the behavior on my laptop. I accept there is a bug, though the initial description did not appear to reflect the way the code works. Thank you for not making further changes while we wait to discuss. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support