Thread: pg_standby -l might destory the archived file

pg_standby -l might destory the archived file

From
Fujii Masao
Date:
Hi,

pg_standby can use ln command to restore an archived file,
which might destroy the archived file as follows.

1) pg_standby creates the symlink to the archived file '102'
2) '102' is applied
3) the next file '103' doesn't exist and the trigger file is created
4) '102' is re-fetched
5) at the end of recovery, the symlink to '102' is rename to '202',   but it still points '102'
6) after recovery, '202' is recycled (rename to '208', which still   points '102')
7) '208' is written new xlog records over   --> the archived file '102' comes down!

One simple solution to fix this problem is copying the content
of the symlink (ie. the archived file itself) and deleting it instead
of renaming it at the end of recovery.

Thought?

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pg_standby -l might destory the archived file

From
Tom Lane
Date:
Fujii Masao <masao.fujii@gmail.com> writes:
> pg_standby can use ln command to restore an archived file,
> which might destroy the archived file as follows.

Does it matter?  pg_standby's source area wouldn't normally be an
"archive" in the real sense of the word, it's just a temporary staging
area between master and slave.  (If it were being used as a real
archive, keeping it on the same disk as the live slave seems pretty
foolish anyway, so the case wouldn't arise.)
        regards, tom lane


Re: pg_standby -l might destory the archived file

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> pg_standby can use ln command to restore an archived file,
>> which might destroy the archived file as follows.
> 
> Does it matter?  pg_standby's source area wouldn't normally be an
> "archive" in the real sense of the word, it's just a temporary staging
> area between master and slave.  (If it were being used as a real
> archive, keeping it on the same disk as the live slave seems pretty
> foolish anyway, so the case wouldn't arise.)

It seems perfectly sane to source pg_standby directly from the archive 
to me. And we're talking about symbolic linking, so the archive 
directory might well be on an NFS mount.

(I haven't looked at the issue yet)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: pg_standby -l might destory the archived file

From
Aidan Van Dyk
Date:
* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [090601 10:56]:
> Tom Lane wrote:
>> Fujii Masao <masao.fujii@gmail.com> writes:
>>> pg_standby can use ln command to restore an archived file,
>>> which might destroy the archived file as follows.
>>
>> Does it matter?  pg_standby's source area wouldn't normally be an
>> "archive" in the real sense of the word, it's just a temporary staging
>> area between master and slave.  (If it were being used as a real
>> archive, keeping it on the same disk as the live slave seems pretty
>> foolish anyway, so the case wouldn't arise.)
>
> It seems perfectly sane to source pg_standby directly from the archive  
> to me. And we're talking about symbolic linking, so the archive  
> directory might well be on an NFS mount.

I would expect that any archive directly available would at least be RO
to the postgres slave... But....

Something like this would stop the "symlink" being renamed... Not portable, but probably portable
across platforms that have symlinks...diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.cindex1b575e2..cba3f7a 100644--- a/src/backend/access/transam/xlog.c+++
b/src/backend/access/transam/xlog.c@@-3042,13 +3042,23 @@ RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr)
      {            if (XLogArchiveCheckDone(xlde->d_name))            {+                               struct stat
stat_buf;               snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
 
                /*                 * Before deleting the file, see if it can be recycled as a                 * future
logsegment.+                                * If it's a symlink, we don't recycle it                 */-
              if (InstallXLogFileSegment(&endlogId, &endlogSeg, path,+                               if ( (stat(path,
&stat_buf)== 0) && S_ISLNK(stat_buf.st_mode))+                               {+
ereport(DEBUG2,+                                                      (errmsg("removing transaction log symlink
\"%s\"",+                                                                      xlde->d_name)));+
              unlink(path);+                                       CheckpointStats.ckpt_segs_removed++;+
              }+                               else if (InstallXLogFileSegment(&endlogId, &endlogSeg, path,
                             true, &max_advance,                                           true))                {
 

You can make a smaller patch if your not interested in the DEBUG2 message
saying that it deleted a symlink...


-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: pg_standby -l might destory the archived file

From
Heikki Linnakangas
Date:
Aidan Van Dyk wrote:
> * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [090601 10:56]:
>> Tom Lane wrote:
>>> Fujii Masao <masao.fujii@gmail.com> writes:
>>>> pg_standby can use ln command to restore an archived file,
>>>> which might destroy the archived file as follows.
>>> Does it matter?  pg_standby's source area wouldn't normally be an
>>> "archive" in the real sense of the word, it's just a temporary staging
>>> area between master and slave.  (If it were being used as a real
>>> archive, keeping it on the same disk as the live slave seems pretty
>>> foolish anyway, so the case wouldn't arise.)
>> It seems perfectly sane to source pg_standby directly from the archive  
>> to me. And we're talking about symbolic linking, so the archive  
>> directory might well be on an NFS mount.
> 
> I would expect that any archive directly available would at least be RO
> to the postgres slave... But....

Me too.

I wonder if we should just remove the symlink option from pg_standby. 
Does anyone use it? Is there a meaningful performance difference?

> Something like this would stop the "symlink" being renamed... Not portable, but probably portable
> across platforms that have symlinks...
>     diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c

That seems reasonable as well.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: pg_standby -l might destory the archived file

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I wonder if we should just remove the symlink option from pg_standby. 

I was considering suggesting that too...
        regards, tom lane


Re: pg_standby -l might destory the archived file

From
Fujii Masao
Date:
Hi,

On Mon, Jun 1, 2009 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> pg_standby can use ln command to restore an archived file,
>> which might destroy the archived file as follows.
>
> Does it matter?  pg_standby's source area wouldn't normally be an
> "archive" in the real sense of the word, it's just a temporary staging
> area between master and slave.

If so, it might be deleted after triggering the warm-standby. But, we cannot
remove it because the latest xlog file which is required for normal recovery
might exist in it. This is another undesirable scenario. Is this problem?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pg_standby -l might destory the archived file

From
Tom Lane
Date:
Fujii Masao <masao.fujii@gmail.com> writes:
> If so, it might be deleted after triggering the warm-standby. But, we cannot
> remove it because the latest xlog file which is required for normal recovery
> might exist in it. This is another undesirable scenario. Is this problem?

What recovery?  In the problem case you're positing, the slave server
has executed a checkpoint and come up live.  It's never going to be
interested in the old xlog again.
        regards, tom lane


Re: pg_standby -l might destory the archived file

From
Fujii Masao
Date:
Hi,

On Tue, Jun 2, 2009 at 10:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> If so, it might be deleted after triggering the warm-standby. But, we cannot
>> remove it because the latest xlog file which is required for normal recovery
>> might exist in it. This is another undesirable scenario. Is this problem?
>
> What recovery?  In the problem case you're positing, the slave server
> has executed a checkpoint and come up live.  It's never going to be
> interested in the old xlog again.

Yes, the old xlog itself is not used again. But, the *old file* might
be recycled
and used later. The case that I'm looking at is that the symlink to a temporary
area is recycled. Am I missing something?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pg_standby -l might destory the archived file

From
Tom Lane
Date:
Fujii Masao <masao.fujii@gmail.com> writes:
> Yes, the old xlog itself is not used again. But, the *old file* might
> be recycled and used later. The case that I'm looking at is that the
> symlink to a temporary area is recycled. Am I missing something?

Actually, I think the right fix for that would be to add defenses to
xlog.c to not try to "recycle" a file that is a symlink.  No good could
possibly come of that.
        regards, tom lane


Re: pg_standby -l might destory the archived file

From
Fujii Masao
Date:
Hi,

On Tue, Jun 2, 2009 at 10:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> Yes, the old xlog itself is not used again. But, the *old file* might
>> be recycled and used later. The case that I'm looking at is that the
>> symlink to a temporary area is recycled. Am I missing something?
>
> Actually, I think the right fix for that would be to add defenses to
> xlog.c to not try to "recycle" a file that is a symlink.

OK, I tweaked Aidan's patch. Thanks Aidan!
http://archives.postgresql.org/message-id/20090601152736.GL15213@yugib.highrise.ca

Changes are:
- use lstat instead of stat
- add #if HAVE_WORKING_LINK and #endif code

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: pg_standby -l might destory the archived file

From
Heikki Linnakangas
Date:
Fujii Masao wrote:
> On Tue, Jun 2, 2009 at 10:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Fujii Masao <masao.fujii@gmail.com> writes:
>>> Yes, the old xlog itself is not used again. But, the *old file* might
>>> be recycled and used later. The case that I'm looking at is that the
>>> symlink to a temporary area is recycled. Am I missing something?
>> Actually, I think the right fix for that would be to add defenses to
>> xlog.c to not try to "recycle" a file that is a symlink.
> 
> OK, I tweaked Aidan's patch. Thanks Aidan!
> http://archives.postgresql.org/message-id/20090601152736.GL15213@yugib.highrise.ca
> 
> Changes are:
> - use lstat instead of stat
> - add #if HAVE_WORKING_LINK and #endif code

Committed. I left out the "#ifdef HAVE_WORKING_LINK" and used S_ISREG() 
instead of S_ISLNK. We use lstat + S_ISREG elsewhere too, so there 
should be no portability issues.

I backpatched to 8.3, since that's when pg_standby was added. Arguably 
earlier versions should've been changed too, as pg_standby works with 
earlier versions, but I decided to not rock the boat as this only 
affects the pg_standby -l mode.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: pg_standby -l might destory the archived file

From
Fujii Masao
Date:
Hi,

On Tue, Jun 2, 2009 at 3:40 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Fujii Masao wrote:
>>
>> On Tue, Jun 2, 2009 at 10:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>
>>> Fujii Masao <masao.fujii@gmail.com> writes:
>>>>
>>>> Yes, the old xlog itself is not used again. But, the *old file* might
>>>> be recycled and used later. The case that I'm looking at is that the
>>>> symlink to a temporary area is recycled. Am I missing something?
>>>
>>> Actually, I think the right fix for that would be to add defenses to
>>> xlog.c to not try to "recycle" a file that is a symlink.
>>
>> OK, I tweaked Aidan's patch. Thanks Aidan!
>>
>> http://archives.postgresql.org/message-id/20090601152736.GL15213@yugib.highrise.ca
>>
>> Changes are:
>> - use lstat instead of stat
>> - add #if HAVE_WORKING_LINK and #endif code
>
> Committed. I left out the "#ifdef HAVE_WORKING_LINK" and used S_ISREG()
> instead of S_ISLNK. We use lstat + S_ISREG elsewhere too, so there should be
> no portability issues.

Thanks a lot!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pg_standby -l might destory the archived file

From
Simon Riggs
Date:
On Mon, 2009-06-01 at 14:47 +0900, Fujii Masao wrote:

> pg_standby can use ln command to restore an archived file,
> which might destroy the archived file as follows.
> 
> 1) pg_standby creates the symlink to the archived file '102'
> 2) '102' is applied
> 3) the next file '103' doesn't exist and the trigger file is created
> 4) '102' is re-fetched
> 5) at the end of recovery, the symlink to '102' is rename to '202',
>     but it still points '102'
> 6) after recovery, '202' is recycled (rename to '208', which still
>     points '102')
> 7) '208' is written new xlog records over
>     --> the archived file '102' comes down!
> 
> One simple solution to fix this problem...

err...I don't see *any* problem at all, since pg_standby does not do
step (1) in the way you say and therefore never does step (5). Any links
created are explicitly deleted in all cases at the end of recovery.

General comment on thread: What's going on with all these "fixes"?
Anybody reading the commit log and/or weekly news is going to get fairly
worried for no reason at all. For that reason I ask for longer
consideration and wider discussion before committing something - it
would certainly avoid lengthy post commit discussion as has occurred
twice recently. I see no reason for such haste on these "fixes". If
there's a need for haste, ship them to your customers directly, please
don't scare other people's. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: pg_standby -l might destory the archived file

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> err...I don't see *any* problem at all, since pg_standby does not do
> step (1) in the way you say and therefore never does step (5). Any links
> created are explicitly deleted in all cases at the end of recovery.

That's a good point; don't we recover files under names like
RECOVERYXLOG, not under names that could possibly conflict with regular
WAL files?
        regards, tom lane


Re: pg_standby -l might destory the archived file

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Mon, 2009-06-01 at 14:47 +0900, Fujii Masao wrote:
> 
>> pg_standby can use ln command to restore an archived file,
>> which might destroy the archived file as follows.
>>
>> 1) pg_standby creates the symlink to the archived file '102'
>> 2) '102' is applied
>> 3) the next file '103' doesn't exist and the trigger file is created
>> 4) '102' is re-fetched
>> 5) at the end of recovery, the symlink to '102' is rename to '202',
>>     but it still points '102'
>> 6) after recovery, '202' is recycled (rename to '208', which still
>>     points '102')
>> 7) '208' is written new xlog records over
>>     --> the archived file '102' comes down!
>>
>> One simple solution to fix this problem...
> 
> err...I don't see *any* problem at all, since pg_standby does not do
> step (1) in the way you say and therefore never does step (5). Any links
> created are explicitly deleted in all cases at the end of recovery.

I don't know how you came to that conclusion, but Fujii-sans description 
seems accurate to me, and I can reproduce the behavior on my laptop.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: pg_standby -l might destory the archived file

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> err...I don't see *any* problem at all, since pg_standby does not do
>> step (1) in the way you say and therefore never does step (5). Any links
>> created are explicitly deleted in all cases at the end of recovery.
> 
> That's a good point; don't we recover files under names like
> RECOVERYXLOG, not under names that could possibly conflict with regular
> WAL files?

Yes. But we rename RECOVERYXLOG to 000000010000000000000057 or similar 
at the end of recovery, in exitArchiveRecovery().

Thinking about this some more, I think we should've changed 
exitArchiveRecovery() rather than RemoveOldXlogFiles(): it would be more 
robust if exitArchiveRecovery() always copied the last WAL file rather 
than just renamed it. It doesn't seem safe to rely on the file the 
symlink points to to be valid after recovery is finished, and we might 
write to it before it's recycled, so the current fix isn't complete.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: pg_standby -l might destory the archived file

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> That's a good point; don't we recover files under names like
>> RECOVERYXLOG, not under names that could possibly conflict with regular
>> WAL files?

> Yes. But we rename RECOVERYXLOG to 000000010000000000000057 or similar 
> at the end of recovery, in exitArchiveRecovery().

> Thinking about this some more, I think we should've changed 
> exitArchiveRecovery() rather than RemoveOldXlogFiles(): it would be more 
> robust if exitArchiveRecovery() always copied the last WAL file rather 
> than just renamed it. It doesn't seem safe to rely on the file the 
> symlink points to to be valid after recovery is finished, and we might 
> write to it before it's recycled, so the current fix isn't complete.

Hmm.  I think really the reason it's coded that way is that we assumed
the recovery command would be physically copying the file from someplace
else.  pg_standby is violating the backend's expectations by using a
symlink.  And I really doubt that the technique is saving anything, since
the data has to be read in from the archive location anyway.

I'm leaning back to the position that pg_standby's -l option is simply a
bad idea and should be removed.
        regards, tom lane


Re: pg_standby -l might destory the archived file

From
Fujii Masao
Date:
Hi,

On Wed, Jun 3, 2009 at 3:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> That's a good point; don't we recover files under names like
>>> RECOVERYXLOG, not under names that could possibly conflict with regular
>>> WAL files?
>
>> Yes. But we rename RECOVERYXLOG to 000000010000000000000057 or similar
>> at the end of recovery, in exitArchiveRecovery().
>
>> Thinking about this some more, I think we should've changed
>> exitArchiveRecovery() rather than RemoveOldXlogFiles(): it would be more
>> robust if exitArchiveRecovery() always copied the last WAL file rather
>> than just renamed it. It doesn't seem safe to rely on the file the
>> symlink points to to be valid after recovery is finished, and we might
>> write to it before it's recycled, so the current fix isn't complete.
>
> Hmm.  I think really the reason it's coded that way is that we assumed
> the recovery command would be physically copying the file from someplace
> else.  pg_standby is violating the backend's expectations by using a
> symlink.  And I really doubt that the technique is saving anything, since
> the data has to be read in from the archive location anyway.
>
> I'm leaning back to the position that pg_standby's -l option is simply a
> bad idea and should be removed.

I'm OK with this. And, we should document the assumption for
restore_command? Otherwise, some users might wrongly use
'ln' command to restore archived files.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pg_standby -l might destory the archived file

From
Simon Riggs
Date:
On Tue, 2009-06-02 at 14:54 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > Tom Lane wrote:
> >> That's a good point; don't we recover files under names like
> >> RECOVERYXLOG, not under names that could possibly conflict with regular
> >> WAL files?
> 
> > Yes. But we rename RECOVERYXLOG to 000000010000000000000057 or similar 
> > at the end of recovery, in exitArchiveRecovery().
> 
> > Thinking about this some more, I think we should've changed 
> > exitArchiveRecovery() rather than RemoveOldXlogFiles(): it would be more 
> > robust if exitArchiveRecovery() always copied the last WAL file rather 
> > than just renamed it. It doesn't seem safe to rely on the file the 
> > symlink points to to be valid after recovery is finished, and we might 
> > write to it before it's recycled, so the current fix isn't complete.
> 
> Hmm.  I think really the reason it's coded that way is that we assumed
> the recovery command would be physically copying the file from someplace
> else.  pg_standby is violating the backend's expectations by using a
> symlink.  And I really doubt that the technique is saving anything, since
> the data has to be read in from the archive location anyway.
> 
> I'm leaning back to the position that pg_standby's -l option is simply a
> bad idea and should be removed.

ISTM we didn't clearly state what the recovery_command should do either
way. Even if you remove the pg_standby option that will not fix the
problem for people who have written their own script, or existing users
of pg_standby. The safe way is to do as Heikki suggests and copy the
final file into place and I would add that we must then fsync it also.
That should be back-patched possibly as far as 8.0. Documenting a change
is not nearly enough.

Removing -l is a separate discussion. If there was a consensus against
it, I would suggest that we deprecate the option, so that it does
nothing.

As an aside, I would be also much more comfortable if there was an
option to not recycle the WAL files at all, as a safe mode for error
checking at least. The question is: why do we need to zero fill the file
first anyway? We could save the 8 bytes for the prev pointer on every
record if we added enough zeroes onto every WAL write to zap any
pre-existing header data, causing recovery to fail.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: pg_standby -l might destory the archived file

From
Simon Riggs
Date:
On Tue, 2009-06-02 at 19:43 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Mon, 2009-06-01 at 14:47 +0900, Fujii Masao wrote:
> > 
> >> pg_standby can use ln command to restore an archived file,
> >> which might destroy the archived file as follows.
> >>
> >> 1) pg_standby creates the symlink to the archived file '102'
> >> 2) '102' is applied
> >> 3) the next file '103' doesn't exist and the trigger file is created
> >> 4) '102' is re-fetched
> >> 5) at the end of recovery, the symlink to '102' is rename to '202',
> >>     but it still points '102'
> >> 6) after recovery, '202' is recycled (rename to '208', which still
> >>     points '102')
> >> 7) '208' is written new xlog records over
> >>     --> the archived file '102' comes down!
> >>
> >> One simple solution to fix this problem...
> > 
> > err...I don't see *any* problem at all, since pg_standby does not do
> > step (1) in the way you say and therefore never does step (5). Any links
> > created are explicitly deleted in all cases at the end of recovery.
> 
> I don't know how you came to that conclusion, but Fujii-sans description 
> seems accurate to me, and I can reproduce the behavior on my laptop.

I accept there is a bug, though the initial description did not appear
to reflect the way the code works.

Thank you for not making further changes while we wait to discuss.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support