Thread: Unnecessary WAL archiving after failover

Unnecessary WAL archiving after failover

From
Fujii Masao
Date:
Hi,

In streaming replication, after failover, new master might have lots
of un-applied
WAL files with old timeline ID. They are the WAL files which were recycled as a
future ones when the server was running as a standby. Since they will never be
used later, they don't need to be archived after failover. But since they have
neither .ready nor .done file in archive_status, checkpoints after
failover newly
create .reacy files for them, and then finally they are archived.
Which might cause
disk I/O spike both in WAL and archive storage.

To avoid the above problem, I think that un-applied WAL files with old
timeline ID
should be marked as already-archived and recycled immediately at the end of
recovery. Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Unnecessary WAL archiving after failover

From
Robert Haas
Date:
On Wed, Feb 29, 2012 at 5:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Hi,
>
> In streaming replication, after failover, new master might have lots
> of un-applied
> WAL files with old timeline ID. They are the WAL files which were recycled as a
> future ones when the server was running as a standby. Since they will never be
> used later, they don't need to be archived after failover. But since they have
> neither .ready nor .done file in archive_status, checkpoints after
> failover newly
> create .reacy files for them, and then finally they are archived.
> Which might cause
> disk I/O spike both in WAL and archive storage.
>
> To avoid the above problem, I think that un-applied WAL files with old
> timeline ID
> should be marked as already-archived and recycled immediately at the end of
> recovery. Thought?

I'm not an expert on this, but that makes sense to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Unnecessary WAL archiving after failover

From
Fujii Masao
Date:
On Thu, Mar 22, 2012 at 12:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 29, 2012 at 5:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Hi,
>>
>> In streaming replication, after failover, new master might have lots
>> of un-applied
>> WAL files with old timeline ID. They are the WAL files which were recycled as a
>> future ones when the server was running as a standby. Since they will never be
>> used later, they don't need to be archived after failover. But since they have
>> neither .ready nor .done file in archive_status, checkpoints after
>> failover newly
>> create .reacy files for them, and then finally they are archived.
>> Which might cause
>> disk I/O spike both in WAL and archive storage.
>>
>> To avoid the above problem, I think that un-applied WAL files with old
>> timeline ID
>> should be marked as already-archived and recycled immediately at the end of
>> recovery. Thought?
>
> I'm not an expert on this, but that makes sense to me.

Thanks for agreeing with my idea.

On second thought, I found other issues about WAL archiving after
failover. So let me clarify the issues again.

Just after failover, there can be three kinds of WAL files in new
master's pg_xlog directory:

(1) WAL files which were recycled to by restartpoint

I've already explained upthread the issue which these WAL files cause
after failover.


(2) WAL files which were restored from the archive

In 9.1 or before, the restored WAL files don't remain after failover
because they are always restored onto the temporary filename
"RECOVERYXLOG". So the issue which I explain from now doesn't exist
in 9.1 or before.

In 9.2dev, as the result of supporting cascade replication,
an archived WAL file is restored onto correct file name so that
cascading walsender can send it to another standby. This restored
WAL file has neither .ready nor .done archive status file. After
failover, checkpoint checks the archive status file of the restored
WAL file to attempt to recycle it, finds that it has neither .ready
nor ,done, and creates .ready. Because of existence of .ready,
it will be archived again even though it obviously already exists in
the archival storage :(

To prevent a restored WAL file from being archived again, I think
that .done should be created whenever WAL file is successfully
restored (of course this should happen only when archive_mode is
enabled). Thought?

Since this is the oversight of cascade replication, I'm thinking to
implement the patch for 9.2dev.


(3) WAL files which were streamed from the master

These WAL files also don't have any archive status, so checkpoint
creates .ready for them after failover. And then, all or many of
them will be archived at a time, which would cause I/O spike on
both WAL and archival storage.

To avoid this problem, I think that we should change walreceiver
so that it creates .ready as soon as it completes the WAL file. Also
we should change the archiver process so that it starts up even in
standby mode and archives the WAL files.

If each server has its own archival storage, the above solution would
work fine. But if all servers share the archival storage, multiple archiver
processes in those servers might archive the same WAL file to
the shared area at the same time. Is this OK? If not, to avoid this,
we might need to separate archive_mode into two: one for normal mode
(i.e., master), another for standbfy mode. If the archive is shared,
we can ensure that only one archiver in the master copies the WAL file
at the same time by disabling WAL archiving in standby mode but
enabling it in normal mode. Thought?

Invoking the archiver process in standby mode is new feature,
not a bug fix. It's too late to propose new feature for 9.2. So I'll
propose this for 9.3.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Unnecessary WAL archiving after failover

From
Robert Haas
Date:
On Fri, Mar 23, 2012 at 10:03 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On second thought, I found other issues about WAL archiving after
> failover. So let me clarify the issues again.
>
> Just after failover, there can be three kinds of WAL files in new
> master's pg_xlog directory:
>
> (1) WAL files which were recycled to by restartpoint
>
> I've already explained upthread the issue which these WAL files cause
> after failover.

Check.

> (2) WAL files which were restored from the archive
>
> In 9.1 or before, the restored WAL files don't remain after failover
> because they are always restored onto the temporary filename
> "RECOVERYXLOG". So the issue which I explain from now doesn't exist
> in 9.1 or before.
>
> In 9.2dev, as the result of supporting cascade replication,
> an archived WAL file is restored onto correct file name so that
> cascading walsender can send it to another standby. This restored
> WAL file has neither .ready nor .done archive status file. After
> failover, checkpoint checks the archive status file of the restored
> WAL file to attempt to recycle it, finds that it has neither .ready
> nor ,done, and creates .ready. Because of existence of .ready,
> it will be archived again even though it obviously already exists in
> the archival storage :(
>
> To prevent a restored WAL file from being archived again, I think
> that .done should be created whenever WAL file is successfully
> restored (of course this should happen only when archive_mode is
> enabled). Thought?
>
> Since this is the oversight of cascade replication, I'm thinking to
> implement the patch for 9.2dev.

Yes, I think we had better fix this in 9.2.  As you say, it's a loose
end from streaming replication.  Do you have a patch?

> (3) WAL files which were streamed from the master
>
> These WAL files also don't have any archive status, so checkpoint
> creates .ready for them after failover. And then, all or many of
> them will be archived at a time, which would cause I/O spike on
> both WAL and archival storage.
>
> To avoid this problem, I think that we should change walreceiver
> so that it creates .ready as soon as it completes the WAL file. Also
> we should change the archiver process so that it starts up even in
> standby mode and archives the WAL files.
>
> If each server has its own archival storage, the above solution would
> work fine. But if all servers share the archival storage, multiple archiver
> processes in those servers might archive the same WAL file to
> the shared area at the same time. Is this OK? If not, to avoid this,
> we might need to separate archive_mode into two: one for normal mode
> (i.e., master), another for standbfy mode. If the archive is shared,
> we can ensure that only one archiver in the master copies the WAL file
> at the same time by disabling WAL archiving in standby mode but
> enabling it in normal mode. Thought?

Another option would be to run the archiver in both modes and somehow
pass a flag indicating whether it's running in standby mode or normal
running.

> Invoking the archiver process in standby mode is new feature,
> not a bug fix. It's too late to propose new feature for 9.2. So I'll
> propose this for 9.3.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Unnecessary WAL archiving after failover

From
Noah Misch
Date:
On Fri, Mar 23, 2012 at 11:03:27PM +0900, Fujii Masao wrote:
> > On Wed, Feb 29, 2012 at 5:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >> In streaming replication, after failover, new master might have lots
> >> of un-applied
> >> WAL files with old timeline ID. They are the WAL files which were recycled as a
> >> future ones when the server was running as a standby. Since they will never be
> >> used later, they don't need to be archived after failover. But since they have
> >> neither .ready nor .done file in archive_status, checkpoints after
> >> failover newly
> >> create .reacy files for them, and then finally they are archived.
> >> Which might cause
> >> disk I/O spike both in WAL and archive storage.

If the old master archived later WAL that the new master never restored, won't
this attempt to archive a file under a name that already exists in the
archive?  The documentation says this:
 The archive command should generally be designed to refuse to overwrite any pre-existing archive file. This is an
importantsafety feature to preserve the integrity of your archive in case of administrator error (such as sending the
outputof two different servers to the same archive directory).
 
 It is advisable to test your proposed archive command to ensure that it indeed does not overwrite an existing file,
andthat it returns nonzero status in this case.
 

Archiving on the new master would halt until the operator intervenes.

> >> To avoid the above problem, I think that un-applied WAL files with old
> >> timeline ID
> >> should be marked as already-archived and recycled immediately at the end of
> >> recovery. Thought?

A small hazard comes to mind.  If the administrator manually copied
post-timeline-divergence segments from the failed master to the new master's
pg_xlog, the current implementation loads them into the archive for you.  The
new master could never apply those files locally, but they might be useful for
alternate recoveries down the previous timeline.  Nonetheless, we can just as
reasonably specify that it's not a role of the new master to provide this
service.  Call the fact that it did so in previous releases an implementation
artifact.

What about instead creating an archive status file at recycle time and
deleting it as we begin to populate the file?  That distinguishes copied-in,
unarchived segments from recycled ones.

Incidentally, RemoveOldXlogFiles() has this comment:
    /*     * We ignore the timeline part of the XLOG segment identifiers in     * deciding whether a segment is still
needed.   This ensures that we     * won't prematurely remove a segment from a parent timeline. We could     * probably
bea little more proactive about removing segments of     * non-parent timelines, but that would be a whole lot more
*complicated.
 

Should both instances of "parent" be "child" or "descendant"?

> Just after failover, there can be three kinds of WAL files in new
> master's pg_xlog directory:
> 
> (1) WAL files which were recycled to by restartpoint
> 
> I've already explained upthread the issue which these WAL files cause
> after failover.
> 
> 
> (2) WAL files which were restored from the archive
> 
> In 9.1 or before, the restored WAL files don't remain after failover
> because they are always restored onto the temporary filename
> "RECOVERYXLOG". So the issue which I explain from now doesn't exist
> in 9.1 or before.
> 
> In 9.2dev, as the result of supporting cascade replication,
> an archived WAL file is restored onto correct file name so that
> cascading walsender can send it to another standby. This restored

The documentation still says this:
 WAL segments that cannot be found in the archive will be sought in pg_xlog/; this allows use of recent un-archived
segments.However, segments that are available from the archive will be used in preference to files in pg_xlog/. The
systemwill not overwrite the existing contents of pg_xlog/ when retrieving archived files.
 

I gather the last sentence is now false?

> WAL file has neither .ready nor .done archive status file. After
> failover, checkpoint checks the archive status file of the restored
> WAL file to attempt to recycle it, finds that it has neither .ready
> nor ,done, and creates .ready. Because of existence of .ready,
> it will be archived again even though it obviously already exists in
> the archival storage :(
> 
> To prevent a restored WAL file from being archived again, I think
> that .done should be created whenever WAL file is successfully
> restored (of course this should happen only when archive_mode is
> enabled). Thought?

Your proposed fix makes sense, and I cannot think of any disadvantage.
Concerning only doing it when archive_mode=on, would there ever be a case
where a segment is restored under archive_mode=off, then the server restarted
with archive_mode=on and an archival attempted on that segment?

> (3) WAL files which were streamed from the master
> 
> These WAL files also don't have any archive status, so checkpoint
> creates .ready for them after failover. And then, all or many of
> them will be archived at a time, which would cause I/O spike on
> both WAL and archival storage.
> 
> To avoid this problem, I think that we should change walreceiver
> so that it creates .ready as soon as it completes the WAL file. Also
> we should change the archiver process so that it starts up even in
> standby mode and archives the WAL files.
> 
> If each server has its own archival storage, the above solution would
> work fine. But if all servers share the archival storage, multiple archiver
> processes in those servers might archive the same WAL file to
> the shared area at the same time. Is this OK? If not, to avoid this,
> we might need to separate archive_mode into two: one for normal mode
> (i.e., master), another for standbfy mode. If the archive is shared,
> we can ensure that only one archiver in the master copies the WAL file
> at the same time by disabling WAL archiving in standby mode but
> enabling it in normal mode. Thought?

I don't think we should remove the recommendation to make archive_command fail
when the archive already has the file.  However, the new master is likely to
have at least one segment not appearing in the archive along with some
already-archived segments.  There's certainly a use case for completing the
shared archive with local-only segments.  I think this also ties into the
prerequisites for letting former peers of the new master begin to follow the
new master without fresh base backups.

More thought is needed here.

Thanks,
nm


Re: Unnecessary WAL archiving after failover

From
Simon Riggs
Date:
On 23 March 2012 14:03, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Mar 22, 2012 at 12:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Feb 29, 2012 at 5:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> Hi,
>>>
>>> In streaming replication, after failover, new master might have lots
>>> of un-applied
>>> WAL files with old timeline ID. They are the WAL files which were recycled as a
>>> future ones when the server was running as a standby. Since they will never be
>>> used later, they don't need to be archived after failover. But since they have
>>> neither .ready nor .done file in archive_status, checkpoints after
>>> failover newly
>>> create .reacy files for them, and then finally they are archived.
>>> Which might cause
>>> disk I/O spike both in WAL and archive storage.
>>>
>>> To avoid the above problem, I think that un-applied WAL files with old
>>> timeline ID
>>> should be marked as already-archived and recycled immediately at the end of
>>> recovery. Thought?
>>
>> I'm not an expert on this, but that makes sense to me.
>
> Thanks for agreeing with my idea.
>
> On second thought, I found other issues about WAL archiving after
> failover. So let me clarify the issues again.
>
> Just after failover, there can be three kinds of WAL files in new
> master's pg_xlog directory:
>
> (1) WAL files which were recycled to by restartpoint
>
> I've already explained upthread the issue which these WAL files cause
> after failover.

This might be a problem, or it might be archiving important data and
you have a corrupt WAL file/CRC. I'd rather take the hit than to
delete potentially useful data. And it avoids having a bug that
deletes useful segments also.


> (2) WAL files which were restored from the archive
>
> In 9.1 or before, the restored WAL files don't remain after failover
> because they are always restored onto the temporary filename
> "RECOVERYXLOG". So the issue which I explain from now doesn't exist
> in 9.1 or before.
>
> In 9.2dev, as the result of supporting cascade replication,
> an archived WAL file is restored onto correct file name so that
> cascading walsender can send it to another standby. This restored
> WAL file has neither .ready nor .done archive status file. After
> failover, checkpoint checks the archive status file of the restored
> WAL file to attempt to recycle it, finds that it has neither .ready
> nor ,done, and creates .ready. Because of existence of .ready,
> it will be archived again even though it obviously already exists in
> the archival storage :(
>
> To prevent a restored WAL file from being archived again, I think
> that .done should be created whenever WAL file is successfully
> restored (of course this should happen only when archive_mode is
> enabled). Thought?

Agreed

> Since this is the oversight of cascade replication, I'm thinking to
> implement the patch for 9.2dev.

Very much so.

> (3) WAL files which were streamed from the master
>
> These WAL files also don't have any archive status, so checkpoint
> creates .ready for them after failover. And then, all or many of
> them will be archived at a time, which would cause I/O spike on
> both WAL and archival storage.
>
> To avoid this problem, I think that we should change walreceiver
> so that it creates .ready as soon as it completes the WAL file. Also
> we should change the archiver process so that it starts up even in
> standby mode and archives the WAL files.
>
> If each server has its own archival storage, the above solution would
> work fine. But if all servers share the archival storage, multiple archiver
> processes in those servers might archive the same WAL file to
> the shared area at the same time. Is this OK? If not, to avoid this,
> we might need to separate archive_mode into two: one for normal mode
> (i.e., master), another for standbfy mode. If the archive is shared,
> we can ensure that only one archiver in the master copies the WAL file
> at the same time by disabling WAL archiving in standby mode but
> enabling it in normal mode. Thought?

Use %s as an option to be passed to the archive command.

> Invoking the archiver process in standby mode is new feature,
> not a bug fix. It's too late to propose new feature for 9.2. So I'll
> propose this for 9.3.

Yep, good idea.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services