Thread: [9.3 bug] disk space in pg_xlog increases during archive recovery

[9.3 bug] disk space in pg_xlog increases during archive recovery

From

"MauMau"

Date:

27 July 2013, 22:57:57

Hello,

I'm sorry I've been touching several things recently before fixing any of 
them.

I've noticed undesirable disk space increase while performing archive 
recovery with PostgreSQL 9.3.  This happens with 9.2, too.

I just performed archived recovery with the following parameters in 
recovery.conf.  I'm not using replication.

restore_command = 'cp ...'
recovery_target_timeline = 'latest'

As the archive recovery progresses, the disk space used by $PGDATA/pg_xlog 
increases.  It seems that restored archive WAL files are accumulated there. 
This is considerable amount depending on the number of archived WAL files. 
In my case, the recovery failed because of the shortage of disk space.  This 
did not happen with 9.1

The cause appears to be KeepFileRestoredFromArchive().  This function saves 
restored archive WAL files in pg_xlog/.  I guess this is for cascading 
replication, a new feature added in 9.2.
So, I think it is a bug that the disk space increases if not using cascading 
replication.  Those who migrated from 9.1 and do not use 9.2 features would 
be surprised like me.


Do you think this should be fixed?  How should it be fixed?  If possible, 
could you fix it in the next minor release?  If you all are busy, I'll try 
to fix it, but give me advice how to do that.

Regards
MauMau

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Fujii Masao

Date:

29 July 2013, 17:56:12

On Sun, Jul 28, 2013 at 7:59 AM, MauMau <maumau307@gmail.com> wrote:
> Hello,
>
> I'm sorry I've been touching several things recently before fixing any of
> them.
>
> I've noticed undesirable disk space increase while performing archive
> recovery with PostgreSQL 9.3.  This happens with 9.2, too.
>
> I just performed archived recovery with the following parameters in
> recovery.conf.  I'm not using replication.
>
> restore_command = 'cp ...'
> recovery_target_timeline = 'latest'
>
> As the archive recovery progresses, the disk space used by $PGDATA/pg_xlog
> increases.  It seems that restored archive WAL files are accumulated there.
> This is considerable amount depending on the number of archived WAL files.
> In my case, the recovery failed because of the shortage of disk space.  This
> did not happen with 9.1
>
> The cause appears to be KeepFileRestoredFromArchive().  This function saves
> restored archive WAL files in pg_xlog/.  I guess this is for cascading
> replication, a new feature added in 9.2.

Yes. Those accumulated WAL files will be removed basically when
restartpoint ends.
But if restartpoint is slow compared to the WAL reply, the problem you described
would happen.

> So, I think it is a bug that the disk space increases if not using cascading
> replication.  Those who migrated from 9.1 and do not use 9.2 features would
> be surprised like me.
>
>
> Do you think this should be fixed?

I think so.

> How should it be fixed?

What about removing the restored archived file as soon as it's replayed
if cascading replication is not enabled (i.e., max_wal_senders = 0 or
hot_standby = off)? This doesn't seem to break the existing behavior
in 9.2.

Regards,

-- 
Fujii Masao

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

"MauMau"

Date:

30 July 2013, 22:41:52

From: "Fujii Masao" <masao.fujii@gmail.com>
>> So, I think it is a bug that the disk space increases if not using 
>> cascading
>> replication.  Those who migrated from 9.1 and do not use 9.2 features 
>> would
>> be surprised like me.
>>
>>
>> Do you think this should be fixed?
>
> I think so.

Thanks for your agreement.  I'll try to submit the patch as soon as possible 
so that the fix will be included in the next minor release and 9.3.  If it 
seems unexpectedly difficult, I'd like to consult you.


>> How should it be fixed?
>
> What about removing the restored archived file as soon as it's replayed
> if cascading replication is not enabled (i.e., max_wal_senders = 0 or
> hot_standby = off)? This doesn't seem to break the existing behavior
> in 9.2.

I'll consider this condition, but I wonder if 9.1 users are setting 
max_wal_senders>0 and hot_standby=on even when just performing archive 
recovery (e.g. to use the same recovery.conf for all cases).  What do you 
think?  Is there any other good condition to judge if cascading replication 
is used?

Regards
MauMau

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

"MauMau"

Date:

31 July 2013, 13:42:49

Hello, Fujii san, all,

From: "Fujii Masao" <masao.fujii@gmail.com>
> On Sun, Jul 28, 2013 at 7:59 AM, MauMau <maumau307@gmail.com> wrote:
> Do you think this should be fixed?
>
> I think so.
>
>> How should it be fixed?
>
> What about removing the restored archived file as soon as it's replayed
> if cascading replication is not enabled (i.e., max_wal_senders = 0 or
> hot_standby = off)? This doesn't seem to break the existing behavior
> in 9.2.

Please find attached the patch to fix the problem.  I changed to keep
restored WAL files in pg_xlog/ only on a cascading standby server.  Could
you review and commit this?

BTW, KeepFileRestoredFromArchive() is also called to keep timeline history
files in pg_xlog/.  What is this for?  Is this necessary for recovery other
than cascading standbys?  This seems to accumulate timeline history files
forever in pg_xlog/.

Regards
MauMau

Attachment

skip_wal_save.patch

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Fujii Masao

Date:

31 July 2013, 14:31:43

On Wed, Jul 31, 2013 at 10:43 PM, MauMau <maumau307@gmail.com> wrote:
> Hello, Fujii san, all,
>
> From: "Fujii Masao" <masao.fujii@gmail.com>
>>
>> On Sun, Jul 28, 2013 at 7:59 AM, MauMau <maumau307@gmail.com> wrote:
>> Do you think this should be fixed?
>>
>> I think so.
>>
>>> How should it be fixed?
>>
>>
>> What about removing the restored archived file as soon as it's replayed
>> if cascading replication is not enabled (i.e., max_wal_senders = 0 or
>> hot_standby = off)? This doesn't seem to break the existing behavior
>> in 9.2.
>
>
> Please find attached the patch to fix the problem.  I changed to keep
> restored WAL files in pg_xlog/ only on a cascading standby server.  Could
> you review and commit this?

-    if (source == XLOG_FROM_ARCHIVE)
+    if (source == XLOG_FROM_ARCHIVE &&
+        StandbyModeRequested && AllowCascadeReplication())

I think that the condition of StandbyModeRequested should be removed
because someone might want to set up the cascade standby from the standby
of warm-standby configuration.

> BTW, KeepFileRestoredFromArchive() is also called to keep timeline history
> files in pg_xlog/.  What is this for?  Is this necessary for recovery other
> than cascading standbys?

Yes. Please see 60df192aea0e6458f20301546e11f7673c102101

> This seems to accumulate timeline history files
> forever in pg_xlog/.

Yes, because basically there is no way to delete the timeline history files
from pg_xlog.

Regards,

-- 
Fujii Masao

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

"MauMau"

Date:

31 July 2013, 22:36:51

From: "Fujii Masao" <masao.fujii@gmail.com>
> - if (source == XLOG_FROM_ARCHIVE)
> + if (source == XLOG_FROM_ARCHIVE &&
> + StandbyModeRequested && AllowCascadeReplication())
>
> I think that the condition of StandbyModeRequested should be removed
> because someone might want to set up the cascade standby from the standby
> of warm-standby configuration.

Fixed and attached the revised patch.

However, isn't StandbyRequested true (= standby_mode set to on) to enable
warm standby?  I'm afraid people set max_wal_senders>0 and hot_standby=on
even on the primary server to make the contents of postgresql.conf identical
on both the primary and the standby for easier configuration.  If so, normal
archive recovery (PITR, not the standby recovery) would face the original
problem -- unnecessary WAL accumulation in pg_xlog/.  So I'm wonder if
AllowCascadeReplication() is enough.

Please take either this patch or the previous one.

Regards
MauMau

Attachment

skip_wal_save_v2.patch

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Fujii Masao

Date:

01 August 2013, 03:37:54

On Thu, Aug 1, 2013 at 7:37 AM, MauMau <maumau307@gmail.com> wrote:
> From: "Fujii Masao" <masao.fujii@gmail.com>
>
>> - if (source == XLOG_FROM_ARCHIVE)
>> + if (source == XLOG_FROM_ARCHIVE &&
>> + StandbyModeRequested && AllowCascadeReplication())
>>
>> I think that the condition of StandbyModeRequested should be removed
>> because someone might want to set up the cascade standby from the standby
>> of warm-standby configuration.
>
>
> Fixed and attached the revised patch.
>
> However, isn't StandbyRequested true (= standby_mode set to on) to enable
> warm standby?

We can set up warm-standby by using pg_standby even if standby_mode = off.

>  I'm afraid people set max_wal_senders>0 and hot_standby=on
> even on the primary server to make the contents of postgresql.conf identical
> on both the primary and the standby for easier configuration.  If so, normal
> archive recovery (PITR, not the standby recovery) would face the original
> problem -- unnecessary WAL accumulation in pg_xlog/.  So I'm wonder if
> AllowCascadeReplication() is enough.

One idea is to add new GUC parameter like enable_cascade_replication
and then prevent WAL file from being kept in pg_xlog if that parameter is off.
But we cannot apply such change into the already-released version 9.2.

Regards,

-- 
Fujii Masao

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

"MauMau"

Date:

01 August 2013, 15:23:12

From: "Fujii Masao" <masao.fujii@gmail.com>
>> However, isn't StandbyRequested true (= standby_mode set to on) to enable
>> warm standby?
>
> We can set up warm-standby by using pg_standby even if standby_mode = off.

I see.  However, I understand that pg_standby is a legacy feature, and the 
current way to setup a warm standby is to set standby=on and restore_command 
in recovery.conf.  So I think it might not be necessary to support cascading 
replication with pg_standby, if supporting it would prevent better 
implementation of new features.


>>  I'm afraid people set max_wal_senders>0 and hot_standby=on
>> even on the primary server to make the contents of postgresql.conf 
>> identical
>> on both the primary and the standby for easier configuration.  If so, 
>> normal
>> archive recovery (PITR, not the standby recovery) would face the original
>> problem -- unnecessary WAL accumulation in pg_xlog/.  So I'm wonder if
>> AllowCascadeReplication() is enough.
>
> One idea is to add new GUC parameter like enable_cascade_replication
> and then prevent WAL file from being kept in pg_xlog if that parameter is 
> off.
> But we cannot apply such change into the already-released version 9.2.

Yes, I thought the same, too.  Adding a new GUC to enable cascading 
replication now would be a usability degradation.  But I have no idea of any 
better way.  It seems we need to take either v1 or v2 of the patch I sent. 
If we consider that we don't have to support cascading replication for a 
legacy pg_standby, v1 patch is definitely better because the standby server 
would not keep restored archive WAL in pg_xlog when cascading replication is 
not used.  Otherwise, we have to take v2 patch.

Could you choose either and commit it?

Regards
MauMau

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Michael Paquier

Date:

02 August 2013, 01:52:43

On Fri, Aug 2, 2013 at 12:24 AM, MauMau <maumau307@gmail.com> wrote:

From: "Fujii Masao" <masao.fujii@gmail.com>

However, isn't StandbyRequested true (= standby_mode set to on) to enable
warm standby?

We can set up warm-standby by using pg_standby even if standby_mode = off.

I see. However, I understand that pg_standby is a legacy feature, and the current way to setup a warm standby is to set standby=on and restore_command in recovery.conf. So I think it might not be necessary to support cascading replication with pg_standby, if supporting it would prevent better implementation of new features.

You are right about that, you should stick with the core features as much as you can and limit the use of wrapper utilities. Since 9.1 and the apparition of a built-in standby mode inside Postgres (with the apparition of the GUC parameter hot_standby), it seems better to use that instead of pg_standby, which would likely (personal opinion, feel free to scream at me) be removed in a future release.

I'm afraid people set max_wal_senders>0 and hot_standby=on
even on the primary server to make the contents of postgresql.conf identical
on both the primary and the standby for easier configuration. If so, normal
archive recovery (PITR, not the standby recovery) would face the original
problem -- unnecessary WAL accumulation in pg_xlog/. So I'm wonder if
AllowCascadeReplication() is enough.

One idea is to add new GUC parameter like enable_cascade_replication
and then prevent WAL file from being kept in pg_xlog if that parameter is off.
But we cannot apply such change into the already-released version 9.2.

Yes, I thought the same, too. Adding a new GUC to enable cascading replication now would be a usability degradation. But I have no idea of any better way. It seems we need to take either v1 or v2 of the patch I sent. If we consider that we don't have to support cascading replication for a legacy pg_standby, v1 patch is definitely better because the standby server would not keep restored archive WAL in pg_xlog when cascading replication is not used. Otherwise, we have to take v2 patch.

By reading this thread, -1 for the addition of a new GUC parameter related to cascading, it looks like an overkill for the possible gain. And +1 for the removal of WAL file once it is replayed in archive recovery if cascading replication is not allowed. However, what about wal_keep_segments? Doesn't the server keep segments even if replication is not allowed? If yes, the immediate WAL file drop after replay needs to take care of that...

--
Michael

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Dimitri Fontaine

Date:

02 August 2013, 09:01:43

Michael Paquier <michael.paquier@gmail.com> writes:
> By reading this thread, -1 for the addition of a new GUC parameter related
> to cascading, it looks like an overkill for the possible gain. And +1 for
> the removal of WAL file once it is replayed in archive recovery if
> cascading replication is not allowed. However, what about

Well, we still don't register standby servers, so I vote against early
removal of files that you have no possible way to know if they are still
useful or not. We need something smarter here. Please.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Fujii Masao

Date:

07 August 2013, 13:46:36

On Fri, Aug 2, 2013 at 10:52 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
>
>
> On Fri, Aug 2, 2013 at 12:24 AM, MauMau <maumau307@gmail.com> wrote:
>>
>> From: "Fujii Masao" <masao.fujii@gmail.com>
>>
>>>> However, isn't StandbyRequested true (= standby_mode set to on) to
>>>> enable
>>>> warm standby?
>>>
>>>
>>> We can set up warm-standby by using pg_standby even if standby_mode =
>>> off.
>>
>>
>> I see.  However, I understand that pg_standby is a legacy feature, and the
>> current way to setup a warm standby is to set standby=on and restore_command
>> in recovery.conf.  So I think it might not be necessary to support cascading
>> replication with pg_standby, if supporting it would prevent better
>> implementation of new features.
>
> You are right about that, you should stick with the core features as much as
> you can and limit the use of wrapper utilities. Since 9.1 and the apparition
> of a built-in standby mode inside Postgres (with the apparition of the GUC
> parameter hot_standby), it seems better to use that instead of pg_standby,
> which would likely (personal opinion, feel free to scream at me) be removed
> in a future release.

I'm sure that there are some users who use pg_standby for warm-standby
for some reasons, for example, the document (*1) explains such a configuration,
a user can specify the file-check-interval (by using -s option), and so on.
I agree that using pg_standby + cascade replication would be very rare.
But I'm not confident that *no one* uses pg_standby + cascade replication.

(*1) http://www.postgresql.org/docs/devel/static/log-shipping-alternative.html

Regards,

-- 
Fujii Masao

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Fujii Masao

Date:

07 August 2013, 14:03:56

On Fri, Aug 2, 2013 at 12:24 AM, MauMau <maumau307@gmail.com> wrote:
> From: "Fujii Masao" <masao.fujii@gmail.com>
>
>>> However, isn't StandbyRequested true (= standby_mode set to on) to enable
>>> warm standby?
>>
>>
>> We can set up warm-standby by using pg_standby even if standby_mode = off.
>
>
> I see.  However, I understand that pg_standby is a legacy feature, and the
> current way to setup a warm standby is to set standby=on and restore_command
> in recovery.conf.  So I think it might not be necessary to support cascading
> replication with pg_standby, if supporting it would prevent better
> implementation of new features.
>
>
>
>>>  I'm afraid people set max_wal_senders>0 and hot_standby=on
>>> even on the primary server to make the contents of postgresql.conf
>>> identical
>>> on both the primary and the standby for easier configuration.  If so,
>>> normal
>>> archive recovery (PITR, not the standby recovery) would face the original
>>> problem -- unnecessary WAL accumulation in pg_xlog/.  So I'm wonder if
>>> AllowCascadeReplication() is enough.
>>
>>
>> One idea is to add new GUC parameter like enable_cascade_replication
>> and then prevent WAL file from being kept in pg_xlog if that parameter is
>> off.
>> But we cannot apply such change into the already-released version 9.2.
>
>
> Yes, I thought the same, too.  Adding a new GUC to enable cascading
> replication now would be a usability degradation.  But I have no idea of any
> better way.  It seems we need to take either v1 or v2 of the patch I sent.
> If we consider that we don't have to support cascading replication for a
> legacy pg_standby, v1 patch is definitely better because the standby server
> would not keep restored archive WAL in pg_xlog when cascading replication is
> not used.  Otherwise, we have to take v2 patch.
>
> Could you choose either and commit it?

On second thought, probably we cannot remove the restored WAL files early
because they might be required for fast promotion which is new feature in 9.3.
In fast promotion, an end-of-recovery checkpoint is not executed. After the end
of recovery, normal online checkpoint starts. If the server crashes before such
an online checkpoint completes, the server needs to replay again all the WAL
files which it replayed before the promotion. Since this is the crash recovery,
all those WAL files need to exist in pg_xlog directory. So if we remove the
restored WAL file from pg_xlog early, such a crash recovery might fail.

So even if cascade replication is disabled, if standby_mode = on, i.e., fast
promotion can be performed, we cannot remove the restored WAL files early.

Regards,

-- 
Fujii Masao

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Jeff Janes

Date:

07 August 2013, 16:44:34

On Wed, Aug 7, 2013 at 7:03 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Aug 2, 2013 at 12:24 AM, MauMau <maumau307@gmail.com> wrote:
>> From: "Fujii Masao" <masao.fujii@gmail.com>
>>
>>>> However, isn't StandbyRequested true (= standby_mode set to on) to enable
>>>> warm standby?
>>>
>>>
>>> We can set up warm-standby by using pg_standby even if standby_mode = off.
>>
>>
>> I see.  However, I understand that pg_standby is a legacy feature, and the
>> current way to setup a warm standby is to set standby=on and restore_command
>> in recovery.conf.  So I think it might not be necessary to support cascading
>> replication with pg_standby, if supporting it would prevent better
>> implementation of new features.
>>
>>
>>
>>>>  I'm afraid people set max_wal_senders>0 and hot_standby=on
>>>> even on the primary server to make the contents of postgresql.conf
>>>> identical
>>>> on both the primary and the standby for easier configuration.  If so,
>>>> normal
>>>> archive recovery (PITR, not the standby recovery) would face the original
>>>> problem -- unnecessary WAL accumulation in pg_xlog/.  So I'm wonder if
>>>> AllowCascadeReplication() is enough.
>>>
>>>
>>> One idea is to add new GUC parameter like enable_cascade_replication
>>> and then prevent WAL file from being kept in pg_xlog if that parameter is
>>> off.
>>> But we cannot apply such change into the already-released version 9.2.
>>
>>
>> Yes, I thought the same, too.  Adding a new GUC to enable cascading
>> replication now would be a usability degradation.  But I have no idea of any
>> better way.  It seems we need to take either v1 or v2 of the patch I sent.
>> If we consider that we don't have to support cascading replication for a
>> legacy pg_standby, v1 patch is definitely better because the standby server
>> would not keep restored archive WAL in pg_xlog when cascading replication is
>> not used.  Otherwise, we have to take v2 patch.
>>
>> Could you choose either and commit it?
>
> On second thought, probably we cannot remove the restored WAL files early
> because they might be required for fast promotion which is new feature in 9.3.
> In fast promotion, an end-of-recovery checkpoint is not executed. After the end
> of recovery, normal online checkpoint starts. If the server crashes before such
> an online checkpoint completes, the server needs to replay again all the WAL
> files which it replayed before the promotion. Since this is the crash recovery,
> all those WAL files need to exist in pg_xlog directory. So if we remove the
> restored WAL file from pg_xlog early, such a crash recovery might fail.
>
> So even if cascade replication is disabled, if standby_mode = on, i.e., fast
> promotion can be performed, we cannot remove the restored WAL files early.

But we should still be able to remove files older than the latest
restart point, right?

Cheers,

Jeff

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

"MauMau"

Date:

16 December 2013, 12:21:54

Hi, Fujii san,

> On Wed, Aug 7, 2013 at 7:03 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On second thought, probably we cannot remove the restored WAL files early
>> because they might be required for fast promotion which is new feature in
>> 9.3.
>> In fast promotion, an end-of-recovery checkpoint is not executed. After
>> the end
>> of recovery, normal online checkpoint starts. If the server crashes
>> before such
>> an online checkpoint completes, the server needs to replay again all the
>> WAL
>> files which it replayed before the promotion. Since this is the crash
>> recovery,
>> all those WAL files need to exist in pg_xlog directory. So if we remove
>> the
>> restored WAL file from pg_xlog early, such a crash recovery might fail.
>>
>> So even if cascade replication is disabled, if standby_mode = on, i.e.,
>> fast
>> promotion can be performed, we cannot remove the restored WAL files
>> early.

Following Fujii-san's advice, I've made the attached patch.  This prevents
unnecessary WAL accumulation in pg_xlog/ during archive recovery (not
standby recovery).  To do this, I had to revive some code in
exitArchiveRecovery() from 9.1.

Could you commit this to 9.2 and later?


Regards
MauMau

Attachment

wal_increase_in_pitr.patch

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Fujii Masao

Date:

19 December 2013, 06:26:34

On Mon, Dec 16, 2013 at 9:22 PM, MauMau <maumau307@gmail.com> wrote:
> Hi, Fujii san,
>
>> On Wed, Aug 7, 2013 at 7:03 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>
>>> On second thought, probably we cannot remove the restored WAL files early
>>> because they might be required for fast promotion which is new feature in
>>> 9.3.
>>> In fast promotion, an end-of-recovery checkpoint is not executed. After
>>> the end
>>> of recovery, normal online checkpoint starts. If the server crashes
>>> before such
>>> an online checkpoint completes, the server needs to replay again all the
>>> WAL
>>> files which it replayed before the promotion. Since this is the crash
>>> recovery,
>>> all those WAL files need to exist in pg_xlog directory. So if we remove
>>> the
>>> restored WAL file from pg_xlog early, such a crash recovery might fail.
>>>
>>> So even if cascade replication is disabled, if standby_mode = on, i.e.,
>>> fast
>>> promotion can be performed, we cannot remove the restored WAL files
>>> early.
>
>
> Following Fujii-san's advice, I've made the attached patch.

Thanks for the patch!

!     if (source == XLOG_FROM_ARCHIVE && StandbyModeRequested)

Even when standby_mode is not enabled, we can use cascade replication and
it needs the accumulated WAL files. So I think that AllowCascadeReplication()
should be added into this condition.

!       snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG");
!       XLogFilePath(xlogpath, ThisTimeLineID, endLogSegNo);
!
!       if (restoredFromArchive)

Don't we need to check !StandbyModeRequested and !AllowCascadeReplication()
here?

!               /*
!                * If the latest segment is not archival, but there's still a
!                * RECOVERYXLOG laying about, get rid of it.
!                */
!               unlink(recoveryPath);   /* ignore any error */

The similar line exists in the lower side of exitArchiveRecovery(), so ISTM that
you can refactor that.

Regards,

-- 
Fujii Masao

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

"MauMau"

Date:

20 December 2013, 12:21:47

From: "Fujii Masao" <masao.fujii@gmail.com>
> !     if (source == XLOG_FROM_ARCHIVE && StandbyModeRequested)
>
> Even when standby_mode is not enabled, we can use cascade replication and
> it needs the accumulated WAL files. So I think that
> AllowCascadeReplication()
> should be added into this condition.
>
> !       snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG");
> !       XLogFilePath(xlogpath, ThisTimeLineID, endLogSegNo);
> !
> !       if (restoredFromArchive)
>
> Don't we need to check !StandbyModeRequested and
> !AllowCascadeReplication()
> here?

Oh, you are correct.  Okay, done.


> !               /*
> !                * If the latest segment is not archival, but there's
> still a
> !                * RECOVERYXLOG laying about, get rid of it.
> !                */
> !               unlink(recoveryPath);   /* ignore any error */
>
> The similar line exists in the lower side of exitArchiveRecovery(), so
> ISTM that
> you can refactor that.

That's already done in the previous patch: deletion of RECOVERYXLOG was
moved into else clause, as in:

-  /*
-   * Since there might be a partial WAL segment named RECOVERYXLOG, get rid
-   * of it.
-   */
-  snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG");
-  unlink(recoveryPath);  /* ignore any error */
-



Regards
MauMau

Attachment

wal_increase_in_pitr_v2.patch

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Fujii Masao

Date:

21 January 2014, 17:31:51

On Fri, Dec 20, 2013 at 9:21 PM, MauMau <maumau307@gmail.com> wrote:
> From: "Fujii Masao" <masao.fujii@gmail.com>
>
>> !     if (source == XLOG_FROM_ARCHIVE && StandbyModeRequested)
>>
>> Even when standby_mode is not enabled, we can use cascade replication and
>> it needs the accumulated WAL files. So I think that
>> AllowCascadeReplication()
>> should be added into this condition.
>>
>> !       snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG");
>> !       XLogFilePath(xlogpath, ThisTimeLineID, endLogSegNo);
>> !
>> !       if (restoredFromArchive)
>>
>> Don't we need to check !StandbyModeRequested and
>> !AllowCascadeReplication()
>> here?
>
>
> Oh, you are correct.  Okay, done.

Thanks! The patch looks good to me. Attached is the updated version of
the patch. I added the comments.

Did you test whether this patch works properly in several recovery cases?

Regards,

--
Fujii Masao

Attachment

wal_increase_in_pitr_v3.patch

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Heikki Linnakangas

Date:

21 January 2014, 21:37:58

On 01/21/2014 07:31 PM, Fujii Masao wrote:
> On Fri, Dec 20, 2013 at 9:21 PM, MauMau <maumau307@gmail.com> wrote:
>> From: "Fujii Masao" <masao.fujii@gmail.com>
>>
>>> !     if (source == XLOG_FROM_ARCHIVE && StandbyModeRequested)
>>>
>>> Even when standby_mode is not enabled, we can use cascade replication and
>>> it needs the accumulated WAL files. So I think that
>>> AllowCascadeReplication()
>>> should be added into this condition.
>>>
>>> !       snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG");
>>> !       XLogFilePath(xlogpath, ThisTimeLineID, endLogSegNo);
>>> !
>>> !       if (restoredFromArchive)
>>>
>>> Don't we need to check !StandbyModeRequested and
>>> !AllowCascadeReplication()
>>> here?
>>
>> Oh, you are correct.  Okay, done.
>
> Thanks! The patch looks good to me. Attached is the updated version of
> the patch. I added the comments.

Sorry for reacting so slowly, but I'm not sure I like this patch. It's a 
quite useful property that all the WAL files that are needed for 
recovery are copied into pg_xlog, even when restoring from archive, even 
when not doing cascading replication. It guarantees that you can restart 
the standby, even if the connection to the archive is lost for some 
reason. I intentionally changed the behavior for archive recovery too, 
when it was introduced for cascading replication. Also, I think it's 
good that the behavior does not depend on whether cascading replication 
is enabled - it's a quite subtle difference.

So, IMHO this is not a bug, it's a feature.

To solve the original problem of running out of disk space in archive 
recovery, I wonder if we should perform restartpoints more aggressively. 
We intentionally don't trigger restatpoings by checkpoint_segments, only 
checkpoint_timeout, but I wonder if there should be an option for that. 
MauMau, did you try simply reducing checkpoint_timeout, while doing 
recovery?

- Heikki

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Fujii Masao

Date:

22 January 2014, 02:06:19

On Wed, Jan 22, 2014 at 6:37 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 01/21/2014 07:31 PM, Fujii Masao wrote:
>>
>> On Fri, Dec 20, 2013 at 9:21 PM, MauMau <maumau307@gmail.com> wrote:
>>>
>>> From: "Fujii Masao" <masao.fujii@gmail.com>
>>>
>>>> !     if (source == XLOG_FROM_ARCHIVE && StandbyModeRequested)
>>>>
>>>> Even when standby_mode is not enabled, we can use cascade replication
>>>> and
>>>> it needs the accumulated WAL files. So I think that
>>>> AllowCascadeReplication()
>>>> should be added into this condition.
>>>>
>>>> !       snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG");
>>>> !       XLogFilePath(xlogpath, ThisTimeLineID, endLogSegNo);
>>>> !
>>>> !       if (restoredFromArchive)
>>>>
>>>> Don't we need to check !StandbyModeRequested and
>>>> !AllowCascadeReplication()
>>>> here?
>>>
>>>
>>> Oh, you are correct.  Okay, done.
>>
>>
>> Thanks! The patch looks good to me. Attached is the updated version of
>> the patch. I added the comments.
>
>
> Sorry for reacting so slowly, but I'm not sure I like this patch. It's a
> quite useful property that all the WAL files that are needed for recovery
> are copied into pg_xlog, even when restoring from archive, even when not
> doing cascading replication. It guarantees that you can restart the standby,
> even if the connection to the archive is lost for some reason. I
> intentionally changed the behavior for archive recovery too, when it was
> introduced for cascading replication. Also, I think it's good that the
> behavior does not depend on whether cascading replication is enabled - it's
> a quite subtle difference.
>
> So, IMHO this is not a bug, it's a feature.

Yep.

> To solve the original problem of running out of disk space in archive
> recovery, I wonder if we should perform restartpoints more aggressively. We
> intentionally don't trigger restatpoings by checkpoint_segments, only
> checkpoint_timeout, but I wonder if there should be an option for that.

That's an option.

> MauMau, did you try simply reducing checkpoint_timeout, while doing
> recovery?

The problem is, we might not be able to perform restartpoints more aggressively
even if we reduce checkpoint_timeout in the server under the archive recovery.
Because the frequency of occurrence of restartpoints depends on not only that
checkpoint_timeout but also the checkpoints which happened while the server
was running.

Regards,

-- 
Fujii Masao

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

"MauMau"

Date:

24 January 2014, 13:31:27

From: "Fujii Masao" <masao.fujii@gmail.com>
> On Wed, Jan 22, 2014 at 6:37 AM, Heikki Linnakangas
>>> Thanks! The patch looks good to me. Attached is the updated version of
>>> the patch. I added the comments.

Thank you very much.  Your comment looks great.  I tested some recovery 
situations, and confirmed that WAL segments were kept/removed as intended. 
I'll update the CommitFest entry with this patch.

> <hlinnakangas@vmware.com> wrote:
>> Sorry for reacting so slowly, but I'm not sure I like this patch. It's a
>> quite useful property that all the WAL files that are needed for recovery
>> are copied into pg_xlog, even when restoring from archive, even when not
>> doing cascading replication. It guarantees that you can restart the 
>> standby,
>> even if the connection to the archive is lost for some reason. I
>> intentionally changed the behavior for archive recovery too, when it was
>> introduced for cascading replication. Also, I think it's good that the
>> behavior does not depend on whether cascading replication is enabled - 
>> it's
>> a quite subtle difference.
>>
>> So, IMHO this is not a bug, it's a feature.
>
> Yep.

I understood the benefit for the standby recovery.

>> To solve the original problem of running out of disk space in archive
>> recovery, I wonder if we should perform restartpoints more aggressively. 
>> We
>> intentionally don't trigger restatpoings by checkpoint_segments, only
>> checkpoint_timeout, but I wonder if there should be an option for that.
>
> That's an option.
>
>> MauMau, did you try simply reducing checkpoint_timeout, while doing
>> recovery?
>
> The problem is, we might not be able to perform restartpoints more 
> aggressively
> even if we reduce checkpoint_timeout in the server under the archive 
> recovery.
> Because the frequency of occurrence of restartpoints depends on not only 
> that
> checkpoint_timeout but also the checkpoints which happened while the 
> server
> was running.

I haven't tried reducing checkpoint_timeout.  I think we cannot take that 
approach, because we cannot suggest appropriate checkpoint_timeout to users. 
That is, what checkpoint_timeout setting can we suggest so that WAL doesn't 
accumulate in pg_xlog/ more than 9.1?

In addition, as Fujii-san said, it doesn't seem we can restartpoint 
completely.  Plus, if we can cause restartpoints frequently, the recovery 
would take (much?) longer, because shared buffers are flushed more 
frequently.

So, how about just removing AllowCascadeReplication() condition from the 
patch?  That allows WAL to accumulate in pg_xlog/ during standby recovery 
but not during archive recovery.

Regards
MauMau

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Andres Freund

Date:

01 February 2014, 20:44:22

On 2014-01-21 23:37:43 +0200, Heikki Linnakangas wrote:
> On 01/21/2014 07:31 PM, Fujii Masao wrote:
> >On Fri, Dec 20, 2013 at 9:21 PM, MauMau <maumau307@gmail.com> wrote:
> >>From: "Fujii Masao" <masao.fujii@gmail.com>
> >>
> >>>!     if (source == XLOG_FROM_ARCHIVE && StandbyModeRequested)
> >>>
> >>>Even when standby_mode is not enabled, we can use cascade replication and
> >>>it needs the accumulated WAL files. So I think that
> >>>AllowCascadeReplication()
> >>>should be added into this condition.
> >>>
> >>>!       snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG");
> >>>!       XLogFilePath(xlogpath, ThisTimeLineID, endLogSegNo);
> >>>!
> >>>!       if (restoredFromArchive)
> >>>
> >>>Don't we need to check !StandbyModeRequested and
> >>>!AllowCascadeReplication()
> >>>here?
> >>
> >>Oh, you are correct.  Okay, done.
> >
> >Thanks! The patch looks good to me. Attached is the updated version of
> >the patch. I added the comments.
> 
> Sorry for reacting so slowly, but I'm not sure I like this patch. It's a
> quite useful property that all the WAL files that are needed for recovery
> are copied into pg_xlog, even when restoring from archive, even when not
> doing cascading replication. It guarantees that you can restart the standby,
> even if the connection to the archive is lost for some reason. I
> intentionally changed the behavior for archive recovery too, when it was
> introduced for cascading replication. Also, I think it's good that the
> behavior does not depend on whether cascading replication is enabled - it's
> a quite subtle difference.
> 
> So, IMHO this is not a bug, it's a feature.

Very much seconded. With the old behaviour it's possible to get into the
situation that you cannot get your standby productive anymore if the
archive server died. Since the archive server is often the primary
that's imo unacceptable.

I'd even go as far as saying the previous behaviour is a bug.

> To solve the original problem of running out of disk space in archive
> recovery, I wonder if we should perform restartpoints more aggressively. We
> intentionally don't trigger restatpoings by checkpoint_segments, only
> checkpoint_timeout, but I wonder if there should be an option for that.
> MauMau, did you try simply reducing checkpoint_timeout, while doing
> recovery?

Hm, don't we actually do cause trigger restartpoints based on checkpoint
segments?

static int
XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,            XLogRecPtr targetRecPtr,
char*readBuf, TimeLineID *readTLI)
 
{
...
   if (readFile >= 0 && !XLByteInSeg(targetPagePtr, readSegNo))   {       /*        * Request a restartpoint if we've
replayedtoo much xlog since the        * last one.        */       if (StandbyModeRequested && bgwriterLaunched)
{          if (XLogCheckpointNeeded(readSegNo))           {               (void) GetRedoRecPtr();               if
(XLogCheckpointNeeded(readSegNo))                  RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);           }       }
 
...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Andres Freund

Date:

01 February 2014, 20:49:31

On 2014-01-24 22:31:17 +0900, MauMau wrote:
> From: "Fujii Masao" <masao.fujii@gmail.com>
> >On Wed, Jan 22, 2014 at 6:37 AM, Heikki Linnakangas
> >>>Thanks! The patch looks good to me. Attached is the updated version of
> >>>the patch. I added the comments.
> 
> Thank you very much.  Your comment looks great.  I tested some recovery
> situations, and confirmed that WAL segments were kept/removed as intended.
> I'll update the CommitFest entry with this patch.

You haven't updated the patches status so far, so I've now marked as
returned feedback as several people voiced serious doubts about the
approach. If that's not accurate please speak up.

> >The problem is, we might not be able to perform restartpoints more
> >aggressively
> >even if we reduce checkpoint_timeout in the server under the archive
> >recovery.
> >Because the frequency of occurrence of restartpoints depends on not only
> >that
> >checkpoint_timeout but also the checkpoints which happened while the
> >server
> >was running.
> 
> I haven't tried reducing checkpoint_timeout.

Did you try reducing checkpoint_segments? As I pointed out, at least if
standby_mode is enabled, it will also trigger checkpoints, independently
from checkpoint_timeout.

If the issue is that you're not using standby_mode (if so, why?), then
the fix maybe is to make that apply to a wider range of situations.

> I think we cannot take that
> approach, because we cannot suggest appropriate checkpoint_timeout to users.
> That is, what checkpoint_timeout setting can we suggest so that WAL doesn't
> accumulate in pg_xlog/ more than 9.1?

Well, <= 9.1's behaviour can lead to loss of a consistent database, so I
don't think it's what we need to compare the current state to.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Fujii Masao

Date:

02 February 2014, 14:50:45

On Sun, Feb 2, 2014 at 5:49 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-01-24 22:31:17 +0900, MauMau wrote:
>> From: "Fujii Masao" <masao.fujii@gmail.com>
>> >On Wed, Jan 22, 2014 at 6:37 AM, Heikki Linnakangas
>> >>>Thanks! The patch looks good to me. Attached is the updated version of
>> >>>the patch. I added the comments.
>>
>> Thank you very much.  Your comment looks great.  I tested some recovery
>> situations, and confirmed that WAL segments were kept/removed as intended.
>> I'll update the CommitFest entry with this patch.
>
> You haven't updated the patches status so far, so I've now marked as
> returned feedback as several people voiced serious doubts about the
> approach. If that's not accurate please speak up.
>
>> >The problem is, we might not be able to perform restartpoints more
>> >aggressively
>> >even if we reduce checkpoint_timeout in the server under the archive
>> >recovery.
>> >Because the frequency of occurrence of restartpoints depends on not only
>> >that
>> >checkpoint_timeout but also the checkpoints which happened while the
>> >server
>> >was running.
>>
>> I haven't tried reducing checkpoint_timeout.
>
> Did you try reducing checkpoint_segments? As I pointed out, at least if
> standby_mode is enabled, it will also trigger checkpoints, independently
> from checkpoint_timeout.

Right. If standby_mode is enabled, checkpoint_segment can trigger
the restartpoint. But the problem is that the timing of restartpoint
depends on not only the checkpoint parameters (i.e.,
checkpoint_timeout and checkpoint_segments) that are used during
archive recovery but also the checkpoint WAL that was generated
by the master.

For example, could you imagine the case where the master generated
only one checkpoint WAL since the last backup and it crashed with
database corruption. Then DBA decided to perform normal archive
recovery by using the last backup. In this case, even if DBA reduces
both checkpoint_timeout and checkpoint_segments, only one
restartpoint can occur during recovery. This low frequency of
restartpoint might fill up the disk space with lots of WAL files.

This would be harmless if the server that we are performing recovery
in has enough disk space. But I can imagine that some users want to
recover the database and restart the service temporarily in poor server
with less enough disk space until they can purchase sufficient server.
In this case, accumulating lots of WAL files in pg_xlog might be harmful.

> If the issue is that you're not using standby_mode (if so, why?), then
> the fix maybe is to make that apply to a wider range of situations.

I guess that he is not using standby_mode because, according to
his first email in this thread, he said he would like to prevent WAL
from accumulating in pg_xlog during normal archive recovery (i.e., PITR).

Regards,

-- 
Fujii Masao

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Andres Freund

Date:

02 February 2014, 15:18:17

On 2014-02-02 23:50:40 +0900, Fujii Masao wrote:
> On Sun, Feb 2, 2014 at 5:49 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-01-24 22:31:17 +0900, MauMau wrote:
> >> I haven't tried reducing checkpoint_timeout.
> >
> > Did you try reducing checkpoint_segments? As I pointed out, at least if
> > standby_mode is enabled, it will also trigger checkpoints, independently
> > from checkpoint_timeout.
>
> Right. If standby_mode is enabled, checkpoint_segment can trigger
> the restartpoint. But the problem is that the timing of restartpoint
> depends on not only the checkpoint parameters (i.e.,
> checkpoint_timeout and checkpoint_segments) that are used during
> archive recovery but also the checkpoint WAL that was generated
> by the master.

Sure. But we really *need* all the WAL since the last checkpoint's redo
location locally to be safe.

> For example, could you imagine the case where the master generated
> only one checkpoint WAL since the last backup and it crashed with
> database corruption. Then DBA decided to perform normal archive
> recovery by using the last backup. In this case, even if DBA reduces
> both checkpoint_timeout and checkpoint_segments, only one
> restartpoint can occur during recovery. This low frequency of
> restartpoint might fill up the disk space with lots of WAL files.

I am not sure I understand the point of this scenario. If the primary
crashed after a checkpoint, there won't be that much WAL since it
happened...

> > If the issue is that you're not using standby_mode (if so, why?), then
> > the fix maybe is to make that apply to a wider range of situations.
> 
> I guess that he is not using standby_mode because, according to
> his first email in this thread, he said he would like to prevent WAL
> from accumulating in pg_xlog during normal archive recovery (i.e., PITR).

Well, that doesn't necessarily prevent you from using
standby_mode... But yes, that might be the case.

I wonder if we shouldn't just always look at checkpoint segments during
!crash recovery.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

"MauMau"

Date:

12 February 2014, 12:24:23

From: "Andres Freund" <andres@2ndquadrant.com>
> On 2014-02-02 23:50:40 +0900, Fujii Masao wrote:
>> Right. If standby_mode is enabled, checkpoint_segment can trigger
>> the restartpoint. But the problem is that the timing of restartpoint
>> depends on not only the checkpoint parameters (i.e.,
>> checkpoint_timeout and checkpoint_segments) that are used during
>> archive recovery but also the checkpoint WAL that was generated
>> by the master.
>
> Sure. But we really *need* all the WAL since the last checkpoint's redo
> location locally to be safe.
>
>> For example, could you imagine the case where the master generated
>> only one checkpoint WAL since the last backup and it crashed with
>> database corruption. Then DBA decided to perform normal archive
>> recovery by using the last backup. In this case, even if DBA reduces
>> both checkpoint_timeout and checkpoint_segments, only one
>> restartpoint can occur during recovery. This low frequency of
>> restartpoint might fill up the disk space with lots of WAL files.
>
> I am not sure I understand the point of this scenario. If the primary
> crashed after a checkpoint, there won't be that much WAL since it
> happened...
>
>> > If the issue is that you're not using standby_mode (if so, why?), then
>> > the fix maybe is to make that apply to a wider range of situations.
>>
>> I guess that he is not using standby_mode because, according to
>> his first email in this thread, he said he would like to prevent WAL
>> from accumulating in pg_xlog during normal archive recovery (i.e., PITR).
>
> Well, that doesn't necessarily prevent you from using
> standby_mode... But yes, that might be the case.
>
> I wonder if we shouldn't just always look at checkpoint segments during
> !crash recovery.

Maybe we could consider in that direction, but there is a problem.  Archive
recovery slows down compared to 9.1, because of repeated restartpoints.
Archive recovery should be as fast as possible, because it typically applies
dozens or hundreds of WAL files, and the DBA desires immediate resumption of
operation.

So, I think we should restore 9.1 behavior for archive recovery.  The
attached patch keeps restored archived WAL in pg_xlog/ only during standby
recovery.  It is based on Fujii-san's revison of the patch, with
AllowCascadeReplication() condition removed from two if statements.

Regards
MauMau

Attachment

wal_increase_in_pitr_v4.patch

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Andres Freund

Date:

12 February 2014, 12:33:37

On 2014-02-12 21:23:54 +0900, MauMau wrote:
> Maybe we could consider in that direction, but there is a problem.  Archive
> recovery slows down compared to 9.1, because of repeated restartpoints.
> Archive recovery should be as fast as possible, because it typically applies
> dozens or hundreds of WAL files, and the DBA desires immediate resumption of
> operation.
> 
> So, I think we should restore 9.1 behavior for archive recovery.

It's easy to be fast, if it's not correct. I don't see how that
behaviour is acceptable, imo the previous behaviour simply was a bug
whose fix was too invasive to backpatch.

I don't think you can convince me (but maybe others) that the old
behaviour is acceptable.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [9.3 bug] disk space in pg_xlog increases during archive recovery

From

Andres Freund

Date:

15 February 2014, 13:46:25

On 2014-02-12 13:33:31 +0100, Andres Freund wrote:
> On 2014-02-12 21:23:54 +0900, MauMau wrote:
> > Maybe we could consider in that direction, but there is a problem.  Archive
> > recovery slows down compared to 9.1, because of repeated restartpoints.
> > Archive recovery should be as fast as possible, because it typically applies
> > dozens or hundreds of WAL files, and the DBA desires immediate resumption of
> > operation.
> > 
> > So, I think we should restore 9.1 behavior for archive recovery.
> 
> It's easy to be fast, if it's not correct. I don't see how that
> behaviour is acceptable, imo the previous behaviour simply was a bug
> whose fix was too invasive to backpatch.
> 
> I don't think you can convince me (but maybe others) that the old
> behaviour is acceptable.

I'm going to mark this patch as "Rejected". Please speak up if that's
not accurate.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services