Thread: [9.3 bug] disk space in pg_xlog increases during archive recovery
Hello, I'm sorry I've been touching several things recently before fixing any of them. I've noticed undesirable disk space increase while performing archive recovery with PostgreSQL 9.3. This happens with 9.2, too. I just performed archived recovery with the following parameters in recovery.conf. I'm not using replication. restore_command = 'cp ...' recovery_target_timeline = 'latest' As the archive recovery progresses, the disk space used by $PGDATA/pg_xlog increases. It seems that restored archive WAL files are accumulated there. This is considerable amount depending on the number of archived WAL files. In my case, the recovery failed because of the shortage of disk space. This did not happen with 9.1 The cause appears to be KeepFileRestoredFromArchive(). This function saves restored archive WAL files in pg_xlog/. I guess this is for cascading replication, a new feature added in 9.2. So, I think it is a bug that the disk space increases if not using cascading replication. Those who migrated from 9.1 and do not use 9.2 features would be surprised like me. Do you think this should be fixed? How should it be fixed? If possible, could you fix it in the next minor release? If you all are busy, I'll try to fix it, but give me advice how to do that. Regards MauMau
On Sun, Jul 28, 2013 at 7:59 AM, MauMau <maumau307@gmail.com> wrote: > Hello, > > I'm sorry I've been touching several things recently before fixing any of > them. > > I've noticed undesirable disk space increase while performing archive > recovery with PostgreSQL 9.3. This happens with 9.2, too. > > I just performed archived recovery with the following parameters in > recovery.conf. I'm not using replication. > > restore_command = 'cp ...' > recovery_target_timeline = 'latest' > > As the archive recovery progresses, the disk space used by $PGDATA/pg_xlog > increases. It seems that restored archive WAL files are accumulated there. > This is considerable amount depending on the number of archived WAL files. > In my case, the recovery failed because of the shortage of disk space. This > did not happen with 9.1 > > The cause appears to be KeepFileRestoredFromArchive(). This function saves > restored archive WAL files in pg_xlog/. I guess this is for cascading > replication, a new feature added in 9.2. Yes. Those accumulated WAL files will be removed basically when restartpoint ends. But if restartpoint is slow compared to the WAL reply, the problem you described would happen. > So, I think it is a bug that the disk space increases if not using cascading > replication. Those who migrated from 9.1 and do not use 9.2 features would > be surprised like me. > > > Do you think this should be fixed? I think so. > How should it be fixed? What about removing the restored archived file as soon as it's replayed if cascading replication is not enabled (i.e., max_wal_senders = 0 or hot_standby = off)? This doesn't seem to break the existing behavior in 9.2. Regards, -- Fujii Masao
From: "Fujii Masao" <masao.fujii@gmail.com> >> So, I think it is a bug that the disk space increases if not using >> cascading >> replication. Those who migrated from 9.1 and do not use 9.2 features >> would >> be surprised like me. >> >> >> Do you think this should be fixed? > > I think so. Thanks for your agreement. I'll try to submit the patch as soon as possible so that the fix will be included in the next minor release and 9.3. If it seems unexpectedly difficult, I'd like to consult you. >> How should it be fixed? > > What about removing the restored archived file as soon as it's replayed > if cascading replication is not enabled (i.e., max_wal_senders = 0 or > hot_standby = off)? This doesn't seem to break the existing behavior > in 9.2. I'll consider this condition, but I wonder if 9.1 users are setting max_wal_senders>0 and hot_standby=on even when just performing archive recovery (e.g. to use the same recovery.conf for all cases). What do you think? Is there any other good condition to judge if cascading replication is used? Regards MauMau
Hello, Fujii san, all, From: "Fujii Masao" <masao.fujii@gmail.com> > On Sun, Jul 28, 2013 at 7:59 AM, MauMau <maumau307@gmail.com> wrote: > Do you think this should be fixed? > > I think so. > >> How should it be fixed? > > What about removing the restored archived file as soon as it's replayed > if cascading replication is not enabled (i.e., max_wal_senders = 0 or > hot_standby = off)? This doesn't seem to break the existing behavior > in 9.2. Please find attached the patch to fix the problem. I changed to keep restored WAL files in pg_xlog/ only on a cascading standby server. Could you review and commit this? BTW, KeepFileRestoredFromArchive() is also called to keep timeline history files in pg_xlog/. What is this for? Is this necessary for recovery other than cascading standbys? This seems to accumulate timeline history files forever in pg_xlog/. Regards MauMau
Attachment
On Wed, Jul 31, 2013 at 10:43 PM, MauMau <maumau307@gmail.com> wrote: > Hello, Fujii san, all, > > From: "Fujii Masao" <masao.fujii@gmail.com> >> >> On Sun, Jul 28, 2013 at 7:59 AM, MauMau <maumau307@gmail.com> wrote: >> Do you think this should be fixed? >> >> I think so. >> >>> How should it be fixed? >> >> >> What about removing the restored archived file as soon as it's replayed >> if cascading replication is not enabled (i.e., max_wal_senders = 0 or >> hot_standby = off)? This doesn't seem to break the existing behavior >> in 9.2. > > > Please find attached the patch to fix the problem. I changed to keep > restored WAL files in pg_xlog/ only on a cascading standby server. Could > you review and commit this? - if (source == XLOG_FROM_ARCHIVE) + if (source == XLOG_FROM_ARCHIVE && + StandbyModeRequested && AllowCascadeReplication()) I think that the condition of StandbyModeRequested should be removed because someone might want to set up the cascade standby from the standby of warm-standby configuration. > BTW, KeepFileRestoredFromArchive() is also called to keep timeline history > files in pg_xlog/. What is this for? Is this necessary for recovery other > than cascading standbys? Yes. Please see 60df192aea0e6458f20301546e11f7673c102101 > This seems to accumulate timeline history files > forever in pg_xlog/. Yes, because basically there is no way to delete the timeline history files from pg_xlog. Regards, -- Fujii Masao
From: "Fujii Masao" <masao.fujii@gmail.com> > - if (source == XLOG_FROM_ARCHIVE) > + if (source == XLOG_FROM_ARCHIVE && > + StandbyModeRequested && AllowCascadeReplication()) > > I think that the condition of StandbyModeRequested should be removed > because someone might want to set up the cascade standby from the standby > of warm-standby configuration. Fixed and attached the revised patch. However, isn't StandbyRequested true (= standby_mode set to on) to enable warm standby? I'm afraid people set max_wal_senders>0 and hot_standby=on even on the primary server to make the contents of postgresql.conf identical on both the primary and the standby for easier configuration. If so, normal archive recovery (PITR, not the standby recovery) would face the original problem -- unnecessary WAL accumulation in pg_xlog/. So I'm wonder if AllowCascadeReplication() is enough. Please take either this patch or the previous one. Regards MauMau
Attachment
On Thu, Aug 1, 2013 at 7:37 AM, MauMau <maumau307@gmail.com> wrote: > From: "Fujii Masao" <masao.fujii@gmail.com> > >> - if (source == XLOG_FROM_ARCHIVE) >> + if (source == XLOG_FROM_ARCHIVE && >> + StandbyModeRequested && AllowCascadeReplication()) >> >> I think that the condition of StandbyModeRequested should be removed >> because someone might want to set up the cascade standby from the standby >> of warm-standby configuration. > > > Fixed and attached the revised patch. > > However, isn't StandbyRequested true (= standby_mode set to on) to enable > warm standby? We can set up warm-standby by using pg_standby even if standby_mode = off. > I'm afraid people set max_wal_senders>0 and hot_standby=on > even on the primary server to make the contents of postgresql.conf identical > on both the primary and the standby for easier configuration. If so, normal > archive recovery (PITR, not the standby recovery) would face the original > problem -- unnecessary WAL accumulation in pg_xlog/. So I'm wonder if > AllowCascadeReplication() is enough. One idea is to add new GUC parameter like enable_cascade_replication and then prevent WAL file from being kept in pg_xlog if that parameter is off. But we cannot apply such change into the already-released version 9.2. Regards, -- Fujii Masao
From: "Fujii Masao" <masao.fujii@gmail.com> >> However, isn't StandbyRequested true (= standby_mode set to on) to enable >> warm standby? > > We can set up warm-standby by using pg_standby even if standby_mode = off. I see. However, I understand that pg_standby is a legacy feature, and the current way to setup a warm standby is to set standby=on and restore_command in recovery.conf. So I think it might not be necessary to support cascading replication with pg_standby, if supporting it would prevent better implementation of new features. >> I'm afraid people set max_wal_senders>0 and hot_standby=on >> even on the primary server to make the contents of postgresql.conf >> identical >> on both the primary and the standby for easier configuration. If so, >> normal >> archive recovery (PITR, not the standby recovery) would face the original >> problem -- unnecessary WAL accumulation in pg_xlog/. So I'm wonder if >> AllowCascadeReplication() is enough. > > One idea is to add new GUC parameter like enable_cascade_replication > and then prevent WAL file from being kept in pg_xlog if that parameter is > off. > But we cannot apply such change into the already-released version 9.2. Yes, I thought the same, too. Adding a new GUC to enable cascading replication now would be a usability degradation. But I have no idea of any better way. It seems we need to take either v1 or v2 of the patch I sent. If we consider that we don't have to support cascading replication for a legacy pg_standby, v1 patch is definitely better because the standby server would not keep restored archive WAL in pg_xlog when cascading replication is not used. Otherwise, we have to take v2 patch. Could you choose either and commit it? Regards MauMau
On Fri, Aug 2, 2013 at 12:24 AM, MauMau <maumau307@gmail.com> wrote:
-- From: "Fujii Masao" <masao.fujii@gmail.com>I see. However, I understand that pg_standby is a legacy feature, and the current way to setup a warm standby is to set standby=on and restore_command in recovery.conf. So I think it might not be necessary to support cascading replication with pg_standby, if supporting it would prevent better implementation of new features.However, isn't StandbyRequested true (= standby_mode set to on) to enable
warm standby?
We can set up warm-standby by using pg_standby even if standby_mode = off.
You are right about that, you should stick with the core features as much as you can and limit the use of wrapper utilities. Since 9.1 and the apparition of a built-in standby mode inside Postgres (with the apparition of the GUC parameter hot_standby), it seems better to use that instead of pg_standby, which would likely (personal opinion, feel free to scream at me) be removed in a future release.
Yes, I thought the same, too. Adding a new GUC to enable cascading replication now would be a usability degradation. But I have no idea of any better way. It seems we need to take either v1 or v2 of the patch I sent. If we consider that we don't have to support cascading replication for a legacy pg_standby, v1 patch is definitely better because the standby server would not keep restored archive WAL in pg_xlog when cascading replication is not used. Otherwise, we have to take v2 patch.I'm afraid people set max_wal_senders>0 and hot_standby=on
even on the primary server to make the contents of postgresql.conf identical
on both the primary and the standby for easier configuration. If so, normal
archive recovery (PITR, not the standby recovery) would face the original
problem -- unnecessary WAL accumulation in pg_xlog/. So I'm wonder if
AllowCascadeReplication() is enough.
One idea is to add new GUC parameter like enable_cascade_replication
and then prevent WAL file from being kept in pg_xlog if that parameter is off.
But we cannot apply such change into the already-released version 9.2.
By reading this thread, -1 for the addition of a new GUC parameter related to cascading, it looks like an overkill for the possible gain. And +1 for the removal of WAL file once it is replayed in archive recovery if cascading replication is not allowed. However, what about wal_keep_segments? Doesn't the server keep segments even if replication is not allowed? If yes, the immediate WAL file drop after replay needs to take care of that...
Michael
Michael Paquier <michael.paquier@gmail.com> writes: > By reading this thread, -1 for the addition of a new GUC parameter related > to cascading, it looks like an overkill for the possible gain. And +1 for > the removal of WAL file once it is replayed in archive recovery if > cascading replication is not allowed. However, what about Well, we still don't register standby servers, so I vote against early removal of files that you have no possible way to know if they are still useful or not. We need something smarter here. Please. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Fri, Aug 2, 2013 at 10:52 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > > > > On Fri, Aug 2, 2013 at 12:24 AM, MauMau <maumau307@gmail.com> wrote: >> >> From: "Fujii Masao" <masao.fujii@gmail.com> >> >>>> However, isn't StandbyRequested true (= standby_mode set to on) to >>>> enable >>>> warm standby? >>> >>> >>> We can set up warm-standby by using pg_standby even if standby_mode = >>> off. >> >> >> I see. However, I understand that pg_standby is a legacy feature, and the >> current way to setup a warm standby is to set standby=on and restore_command >> in recovery.conf. So I think it might not be necessary to support cascading >> replication with pg_standby, if supporting it would prevent better >> implementation of new features. > > You are right about that, you should stick with the core features as much as > you can and limit the use of wrapper utilities. Since 9.1 and the apparition > of a built-in standby mode inside Postgres (with the apparition of the GUC > parameter hot_standby), it seems better to use that instead of pg_standby, > which would likely (personal opinion, feel free to scream at me) be removed > in a future release. I'm sure that there are some users who use pg_standby for warm-standby for some reasons, for example, the document (*1) explains such a configuration, a user can specify the file-check-interval (by using -s option), and so on. I agree that using pg_standby + cascade replication would be very rare. But I'm not confident that *no one* uses pg_standby + cascade replication. (*1) http://www.postgresql.org/docs/devel/static/log-shipping-alternative.html Regards, -- Fujii Masao
On Fri, Aug 2, 2013 at 12:24 AM, MauMau <maumau307@gmail.com> wrote: > From: "Fujii Masao" <masao.fujii@gmail.com> > >>> However, isn't StandbyRequested true (= standby_mode set to on) to enable >>> warm standby? >> >> >> We can set up warm-standby by using pg_standby even if standby_mode = off. > > > I see. However, I understand that pg_standby is a legacy feature, and the > current way to setup a warm standby is to set standby=on and restore_command > in recovery.conf. So I think it might not be necessary to support cascading > replication with pg_standby, if supporting it would prevent better > implementation of new features. > > > >>> I'm afraid people set max_wal_senders>0 and hot_standby=on >>> even on the primary server to make the contents of postgresql.conf >>> identical >>> on both the primary and the standby for easier configuration. If so, >>> normal >>> archive recovery (PITR, not the standby recovery) would face the original >>> problem -- unnecessary WAL accumulation in pg_xlog/. So I'm wonder if >>> AllowCascadeReplication() is enough. >> >> >> One idea is to add new GUC parameter like enable_cascade_replication >> and then prevent WAL file from being kept in pg_xlog if that parameter is >> off. >> But we cannot apply such change into the already-released version 9.2. > > > Yes, I thought the same, too. Adding a new GUC to enable cascading > replication now would be a usability degradation. But I have no idea of any > better way. It seems we need to take either v1 or v2 of the patch I sent. > If we consider that we don't have to support cascading replication for a > legacy pg_standby, v1 patch is definitely better because the standby server > would not keep restored archive WAL in pg_xlog when cascading replication is > not used. Otherwise, we have to take v2 patch. > > Could you choose either and commit it? On second thought, probably we cannot remove the restored WAL files early because they might be required for fast promotion which is new feature in 9.3. In fast promotion, an end-of-recovery checkpoint is not executed. After the end of recovery, normal online checkpoint starts. If the server crashes before such an online checkpoint completes, the server needs to replay again all the WAL files which it replayed before the promotion. Since this is the crash recovery, all those WAL files need to exist in pg_xlog directory. So if we remove the restored WAL file from pg_xlog early, such a crash recovery might fail. So even if cascade replication is disabled, if standby_mode = on, i.e., fast promotion can be performed, we cannot remove the restored WAL files early. Regards, -- Fujii Masao
On Wed, Aug 7, 2013 at 7:03 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Aug 2, 2013 at 12:24 AM, MauMau <maumau307@gmail.com> wrote: >> From: "Fujii Masao" <masao.fujii@gmail.com> >> >>>> However, isn't StandbyRequested true (= standby_mode set to on) to enable >>>> warm standby? >>> >>> >>> We can set up warm-standby by using pg_standby even if standby_mode = off. >> >> >> I see. However, I understand that pg_standby is a legacy feature, and the >> current way to setup a warm standby is to set standby=on and restore_command >> in recovery.conf. So I think it might not be necessary to support cascading >> replication with pg_standby, if supporting it would prevent better >> implementation of new features. >> >> >> >>>> I'm afraid people set max_wal_senders>0 and hot_standby=on >>>> even on the primary server to make the contents of postgresql.conf >>>> identical >>>> on both the primary and the standby for easier configuration. If so, >>>> normal >>>> archive recovery (PITR, not the standby recovery) would face the original >>>> problem -- unnecessary WAL accumulation in pg_xlog/. So I'm wonder if >>>> AllowCascadeReplication() is enough. >>> >>> >>> One idea is to add new GUC parameter like enable_cascade_replication >>> and then prevent WAL file from being kept in pg_xlog if that parameter is >>> off. >>> But we cannot apply such change into the already-released version 9.2. >> >> >> Yes, I thought the same, too. Adding a new GUC to enable cascading >> replication now would be a usability degradation. But I have no idea of any >> better way. It seems we need to take either v1 or v2 of the patch I sent. >> If we consider that we don't have to support cascading replication for a >> legacy pg_standby, v1 patch is definitely better because the standby server >> would not keep restored archive WAL in pg_xlog when cascading replication is >> not used. Otherwise, we have to take v2 patch. >> >> Could you choose either and commit it? > > On second thought, probably we cannot remove the restored WAL files early > because they might be required for fast promotion which is new feature in 9.3. > In fast promotion, an end-of-recovery checkpoint is not executed. After the end > of recovery, normal online checkpoint starts. If the server crashes before such > an online checkpoint completes, the server needs to replay again all the WAL > files which it replayed before the promotion. Since this is the crash recovery, > all those WAL files need to exist in pg_xlog directory. So if we remove the > restored WAL file from pg_xlog early, such a crash recovery might fail. > > So even if cascade replication is disabled, if standby_mode = on, i.e., fast > promotion can be performed, we cannot remove the restored WAL files early. But we should still be able to remove files older than the latest restart point, right? Cheers, Jeff
Hi, Fujii san, > On Wed, Aug 7, 2013 at 7:03 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On second thought, probably we cannot remove the restored WAL files early >> because they might be required for fast promotion which is new feature in >> 9.3. >> In fast promotion, an end-of-recovery checkpoint is not executed. After >> the end >> of recovery, normal online checkpoint starts. If the server crashes >> before such >> an online checkpoint completes, the server needs to replay again all the >> WAL >> files which it replayed before the promotion. Since this is the crash >> recovery, >> all those WAL files need to exist in pg_xlog directory. So if we remove >> the >> restored WAL file from pg_xlog early, such a crash recovery might fail. >> >> So even if cascade replication is disabled, if standby_mode = on, i.e., >> fast >> promotion can be performed, we cannot remove the restored WAL files >> early. Following Fujii-san's advice, I've made the attached patch. This prevents unnecessary WAL accumulation in pg_xlog/ during archive recovery (not standby recovery). To do this, I had to revive some code in exitArchiveRecovery() from 9.1. Could you commit this to 9.2 and later? Regards MauMau
Attachment
On Mon, Dec 16, 2013 at 9:22 PM, MauMau <maumau307@gmail.com> wrote: > Hi, Fujii san, > >> On Wed, Aug 7, 2013 at 7:03 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> >>> On second thought, probably we cannot remove the restored WAL files early >>> because they might be required for fast promotion which is new feature in >>> 9.3. >>> In fast promotion, an end-of-recovery checkpoint is not executed. After >>> the end >>> of recovery, normal online checkpoint starts. If the server crashes >>> before such >>> an online checkpoint completes, the server needs to replay again all the >>> WAL >>> files which it replayed before the promotion. Since this is the crash >>> recovery, >>> all those WAL files need to exist in pg_xlog directory. So if we remove >>> the >>> restored WAL file from pg_xlog early, such a crash recovery might fail. >>> >>> So even if cascade replication is disabled, if standby_mode = on, i.e., >>> fast >>> promotion can be performed, we cannot remove the restored WAL files >>> early. > > > Following Fujii-san's advice, I've made the attached patch. Thanks for the patch! ! if (source == XLOG_FROM_ARCHIVE && StandbyModeRequested) Even when standby_mode is not enabled, we can use cascade replication and it needs the accumulated WAL files. So I think that AllowCascadeReplication() should be added into this condition. ! snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG"); ! XLogFilePath(xlogpath, ThisTimeLineID, endLogSegNo); ! ! if (restoredFromArchive) Don't we need to check !StandbyModeRequested and !AllowCascadeReplication() here? ! /* ! * If the latest segment is not archival, but there's still a ! * RECOVERYXLOG laying about, get rid of it. ! */ ! unlink(recoveryPath); /* ignore any error */ The similar line exists in the lower side of exitArchiveRecovery(), so ISTM that you can refactor that. Regards, -- Fujii Masao
From: "Fujii Masao" <masao.fujii@gmail.com> > ! if (source == XLOG_FROM_ARCHIVE && StandbyModeRequested) > > Even when standby_mode is not enabled, we can use cascade replication and > it needs the accumulated WAL files. So I think that > AllowCascadeReplication() > should be added into this condition. > > ! snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG"); > ! XLogFilePath(xlogpath, ThisTimeLineID, endLogSegNo); > ! > ! if (restoredFromArchive) > > Don't we need to check !StandbyModeRequested and > !AllowCascadeReplication() > here? Oh, you are correct. Okay, done. > ! /* > ! * If the latest segment is not archival, but there's > still a > ! * RECOVERYXLOG laying about, get rid of it. > ! */ > ! unlink(recoveryPath); /* ignore any error */ > > The similar line exists in the lower side of exitArchiveRecovery(), so > ISTM that > you can refactor that. That's already done in the previous patch: deletion of RECOVERYXLOG was moved into else clause, as in: - /* - * Since there might be a partial WAL segment named RECOVERYXLOG, get rid - * of it. - */ - snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG"); - unlink(recoveryPath); /* ignore any error */ - Regards MauMau
Attachment
On Fri, Dec 20, 2013 at 9:21 PM, MauMau <maumau307@gmail.com> wrote: > From: "Fujii Masao" <masao.fujii@gmail.com> > >> ! if (source == XLOG_FROM_ARCHIVE && StandbyModeRequested) >> >> Even when standby_mode is not enabled, we can use cascade replication and >> it needs the accumulated WAL files. So I think that >> AllowCascadeReplication() >> should be added into this condition. >> >> ! snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG"); >> ! XLogFilePath(xlogpath, ThisTimeLineID, endLogSegNo); >> ! >> ! if (restoredFromArchive) >> >> Don't we need to check !StandbyModeRequested and >> !AllowCascadeReplication() >> here? > > > Oh, you are correct. Okay, done. Thanks! The patch looks good to me. Attached is the updated version of the patch. I added the comments. Did you test whether this patch works properly in several recovery cases? Regards, -- Fujii Masao
Attachment
On 01/21/2014 07:31 PM, Fujii Masao wrote: > On Fri, Dec 20, 2013 at 9:21 PM, MauMau <maumau307@gmail.com> wrote: >> From: "Fujii Masao" <masao.fujii@gmail.com> >> >>> ! if (source == XLOG_FROM_ARCHIVE && StandbyModeRequested) >>> >>> Even when standby_mode is not enabled, we can use cascade replication and >>> it needs the accumulated WAL files. So I think that >>> AllowCascadeReplication() >>> should be added into this condition. >>> >>> ! snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG"); >>> ! XLogFilePath(xlogpath, ThisTimeLineID, endLogSegNo); >>> ! >>> ! if (restoredFromArchive) >>> >>> Don't we need to check !StandbyModeRequested and >>> !AllowCascadeReplication() >>> here? >> >> Oh, you are correct. Okay, done. > > Thanks! The patch looks good to me. Attached is the updated version of > the patch. I added the comments. Sorry for reacting so slowly, but I'm not sure I like this patch. It's a quite useful property that all the WAL files that are needed for recovery are copied into pg_xlog, even when restoring from archive, even when not doing cascading replication. It guarantees that you can restart the standby, even if the connection to the archive is lost for some reason. I intentionally changed the behavior for archive recovery too, when it was introduced for cascading replication. Also, I think it's good that the behavior does not depend on whether cascading replication is enabled - it's a quite subtle difference. So, IMHO this is not a bug, it's a feature. To solve the original problem of running out of disk space in archive recovery, I wonder if we should perform restartpoints more aggressively. We intentionally don't trigger restatpoings by checkpoint_segments, only checkpoint_timeout, but I wonder if there should be an option for that. MauMau, did you try simply reducing checkpoint_timeout, while doing recovery? - Heikki
On Wed, Jan 22, 2014 at 6:37 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 01/21/2014 07:31 PM, Fujii Masao wrote: >> >> On Fri, Dec 20, 2013 at 9:21 PM, MauMau <maumau307@gmail.com> wrote: >>> >>> From: "Fujii Masao" <masao.fujii@gmail.com> >>> >>>> ! if (source == XLOG_FROM_ARCHIVE && StandbyModeRequested) >>>> >>>> Even when standby_mode is not enabled, we can use cascade replication >>>> and >>>> it needs the accumulated WAL files. So I think that >>>> AllowCascadeReplication() >>>> should be added into this condition. >>>> >>>> ! snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG"); >>>> ! XLogFilePath(xlogpath, ThisTimeLineID, endLogSegNo); >>>> ! >>>> ! if (restoredFromArchive) >>>> >>>> Don't we need to check !StandbyModeRequested and >>>> !AllowCascadeReplication() >>>> here? >>> >>> >>> Oh, you are correct. Okay, done. >> >> >> Thanks! The patch looks good to me. Attached is the updated version of >> the patch. I added the comments. > > > Sorry for reacting so slowly, but I'm not sure I like this patch. It's a > quite useful property that all the WAL files that are needed for recovery > are copied into pg_xlog, even when restoring from archive, even when not > doing cascading replication. It guarantees that you can restart the standby, > even if the connection to the archive is lost for some reason. I > intentionally changed the behavior for archive recovery too, when it was > introduced for cascading replication. Also, I think it's good that the > behavior does not depend on whether cascading replication is enabled - it's > a quite subtle difference. > > So, IMHO this is not a bug, it's a feature. Yep. > To solve the original problem of running out of disk space in archive > recovery, I wonder if we should perform restartpoints more aggressively. We > intentionally don't trigger restatpoings by checkpoint_segments, only > checkpoint_timeout, but I wonder if there should be an option for that. That's an option. > MauMau, did you try simply reducing checkpoint_timeout, while doing > recovery? The problem is, we might not be able to perform restartpoints more aggressively even if we reduce checkpoint_timeout in the server under the archive recovery. Because the frequency of occurrence of restartpoints depends on not only that checkpoint_timeout but also the checkpoints which happened while the server was running. Regards, -- Fujii Masao
From: "Fujii Masao" <masao.fujii@gmail.com> > On Wed, Jan 22, 2014 at 6:37 AM, Heikki Linnakangas >>> Thanks! The patch looks good to me. Attached is the updated version of >>> the patch. I added the comments. Thank you very much. Your comment looks great. I tested some recovery situations, and confirmed that WAL segments were kept/removed as intended. I'll update the CommitFest entry with this patch. > <hlinnakangas@vmware.com> wrote: >> Sorry for reacting so slowly, but I'm not sure I like this patch. It's a >> quite useful property that all the WAL files that are needed for recovery >> are copied into pg_xlog, even when restoring from archive, even when not >> doing cascading replication. It guarantees that you can restart the >> standby, >> even if the connection to the archive is lost for some reason. I >> intentionally changed the behavior for archive recovery too, when it was >> introduced for cascading replication. Also, I think it's good that the >> behavior does not depend on whether cascading replication is enabled - >> it's >> a quite subtle difference. >> >> So, IMHO this is not a bug, it's a feature. > > Yep. I understood the benefit for the standby recovery. >> To solve the original problem of running out of disk space in archive >> recovery, I wonder if we should perform restartpoints more aggressively. >> We >> intentionally don't trigger restatpoings by checkpoint_segments, only >> checkpoint_timeout, but I wonder if there should be an option for that. > > That's an option. > >> MauMau, did you try simply reducing checkpoint_timeout, while doing >> recovery? > > The problem is, we might not be able to perform restartpoints more > aggressively > even if we reduce checkpoint_timeout in the server under the archive > recovery. > Because the frequency of occurrence of restartpoints depends on not only > that > checkpoint_timeout but also the checkpoints which happened while the > server > was running. I haven't tried reducing checkpoint_timeout. I think we cannot take that approach, because we cannot suggest appropriate checkpoint_timeout to users. That is, what checkpoint_timeout setting can we suggest so that WAL doesn't accumulate in pg_xlog/ more than 9.1? In addition, as Fujii-san said, it doesn't seem we can restartpoint completely. Plus, if we can cause restartpoints frequently, the recovery would take (much?) longer, because shared buffers are flushed more frequently. So, how about just removing AllowCascadeReplication() condition from the patch? That allows WAL to accumulate in pg_xlog/ during standby recovery but not during archive recovery. Regards MauMau
On 2014-01-21 23:37:43 +0200, Heikki Linnakangas wrote: > On 01/21/2014 07:31 PM, Fujii Masao wrote: > >On Fri, Dec 20, 2013 at 9:21 PM, MauMau <maumau307@gmail.com> wrote: > >>From: "Fujii Masao" <masao.fujii@gmail.com> > >> > >>>! if (source == XLOG_FROM_ARCHIVE && StandbyModeRequested) > >>> > >>>Even when standby_mode is not enabled, we can use cascade replication and > >>>it needs the accumulated WAL files. So I think that > >>>AllowCascadeReplication() > >>>should be added into this condition. > >>> > >>>! snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG"); > >>>! XLogFilePath(xlogpath, ThisTimeLineID, endLogSegNo); > >>>! > >>>! if (restoredFromArchive) > >>> > >>>Don't we need to check !StandbyModeRequested and > >>>!AllowCascadeReplication() > >>>here? > >> > >>Oh, you are correct. Okay, done. > > > >Thanks! The patch looks good to me. Attached is the updated version of > >the patch. I added the comments. > > Sorry for reacting so slowly, but I'm not sure I like this patch. It's a > quite useful property that all the WAL files that are needed for recovery > are copied into pg_xlog, even when restoring from archive, even when not > doing cascading replication. It guarantees that you can restart the standby, > even if the connection to the archive is lost for some reason. I > intentionally changed the behavior for archive recovery too, when it was > introduced for cascading replication. Also, I think it's good that the > behavior does not depend on whether cascading replication is enabled - it's > a quite subtle difference. > > So, IMHO this is not a bug, it's a feature. Very much seconded. With the old behaviour it's possible to get into the situation that you cannot get your standby productive anymore if the archive server died. Since the archive server is often the primary that's imo unacceptable. I'd even go as far as saying the previous behaviour is a bug. > To solve the original problem of running out of disk space in archive > recovery, I wonder if we should perform restartpoints more aggressively. We > intentionally don't trigger restatpoings by checkpoint_segments, only > checkpoint_timeout, but I wonder if there should be an option for that. > MauMau, did you try simply reducing checkpoint_timeout, while doing > recovery? Hm, don't we actually do cause trigger restartpoints based on checkpoint segments? static int XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen, XLogRecPtr targetRecPtr, char*readBuf, TimeLineID *readTLI) { ... if (readFile >= 0 && !XLByteInSeg(targetPagePtr, readSegNo)) { /* * Request a restartpoint if we've replayedtoo much xlog since the * last one. */ if (StandbyModeRequested && bgwriterLaunched) { if (XLogCheckpointNeeded(readSegNo)) { (void) GetRedoRecPtr(); if (XLogCheckpointNeeded(readSegNo)) RequestCheckpoint(CHECKPOINT_CAUSE_XLOG); } } ... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-01-24 22:31:17 +0900, MauMau wrote: > From: "Fujii Masao" <masao.fujii@gmail.com> > >On Wed, Jan 22, 2014 at 6:37 AM, Heikki Linnakangas > >>>Thanks! The patch looks good to me. Attached is the updated version of > >>>the patch. I added the comments. > > Thank you very much. Your comment looks great. I tested some recovery > situations, and confirmed that WAL segments were kept/removed as intended. > I'll update the CommitFest entry with this patch. You haven't updated the patches status so far, so I've now marked as returned feedback as several people voiced serious doubts about the approach. If that's not accurate please speak up. > >The problem is, we might not be able to perform restartpoints more > >aggressively > >even if we reduce checkpoint_timeout in the server under the archive > >recovery. > >Because the frequency of occurrence of restartpoints depends on not only > >that > >checkpoint_timeout but also the checkpoints which happened while the > >server > >was running. > > I haven't tried reducing checkpoint_timeout. Did you try reducing checkpoint_segments? As I pointed out, at least if standby_mode is enabled, it will also trigger checkpoints, independently from checkpoint_timeout. If the issue is that you're not using standby_mode (if so, why?), then the fix maybe is to make that apply to a wider range of situations. > I think we cannot take that > approach, because we cannot suggest appropriate checkpoint_timeout to users. > That is, what checkpoint_timeout setting can we suggest so that WAL doesn't > accumulate in pg_xlog/ more than 9.1? Well, <= 9.1's behaviour can lead to loss of a consistent database, so I don't think it's what we need to compare the current state to. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 2, 2014 at 5:49 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-01-24 22:31:17 +0900, MauMau wrote: >> From: "Fujii Masao" <masao.fujii@gmail.com> >> >On Wed, Jan 22, 2014 at 6:37 AM, Heikki Linnakangas >> >>>Thanks! The patch looks good to me. Attached is the updated version of >> >>>the patch. I added the comments. >> >> Thank you very much. Your comment looks great. I tested some recovery >> situations, and confirmed that WAL segments were kept/removed as intended. >> I'll update the CommitFest entry with this patch. > > You haven't updated the patches status so far, so I've now marked as > returned feedback as several people voiced serious doubts about the > approach. If that's not accurate please speak up. > >> >The problem is, we might not be able to perform restartpoints more >> >aggressively >> >even if we reduce checkpoint_timeout in the server under the archive >> >recovery. >> >Because the frequency of occurrence of restartpoints depends on not only >> >that >> >checkpoint_timeout but also the checkpoints which happened while the >> >server >> >was running. >> >> I haven't tried reducing checkpoint_timeout. > > Did you try reducing checkpoint_segments? As I pointed out, at least if > standby_mode is enabled, it will also trigger checkpoints, independently > from checkpoint_timeout. Right. If standby_mode is enabled, checkpoint_segment can trigger the restartpoint. But the problem is that the timing of restartpoint depends on not only the checkpoint parameters (i.e., checkpoint_timeout and checkpoint_segments) that are used during archive recovery but also the checkpoint WAL that was generated by the master. For example, could you imagine the case where the master generated only one checkpoint WAL since the last backup and it crashed with database corruption. Then DBA decided to perform normal archive recovery by using the last backup. In this case, even if DBA reduces both checkpoint_timeout and checkpoint_segments, only one restartpoint can occur during recovery. This low frequency of restartpoint might fill up the disk space with lots of WAL files. This would be harmless if the server that we are performing recovery in has enough disk space. But I can imagine that some users want to recover the database and restart the service temporarily in poor server with less enough disk space until they can purchase sufficient server. In this case, accumulating lots of WAL files in pg_xlog might be harmful. > If the issue is that you're not using standby_mode (if so, why?), then > the fix maybe is to make that apply to a wider range of situations. I guess that he is not using standby_mode because, according to his first email in this thread, he said he would like to prevent WAL from accumulating in pg_xlog during normal archive recovery (i.e., PITR). Regards, -- Fujii Masao
On 2014-02-02 23:50:40 +0900, Fujii Masao wrote: > On Sun, Feb 2, 2014 at 5:49 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-01-24 22:31:17 +0900, MauMau wrote: > >> I haven't tried reducing checkpoint_timeout. > > > > Did you try reducing checkpoint_segments? As I pointed out, at least if > > standby_mode is enabled, it will also trigger checkpoints, independently > > from checkpoint_timeout. > > Right. If standby_mode is enabled, checkpoint_segment can trigger > the restartpoint. But the problem is that the timing of restartpoint > depends on not only the checkpoint parameters (i.e., > checkpoint_timeout and checkpoint_segments) that are used during > archive recovery but also the checkpoint WAL that was generated > by the master. Sure. But we really *need* all the WAL since the last checkpoint's redo location locally to be safe. > For example, could you imagine the case where the master generated > only one checkpoint WAL since the last backup and it crashed with > database corruption. Then DBA decided to perform normal archive > recovery by using the last backup. In this case, even if DBA reduces > both checkpoint_timeout and checkpoint_segments, only one > restartpoint can occur during recovery. This low frequency of > restartpoint might fill up the disk space with lots of WAL files. I am not sure I understand the point of this scenario. If the primary crashed after a checkpoint, there won't be that much WAL since it happened... > > If the issue is that you're not using standby_mode (if so, why?), then > > the fix maybe is to make that apply to a wider range of situations. > > I guess that he is not using standby_mode because, according to > his first email in this thread, he said he would like to prevent WAL > from accumulating in pg_xlog during normal archive recovery (i.e., PITR). Well, that doesn't necessarily prevent you from using standby_mode... But yes, that might be the case. I wonder if we shouldn't just always look at checkpoint segments during !crash recovery. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
From: "Andres Freund" <andres@2ndquadrant.com> > On 2014-02-02 23:50:40 +0900, Fujii Masao wrote: >> Right. If standby_mode is enabled, checkpoint_segment can trigger >> the restartpoint. But the problem is that the timing of restartpoint >> depends on not only the checkpoint parameters (i.e., >> checkpoint_timeout and checkpoint_segments) that are used during >> archive recovery but also the checkpoint WAL that was generated >> by the master. > > Sure. But we really *need* all the WAL since the last checkpoint's redo > location locally to be safe. > >> For example, could you imagine the case where the master generated >> only one checkpoint WAL since the last backup and it crashed with >> database corruption. Then DBA decided to perform normal archive >> recovery by using the last backup. In this case, even if DBA reduces >> both checkpoint_timeout and checkpoint_segments, only one >> restartpoint can occur during recovery. This low frequency of >> restartpoint might fill up the disk space with lots of WAL files. > > I am not sure I understand the point of this scenario. If the primary > crashed after a checkpoint, there won't be that much WAL since it > happened... > >> > If the issue is that you're not using standby_mode (if so, why?), then >> > the fix maybe is to make that apply to a wider range of situations. >> >> I guess that he is not using standby_mode because, according to >> his first email in this thread, he said he would like to prevent WAL >> from accumulating in pg_xlog during normal archive recovery (i.e., PITR). > > Well, that doesn't necessarily prevent you from using > standby_mode... But yes, that might be the case. > > I wonder if we shouldn't just always look at checkpoint segments during > !crash recovery. Maybe we could consider in that direction, but there is a problem. Archive recovery slows down compared to 9.1, because of repeated restartpoints. Archive recovery should be as fast as possible, because it typically applies dozens or hundreds of WAL files, and the DBA desires immediate resumption of operation. So, I think we should restore 9.1 behavior for archive recovery. The attached patch keeps restored archived WAL in pg_xlog/ only during standby recovery. It is based on Fujii-san's revison of the patch, with AllowCascadeReplication() condition removed from two if statements. Regards MauMau
Attachment
On 2014-02-12 21:23:54 +0900, MauMau wrote: > Maybe we could consider in that direction, but there is a problem. Archive > recovery slows down compared to 9.1, because of repeated restartpoints. > Archive recovery should be as fast as possible, because it typically applies > dozens or hundreds of WAL files, and the DBA desires immediate resumption of > operation. > > So, I think we should restore 9.1 behavior for archive recovery. It's easy to be fast, if it's not correct. I don't see how that behaviour is acceptable, imo the previous behaviour simply was a bug whose fix was too invasive to backpatch. I don't think you can convince me (but maybe others) that the old behaviour is acceptable. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-02-12 13:33:31 +0100, Andres Freund wrote: > On 2014-02-12 21:23:54 +0900, MauMau wrote: > > Maybe we could consider in that direction, but there is a problem. Archive > > recovery slows down compared to 9.1, because of repeated restartpoints. > > Archive recovery should be as fast as possible, because it typically applies > > dozens or hundreds of WAL files, and the DBA desires immediate resumption of > > operation. > > > > So, I think we should restore 9.1 behavior for archive recovery. > > It's easy to be fast, if it's not correct. I don't see how that > behaviour is acceptable, imo the previous behaviour simply was a bug > whose fix was too invasive to backpatch. > > I don't think you can convince me (but maybe others) that the old > behaviour is acceptable. I'm going to mark this patch as "Rejected". Please speak up if that's not accurate. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services