Thread: [HACKERS] Remove secondary checkpoint
Remove the code that maintained two checkpoint's WAL files and all associated stuff. Try to avoid breaking anything else This * reduces disk space requirements on master * removes a minor bug in fast failover * simplifies code -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Simon Riggs <simon@2ndquadrant.com> writes: > Remove the code that maintained two checkpoint's WAL files and all > associated stuff. > Try to avoid breaking anything else > This > * reduces disk space requirements on master > * removes a minor bug in fast failover > * simplifies code Doesn't it also make crash recovery less robust? The whole point of that mechanism is to be able to cope if the latest checkpoint record is unreadable. If you want to toss that overboard, I think you need to make the case why we don't need it, not just post a patch removing it. *Of course* the code is simpler without it. That's utterly irrelevant. The code would be even simpler with no crash recovery at all ... but we're not going there. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-10-24 09:50:12 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Remove the code that maintained two checkpoint's WAL files and all > > associated stuff. > > > Try to avoid breaking anything else > > > This > > * reduces disk space requirements on master > > * removes a minor bug in fast failover > > * simplifies code > > Doesn't it also make crash recovery less robust? I think it does the contrary. The current mechanism is, in my opinion, downright dangerous: https://www.postgresql.org/message-id/20160201235854.GO8743@awork2.anarazel.de Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 24 October 2017 at 09:50, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> Remove the code that maintained two checkpoint's WAL files and all >> associated stuff. > >> Try to avoid breaking anything else > >> This >> * reduces disk space requirements on master >> * removes a minor bug in fast failover >> * simplifies code > > Doesn't it also make crash recovery less robust? The whole point > of that mechanism is to be able to cope if the latest checkpoint > record is unreadable. If you want to toss that overboard, I think > you need to make the case why we don't need it, not just post a > patch removing it. *Of course* the code is simpler without it. > That's utterly irrelevant. The code would be even simpler with > no crash recovery at all ... but we're not going there. Well, the mechanism has already been partially removed since we don't maintain two checkpoints on a standby. So all I'm proposing is we remove the other half. I've not seen myself, nor can I find an example online where the primary failed yet the secondary did not also fail from the same cause. If it is a possibility to do this, now we have pg_waldump we can easily search for a different checkpoint and start from there instead which is a more flexible approach. If you didn't save your WAL and don't have any other form of backup, relying on the secondary checkpoint is not exactly a safe bet. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 23, 2017 at 11:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Remove the code that maintained two checkpoint's WAL files and all > associated stuff. > > Try to avoid breaking anything else > > This > * reduces disk space requirements on master > * removes a minor bug in fast failover > * simplifies code In short, yeah! I had a close look at the patch and noticed a couple of issues. + else /* - * The last valid checkpoint record required for a streaming - * recovery exists in neither standby nor the primary. + * We used to attempt to go back to a secondary checkpoint + * record here, but only when not in standby_mode. We now + * just fail if we can't read the last checkpoint because + * this allows us to simplify processing around checkpoints. */ ereport(PANIC, (errmsg("could not locate a valid checkpoint record"))); - } Using brackets in this case is more elegant style IMO. OK, there is one line, but the comment is long so the block becomes confusingly-shaped. /* Version identifier for this pg_control format */ -#define PG_CONTROL_VERSION 1003 +#define PG_CONTROL_VERSION 1100 Yes, this had to happen anyway in this release cycle. backup.sgml describes the following: to higher segment numbers. It's assumed that segment files whose contents precedethe checkpoint-before-last are no longer of interest and can be recycled. However this is not true anymore with this patch. The documentation of pg_control_checkpoint() is not updated. func.sgml lists the tuple fields returned for this function. You forgot to update pg_control_checkpoint in pg_proc.h. prior_lsn is still listed in the list of arguments returned. And you need to update as well the output list of types. * whichChkpt identifies the checkpoint (merely for reporting purposes). - * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label) + * 1 for "primary", 0 for "other" (backup_label) + * 2 for "secondary" is no longer supported */ I would suggest to just remove the reference to "secondary". - * Delete old log files (those no longer needed even for previous - * checkpoint or the standbys in XLOG streaming). + * Delete old log files and recycle them */ Here that's more "Delete or recycle old log files". Recycling of a WAL segment is its renaming into a newer segment. You forgot to update this formula in xlog.c: distance = (2.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate; /* add 10% for good measure. */ distance *= 1.10; You can guess easily where the mistake is. - checkPointLoc = ControlFile->prevCheckPoint; + /* + * It appears to be a bug that we used to use prevCheckpoint here + */ + checkPointLoc = ControlFile->checkPoint; Er, no. This is correct because we expect the prior checkpoint to still be present in the event of a failure detecting the latest checkpoint. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane > Doesn't it also make crash recovery less robust? The whole point > of that mechanism is to be able to cope if the latest checkpoint > record is unreadable. If the latest checkpoint record is unreadable (the WAL segment/block/record is corrupt?), recovery from the previous checkpointwould also stop at the latest checkpoint. And we don't need to replay the WAL records between the previous checkpointand the latest one, because their changes are already persisted when the latest checkpoint was taken. So, theuser should just do pg_resetxlog and start the database server when the recovery cannot find the latest checkpoint recordand PANICs? Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 24, 2017 at 5:58 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > If the latest checkpoint record is unreadable (the WAL segment/block/record is corrupt?), recovery from the previous checkpointwould also stop at the latest checkpoint. And we don't need to replay the WAL records between the previous checkpointand the latest one, because their changes are already persisted when the latest checkpoint was taken. So, theuser should just do pg_resetxlog and start the database server when the recovery cannot find the latest checkpoint recordand PANICs? Not necessarily. If a failure is detected when reading the last checkpoint, as you say recovery would begin at the checkpoint prior that and would stop when reading the record of last checkpoint, still one could use a recovery.conf with restore_command to fetch segments from a different source, like a static archive, as only the local segment may be corrupted. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier > On Tue, Oct 24, 2017 at 5:58 PM, Tsunakawa, Takayuki > <tsunakawa.takay@jp.fujitsu.com> wrote: > > If the latest checkpoint record is unreadable (the WAL > segment/block/record is corrupt?), recovery from the previous checkpoint > would also stop at the latest checkpoint. And we don't need to replay the > WAL records between the previous checkpoint and the latest one, because > their changes are already persisted when the latest checkpoint was taken. > So, the user should just do pg_resetxlog and start the database server when > the recovery cannot find the latest checkpoint record and PANICs? > > Not necessarily. If a failure is detected when reading the last checkpoint, > as you say recovery would begin at the checkpoint prior that and would stop > when reading the record of last checkpoint, still one could use a > recovery.conf with restore_command to fetch segments from a different > source, like a static archive, as only the local segment may be corrupted. Oh, you are right. "If the crash recovery fails, perform recovery from backup." Anyway, I agree that the secondary checkpoint isn't necessary. Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Simon Riggs > This > * reduces disk space requirements on master > * removes a minor bug in fast failover > * simplifies code I welcome this patch. I was wondering why PostgreSQL retains the previous checkpoint. (Although unrelated to this, I'vealso been wondering why PostgreSQL flushes WAL to disk when writing a page in the shared buffer, because PostgreSQL doesn'tuse WAL for undo.) Here are my review comments. (1) - * a) we keep WAL for two checkpoint cycles, back to the "prev" checkpoint. + * a) we keep WAL for only one checkpoint cycle (prior to PG11 we kept + * WAL for two checkpoint cycles to allow us to recover from the + * secondary checkpoint if the first checkpoint failed, though we + * only did this on the master anyway, not on standby. Keeping just + * one checkpoint simplifies processing and reduces disk space in + * many smaller databases.) /* - * The last valid checkpoint record required for a streaming - * recovery exists in neither standby nor the primary. + * We used to attempt to go back to a secondary checkpoint + * record here, but only when not in standby_mode. We now + * just fail if we can't read the last checkpoint because + * this allows us to simplify processing around checkpoints. */ if (fast_promote) { - checkPointLoc = ControlFile->prevCheckPoint; + /* + * It appears to be a bug that we used to use prevCheckpoint here + */ + checkPointLoc = ControlFile->checkPoint; - XLByteToSeg(PriorRedoPtr, _logSegNo, wal_segment_size); + /* Trim from the last checkpoint, not the last - 1 */ + XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size); I suggest to remove references to the secondary checkpoint (the old behavior) from the comments. I'm afraid those couldconfuse new developers. (2) @@ -8110,10 +8100,6 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, ereport(LOG, (errmsg("invalid primary checkpoint link in control file"))); break; @@ -8135,10 +8121,6 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, ereport(LOG, (errmsg("invalid primary checkpoint record"))); break; @@ -8154,10 +8136,6 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, ereport(LOG, (errmsg("invalid resource manager ID in primary checkpoint record"))); break; etc, etc. "primary checkpoint" should be just "checkpoint", now that the primary/secondary concept will disappear. (3) Should we change the default value of max_wal_size from 1 GB to a smaller size? I vote for "no" for performance. Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 24, 2017 at 7:24 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > (3) > Should we change the default value of max_wal_size from 1 GB to a smaller size? I vote for "no" for performance. The default has just changed in v10, so changing it again could be confusing, so I agree with your position. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Tsunakawa, Takayuki wrote: > (Although unrelated to this, I've also been wondering why PostgreSQL > flushes WAL to disk when writing a page in the shared buffer, because > PostgreSQL doesn't use WAL for undo.) The reason is that if the system crashes after writing the data page to disk, but before writing the WAL, the data page would be inconsistent with data in pages that weren't flushed, since there is no WAL to update those other pages. Also, if the system crashes after partially writing the page (say it writes the first 4kB) then the page is downright corrupted with no way to fix it. So there has to be a barrier that ensures that the WAL is flushed up to the last position that modified a page (i.e. that page's LSN) before actually writing that page to disk. And this is why we can't use mmap() for shared buffers -- there is no mechanism to force the WAL down if the operation system has the liberty to flush pages whenever it likes. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Alvaro Herrera [mailto:alvherre@alvh.no-ip.org] > Tsunakawa, Takayuki wrote: > > > (Although unrelated to this, I've also been wondering why PostgreSQL > > flushes WAL to disk when writing a page in the shared buffer, because > > PostgreSQL doesn't use WAL for undo.) > > The reason is that if the system crashes after writing the data page to > disk, but before writing the WAL, the data page would be inconsistent with > data in pages that weren't flushed, since there is no WAL to update those > other pages. Also, if the system crashes after partially writing the page > (say it writes the first 4kB) then the page is downright corrupted with > no way to fix it. > > So there has to be a barrier that ensures that the WAL is flushed up to > the last position that modified a page (i.e. that page's LSN) before actually > writing that page to disk. And this is why we can't use mmap() for shared > buffers -- there is no mechanism to force the WAL down if the operation > system has the liberty to flush pages whenever it likes. I see. The latter is a torn page problem, which is solved by a full page image WAL record. I understood that an exampleof the former problem is the inconsistency between a table page and an index page -- if an index page is flushed todisk without slushing the WAL and the corresponding table page, an index entry would point to a wroing table record afterrecovery. Thanks, my long-standing question has beenn solved. Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 24, 2017 at 7:25 PM, Andres Freund <andres@anarazel.de> wrote: > I think it does the contrary. The current mechanism is, in my opinion, > downright dangerous: > https://www.postgresql.org/message-id/20160201235854.GO8743@awork2.anarazel.de A sort of middle way would be to keep the secondary checkpoint around but never try to replay from that point, or only if a specific flag is provided. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote: > On Tue, Oct 24, 2017 at 7:25 PM, Andres Freund <andres@anarazel.de> wrote: > > I think it does the contrary. The current mechanism is, in my opinion, > > downright dangerous: > > https://www.postgresql.org/message-id/20160201235854.GO8743@awork2.anarazel.de > > A sort of middle way would be to keep the secondary checkpoint around > but never try to replay from that point, or only if a specific flag is > provided. Why do you want to keep the secondary checkpoint? If there is no way to automatically start a recovery from the prior checkpoint, is it really possible to do the same manually? I think the only advantage of keeping it is that the WAL files are kept around for a little bit longer. But is that useful? Surely for any environment where you really care, you have a WAL archive somewhere, so it doesn't matter if files are removed from the primary's pg_xlog dir. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 30, 2017 at 7:04 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Robert Haas wrote: >> On Tue, Oct 24, 2017 at 7:25 PM, Andres Freund <andres@anarazel.de> wrote: >> > I think it does the contrary. The current mechanism is, in my opinion, >> > downright dangerous: >> > https://www.postgresql.org/message-id/20160201235854.GO8743@awork2.anarazel.de >> >> A sort of middle way would be to keep the secondary checkpoint around >> but never try to replay from that point, or only if a specific flag is >> provided. > > Why do you want to keep the secondary checkpoint? If there is no way to > automatically start a recovery from the prior checkpoint, is it really > possible to do the same manually? I think the only advantage of keeping > it is that the WAL files are kept around for a little bit longer. But > is that useful? Surely for any environment where you really care, you > have a WAL archive somewhere, so it doesn't matter if files are removed > from the primary's pg_xlog dir. > > -- > Álvaro Herrera https://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 30, 2017 at 7:04 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> A sort of middle way would be to keep the secondary checkpoint around >> but never try to replay from that point, or only if a specific flag is >> provided. > > Why do you want to keep the secondary checkpoint? If there is no way to > automatically start a recovery from the prior checkpoint, is it really > possible to do the same manually? I think the only advantage of keeping > it is that the WAL files are kept around for a little bit longer. But > is that useful? Surely for any environment where you really care, you > have a WAL archive somewhere, so it doesn't matter if files are removed > from the primary's pg_xlog dir. (apologies for the empty message) I don't really want anything in particular here, other than for the system to be reliable. If we're confident that there's zero value in the secondary checkpoint then, sure, ditch it. Even if you have the older WAL files around in an archive, it doesn't mean that you know where the previous checkpoint start location is ... but actually, come to think of it, if you did need to know that, you could just run pg_waldump to find it. That probably wasn't true when this code was originally written, but it is today. I was mostly just thinking out loud, listing another option rather than advocating for it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes: > I was mostly just thinking out loud, listing another option rather > than advocating for it. FWIW, I just wanted the question to be debated and resolved properly. After rereading the thread Andres pointed to, I thought of a hazard that I think Andres alluded to, but didn't spell out explicitly: if we can't read the primary checkpoint, and then back up to a secondary one and replay as much of WAL as we can read, we may well be left with an inconsistent database. This would happen because some changes that post-date the part of WAL we could read may have been flushed to disk, if the system previously thought that WAL up through the primary checkpoint was all safely down to disk. Therefore, backing up to the secondary checkpoint is *not* a safe automatic recovery choice, and it's dubious that it's even a useful approach for manual recovery. You're really into restore-from- backup territory at that point. I'm content now that removing the secondary checkpoint is an OK decision. (This isn't a review of Simon's patch, though.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 25 October 2017 at 00:17, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Oct 23, 2017 at 11:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Remove the code that maintained two checkpoint's WAL files and all >> associated stuff. >> >> Try to avoid breaking anything else >> >> This >> * reduces disk space requirements on master >> * removes a minor bug in fast failover >> * simplifies code > > In short, yeah! I had a close look at the patch and noticed a couple of issues. Thanks for the detailed review > + else > /* > - * The last valid checkpoint record required for a streaming > - * recovery exists in neither standby nor the primary. > + * We used to attempt to go back to a secondary checkpoint > + * record here, but only when not in standby_mode. We now > + * just fail if we can't read the last checkpoint because > + * this allows us to simplify processing around checkpoints. > */ > ereport(PANIC, > (errmsg("could not locate a valid checkpoint record"))); > - } > Using brackets in this case is more elegant style IMO. OK, there is > one line, but the comment is long so the block becomes > confusingly-shaped. OK > /* Version identifier for this pg_control format */ > -#define PG_CONTROL_VERSION 1003 > +#define PG_CONTROL_VERSION 1100 > Yes, this had to happen anyway in this release cycle. > > backup.sgml describes the following: > to higher segment numbers. It's assumed that segment files whose > contents precede the checkpoint-before-last are no longer of > interest and can be recycled. > However this is not true anymore with this patch. Thanks, will fix. > The documentation of pg_control_checkpoint() is not updated. func.sgml > lists the tuple fields returned for this function. Ah, OK. > You forgot to update pg_control_checkpoint in pg_proc.h. prior_lsn is > still listed in the list of arguments returned. And you need to update > as well the output list of types. > > * whichChkpt identifies the checkpoint (merely for reporting purposes). > - * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label) > + * 1 for "primary", 0 for "other" (backup_label) > + * 2 for "secondary" is no longer supported > */ > I would suggest to just remove the reference to "secondary". Yeh, that was there for review. > - * Delete old log files (those no longer needed even for previous > - * checkpoint or the standbys in XLOG streaming). > + * Delete old log files and recycle them > */ > Here that's more "Delete or recycle old log files". Recycling of a WAL > segment is its renaming into a newer segment. Sometimes files are deleted if there are too many. > You forgot to update this formula in xlog.c: > distance = (2.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate; > /* add 10% for good measure. */ > distance *= 1.10; > You can guess easily where the mistake is. Doh - fixed that before posting, so I must have sent the wrong patch. It's the 10%, right? ;-) > - checkPointLoc = ControlFile->prevCheckPoint; > + /* > + * It appears to be a bug that we used to use > prevCheckpoint here > + */ > + checkPointLoc = ControlFile->checkPoint; > Er, no. This is correct because we expect the prior checkpoint to > still be present in the event of a failure detecting the latest > checkpoint. I'm removing "prevCheckPoint", so not sure what you mean. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 30, 2017 at 2:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 25 October 2017 at 00:17, Michael Paquier <michael.paquier@gmail.com> wrote: >> - * Delete old log files (those no longer needed even for previous >> - * checkpoint or the standbys in XLOG streaming). >> + * Delete old log files and recycle them >> */ >> Here that's more "Delete or recycle old log files". Recycling of a WAL >> segment is its renaming into a newer segment. > > Sometimes files are deleted if there are too many. Sure, but one segment is never deleted and then recycled, which is what your new comment implies as I understand it. >> - checkPointLoc = ControlFile->prevCheckPoint; >> + /* >> + * It appears to be a bug that we used to use >> prevCheckpoint here >> + */ >> + checkPointLoc = ControlFile->checkPoint; >> Er, no. This is correct because we expect the prior checkpoint to >> still be present in the event of a failure detecting the latest >> checkpoint. > > I'm removing "prevCheckPoint", so not sure what you mean. I mean that there is no actual bug here. And changing the code as you do is correct, but the comment is not. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 30 October 2017 at 15:31, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Oct 30, 2017 at 2:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 25 October 2017 at 00:17, Michael Paquier <michael.paquier@gmail.com> wrote: >>> - * Delete old log files (those no longer needed even for previous >>> - * checkpoint or the standbys in XLOG streaming). >>> + * Delete old log files and recycle them >>> */ >>> Here that's more "Delete or recycle old log files". Recycling of a WAL >>> segment is its renaming into a newer segment. >> >> Sometimes files are deleted if there are too many. > > Sure, but one segment is never deleted and then recycled, which is > what your new comment implies as I understand it. I guess it depends how you read it. The function performs both deletion AND recycling Yet one file is only ever deleted OR recycled >>> - checkPointLoc = ControlFile->prevCheckPoint; >>> + /* >>> + * It appears to be a bug that we used to use >>> prevCheckpoint here >>> + */ >>> + checkPointLoc = ControlFile->checkPoint; >>> Er, no. This is correct because we expect the prior checkpoint to >>> still be present in the event of a failure detecting the latest >>> checkpoint. >> >> I'm removing "prevCheckPoint", so not sure what you mean. > > I mean that there is no actual bug here. And changing the code as you > do is correct, but the comment is not. Thanks -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-10-30 10:10:19 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I was mostly just thinking out loud, listing another option rather > > than advocating for it. > > FWIW, I just wanted the question to be debated and resolved properly. > > After rereading the thread Andres pointed to, I thought of a hazard > that I think Andres alluded to, but didn't spell out explicitly: > if we can't read the primary checkpoint, and then back up to a > secondary one and replay as much of WAL as we can read, we may well > be left with an inconsistent database. Exactly. > I'm content now that removing the secondary checkpoint is an OK > decision. (This isn't a review of Simon's patch, though.) I wonder if we shouldn't add a pg_resetxlog option that sets the checkpoint to start from to a certain LSN. For the few cases where there's actual data recovery needed that's a lot more useful than randomly using checkpoint - 1. And it's an explicit expert only thing, without costing everyone. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 30 October 2017 at 15:22, Simon Riggs <simon@2ndquadrant.com> wrote: >> You forgot to update this formula in xlog.c: >> distance = (2.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate; >> /* add 10% for good measure. */ >> distance *= 1.10; >> You can guess easily where the mistake is. > > Doh - fixed that before posting, so I must have sent the wrong patch. > > It's the 10%, right? ;-) So, there are 2 locations that mention 2.0 in xlog.c I had already fixed one, which is why I remembered editing it. The other one you mention has a detailed comment above it explaining why it should be 2.0 rather than 1.0, so I left it. Let me know if you still think it should be removed? I can see the argument both ways. Or maybe we need another patch to account for manual checkpoints. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 30 October 2017 at 18:58, Simon Riggs <simon@2ndquadrant.com> wrote: > On 30 October 2017 at 15:22, Simon Riggs <simon@2ndquadrant.com> wrote: > >>> You forgot to update this formula in xlog.c: >>> distance = (2.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate; >>> /* add 10% for good measure. */ >>> distance *= 1.10; >>> You can guess easily where the mistake is. >> >> Doh - fixed that before posting, so I must have sent the wrong patch. >> >> It's the 10%, right? ;-) > > So, there are 2 locations that mention 2.0 in xlog.c > > I had already fixed one, which is why I remembered editing it. > > The other one you mention has a detailed comment above it explaining > why it should be 2.0 rather than 1.0, so I left it. > > Let me know if you still think it should be removed? I can see the > argument both ways. > > Or maybe we need another patch to account for manual checkpoints. Revised patch to implement this -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Oct 31, 2017 at 2:00 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 30 October 2017 at 18:58, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 30 October 2017 at 15:22, Simon Riggs <simon@2ndquadrant.com> wrote: >> >>>> You forgot to update this formula in xlog.c: >>>> distance = (2.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate; >>>> /* add 10% for good measure. */ >>>> distance *= 1.10; >>>> You can guess easily where the mistake is. >>> >>> Doh - fixed that before posting, so I must have sent the wrong patch. >>> >>> It's the 10%, right? ;-) >> >> So, there are 2 locations that mention 2.0 in xlog.c >> >> I had already fixed one, which is why I remembered editing it. >> >> The other one you mention has a detailed comment above it explaining >> why it should be 2.0 rather than 1.0, so I left it. >> >> Let me know if you still think it should be removed? I can see the >> argument both ways. >> Or maybe we need another patch to account for manual checkpoints. > > Revised patch to implement this Here is the comment and the formula: * The reason this calculation is done from the prior checkpoint, not the * one that just finished, is that this behaves better if some checkpoint * cycles are abnormally short, like if you perform a manual checkpoint * right after a timed one.The manual checkpoint will make almost a full * cycle's worth of WAL segments available for recycling, because the * segments from the prior'sprior, fully-sized checkpoint cycle are no * longer needed. However, the next checkpoint will make only few segments * available for recycling, the ones generated between the timed * checkpoint and the manual oneright after that. If at the manual * checkpoint we only retained enough segments to get us to the next timed * one, and removed the rest, then at the next checkpoint we would not * have enough segmentsaround for recycling, to get us to the checkpoint * after that. Basing the calculations on the distance from the prior redo * pointer largely fixes that problem. */ distance = (2.0 + CheckPointCompletionTarget)* CheckPointDistanceEstimate; While the mention about a manual checkpoint happening after a timed one will cause a full range of WAL segments to be recycled, it is not actually true that segments of the prior's prior checkpoint are not needed, because with your patch the segments of the prior checkpoint are getting recycled. So it seems to me that based on that the formula ought to use 1.0 instead of 2.0... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 31 October 2017 at 12:01, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Oct 31, 2017 at 2:00 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 30 October 2017 at 18:58, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 30 October 2017 at 15:22, Simon Riggs <simon@2ndquadrant.com> wrote: >>> >>>>> You forgot to update this formula in xlog.c: >>>>> distance = (2.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate; >>>>> /* add 10% for good measure. */ >>>>> distance *= 1.10; >>>>> You can guess easily where the mistake is. >>>> >>>> Doh - fixed that before posting, so I must have sent the wrong patch. >>>> >>>> It's the 10%, right? ;-) >>> >>> So, there are 2 locations that mention 2.0 in xlog.c >>> >>> I had already fixed one, which is why I remembered editing it. >>> >>> The other one you mention has a detailed comment above it explaining >>> why it should be 2.0 rather than 1.0, so I left it. >>> >>> Let me know if you still think it should be removed? I can see the >>> argument both ways. >>> Or maybe we need another patch to account for manual checkpoints. >> >> Revised patch to implement this > > Here is the comment and the formula: > * The reason this calculation is done from the prior > checkpoint, not the > * one that just finished, is that this behaves better if some > checkpoint > * cycles are abnormally short, like if you perform a manual checkpoint > * right after a timed one. The manual checkpoint will make > almost a full > * cycle's worth of WAL segments available for recycling, because the > * segments from the prior's prior, fully-sized checkpoint cycle are no > * longer needed. However, the next checkpoint will make only > few segments > * available for recycling, the ones generated between the timed > * checkpoint and the manual one right after that. If at the manual > * checkpoint we only retained enough segments to get us to > the next timed > * one, and removed the rest, then at the next checkpoint we would not > * have enough segments around for recycling, to get us to the > checkpoint > * after that. Basing the calculations on the distance from > the prior redo > * pointer largely fixes that problem. > */ > distance = (2.0 + CheckPointCompletionTarget) * > CheckPointDistanceEstimate; > > While the mention about a manual checkpoint happening after a timed > one will cause a full range of WAL segments to be recycled, it is not > actually true that segments of the prior's prior checkpoint are not > needed, because with your patch the segments of the prior checkpoint > are getting recycled. So it seems to me that based on that the formula > ought to use 1.0 instead of 2.0... I think the argument in the comment is right, in that CheckPointDistanceEstimate is better if we use multiple checkpoint cycles. But the implementation of that is bogus and multiplying by 2.0 wouldn't make it better if CheckPointDistanceEstimate is wrong. So I will change to 1.0 as you say and rewrite the comment. Thanks for your review. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 8, 2017 at 2:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 31 October 2017 at 12:01, Michael Paquier <michael.paquier@gmail.com> wrote: >> While the mention about a manual checkpoint happening after a timed >> one will cause a full range of WAL segments to be recycled, it is not >> actually true that segments of the prior's prior checkpoint are not >> needed, because with your patch the segments of the prior checkpoint >> are getting recycled. So it seems to me that based on that the formula >> ought to use 1.0 instead of 2.0... > > I think the argument in the comment is right, in that > CheckPointDistanceEstimate is better if we use multiple checkpoint > cycles. Yes, the theory behind is correct. No argument behind that. > But the implementation of that is bogus and multiplying by 2.0 > wouldn't make it better if CheckPointDistanceEstimate is wrong. Yes, this is wrong. My apologies if my words looked confusing. By reading your message I can see that our thoughts are on the same page. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers