Thread: "using previous checkpoint record at" maybe not the greatest idea?
Hi, currently if, when not in standby mode, we can't read a checkpoint record, we automatically fall back to the previous checkpoint, and start replay from there. Doing so without user intervention doesn't actually seem like a good idea. While not super likely, it's entirely possible that doing so can wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a tablespace being dropped - going back to the previous checkpoint very well could lead to replay not finishing, as the directory to create files in doesn't even exist. As there's, afaics, really no "legitimate" reasons for needing to go back to the previous checkpoint I don't think we should do so in an automated fashion. All the cases where I could find logs containing "using previous checkpoint record at" were when something else had already gone pretty badly wrong. Now that obviously doesn't have a very large significance, because in the situations where it "just worked" are unlikely to be reported... Am I missing a reason for doing this by default? Greetings, Andres Freund
Re: "using previous checkpoint record at" maybe not the greatest idea?
From
"David G. Johnston"
Date:
Hi,
currently if, when not in standby mode, we can't read a checkpoint
record, we automatically fall back to the previous checkpoint, and start
replay from there.
Doing so without user intervention doesn't actually seem like a good
idea. While not super likely, it's entirely possible that doing so can
wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
tablespace being dropped - going back to the previous checkpoint very
well could lead to replay not finishing, as the directory to create
files in doesn't even exist.
As there's, afaics, really no "legitimate" reasons for needing to go
back to the previous checkpoint I don't think we should do so in an
automated fashion.
All the cases where I could find logs containing "using previous
checkpoint record at" were when something else had already gone pretty
badly wrong. Now that obviously doesn't have a very large significance,
because in the situations where it "just worked" are unlikely to be
reported...
Am I missing a reason for doing this by default?
Learning by reading here...
"""
After a checkpoint has been made and the log flushed, the checkpoint's position is saved in the file pg_control. Therefore, at the start of recovery, the server first reads pg_control and then the checkpoint record; then it performs the REDO operation by scanning forward from the log position indicated in the checkpoint record. Because the entire content of data pages is saved in the log on the first page modification after a checkpoint (assuming full_page_writes is not disabled), all pages changed since the checkpoint will be restored to a consistent state.
To deal with the case where pg_control is corrupt, we should support the possibility of scanning existing log segments in reverse order — newest to oldest — in order to find the latest checkpoint. This has not been implemented yet. pg_control is small enough (less than one disk page) that it is not subject to partial-write problems, and as of this writing there have been no reports of database failures due solely to the inability to read pg_control itself. So while it is theoretically a weak spot, pg_control does not seem to be a problem in practice.
"""
The above comment appears out-of-date if this post describes what presently happens.
Also, I was under the impression that tablespace commands resulted in checkpoints so that the state of the file system could be presumed current...
I don't know enough internals but its seems like we'd need to distinguish between an interrupted checkpoint (pull the plug during checkpoint) and one that supposedly completed without interruption but then was somehow corrupted (solar flares). The former seem legitimate for auto-skip while the later do not.
David J.
On 2016-02-01 17:29:39 -0700, David G. Johnston wrote: > Learning by reading here... > > http://www.postgresql.org/docs/current/static/wal-internals.html > """ > After a checkpoint has been made and the log flushed, the checkpoint's > position is saved in the file pg_control. Therefore, at the start of > recovery, the server first reads pg_control and then the checkpoint record; > then it performs the REDO operation by scanning forward from the log > position indicated in the checkpoint record. Because the entire content of > data pages is saved in the log on the first page modification after a > checkpoint (assuming full_page_writes is not disabled), all pages changed > since the checkpoint will be restored to a consistent state. > The above comment appears out-of-date if this post describes what > presently happens. Where do you see a conflict with what I wrote about? We store both the last and the previous checkpoint's location in pg_control. Or are you talking about: > To deal with the case where pg_control is corrupt, we should support the > possibility of scanning existing log segments in reverse order — newest to > oldest — in order to find the latest checkpoint. This has not been > implemented yet. pg_control is small enough (less than one disk page) that > it is not subject to partial-write problems, and as of this writing there > have been no reports of database failures due solely to the inability to > read pg_control itself. So while it is theoretically a weak spot, > pg_control does not seem to be a problem in practice. if so, no, that's not a out-of-date, as we simply store two checkpoint locations:$ pg_controldata /srv/dev/pgdev-dev/|grep 'checkpoint location' Latest checkpoint location: B3/2A730028 Prior checkpoint location: B3/2A72FFA0 > Also, I was under the impression that tablespace commands resulted in > checkpoints so that the state of the file system could be presumed > current... That actually doesn't really make it any better - it forces the *latest* checkpoint, but if we can't read that, we'll start with the previous one... > I don't know enough internals but its seems like we'd need to distinguish > between an interrupted checkpoint (pull the plug during checkpoint) and one > that supposedly completed without interruption but then was somehow > corrupted (solar flares). The former seem legitimate for auto-skip while > the later do not. I don't think such a distinction is really possible (or necessary). If pg_control is corrupted we won't even start, and if WAL is corrupted that badly we won't finish replay... Greetings, Andres Freund
Re: "using previous checkpoint record at" maybe not the greatest idea?
From
"David G. Johnston"
Date:
On 2016-02-01 17:29:39 -0700, David G. Johnston wrote:
> Learning by reading here...
>
> http://www.postgresql.org/docs/current/static/wal-internals.html
> """
> After a checkpoint has been made and the log flushed, the checkpoint's
> position is saved in the file pg_control. Therefore, at the start of
> recovery, the server first reads pg_control and then the checkpoint record;
> then it performs the REDO operation by scanning forward from the log
> position indicated in the checkpoint record. Because the entire content of
> data pages is saved in the log on the first page modification after a
> checkpoint (assuming full_page_writes is not disabled), all pages changed
> since the checkpoint will be restored to a consistent state.
> The above comment appears out-of-date if this post describes what
> presently happens.
Where do you see a conflict with what I wrote about? We store both the
last and the previous checkpoint's location in pg_control. Or are you
talking about:
Mainly the following...but the word I used was "out-of-date" and not "conflict". The present state seems to do the above, and then some.
> To deal with the case where pg_control is corrupt, we should support the
> possibility of scanning existing log segments in reverse order — newest to
> oldest — in order to find the latest checkpoint. This has not been
> implemented yet. pg_control is small enough (less than one disk page) that
> it is not subject to partial-write problems, and as of this writing there
> have been no reports of database failures due solely to the inability to
> read pg_control itself. So while it is theoretically a weak spot,
> pg_control does not seem to be a problem in practice.
if so, no, that's not a out-of-date, as we simply store two checkpoint
locations:
$ pg_controldata /srv/dev/pgdev-dev/|grep 'checkpoint location'
Latest checkpoint location: B3/2A730028
Prior checkpoint location: B3/2A72FFA0
The quote implies that only a single checkpoint is noted and that no "searching" is performed - whether by scanning or by being told the position of a previous one so that it can jump there immediately without scanning backwards. It isn't strictly the fact that we do not "scan" backwards but the implications that arise in making that statement. Maybe this is being picky but if you cannot trust the value of "Latest checkpoint location" then pg_control is arguably corrupt. Corruption is not strictly limited to "unable to be read" but does include "contains invalid data".
> Also, I was under the impression that tablespace commands resulted in
> checkpoints so that the state of the file system could be presumed
> current...
That actually doesn't really make it any better - it forces the *latest*
checkpoint, but if we can't read that, we'll start with the previous
one...
> I don't know enough internals but its seems like we'd need to distinguish
> between an interrupted checkpoint (pull the plug during checkpoint) and one
> that supposedly completed without interruption but then was somehow
> corrupted (solar flares). The former seem legitimate for auto-skip while
> the later do not.
I don't think such a distinction is really possible (or necessary). If
pg_control is corrupted we won't even start, and if WAL is corrupted
that badly we won't finish replay...
My takeaway from the above is that we should only record what we think is a usable/readable/valid checkpoint location to "Latest checkpoint location"
(LCL) and if the system is not able to use that information to perform a successful recovery it should be allowed to die without using the value in "Previous checkpoint location" - which becomes effectively ignored during master recovery.
David J.
On Mon, Feb 1, 2016 at 6:58 PM, Andres Freund <andres@anarazel.de> wrote: > currently if, when not in standby mode, we can't read a checkpoint > record, we automatically fall back to the previous checkpoint, and start > replay from there. > > Doing so without user intervention doesn't actually seem like a good > idea. While not super likely, it's entirely possible that doing so can > wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a > tablespace being dropped - going back to the previous checkpoint very > well could lead to replay not finishing, as the directory to create > files in doesn't even exist. > > As there's, afaics, really no "legitimate" reasons for needing to go > back to the previous checkpoint I don't think we should do so in an > automated fashion. > > All the cases where I could find logs containing "using previous > checkpoint record at" were when something else had already gone pretty > badly wrong. Now that obviously doesn't have a very large significance, > because in the situations where it "just worked" are unlikely to be > reported... > > Am I missing a reason for doing this by default? I agree: this seems like a terrible idea. Would we still have some way of forcing the older checkpoint record to be used if somebody wants to try to do that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-02-03 09:28:24 -0500, Robert Haas wrote: > Would we still have some way of forcing the older checkpoint record to > be used if somebody wants to try to do that? I think currently the best way to force an arbitrary checkpoint to be used is creating a "custom" backup label. Not that nice. Not sure if we need something nice here, I don't really see a frequent need for this. We could add another option to pg_resetxlog alternatively :/ Regards, Andres
David G. Johnston wrote: > Learning by reading here... > > http://www.postgresql.org/docs/current/static/wal-internals.html > """ > After a checkpoint has been made and the log flushed, the checkpoint's > position is saved in the file pg_control. Therefore, at the start of > recovery, the server first reads pg_control and then the checkpoint record; > then it performs the REDO operation by scanning forward from the log > position indicated in the checkpoint record. Because the entire content of > data pages is saved in the log on the first page modification after a > checkpoint (assuming full_page_writes is not disabled), all pages changed > since the checkpoint will be restored to a consistent state. > > To deal with the case where pg_control is corrupt, we should support the > possibility of scanning existing log segments in reverse order — newest to > oldest — in order to find the latest checkpoint. This has not been > implemented yet. pg_control is small enough (less than one disk page) that > it is not subject to partial-write problems, and as of this writing there > have been no reports of database failures due solely to the inability to > read pg_control itself. So while it is theoretically a weak spot, > pg_control does not seem to be a problem in practice. > """ > > The above comment appears out-of-date if this post describes what > presently happens. I think you're misinterpreting Andres, or the docs, or both. What Andres says is that the control file (pg_control) stores two checkpoint locations: the latest one, and the one before that. When recovery occurs, it starts by looking up the latest checkpoint record; if it cannot find that for whatever reason, it falls back to reading the previous one. (He further claims that falling back to the previous one is a bad idea.) What the 2nd para in the documentation is saying is something different: it is talking about reading all the pg_xlog files (in reverse order), which is not pg_control, and see what checkpoint records are there, then figure out which one to use. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: "using previous checkpoint record at" maybe not the greatest idea?
From
"David G. Johnston"
Date:
David G. Johnston wrote:
> Learning by reading here...
>
> http://www.postgresql.org/docs/current/static/wal-internals.html
> """
> After a checkpoint has been made and the log flushed, the checkpoint's
> position is saved in the file pg_control. Therefore, at the start of
> recovery, the server first reads pg_control and then the checkpoint record;
> then it performs the REDO operation by scanning forward from the log
> position indicated in the checkpoint record. Because the entire content of
> data pages is saved in the log on the first page modification after a
> checkpoint (assuming full_page_writes is not disabled), all pages changed
> since the checkpoint will be restored to a consistent state.
>
> To deal with the case where pg_control is corrupt, we should support the
> possibility of scanning existing log segments in reverse order — newest to
> oldest — in order to find the latest checkpoint. This has not been
> implemented yet. pg_control is small enough (less than one disk page) that
> it is not subject to partial-write problems, and as of this writing there
> have been no reports of database failures due solely to the inability to
> read pg_control itself. So while it is theoretically a weak spot,
> pg_control does not seem to be a problem in practice.
> """
>
> The above comment appears out-of-date if this post describes what
> presently happens.
I think you're misinterpreting Andres, or the docs, or both.
What Andres says is that the control file (pg_control) stores two
checkpoint locations: the latest one, and the one before that. When
recovery occurs, it starts by looking up the latest checkpoint record;
if it cannot find that for whatever reason, it falls back to reading the
previous one. (He further claims that falling back to the previous one
is a bad idea.)
What the 2nd para in the documentation is saying is something different:
it is talking about reading all the pg_xlog files (in reverse order),
which is not pg_control, and see what checkpoint records are there, then
figure out which one to use.
Yes, I inferred something that obviously isn't true - that the system doesn't go hunting for a valid checkpoint to begin recovery from. While it does not do so in the case of a corrupted pg_control file I further assumed it never did. That would be because the documentation doesn't make the point of stating that two checkpoint positions exist and that PostgreSQL will try the second one if the first one proves unusable. Given the topic of this thread that omission makes the documentation out-of-date. Maybe its covered elsewhere but since this section addresses locating a starting point I would expect any such description to be here as well.
David J.
On 2/4/16 5:09 PM, David G. Johnston wrote: > > What the 2nd para in the documentation is saying is something different: > it is talking about reading all the pg_xlog files (in reverse order), > which is not pg_control, and see what checkpoint records are there, then > figure out which one to use. > > > Yes, I inferred something that obviously isn't true - that the system > doesn't go hunting for a valid checkpoint to begin recovery from. While > it does not do so in the case of a corrupted pg_control file I further > assumed it never did. That would be because the documentation doesn't > make the point of stating that two checkpoint positions exist and that > PostgreSQL will try the second one if the first one proves unusable. > Given the topic of this thread that omission makes the documentation > out-of-date. Maybe its covered elsewhere but since this section > addresses locating a starting point I would expect any such description > to be here as well. Yeah, I think we should fix the docs. Especially since I imagine that if you're reading that part of the docs you're probably having a really bad day, and bad info won't help you... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2/4/16 3:37 PM, Andres Freund wrote: > On 2016-02-03 09:28:24 -0500, Robert Haas wrote: >> Would we still have some way of forcing the older checkpoint record to >> be used if somebody wants to try to do that? > > I think currently the best way to force an arbitrary checkpoint to be > used is creating a "custom" backup label. Not that nice. Not sure if we > need something nice here, I don't really see a frequent need for this. > > We could add another option to pg_resetxlog alternatively :/ I guess you'd have to scan through WAL files by hand to find the next oldest checkpoint? I'm guessing that if this is happening in the field there's a decent chance people aren't noticing it, so maybe the best thing for now is to turn off the automatic behavior bust still have a relatively easy way to re-enable it. In case this is more common than we think... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On February 5, 2016 2:52:20 AM GMT+03:00, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote: >On 2/4/16 3:37 PM, Andres Freund wrote: >> On 2016-02-03 09:28:24 -0500, Robert Haas wrote: >>> Would we still have some way of forcing the older checkpoint record >to >>> be used if somebody wants to try to do that? >> >> I think currently the best way to force an arbitrary checkpoint to be >> used is creating a "custom" backup label. Not that nice. Not sure if >we >> need something nice here, I don't really see a frequent need for >this. >> >> We could add another option to pg_resetxlog alternatively :/ > >I guess you'd have to scan through WAL files by hand to find the next >oldest checkpoint? Just look at pg control, it contains the precious location? --- Please excuse brevity and formatting - I am writing this on my mobile phone.
On Tue, Feb 2, 2016 at 5:28 AM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> currently if, when not in standby mode, we can't read a checkpoint
> record, we automatically fall back to the previous checkpoint, and start
> replay from there.
>
> Doing so without user intervention doesn't actually seem like a good
> idea. While not super likely, it's entirely possible that doing so can
> wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
> tablespace being dropped - going back to the previous checkpoint very
> well could lead to replay not finishing, as the directory to create
> files in doesn't even exist.
>
> As there's, afaics, really no "legitimate" reasons for needing to go
> back to the previous checkpoint I don't think we should do so in an
> automated fashion.
>
> All the cases where I could find logs containing "using previous
> checkpoint record at" were when something else had already gone pretty
> badly wrong. Now that obviously doesn't have a very large significance,
> because in the situations where it "just worked" are unlikely to be
> reported...
>
> Am I missing a reason for doing this by default?
>
>
> Hi,
>
> currently if, when not in standby mode, we can't read a checkpoint
> record, we automatically fall back to the previous checkpoint, and start
> replay from there.
>
> Doing so without user intervention doesn't actually seem like a good
> idea. While not super likely, it's entirely possible that doing so can
> wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
> tablespace being dropped - going back to the previous checkpoint very
> well could lead to replay not finishing, as the directory to create
> files in doesn't even exist.
>
I think there are similar hazards for deletion of relation when
relfilenode gets reused. Basically, it can delete the data
for one of the newer relations which is created after the
last checkpoint.
> As there's, afaics, really no "legitimate" reasons for needing to go
> back to the previous checkpoint I don't think we should do so in an
> automated fashion.
>
I have tried to find out why at the first place such a mechanism has
been introduced and it seems to me that commit
4d14fe0048cf80052a3ba2053560f8aab1bb1b22 has introduced it, but
the reason is not apparent. Then I digged through the archives
and found mail chain which I think has lead to this commit.
Refer [1][2].
If we want to do something for fallback-to-previous-checkpoint
mechanism, then I think it is worth considering whether we want
to retain xlog files from two checkpoints as that also seems to
have been introduced in the same commit.
> All the cases where I could find logs containing "using previous
> checkpoint record at" were when something else had already gone pretty
> badly wrong. Now that obviously doesn't have a very large significance,
> because in the situations where it "just worked" are unlikely to be
> reported...
>
> Am I missing a reason for doing this by default?
>
I am not sure, but may be such hazards won't exist at the time
fallback-to-previous-checkpoint mechanism has been introduced.
I think even if we want to make it non-default, it will be very
difficult for users to decide whether to turn it on or not. Basically,
I think if such a situation occurs, what ever solution we try to
provide to user, it might not be full-proof, but OTOH we should
provide some way to allow user to start database and dump the
existing contents. Some of the options that comes to mind are
provide some way to get the last checkpoint record from WAL
or provide a way to compute max-lsn from data-pages and use
that with pg_resetxlog utility to allow user to start database.
On Sun, Feb 7, 2016 at 10:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 2, 2016 at 5:28 AM, Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > currently if, when not in standby mode, we can't read a checkpoint
> > record, we automatically fall back to the previous checkpoint, and start
> > replay from there.
> >
> > Doing so without user intervention doesn't actually seem like a good
> > idea. While not super likely, it's entirely possible that doing so can
> > wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
> > tablespace being dropped - going back to the previous checkpoint very
> > well could lead to replay not finishing, as the directory to create
> > files in doesn't even exist.
> >
>
> I think there are similar hazards for deletion of relation when
> relfilenode gets reused. Basically, it can delete the data
> for one of the newer relations which is created after the
> last checkpoint.
>
> > As there's, afaics, really no "legitimate" reasons for needing to go
> > back to the previous checkpoint I don't think we should do so in an
> > automated fashion.
> >
>
> I have tried to find out why at the first place such a mechanism has
> been introduced and it seems to me that commit
> 4d14fe0048cf80052a3ba2053560f8aab1bb1b22 has introduced it, but
> the reason is not apparent. Then I digged through the archives
> and found mail chain which I think has lead to this commit.
> Refer [1][2].
>
>
> On Tue, Feb 2, 2016 at 5:28 AM, Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > currently if, when not in standby mode, we can't read a checkpoint
> > record, we automatically fall back to the previous checkpoint, and start
> > replay from there.
> >
> > Doing so without user intervention doesn't actually seem like a good
> > idea. While not super likely, it's entirely possible that doing so can
> > wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
> > tablespace being dropped - going back to the previous checkpoint very
> > well could lead to replay not finishing, as the directory to create
> > files in doesn't even exist.
> >
>
> I think there are similar hazards for deletion of relation when
> relfilenode gets reused. Basically, it can delete the data
> for one of the newer relations which is created after the
> last checkpoint.
>
> > As there's, afaics, really no "legitimate" reasons for needing to go
> > back to the previous checkpoint I don't think we should do so in an
> > automated fashion.
> >
>
> I have tried to find out why at the first place such a mechanism has
> been introduced and it seems to me that commit
> 4d14fe0048cf80052a3ba2053560f8aab1bb1b22 has introduced it, but
> the reason is not apparent. Then I digged through the archives
> and found mail chain which I think has lead to this commit.
> Refer [1][2].
>
oops, forgot to provide the links, providing them now.