Re: "using previous checkpoint record at" maybe not the greatest idea? - Mailing list pgsql-hackers

From David G. Johnston
Subject Re: "using previous checkpoint record at" maybe not the greatest idea?
Date
Msg-id CAKFQuwZ+XZK1b3wE9k4U24NsFS2nRnbE0z7HGsci7Y_gBaeA=Q@mail.gmail.com
Whole thread Raw
In response to Re: "using previous checkpoint record at" maybe not the greatest idea?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Mon, Feb 1, 2016 at 5:48 PM, Andres Freund <andres@anarazel.de> wrote:
On 2016-02-01 17:29:39 -0700, David G. Johnston wrote:
> ​Learning by reading here...
>
> http://www.postgresql.org/docs/current/static/wal-internals.html
> """
> ​After a checkpoint has been made and the log flushed, the checkpoint's
> position is saved in the file pg_control. Therefore, at the start of
> recovery, the server first reads pg_control and then the checkpoint record;
> then it performs the REDO operation by scanning forward from the log
> position indicated in the checkpoint record. Because the entire content of
> data pages is saved in the log on the first page modification after a
> checkpoint (assuming full_page_writes is not disabled), all pages changed
> since the checkpoint will be restored to a consistent state.

> ​The above comment appears out-of-date if this post describes what
> presently happens.

Where do you see a conflict with what I wrote about? We store both the
last and the previous checkpoint's location in pg_control. Or are you
talking about:

​Mainly the following...but the word I used was "out-of-date" and not "conflict".​  The present state seems to do the above, and then some.


> To deal with the case where pg_control is corrupt, we should support the
> possibility of scanning existing log segments in reverse order — newest to
> oldest — in order to find the latest checkpoint. This has not been
> implemented yet. pg_control is small enough (less than one disk page) that
> it is not subject to partial-write problems, and as of this writing there
> have been no reports of database failures due solely to the inability to
> read pg_control itself. So while it is theoretically a weak spot,
> pg_control does not seem to be a problem in practice.

if so, no, that's not a out-of-date, as we simply store two checkpoint
locations:
 $ pg_controldata /srv/dev/pgdev-dev/|grep 'checkpoint location'
Latest checkpoint location:           B3/2A730028
Prior checkpoint location:            B3/2A72FFA0


​The quote implies that only a single checkpoint​ is noted and that no "searching" is performed - whether by scanning or by being told the position of a previous one so that it can jump there immediately without scanning backwards.  It isn't strictly the fact that we do not "scan" backwards but the implications that arise in making that statement.  Maybe this is being picky but if you cannot trust the value of "Latest checkpoint location" then pg_control is arguably corrupt.  Corruption is not strictly limited to "unable to be read" but does include "contains invalid data".

> Also, I was​ under the impression that tablespace commands resulted in
> checkpoints so that the state of the file system could be presumed
> current...

That actually doesn't really make it any better - it forces the *latest*
checkpoint, but if we can't read that, we'll start with the previous
one... 

> I don't know enough internals but its seems like we'd need to distinguish
> between an interrupted checkpoint (pull the plug during checkpoint) and one
> that supposedly completed without interruption but then was somehow
> corrupted (solar flares).  The former seem legitimate for auto-skip while
> the later do not.

I don't think such a distinction is really possible (or necessary). If
pg_control is corrupted we won't even start, and if WAL is corrupted
that badly we won't finish replay...

My takeaway from the above is that we should only record what we think is a usable/readable/valid checkpoint location to "Latest checkpoint location"​
​ (LCL) and if the system is not able to use that information to perform a successful recovery it should be allowed to die without using the value in "Previous checkpoint location" - which becomes effectively ignored during master recovery.

​David J.

pgsql-hackers by date:

Previous
From: Jim Nasby
Date:
Subject: Re: statistics for shared catalogs not updated when autovacuum is off
Next
From: Jim Nasby
Date:
Subject: Re: Add links to commit fests to patch summary page