Home > mailing lists

Re: "using previous checkpoint record at" maybe not the greatest idea? - Mailing list pgsql-hackers

From	David G. Johnston
Subject	Re: "using previous checkpoint record at" maybe not the greatest idea?
Date	February 4, 2016 23:09:55
Msg-id	CAKFQuwasfkwfhXB37hvjWK1G=cv8Aogun3tDCYEj9FbPNZZ8wQ@mail.gmail.com Whole thread
In response to	Re: "using previous checkpoint record at" maybe not the greatest idea? (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses	Re: "using previous checkpoint record at" maybe not the greatest idea?
List	pgsql-hackers

Tree view

On Thu, Feb 4, 2016 at 3:57 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

David G. Johnston wrote:

> Learning by reading here...
>
> http://www.postgresql.org/docs/current/static/wal-internals.html
> """
> After a checkpoint has been made and the log flushed, the checkpoint's
> position is saved in the file pg_control. Therefore, at the start of
> recovery, the server first reads pg_control and then the checkpoint record;
> then it performs the REDO operation by scanning forward from the log
> position indicated in the checkpoint record. Because the entire content of
> data pages is saved in the log on the first page modification after a
> checkpoint (assuming full_page_writes is not disabled), all pages changed
> since the checkpoint will be restored to a consistent state.
>
> To deal with the case where pg_control is corrupt, we should support the
> possibility of scanning existing log segments in reverse order — newest to
> oldest — in order to find the latest checkpoint. This has not been
> implemented yet. pg_control is small enough (less than one disk page) that
> it is not subject to partial-write problems, and as of this writing there
> have been no reports of database failures due solely to the inability to
> read pg_control itself. So while it is theoretically a weak spot,
> pg_control does not seem to be a problem in practice.
> """
>
> The above comment appears out-of-date if this post describes what
> presently happens.

I think you're misinterpreting Andres, or the docs, or both.

What Andres says is that the control file (pg_control) stores two
checkpoint locations: the latest one, and the one before that. When
recovery occurs, it starts by looking up the latest checkpoint record;
if it cannot find that for whatever reason, it falls back to reading the
previous one. (He further claims that falling back to the previous one
is a bad idea.)

What the 2nd para in the documentation is saying is something different:
it is talking about reading all the pg_xlog files (in reverse order),
which is not pg_control, and see what checkpoint records are there, then
figure out which one to use.

Yes, I inferred something that obviously isn't true - that the system doesn't go hunting for a valid checkpoint to begin recovery from. While it does not do so in the case of a corrupted pg_control file I further assumed it never did. That would be because the documentation doesn't make the point of stating that two checkpoint positions exist and that PostgreSQL will try the second one if the first one proves unusable. Given the topic of this thread that omission makes the documentation out-of-date. Maybe its covered elsewhere but since this section addresses locating a starting point I would expect any such description to be here as well.

David J.

pgsql-hackers by date:

From: Alvaro Herrera
Date: 04 February 2016, 22:57:51
Subject: Re: "using previous checkpoint record at" maybe not the greatest idea?

From: Alvaro Herrera
Date: 04 February 2016, 23:10:42
Subject: Re: insufficient qualification of some objects in dump files

Re: "using previous checkpoint record at" maybe not the greatest idea? - Mailing list pgsql-hackers

Previous

Next