WAL page magic errors (and plenty others) got hard to debug. - Mailing list pgsql-hackers

From Andres Freund
Subject WAL page magic errors (and plenty others) got hard to debug.
Date
Msg-id 20200405224916.fxt4yn36vgxbftbn@alap3.anarazel.de
Whole thread Raw
Responses Re: WAL page magic errors (and plenty others) got hard to debug.
List pgsql-hackers
Hi,

When starting with on a data directory with an older WAL page magic we
currently make that hard to debug. E.g.:

2020-04-05 15:31:04.314 PDT [1896669][:0] LOG:  database system was shut down at 2020-04-05 15:24:56 PDT
2020-04-05 15:31:04.314 PDT [1896669][:0] LOG:  invalid primary checkpoint record
2020-04-05 15:31:04.314 PDT [1896669][:0] PANIC:  could not locate a valid checkpoint record
2020-04-05 15:31:04.315 PDT [1896668][:0] LOG:  startup process (PID 1896669) was terminated by signal 6: Aborted
2020-04-05 15:31:04.315 PDT [1896668][:0] LOG:  aborting startup due to startup process failure
2020-04-05 15:31:04.316 PDT [1896668][:0] LOG:  database system is shut down

As far as I can tell this is not just the case for a wrong page magic,
but for all page level validation errors.

I think this largely originates in:

commit 0668719801838aa6a8bda330ff9b3d20097ea844
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   2018-05-05 01:34:53 +0300

    Fix scenario where streaming standby gets stuck at a continuation record.

    If a continuation record is split so that its first half has already been
    removed from the master, and is only present in pg_wal, and there is a
    recycled WAL segment in the standby server that looks like it would
    contain the second half, recovery would get stuck. The code in
    XLogPageRead() incorrectly started streaming at the beginning of the
    WAL record, even if we had already read the first page.

    Backpatch to 9.4. In principle, older versions have the same problem, but
    without replication slots, there was no straightforward mechanism to
    prevent the master from recycling old WAL that was still needed by standby.
    Without such a mechanism, I think it's reasonable to assume that there's
    enough slack in how many old segments are kept around to not run into this,
    or you have a WAL archive.

    Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
    some extra comments by me.

    Discussion:
https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com

which added the following to XLogPageRead():

+    /*
+     * Check the page header immediately, so that we can retry immediately if
+     * it's not valid. This may seem unnecessary, because XLogReadRecord()
+     * validates the page header anyway, and would propagate the failure up to
+     * ReadRecord(), which would retry. However, there's a corner case with
+     * continuation records, if a record is split across two pages such that
+     * we would need to read the two pages from different sources. For
+     * example, imagine a scenario where a streaming replica is started up,
+     * and replay reaches a record that's split across two WAL segments. The
+     * first page is only available locally, in pg_wal, because it's already
+     * been recycled in the master. The second page, however, is not present
+     * in pg_wal, and we should stream it from the master. There is a recycled
+     * WAL segment present in pg_wal, with garbage contents, however. We would
+     * read the first page from the local WAL segment, but when reading the
+     * second page, we would read the bogus, recycled, WAL segment. If we
+     * didn't catch that case here, we would never recover, because
+     * ReadRecord() would retry reading the whole record from the beginning.
+     *
+     * Of course, this only catches errors in the page header, which is what
+     * happens in the case of a recycled WAL segment. Other kinds of errors or
+     * corruption still has the same problem. But this at least fixes the
+     * common case, which can happen as part of normal operation.
+     *
+     * Validating the page header is cheap enough that doing it twice
+     * shouldn't be a big deal from a performance point of view.
+     */
+    if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+    {
+        /* reset any error XLogReaderValidatePageHeader() might have set */
+        xlogreader->errormsg_buf[0] = '\0';
+        goto next_record_is_invalid;
+    }
+

I really can't follow the logic of just intentionally and silently
throwing the error message away here. Isn't this basically hiding *all*
page level error messages?

And even in the scenarios where this were the right thing, I feel like
not even outputting a debugging message makes debugging situations in
which this is encountered unnecessarily hard.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: d25ea01275 and partitionwise join
Next
From: Andres Freund
Date:
Subject: Re: Make MemoryContextMemAllocated() more precise