Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id 20210429031409.quuhyjihk6hqbloe@alap3.anarazel.de
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Andres Freund <andres@anarazel.de>)
Responses Re: WIP: WAL prefetch (another approach)
Re: WIP: WAL prefetch (another approach)
List pgsql-hackers
Hi,

On 2021-04-28 17:59:22 -0700, Andres Freund wrote:
> I can however say that pg_waldump on the standby's pg_wal does also
> fail. The failure as part of the backend is "invalid memory alloc
> request size", whereas in pg_waldump I get the much more helpful:
> pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect prev-link 416200FF/FF000000 at
4/7F5C3200
> 
> In frontend code that allocation actually succeeds, because there is no
> size check. But in backend code we run into the size check, and thus
> don't even display a useful error.
> 
> In 13 the header is validated before allocating space for the
> record(except if header is spread across pages) - it seems inadvisable
> to turn that around?

I was now able to reproduce the problem again, and I'm afraid that the
bug I hit is likely separate from Tom's. The allocation thing above is
the issue in my case:

The walsender connection ended (I restarted the primary), thus the
startup switches to replaying locally. For some reason the end of the
WAL contains non-zero data (I think it's because walreceiver doesn't
zero out pages - that's bad!). Because the allocation happen before the
header is validated, we reproducably end up in the mcxt.c ERROR path,
failing recovery.

To me it looks like a smaller version of the problem is present in < 14,
albeit only when the page header is at a record boundary. In that case
we don't validate the page header immediately, only once it's completely
read. But we do believe the total size, and try to allocate
that.

There's a really crufty escape hatch (from 70b4f82a4b) to that:

    /*
     * Note that in much unlucky circumstances, the random data read from a
     * recycled segment can cause this routine to be called with a size
     * causing a hard failure at allocation.  For a standby, this would cause
     * the instance to stop suddenly with a hard failure, preventing it to
     * retry fetching WAL from one of its sources which could allow it to move
     * on with replay without a manual restart. If the data comes from a past
     * recycled segment and is still valid, then the allocation may succeed
     * but record checks are going to fail so this would be short-lived.  If
     * the allocation fails because of a memory shortage, then this is not a
     * hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM.
     */
    if (!AllocSizeIsValid(newSize))
        return false;

but it looks to me like that's pretty much the wrong fix, at least in
the case where we've not yet validated the rest of the header. We don't
need to allocate all that data before we've read the rest of the
*fixed-size* header.

It also seems to me that 70b4f82a4b should also have changed walsender
to pad out the received data to an 8KB boundary?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Replication slot stats misgivings
Next
From: Tom Lane
Date:
Subject: Re: Replication slot stats misgivings