Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: WIP: WAL prefetch (another approach) |
Date | |
Msg-id | 20210429031409.quuhyjihk6hqbloe@alap3.anarazel.de Whole thread Raw |
In response to | Re: WIP: WAL prefetch (another approach) (Andres Freund <andres@anarazel.de>) |
Responses |
Re: WIP: WAL prefetch (another approach)
Re: WIP: WAL prefetch (another approach) |
List | pgsql-hackers |
Hi, On 2021-04-28 17:59:22 -0700, Andres Freund wrote: > I can however say that pg_waldump on the standby's pg_wal does also > fail. The failure as part of the backend is "invalid memory alloc > request size", whereas in pg_waldump I get the much more helpful: > pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect prev-link 416200FF/FF000000 at 4/7F5C3200 > > In frontend code that allocation actually succeeds, because there is no > size check. But in backend code we run into the size check, and thus > don't even display a useful error. > > In 13 the header is validated before allocating space for the > record(except if header is spread across pages) - it seems inadvisable > to turn that around? I was now able to reproduce the problem again, and I'm afraid that the bug I hit is likely separate from Tom's. The allocation thing above is the issue in my case: The walsender connection ended (I restarted the primary), thus the startup switches to replaying locally. For some reason the end of the WAL contains non-zero data (I think it's because walreceiver doesn't zero out pages - that's bad!). Because the allocation happen before the header is validated, we reproducably end up in the mcxt.c ERROR path, failing recovery. To me it looks like a smaller version of the problem is present in < 14, albeit only when the page header is at a record boundary. In that case we don't validate the page header immediately, only once it's completely read. But we do believe the total size, and try to allocate that. There's a really crufty escape hatch (from 70b4f82a4b) to that: /* * Note that in much unlucky circumstances, the random data read from a * recycled segment can cause this routine to be called with a size * causing a hard failure at allocation. For a standby, this would cause * the instance to stop suddenly with a hard failure, preventing it to * retry fetching WAL from one of its sources which could allow it to move * on with replay without a manual restart. If the data comes from a past * recycled segment and is still valid, then the allocation may succeed * but record checks are going to fail so this would be short-lived. If * the allocation fails because of a memory shortage, then this is not a * hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM. */ if (!AllocSizeIsValid(newSize)) return false; but it looks to me like that's pretty much the wrong fix, at least in the case where we've not yet validated the rest of the header. We don't need to allocate all that data before we've read the rest of the *fixed-size* header. It also seems to me that 70b4f82a4b should also have changed walsender to pad out the received data to an 8KB boundary? Greetings, Andres Freund
pgsql-hackers by date: