Hi,
In https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg%40alap3.anarazel.de
I commented that we have a number of hacky workarounds to deal with the fact
that walreceiver writes partial WAL pages into reycled segments.
The problem with that practice is that within a page we cannot reliably detect
invalid record headers. This is especially true, when the record header spans
across a page boundary - currently the only check in that case is if the
record length is smaller than 1GB, and even that is just checked in the
frontend. Note that we cannot rely on the CRC checksum here - it can only be
validated once the whole record has been read.
On a primary we *do* pad partial pages, see AdvanceXLInsertBuffer():
/*
* Be sure to re-zero the buffer so that bytes beyond what we've
* written will look like zeroes and not valid XLOG records...
*/
MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
Particularly the logic in allocate_recordbuf() is scary: In a completely
working setup we'll regularly try to allocate large buffers that we'll never
need - and the record buffer is not freed until the startup process exits. And
we have no corresponding size check in the frontend (which doesn't make any
sense to me). In the case of a record header across a page boundary, this
check will pass in roughly 1/4 of the cases!
As an example of the difference this makes, I ran a primary/standby setup with
continuously running regression tests, and had a psql \watch terminate
walsender every 1.5 s.
Decoding failures without zero-padding:
2021-05-10 16:52:51.448 PDT [2481446][1/0] LOG: record with incorrect prev-link 103FF/73 at 4/C154FD50
2021-05-10 16:52:53.001 PDT [2481446][1/0] LOG: record with incorrect prev-link 0/FFFF at 4/C3531A88
2021-05-10 16:52:57.848 PDT [2481446][1/0] LOG: invalid resource manager ID 32 at 4/C3B67AD8
2021-05-10 16:52:58.773 PDT [2481446][1/0] LOG: record with incorrect prev-link 403FF/12 at 4/C47F35E8
2021-05-10 16:53:03.771 PDT [2481446][1/0] LOG: invalid page at 4/C562E000
2021-05-10 16:53:04.945 PDT [2481446][1/0] LOG: invalid record length at 4/C6E1C1E8: wanted 24, got 0
2021-05-10 16:53:06.176 PDT [2481446][1/0] LOG: invalid page at 4/C7040000
2021-05-10 16:53:07.624 PDT [2481446][1/0] LOG: record with incorrect prev-link 2FF/64 at 4/C7475078
...
With zero-padding:
2021-05-10 16:58:20.186 PDT [2489042][1/0] LOG: invalid record length at 5/7049A40: wanted 24, got 0
2021-05-10 16:58:22.832 PDT [2489042][1/0] LOG: invalid record length at 5/801AD70: wanted 24, got 0
2021-05-10 16:58:27.548 PDT [2489042][1/0] LOG: invalid record length at 5/8319908: wanted 24, got 0
2021-05-10 16:58:30.945 PDT [2489042][1/0] LOG: invalid record length at 5/AFDC770: wanted 24, got 0
...
2021-05-10 16:59:24.546 PDT [2489042][1/0] LOG: invalid page at 5/19284000
The "invalid page" cases are a lot rarer - previously we would hit them
whenever the record header itself passed [minimal] muster, even though it was
just padding passing as e.g. a valid record length. Now it's only when the end
of WAL actually is at the page boundary.
On 13+ we could do a bit better than the current approach, and use
pg_pwritev() to append the zeroed data. However, I'm not convinced it is a
good idea - when pg_pwritev is emulated, we'd always do the zeroing as part of
a separate write, which does seem like it increases the likelihood of
encountering such partially written pages a bit. But perhaps it's too
insignificant to matter.
Greetings,
Andres Freund