Zeroing partial pages in walreceiver - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Zeroing partial pages in walreceiver |
Date | |
Msg-id | 20210511005520.6uzndcrxxozzidf3@alap3.anarazel.de Whole thread Raw |
List | pgsql-hackers |
Hi, In https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg%40alap3.anarazel.de I commented that we have a number of hacky workarounds to deal with the fact that walreceiver writes partial WAL pages into reycled segments. The problem with that practice is that within a page we cannot reliably detect invalid record headers. This is especially true, when the record header spans across a page boundary - currently the only check in that case is if the record length is smaller than 1GB, and even that is just checked in the frontend. Note that we cannot rely on the CRC checksum here - it can only be validated once the whole record has been read. On a primary we *do* pad partial pages, see AdvanceXLInsertBuffer(): /* * Be sure to re-zero the buffer so that bytes beyond what we've * written will look like zeroes and not valid XLOG records... */ MemSet((char *) NewPage, 0, XLOG_BLCKSZ); Particularly the logic in allocate_recordbuf() is scary: In a completely working setup we'll regularly try to allocate large buffers that we'll never need - and the record buffer is not freed until the startup process exits. And we have no corresponding size check in the frontend (which doesn't make any sense to me). In the case of a record header across a page boundary, this check will pass in roughly 1/4 of the cases! As an example of the difference this makes, I ran a primary/standby setup with continuously running regression tests, and had a psql \watch terminate walsender every 1.5 s. Decoding failures without zero-padding: 2021-05-10 16:52:51.448 PDT [2481446][1/0] LOG: record with incorrect prev-link 103FF/73 at 4/C154FD50 2021-05-10 16:52:53.001 PDT [2481446][1/0] LOG: record with incorrect prev-link 0/FFFF at 4/C3531A88 2021-05-10 16:52:57.848 PDT [2481446][1/0] LOG: invalid resource manager ID 32 at 4/C3B67AD8 2021-05-10 16:52:58.773 PDT [2481446][1/0] LOG: record with incorrect prev-link 403FF/12 at 4/C47F35E8 2021-05-10 16:53:03.771 PDT [2481446][1/0] LOG: invalid page at 4/C562E000 2021-05-10 16:53:04.945 PDT [2481446][1/0] LOG: invalid record length at 4/C6E1C1E8: wanted 24, got 0 2021-05-10 16:53:06.176 PDT [2481446][1/0] LOG: invalid page at 4/C7040000 2021-05-10 16:53:07.624 PDT [2481446][1/0] LOG: record with incorrect prev-link 2FF/64 at 4/C7475078 ... With zero-padding: 2021-05-10 16:58:20.186 PDT [2489042][1/0] LOG: invalid record length at 5/7049A40: wanted 24, got 0 2021-05-10 16:58:22.832 PDT [2489042][1/0] LOG: invalid record length at 5/801AD70: wanted 24, got 0 2021-05-10 16:58:27.548 PDT [2489042][1/0] LOG: invalid record length at 5/8319908: wanted 24, got 0 2021-05-10 16:58:30.945 PDT [2489042][1/0] LOG: invalid record length at 5/AFDC770: wanted 24, got 0 ... 2021-05-10 16:59:24.546 PDT [2489042][1/0] LOG: invalid page at 5/19284000 The "invalid page" cases are a lot rarer - previously we would hit them whenever the record header itself passed [minimal] muster, even though it was just padding passing as e.g. a valid record length. Now it's only when the end of WAL actually is at the page boundary. On 13+ we could do a bit better than the current approach, and use pg_pwritev() to append the zeroed data. However, I'm not convinced it is a good idea - when pg_pwritev is emulated, we'd always do the zeroing as part of a separate write, which does seem like it increases the likelihood of encountering such partially written pages a bit. But perhaps it's too insignificant to matter. Greetings, Andres Freund
Attachment
pgsql-hackers by date: