Hi,
On 2021-05-04 18:08:35 -0700, Andres Freund wrote:
> But the issue that 70b4f82a4b is trying to address seems bigger to
> me. The reason it's so easy to hit the issue is that walreceiver does <
> 8KB writes into recycled WAL segments *without* zero-filling the tail
> end of the page - which will commonly be filled with random older
> contents, because we'll use a recycled segments. I think that
> *drastically* increases the likelihood of finding something that looks
> like a valid record header compared to the situation on a primary where
> the zeroing pages before use makes that pretty unlikely.
I've written an experimental patch to deal with this and, as expected,
it does make the end-of-wal detection a lot more predictable and
reliable. There's only two types of possible errors outside of crashes:
A record length of 0 (the end of WAL is within a page), and the page
header LSN mismatching (the end of WAL is at a page boundary).
This seems like a significant improvement.
However: It's nontrivial to do this nicely and in a backpatchable way in
XLogWalRcvWrite(). Or at least I haven't found a good way:
- We can't extend the input buffer to XLogWalRcvWrite(), it's from
libpq.
- We don't want to copy the the entire buffer (commonly 128KiB) to a new
buffer that we then can extend by 0-BLCKSZ of zeroes to cover the
trailing part of the last page.
- In PG13+ we can do this utilizing pg_writev(), adding another IOV
entry covering the trailing space to be padded.
- It's nicer to avoid increasign the number of write() calls, but it's
not as crucial as the earlier points.
I'm also a bit uncomfortable with another aspect, although I can't
really see a problem: When switch to receiving WAL via walreceiver, we
always start at a segment boundary, even if we had received most of that
segment before. Currently that won't end up with any trailing space that
needs to be zeroed, because the server always will send 128KB chunks,
but there's no formal guarantee for that. It seems a bit odd that we
could end up zeroing trailing space that already contains valid data,
just to overwrite it with valid data again. But it ought to always be
fine.
The least offensive way I could come up with is for XLogWalRcvWrite() to
always write partial pages in a separate pg_pwrite(). When writing a
partial page, and the previous write position was not already on that
same page, copy the buffer into a local XLOG_BLCKSZ sized buffer
(although we'll never use more than XLOG_BLCKSZ-1 I think), and (re)zero
out the trailing part. One thing that does not yet handle is if we were
to get a partial write - we'd not again notice that we need to pad the
end of the page.
Does anybody have a better idea?
I really wish we had a version of pg_p{read,write}[v] that internally
handled partial IOs, retrying as long as they see > 0 bytes written.
Greetings,
Andres Freund