Hi,
xlogreader.c has a pointer overflow bug, as revealed by the
combination of -fsanitize=undefined -m32, the new 039_end_of_wal.pl
test and Robert's incremental backup patch[1]. The bad code tests
whether an object could fit using something like base + size <= end,
which can be converted to something like size <= end - base to avoid
the overflow. See experimental fix patch, attached.
I took a while to follow up because I wanted to understand exactly why
it doesn't break in practice despite being that way since v15. I
think I have it now:
1. In the case of a bad/garbage size at end-of-WAL, the
following-page checks will fail first before anything bad happens as a
result of the overflow.
2. In the case of a real oversized record on current 64 bit
architectures including amd64, aarch64, power and riscv64, the pointer
can't really overflow anyway because the virtual address space is < 64
bit, typically around 48, and record lengths are 32 bit.
3. In the case of the 32 bit kernels I looked at including Linux,
FreeBSD and cousins, Solaris and Windows the top 1GB plus a bit more
of virtual address space is reserved for system use*, so I think a
real oversized record shouldn't be able to overflow the pointer there
either.
A 64 bit kernel running a 32 bit process could run into trouble,
though :-(. Those don't need to block out that high memory segment.
You'd need to have the WAL buffer in that address range and decode
large enough real WAL records and then things could break badly. I
guess 32/64 configurations must be rare these days outside developers
testing 32 bit code, and that is what happened here (ie CI); and with
some minor tweaks to the test it can be reached without Robert's patch
too. There may of course be other more exotic systems that could
break, but I don't know specifically what.
TLDR; this is a horrible bug, but all damage seems to be averted on
"normal" systems. The best thing I can say about all this is that the
new test found a bug, and the fix seems straightforward. I will study
and test this some more, but wanted to share what I have so far.
(*I think the old 32 bit macOS kernels might have been an exception to
this pattern but 32 bit kernels and even processes are history there.)
[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGLCuW_CE9nDAbUNV40G2FkpY_kcPZkaORyVBVive8FQHQ%40mail.gmail.com#d0d00ca5cc3f756656466adc9f2ec186