Hi,
On 2021-05-04 09:46:12 -0400, Tom Lane wrote:
> Yeah, I have also spent a fair amount of time trying to reproduce it
> elsewhere, without success so far. Notably, I've been trying on a
> PPC Mac laptop that has a fairly similar CPU to what's in the G4,
> though a far slower disk drive. So that seems to exclude theories
> based on it being PPC-specific.
>
> I suppose that if we're unable to reproduce it on at least one other box,
> we have to write it off as hardware flakiness.
I wonder if there's a chance what we're seeing is an OS memory ordering
bug, or a race between walreceiver writing data and the startup process
reading it.
When the startup process is able to keep up, there often will be a very
small time delta between the startup process reading a page that the
walreceiver just wrote. And if the currently read page was the tail page
written to by a 'w' message, it'll often be written to again in short
order - potentially while the startup process is reading it.
It'd not terribly surprise me if an old OS version on an old processor
had some issues around that.
Were there any cases of walsender terminating and reconnecting around
the failures?
It looks suspicious that XLogPageRead() does not invalidate the
xlogreader state when retrying. Normally that's xlogreader's
responsibility, but there is that whole XLogReaderValidatePageHeader()
business. But I don't quite see how it'd actually cause problems.
Greetings,
Andres Freund