Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id 20210422013411.tbcaqqq6c23s2pxy@alap3.anarazel.de
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: WIP: WAL prefetch (another approach)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2021-04-21 21:21:05 -0400, Tom Lane wrote:
> What I'm doing is running the core regression tests with a single
> standby (on the same machine) and wal_consistency_checking = all.

Do you run them over replication, or sequentially by storing data into
an archive? Just curious, because its so painful to run that scenario in
the replication case due to the tablespace conflicting between
primary/standby, unless one disables the tablespace tests.


> The other PPC machine (with no known history of trouble) is the one
> that had the CRC failure I showed earlier.  That one does seem to be
> actual bad data in the stored WAL, because the problem was also seen
> by pg_waldump, and trying to restart the standby got the same failure
> again.

It seems like that could also indicate an xlogreader bug that is
reliably hit? Once it gets confused about record lengths or such I'd
expect CRC failures...

If it were actually wrong WAL contents I don't think any of the
xlogreader / prefetching changes could be responsible...


Have you tried reproducing it on commits before the recent xlogreader
changes?

commit 1d257577e08d3e598011d6850fd1025858de8c8c
Author: Thomas Munro <tmunro@postgresql.org>
Date:   2021-04-08 23:03:43 +1200

    Optionally prefetch referenced data in recovery.

commit f003d9f8721b3249e4aec8a1946034579d40d42c
Author: Thomas Munro <tmunro@postgresql.org>
Date:   2021-04-08 23:03:34 +1200

    Add circular WAL decoding buffer.

    Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com

commit 323cbe7c7ddcf18aaf24b7f6d682a45a61d4e31b
Author: Thomas Munro <tmunro@postgresql.org>
Date:   2021-04-08 23:03:23 +1200

    Remove read_page callback from XLogReader.


Trying 323cbe7c7ddcf18aaf24b7f6d682a45a61d4e31b^ is probably the most
interesting bit.


> I've not been able to duplicate the consistency-check failures
> there.  But because that machine is a laptop with a much inferior disk
> drive, the speeds are enough different that it's not real surprising
> if it doesn't hit the same problem.
>
> I've also tried to reproduce on 32-bit and 64-bit Intel, without
> success.  So if this is real, maybe it's related to being big-endian
> hardware?  But it's also quite sensitive to $dunno-what, maybe the
> history of WAL records that have already been replayed.

It might just be disk speed influencing how long the tests take, which
in turn increase the number of times checkpoints during the test,
increasing the number of FPIs?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: WIP: WAL prefetch (another approach)
Next
From: Fujii Masao
Date:
Subject: Re: Stale description for pg_basebackup