Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id CA+hUKGJ7OqpdnbSTq5oK=djSeVW2JMnrVPSm8JC-_dbN6Y7bpw@mail.gmail.com
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Daniel Gustafsson <daniel@yesql.se>)
Responses Re: WIP: WAL prefetch (another approach)
Re: WIP: WAL prefetch (another approach)
List pgsql-hackers
On Mon, Nov 15, 2021 at 11:31 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> Could you post an updated version of the patch which is for review?

Sorry for taking so long to come back; I learned some new things that
made me want to restructure this code a bit (see below).  Here is an
updated pair of patches that I'm currently testing.

Old problems:

1.  Last time around, an infinite loop was reported in pg_waldump.  I
believe Horiguchi-san has fixed that[1], but I'm no longer depending
on that patch.  I thought his patch set was a good idea, but it's
complicated and there's enough going on here already... let's consider
that independently.

This version goes back to what I had earlier, though (I hope) it is
better about how "nonblocking" states are communicated.  In this
version, XLogPageRead() has a way to give up part way through a record
if it doesn't have enough data and there are queued up records that
could be replayed right now.  In that case, we'll go back to the
beginning of the record (and occasionally, back a WAL page) next time
we try.  That's the cost of not maintaining intra-record decoding
state.

2.  Last time around, we could try to allocate a crazy amount of
memory when reading garbage past the end of the WAL.  Fixed, by
validating first, like in master.

New work:

Since last time, I went away and worked on a "real" AIO version of
this feature.  That's ongoing experimental work for a future proposal,
but I have a working prototype and I aim to share that soon, when that
branch is rebased to catch up with recent changes.  In that version,
the prefetcher starts actual reads into the buffer pool, and recovery
receives already pinned buffers attached to the stream of records it's
replaying.

That inspired a couple of refactoring changes to this non-AIO version,
to minimise the difference and anticipate the future work better:

1.  The logic for deciding which block to start prefetching next is
moved into a new callback function in a sort of standard form (this is
approximately how all/most prefetching code looks in the AIO project,
ie sequential scans, bitmap heap scan, etc).

2.  The logic for controlling how many IOs are running and deciding
when to call the above is in a separate component.  In this non-AIO
version, it works using a simple ring buffer of LSNs to estimate the
number of in flight I/Os, just like before.  This part would be thrown
away and replaced with the AIO branch's centralised "streaming read"
mechanism which tracks I/O completions based on a stream of completion
events from the kernel (or I/O worker processes).

3.  In this version, the prefetcher still doesn't pin buffers, for
simplicity.  That work did force me to study places where WAL streams
need prefetching "barriers", though, so in this patch you can
see that it's now a little more careful than it probably needs to be.
(It doesn't really matter much if you call posix_fadvise() on a
non-existent file region, or the wrong file after OID wraparound and
reuse, but it would matter if you actually read it into a buffer, and
if an intervening record might be trying to drop something you have
pinned).

Some other changes:

1.  I dropped the GUC recovery_prefetch_fpw.  I think it was a
possibly useful idea but it's a niche concern and not worth worrying
about for now.

2.  I simplified the stats.  Coming up with a good running average
system seemed like a problem for another day (the numbers before were
hard to interpret).  The new stats are super simple counters and
instantaneous values:

postgres=# select * from pg_stat_prefetch_recovery ;
-[ RECORD 1 ]--+------------------------------
stats_reset    | 2021-11-10 09:02:08.590217+13
prefetch       | 13605674 <- times we called posix_fadvise()
hit            | 24185289 <- times we found pages already cached
skip_init      | 217215   <- times we did nothing because init, not read
skip_new       | 192347   <- times we skipped because relation too small
skip_fpw       | 27429    <- times we skipped because fpw, not read
wal_distance   | 10648    <- how far ahead in WAL bytes
block_distance | 134      <- how far ahead in block references
io_depth       | 50       <- fadvise() calls not yet followed by pread()

I also removed the code to save and restore the stats via the stats
collector, for now.  I figured that persistent stats could be a later
feature, perhaps after the shared memory stats stuff?

3.  I dropped the code that was caching an SMgrRelation pointer to
avoid smgropen() calls that showed up in some profiles.  That probably
lacked invalidation that could be done with some more WAL analysis,
but I decided to leave it out completely for now for simplicity.

4.  I dropped the verbose logging.  I think it might make sense to
integrate with the new "recovery progress" system, but I think that
should be a separate discussion.  If you want to see the counters
after crash recovery finishes, you can look at the stats view.

[1] https://commitfest.postgresql.org/34/2113/

Attachment

pgsql-hackers by date:

Previous
From: "Drouvot, Bertrand"
Date:
Subject: Re: removing global variable ThisTimeLineID
Next
From: Amit Kapila
Date:
Subject: Re: row filtering for logical replication