Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: WIP: WAL prefetch (another approach) |
Date | |
Msg-id | CA+hUKGJ7OqpdnbSTq5oK=djSeVW2JMnrVPSm8JC-_dbN6Y7bpw@mail.gmail.com Whole thread Raw |
In response to | Re: WIP: WAL prefetch (another approach) (Daniel Gustafsson <daniel@yesql.se>) |
Responses |
Re: WIP: WAL prefetch (another approach)
Re: WIP: WAL prefetch (another approach) |
List | pgsql-hackers |
On Mon, Nov 15, 2021 at 11:31 PM Daniel Gustafsson <daniel@yesql.se> wrote: > Could you post an updated version of the patch which is for review? Sorry for taking so long to come back; I learned some new things that made me want to restructure this code a bit (see below). Here is an updated pair of patches that I'm currently testing. Old problems: 1. Last time around, an infinite loop was reported in pg_waldump. I believe Horiguchi-san has fixed that[1], but I'm no longer depending on that patch. I thought his patch set was a good idea, but it's complicated and there's enough going on here already... let's consider that independently. This version goes back to what I had earlier, though (I hope) it is better about how "nonblocking" states are communicated. In this version, XLogPageRead() has a way to give up part way through a record if it doesn't have enough data and there are queued up records that could be replayed right now. In that case, we'll go back to the beginning of the record (and occasionally, back a WAL page) next time we try. That's the cost of not maintaining intra-record decoding state. 2. Last time around, we could try to allocate a crazy amount of memory when reading garbage past the end of the WAL. Fixed, by validating first, like in master. New work: Since last time, I went away and worked on a "real" AIO version of this feature. That's ongoing experimental work for a future proposal, but I have a working prototype and I aim to share that soon, when that branch is rebased to catch up with recent changes. In that version, the prefetcher starts actual reads into the buffer pool, and recovery receives already pinned buffers attached to the stream of records it's replaying. That inspired a couple of refactoring changes to this non-AIO version, to minimise the difference and anticipate the future work better: 1. The logic for deciding which block to start prefetching next is moved into a new callback function in a sort of standard form (this is approximately how all/most prefetching code looks in the AIO project, ie sequential scans, bitmap heap scan, etc). 2. The logic for controlling how many IOs are running and deciding when to call the above is in a separate component. In this non-AIO version, it works using a simple ring buffer of LSNs to estimate the number of in flight I/Os, just like before. This part would be thrown away and replaced with the AIO branch's centralised "streaming read" mechanism which tracks I/O completions based on a stream of completion events from the kernel (or I/O worker processes). 3. In this version, the prefetcher still doesn't pin buffers, for simplicity. That work did force me to study places where WAL streams need prefetching "barriers", though, so in this patch you can see that it's now a little more careful than it probably needs to be. (It doesn't really matter much if you call posix_fadvise() on a non-existent file region, or the wrong file after OID wraparound and reuse, but it would matter if you actually read it into a buffer, and if an intervening record might be trying to drop something you have pinned). Some other changes: 1. I dropped the GUC recovery_prefetch_fpw. I think it was a possibly useful idea but it's a niche concern and not worth worrying about for now. 2. I simplified the stats. Coming up with a good running average system seemed like a problem for another day (the numbers before were hard to interpret). The new stats are super simple counters and instantaneous values: postgres=# select * from pg_stat_prefetch_recovery ; -[ RECORD 1 ]--+------------------------------ stats_reset | 2021-11-10 09:02:08.590217+13 prefetch | 13605674 <- times we called posix_fadvise() hit | 24185289 <- times we found pages already cached skip_init | 217215 <- times we did nothing because init, not read skip_new | 192347 <- times we skipped because relation too small skip_fpw | 27429 <- times we skipped because fpw, not read wal_distance | 10648 <- how far ahead in WAL bytes block_distance | 134 <- how far ahead in block references io_depth | 50 <- fadvise() calls not yet followed by pread() I also removed the code to save and restore the stats via the stats collector, for now. I figured that persistent stats could be a later feature, perhaps after the shared memory stats stuff? 3. I dropped the code that was caching an SMgrRelation pointer to avoid smgropen() calls that showed up in some profiles. That probably lacked invalidation that could be done with some more WAL analysis, but I decided to leave it out completely for now for simplicity. 4. I dropped the verbose logging. I think it might make sense to integrate with the new "recovery progress" system, but I think that should be a separate discussion. If you want to see the counters after crash recovery finishes, you can look at the stats view. [1] https://commitfest.postgresql.org/34/2113/
Attachment
pgsql-hackers by date: