Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Streaming I/O, vectored I/O (WIP) |
Date | |
Msg-id | CA+hUKG+Mv6gwnZpsNyMBb60Fg2UPzsjk_gq=nAy29_F7-mLc7Q@mail.gmail.com Whole thread Raw |
In response to | Re: Streaming I/O, vectored I/O (WIP) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Streaming I/O, vectored I/O (WIP)
|
List | pgsql-hackers |
On Tue, Feb 27, 2024 at 5:03 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 27, 2024 at 9:25 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Here's the 2 step version. The streaming_read.c API is unchanged, but > > the bugmfr.c API now has only the following extra functions: > > > > bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op) > > WaitReadBuffers(ReadBuffersOperation *op) > > I wonder if there are semi-standard names that people use for this > kind of API. Somehow I like "start" and "finish" or "start" and > "complete" better than "start" and "wait". But I don't really know > what's best. If there's a usual practice, it'd be good to adhere to > it. I think "complete" has a subtly different meaning, which is why I liked Heikki's suggestion. One way to look at it is that only one backend can "complete" a read, but any number of backends can "wait". But I don't have strong views on this. It feels like this API is relatively difficult to use directly anyway, so almost all users will go through the streaming read API. Here's a new version with two main improvements. (Note that I'm only talking about 0001-0003 here, the rest are useful for testing but are just outdated versions of patches that have their own threads.) 1. Melanie discovered small regressions in all-cached simple scans. Here's a better look-ahead distance control algorithm that addresses that. First let me state the updated goals of the algorithm: A) for all-hit scans, pin very few buffers, since that can't help if we're not doing any I/O B) for all-miss sequential scans, pin only as many buffers as it takes to build full-sized I/Os, since fully sequential scans are left to the OS to optimise for now (ie no "advice") C) for all-miss random scans, pin as many buffers as it takes to reach our I/O concurrency level In all cases, respect the per-backend pin limit as a last resort limit (roughly NBuffers / max_connections), but we're now actively trying *not* to use so many. For patterns in between the A, B, C extremes, do something in between. The way the new algorithm attempts to classify the scan adaptively over time is as follows: * look ahead distance starts out at one block (behaviour A) * every time we start an I/O, we double the distance until we reach the max pin limit (behaviour C), or if we're not issuing "advice" because sequential access is detected, until we reach the MAX_TRANSFER_BUFFERS (behaviour B) * every time we get a hit, we decrement the distance by one (we move slowly back to behaviour A) Query to observe a system transitioning A->B->A->B, when doing a full scan that has ~50 contiguous blocks already in shared buffers somewhere in the middle: create extension pg_prewarm; create extension pg_buffercache; set max_parallel_workers_per_gather = 0; create table t (i int); insert into t select generate_series(1, 100000); select pg_prewarm('t'); select bool_and(pg_buffercache_invalidate(bufferid)) from pg_buffercache where relfilenode = pg_relation_filenode('t') and (relblocknumber between 0 and 100 or relblocknumber > 150); select count(*) from t; pread(31,...) = 8192 (0x2000) <--- start small (A) preadv(31,...) = 16384 (0x4000) <--- ramp up... preadv(31,...) = 32768 (0x8000) preadv(31,...) = 65536 (0x10000) preadv(31,...) = 131072 (0x20000) <--- full size (B) preadv(31,...) = 131072 (0x20000) ... preadv(31,...) = 131072 (0x20000) preadv(31,...) = 49152 (0xc000) <--- end of misses, decay to A pread(31,...) = 8192 (0x2000) <--- start small again (A) preadv(31,...) = 16384 (0x4000) preadv(31,...) = 32768 (0x8000) preadv(31,...) = 65536 (0x10000) preadv(31,...) = 131072 (0x20000) <--- full size (B) preadv(31,...) = 131072 (0x20000) ... preadv(31,...) = 131072 (0x20000) preadv(31,...) = 122880 (0x1e000) <-- end of relation Note that this never pins more than 16 buffers, because it's case B, not case C in the description above. There is no benefit in looking further ahead if you're relying on the kernel's sequential prefetching. The fully cached regression Melanie reported now stays entirely in A. The previous coding would ramp up to high look-ahead distance for no good reason and never ramp down, so it was wasteful and slightly slower than master. 2. Melanie and I noticed while discussing the pre-existing ad hoc bitmap heap scan that this thing should respect the {effective,maintenance}_io_concurrency setting of the containing tablespace. Note special cases to avoid problems while stream-reading pg_tablespace itself and pg_database in backend initialisation. There is a third problem that I'm still working on: the behaviour for very small values of effective_io_concurrency isn't quite right, as discussed in detail by Tomas and Melanie on the bitmap heapscan thread. The attached makes 0 do approximately the right thing (though I'm hoping to make it less special), but other small numbers aren't quite right yet -- 1 is still issuing a useless fadvise at the wrong time, and 2 is working in groups of N at a time instead of interleaving as you might perhaps expect. These are probably side-effects of my focusing on coalescing large reads and losing sight of the small cases. I need a little more adaptivity and generality in the algorithm at the small end, not least because 1 is the default value. I'll share a patch to improve that very soon.
Attachment
- v6-0001-Provide-vectored-variant-of-ReadBuffer.patch
- v6-0002-Provide-API-for-streaming-reads-of-relations.patch
- v6-0003-Use-streaming-reads-in-pg_prewarm.patch
- v6-0004-WIP-Use-streaming-reads-in-heapam-scans.patch
- v6-0005-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patch
- v6-0006-WIP-Use-streaming-reads-in-bitmap-heapscan.patch
- v6-0007-WIP-pg_buffercache_invalidate.patch
pgsql-hackers by date: