Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Streaming I/O, vectored I/O (WIP)
Date
Msg-id CA+hUKG+Mv6gwnZpsNyMBb60Fg2UPzsjk_gq=nAy29_F7-mLc7Q@mail.gmail.com
Whole thread Raw
In response to Re: Streaming I/O, vectored I/O (WIP)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Streaming I/O, vectored I/O (WIP)
List pgsql-hackers
On Tue, Feb 27, 2024 at 5:03 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 27, 2024 at 9:25 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Here's the 2 step version.  The streaming_read.c API is unchanged, but
> > the bugmfr.c API now has only the following extra functions:
> >
> >   bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op)
> >   WaitReadBuffers(ReadBuffersOperation *op)
>
> I wonder if there are semi-standard names that people use for this
> kind of API. Somehow I like "start" and "finish" or "start" and
> "complete" better than "start" and "wait". But I don't really know
> what's best. If there's a usual practice, it'd be good to adhere to
> it.

I think "complete" has a subtly different meaning, which is why I
liked Heikki's suggestion.  One way to look at it is that only one
backend can "complete" a read, but any number of backends can "wait".
But I don't have strong views on this.  It feels like this API is
relatively difficult to use directly anyway, so almost all users will
go through the streaming read API.

Here's a new version with two main improvements.  (Note that I'm only
talking about 0001-0003 here, the rest are useful for testing but are
just outdated versions of patches that have their own threads.)

1.  Melanie discovered small regressions in all-cached simple scans.
Here's a better look-ahead distance control algorithm that addresses
that.  First let me state the updated goals of the algorithm:

 A) for all-hit scans, pin very few buffers, since that can't help if
we're not doing any I/O
 B) for all-miss sequential scans, pin only as many buffers as it
takes to build full-sized I/Os, since fully sequential scans are left
to the OS to optimise for now (ie no "advice")
 C) for all-miss random scans, pin as many buffers as it takes to
reach our I/O concurrency level

In all cases, respect the per-backend pin limit as a last resort limit
(roughly NBuffers / max_connections), but we're now actively trying
*not* to use so many.

For patterns in between the A, B, C extremes, do something in between.
The way the new algorithm attempts to classify the scan adaptively
over time is as follows:

 * look ahead distance starts out at one block (behaviour A)
 * every time we start an I/O, we double the distance until we reach
the max pin limit (behaviour C), or if we're not issuing "advice"
because sequential access is detected, until we reach the
MAX_TRANSFER_BUFFERS (behaviour B)
 * every time we get a hit, we decrement the distance by one (we move
slowly back to behaviour A)

Query to observe a system transitioning A->B->A->B, when doing a full
scan that has ~50 contiguous blocks already in shared buffers
somewhere in the middle:

create extension pg_prewarm;
create extension pg_buffercache;
set max_parallel_workers_per_gather = 0;

create table t (i int);
insert into t select generate_series(1, 100000);
select pg_prewarm('t');
select bool_and(pg_buffercache_invalidate(bufferid))
  from pg_buffercache
 where relfilenode = pg_relation_filenode('t')
   and (relblocknumber between 0 and 100 or relblocknumber > 150);

select count(*) from t;

pread(31,...) = 8192 (0x2000)      <--- start small (A)
preadv(31,...) = 16384 (0x4000)    <--- ramp up...
preadv(31,...) = 32768 (0x8000)
preadv(31,...) = 65536 (0x10000)
preadv(31,...) = 131072 (0x20000)  <--- full size (B)
preadv(31,...) = 131072 (0x20000)
...
preadv(31,...) = 131072 (0x20000)
preadv(31,...) = 49152 (0xc000)    <--- end of misses, decay to A
pread(31,...) = 8192 (0x2000)      <--- start small again (A)
preadv(31,...) = 16384 (0x4000)
preadv(31,...) = 32768 (0x8000)
preadv(31,...) = 65536 (0x10000)
preadv(31,...) = 131072 (0x20000)  <--- full size (B)
preadv(31,...) = 131072 (0x20000)
...
preadv(31,...) = 131072 (0x20000)
preadv(31,...) = 122880 (0x1e000)  <-- end of relation

Note that this never pins more than 16 buffers, because it's case B,
not case C in the description above.  There is no benefit in looking
further ahead if you're relying on the kernel's sequential
prefetching.

The fully cached regression Melanie reported now stays entirely in A.
The previous coding would ramp up to high look-ahead distance for no
good reason and never ramp down, so it was wasteful and slightly
slower than master.

2.  Melanie and I noticed while discussing the pre-existing ad hoc
bitmap heap scan that this thing should respect the
{effective,maintenance}_io_concurrency setting of the containing
tablespace.  Note special cases to avoid problems while stream-reading
pg_tablespace itself and pg_database in backend initialisation.

There is a third problem that I'm still working on: the behaviour for
very small values of effective_io_concurrency isn't quite right, as
discussed in detail by Tomas and Melanie on the bitmap heapscan
thread.  The attached makes 0 do approximately the right thing (though
I'm hoping to make it less special), but other small numbers aren't
quite right yet -- 1 is still issuing a useless fadvise at the wrong
time, and 2 is working in groups of N at a time instead of
interleaving as you might perhaps expect.  These are probably
side-effects of my focusing on coalescing large reads and losing sight
of the small cases.  I need a little more adaptivity and generality in
the algorithm at the small end, not least because 1 is the default
value.  I'll share a patch to improve that very soon.

Attachment

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Make query cancellation keys longer
Next
From: Daniel Gustafsson
Date:
Subject: Re: Call perror on popen failure