Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Streaming I/O, vectored I/O (WIP)
Date
Msg-id CA+hUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ@mail.gmail.com
Whole thread Raw
In response to Re: Streaming I/O, vectored I/O (WIP)  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Streaming I/O, vectored I/O (WIP)
List pgsql-hackers
On Fri, Jan 12, 2024 at 12:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Jan 12, 2024 at 3:31 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > Ok. It feels surprising to have three steps. I understand that you need
> > two steps, one to start the I/O and another to wait for them to finish,
> > but why do you need separate Prepare and Start steps? What can you do in
> > between them? (You explained that. I'm just saying that that's my
> > initial reaction when seeing that API. It is surprising.)
[...]
> OK, I'm going to try the two-step version (again) with interfaces
> along the lines you sketched out...  more soon.

Here's the 2 step version.  The streaming_read.c API is unchanged, but
the bugmfr.c API now has only the following extra functions:

  bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op)
  WaitReadBuffers(ReadBuffersOperation *op)

That is, the PrepareReadBuffer() step is gone.

StartReadBuffers() updates *nblocks to the number actually processed,
which is always at least one.  If it returns true, then you must call
WaitReadBuffers().  When it finds a 'hit' (doesn't need I/O), then
that one final (or only) buffer is processed, but no more.
StartReadBuffers() always conceptually starts 0 or 1 I/Os.  Example:
if you ask for 16 blocks, and it finds two misses followed by a hit,
it'll set *nblocks = 3, smgrprefetch(2 blocks), and smgrreadv(2
blocks) them in WaitReadBuffer().  The caller can't really tell that
the third block was a hit.  The only case it can distinguish is if the
first one was a hit, and then it returns false and sets *nblocks = 1.

This arrangement, where the results include the 'boundary' block that
ends the readable range, avoids the double-lookup problem we discussed
upthread.  I think it should probably also be able to handle multiple
consecutive 'hits' at the start of a sequence, but in this version I
kept it simpler.  It couldn't ever handle more than one after an I/O
range though, because it can't guess if the one after will be a hit or
a miss.  If it turned out to be a miss, we don't want to start a
second I/O, so unless we decide that we're happy unpinning and
re-looking-up next time, it's better to give up then.  Hence the idea
of including the hit as a bonus block on the end.

It took me a long time but I eventually worked my way around to
preferring this way over the 3 step version.  streaming_read.c now has
to do a bit more work including sometimes 'ungetting' a block (ie
deferring one that the callback has requested until next time), to
resolve some circularities that come up with flow control.  But I
suspect you'd probably finish up having to deal with 'short' writes
anyway, because in the asynchronous future, in a three-step version,
the StartReadBuffers() (as 2nd step) might also be short when it fails
to get enough BM_IO_IN_PROGRESS flags, so you have to deal with some
version of these problems anyway.  Thoughts?

I am still thinking about how to improve the coding in
streaming_read.c, ie to simplify and beautify the main control loop
and improve the flow control logic.  And looking for interesting test
cases to hit various conditions in it and try to break it.  And trying
to figure out how this read-coalescing and parallel seq scan's block
allocator might interfere with each other to produce non-idea patterns
of system calls.

Here are some example strace results generated by a couple of simple
queries.  See CF #4426 for pg_buffercache_invalidate().

=== Sequential scan example ===

create table big as select generate_series(1, 10000000);

select count(*) from big;

pread64(81, ...) = 8192   <-- starts small
pread64(81, ...) = 16384
pread64(81, ...) = 32768
pread64(81, ...) = 65536
pread64(81, ...) = 131072 <-- fully ramped up size reached
preadv(81, ...) = 131072  <-- more Vs seen as buffers fill up/fragments
preadv(81, ...) = 131072
...repeating...
preadv(81, ...) = 131072
preadv(81, ...) = 131072
pread64(81, ...) = 8192   <-- end fragment

create table small as select generate_series(1, 100000);

select bool_and(pg_buffercache_invalidate(bufferid))
  from pg_buffercache
 where relfilenode = pg_relation_filenode('small')
   and relblocknumber % 3 != 0;  -- <-- kick out every 3rd block

select count(*) from small;

preadv(88, ...) = 16384  <-- just the 2-block fragments we need to load
preadv(88, ...) = 16384
preadv(88, ...) = 16384

=== Bitmap heapscan example ===

create table heap (i int primary key);
insert into heap select generate_series(1, 1000000);

select bool_and(pg_buffercache_invalidate(bufferid))
 from pg_buffercache
where relfilenode = pg_relation_filenode('heap');

select count(i) from heap where i in (10, 1000, 10000, 100000) or i in
(20, 200, 2000, 200000);

pread64(75, ..., 8192, 0) = 8192
fadvise64(75, 32768, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(75, 65536, 8192, POSIX_FADV_WILLNEED) = 0
pread64(75, ..., 8192, 32768) = 8192
pread64(75, ..., 8192, 65536) = 8192
fadvise64(75, 360448, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(75, 3620864, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(75, 7241728, 8192, POSIX_FADV_WILLNEED) = 0
pread64(75, ..., 8192, 360448) = 8192
pread64(75, ..., 8192, 3620864) = 8192
pread64(75, ..., 8192, 7241728) = 8192

Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Streaming I/O, vectored I/O (WIP)
Next
From: Robert Haas
Date:
Subject: Re: Streaming I/O, vectored I/O (WIP)