Home > mailing lists

Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: Streaming I/O, vectored I/O (WIP)
Date	April 1, 2024 01:01:05
Msg-id	CA+hUKGLDqKhxOODYD2C0Q5YGM697hWMksMoSN-SsUMRGgcH2+A@mail.gmail.com Whole thread
In response to	Re: Streaming I/O, vectored I/O (WIP) (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses	Re: Streaming I/O, vectored I/O (WIP)
List	pgsql-hackers

Tree view

1.  I tried out Tomas's suggestion ALTER TABLESPACE ts SET
(io_combine_limit = ...).  I like it, it's simple and works nicely.
Unfortunately we don't have support for units like '128kB' in
reloptions.c, so for now it requires a number of blocks.  That's not
great, so we should probably fix that before merging it, so I'm
leaving that patch (v14-0004) separate, hopefully for later.

2. I also tried Tomas's suggestion of inventing a way to tell
PostgreSQL what the OS readahead window size is.  That allows for
better modelling of whether the kernel is going to consider an access
pattern to be sequential.  Again, the ALTER TABLESPACE version would
probably need unit support.  This is initially just for
experimentation as it came up in discussions of BHS behaviour.  I was
against this idea originally as it seemed like more complexity to have
to explain and tune and I'm not sure if it's worth the trouble... but
in fact we are already in the business of second guessing kernel
logic, so there doesn't seem to be any good reason not to do it a bit
better if we can.  Do you think I have correctly understood what Linux
is doing?  The name I came up with is: effective_io_readahead_window.
I wanted to make clear that it's a property of the system we are
telling it about, not to be confused with our own look-ahead concept
or whatever.  Better names very welcome.  This is also in a separate
patch (v14-0005), left for later.

3.  Another question I wondered about while retesting: does this need
to be so low?  I don't think so, so I've added a patch for that.

src/include/port/pg_iovec.h:#define PG_IOV_MAX Min(IOV_MAX, 32)

Now that I'm not using an array full of arrays of that size, I don't
care so much how big we make that 32 (= 256kB @ 8kB), which clamps
io_combine_limit.  I think 128 (= 1MB @ 8kB) might be a decent
arbitrary number.  Sometimes we use it to size stack arrays, so I
don't want to make it insanely large, but 128 should be fine.  I think
it would be good to be able to at least experiment with up to 1MB (I'm
not saying it's a good idea to do it, who knows?, just that there
isn't a technical reason why not to allow it AFAIK).  FWIW every
system on our target list that has p{read,write}v has IOV_MAX == 1024
(I checked {Free,Net,Open}BSD, macOS, illumos and Linux), so the
Min(IOV_MAX, ...) really only clamps the systems where
pg_{read,write}v fall back to loop-based emulation (Windows, Solaris)
which is fine.

PG_IOV_MAX also affects the routine that initialises new WAL files.  I
don't currently see a downside to doing that in 1MB chunks, as there
was nothing sacred about the previous arbitrary number and the code
deals with short writes by retrying as it should.

4.  I agree with Heikki's complaints about the BMR interface.  It
should be made more consistent and faster.  I didn't want to make all
of those changes touching AMs etc a dependency though, so I spent some
time trying to squeeze out regressions using some of these clues about
calling conventions, likely hints, memory access and batching.  I'm
totally open to later improvements and refactoring of that stuff
later!

Attached is the version with the best results I've managed to get.  My
test is GCC -O3, pg_prewarm of a table of 220_000_000 integers =
7.6GB, which sometimes comes out around the same ~250ms on master and
streaming pg_prewarm v14 on a random cloud ARM box I'm testing with,
but not always, sometimes it's ~5-7ms more.  (Unfortunately I don't
have access to good benchmarking equipment right now, better numbers
welcome.)  Two new ideas:

* give fast path mode a single flag, instead of testing all the
conditions for every block
* give fast path mode a little buffer of future block numbers, so it
can call the callback in batches

I'd tried that batch-calling thing before, and results were
inconclusive, but I think sometimes it helps a bit.  Note that it
replaces the 'unget' thing from before and it is possibly a tiny bit
nicer anyway.

I'm a bit stumped about how to improve this further -- if anyone has
any ideas for further improvements I'm all ears.

Zooming back out of micro-benchmark mode, it must be pretty hard to
see in a real workload that actually does something with the buffers,
like a sequential scan.  Earlier complaints about all-cached
sequential scan regressions were resolved many versions ago AFAIK by
minimising pin count in that case.   I just tried Melanie's streaming
sequential scan patch, with a simple SELECT COUNT(*) WHERE i = -1,
with the same all-cached table of 220 million integers.  Patched
consistently comes out ahead for all-in-kernel-cache none-in-PG-cache:
~14.7-> ~14.4, and all-in-PG-cache ~13.5s -> ~13.3s (which I don't
have an explanation for).  I don't claim any of that is particularly
scientific, I just wanted to note that single digit numbers of
milliseconds of regression while pinning a million pages is clearly
lost in the noise of other effects once you add in real query
execution.  That's half a dozen nanoseconds per page if I counted
right.

So, I am finally starting to think we should commit this, and decide
which user patches are candidates.

Attachment

pgsql-hackers by date:

From: "Zhijie Hou (Fujitsu)"
Date: 01 April 2024, 00:56:29
Subject: RE: Synchronizing slots from primary to standby

From: Masahiko Sawada
Date: 01 April 2024, 01:03:22
Subject: Re: Add new error_action COPY ON_ERROR "log"

Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers

Attachment

Previous

Next