Home > mailing lists

AIO / read stream heuristics adjustments for index prefetching - Mailing list pgsql-hackers

From	Andres Freund
Subject	AIO / read stream heuristics adjustments for index prefetching
Date	March 31 19:02:51
Msg-id	f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu Whole thread
Responses	Re: AIO / read stream heuristics adjustments for index prefetching Re: AIO / read stream heuristics adjustments for index prefetching Re: AIO / read stream heuristics adjustments for index prefetching Re: AIO / read stream heuristics adjustments for index prefetching
List	pgsql-hackers

Tree view

Hi,

The index prefetching patchset [1] contains a few adjustments to the read
stream logic for readahead. It seemed better to discuss them separately than
in that already very large thread.

The first two patches are also a dependency of the explain read stream
patches [2].

There are two main areas that prefetching of table data during an index scan
is more sensitive to than existing read stream users:

1) Prefetching for index scans is much more sensitive to doing too aggressive
read ahead due to plans that involve running index scans to partial
completion, rather than full completion. Consider e.g. nestloop antijoins
or such, where the scan on the inner side will be started but often not
completed. If we unnecessarily read ahead too aggressively, a lot of IO
could be wasted.

While it's of course possible to have partially consumed read streams with
sequential scans or bitmap heap scans, it's not as common / cost sensitive.

For seqscans it likely mostly happens with a LIMIT above the seqscan, but
that probably won't be happening many times within a query on a table of
any size.

For bitmap heap scans it's not as common because the startup cost, i.e. the
building of the bitmap, is far from cheap, doing that over and over does
not make a lot of sense.

2) Prefetching for index scans is much more likely to have complicated mixes
of hits and misses.

Whereas a seqscan or a bitmap heap scan accesses each table block exactly
once, with index scans its very common to have repeated accesses to some
table blocks, while still having misses on other blocks. This means that
index scans are more sensitive to patterns of hits and misses decreasing
the readahead distance so much that we don't do aggressive enough readahead
to avoid waiting for IO anymore.

While more pronounced with index prefetching, it was already an issue with
the existing users, particularly for bitmap heap scans. In fact, a similar
patch to what's included here was first discussed somewhere around the BHS
prefetching work.

There's a few sets of changes here:

0001+0002: Return whether WaitReadBuffers() needed to wait

The first patch allows pgaio_wref_check_done() to work more reliably with
io_uring. Until now it only was able to return true if userspace already
had consumed the kernel's completion event, but returned false otherwise.
That's not really incorrect, just suboptimal.

The second patch returns whether WaitReadBuffers() needed to wait for
IO. This is useful for a) instrumentation like in [2] and b) to provide
information to the read_stream heuristics to control how aggressive to
perform read ahead.

0003: read_stream: Issue IO synchronously while in fast path

When read stream is in fast path mode (where it short-circuits the read
ahead logic, to reduce CPU overhead in s_b resident workloads) and
encounters a miss, we until now performed the read asynchronously.

Unfortunately, with worker, that can lead to slowdowns, because
dispatching to workers has a latency impact. When doing "real" readahead,
that's a price worth paying, because the latency should be hidden by
issuing the reads early enough. But when just coming out of fast path
mode, we're not ahead of what's needed, so the dispatch latency can't be
hidden.

We already have infrastructure to mark IOs to be executed
synchronously. So we just need to use that here.

0004: read_stream: Prevent distance from decaying too quickly

This, quite simple, patch reduces issue 2) from above, by preventing the
look-ahead distance from being reduced for #maximum lookahead distance
blocks after each miss. While this may seem overly aggressive, a single
effectively synchronous read can take a long time compared to the CPU time
needed for processing pages hits. On cloud storage the IO latency is
somewhere between 0.5ms and 4ms. A halfway modern CPU can do a few
heap_hot_search_buffer()s on 1000s of pages within 1 ms.

While this one is my patch, several others have written variations of it
before. We should probably have committed one already.

There are two minor questions here:
- Should read_stream_pause()/read_stream_resume() restore the "holdoff"
counter? I doubt it matters for the prospective user, since it will
only be used when the lookahead distance is very large.

- For how long to hold off distance reductions? Initially I was torn
between using "max_pinned_buffers" (Min(max_ios * io_combine_limit,
cap)) and "max_ios" ([maintenance_]effective_io_concurrency). But I
think the former makes more sense, as we otherwise won't allow for far
enough readahead when doing IO combining, and it does seem to make sense
to hold off decay for long enough that the maximum lookahead could not
theoretically allow us to start an IO.

0005+0006: Only increase distance when waiting for IO

Until now we have increased the read ahead distance whenever there we
needed to do IO (doubling the distance every miss). But that will often be
way too aggressive, with the IO subsystem being able to keep up with a
much lower distance.

The idea here is to use information about whether we needed to wait for IO
before returning the buffer in read_stream_next_buffer() to control
whether we should increase the readahead distance.

This seems to work extremely well for worker.

Unfortuntely with io_uring the situation is more complicated, because
io_uring performs reads synchronously during submission if the data is the
kernel page cache. This can reduce performance substantially compared to
worker, because it prevents parallelizing the copy from the page cache.
There is an existing heuristic for that in method_io_uring.c that adds a
flag to the IO submissions forcing the IO to be processed asynchronously,
allowing for parallelism. Unfortunately the heuristic is triggered by the
number of IOs in flight - which will never become big enough to tgrigger
after using "needed to wait" to control how far to read ahead.

So 0005 expands the io_uring heuristic to also trigger based on the sizes
of IOs - but that's decidedly not perfect, we e.g. have some experiments
showing it regressing some parallel bitmap heap scan cases. It may be
better to somehow tweak the logic to only trigger for worker.

As is this has another issue, which is that it prevents IO combining in
situations where it shouldn't, because right now using the distance to
control both. See 0008 for an attempt at splitting those concerns.

0007: Make read_stream_reset()/end() not wait for IO

This is a quite experimental, not really correct as-is, patch to avoid
unnecessarily waiting for in-flight IO when read_stream_reset() is done
while there's in-flight IO. This is useful for things like nestloop
antioins with quals on the inner side (without the qual we'd not trigger
any readahead, as that's deferred in the index prefetching patch).

As-is this will leave IOs visible in pg_aios for a while, potentially
until the backends exit. That's not right.

0008: WIP: read stream: Split decision about look ahead for AIO and combining

Until now read stream has used a single look-ahead distance to control
lookahead for both IO combining and read-ahead. That's sub-optimal, as we
want to do IO combining even when we don't need to do any readahead, as
avoiding the syscall overhead is important to reduce CPU overhead when
data is in the kernel page cache.

This is a prototype for what it could look like to split those
decisions. Thereby fixing the regression mentioned in 0006.

One thing that's really annoying around this is that we have no infrastructure
for testing that the heuristics keep working. It's very easy to improve one
thing while breaking something else, without noticing, because everything
keeps working.

I'm wondering about something like a READ_STREAM_DEBUG_INSTRUMENT flag which
would trigger providing information about the IOs and their schedule via the
the per-buffer-data mechanism. That would allow test_aio's
read_stream_for_blocks() to return that information, which in turn could be
used to verify that we are doing IO combining and looking ahead far enough in
some situations.

Greetings,

Andres Freund

[1] https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
[2] https://postgr.es/m/6f541abf-f9e1-4830-93cc-4a849dbf2ecf%40vondra.me

Attachment

pgsql-hackers by date:

From: Yura Sokolov
Date: 31 March, 19:02:33
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)

From: Álvaro Herrera
Date: 31 March, 19:10:51
Subject: Re: table AM option passing

AIO / read stream heuristics adjustments for index prefetching - Mailing list pgsql-hackers

Attachment

Previous

Next