AIO / read stream heuristics adjustments for index prefetching - Mailing list pgsql-hackers

Hi,

The index prefetching patchset [1] contains a few adjustments to the read
stream logic for readahead.  It seemed better to discuss them separately than
in that already very large thread.

The first two patches are also a dependency of the explain read stream
patches [2].


There are two main areas that prefetching of table data during an index scan
is more sensitive to than existing read stream users:

1) Prefetching for index scans is much more sensitive to doing too aggressive
   read ahead due to plans that involve running index scans to partial
   completion, rather than full completion. Consider e.g. nestloop antijoins
   or such, where the scan on the inner side will be started but often not
   completed.  If we unnecessarily read ahead too aggressively, a lot of IO
   could be wasted.

   While it's of course possible to have partially consumed read streams with
   sequential scans or bitmap heap scans, it's not as common / cost sensitive.

   For seqscans it likely mostly happens with a LIMIT above the seqscan, but
   that probably won't be happening many times within a query on a table of
   any size.

   For bitmap heap scans it's not as common because the startup cost, i.e. the
   building of the bitmap, is far from cheap, doing that over and over does
   not make a lot of sense.


2) Prefetching for index scans is much more likely to have complicated mixes
   of hits and misses.

   Whereas a seqscan or a bitmap heap scan accesses each table block exactly
   once, with index scans its very common to have repeated accesses to some
   table blocks, while still having misses on other blocks.  This means that
   index scans are more sensitive to patterns of hits and misses decreasing
   the readahead distance so much that we don't do aggressive enough readahead
   to avoid waiting for IO anymore.

   While more pronounced with index prefetching, it was already an issue with
   the existing users, particularly for bitmap heap scans. In fact, a similar
   patch to what's included here was first discussed somewhere around the BHS
   prefetching work.


There's a few sets of changes here:

0001+0002:  Return whether WaitReadBuffers() needed to wait

    The first patch allows pgaio_wref_check_done() to work more reliably with
    io_uring. Until now it only was able to return true if userspace already
    had consumed the kernel's completion event, but returned false otherwise.
    That's not really incorrect, just suboptimal.

    The second patch returns whether WaitReadBuffers() needed to wait for
    IO. This is useful for a) instrumentation like in [2] and b) to provide
    information to the read_stream heuristics to control how aggressive to
    perform read ahead.


0003:  read_stream: Issue IO synchronously while in fast path

    When read stream is in fast path mode (where it short-circuits the read
    ahead logic, to reduce CPU overhead in s_b resident workloads) and
    encounters a miss, we until now performed the read asynchronously.

    Unfortunately, with worker, that can lead to slowdowns, because
    dispatching to workers has a latency impact. When doing "real" readahead,
    that's a price worth paying, because the latency should be hidden by
    issuing the reads early enough. But when just coming out of fast path
    mode, we're not ahead of what's needed, so the dispatch latency can't be
    hidden.

    We already have infrastructure to mark IOs to be executed
    synchronously. So we just need to use that here.


0004:  read_stream: Prevent distance from decaying too quickly

    This, quite simple, patch reduces issue 2) from above, by preventing the
    look-ahead distance from being reduced for #maximum lookahead distance
    blocks after each miss.  While this may seem overly aggressive, a single
    effectively synchronous read can take a long time compared to the CPU time
    needed for processing pages hits.  On cloud storage the IO latency is
    somewhere between 0.5ms and 4ms. A halfway modern CPU can do a few
    heap_hot_search_buffer()s on 1000s of pages within 1 ms.

    While this one is my patch, several others have written variations of it
    before. We should probably have committed one already.


    There are two minor questions here:
    - Should read_stream_pause()/read_stream_resume() restore the "holdoff"
      counter?  I doubt it matters for the prospective user, since it will
      only be used when the lookahead distance is very large.

    - For how long to hold off distance reductions?  Initially I was torn
      between using "max_pinned_buffers" (Min(max_ios * io_combine_limit,
      cap)) and "max_ios" ([maintenance_]effective_io_concurrency). But I
      think the former makes more sense, as we otherwise won't allow for far
      enough readahead when doing IO combining, and it does seem to make sense
      to hold off decay for long enough that the maximum lookahead could not
      theoretically allow us to start an IO.


0005+0006:  Only increase distance when waiting for IO

    Until now we have increased the read ahead distance whenever there we
    needed to do IO (doubling the distance every miss). But that will often be
    way too aggressive, with the IO subsystem being able to keep up with a
    much lower distance.

    The idea here is to use information about whether we needed to wait for IO
    before returning the buffer in read_stream_next_buffer() to control
    whether we should increase the readahead distance.

    This seems to work extremely well for worker.

    Unfortuntely with io_uring the situation is more complicated, because
    io_uring performs reads synchronously during submission if the data is the
    kernel page cache.  This can reduce performance substantially compared to
    worker, because it prevents parallelizing the copy from the page cache.
    There is an existing heuristic for that in method_io_uring.c that adds a
    flag to the IO submissions forcing the IO to be processed asynchronously,
    allowing for parallelism.  Unfortunately the heuristic is triggered by the
    number of IOs in flight - which will never become big enough to tgrigger
    after using "needed to wait" to control how far to read ahead.

    So 0005 expands the io_uring heuristic to also trigger based on the sizes
    of IOs - but that's decidedly not perfect, we e.g. have some experiments
    showing it regressing some parallel bitmap heap scan cases.  It may be
    better to somehow tweak the logic to only trigger for worker.

    As is this has another issue, which is that it prevents IO combining in
    situations where it shouldn't, because right now using the distance to
    control both. See 0008 for an attempt at splitting those concerns.


0007: Make read_stream_reset()/end() not wait for IO

    This is a quite experimental, not really correct as-is, patch to avoid
    unnecessarily waiting for in-flight IO when read_stream_reset() is done
    while there's in-flight IO.  This is useful for things like nestloop
    antioins with quals on the inner side (without the qual we'd not trigger
    any readahead, as that's deferred in the index prefetching patch).

    As-is this will leave IOs visible in pg_aios for a while, potentially
    until the backends exit. That's not right.


0008: WIP: read stream: Split decision about look ahead for AIO and combining

    Until now read stream has used a single look-ahead distance to control
    lookahead for both IO combining and read-ahead. That's sub-optimal, as we
    want to do IO combining even when we don't need to do any readahead, as
    avoiding the syscall overhead is important to reduce CPU overhead when
    data is in the kernel page cache.

    This is a prototype for what it could look like to split those
    decisions. Thereby fixing the regression mentioned in 0006.



One thing that's really annoying around this is that we have no infrastructure
for testing that the heuristics keep working. It's very easy to improve one
thing while breaking something else, without noticing, because everything
keeps working.

I'm wondering about something like a READ_STREAM_DEBUG_INSTRUMENT flag which
would trigger providing information about the IOs and their schedule via the
the per-buffer-data mechanism.  That would allow test_aio's
read_stream_for_blocks() to return that information, which in turn could be
used to verify that we are doing IO combining and looking ahead far enough in
some situations.


Greetings,

Andres Freund

[1] https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
[2] https://postgr.es/m/6f541abf-f9e1-4830-93cc-4a849dbf2ecf%40vondra.me

Attachment

pgsql-hackers by date:

Previous
From: Yura Sokolov
Date:
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)
Next
From: Álvaro Herrera
Date:
Subject: Re: table AM option passing