Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id CA+hUKGJ7_niJzcCHX5JWFo-05pLbzwukQfFgQi4-h46yZh=tqA@mail.gmail.com
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Dmitry Dolgov <9erthalion6@gmail.com>)
Responses Re: WIP: WAL prefetch (another approach)  (Dmitry Dolgov <9erthalion6@gmail.com>)
List pgsql-hackers
On Sun, Apr 19, 2020 at 11:46 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Thanks for working on this patch, it seems like a great feature. I'm
> probably a bit late to the party, but still want to make couple of
> commentaries.

Hi Dmitry,

Thanks for your feedback and your interest in this work!

> The patch indeed looks good, I couldn't find any significant issues so
> far and almost all my questions I had while reading it were actually
> answered in this thread. I'm still busy with benchmarking, mostly to see
> how prefetching would work with different workload distributions and how
> much the kernel will actually prefetch.

Cool.

One report I heard recently said that if  you get rid of I/O stalls,
pread() becomes cheap enough that the much higher frequency lseek()
calls I've complained about elsewhere[1] become the main thing
recovery is doing, at least on some systems, but I haven't pieced
together the conditions required yet.  I'd be interested to know if
you see that.

> In the meantime I have a few questions:
>
> > 1.  It now uses effective_io_concurrency to control how many
> > concurrent prefetches to allow.  It's possible that we should have a
> > different GUC to control "maintenance" users of concurrency I/O as
> > discussed elsewhere[1], but I'm staying out of that for now; if we
> > agree to do that for VACUUM etc, we can change it easily here.  Note
> > that the value is percolated through the ComputeIoConcurrency()
> > function which I think we should discuss, but again that's off topic,
> > I just want to use the standard infrastructure here.
>
> This totally makes sense, I believe the question "how much to prefetch"
> eventually depends equally on a type of workload (correlates with how
> far in WAL to read) and how much resources are available for prefetching
> (correlates with queue depth). But in the documentation it looks like
> maintenance-io-concurrency is just an "unimportant" option, and I'm
> almost sure will be overlooked by many readers:
>
>     The maximum distance to look ahead in the WAL during recovery, to find
>     blocks to prefetch.  Prefetching blocks that will soon be needed can
>     reduce I/O wait times.  The number of concurrent prefetches is limited
>     by this setting as well as
>     <xref linkend="guc-maintenance-io-concurrency"/>.  Setting it too high
>     might be counterproductive, if it means that data falls out of the
>     kernel cache before it is needed.  If this value is specified without
>     units, it is taken as bytes.  A setting of -1 disables prefetching
>     during recovery.
>
> Maybe it makes also sense to emphasize that maintenance-io-concurrency
> directly affects resource consumption and it's a "primary control"?

You're right.  I will add something in the next version to emphasise that.

> > On Wed, Mar 18, 2020 at 06:18:44PM +1300, Thomas Munro wrote:
> >
> > Here's a new version that changes that part just a bit more, after a
> > brief chat with Andres about his async I/O plans.  It seems clear that
> > returning an enum isn't very extensible, so I decided to try making
> > PrefetchBufferResult a struct whose contents can be extended in the
> > future.  In this patch set it's still just used to distinguish 3 cases
> > (hit, miss, no file), but it's now expressed as a buffer and a flag to
> > indicate whether I/O was initiated.  You could imagine that the second
> > thing might be replaced by a pointer to an async I/O handle you can
> > wait on or some other magical thing from the future.
>
> I like the idea of extensible PrefetchBufferResult. Just one commentary,
> if I understand correctly the way how it is being used together with
> prefetch_queue assumes one IO operation at a time. This limits potential
> extension of the underlying code, e.g. one can't implement some sort of
> buffering of requests and submitting an iovec to a sycall, then
> prefetch_queue will no longer correctly represent inflight IO. Also,
> taking into account that "we don't have any awareness of when I/O really
> completes", maybe in the future it makes to reconsider having queue in
> the prefetcher itself and rather ask for this information from the
> underlying code?

Yeah, you're right that it'd be good to be able to do some kind of
batching up of these requests to reduce system calls.  Of course
posix_fadvise() doesn't support that, but clearly in the AIO future[2]
it would indeed make sense to buffer up a few of these and then make a
single call to io_uring_enter() on Linux[3] or lio_listio() on a
hypothetical POSIX AIO implementation[4].  (I'm not sure if there is a
thing like that on Windows; at a glance, ReadFileScatter() is
asynchronous ("overlapped") but works only on a single handle so it's
like a hypothetical POSIX aio_readv(), not like POSIX lio_list()).

Perhaps there could be an extra call PrefetchBufferSubmit() that you'd
call at appropriate times, but you obviously can't call it too
infrequently.

As for how to make the prefetch queue a reusable component, rather
than having a custom thing like that for each part of our system that
wants to support prefetching: that's a really good question.  I didn't
see how to do it, but maybe I didn't try hard enough.  I looked at the
three users I'm aware of, namely this patch, a btree prefetching patch
I haven't shared yet, and the existing bitmap heap scan code, and they
all needed to have their own custom book keeping for this, and I
couldn't figure out how to share more infrastructure.  In the case of
this patch, you currently need to do LSN based book keeping to
simulate "completion", and that doesn't make sense for other users.
Maybe it'll become clearer when we have support for completion
notification?

Some related questions are why all these parts of our system that know
how to prefetch are allowed to do so independently without any kind of
shared accounting, and why we don't give each tablespace (= our model
of a device?) its own separate queue.  I think it's OK to put these
questions off a bit longer until we have more infrastructure and
experience.  Our current non-answer is at least consistent with our
lack of an approach to system-wide memory and CPU accounting...  I
personally think that a better XLogReader that can be used for
prefetching AND recovery would be a higher priority than that.

> > On Wed, Apr 08, 2020 at 04:24:21AM +1200, Thomas Munro wrote:
> > > Is there a way we could have a "historical" version of at least some of
> > > these? An average queue depth, or such?
> >
> > Ok, I added simple online averages for distance and queue depth that
> > take a sample every time recovery advances by 256kB.
>
> Maybe it was discussed in the past in other threads. But if I understand
> correctly, this implementation weights all the samples. Since at the
> moment it depends directly on replaying speed (so a lot of IO involved),
> couldn't it lead to a single outlier at the beginning skewing this value
> and make it less useful? Does it make sense to decay old values?

Hmm.

I wondered about a reporting one or perhaps three exponential moving
averages (like Unix 1/5/15 minute load averages), but I didn't propose
it because: (1) In crash recovery, you can't query it, you just get
the log message at the end, and mean unweighted seems OK in that case,
no? (you are not more interested in the I/O saturation at the end of
the recovery compared to the start of recovery are you?), and (2) on a
streaming replica, if you want to sample the instantaneous depth and
compute an exponential moving average or some more exotic statistical
concoction in your monitoring tool, you're free to do so.  I suppose
(2) is an argument for removing the existing average completely from
the stat view; I put it in there at Andres's suggestion, but I'm not
sure I really believe in it.  Where is our average replication lag,
and why don't we compute the stddev of X, Y or Z?  I think we should
provide primary measurements and let people compute derived statistics
from those.

I suppose the reason for this request was the analogy with Linux
iostat -x's "aqu-sz", which is the primary way that people understand
device queue depth on that OS.  This number is actually computed by
iostat, not the kernel, so by analogy I could argue that a
hypothetical pg_iostat program compute that for you from raw
ingredients.  AFAIK iostat computes the *unweighted* average queue
depth during the time between output lines, by observing changes in
the "aveq" ("the sum of how long all requests have spent in flight, in
milliseconds") and "use" ("how many milliseconds there has been at
least one IO in flight") fields of /proc/diskstats.  But it's OK that
it's unweighted, because it computes a new value for every line it
output (ie every 5 seconds or whatever you asked for).  It's not too
clear how to do something like that here, but all suggestions are
weclome.

Or maybe we'll have something more general that makes this more
specific thing irrelevant, in future AIO infrastructure work.

On a more superficial note, one thing I don't like about the last
version of the patch is the difference in the ordering of the words in
the GUC recovery_prefetch_distance and the view
pg_stat_prefetch_recovery.  Hrmph.

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2BNPZeEdLXAcNr%2Bw0YOZVb0Un0_MwTBpgmmVDh7No2jbg%40mail.gmail.com
[2] https://anarazel.de/talks/2020-01-31-fosdem-aio/aio.pdf
[3] https://kernel.dk/io_uring.pdf
[4] https://pubs.opengroup.org/onlinepubs/009695399/functions/lio_listio.html



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: Parallel Append can break run-time partition pruning
Next
From: Fujii Masao
Date:
Subject: Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2