Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: WIP: WAL prefetch (another approach) |
Date | |
Msg-id | CA+hUKGJ7_niJzcCHX5JWFo-05pLbzwukQfFgQi4-h46yZh=tqA@mail.gmail.com Whole thread Raw |
In response to | Re: WIP: WAL prefetch (another approach) (Dmitry Dolgov <9erthalion6@gmail.com>) |
Responses |
Re: WIP: WAL prefetch (another approach)
|
List | pgsql-hackers |
On Sun, Apr 19, 2020 at 11:46 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Thanks for working on this patch, it seems like a great feature. I'm > probably a bit late to the party, but still want to make couple of > commentaries. Hi Dmitry, Thanks for your feedback and your interest in this work! > The patch indeed looks good, I couldn't find any significant issues so > far and almost all my questions I had while reading it were actually > answered in this thread. I'm still busy with benchmarking, mostly to see > how prefetching would work with different workload distributions and how > much the kernel will actually prefetch. Cool. One report I heard recently said that if you get rid of I/O stalls, pread() becomes cheap enough that the much higher frequency lseek() calls I've complained about elsewhere[1] become the main thing recovery is doing, at least on some systems, but I haven't pieced together the conditions required yet. I'd be interested to know if you see that. > In the meantime I have a few questions: > > > 1. It now uses effective_io_concurrency to control how many > > concurrent prefetches to allow. It's possible that we should have a > > different GUC to control "maintenance" users of concurrency I/O as > > discussed elsewhere[1], but I'm staying out of that for now; if we > > agree to do that for VACUUM etc, we can change it easily here. Note > > that the value is percolated through the ComputeIoConcurrency() > > function which I think we should discuss, but again that's off topic, > > I just want to use the standard infrastructure here. > > This totally makes sense, I believe the question "how much to prefetch" > eventually depends equally on a type of workload (correlates with how > far in WAL to read) and how much resources are available for prefetching > (correlates with queue depth). But in the documentation it looks like > maintenance-io-concurrency is just an "unimportant" option, and I'm > almost sure will be overlooked by many readers: > > The maximum distance to look ahead in the WAL during recovery, to find > blocks to prefetch. Prefetching blocks that will soon be needed can > reduce I/O wait times. The number of concurrent prefetches is limited > by this setting as well as > <xref linkend="guc-maintenance-io-concurrency"/>. Setting it too high > might be counterproductive, if it means that data falls out of the > kernel cache before it is needed. If this value is specified without > units, it is taken as bytes. A setting of -1 disables prefetching > during recovery. > > Maybe it makes also sense to emphasize that maintenance-io-concurrency > directly affects resource consumption and it's a "primary control"? You're right. I will add something in the next version to emphasise that. > > On Wed, Mar 18, 2020 at 06:18:44PM +1300, Thomas Munro wrote: > > > > Here's a new version that changes that part just a bit more, after a > > brief chat with Andres about his async I/O plans. It seems clear that > > returning an enum isn't very extensible, so I decided to try making > > PrefetchBufferResult a struct whose contents can be extended in the > > future. In this patch set it's still just used to distinguish 3 cases > > (hit, miss, no file), but it's now expressed as a buffer and a flag to > > indicate whether I/O was initiated. You could imagine that the second > > thing might be replaced by a pointer to an async I/O handle you can > > wait on or some other magical thing from the future. > > I like the idea of extensible PrefetchBufferResult. Just one commentary, > if I understand correctly the way how it is being used together with > prefetch_queue assumes one IO operation at a time. This limits potential > extension of the underlying code, e.g. one can't implement some sort of > buffering of requests and submitting an iovec to a sycall, then > prefetch_queue will no longer correctly represent inflight IO. Also, > taking into account that "we don't have any awareness of when I/O really > completes", maybe in the future it makes to reconsider having queue in > the prefetcher itself and rather ask for this information from the > underlying code? Yeah, you're right that it'd be good to be able to do some kind of batching up of these requests to reduce system calls. Of course posix_fadvise() doesn't support that, but clearly in the AIO future[2] it would indeed make sense to buffer up a few of these and then make a single call to io_uring_enter() on Linux[3] or lio_listio() on a hypothetical POSIX AIO implementation[4]. (I'm not sure if there is a thing like that on Windows; at a glance, ReadFileScatter() is asynchronous ("overlapped") but works only on a single handle so it's like a hypothetical POSIX aio_readv(), not like POSIX lio_list()). Perhaps there could be an extra call PrefetchBufferSubmit() that you'd call at appropriate times, but you obviously can't call it too infrequently. As for how to make the prefetch queue a reusable component, rather than having a custom thing like that for each part of our system that wants to support prefetching: that's a really good question. I didn't see how to do it, but maybe I didn't try hard enough. I looked at the three users I'm aware of, namely this patch, a btree prefetching patch I haven't shared yet, and the existing bitmap heap scan code, and they all needed to have their own custom book keeping for this, and I couldn't figure out how to share more infrastructure. In the case of this patch, you currently need to do LSN based book keeping to simulate "completion", and that doesn't make sense for other users. Maybe it'll become clearer when we have support for completion notification? Some related questions are why all these parts of our system that know how to prefetch are allowed to do so independently without any kind of shared accounting, and why we don't give each tablespace (= our model of a device?) its own separate queue. I think it's OK to put these questions off a bit longer until we have more infrastructure and experience. Our current non-answer is at least consistent with our lack of an approach to system-wide memory and CPU accounting... I personally think that a better XLogReader that can be used for prefetching AND recovery would be a higher priority than that. > > On Wed, Apr 08, 2020 at 04:24:21AM +1200, Thomas Munro wrote: > > > Is there a way we could have a "historical" version of at least some of > > > these? An average queue depth, or such? > > > > Ok, I added simple online averages for distance and queue depth that > > take a sample every time recovery advances by 256kB. > > Maybe it was discussed in the past in other threads. But if I understand > correctly, this implementation weights all the samples. Since at the > moment it depends directly on replaying speed (so a lot of IO involved), > couldn't it lead to a single outlier at the beginning skewing this value > and make it less useful? Does it make sense to decay old values? Hmm. I wondered about a reporting one or perhaps three exponential moving averages (like Unix 1/5/15 minute load averages), but I didn't propose it because: (1) In crash recovery, you can't query it, you just get the log message at the end, and mean unweighted seems OK in that case, no? (you are not more interested in the I/O saturation at the end of the recovery compared to the start of recovery are you?), and (2) on a streaming replica, if you want to sample the instantaneous depth and compute an exponential moving average or some more exotic statistical concoction in your monitoring tool, you're free to do so. I suppose (2) is an argument for removing the existing average completely from the stat view; I put it in there at Andres's suggestion, but I'm not sure I really believe in it. Where is our average replication lag, and why don't we compute the stddev of X, Y or Z? I think we should provide primary measurements and let people compute derived statistics from those. I suppose the reason for this request was the analogy with Linux iostat -x's "aqu-sz", which is the primary way that people understand device queue depth on that OS. This number is actually computed by iostat, not the kernel, so by analogy I could argue that a hypothetical pg_iostat program compute that for you from raw ingredients. AFAIK iostat computes the *unweighted* average queue depth during the time between output lines, by observing changes in the "aveq" ("the sum of how long all requests have spent in flight, in milliseconds") and "use" ("how many milliseconds there has been at least one IO in flight") fields of /proc/diskstats. But it's OK that it's unweighted, because it computes a new value for every line it output (ie every 5 seconds or whatever you asked for). It's not too clear how to do something like that here, but all suggestions are weclome. Or maybe we'll have something more general that makes this more specific thing irrelevant, in future AIO infrastructure work. On a more superficial note, one thing I don't like about the last version of the patch is the difference in the ordering of the words in the GUC recovery_prefetch_distance and the view pg_stat_prefetch_recovery. Hrmph. [1] https://www.postgresql.org/message-id/CA%2BhUKG%2BNPZeEdLXAcNr%2Bw0YOZVb0Un0_MwTBpgmmVDh7No2jbg%40mail.gmail.com [2] https://anarazel.de/talks/2020-01-31-fosdem-aio/aio.pdf [3] https://kernel.dk/io_uring.pdf [4] https://pubs.opengroup.org/onlinepubs/009695399/functions/lio_listio.html
pgsql-hackers by date: