Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers
From | Dmitry Dolgov |
---|---|
Subject | Re: WIP: WAL prefetch (another approach) |
Date | |
Msg-id | 20200425191935.4p6qxzzu5bklodwk@localhost Whole thread Raw |
In response to | Re: WIP: WAL prefetch (another approach) (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: WIP: WAL prefetch (another approach)
|
List | pgsql-hackers |
> On Tue, Apr 21, 2020 at 05:26:52PM +1200, Thomas Munro wrote: > > One report I heard recently said that if you get rid of I/O stalls, > pread() becomes cheap enough that the much higher frequency lseek() > calls I've complained about elsewhere[1] become the main thing > recovery is doing, at least on some systems, but I haven't pieced > together the conditions required yet. I'd be interested to know if > you see that. At the moment I've performed couple of tests for the replication in case when almost everything is in memory (mostly by mistake, I was expecting that a postgres replica within a badly memory limited cgroup will cause more IO, but looks like kernel do not evict pages anyway). Not sure if that's what you mean by getting rid of IO stalls, but in these tests profiling shows lseek & pread appear in similar amount of samples. If I understand correctly, eventually one can measure prefetching influence by looking at different redo function execution time (assuming that data they operate with is already prefetched they should be faster). I still have to clarify what is the exact reason, but even in the situation described above (in memory) there is some visible difference, e.g. # with prefetch Function = b'heap2_redo' [8064] nsecs : count distribution 4096 -> 8191 : 1213 | | 8192 -> 16383 : 66639 |****************************************| 16384 -> 32767 : 27846 |**************** | 32768 -> 65535 : 873 | | # without prefetch Function = b'heap2_redo' [17980] nsecs : count distribution 4096 -> 8191 : 1 | | 8192 -> 16383 : 66997 |****************************************| 16384 -> 32767 : 30966 |****************** | 32768 -> 65535 : 1602 | | # with prefetch Function = b'btree_redo' [8064] nsecs : count distribution 2048 -> 4095 : 0 | | 4096 -> 8191 : 246 |****************************************| 8192 -> 16383 : 5 | | 16384 -> 32767 : 2 | | # without prefetch Function = b'btree_redo' [17980] nsecs : count distribution 2048 -> 4095 : 0 | | 4096 -> 8191 : 82 |******************** | 8192 -> 16383 : 19 |**** | 16384 -> 32767 : 160 |****************************************| Of course it doesn't take into account time we spend doing extra syscalls for prefetching, but still can give some interesting information. > > I like the idea of extensible PrefetchBufferResult. Just one commentary, > > if I understand correctly the way how it is being used together with > > prefetch_queue assumes one IO operation at a time. This limits potential > > extension of the underlying code, e.g. one can't implement some sort of > > buffering of requests and submitting an iovec to a sycall, then > > prefetch_queue will no longer correctly represent inflight IO. Also, > > taking into account that "we don't have any awareness of when I/O really > > completes", maybe in the future it makes to reconsider having queue in > > the prefetcher itself and rather ask for this information from the > > underlying code? > > Yeah, you're right that it'd be good to be able to do some kind of > batching up of these requests to reduce system calls. Of course > posix_fadvise() doesn't support that, but clearly in the AIO future[2] > it would indeed make sense to buffer up a few of these and then make a > single call to io_uring_enter() on Linux[3] or lio_listio() on a > hypothetical POSIX AIO implementation[4]. (I'm not sure if there is a > thing like that on Windows; at a glance, ReadFileScatter() is > asynchronous ("overlapped") but works only on a single handle so it's > like a hypothetical POSIX aio_readv(), not like POSIX lio_list()). > > Perhaps there could be an extra call PrefetchBufferSubmit() that you'd > call at appropriate times, but you obviously can't call it too > infrequently. > > As for how to make the prefetch queue a reusable component, rather > than having a custom thing like that for each part of our system that > wants to support prefetching: that's a really good question. I didn't > see how to do it, but maybe I didn't try hard enough. I looked at the > three users I'm aware of, namely this patch, a btree prefetching patch > I haven't shared yet, and the existing bitmap heap scan code, and they > all needed to have their own custom book keeping for this, and I > couldn't figure out how to share more infrastructure. In the case of > this patch, you currently need to do LSN based book keeping to > simulate "completion", and that doesn't make sense for other users. > Maybe it'll become clearer when we have support for completion > notification? Yes, definitely. > Some related questions are why all these parts of our system that know > how to prefetch are allowed to do so independently without any kind of > shared accounting, and why we don't give each tablespace (= our model > of a device?) its own separate queue. I think it's OK to put these > questions off a bit longer until we have more infrastructure and > experience. Our current non-answer is at least consistent with our > lack of an approach to system-wide memory and CPU accounting... I > personally think that a better XLogReader that can be used for > prefetching AND recovery would be a higher priority than that. Sure, this patch is quite valuable as it is, and those questions I've mentioned are targeting mostly future development. > > Maybe it was discussed in the past in other threads. But if I understand > > correctly, this implementation weights all the samples. Since at the > > moment it depends directly on replaying speed (so a lot of IO involved), > > couldn't it lead to a single outlier at the beginning skewing this value > > and make it less useful? Does it make sense to decay old values? > > Hmm. > > I wondered about a reporting one or perhaps three exponential moving > averages (like Unix 1/5/15 minute load averages), but I didn't propose > it because: (1) In crash recovery, you can't query it, you just get > the log message at the end, and mean unweighted seems OK in that case, > no? (you are not more interested in the I/O saturation at the end of > the recovery compared to the start of recovery are you?), and (2) on a > streaming replica, if you want to sample the instantaneous depth and > compute an exponential moving average or some more exotic statistical > concoction in your monitoring tool, you're free to do so. I suppose > (2) is an argument for removing the existing average completely from > the stat view; I put it in there at Andres's suggestion, but I'm not > sure I really believe in it. Where is our average replication lag, > and why don't we compute the stddev of X, Y or Z? I think we should > provide primary measurements and let people compute derived statistics > from those. For once I disagree, since I believe this very approach, widely applied, leads to a slightly chaotic situation with monitoring. But of course you're right, it has nothing to do with the patch itself. I also would be in favour of removing the existing averages, unless Andres has more arguments to keep it.
pgsql-hackers by date: