patch: improve SLRU replacement algorithm - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | patch: improve SLRU replacement algorithm |
Date | |
Msg-id | CA+TgmoZ6cWexkRa4v3nUykZebzJkrkCoS_EDt6uDuP7RTPq7Bg@mail.gmail.com Whole thread Raw |
Responses |
Re: patch: improve SLRU replacement algorithm
Re: patch: improve SLRU replacement algorithm Re: patch: improve SLRU replacement algorithm Re: patch: improve SLRU replacement algorithm Re: patch: improve SLRU replacement algorithm |
List | pgsql-hackers |
On Mon, Apr 2, 2012 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: > This particular example shows the above chunk of code taking >13s to > execute. Within 3s, every other backend piles up behind that, leading > to the database getting no work at all done for a good ten seconds. > > My guess is that what's happening here is that one backend needs to > read a page into CLOG, so it calls SlruSelectLRUPage to evict the > oldest SLRU page, which is dirty. For some reason, that I/O takes a > long time. Then, one by one, other backends comes along and also need > to read various SLRU pages, but the oldest SLRU page hasn't changed, > so SlruSelectLRUPage keeps returning the exact same page that it > returned before, and everybody queues up waiting for that I/O, even > though there might be other buffers available that aren't even dirty. > > I am thinking that SlruSelectLRUPage() should probably do > SlruRecentlyUsed() on the selected buffer before calling > SlruInternalWritePage, so that the next backend that comes along > looking for a buffer doesn't pick the same one. Possibly we should go > further and try to avoid replacing dirty buffers in the first place, > but sometimes there may be no choice, so doing SlruRecentlyUsed() is > still a good idea. > > I'll do some testing to try to confirm whether this theory is correct > and whether the above fix helps. Having performed this investigation, I've discovered a couple of interesting things. First, SlruRecentlyUsed() is an ineffective way of keeping a page from getting reused, because it's called extremely frequently, and on these high-velocity tests it takes almost no time at all for the most recently used buffer to become the least recently used buffer. Therefore, SlruRecentlyUsed() doesn't prevent the lock pile-up. In the unpatched code, once a long buffer I/O starts, everybody immediately goes into the tank until the I/O finishes. If you patch the code so that the page is marked recently-used before beginning the I/O, everybody's next few CLOG requests hit some other buffer but eventually the long-I/O-in-progress buffer again becomes least recently used and the next CLOG eviction causes a second backend to begin waiting for that buffer. Lather, rinse, repeat, until literally every backend is once again waiting on that buffer I/O. You still get the same problem; it just takes slightly longer to develop. On reflection, it seems to me that the right fix here is to make SlruSelectLRUPage() to avoid selecting a page on which an I/O is already in progress. In general, those I/Os are all writes. We don't end up waiting for reads because all the old CLOG pages we might want to read are still in the OS cache. So reads complete quickly, on this test. Writes take a long time, because there we have to actually get the data down to disk, and the disk is busy. But there's no reason for a backend doing a replacement to wait for either a read or a write that is in progress: once the read or write completes, we're going to loop around and repeat the buffer selection process, and most likely pick a buffer completely unrelated to the one whose I/O we waited for. We might as well just skip the wait and select that other buffer immediately. The attached patch implements that. Applying this patch does in fact eliminate the stalls. Here are the top ten places where blocking happens without the patch - these are counts of times we waited more than 100ms for a lwlock during 30-minute, 32-client pgbench run: 54 slru.c:311 blocked by slru.c:405 99 xlog.c:2241 blocked by xlog.c:2090 172 heapam.c:2758 blocked by heapam.c:2758 635 indexam.c:521 blocked by heapam.c:2758 663 xlog.c:2090 blocked by xlog.c:2241 666 varsup.c:65 blocked by varsup.c:65 682 heapam.c:2758 blocked by indexam.c:521 803 xlog.c:1502 blocked by xlog.c:2241 3002 slru.c:311 blocked by slru.c:529 23978 xlog.c:909 blocked by xlog.c:909 And with the patch: 72 hio.c:336 blocked by heapam.c:2758 109 xlog.c:2241 blocked by xlog.c:2090 129 slru.c:311 blocked by slru.c:405 210 heapam.c:2758 blocked by heapam.c:2758 425 heapam.c:2758 blocked by indexam.c:521 710 indexam.c:521 blocked by heapam.c:2758 766 xlog.c:2090 blocked by xlog.c:2241 915 xlog.c:1502 blocked by xlog.c:2241 1684 varsup.c:65 blocked by varsup.c:65 27950 xlog.c:909 blocked by xlog.c:909 As you can see, slru.c:311 blocked by slru.c:529 disappears. It's not just no longer in the top ten - it's actually completely gone. Unfortunately, we get more stalls elsewhere as a result, but that's only to be expected - contention moves around as you fix things. The remaining blocking within slru.c is attributable to the line that says "129 slru.c:311 blocked by slru.c:405". I haven't fully verified this, but I believe that blocking happens there when somebody needs to read a page that's already being read - the second guy quite naturally waits for the first guy's I/O to finish. Those waits are not only many fewer in number but also much shorter than the problem this patch fixes, where the whole system piles up waiting on a buffer I/O for no real good reason. I ran some 30-minute pgbench tests to try to see how this affected throughput. Here are the results; m = master, s = slru-replacement-fix; the number immediately after that letter is the client count. These numbers are without the instrumentation used to gather the lock-wait data shown above. m01 tps = 1404.506517 (including connections establishing) m01 tps = 1378.661239 (including connections establishing) m01 tps = 1403.880038 (including connections establishing) s01 tps = 1383.525351 (including connections establishing) s01 tps = 1365.177884 (including connections establishing) s01 tps = 1362.275677 (including connections establishing) m08 tps = 8902.111306 (including connections establishing) m08 tps = 8864.054359 (including connections establishing) m08 tps = 8911.271822 (including connections establishing) s08 tps = 9298.724507 (including connections establishing) s08 tps = 9473.795213 (including connections establishing) s08 tps = 9204.661724 (including connections establishing) m16 tps = 12963.745279 (including connections establishing) m16 tps = 12654.084955 (including connections establishing) m16 tps = 13203.021537 (including connections establishing) s16 tps = 13241.299141 (including connections establishing) s16 tps = 13280.307298 (including connections establishing) s16 tps = 12550.540800 (including connections establishing) m32 tps = 13026.542041 (including connections establishing) m32 tps = 14362.972612 (including connections establishing) m32 tps = 14586.169194 (including connections establishing) s32 tps = 14671.230224 (including connections establishing) s32 tps = 14487.942766 (including connections establishing) s32 tps = 14294.915303 (including connections establishing) m64 tps = 12883.598484 (including connections establishing) m64 tps = 13927.599008 (including connections establishing) m64 tps = 13487.756722 (including connections establishing) s64 tps = 13761.857778 (including connections establishing) s64 tps = 14001.861641 (including connections establishing) s64 tps = 14087.215634 (including connections establishing) There's some apparent regression on the single-client test, but I'm inclined to think that's a testing artifact of some kind and also probably not very important. It would be worth paying a small price in throughput to avoid many-second entire-database stalls, but on this test throughput actually goes up in all cases other than a single client; and it's hard to take the single client case seriously as a regression anyway because if there's only one thing running, the only effect of this patch is to slightly increase the amount of CPU effort that we spend before replacement the same buffer we would have replaced anyway. There's no way that's enough to cut 3% off performance; I think the explanation must be something like, say, autovacuum runs a bit faster because it doesn't hang as much, but then that triggers a checkpoint sooner; or something's shuffled itself around across cache lines in a way that works out a little worse; or maybe it's just that the patched code was tested second. A couple of other notes: 1. It's probably also a good idea to hand off fsyncs of CLOG pages to the background writer, as suggested on the other thread. However, even if we do that, this is still a good idea, because I've now found that occasionally the *write* can be slow rather than (or in addition to) the fsync. 2. I think we might want to revisit Simon's idea of background-writing SLRU pages. That didn't work out well before, but now I think I know why: without this patch, any write or fsync that takes a while practically guarantees that the whole system will pile up waiting for the page lock. So anything that, like background writing, promises to increase the total number of writes works out poorly. But this patch should remove that obstacle, because now doing a write or fsync that ends up being slow will only cause latency problems for the person actually doing it - and it seems clearly preferable for that to be a background task. The devil is in the details, of course, but I think it's worth retesting with this fixed applied, at the least. 3. I noticed that the blocking described by "slru.c:311 blocked by slru.c:405" seemed to be clumpy - I would get a bunch of messages about that all at once. This makes me wonder if the SLRU machinery is occasionally making a real bad decision about what page to evict, and then everybody piles up waiting for that page to be read back in. That is sheer hand-waving at this point and might be complete bologna, but I'm hoping to find time to investigate further. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: