patch: improve SLRU replacement algorithm - Mailing list pgsql-hackers

On Mon, Apr 2, 2012 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> This particular example shows the above chunk of code taking >13s to
> execute.  Within 3s, every other backend piles up behind that, leading
> to the database getting no work at all done for a good ten seconds.
>
> My guess is that what's happening here is that one backend needs to
> read a page into CLOG, so it calls SlruSelectLRUPage to evict the
> oldest SLRU page, which is dirty.  For some reason, that I/O takes a
> long time.  Then, one by one, other backends comes along and also need
> to read various SLRU pages, but the oldest SLRU page hasn't changed,
> so SlruSelectLRUPage keeps returning the exact same page that it
> returned before, and everybody queues up waiting for that I/O, even
> though there might be other buffers available that aren't even dirty.
>
> I am thinking that SlruSelectLRUPage() should probably do
> SlruRecentlyUsed() on the selected buffer before calling
> SlruInternalWritePage, so that the next backend that comes along
> looking for a buffer doesn't pick the same one.  Possibly we should go
> further and try to avoid replacing dirty buffers in the first place,
> but sometimes there may be no choice, so doing SlruRecentlyUsed() is
> still a good idea.
>
> I'll do some testing to try to confirm whether this theory is correct
> and whether the above fix helps.

Having performed this investigation, I've discovered a couple of
interesting things.  First, SlruRecentlyUsed() is an ineffective way
of keeping a page from getting reused, because it's called extremely
frequently, and on these high-velocity tests it takes almost no time
at all for the most recently used buffer to become the least recently
used buffer.  Therefore, SlruRecentlyUsed() doesn't prevent the lock
pile-up.  In the unpatched code, once a long buffer I/O starts,
everybody immediately goes into the tank until the I/O finishes.  If
you patch the code so that the page is marked recently-used before
beginning the I/O, everybody's next few CLOG requests hit some other
buffer but eventually the long-I/O-in-progress buffer again becomes
least recently used and the next CLOG eviction causes a second backend
to begin waiting for that buffer.  Lather, rinse, repeat, until
literally every backend is once again waiting on that buffer I/O.  You
still get the same problem; it just takes slightly longer to develop.

On reflection, it seems to me that the right fix here is to make
SlruSelectLRUPage() to avoid selecting a page on which an I/O is
already in progress.  In general, those I/Os are all writes.  We don't
end up waiting for reads because all the old CLOG pages we might want
to read are still in the OS cache.  So reads complete quickly, on this
test.  Writes take a long time, because there we have to actually get
the data down to disk, and the disk is busy.  But there's no reason
for a backend doing a replacement to wait for either a read or a write
that is in progress: once the read or write completes, we're going to
loop around and repeat the buffer selection process, and most likely
pick a buffer completely unrelated to the one whose I/O we waited for.
 We might as well just skip the wait and select that other buffer
immediately.  The attached patch implements that.

Applying this patch does in fact eliminate the stalls.  Here are the
top ten places where blocking happens without the patch - these are
counts of times we waited more than 100ms for a lwlock during
30-minute, 32-client pgbench run:

     54 slru.c:311 blocked by slru.c:405
     99 xlog.c:2241 blocked by xlog.c:2090
    172 heapam.c:2758 blocked by heapam.c:2758
    635 indexam.c:521 blocked by heapam.c:2758
    663 xlog.c:2090 blocked by xlog.c:2241
    666 varsup.c:65 blocked by varsup.c:65
    682 heapam.c:2758 blocked by indexam.c:521
    803 xlog.c:1502 blocked by xlog.c:2241
   3002 slru.c:311 blocked by slru.c:529
  23978 xlog.c:909 blocked by xlog.c:909

And with the patch:

     72 hio.c:336 blocked by heapam.c:2758
    109 xlog.c:2241 blocked by xlog.c:2090
    129 slru.c:311 blocked by slru.c:405
    210 heapam.c:2758 blocked by heapam.c:2758
    425 heapam.c:2758 blocked by indexam.c:521
    710 indexam.c:521 blocked by heapam.c:2758
    766 xlog.c:2090 blocked by xlog.c:2241
    915 xlog.c:1502 blocked by xlog.c:2241
   1684 varsup.c:65 blocked by varsup.c:65
  27950 xlog.c:909 blocked by xlog.c:909

As you can see, slru.c:311 blocked by slru.c:529 disappears.  It's not
just no longer in the top ten - it's actually completely gone.
Unfortunately, we get more stalls elsewhere as a result, but that's
only to be expected - contention moves around as you fix things.  The
remaining blocking within slru.c is attributable to the line that says
"129 slru.c:311 blocked by slru.c:405".  I haven't fully verified
this, but I believe that blocking happens there when somebody needs to
read a page that's already being read - the second guy quite naturally
waits for the first guy's I/O to finish.  Those waits are not only
many fewer in number but also much shorter than the problem this patch
fixes, where the whole system piles up waiting on a buffer I/O for no
real good reason.

I ran some 30-minute pgbench tests to try to see how this affected
throughput.  Here are the results; m = master, s =
slru-replacement-fix; the number immediately after that letter is the
client count.  These numbers are without the instrumentation used to
gather the lock-wait data shown above.

m01 tps = 1404.506517 (including connections establishing)
m01 tps = 1378.661239 (including connections establishing)
m01 tps = 1403.880038 (including connections establishing)
s01 tps = 1383.525351 (including connections establishing)
s01 tps = 1365.177884 (including connections establishing)
s01 tps = 1362.275677 (including connections establishing)
m08 tps = 8902.111306 (including connections establishing)
m08 tps = 8864.054359 (including connections establishing)
m08 tps = 8911.271822 (including connections establishing)
s08 tps = 9298.724507 (including connections establishing)
s08 tps = 9473.795213 (including connections establishing)
s08 tps = 9204.661724 (including connections establishing)
m16 tps = 12963.745279 (including connections establishing)
m16 tps = 12654.084955 (including connections establishing)
m16 tps = 13203.021537 (including connections establishing)
s16 tps = 13241.299141 (including connections establishing)
s16 tps = 13280.307298 (including connections establishing)
s16 tps = 12550.540800 (including connections establishing)
m32 tps = 13026.542041 (including connections establishing)
m32 tps = 14362.972612 (including connections establishing)
m32 tps = 14586.169194 (including connections establishing)
s32 tps = 14671.230224 (including connections establishing)
s32 tps = 14487.942766 (including connections establishing)
s32 tps = 14294.915303 (including connections establishing)
m64 tps = 12883.598484 (including connections establishing)
m64 tps = 13927.599008 (including connections establishing)
m64 tps = 13487.756722 (including connections establishing)
s64 tps = 13761.857778 (including connections establishing)
s64 tps = 14001.861641 (including connections establishing)
s64 tps = 14087.215634 (including connections establishing)

There's some apparent regression on the single-client test, but I'm
inclined to think that's a testing artifact of some kind and also
probably not very important.  It would be worth paying a small price
in throughput to avoid many-second entire-database stalls, but on this
test throughput actually goes up in all cases other than a single
client; and it's hard to take the single client case seriously as a
regression anyway because if there's only one thing running, the only
effect of this patch is to slightly increase the amount of CPU effort
that we spend before replacement the same buffer we would have
replaced anyway.  There's no way that's enough to cut 3% off
performance; I think the explanation must be something like, say,
autovacuum runs a bit faster because it doesn't hang as much, but then
that triggers a checkpoint sooner; or something's shuffled itself
around across cache lines in a way that works out a little worse; or
maybe it's just that the patched code was tested second.

A couple of other notes:

1. It's probably also a good idea to hand off fsyncs of CLOG pages to
the background writer, as suggested on the other thread.  However,
even if we do that, this is still a good idea, because I've now found
that occasionally the *write* can be slow rather than (or in addition
to) the fsync.

2. I think we might want to revisit Simon's idea of background-writing
SLRU pages.  That didn't work out well before, but now I think I know
why: without this patch, any write or fsync that takes a while
practically guarantees that the whole system will pile up waiting for
the page lock.  So anything that, like background writing, promises to
increase the total number of writes works out poorly.  But this patch
should remove that obstacle, because now doing a write or fsync that
ends up being slow will only cause latency problems for the person
actually doing it - and it seems clearly preferable for that to be a
background task.  The devil is in the details, of course, but I think
it's worth retesting with this fixed applied, at the least.

3. I noticed that the blocking described by "slru.c:311 blocked by
slru.c:405" seemed to be clumpy - I would get a bunch of messages
about that all at once.  This makes me wonder if the SLRU machinery is
occasionally making a real bad decision about what page to evict, and
then everybody piles up waiting for that page to be read back in.
That is sheer hand-waving at this point and might be complete bologna,
but I'm hoping to find time to investigate further.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Boszormenyi Zoltan
Date:
Subject: [PATCH] lock_timeout and common SIGALRM framework
Next
From: Andrew Dunstan
Date:
Subject: Re: parallel pg_dump