Re: [HACKERS] Clock with Adaptive Replacement - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] Clock with Adaptive Replacement
Date
Msg-id CAH2-Wz=1ejgxyqn1PoJ=A2NvUaL+1h+Y7F7NLSev=YuBTFMHtQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Clock with Adaptive Replacement  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: [HACKERS] Clock with Adaptive Replacement
Re: [HACKERS] Clock with Adaptive Replacement
List pgsql-hackers
On Wed, Apr 25, 2018 at 5:26 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I think pgbench isn't a very interesting workload though.  I don't
> have time right now but it would be fun to dig further and try to
> construct workloads that generate pathological searches for buffers (I
> believe that such workloads exist, from anecdotal reports).

While pgbench might not be the most interesting workload, I think even
its super-synthetic, uniformly distributed point lookups can benefit
from better caching decisions. The LRU-K paper's example 1.1 explains
why this might be very cogently: Point lookups from TPC-A (or TPC-C)
consist of descending the index, and then accessing the record in the
main table from a TID, which actually implies quite a non-uniform
access pattern for individual pages. Despite accessing table records
uniformly.

There are a negligible number of internal pages involved with pgbench
(and they should always be cached), so they can almost be ignored.
It's really just a matter of what proportion of shared_buffers should
be leaf pages versus heap pages. The pgbench accounts table is
approximately 6x the size of its primary key, so the leaf blocks have
6x the utility of heap blocks for this workload. It actually is pretty
much that simple. Or it would be, if we didn't have the OS cache as a
confounding factor. The optimal strategy for the pgbench workload as
far as maximizing hits in shared_buffers goes is to *only* cache leaf
blocks, at least until you have all leaf blocks cached; we should
evict table blocks *immediately*. This isn't even remotely close to
the behavior of an LRU, of course. Actually, as the LRU-K paper also
points out, an LRU has a tendency to perversely somewhat *favor* the
table blocks, since presumably each point lookup's table block is
accessed after its leaf block is pinned/accessed.

Before some of the big shared_buffers bottlenecks were alleviated
several years ago, it was possible to observe shared_buffers evictions
occurring essentially at random. I have no idea if that's still true,
but it could be. A random replacement strategy is actually not as bad
as it sounds, but it is still pretty bad, especially given that we
could pay a high price in contention in return for such a bad
strategy. This convinces me that we will not be able to assess any
algorithm well in isolation, through simulations, because lock
contention is too important overall, and can actually affect the cache
hit ratio of a clock-based approach. I believe that it's generally
true that our clock sweep algorithm approximates an LRU, but I have
also seen it totally fail to do that, which it sounds like you've
heard about, too. Andrey is probably correct in his suspicion that
we'll end up prototyping a number of approaches.

I'm glad that you're thinking about this, in any case. Seems like a
promising area to work on.

-- 
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Corrupted btree index on HEAD because of covering indexes
Next
From: Merlin Moncure
Date:
Subject: Re: Built-in connection pooling