Thread: Clock sweep not caching enough B-Tree leaf pages?

Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
I have some theories about the PostgreSQL buffer manager/clock sweep.
To motivate the reader to get through the material presented here, I
present up-front a benchmark of a proof-of-concept patch of mine:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay/

Test Set 4 represents the patches performance here.

This shows some considerable improvements for a tpc-b workload, with
15 minute runs, where the buffer manager struggles with moderately
intense cache pressure. shared_buffers is 8GiB, with 32GiB of system
memory in total. The scale factor is 5,000 here, so that puts the
primary index of the accounts table at a size that makes it impossible
to cache entirely within shared_buffers, by a margin of couple of
GiBs. pgbench_accounts_pkey is ~"10GB", and pgbench_accounts is ~"63
GB". Obviously the heap is much larger, since for that table heap
tuples are several times the size of index tuples (the ratio here is
probably well below the mean, if I can be permitted to make a vast
generalization).

PostgreSQL implements a clock sweep algorithm, which gets us something
approaching an LRU for the buffer manager in trade-off for less
contention on core structures. Buffers have a usage_count/"popularity"
that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK
algorithm only has one bit for what approximates our "usage_count" (so
it's either 0 or 1). I think that at its core CLOCK is an algorithm
that has some very desirable properties that I am sure must be
preserved. Actually, I think it's more accurate to say we use a
variant of clock pro, a refinement of the original CLOCK.

In the past, various hackers have noted problems they've observed with
this scheme. A common pathology is to see frantic searching for a
victim buffer only to find all buffer usage_count values at 5. It may
take multiple revolutions of the clock hand before a victim buffer is
found, as usage_count is decremented for each and every buffer.  Also,
BufFreelistLock contention is considered a serious bottleneck [1],
which is related.

I think a very real problem that may be that approximating an LRU is
bad because an actual LRU is bad, though not due to the problems that
CLOCK/our clock sweep algorithm to its great credit ameliorates. I
don't think that we need to trade-off between an LRU and MRU as Atri
once suggested [2]; rather, I think we need a trade-off between an LRU
and a LFU (very loosely speaking). Something like a pure LRU can make
very bad choices for database workloads. This is described extensively
in the literature.

As I recently remarked upon for unrelated reasons [3], B+Trees perform
excellently when partially cached. There is a remarkable asymmetry
between how many pages we must have cached relative to how much I/O is
avoided, particularly the random I/O that is characteristic of fully
uncached B-Trees when scanned. Approximately 99% of pages in our
nbtree structures can be expected to be leaf pages in many common
cases. There is a very wide fan-out of B-Tree structures that makes
this possible. A lot of material on the subject doesn't emphasize this
basic aspect strongly enough in my view. But it also bears mentioning
that B-Tree pages are, in an important sense, far denser than heap
pages.

Let's leave aside inner/root pages though, because they're so
dramatically useful when in a primary index on a tpb-b table that
they'll always be cached by any non-terrible algorithm. It beggars
belief that the still relatively dense (and consequently *popular*)
B+Tree leaf pages get so little credit for being of such long-term
utility (in the view of our clock sweep algorithm as implemented). The
algorithm has what could be loosely described as an excessively
short-term perspective. There is clearly a better balance to be had
here. I don't think the answer is that we have the B-Tree code give
its pages some special treatment that makes them harder to evict,
although I will admit that I briefly entertained the idea.

I am aware of the false start that we had with ARC back in 2005. I
believe that effort tried to address some of these problems. I am
mindful of the pitfalls there. I'm inclined to think that the decision
to create a CLOCK variant in preference to ARC was the right one at
the time, because clock sweep really is at a considerable advantage
with regard to lock contention.

The 1993 paper "The LRU-K Page Replacement Algorithm For Database Disk
Buffering" [4] proposes what is (roughly speaking) an algorithm that
is an adaptive hybrid of LRU and LFU. Consider Example 1.1. from that
paper; it describes a very simple scenario in which its possible to
have slightly more heap pages cached than B-Tree pages. This scenario
is essentially a "pgbench -S", a scenario compared to the simplistic
tpc-a; it's nothing more than that. The authors aren't clear on this,
but I assumed a uniform distribution (IIRC, the 2Q paper, which was
published a year or two later, also explicitly assumes this of the
very same earlier LRU-K/LRU-2 example). It is argued within that
example, quite cogently in my opinion, that it's very bad that a
simple LRU cache will see just under 50% of buffers used to cache
B-Tree leaf pages, while over 50% are used for "data" (heap) pages. In
actual fact, it is preferable to almost exclusively cache B-Tree
pages. Using 50% of the cache on B-Tree pages when it is optimal to
cache a number approaching 100% of the cache is a pretty large
disparity, particularly for such a simple workload. This is literally
the furthest possible thing from a contrived corner case.

I believe that it isn't hard to get clock sweep to do just the same as
predicted in the '93 paper, even with a "usage_count". It is nothing
more than an LRU approximation, or at least that's what the relevant
comments say. I did some experiments here, but it probably isn't worth
sharing the raw data, given what I already have here. The example
surrounding caching B-Tree leaf pages in the paper draws attention to
a particularly acute pathology, but there are probably others. Leaf
pages are in many representative cases far denser, and therefore can
be expected to be accessed far more frequently to service queries. Our
current failure to credit them in a way that weighs the frequency with
which they're accessed is something I suggest thinking long and hard
about.

Has anyone thought about this in the last few years? I know that Tom
examined the LRU-K paper back in 2000 [5], but was discouraged by some
kind of contention or CPU overhead (although he did say he intended to
revisit the question). Obviously a lot has changed in the past 14
years, particularly with regard to CPU characteristics.

Anyway, having gone through the requisite background information, I'll
get back to the subject of the pgbench-tools tpc-b benchmark that I
started out with, and what I've actually done in the attached patch to
improve matters. Let me preface this by saying: this is a rough
prototype. The way I add a field to the buffer descriptor struct
clearly isn't going to fly, since we have every reason to believe that
that is itself performance critical in other scenarios (just look at
Andres' recent work on the alignment of these structures in memory).
Actually, it probably matters a lot in this scenario too, just not
enough to mask the general outcome. I've arrived at this prototype
through plenty of principled theorizing, and a bit of unprincipled
frobbing. I'm a big fan of building simple prototypes to test
theories. In the interest of reproducibility I have not attempted to
clean up what I have here just yet, in case I accidentally invalidate
the benchmark presented.

The prototype patch:

1) Throttles incrementation of usage_count temporally. It becomes
impossible to increment usage_count for any given buffer more
frequently than every 3 seconds, while decrementing usage_count is
totally unaffected. This is thought to give the algorithm a short to
medium term "appreciation" of the frequency of access. It also
prevents very close increments in usage_count due to pinning a buffer
twice in the same query or transaction, just because the code in the
executor happens to be accidentally structured such that that happens
(as when inserting two tuples within a single INSERT DML query), or
whatever. Clearly the tpc-b accounts table is accessed twice in
succession in the same transaction by our tpb-c, creating a sort of
misrepresentation of buffer popularity that is likely in and of itself
undesirable.

2) Has usage_count saturate at 10 (i.e. BM_MAX_USAGE_COUNT = 10), not
5 as before. This gives us a more granular representation of the
popularity/usefulness of each buffer, which I believe spans a
dramatically large spectrum (i.e. my guess is that testing will show I
didn't go far enough). This step on its own would be assumed extremely
counter-productive by those in the know, but I believe that other
measures ameliorate the downsides. I could be wrong about how true
that is in other cases, but then the case helped here isn't what you'd
call a narrow benchmark. It's the pgbench default (although I do use
unlogged tables, in part because the I/O subsystem on this server is
quite under-powered, even though it has plenty of memory).

3) Has buffer usage_count start at 3. I also tried 4 (and 1, the
current value within master), but settled on 3 for now. Seemingly this
makes a very large difference compared to only doing 1) and 2), since
it gives each page a fair shot at proving its usefulness. Presumably
the throttling in 1) makes frantic buffer scanning much less
prevalent, since clock sweep has the upper hand, so to speak. A tug of
war between clock sweep and backends pinning buffers can be
disastrous, but that seems much less likely than before. Semi-magical
constant amounts of time are something you see surprisingly often in
the research. Probably most notably, the "five-minute rule" [6] argues
that one should only cache randomly accessed pages that are re-used
every 5 minutes or less (it's actually a bit more complicated these
days, but not much). This rule is a mainstay of database caching
research theory. Anyway, I'm not sure whether or not this delay should
be exposed as a GUC. I lean towards "no".

The amount of evidence it takes to validate something like this is
generally enormous. Validating the just-in-time bgwriter strategy was
the original reason that Greg Smith wrote pgbench-tools. Still, I'm
confident that I've identified a real problem. The big picture here is
that pages have a fair go at proving their worth, while allowing for
certain pages to be recognized as being of dramatically higher
long-term utility. Generally speaking, as a caching algorithm, a pure
LFU is inappropriate for almost all use-cases, because it never
forgets anything until it forgets everything about individual pages.
There is a balance to be had here, in terms of the extent to which we
allow pages to "rest on their laurels". I wouldn't be surprised to
learn I have the balance wrong right now.

pgbench-tools benchmark interpretation
=============================

I also attach a LibreOffice spreadsheet comparing hit rates for each
table (and its indexes) for each run that you see in the pgbench-tools
report (except the final do-over of master baseline). I built this
with the help of a small pgbench-tools hack, resetting pg_statio*
views, measuring hit rates per pgbench invocation. The hit rate shown
would probably be a lot more interesting if I'd used a rate limit, so
the amount of work performed is consistent across test sets (that is,
variants of the patch, and master). Greg Smith is very enthusiastic
about how much further insight can be had while using that pgbench
feature, and I'd say he's probably right to be. Right now, in terms of
hit rate you mostly see a slightly smaller one for indexes, and a
considerably larger one for heap buffers as compared to the master
baseline.

If you run down the individual runs in the pgbench-tools report, and
consider what actually happens and when, you tend to see a degradation
in performance over successive runs for different client counts for
certain cases tested. It's hard to be sure, but I think that might be
because earlier runs have a big advantage due to the index being well
cached (pgbench-tools performs vacuuming at the start of each run not
including the first run. So VACUUM reverses the situation initially
seen, since we kill heap tuples last when vacuuming, rather than
creating the index last). In contrast, the impressive performance of
the patch (where all 3 of the above measures are implemented) shows
more consistent throughput even with that noise, which seems to
support my theories about what is going on here. Leaving aside cache
effectiveness as such, I suspect that in general we currently pay a
high price for all too frequently having buffers fall out of
shared_buffers into the OS cache, and fall back in again.

Having arrived at the conclusion that these optimizations were worth
sharing here, I decided to "do over" the original master baseline
(Test Set 1). I ended up with a result (Test set 5) that is actually
quite different from the original result, without it being all that
clear that it was better or worse than the first time I took a
baseline (it was probably worse). This might call into question the
accuracy of my results. However, to see what's really going on here,
it is necessary to drill down to the individual TPS/latency figures
for individual pgbench invocations. That's the real story here.

In general, as I said, the I/O system here is quite under specified
for this workload. This is dedicated physical hardware, but it happens
to be what was immediately available to me. There are a couple of SATA
7200 rpm disks in RAID 1. The system specifications, as advertised by
my hosting provider is detailed here:
https://www.hetzner.de/en/hosting/produkte_rootserver/ex40 . As always
with pgbench-tools, you can drill down to see more about each test
(including kernel vm settings, etc).

Inevitably, with this disk setup, which is I imagine particularly bad
with random I/O, and also with so much memory, it's only a matter of
time before things get ugly, even if there is an initial high
performance burst (even on master) before we have to face the music,
unsurprisingly. It seems like the pain is felt at slightly different
times between Test Set 1 and Test Set 5 (i.e. for each test of
master). With either of those 2 test sets, if you drill down to see
the moment-to-moment throughput and latency, you'll find tests from
each test set where both latency and throughput were on the floor for
as long as a couple of minutes at a time, after which there is a
sudden recovery. We're subject to the vagaries of kernel I/O
scheduling to a much greater extent than with the good patched test
sets, it seems. What is seen with master looks very much like sync
phase checkpoint spikes. Once or twice, things are consistently very
poor for almost an entire master invocation of pgbench. In general
things are very inconsistent, although to be fair it's possible that
at its best master has greater throughput than patched for brief
moments (of course, the buffer descriptor bloating which I have yet to
deal with may well be all that is to blame here).

If you look at the test sets that this patch covers (with all the
tricks applied), there are pretty good figures throughout. You can
kind of see the pain towards the end, but there are no dramatic falls
in responsiveness for minutes at a time. There are latency spikes, but
they're *far* shorter, and much better hidden. Without looking at
individual multiple minute spikes, at the macro level (all client
counts for all runs) average latency is about half of what is seen on
master.

Does anyone have any high-end hardware that they could use to test this out?

[1] http://www.postgresql.org/message-id/CA+TgmobJm0GHk58nUPRQHCGwY25n1DCkU4ku9aQeczZEjiz9mQ@mail.gmail.com
[2] http://www.postgresql.org/message-id/CAOeZVic4HikhmzVD=ZP4JY9g8PgpyiQQOXOELWP=kR+=H1Frgg@mail.gmail.com
[3] http://www.postgresql.org/message-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=A@mail.gmail.com
[4] http://www.cs.cmu.edu/~christos/courses/721-resources/p297-o_neil.pdf
[5] http://www.postgresql.org/message-id/1601.967421129@sss.pgh.pa.us
[6] http://en.wikipedia.org/wiki/Five-minute_rule

--
Peter Geoghegan

Attachment

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Jim Nasby
Date:
On 4/14/14, 12:11 PM, Peter Geoghegan wrote:
> I have some theories about the PostgreSQL buffer manager/clock sweep.
> To motivate the reader to get through the material presented here, I
> present up-front a benchmark of a proof-of-concept patch of mine:
>
> http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay/
>
> Test Set 4 represents the patches performance here.
>
> This shows some considerable improvements for a tpc-b workload, with
> 15 minute runs, where the buffer manager struggles with moderately
> intense cache pressure. shared_buffers is 8GiB, with 32GiB of system
> memory in total. The scale factor is 5,000 here, so that puts the
> primary index of the accounts table at a size that makes it impossible
> to cache entirely within shared_buffers, by a margin of couple of
> GiBs. pgbench_accounts_pkey is ~"10GB", and pgbench_accounts is ~"63
> GB". Obviously the heap is much larger, since for that table heap
> tuples are several times the size of index tuples (the ratio here is
> probably well below the mean, if I can be permitted to make a vast
> generalization).
>
> PostgreSQL implements a clock sweep algorithm, which gets us something
> approaching an LRU for the buffer manager in trade-off for less
> contention on core structures. Buffers have a usage_count/"popularity"
> that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK
> algorithm only has one bit for what approximates our "usage_count" (so
> it's either 0 or 1). I think that at its core CLOCK is an algorithm
> that has some very desirable properties that I am sure must be
> preserved. Actually, I think it's more accurate to say we use a
> variant of clock pro, a refinement of the original CLOCK.

I think it's important to mention that OS implementations (at least all I know of) have multiple page pools, each of
whichhas it's own clock. IIRC one of the arguments for us supporting a count>1 was we could get the benefits of
multiplepage pools without the overhead. In reality I believe that argument is false, because the clocks for each page
poolin an OS *run at different rates* based on system demands.
 

I don't know if multiple buffer pools would be good or bad for Postgres, but I do think it's important to remember this
differenceany time we look at what OSes do.
 

> If you look at the test sets that this patch covers (with all the
> tricks applied), there are pretty good figures throughout. You can
> kind of see the pain towards the end, but there are no dramatic falls
> in responsiveness for minutes at a time. There are latency spikes, but
> they're *far* shorter, and much better hidden. Without looking at
> individual multiple minute spikes, at the macro level (all client
> counts for all runs) average latency is about half of what is seen on
> master.

My guess would be that those latency spikes are caused by a need to run the clock for an extended period. IIRC there's
codefloating around that makes it possible to measure that.
 

I suspect it would be very interesting to see what happens if your patch is combined with the work that (Greg?) did to
reducethe odds of individual backends needing to run the clock. (I know part of that work looked at proactively keeping
pageson the free list, but I think there was more to it than that).
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Bruce Momjian
Date:
On Mon, Apr 14, 2014 at 10:11:53AM -0700, Peter Geoghegan wrote:
> Has anyone thought about this in the last few years? I know that Tom
> examined the LRU-K paper back in 2000 [5], but was discouraged by some
> kind of contention or CPU overhead (although he did say he intended to
> revisit the question). Obviously a lot has changed in the past 14
> years, particularly with regard to CPU characteristics.

I am glad you are looking at this.  You are right that it requires a
huge amount of testing, but clearly our code needs improvement in this
area.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Stephen Frost
Date:
* Jim Nasby (jim@nasby.net) wrote:
> I think it's important to mention that OS implementations (at least all I know of) have multiple page pools, each of
whichhas it's own clock. IIRC one of the arguments for us supporting a count>1 was we could get the benefits of
multiplepage pools without the overhead. In reality I believe that argument is false, because the clocks for each page
poolin an OS *run at different rates* based on system demands.
 

They're also maintained in *parallel*, no?  That's something that I've
been talking over with a few folks at various conferences- that we
should consider breaking up shared buffers and then have new backend
processes which work through each pool independently and in parallel.

> I don't know if multiple buffer pools would be good or bad for Postgres, but I do think it's important to remember
thisdifference any time we look at what OSes do.
 

It's my suspicion that the one-big-pool is exactly why we see many cases
where PG performs worse when the pool is more than a few gigs.  Of
course, this is all speculation and proper testing needs to be done..
Thanks,
    Stephen

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Mon, Apr 14, 2014 at 5:30 PM, Bruce Momjian <bruce@momjian.us> wrote:
> I am glad you are looking at this.  You are right that it requires a
> huge amount of testing, but clearly our code needs improvement in this
> area.

Thanks.

Does anyone recall the original justification for the recommendation
that shared_buffers never exceed 8GiB? I'd like to revisit the test
case, if such a thing exists.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Bruce Momjian
Date:
On Mon, Apr 14, 2014 at 05:45:56PM -0700, Peter Geoghegan wrote:
> On Mon, Apr 14, 2014 at 5:30 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > I am glad you are looking at this.  You are right that it requires a
> > huge amount of testing, but clearly our code needs improvement in this
> > area.
> 
> Thanks.
> 
> Does anyone recall the original justification for the recommendation
> that shared_buffers never exceed 8GiB? I'd like to revisit the test
> case, if such a thing exists.

I have understood it be that the overhead of managing over 1 million
buffers is too large if you aren't accessing more than 8GB of data in a
five-minute period.  If are accessing that much, it might be possible to
have a win over 8GB.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Jim Nasby
Date:
On 4/14/14, 7:43 PM, Stephen Frost wrote:
> * Jim Nasby (jim@nasby.net) wrote:
>> I think it's important to mention that OS implementations (at least all I know of) have multiple page pools, each of
whichhas it's own clock. IIRC one of the arguments for us supporting a count>1 was we could get the benefits of
multiplepage pools without the overhead. In reality I believe that argument is false, because the clocks for each page
poolin an OS *run at different rates* based on system demands.
 
>
> They're also maintained in *parallel*, no?  That's something that I've
> been talking over with a few folks at various conferences- that we
> should consider breaking up shared buffers and then have new backend
> processes which work through each pool independently and in parallel.

I suspect that varies based on the OS, but it certainly happens in a separate process from user processes. The
expectationis that there should always be pages on the free list so requests for memory can happen quickly.
 

http://www.freebsd.org/doc/en/articles/vm-design/freeing-pages.html contains a good overview of what FreeBSD does. See
http://www.freebsd.org/doc/en/articles/vm-design/allen-briggs-qa.html#idp62990256as well.
 

>> I don't know if multiple buffer pools would be good or bad for Postgres, but I do think it's important to remember
thisdifference any time we look at what OSes do.
 
>
> It's my suspicion that the one-big-pool is exactly why we see many cases
> where PG performs worse when the pool is more than a few gigs.  Of
> course, this is all speculation and proper testing needs to be done..

I think there some critical take-aways from FreeBSD that apply here (in no particular order):

1: The system is driven by memory pressure. No pressure means no processing.
2: It sounds like the active list is LFU, not LRU. The cache list is LRU.
3: *The use counter is maintained by a clock.* Because the clock only runs so often this means there is no run-away
incrementinglike we see in Postgres.
 
4: Once a page is determined to not be active it goes onto a separate list depending on whether it's clean or dirty.
5: Dirty pages are only written to maintain a certain clean/dirty ratio and again, only when there's actual memory
pressure.
6: The system maintains a list of free pages to serve memory requests quickly. In fact, lower level functions (ie:
http://www.leidinger.net/FreeBSD/dox/vm/html/d4/d65/vm__phys_8c_source.html#l00862)simply return NULL if they can't
findpages on the free list.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Merlin Moncure
Date:
On Mon, Apr 14, 2014 at 7:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Apr 14, 2014 at 5:30 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> I am glad you are looking at this.  You are right that it requires a
>> huge amount of testing, but clearly our code needs improvement in this
>> area.
>
> Thanks.
>
> Does anyone recall the original justification for the recommendation
> that shared_buffers never exceed 8GiB? I'd like to revisit the test
> case, if such a thing exists.

There are many reports of improvement from lowering shared_buffers.
The problem is that it tends to show up on complex production
workloads and that there is no clear evidence pointing to problems
with the clock sweep; it could be higher up in the partition locks or
something else entirely (like the O/S).  pgbench is also not the
greatest tool for sniffing out these cases: it's too random and for
large database optimization is generally an exercise in de-randomizing
i/o patterns.  We really, really need a broader testing suite that
covers more usage patterns.

I was suspicious for a while that spinlock contention inside the
clocksweep was causing stalls and posted a couple of different patches
to try and reduce the chance of that.  I basically gave up when I
couldn't demonstrate that case in simulated testing.

I still think there is no good reason for the clock to pedantically
adjust usage count on contented buffers...better to throw a single
TTAS and bail to the next buffer if either 'T' signals a lock.

merlin



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Tue, Apr 15, 2014 at 9:30 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> There are many reports of improvement from lowering shared_buffers.
> The problem is that it tends to show up on complex production
> workloads and that there is no clear evidence pointing to problems
> with the clock sweep; it could be higher up in the partition locks or
> something else entirely (like the O/S).  pgbench is also not the
> greatest tool for sniffing out these cases: it's too random and for
> large database optimization is generally an exercise in de-randomizing
> i/o patterns.  We really, really need a broader testing suite that
> covers more usage patterns.

I find it quite dissatisfying that we know so little about this.

I'm finding that my patch helps much less when shared_buffers is sized
large enough to fit the index entirely (although there are still some
localized stalls on master, where there are none with patched).
shared_buffers is still far too small to fit the entire heap. With
shared_buffers=24GB (which still leaves just under 8GB of memory for
the OS to use as cache, since this system has 32GB of main memory),
the numbers are much less impressive relative to master with the same
configuration. Both sets of numbers are still better than what you've
already seen with shared_buffers=8GB, since of course the "no more
than 8GB" recommendation is not an absolute, and as you say its
efficacy seemingly cannot be demonstrated with pgbench.

My guess is that the patch doesn't help because once there is more
than enough room to cache the entire index (slightly over twice as
many buffers as would be required to do so), even on master it becomes
virtually impossible to evict those relatively popular index pages,
since they still have an early advantage. It doesn't matter that
master's clock sweep has what I've called an excessively short-term
perspective, because there is always enough pressure relative to the
number of leaf pages being pinned to prefer to evict heap pages. There
is still a lot of buffers that can fit some moderate proportion of all
heap pages even after buffering the entire index (something like
~13GB).

You might say that with this new shared_buffers setting, clock sweep
doesn't need to have a "good memory", because it can immediately
observe the usefulness of B-Tree leaf pages.

There is no need to limit myself to speculation here, of course. I'll
check it out using pg_buffercache.
-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Ants Aasma
Date:
On Mon, Apr 14, 2014 at 8:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
> PostgreSQL implements a clock sweep algorithm, which gets us something
> approaching an LRU for the buffer manager in trade-off for less
> contention on core structures. Buffers have a usage_count/"popularity"
> that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK
> algorithm only has one bit for what approximates our "usage_count" (so
> it's either 0 or 1). I think that at its core CLOCK is an algorithm
> that has some very desirable properties that I am sure must be
> preserved. Actually, I think it's more accurate to say we use a
> variant of clock pro, a refinement of the original CLOCK.

PostgreSQL replacement algorithm is more similar to Generalized CLOCK
or GCLOCK, as described in [1]. CLOCK-Pro [2] is a different algorithm
that approximates LIRS[3]. LIRS is what MySQL implements[4] and
CLOCK-Pro is implemented by NetBSD [5] and there has been some work on
trying it on Linux [6]. Both LIRS and CLOCK-Pro work by keeping double
the cache size metadata entries and detect pages that have been
recently referenced. Basically they provide an adaptive tradeoff
between LRU and LFU.

> In the past, various hackers have noted problems they've observed with
> this scheme. A common pathology is to see frantic searching for a
> victim buffer only to find all buffer usage_count values at 5. It may
> take multiple revolutions of the clock hand before a victim buffer is
> found, as usage_count is decremented for each and every buffer.  Also,
> BufFreelistLock contention is considered a serious bottleneck [1],
> which is related.

There's a paper on a non blocking GCLOCK algorithm, that does lock
free clock sweep and buffer pinning[7]. If we decide to stay with
GCLOCK it may be interesting, although I still believe that some
variant of buffer nailing[8] is a better idea, my experience shows
that most of the locking overhead is cache line bouncing ignoring the
extreme cases where our naive spinlock implementation blows up.


> Let's leave aside inner/root pages though, because they're so
> dramatically useful when in a primary index on a tpb-b table that
> they'll always be cached by any non-terrible algorithm. It beggars
> belief that the still relatively dense (and consequently *popular*)
> B+Tree leaf pages get so little credit for being of such long-term
> utility (in the view of our clock sweep algorithm as implemented). The
> algorithm has what could be loosely described as an excessively
> short-term perspective. There is clearly a better balance to be had
> here. I don't think the answer is that we have the B-Tree code give
> its pages some special treatment that makes them harder to evict,
> although I will admit that I briefly entertained the idea.

There has been some research that indicates that for TPC-A workloads
giving index pages higher weights increases hitrates[1].


I think the hardest hurdle for any changes in this area will be
showing that we don't have any nasty regressions. I think the best way
to do that would be to study separately the performance overhead of
the replacement algorithm and optimality of the replacement choices.
If we capture a bunch of buffer reference traces by instrumenting
PinBuffer, we can pretty accurately simulate the behavior of different
algorithm and tuning choices with different shared buffer sizes.
Obviously full scale tests are still needed due to interactions with
OS, controller and disk caches and other miscellaneous influences. But
even so, simulation would get us much better coverage of various
workloads and at least some confidence that it's a good change
overall. It will be very hard and time consuming to gather equivalent
evidence with full scale tests.


[1] http://www.csd.uoc.gr/~hy460/pdf/p35-nicola.pdf
[2] http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-05-3.pdf
[3] http://www.ece.eng.wayne.edu/~sjiang/pubs/papers/jiang02_LIRS.pdf
[4] http://lists.mysql.com/commits/28601
[5] http://fxr.watson.org/fxr/source/uvm/uvm_pdpolicy_clockpro.c?v=NETBSD
[6] http://lwn.net/Articles/147879/
[7] http://derby-nb.googlecode.com/svn-history/r41/trunk/derby-nb/ICDE10_conf_full_409.pdf
[8] http://www.postgresql.org/message-id/CA+TgmoZYPeYHWAUeJVYy9A5aNDoULcF33WTnprfR9SYcw30vAg@mail.gmail.com

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Tue, Apr 15, 2014 at 3:59 PM, Ants Aasma <ants@cybertec.at> wrote:
> PostgreSQL replacement algorithm is more similar to Generalized CLOCK
> or GCLOCK, as described in [1]. CLOCK-Pro [2] is a different algorithm
> that approximates LIRS[3]. LIRS is what MySQL implements[4] and
> CLOCK-Pro is implemented by NetBSD [5] and there has been some work on
> trying it on Linux [6]. Both LIRS and CLOCK-Pro work by keeping double
> the cache size metadata entries and detect pages that have been
> recently referenced. Basically they provide an adaptive tradeoff
> between LRU and LFU.

That's good to know.

> There's a paper on a non blocking GCLOCK algorithm, that does lock
> free clock sweep and buffer pinning[7]. If we decide to stay with
> GCLOCK it may be interesting, although I still believe that some
> variant of buffer nailing[8] is a better idea, my experience shows
> that most of the locking overhead is cache line bouncing ignoring the
> extreme cases where our naive spinlock implementation blows up.

You might be right about that, but lets handle one problem at a time.
Who knows what the bottleneck will end up being if and when we address
the naivety around frequency? I want to better characterize that
problem first.

> There has been some research that indicates that for TPC-A workloads
> giving index pages higher weights increases hitrates[1].

Frankly, there doesn't need to be any research on this, because it's
just common sense that probabilistically, leaf pages are much more
useful than heap pages in servicing index scan queries if we assume a
uniform distribution. If we don't assume that, then they're still more
useful on average.

> I think the hardest hurdle for any changes in this area will be
> showing that we don't have any nasty regressions. I think the best way
> to do that would be to study separately the performance overhead of
> the replacement algorithm and optimality of the replacement choices.
> If we capture a bunch of buffer reference traces by instrumenting
> PinBuffer, we can pretty accurately simulate the behavior of different
> algorithm and tuning choices with different shared buffer sizes.
> Obviously full scale tests are still needed due to interactions with
> OS, controller and disk caches and other miscellaneous influences. But
> even so, simulation would get us much better coverage of various
> workloads and at least some confidence that it's a good change
> overall. It will be very hard and time consuming to gather equivalent
> evidence with full scale tests.

I think I agree with all of that. The fact that we as a community
don't appear to have too much to say about what workloads to
prioritize somewhat frustrates this. The other problem is that sizing
shared_buffers appropriately involves a surprising amount of deference
to rules of thumb that in practice no one is quite prepared to
rigorously defend - who is to say what apportionment of memory to
Postgres is appropriate here? I too was hopeful that we could evaluate
this work purely in terms of observed improvements to hit rate (at
least initially), but now I doubt even that. It would be great to be
able to say "here are the parameters of this discussion", and have
everyone immediately agree with that, but in this instance that's
legitimately not possible.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Amit Kapila
Date:
On Wed, Apr 16, 2014 at 5:00 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Apr 15, 2014 at 3:59 PM, Ants Aasma <ants@cybertec.at> wrote:
>> There's a paper on a non blocking GCLOCK algorithm, that does lock
>> free clock sweep and buffer pinning[7]. If we decide to stay with
>> GCLOCK it may be interesting, although I still believe that some
>> variant of buffer nailing[8] is a better idea, my experience shows
>> that most of the locking overhead is cache line bouncing ignoring the
>> extreme cases where our naive spinlock implementation blows up.
>
> You might be right about that, but lets handle one problem at a time.
> Who knows what the bottleneck will end up being if and when we address
> the naivety around frequency? I want to better characterize that
> problem first.

Just to summarize you about the previous discussion and the
improvements that we decided to do in this area based on feedback
are as follows:

1. Bgwriter needs to be improved so that it can help in reducing   usage count and finding next victim buffer (run the
clocksweep   and add buffers to the free list).
 
2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist   are less.
3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer   (a spinlock for the freelist, and an lwlock for
theclock sweep).   Here we can try to make it lock free based on atomic ops as   well.
 
4. Bgwriter needs to be more aggressive, logic based on which it   calculates how many buffers it needs to process
needsto be   improved.
 
5. Contention around buffer mapping locks.
6. Cacheline bouncing around the buffer header spinlocks, is there   anything we can do to reduce this?
7. Choose Optimistically used buffer in StrategyGetBuffer().
8. Don't bump the usage count every time buffer is pinned.

I have already addressed some of these improvements in patch[1]
and for other's, I have plan to work on them for 9.5.

I think here you want to address the improvements related to usage
count and see if it can get us win in some of commonly used scenario's,
without affecting any other commonly used scenario.  I feel this is good
idea to pursue and see if we can get good benefits with it.

Infact few days back, I had ran some tests manually to see the
problems around BufFreeListLock (currently I don't have script ready)
and more recently Jason Petersen has done some benchmarking
in this area which you can refer it here[2].

I wonder if we can work together to improve things in this area.

[1]
http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@huawei.com
[2]
https://googledrive.com/host/0Bx33JCTmOADOeTIwaE9KX21yWEk/Concurrency%20Limits%20with%20Large%20Working%20Sets

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Mon, Apr 14, 2014 at 1:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
> In the past, various hackers have noted problems they've observed with
> this scheme. A common pathology is to see frantic searching for a
> victim buffer only to find all buffer usage_count values at 5. It may
> take multiple revolutions of the clock hand before a victim buffer is
> found, as usage_count is decremented for each and every buffer.  Also,
> BufFreelistLock contention is considered a serious bottleneck [1],
> which is related.

I think that the basic problem here is that usage counts increase when
buffers are referenced, but they decrease when buffers are evicted,
and those two things are not in any necessary way connected to each
other.  In particular, if no eviction is happening, reference counts
will converge to the maximum value.  I've read a few papers about
algorithms that attempt to segregate the list of buffers into "hot"
and "cold" lists, and an important property of such algorithms is that
they mustn't be allowed to make everything hot.  It's easy to be too
simplistic, here: an algorithm that requires that no more than half
the list be hot will fall over badly on a workload where the working
set exceeds the available cache and the really hot portion of the
working set is 60% of the available cache.  So you need a more
sophisticated algorithm than that.  But that core property that not
all buffers can be hot must somehow be preserved, and our algorithm
doesn't.

This isn't a fundamental property of the usage-count idea; it's an
artifact of the fact that usage count decreases are tied to eviction
pressure rather than access pressure.  For example, suppose we made a
rule that if the total usage counts of all buffers exceed 3 *
NBuffers, then every time you bump the usage count of a buffer from N
to N+1, you're required to advance the clock sweep far enough to
decrease the reference count of a buffer by one.  When you want to
reclaiim a buffer, you advance a separate clock sweep until you find a
buffer with a zero usage count; if you circle the whole ring without
finding one, then you reclaim the buffer you saw with the lowest usage
count.  There are obvious scalability problems here (everyone fighting
over the right to advance the clock sweep) but ignore that for the
sake of the thought experiment: now you have an algorithm where not
all buffers can be hot.  If some buffers are hotter than others, then
whenever their usage count is decreased it will immediately get pushed
back up again, but some other buffer then has to absorb the decrease.
Only the buffers that are really hot can maintain high usage counts,
because *somebody* has to have a low usage count.

Even ignoring scalability concerns, this might not be (and probably
isn't) exactly what we want to implement, but I think it illustrates
an important control principle all the same: buffer "cooling" needs to
be driven by the same underlying phenomenon - probably buffer access -
as buffer "heating".  If they're driven by unrelated phenomena, then
the rates may be wildly incomparable, and you'll end up with
everything hot or everything cold.  If that happens, you lose, because
with everything the same, there's no principled way to decide which
things are actually best to evict.

If we come up with some good solution for shared buffers, we should
also consider it applying it to SLRU eviction.  I believe that the
current situation around CLOG eviction is none too pretty.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Tue, Apr 15, 2014 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Apr 14, 2014 at 1:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> In the past, various hackers have noted problems they've observed with
>> this scheme. A common pathology is to see frantic searching for a
>> victim buffer only to find all buffer usage_count values at 5. It may
>> take multiple revolutions of the clock hand before a victim buffer is
>> found, as usage_count is decremented for each and every buffer.  Also,
>> BufFreelistLock contention is considered a serious bottleneck [1],
>> which is related.
>
> I think that the basic problem here is that usage counts increase when
> buffers are referenced, but they decrease when buffers are evicted,
> and those two things are not in any necessary way connected to each
> other.  In particular, if no eviction is happening, reference counts
> will converge to the maximum value.  I've read a few papers about
> algorithms that attempt to segregate the list of buffers into "hot"
> and "cold" lists, and an important property of such algorithms is that
> they mustn't be allowed to make everything hot.

It's possible that I've misunderstood what you mean here, but do you
really think it's likely that everything will be hot, in the event of
using something like what I've sketched here? I think it's an
important measure against this general problem that buffers really
earn the right to be considered hot, so to speak. With my prototype,
in order for a buffer to become as hard to evict as possible, at a
minimum it must be *continually* pinned for at least 30 seconds.
That's actually a pretty tall order. Although, as I said, I wouldn't
be surprised if it was worth making it possible for buffers to be even
more difficult to evict than that. It should be extremely difficult to
evict a root B-Tree page, and to a lesser extent inner pages even
under a lot of cache pressure, for example. There are lots of
workloads in which that can happen, and I have a hard time believing
that it's worth it to evict given the extraordinary difference in
their utility as compared to a lot of other things. I can imagine a
huge barrier against evicting what is actually a relatively tiny
number of pages being worth it.

I don't want to dismiss what you're saying about heating and cooling
being unrelated, but I don't find the conclusion that not everything
can be hot obvious. Maybe "heat" should be relative rather than
absolute, and maybe that's actually what you meant. There is surely
some workload where buffer access actually is perfectly uniform, and
what do you do there? What "temperature" are those buffers?

It occurs to me that within the prototype patch, even though
usage_count is incremented in a vastly slower fashion (in a wall time
sense), clock sweep doesn't take advantage of that. I should probably
investigate having clock sweep become more aggressive in decrementing
in response to realizing that it won't get some buffer's usage_count
down to zero on the next revolution either. There are certainly
problems with that, but they might be fixable. Within the patch, in
order for it to be possible for the usage_count to be incremented in
the interim, an average of 1.5 seconds must pass, so if clock sweep
were to anticipate another no-set-to-zero revolution, it seems pretty
likely that it would be exactly right, or if not then close enough,
since it can only really fail to correct for some buffers getting
incremented once more in the interim. Conceptually, it would be like
multiple logical revolutions were merged into one actual one,
sufficient to have the next revolution find a victim buffer.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Andres Freund
Date:
Hi,

It's good to see focus on this - some improvements around s_b are sorely
needed.

On 2014-04-14 10:11:53 -0700, Peter Geoghegan wrote:
> 1) Throttles incrementation of usage_count temporally. It becomes
> impossible to increment usage_count for any given buffer more
> frequently than every 3 seconds, while decrementing usage_count is
> totally unaffected.

I think this is unfortunately completely out of question. For one a
gettimeofday() for every uffer pin will become a significant performance
problem. Even the computation of the xact/stm start/stop timestamps
shows up pretty heavily in profiles today - and they are far less
frequent than buffer pins. And that's on x86 linux, where gettimeofday()
is implemented as something more lightweight than a full syscall.

The other significant problem I see with this is that its not adaptive
to the actual throughput of buffers in s_b. In many cases there's
hundreds of clock cycles through shared buffers in 3 seconds. By only
increasing the usagecount that often you've destroyed the little
semblance to a working LRU there is right now.

It also wouldn't work well for situations with a fast changing
workload >> s_b. If you have frequent queries that take a second or so
and access some data repeatedly (index nodes or whatnot) only increasing
the usagecount once will mean they'll continually fall back to disk access.

> 2) Has usage_count saturate at 10 (i.e. BM_MAX_USAGE_COUNT = 10), not
> 5 as before. ... . This step on its own would be assumed extremely
> counter-productive by those in the know, but I believe that other
> measures ameliorate the downsides. I could be wrong about how true
> that is in other cases, but then the case helped here isn't what you'd
> call a narrow benchmark.

I don't see which mechanisms you have suggested that counter this?

I think having more granular usagecount is a good idea, but I don't
think it can realistically be implemented with the current method of
choosing victim buffers. The amount of cacheline misses around that is
already a major scalability limit; we surely can't make this even
worse. I think it'd be possible to get back to this if we had a better
bgwriter implementation.

Greetings,

Andres Freund



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Wed, Apr 16, 2014 at 12:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I think this is unfortunately completely out of question. For one a
> gettimeofday() for every uffer pin will become a significant performance
> problem. Even the computation of the xact/stm start/stop timestamps
> shows up pretty heavily in profiles today - and they are far less
> frequent than buffer pins. And that's on x86 linux, where gettimeofday()
> is implemented as something more lightweight than a full syscall.

Come on, Andres. Of course exactly what I've done here is completely
out of the question as a patch that we can go and commit right now.
I've numerous caveats about bloating the buffer descriptors, and about
it being a proof of concept. I'm pretty sure we can come up with a
scheme to significantly cut down on the number of gettimeofday() calls
if it comes down to it. In any case, I'm interested in advancing our
understanding of the problem right now. Let's leave the minutiae to
one side for the time being.

> The other significant problem I see with this is that its not adaptive
> to the actual throughput of buffers in s_b. In many cases there's
> hundreds of clock cycles through shared buffers in 3 seconds. By only
> increasing the usagecount that often you've destroyed the little
> semblance to a working LRU there is right now.

If a usage_count can get to BM_MAX_USAGE_COUNT from its initial
allocation within an instant, that's bad. It's that simple. Consider
all the ways in which that can happen almost by accident.

You could probably reasonably argue that the trade-off or lack of
adaption (between an LRU and an LFU) that this particular sketch of
mine represents is inappropriate or sub-optimal, but I don't
understand why you're criticizing the patch for doing what I expressly
set out to do. I wrote "I think a very real problem that may be that
approximating an LRU is bad because an actual LRU is bad".

> It also wouldn't work well for situations with a fast changing
> workload >> s_b. If you have frequent queries that take a second or so
> and access some data repeatedly (index nodes or whatnot) only increasing
> the usagecount once will mean they'll continually fall back to disk access.

No, it shouldn't, because there is a notion of buffers getting a fair
chance to prove themselves. Now, it might well be the case that there
are workloads where what I've done to make that happen in this
prototype doesn't work out too well - I've already said so. But should
a buffer get a usage count of 5 just because the user inserted 5
tuples within a single DML command, for example? If so, why?

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Andres Freund
Date:
On 2014-04-16 01:58:23 -0700, Peter Geoghegan wrote:
> On Wed, Apr 16, 2014 at 12:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I think this is unfortunately completely out of question. For one a
> > gettimeofday() for every uffer pin will become a significant performance
> > problem. Even the computation of the xact/stm start/stop timestamps
> > shows up pretty heavily in profiles today - and they are far less
> > frequent than buffer pins. And that's on x86 linux, where gettimeofday()
> > is implemented as something more lightweight than a full syscall.
> 
> Come on, Andres. Of course exactly what I've done here is completely
> out of the question as a patch that we can go and commit right now.
> I've numerous caveats about bloating the buffer descriptors, and about
> it being a proof of concept. I'm pretty sure we can come up with a
> scheme to significantly cut down on the number of gettimeofday() calls
> if it comes down to it. In any case, I'm interested in advancing our
> understanding of the problem right now. Let's leave the minutiae to
> one side for the time being.

*I* don't think any scheme that involves measuring the time around
buffer pins is going to be acceptable. It's better than I say that now
rather than when you've invested significant time into the approach, no?

> > The other significant problem I see with this is that its not adaptive
> > to the actual throughput of buffers in s_b. In many cases there's
> > hundreds of clock cycles through shared buffers in 3 seconds. By only
> > increasing the usagecount that often you've destroyed the little
> > semblance to a working LRU there is right now.
> 
> If a usage_count can get to BM_MAX_USAGE_COUNT from its initial
> allocation within an instant, that's bad. It's that simple. Consider
> all the ways in which that can happen almost by accident.

Yes, I agree that that's a problem. It immediately going down to zero is
a problem as well though. And that's what will happen in many scenarios,
because you have time limits on increasing the usagecount, but not when
decreasing.

> > It also wouldn't work well for situations with a fast changing
> > workload >> s_b. If you have frequent queries that take a second or so
> > and access some data repeatedly (index nodes or whatnot) only increasing
> > the usagecount once will mean they'll continually fall back to disk access.
> 
> No, it shouldn't, because there is a notion of buffers getting a fair
> chance to prove themselves.

If you have a workload with > (BM_MAX_USAGE_COUNT + 1) clock
cycles/second, how does *any* buffer has a chance to prove itself?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Wed, Apr 16, 2014 at 2:18 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> *I* don't think any scheme that involves measuring the time around
> buffer pins is going to be acceptable. It's better than I say that now
> rather than when you've invested significant time into the approach, no?

Well, I do think that it will be possible to make something like that
work. LRU-K/LRU-2 involves remembering the last two access times (not
the last one). Researchers considered preeminent authorities on
caching algorithms thought that was a good idea in 1993. There are
plenty of other examples of similar schemes too.

>> No, it shouldn't, because there is a notion of buffers getting a fair
>> chance to prove themselves.
>
> If you have a workload with > (BM_MAX_USAGE_COUNT + 1) clock
> cycles/second, how does *any* buffer has a chance to prove itself?

There could be lots of ways. I thought about representing that more
directly. I don't think that it's useful to have a large number of
revolutions in search of a victim under any circumstances.
Fundamentally, you're asking "what if any scheme here leans too
heavily towards frequency?". That could certainly be a problem, as
I've said, and we could think about adaptation over heuristics, as
I've said, but it is very obviously a big problem that clock sweep
doesn't really care about frequency one bit right now.

Why should I be the one with all the answers? Aren't you interested in
the significance of the patch, and the test case?

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Andres Freund
Date:
Hi,

On 2014-04-16 02:57:54 -0700, Peter Geoghegan wrote:
> Why should I be the one with all the answers?

Who said you need to be? The only thing I am saying is that I don't
agree with some of your suggestions?

I only responded to the thread now because downthread (in
CAM3SWZQa2OAVUrfPL-df=we1sMozKBR392SW_NoVuKZEPXhu9w@mail.gmail.com) you
further argued using the timestamp - which I think is a flawed
concept. So I thought it'd be fair to argue against it now, rather than
later.

> Aren't you interested in the significance of the patch, and the test case?

Not particularly in the specifics to be honest. The tradeoffs of the
techniques you used in there seem prohibitive to me. It's easy to make
individual cases faster by sacrificing others.
Sometimes it's useful to prototype solutions while narrowing the scope
for evaluation to get faster feedback, but as I don't see the solutions
to be applicable in the general case...

I think it's very important to improve upon the current state. It's imo
one of postgres' biggest issues. But it's also far from trivial,
otherwise it'd be done already.

I *personally* don't think it's very likely that we can improve
significantly upon the current state as long as every process regularly
participates in the clock sweep. ISTM that prevents many more elaborate
techniques to be used (cache misses/bus traffic, locking). But that's
just gut feeling.
I also think there are bigger issues than the actual LRU/whatever
behaviour, namely the scalability issues around shared buffers making
both small and big s_b settings major bottlenecks. But that's just where
I have seen more issues personally.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Wed, Apr 16, 2014 at 3:22 AM, Peter Geoghegan <pg@heroku.com> wrote:
> It's possible that I've misunderstood what you mean here, but do you
> really think it's likely that everything will be hot, in the event of
> using something like what I've sketched here? I think it's an
> important measure against this general problem that buffers really
> earn the right to be considered hot, so to speak. With my prototype,
> in order for a buffer to become as hard to evict as possible, at a
> minimum it must be *continually* pinned for at least 30 seconds.
> That's actually a pretty tall order. Although, as I said, I wouldn't
> be surprised if it was worth making it possible for buffers to be even
> more difficult to evict than that. It should be extremely difficult to
> evict a root B-Tree page, and to a lesser extent inner pages even
> under a lot of cache pressure, for example. There are lots of
> workloads in which that can happen, and I have a hard time believing
> that it's worth it to evict given the extraordinary difference in
> their utility as compared to a lot of other things. I can imagine a
> huge barrier against evicting what is actually a relatively tiny
> number of pages being worth it.

I'm making a general statement about a property that I think a buffer
eviction algorithm ought to have.  I actually didn't say anything
about the algorithm you've chosen one way or the other.  Obviously,
you've built in some protections against everything becoming hot, and
that's a good thing as far as it goes.  But you also have a greatly
increased risk of everything becoming cold.  All you need is a rate of
buffer eviction that circles shared_buffers more often than once every
3 seconds, and everything will gradually cool down until you once
again can't distinguish which stuff is hot from which stuff isn't.

> I don't want to dismiss what you're saying about heating and cooling
> being unrelated, but I don't find the conclusion that not everything
> can be hot obvious. Maybe "heat" should be relative rather than
> absolute, and maybe that's actually what you meant. There is surely
> some workload where buffer access actually is perfectly uniform, and
> what do you do there? What "temperature" are those buffers?

Obviously, some value lower than the maximum and higher than the
minimum.  If they're all at max temperature and then a new buffer (a
btree room or vm page, for example) comes along and is much hotter,
there's no room on the scale left to express that.  If they're all at
min temperature and then a new buffer comes along that is just used
once and thrown out, there's no room left on the scale for that buffer
to emerge as a good candidate for eviction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Ants Aasma
Date:
On Wed, Apr 16, 2014 at 7:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think that the basic problem here is that usage counts increase when
> buffers are referenced, but they decrease when buffers are evicted,
> and those two things are not in any necessary way connected to each
> other.  In particular, if no eviction is happening, reference counts
> will converge to the maximum value.  I've read a few papers about
> algorithms that attempt to segregate the list of buffers into "hot"
> and "cold" lists, and an important property of such algorithms is that
> they mustn't be allowed to make everything hot.  It's easy to be too
> simplistic, here: an algorithm that requires that no more than half
> the list be hot will fall over badly on a workload where the working
> set exceeds the available cache and the really hot portion of the
> working set is 60% of the available cache.  So you need a more
> sophisticated algorithm than that.  But that core property that not
> all buffers can be hot must somehow be preserved, and our algorithm
> doesn't.

FWIW in CLOCK-Pro segregating buffers between hot and cold is tied to
eviction and the clock sweep, the ratio between hot and cold is
dynamically adapted based on prior experience. The main downside is
that it seems to require an indirection somewhere either in the clock
sweep or buffer lookup. Maybe it's possible to avoid that with some
clever engineering if we think hard enough.

CLOCK-Pro may also have too little memory of hotness, making it too
easy to blow the whole cache away with a burst of activity. It may be
useful to have a (possibly tunable) notion of fairness where one
query/backend can't take over the cache even though it may be an
overall win in terms of total number of I/Os performed. Maybe we need
to invent Generalized CLOCK-Pro with a larger number of levels,
ranging from cold, hot and scalding to infernal. :)

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Merlin Moncure
Date:
On Tue, Apr 15, 2014 at 11:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Apr 16, 2014 at 5:00 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Tue, Apr 15, 2014 at 3:59 PM, Ants Aasma <ants@cybertec.at> wrote:
>>> There's a paper on a non blocking GCLOCK algorithm, that does lock
>>> free clock sweep and buffer pinning[7]. If we decide to stay with
>>> GCLOCK it may be interesting, although I still believe that some
>>> variant of buffer nailing[8] is a better idea, my experience shows
>>> that most of the locking overhead is cache line bouncing ignoring the
>>> extreme cases where our naive spinlock implementation blows up.
>>
>> You might be right about that, but lets handle one problem at a time.
>> Who knows what the bottleneck will end up being if and when we address
>> the naivety around frequency? I want to better characterize that
>> problem first.
>
> Just to summarize you about the previous discussion and the
> improvements that we decided to do in this area based on feedback
> are as follows:
>
> 1. Bgwriter needs to be improved so that it can help in reducing
>     usage count and finding next victim buffer (run the clock sweep
>     and add buffers to the free list).
> 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist
>     are less.
> 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
>     (a spinlock for the freelist, and an lwlock for the clock sweep).
>     Here we can try to make it lock free based on atomic ops as
>     well.
> 4. Bgwriter needs to be more aggressive, logic based on which it
>     calculates how many buffers it needs to process needs to be
>     improved.
> 5. Contention around buffer mapping locks.
> 6. Cacheline bouncing around the buffer header spinlocks, is there
>     anything we can do to reduce this?
> 7. Choose Optimistically used buffer in StrategyGetBuffer().
> 8. Don't bump the usage count every time buffer is pinned.

What about:  9. Don't wait on locked buffer in the clock sweep.

merlin



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Andres Freund
Date:
On 2014-04-16 07:55:44 -0500, Merlin Moncure wrote:
> > 1. Bgwriter needs to be improved so that it can help in reducing
> >     usage count and finding next victim buffer (run the clock sweep
> >     and add buffers to the free list).
> > 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist
> >     are less.
> > 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
> >     (a spinlock for the freelist, and an lwlock for the clock sweep).
> >     Here we can try to make it lock free based on atomic ops as
> >     well.
> > 4. Bgwriter needs to be more aggressive, logic based on which it
> >     calculates how many buffers it needs to process needs to be
> >     improved.
> > 5. Contention around buffer mapping locks.
> > 6. Cacheline bouncing around the buffer header spinlocks, is there
> >     anything we can do to reduce this?
> > 7. Choose Optimistically used buffer in StrategyGetBuffer().
> > 8. Don't bump the usage count every time buffer is pinned.
> 
> What about:  9. Don't wait on locked buffer in the clock sweep.

I don't think we do that? Or are you referring to locked buffer headers?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Merlin Moncure
Date:
On Tue, Apr 15, 2014 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think that the basic problem here is that usage counts increase when
> buffers are referenced, but they decrease when buffers are evicted,
> and those two things are not in any necessary way connected to each
> other.  In particular, if no eviction is happening, reference counts
> will converge to the maximum value.  I've read a few papers about
> algorithms that attempt to segregate the list of buffers into "hot"
> and "cold" lists, and an important property of such algorithms is that
> they mustn't be allowed to make everything hot.  It's easy to be too
> simplistic, here: an algorithm that requires that no more than half
> the list be hot will fall over badly on a workload where the working
> set exceeds the available cache and the really hot portion of the
> working set is 60% of the available cache.  So you need a more
> sophisticated algorithm than that.  But that core property that not
> all buffers can be hot must somehow be preserved, and our algorithm
> doesn't.

A while back you sketched out an idea that did something like that:
hotly accessed buffers became 'perma-pinned' such that they no longer
participated in the clock sweep for eviction and there was a side-line
process that did a two stage eviction (IIRC) from the super hot stack
in order to mitigate locking.  This idea had a couple of nice
properties:

1) very hot buffers no longer get refcounted, reducing spinlock
contention (which has been documented in real world workloads)
2) eviction loop shrinks.  although you still have to check the 'very
hot' flag, thats an unlocked check (again, IIRC) and no further
processing is done.

The downside of this approach was complexity and difficult to test for
edge case complexity.  I would like to point out though that while i/o
efficiency gains are nice, I think contention issues are the bigger
fish to fry.

On Mon, Apr 14, 2014 at 12:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
> 1) Throttles incrementation of usage_count temporally. It becomes
> impossible to increment usage_count for any given buffer more
> frequently than every 3 seconds, while decrementing usage_count is
> totally unaffected.

hm, that's expensive.  how about a heuristic based on the number of
buffer allocations and the size of the buffer pool?

On Wed, Apr 16, 2014 at 8:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-04-16 07:55:44 -0500, Merlin Moncure wrote:
>> What about:  9. Don't wait on locked buffer in the clock sweep.
>
> I don't think we do that? Or are you referring to locked buffer headers?

Right -- exactly.  I posted patch for this a while back. It's quite
trivial: implement a trylock variant of the buffer header lock macro
and further guard the check with a non-locking test (which TAS()
already does generally, but the idea is to avoid the cache line lock
in likely cases of contention).  I believe this to be unambiguously
better: even if it's self healing or unlikely, there is no good reason
to jump into a spinlock fray or even request a contented cache line
while holding a critical lock.

merlin



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Andres Freund
Date:
On 2014-04-16 08:25:23 -0500, Merlin Moncure wrote:
> The downside of this approach was complexity and difficult to test for
> edge case complexity.  I would like to point out though that while i/o
> efficiency gains are nice, I think contention issues are the bigger
> fish to fry.

That's my feeling as well.

> 
> On Wed, Apr 16, 2014 at 8:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-04-16 07:55:44 -0500, Merlin Moncure wrote:
> >> What about:  9. Don't wait on locked buffer in the clock sweep.
> >
> > I don't think we do that? Or are you referring to locked buffer headers?
> 
> Right -- exactly.  I posted patch for this a while back. It's quite
> trivial: implement a trylock variant of the buffer header lock macro
> and further guard the check with a non-locking test (which TAS()
> already does generally, but the idea is to avoid the cache line lock
> in likely cases of contention).  I believe this to be unambiguously
> better: even if it's self healing or unlikely, there is no good reason
> to jump into a spinlock fray or even request a contented cache line
> while holding a critical lock.

IIRC you had problems proving the benefits of that, right?

I think that's because the locking times of buffer headers are short
enough that it's really unlikely to read a locked buffer header
spinlock. The spinlock acquiration will have made the locker the
exclusive owner of the spinlock in the majority of cases, and as soon as
that happens the cache miss/transfer will take far longer than the lock
takes.

I think this is the wrong level to optimize things. Imo there's two
possible solutions (that don't exclude each other):

* perform the clock sweep in one process so there's a very fast way to get to a free buffer. Possibly in a partitioned
way.

* Don't take a global exclusive lock while performing the clock sweep. Instead increase
StrategyControl->nextVictimBufferin chunks under an exclusive lock, and then scan the potential victim buffers in those
chunkswithout a global lock held.
 

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Wed, Apr 16, 2014 at 9:35 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I think this is the wrong level to optimize things. Imo there's two
> possible solutions (that don't exclude each other):
>
> * perform the clock sweep in one process so there's a very fast way to
>   get to a free buffer. Possibly in a partitioned way.
>
> * Don't take a global exclusive lock while performing the clock
>   sweep. Instead increase StrategyControl->nextVictimBuffer in chunks
>   under an exclusive lock, and then scan the potential victim buffers in
>   those chunks without a global lock held.

I definitely agree with both of these ideas.  But isn't it sort of
off-topic for this thread?  There are two issues here:

1. Improving the rate at which we can evict buffers, which is what
you're talking about here.

2. Improving the choice of which buffers we evict, which is what
Peter's talking about, or at least what I think he's talking about.

Those things are both important, but they're different, and I'm not
sure that working on one precludes working on the other.  There's
certainly the potential for overlap, but not necessarily.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Andres Freund
Date:
On 2014-04-16 10:29:29 -0400, Robert Haas wrote:
> On Wed, Apr 16, 2014 at 9:35 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I think this is the wrong level to optimize things. Imo there's two
> > possible solutions (that don't exclude each other):
> >
> > * perform the clock sweep in one process so there's a very fast way to
> >   get to a free buffer. Possibly in a partitioned way.
> >
> > * Don't take a global exclusive lock while performing the clock
> >   sweep. Instead increase StrategyControl->nextVictimBuffer in chunks
> >   under an exclusive lock, and then scan the potential victim buffers in
> >   those chunks without a global lock held.
> 
> I definitely agree with both of these ideas.  But isn't it sort of
> off-topic for this thread?

Yes, I agree it's somewhat offtopic - I only started on it (I think)
because Merlin commented on it. But I also agree with Merlin's that
comment at the moment that the scalability issues (concurrency and size
of shared buffers). If you can't use a large enough s_b to contain a
significant portion of your workload, you're relying on the OS cache
anyway.

> 1. Improving the rate at which we can evict buffers, which is what
> you're talking about here.
> 
> 2. Improving the choice of which buffers we evict, which is what
> Peter's talking about, or at least what I think he's talking about.
> 
> Those things are both important, but they're different, and I'm not
> sure that working on one precludes working on the other.  There's
> certainly the potential for overlap, but not necessarily.

I don't think that that they neccessarily preclude each other
either. But my gut feeling tells me that it'll be hard to have
interesting algorithmic improvements on the buffer eviction choice
because any additional complexity around that will have prohibitively
high scalability impacts due to the coarse locking.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Andres Freund
Date:
Hi,


Stephen flagged a ENOPARSE:

On 2014-04-16 16:49:55 +0200, Andres Freund wrote:
> But I also agree with Merlin's that comment at the moment that the
> scalability issues (concurrency and size of shared buffers).

That should have been:

But I also agree with Merlin's comment that at the moment the
scalability issues are bigger than the cache eviction choices.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Wed, Apr 16, 2014 at 10:49 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> 1. Improving the rate at which we can evict buffers, which is what
>> you're talking about here.
>>
>> 2. Improving the choice of which buffers we evict, which is what
>> Peter's talking about, or at least what I think he's talking about.
>>
>> Those things are both important, but they're different, and I'm not
>> sure that working on one precludes working on the other.  There's
>> certainly the potential for overlap, but not necessarily.
>
> I don't think that that they neccessarily preclude each other
> either. But my gut feeling tells me that it'll be hard to have
> interesting algorithmic improvements on the buffer eviction choice
> because any additional complexity around that will have prohibitively
> high scalability impacts due to the coarse locking.

Doesn't that amount to giving up?  I mean, I'm not optimistic about
the particular approach Peter's chosen here being practical for the
reasons that you and I already articulated.  But I don't think that
means there *isn't* a viable approach; and I think Peter's test
results demonstrate that the additional complexity of a better
algorithm can more than pay for itself.  That's a pretty important
point to keep in mind.

Also, I think the scalability problems around buffer eviction are
eminently solvable, and in particular I'm hopeful that Amit is going
to succeed in solving them.  Suppose we have a background process
(whether the background writer or some other) that runs the clock
sweep, identifies good candidates for eviction, and pushes them on a
set of, say, 16 free-lists protected by spinlocks.  (The optimal
number of free-lists probably depends on the size of shared_buffers.)
Backends try to reclaim by popping buffers off of one of these
free-lists and double-checking whether the page is still a good
candidate for eviction (i.e. it's still clean and unpinned).  If the
free list is running low, they kick the background process via a latch
to make sure it's awake and working to free up more stuff, and if
necessary, advance the clock sweep themselves.  This can even be done
by multiple processes at once, if we adopt your idea of advancing the
clock sweep hand by N buffers at a time and then scanning them
afterwards without any global lock.  In such a world, it's still not
permissible for reclaim calculations to be super-complex, but you hope
that most of the activity is happening in the background process, so
cycle-shaving becomes less critical.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Andres Freund
Date:
On 2014-04-16 11:28:04 -0400, Robert Haas wrote:
> On Wed, Apr 16, 2014 at 10:49 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> 1. Improving the rate at which we can evict buffers, which is what
> >> you're talking about here.
> >>
> >> 2. Improving the choice of which buffers we evict, which is what
> >> Peter's talking about, or at least what I think he's talking about.
> >>
> >> Those things are both important, but they're different, and I'm not
> >> sure that working on one precludes working on the other.  There's
> >> certainly the potential for overlap, but not necessarily.
> >
> > I don't think that that they neccessarily preclude each other
> > either. But my gut feeling tells me that it'll be hard to have
> > interesting algorithmic improvements on the buffer eviction choice
> > because any additional complexity around that will have prohibitively
> > high scalability impacts due to the coarse locking.
> 
> Doesn't that amount to giving up?  I mean, I'm not optimistic about
> the particular approach Peter's chosen here being practical for the
> reasons that you and I already articulated.  But I don't think that
> means there *isn't* a viable approach; and I think Peter's test
> results demonstrate that the additional complexity of a better
> algorithm can more than pay for itself.  That's a pretty important
> point to keep in mind.

Well, I think it could be a very good idea to invest more resources
(cpu, bus, memory) in buffer management - but doing so right *now* where
it's all done under one monolithic lock will have noticeable
consequences for many workloads. Spending more cycles per buffer won't
be very noticeable if it's not done under a gigantic lock - right now it
will be.

> [ reasonable proposal ].  In such a world, it's still not
> permissible for reclaim calculations to be super-complex, but you hope
> that most of the activity is happening in the background process, so
> cycle-shaving becomes less critical.

Yes, agreed.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Claudio Freire
Date:
On Wed, Apr 16, 2014 at 4:22 AM, Peter Geoghegan <pg@heroku.com> wrote:
>
> I don't want to dismiss what you're saying about heating and cooling
> being unrelated, but I don't find the conclusion that not everything
> can be hot obvious. Maybe "heat" should be relative rather than
> absolute, and maybe that's actually what you meant. There is surely
> some workload where buffer access actually is perfectly uniform, and
> what do you do there? What "temperature" are those buffers?

In that case, hotness, or retention priority, should be relative to
re-population cost.

IE: whether it's likely to still be in the OS cache or not, whether
it's dirty or not, etc.

> It occurs to me that within the prototype patch, even though
> usage_count is incremented in a vastly slower fashion (in a wall time
> sense), clock sweep doesn't take advantage of that. I should probably
> investigate having clock sweep become more aggressive in decrementing
> in response to realizing that it won't get some buffer's usage_count
> down to zero on the next revolution either. There are certainly
> problems with that, but they might be fixable. Within the patch, in
> order for it to be possible for the usage_count to be incremented in
> the interim, an average of 1.5 seconds must pass, so if clock sweep
> were to anticipate another no-set-to-zero revolution, it seems pretty
> likely that it would be exactly right, or if not then close enough,
> since it can only really fail to correct for some buffers getting
> incremented once more in the interim. Conceptually, it would be like
> multiple logical revolutions were merged into one actual one,
> sufficient to have the next revolution find a victim buffer.

Why use time at all? Why not synchronize usage bumpability to clock sweeps?

I'd use a simple bit that the clock sweep clears, and the users set.
Only one increase per sweep.

Or maybe use a decreasing loop count instead of a bit. In any case,
measuring "time" in terms of clock sweeps sounds like a better
proposition.



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Wed, Apr 16, 2014 at 4:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Aren't you interested in the significance of the patch, and the test case?
>
> Not particularly in the specifics to be honest. The tradeoffs of the
> techniques you used in there seem prohibitive to me. It's easy to make
> individual cases faster by sacrificing others.

You're the one poring over the specifics of what I've done, to my
consternation. I am not prepared to defend the patch at that level, as
I've made abundantly clear. I've called it a sketch, a proof of
concept half a dozen times already. I don't understand your difficulty
with that. I also don't understand how you can be so dismissive of the
benchmark, given the numbers involved. You're being unreasonable.

If I didn't write this patch, and I talked to people about this issue
at pgCon, I'm not sure that anyone would be convinced that it was a
problem, or at least that it was this much of a problem.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Wed, Apr 16, 2014 at 1:42 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Apr 16, 2014 at 4:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>>> Aren't you interested in the significance of the patch, and the test case?
>>
>> Not particularly in the specifics to be honest. The tradeoffs of the
>> techniques you used in there seem prohibitive to me. It's easy to make
>> individual cases faster by sacrificing others.
>
> You're the one poring over the specifics of what I've done, to my
> consternation. I am not prepared to defend the patch at that level, as
> I've made abundantly clear. I've called it a sketch, a proof of
> concept half a dozen times already. I don't understand your difficulty
> with that. I also don't understand how you can be so dismissive of the
> benchmark, given the numbers involved. You're being unreasonable.

I don't think he's being unreasonable, and I don't understand why
you're getting bent out of shape about it.  You proposed a patch, he
articulated a problem, you don't want to fix it right now.  All of
which is fine.  Why the ad hominem accusations?

> If I didn't write this patch, and I talked to people about this issue
> at pgCon, I'm not sure that anyone would be convinced that it was a
> problem, or at least that it was this much of a problem.

I agree with that, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Wed, Apr 16, 2014 at 10:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think he's being unreasonable, and I don't understand why
> you're getting bent out of shape about it.  You proposed a patch, he
> articulated a problem, you don't want to fix it right now.  All of
> which is fine.  Why the ad hominem accusations?

I just think it's bad form to hold something like this to the same
standards as a formal commitfest submission. I am well aware that the
patch probably has several scalability issues.


-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Merlin Moncure
Date:
On Wed, Apr 16, 2014 at 1:07 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Apr 16, 2014 at 10:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't think he's being unreasonable, and I don't understand why
>> you're getting bent out of shape about it.  You proposed a patch, he
>> articulated a problem, you don't want to fix it right now.  All of
>> which is fine.  Why the ad hominem accusations?
>
> I just think it's bad form to hold something like this to the same
> standards as a formal commitfest submission. I am well aware that the
> patch probably has several scalability issues.

In fairness to Andres, while *you* may know that issuing an expensive
syscall in a tight loop is on the list of Forbidden Things, a lot of
people don't and it's pretty reasonable to issue methodology
objections in order to get them documented.

Anyways, I'm still curious if you can post similar numbers basing the
throttling on gross allocation counts instead of time.  Meaning: some
number of buffer allocations has to have occurred before you consider
eviction.  Besides being faster I think it's a better implementation:
an intermittently loaded server will give more consistent behavior.

merlin



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> Anyways, I'm still curious if you can post similar numbers basing the
> throttling on gross allocation counts instead of time.  Meaning: some
> number of buffer allocations has to have occurred before you consider
> eviction.  Besides being faster I think it's a better implementation:
> an intermittently loaded server will give more consistent behavior.

Yeah --- I think wall-clock-based throttling is fundamentally the wrong
thing anyway.  Are we going to start needing a CPU speed measurement to
tune the algorithm with?  Not the place to be going.  But driving it off
the number of allocations that've been done could be sensible.  (OTOH,
that means you need a central counter, which itself would be a
bottleneck.)
        regards, tom lane



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Wed, Apr 16, 2014 at 2:07 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Apr 16, 2014 at 10:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't think he's being unreasonable, and I don't understand why
>> you're getting bent out of shape about it.  You proposed a patch, he
>> articulated a problem, you don't want to fix it right now.  All of
>> which is fine.  Why the ad hominem accusations?
>
> I just think it's bad form to hold something like this to the same
> standards as a formal commitfest submission. I am well aware that the
> patch probably has several scalability issues.

I don't agree.  I think it's perfectly appropriate to raise potential
issues at the earliest possible time.  People have regularly been
heard to complain in this forum that those objecting to a patch did
not object soon enough for them to make changes.  That's a hard
problem to fix because we can't force people whose salaries we're not
paying to attention to patches over whatever else they may have to do,
but we shouldn't label it as a bad thing when people choose to get
involved and provide feedback early.  Early feedback is exactly what
we want to encourage here.

And regardless of any of that, I think "person X is being
unreasonable" is a personal attack that has exactly zero place on this
mailing list.  We are here to talk about technology, not anyone's
character.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Merlin Moncure
Date:
On Wed, Apr 16, 2014 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> Anyways, I'm still curious if you can post similar numbers basing the
>> throttling on gross allocation counts instead of time.  Meaning: some
>> number of buffer allocations has to have occurred before you consider
>> eviction.  Besides being faster I think it's a better implementation:
>> an intermittently loaded server will give more consistent behavior.
>
> Yeah --- I think wall-clock-based throttling is fundamentally the wrong
> thing anyway.  Are we going to start needing a CPU speed measurement to
> tune the algorithm with?  Not the place to be going.  But driving it off
> the number of allocations that've been done could be sensible.  (OTOH,
> that means you need a central counter, which itself would be a
> bottleneck.)

sure -- note we already track that in BufferStrategyControl
(everything in buffer allocation is already centrally managed
essentially).
       /*        * Statistics.  These counters should be wide enough that they can't        * overflow during a single
bgwritercycle.        */       uint32          completePasses; /* Complete cycles of the clock sweep */       uint32
     numBufferAllocs;        /* Buffers allocated
 
since last reset */

merlin



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Wed, Apr 16, 2014 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't agree.  I think it's perfectly appropriate to raise potential
> issues at the earliest possible time.

If I didn't *strongly* emphasize my intent in writing the patch up
front, I'd certainly agree. I just don't see why what I've done cannot
be accepted in the spirit in which it was intended.

> And regardless of any of that, I think "person X is being
> unreasonable" is a personal attack that has exactly zero place on this
> mailing list.  We are here to talk about technology, not anyone's
> character.

Telling someone they're being unreasonable is not a personal attack.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> On Wed, Apr 16, 2014 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah --- I think wall-clock-based throttling is fundamentally the wrong
>> thing anyway.  Are we going to start needing a CPU speed measurement to
>> tune the algorithm with?  Not the place to be going.  But driving it off
>> the number of allocations that've been done could be sensible.  (OTOH,
>> that means you need a central counter, which itself would be a
>> bottleneck.)

> sure -- note we already track that in BufferStrategyControl
> (everything in buffer allocation is already centrally managed
> essentially).

Indeed, but I'd think getting rid of that property would be one of the
top priorities for any attempt to do anything at all in this area of
the code.
        regards, tom lane



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Wed, Apr 16, 2014 at 7:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> 2. Improving the choice of which buffers we evict, which is what
> Peter's talking about, or at least what I think he's talking about.
>
> Those things are both important, but they're different, and I'm not
> sure that working on one precludes working on the other.  There's
> certainly the potential for overlap, but not necessarily.

That's certainly my primary area of interest here, but there may be
some overlap with other areas, or areas were cooperation turns out to
be appropriate. At the risk of stating the very obvious, if the
caching algorithm is making poor decisions about caching, and leaf
pages are continually swapped in and out of shared_buffers for no good
reason, that is likely to constrain the scalability of the buffer
manager. That could be very significant.

My immediate concern here is getting recognition of the importance of
weighing frequency of access in *some* way.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> My immediate concern here is getting recognition of the importance of
> weighing frequency of access in *some* way.

That's a completely content-free statement; certainly the existing
clock-sweep code is sensitive to frequency of access, as would be
any other algorithm we'd be likely to adopt.  It may well be that
we can do better than what we've got, but sweeping generalities
are unlikely to help us much.
        regards, tom lane



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Amit Kapila
Date:
On Wed, Apr 16, 2014 at 6:25 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Apr 15, 2014 at 11:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Just to summarize you about the previous discussion and the
>> improvements that we decided to do in this area based on feedback
>> are as follows:
>>
>> 1. Bgwriter needs to be improved so that it can help in reducing
>>     usage count and finding next victim buffer (run the clock sweep
>>     and add buffers to the free list).
>> 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist
>>     are less.
>> 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
>>     (a spinlock for the freelist, and an lwlock for the clock sweep).
>>     Here we can try to make it lock free based on atomic ops as
>>     well.
>> 4. Bgwriter needs to be more aggressive, logic based on which it
>>     calculates how many buffers it needs to process needs to be
>>     improved.
>> 5. Contention around buffer mapping locks.
>> 6. Cacheline bouncing around the buffer header spinlocks, is there
>>     anything we can do to reduce this?
>> 7. Choose Optimistically used buffer in StrategyGetBuffer().
>> 8. Don't bump the usage count every time buffer is pinned.
>
> What about:  9. Don't wait on locked buffer in the clock sweep.

Right, I remember you have written patch for it, I think we should
consider it along with point-6.  In general, I think unless we first
improve the situation for BufFreelistLock and eviction strategy,
it might not show benefit.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Greg Stark
Date:
On Wed, Apr 16, 2014 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> This isn't a fundamental property of the usage-count idea; it's an
> artifact of the fact that usage count decreases are tied to eviction
> pressure rather than access pressure.  For example, suppose we made a
> rule that if the total usage counts of all buffers exceed 3 *
> NBuffers, then every time you bump the usage count of a buffer from N
> to N+1, you're required to advance the clock sweep far enough to
> decrease the reference count of a buffer by one.

This sounds like the right way to reason about it.

From what I remember in school the idea with the clock sweep is to set
the usage flags to the maximum whenever the buffer is used and
decrement (actually iirc typically shift right)  it when the clock
sweep goes by. Ie, simulate a LRU where when the buffer is accessed it
jumps to the head of the list and when the clock comes by it moves
gradually down the list.

What you're pointing out is that the clock might not come by very
often resulting everything being at the head of the list. In that case
I'm not clear it really matters what gets evicted though. And the cpu
effort of running the clock n times sounds bad but doing the work
earlier doesn't really change the amount of work being done, it just
amortizes it over more calls.

But if you want to do that it seems to me the way to do it is every
time a buffer is pinned set to the maximum and then run the clock
max_value - previous_value. So the total usage counts of all buffers
remains constant. If that results in contention one way to reduce it
is to do this probabilistically. Run the clock 1% of the time but run
it 100x as much as you would normally.

But I think you've misidentified the problem and what those other
algorithms are trying to solve. The problem is not that Postgres will
pick a bad buffer to evict. If all the buffers have been since the
last time the clock came around then they're all "hot" anyways and it
doesn't really matter which one we evict. The problem is that we
expend an inordinate amount of work finding the few non-hot buffers.
When you have a really large amount of memory and 99.9% of it is hot
but 0.1% is whatever random non-hot page was needed last then there's
an obvious buffer to evict when you need a new one. But we spend a lot
of work decrementing every hot buffer's usage count 4 times only to
have them immediately incremented again just to find the 1 buffer
where the usage count was 4 or 3. The goal of these algorithms that
divide the buffers into groups is to avoid having to do so much work
to find the colder buffers. Once the hot buffers migrate to the hot
pool we only need to run the clock there when we find we have new hot
pages that we want to promote. All the thrashing in the cold pool can
be more efficient because there's many fewer pages to consider.

-- 
greg



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Thu, Apr 17, 2014 at 9:40 AM, Greg Stark <stark@mit.edu> wrote:
> On Wed, Apr 16, 2014 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> This isn't a fundamental property of the usage-count idea; it's an
>> artifact of the fact that usage count decreases are tied to eviction
>> pressure rather than access pressure.  For example, suppose we made a
>> rule that if the total usage counts of all buffers exceed 3 *
>> NBuffers, then every time you bump the usage count of a buffer from N
>> to N+1, you're required to advance the clock sweep far enough to
>> decrease the reference count of a buffer by one.
>
> This sounds like the right way to reason about it.
>
> From what I remember in school the idea with the clock sweep is to set
> the usage flags to the maximum whenever the buffer is used and
> decrement (actually iirc typically shift right)  it when the clock
> sweep goes by. Ie, simulate a LRU where when the buffer is accessed it
> jumps to the head of the list and when the clock comes by it moves
> gradually down the list.
>
> What you're pointing out is that the clock might not come by very
> often resulting everything being at the head of the list. In that case
> I'm not clear it really matters what gets evicted though. And the cpu
> effort of running the clock n times sounds bad but doing the work
> earlier doesn't really change the amount of work being done, it just
> amortizes it over more calls.
>
> But if you want to do that it seems to me the way to do it is every
> time a buffer is pinned set to the maximum and then run the clock
> max_value - previous_value. So the total usage counts of all buffers
> remains constant. If that results in contention one way to reduce it
> is to do this probabilistically. Run the clock 1% of the time but run
> it 100x as much as you would normally.
>
> But I think you've misidentified the problem and what those other
> algorithms are trying to solve. The problem is not that Postgres will
> pick a bad buffer to evict. If all the buffers have been since the
> last time the clock came around then they're all "hot" anyways and it
> doesn't really matter which one we evict. The problem is that we
> expend an inordinate amount of work finding the few non-hot buffers.
> When you have a really large amount of memory and 99.9% of it is hot
> but 0.1% is whatever random non-hot page was needed last then there's
> an obvious buffer to evict when you need a new one. But we spend a lot
> of work decrementing every hot buffer's usage count 4 times only to
> have them immediately incremented again just to find the 1 buffer
> where the usage count was 4 or 3. The goal of these algorithms that
> divide the buffers into groups is to avoid having to do so much work
> to find the colder buffers. Once the hot buffers migrate to the hot
> pool we only need to run the clock there when we find we have new hot
> pages that we want to promote. All the thrashing in the cold pool can
> be more efficient because there's many fewer pages to consider.

Well, I think Peter has proved that PostgreSQL *will* pick a bad
buffer to evict.  The proof is that when he changed the choice of
buffer to evict, he got a significant performance improvement.

I also believe this to be the case on first principles and my own
experiments.  Suppose you have a workload that fits inside
shared_buffers.  All of the usage counts will converge to 5.  Then,
somebody accesses a table that is not cached, so something's got to be
evicted.  Because all the usage counts are the same, the eviction at
this point is completely indiscriminate.  We're just as likely to kick
out a btree root page or a visibility map page as we are to kick out a
random heap page, even though the former have probably been accessed
several orders of magnitude more often.  That's clearly bad.  On
systems that are not too heavily loaded it doesn't matter too much
because we just fault the page right back in from the OS pagecache.
But I've done pgbench runs where such decisions lead to long stalls,
because the page has to be brought back in from disk, and there's a
long I/O queue; or maybe just because the kernel thinks PostgreSQL is
issuing too many I/O requests and makes some of them wait to cool
things down.

Of course, the overhead of repeated clock sweeps to push down the
usage counts isn't a great thing either.  I'm not saying that isn't a
problem.  But I think bad decisions about what to evict are also a
problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Bruce Momjian
Date:
On Thu, Apr 17, 2014 at 10:18:43AM -0400, Robert Haas wrote:
> I also believe this to be the case on first principles and my own
> experiments.  Suppose you have a workload that fits inside
> shared_buffers.  All of the usage counts will converge to 5.  Then,
> somebody accesses a table that is not cached, so something's got to be
> evicted.  Because all the usage counts are the same, the eviction at
> this point is completely indiscriminate.  We're just as likely to kick
> out a btree root page or a visibility map page as we are to kick out a
> random heap page, even though the former have probably been accessed
> several orders of magnitude more often.  That's clearly bad.  On
> systems that are not too heavily loaded it doesn't matter too much
> because we just fault the page right back in from the OS pagecache.
> But I've done pgbench runs where such decisions lead to long stalls,
> because the page has to be brought back in from disk, and there's a
> long I/O queue; or maybe just because the kernel thinks PostgreSQL is
> issuing too many I/O requests and makes some of them wait to cool
> things down.

I understand now.  If there is no memory pressure, every buffer gets the
max usage count, and when a new buffer comes in, it isn't the max so it
is swiftly removed until the clock sweep has time to decrement the old
buffers.  Decaying buffers when there is no memory pressure creates
additional overhead and gets into timing issues of when to decay.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Thu, Apr 17, 2014 at 10:32 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Thu, Apr 17, 2014 at 10:18:43AM -0400, Robert Haas wrote:
>> I also believe this to be the case on first principles and my own
>> experiments.  Suppose you have a workload that fits inside
>> shared_buffers.  All of the usage counts will converge to 5.  Then,
>> somebody accesses a table that is not cached, so something's got to be
>> evicted.  Because all the usage counts are the same, the eviction at
>> this point is completely indiscriminate.  We're just as likely to kick
>> out a btree root page or a visibility map page as we are to kick out a
>> random heap page, even though the former have probably been accessed
>> several orders of magnitude more often.  That's clearly bad.  On
>> systems that are not too heavily loaded it doesn't matter too much
>> because we just fault the page right back in from the OS pagecache.
>> But I've done pgbench runs where such decisions lead to long stalls,
>> because the page has to be brought back in from disk, and there's a
>> long I/O queue; or maybe just because the kernel thinks PostgreSQL is
>> issuing too many I/O requests and makes some of them wait to cool
>> things down.
>
> I understand now.  If there is no memory pressure, every buffer gets the
> max usage count, and when a new buffer comes in, it isn't the max so it
> is swiftly removed until the clock sweep has time to decrement the old
> buffers.  Decaying buffers when there is no memory pressure creates
> additional overhead and gets into timing issues of when to decay.

That can happen, but the real problem I was trying to get at is that
when all the buffers get up to max usage count, they all appear
equally important.  But in reality they're not.  So when we do start
evicting those long-resident buffers, it's essentially random which
one we kick out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Bruce Momjian
Date:
On Thu, Apr 17, 2014 at 10:40:40AM -0400, Robert Haas wrote:
> On Thu, Apr 17, 2014 at 10:32 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Thu, Apr 17, 2014 at 10:18:43AM -0400, Robert Haas wrote:
> >> I also believe this to be the case on first principles and my own
> >> experiments.  Suppose you have a workload that fits inside
> >> shared_buffers.  All of the usage counts will converge to 5.  Then,
> >> somebody accesses a table that is not cached, so something's got to be
> >> evicted.  Because all the usage counts are the same, the eviction at
> >> this point is completely indiscriminate.  We're just as likely to kick
> >> out a btree root page or a visibility map page as we are to kick out a
> >> random heap page, even though the former have probably been accessed
> >> several orders of magnitude more often.  That's clearly bad.  On
> >> systems that are not too heavily loaded it doesn't matter too much
> >> because we just fault the page right back in from the OS pagecache.
> >> But I've done pgbench runs where such decisions lead to long stalls,
> >> because the page has to be brought back in from disk, and there's a
> >> long I/O queue; or maybe just because the kernel thinks PostgreSQL is
> >> issuing too many I/O requests and makes some of them wait to cool
> >> things down.
> >
> > I understand now.  If there is no memory pressure, every buffer gets the
> > max usage count, and when a new buffer comes in, it isn't the max so it
> > is swiftly removed until the clock sweep has time to decrement the old
> > buffers.  Decaying buffers when there is no memory pressure creates
> > additional overhead and gets into timing issues of when to decay.
> 
> That can happen, but the real problem I was trying to get at is that
> when all the buffers get up to max usage count, they all appear
> equally important.  But in reality they're not.  So when we do start
> evicting those long-resident buffers, it's essentially random which
> one we kick out.

True.  Ideally we would have some way to know that _all_ the buffers had
reached the maximum and kick off a sweep to decrement them all.  I am
unclear how we would do that.  One odd idea would be to have a global
counter that is incremented everytime a buffer goes from 4 to 5 (max)
--- when the counter equals 50% of all buffers, do a clock sweep.  Of
course, then the counter becomes a bottleneck.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Thu, Apr 17, 2014 at 10:48 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> > I understand now.  If there is no memory pressure, every buffer gets the
>> > max usage count, and when a new buffer comes in, it isn't the max so it
>> > is swiftly removed until the clock sweep has time to decrement the old
>> > buffers.  Decaying buffers when there is no memory pressure creates
>> > additional overhead and gets into timing issues of when to decay.
>>
>> That can happen, but the real problem I was trying to get at is that
>> when all the buffers get up to max usage count, they all appear
>> equally important.  But in reality they're not.  So when we do start
>> evicting those long-resident buffers, it's essentially random which
>> one we kick out.
>
> True.  Ideally we would have some way to know that _all_ the buffers had
> reached the maximum and kick off a sweep to decrement them all.  I am
> unclear how we would do that.  One odd idea would be to have a global
> counter that is incremented everytime a buffer goes from 4 to 5 (max)
> --- when the counter equals 50% of all buffers, do a clock sweep.  Of
> course, then the counter becomes a bottleneck.

Yeah, I think that's the right general line of thinking.  But it
doesn't have to be as coarse-grained as "do a whole clock sweep".  It
can be, you know, for every buffer that gets incremented from 4 to 5,
run the clock sweep far enough to decrement the usage count of some
other buffer by one.  That's similar to your idea but you can do it a
bit at a time rather than having to make a complete pass over
shared_buffers all at once.

Your other point, that the counter can become the bottleneck, is quite
right also and a major problem in this area.  I don't know how to
solve it right at the moment, but I'm hopeful that there may be a way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Andres Freund
Date:
On 2014-04-17 10:48:15 -0400, Bruce Momjian wrote:
> On Thu, Apr 17, 2014 at 10:40:40AM -0400, Robert Haas wrote:
> > That can happen, but the real problem I was trying to get at is that
> > when all the buffers get up to max usage count, they all appear
> > equally important.  But in reality they're not.  So when we do start
> > evicting those long-resident buffers, it's essentially random which
> > one we kick out.
> 
> True.  Ideally we would have some way to know that _all_ the buffers had
> reached the maximum and kick off a sweep to decrement them all.  I am
> unclear how we would do that.  One odd idea would be to have a global
> counter that is incremented everytime a buffer goes from 4 to 5 (max)
> --- when the counter equals 50% of all buffers, do a clock sweep.  Of
> course, then the counter becomes a bottleneck.

I have my doubts that we'll make the current scheme, where buffer
reclaim essentially is O(NBuffers), work much better. Especially as CPU
cache effects make such large, high frequency, accesses really
expensive.
I think we need more drastic changes.

I am *not* suggesting that we do that, but I believe it'd be possible to
implement a full LRU and be faster than today in scenarios with
nontrivial amounts of shared buffers.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Greg Stark
Date:
On Thu, Apr 17, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Because all the usage counts are the same, the eviction at
> this point is completely indiscriminate.  We're just as likely to kick
> out a btree root page or a visibility map page as we are to kick out a
> random heap page, even though the former have probably been accessed
> several orders of magnitude more often.  That's clearly bad.

That's not clear at all. In that circumstance regardless of what page
you evict you're incurring precisely one page fault i/o when the page
is read back in. Incurring that i/o is bad but it's unavoidable and
it's the same badness regardless of what page it's for. The only way
to prefer one page over another is if one page won't be needed for
long enough for the page to be useful for caching this new buffer (or
mixture of buffers) for multiple accesses. If you can't do that then
it doesn't matter which buffer you use since it'll just be evicted to
read back in the original page again.



-- 
greg



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Greg Stark
Date:
On Tue, Apr 15, 2014 at 7:30 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Frankly, there doesn't need to be any research on this, because it's
> just common sense that probabilistically, leaf pages are much more
> useful than heap pages in servicing index scan queries if we assume a
> uniform distribution. If we don't assume that, then they're still more
> useful on average.

I don't think "common sense" is compelling. I think you need to pin
down exactly what it is about btree intermediate pages that the LRU
isn't capturing and not just argue they're more useful. The LRU is
already capturing which pages are more heavily used than others so you
need to identify what it is that makes index pages *even more* useful
than their frequency and recency of access indicates. Not just that
they're more useful than an average page.

So what I think is missing is that indexes are always accessed from
the root down to the leaf. So the most recent page accessed will
always be the leaf. And in whatever chain of pages was used to reach
the last leaf page the least recently accessed will always be the
root. But we'll need the root page again on the subsequent descent
even if it's to reach the same leaf page we kept in ram in preference
to it.

Now it doesn't *always* make sense to keep an intermediate page over
leaf pages. Imagine an index that we always do full traversals of.
We'll always descend from the root down the left-most pages and then
follow the right pointers across. All the other intermediate pages
will be cold. If we do an occasional descent probing for other keys
those leaf pages shouldn't be cached since they won't be needed again
for the common full index traversals and the next occasional probe
will probably be looking for different keys.

But if we're often probing for the same keys the last thing we want to
do is throw away one of the intermediate pages for those keys when we
could throw away a leaf page. But that's what would happen in a strict
LRU.  It's almost like what we would really want to do is mark the
pages as least recently used in the opposite order from the order
they're actually accessed when descending. Or perhaps bump the usage
count to max+1 when it's an intermediate page so that it takes one
extra cycle of decrementing before it's considered old compared to a
leaf page.


-- 
greg



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> several orders of magnitude more often.  That's clearly bad.  On
> systems that are not too heavily loaded it doesn't matter too much
> because we just fault the page right back in from the OS pagecache.

Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
long enough, the kernel will happily evict it as 'cold' from *its*
cache, leading to...

> But I've done pgbench runs where such decisions lead to long stalls,
> because the page has to be brought back in from disk, and there's a
> long I/O queue; or maybe just because the kernel thinks PostgreSQL is
> issuing too many I/O requests and makes some of them wait to cool
> things down.

Exactly this.

> Of course, the overhead of repeated clock sweeps to push down the
> usage counts isn't a great thing either.  I'm not saying that isn't a
> problem.  But I think bad decisions about what to evict are also a
> problem.

Using a bit more CPU here and there, particularly if it's done in a
background worker, or ideally multiple background workers (for each
buffer pool) would be much better than evicting a hot page that isn't in
the kernel's buffer either 'cause we've held on to it long enough that
the kernel thinks it's cold.
Thanks,
    Stephen

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Greg Stark
Date:
On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
> long enough, the kernel will happily evict it as 'cold' from *its*
> cache, leading to...


This is a whole nother problem.

It is worrisome that we could be benchmarking the page replacement
algorithm in Postgres and choose a page replacement algorithm that
chooses pages that performs well because it tends to evict pages that
are in the OS cache. And then one day (hopefully not too far off)
we'll fix the double buffering problem and end up with a strange
choice of page replacement algorithm.

It also means that every benchmark is super sensitive to the how large
a fraction of system memory Postgres is managing. If A benchmark of a
page replacement algorithm with 3GB shared buffers might perform well
compared to others on a system with 8GB or 32GB total RAM but actually
be choosing pages very poorly in normal terms and perform terribly on
a system with 4GB total ram.

Ideally what I would like to see is instrumentation of Postgres's
buffer pinning so we can generate various test loads and then just run
the different algorithms on them and measure precisely how many page
evictions it's causing and when how often it's choosing pages that
need to be read in soon after and so on. We shouldn't have to run
Postgres to get these counts at all, just run the algorithm as we read
through a text file (or database table) listing the pages being
accessed.

-- 
greg



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Thu, Apr 17, 2014 at 9:21 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> several orders of magnitude more often.  That's clearly bad.  On
>> systems that are not too heavily loaded it doesn't matter too much
>> because we just fault the page right back in from the OS pagecache.
>
> Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
> long enough, the kernel will happily evict it as 'cold' from *its*
> cache, leading to...
>
>> But I've done pgbench runs where such decisions lead to long stalls,
>> because the page has to be brought back in from disk, and there's a
>> long I/O queue; or maybe just because the kernel thinks PostgreSQL is
>> issuing too many I/O requests and makes some of them wait to cool
>> things down.
>
> Exactly this.

Yes, I believe that's why this is so effective.


-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Stephen Frost
Date:
* Greg Stark (stark@mit.edu) wrote:
> On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
> > long enough, the kernel will happily evict it as 'cold' from *its*
> > cache, leading to...
>
> This is a whole nother problem.
>
> It is worrisome that we could be benchmarking the page replacement
> algorithm in Postgres and choose a page replacement algorithm that
> chooses pages that performs well because it tends to evict pages that
> are in the OS cache. And then one day (hopefully not too far off)
> we'll fix the double buffering problem and end up with a strange
> choice of page replacement algorithm.

That's certainly possible but I don't see the double buffering problem
going away any time particularly soon and, even if it does, it's likely
to either a) mean we're just using the kernel's cache (eg: something w/
mmap, etc), or b) will involve so many other changes that this will end
up getting changed anyway.  In any case, while I think we should
document any such cache management system we employ as having this risk,
I don't think we should worry about it terribly much.

> It also means that every benchmark is super sensitive to the how large
> a fraction of system memory Postgres is managing. If A benchmark of a
> page replacement algorithm with 3GB shared buffers might perform well
> compared to others on a system with 8GB or 32GB total RAM but actually
> be choosing pages very poorly in normal terms and perform terribly on
> a system with 4GB total ram.

I'm not following you here- benchmarks are already sensitive to how much
of the system's memory PG is managing (and how much ends up being
*dedicated* to PG's cache and therefore unavailable for other work).

> Ideally what I would like to see is instrumentation of Postgres's
> buffer pinning so we can generate various test loads and then just run
> the different algorithms on them and measure precisely how many page
> evictions it's causing and when how often it's choosing pages that
> need to be read in soon after and so on. We shouldn't have to run
> Postgres to get these counts at all, just run the algorithm as we read
> through a text file (or database table) listing the pages being
> accessed.

Go for it.  I'd love to see that also.
Thanks,
    Stephen

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Heikki Linnakangas
Date:
On 04/17/2014 09:38 PM, Stephen Frost wrote:
> * Greg Stark (stark@mit.edu) wrote:
>> On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost <sfrost@snowman.net> wrote:
>>> Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
>>> long enough, the kernel will happily evict it as 'cold' from *its*
>>> cache, leading to...
>>
>> This is a whole nother problem.
>>
>> It is worrisome that we could be benchmarking the page replacement
>> algorithm in Postgres and choose a page replacement algorithm that
>> chooses pages that performs well because it tends to evict pages that
>> are in the OS cache. And then one day (hopefully not too far off)
>> we'll fix the double buffering problem and end up with a strange
>> choice of page replacement algorithm.
>
> That's certainly possible but I don't see the double buffering problem
> going away any time particularly soon and, even if it does, it's likely
> to either a) mean we're just using the kernel's cache (eg: something w/
> mmap, etc), or b) will involve so many other changes that this will end
> up getting changed anyway.  In any case, while I think we should
> document any such cache management system we employ as having this risk,
> I don't think we should worry about it terribly much.

Note that if we somehow come up with a page replacement algorithm that 
tends to evict pages that are in the OS cache, we have effectively 
solved the double buffering problem. When a page is cached in both 
caches, evicting it from one of them eliminates the double buffering. 
Granted, you might prefer to evict it from the OS cache instead, and 
such an algorithm could be bad in other ways. But if a page replacement 
algorithm happens avoid double buffering, that's a genuine merit for 
that algorithm.

- Heikki



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Andres Freund
Date:
On 2014-04-17 21:44:47 +0300, Heikki Linnakangas wrote:
> On 04/17/2014 09:38 PM, Stephen Frost wrote:
> >* Greg Stark (stark@mit.edu) wrote:
> >>On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost <sfrost@snowman.net> wrote:
> >>>Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
> >>>long enough, the kernel will happily evict it as 'cold' from *its*
> >>>cache, leading to...
> >>
> >>This is a whole nother problem.
> >>
> >>It is worrisome that we could be benchmarking the page replacement
> >>algorithm in Postgres and choose a page replacement algorithm that
> >>chooses pages that performs well because it tends to evict pages that
> >>are in the OS cache. And then one day (hopefully not too far off)
> >>we'll fix the double buffering problem and end up with a strange
> >>choice of page replacement algorithm.
> >
> >That's certainly possible but I don't see the double buffering problem
> >going away any time particularly soon and, even if it does, it's likely
> >to either a) mean we're just using the kernel's cache (eg: something w/
> >mmap, etc), or b) will involve so many other changes that this will end
> >up getting changed anyway.  In any case, while I think we should
> >document any such cache management system we employ as having this risk,
> >I don't think we should worry about it terribly much.
> 
> Note that if we somehow come up with a page replacement algorithm that tends
> to evict pages that are in the OS cache, we have effectively solved the
> double buffering problem. When a page is cached in both caches, evicting it
> from one of them eliminates the double buffering. Granted, you might prefer
> to evict it from the OS cache instead, and such an algorithm could be bad in
> other ways. But if a page replacement algorithm happens avoid double
> buffering, that's a genuine merit for that algorithm.

I don't think it's a good idea to try to synchronize algorithms with the
OSs. There's so much change about the caching logic in e.g. linux that
it won't stay effective for very long.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Stephen Frost
Date:
* Andres Freund (andres@2ndquadrant.com) wrote:
> > Note that if we somehow come up with a page replacement algorithm that tends
> > to evict pages that are in the OS cache, we have effectively solved the
> > double buffering problem. When a page is cached in both caches, evicting it
> > from one of them eliminates the double buffering. Granted, you might prefer
> > to evict it from the OS cache instead, and such an algorithm could be bad in
> > other ways. But if a page replacement algorithm happens avoid double
> > buffering, that's a genuine merit for that algorithm.
>
> I don't think it's a good idea to try to synchronize algorithms with the
> OSs. There's so much change about the caching logic in e.g. linux that
> it won't stay effective for very long.

There's also more than one OS...
Thanks,
    Stephen

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Merlin Moncure
Date:
On Thu, Apr 17, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-04-17 21:44:47 +0300, Heikki Linnakangas wrote:
>> On 04/17/2014 09:38 PM, Stephen Frost wrote:
>> >* Greg Stark (stark@mit.edu) wrote:
>> >>On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> >>>Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
>> >>>long enough, the kernel will happily evict it as 'cold' from *its*
>> >>>cache, leading to...
>> >>
>> >>This is a whole nother problem.
>> >>
>> >>It is worrisome that we could be benchmarking the page replacement
>> >>algorithm in Postgres and choose a page replacement algorithm that
>> >>chooses pages that performs well because it tends to evict pages that
>> >>are in the OS cache. And then one day (hopefully not too far off)
>> >>we'll fix the double buffering problem and end up with a strange
>> >>choice of page replacement algorithm.
>> >
>> >That's certainly possible but I don't see the double buffering problem
>> >going away any time particularly soon and, even if it does, it's likely
>> >to either a) mean we're just using the kernel's cache (eg: something w/
>> >mmap, etc), or b) will involve so many other changes that this will end
>> >up getting changed anyway.  In any case, while I think we should
>> >document any such cache management system we employ as having this risk,
>> >I don't think we should worry about it terribly much.
>>
>> Note that if we somehow come up with a page replacement algorithm that tends
>> to evict pages that are in the OS cache, we have effectively solved the
>> double buffering problem. When a page is cached in both caches, evicting it
>> from one of them eliminates the double buffering. Granted, you might prefer
>> to evict it from the OS cache instead, and such an algorithm could be bad in
>> other ways. But if a page replacement algorithm happens avoid double
>> buffering, that's a genuine merit for that algorithm.
>
> I don't think it's a good idea to try to synchronize algorithms with the
> OSs. There's so much change about the caching logic in e.g. linux that
> it won't stay effective for very long.

No. but if you were very judicious, maybe you could hint the o/s
(posix_fadvise) about pages that are likely to stay hot that you don't
need them.

I doubt that's necessary though -- if the postgres caching algorithm
improves such that there is a better tendency for hot pages to stay in
s_b,  Eventually the O/S will deschedule the page for something else
that needs it.   In other words, otherwise preventable double
buffering is really a measurement of bad eviction policy because it
manifests in volatility of frequency accessed pages.

merlin



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Thu, Apr 17, 2014 at 11:53 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> No. but if you were very judicious, maybe you could hint the o/s
> (posix_fadvise) about pages that are likely to stay hot that you don't
> need them.

Mitsumasa KONDO wrote a patch like that. I don't think the results
were that promising, but things change quickly.


-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Stephen Frost
Date:
* Merlin Moncure (mmoncure@gmail.com) wrote:
> I doubt that's necessary though -- if the postgres caching algorithm
> improves such that there is a better tendency for hot pages to stay in
> s_b,  Eventually the O/S will deschedule the page for something else
> that needs it.   In other words, otherwise preventable double
> buffering is really a measurement of bad eviction policy because it
> manifests in volatility of frequency accessed pages.

I wonder if it would help to actually tell the OS to read in buffers
that we're *evicting*...  On the general notion that if the OS already
has them buffered then it's almost a no-op, and if it doesn't and it's
actually a 'hot' buffer that we're gonna need again shortly, the OS will
have it.

In other words, try to make the OS more like a secondary cache to ours
by encouraging it to cache things we're evicting.
Thanks,        Stephen

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Merlin Moncure
Date:
On Thu, Apr 17, 2014 at 2:00 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Merlin Moncure (mmoncure@gmail.com) wrote:
>> I doubt that's necessary though -- if the postgres caching algorithm
>> improves such that there is a better tendency for hot pages to stay in
>> s_b,  Eventually the O/S will deschedule the page for something else
>> that needs it.   In other words, otherwise preventable double
>> buffering is really a measurement of bad eviction policy because it
>> manifests in volatility of frequency accessed pages.
>
> I wonder if it would help to actually tell the OS to read in buffers
> that we're *evicting*...  On the general notion that if the OS already
> has them buffered then it's almost a no-op, and if it doesn't and it's
> actually a 'hot' buffer that we're gonna need again shortly, the OS will
> have it.
>
> In other words, try to make the OS more like a secondary cache to ours
> by encouraging it to cache things we're evicting.
I don't think this would work unless we would keep some kind of
tracking information on the page itself which seems not worth a write
operation to do (maybe if the page is dirtied it could be snuck in
there though...).  IOW, it would only make sense to do this if we knew
that this page was likely to be read in again.  This might be true in
general on particular workloads but is probably a pretty flimsy
assumption without supporting evidence; probably better to let the O/S
deal with it.

merlin



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> I wonder if it would help to actually tell the OS to read in buffers
> that we're *evicting*...  On the general notion that if the OS already
> has them buffered then it's almost a no-op, and if it doesn't and it's
> actually a 'hot' buffer that we're gonna need again shortly, the OS will
> have it.

But if it's actually gone cold, you're just forcing unnecessary read I/O,
not to mention possibly causing something slightly warmer to be lost from
kernel cache.
        regards, tom lane



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Stephen Frost
Date:
* Merlin Moncure (mmoncure@gmail.com) wrote:
>  I don't think this would work unless we would keep some kind of
> tracking information on the page itself which seems not worth a write
> operation to do (maybe if the page is dirtied it could be snuck in
> there though...).  IOW, it would only make sense to do this if we knew
> that this page was likely to be read in again.  This might be true in
> general on particular workloads but is probably a pretty flimsy
> assumption without supporting evidence; probably better to let the O/S
> deal with it.

The trouble is that we're ending up "hiding" the information from the OS
about the frequency of utilization of that page.  You have a good point
and we wouldn't want to do this for pages that are just accessed once or
similar, but perhaps just mark a page that's reached the 'max' as having
been 'hot' and then, for those pages, advise the OS that while we're
under pressure and need to push this page out, it was once pretty hottly
used and therefore we may want it again soon.

For pages that never reach the 'max' level, we wouldn't do anything on
the assumption that those were only temporairly needed.

Just some thoughts.
Thanks,
    Stephen

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > I wonder if it would help to actually tell the OS to read in buffers
> > that we're *evicting*...  On the general notion that if the OS already
> > has them buffered then it's almost a no-op, and if it doesn't and it's
> > actually a 'hot' buffer that we're gonna need again shortly, the OS will
> > have it.
>
> But if it's actually gone cold, you're just forcing unnecessary read I/O,
> not to mention possibly causing something slightly warmer to be lost from
> kernel cache.

Certainly possible- see the email I just sent about another thought
around this.

Obviously, none of these thoughts are really fully formed solutions and
are, instead, just speculation and ideas.
Thanks,
    Stephen

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Merlin Moncure
Date:
On Thu, Apr 17, 2014 at 2:16 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Merlin Moncure (mmoncure@gmail.com) wrote:
>>  I don't think this would work unless we would keep some kind of
>> tracking information on the page itself which seems not worth a write
>> operation to do (maybe if the page is dirtied it could be snuck in
>> there though...).  IOW, it would only make sense to do this if we knew
>> that this page was likely to be read in again.  This might be true in
>> general on particular workloads but is probably a pretty flimsy
>> assumption without supporting evidence; probably better to let the O/S
>> deal with it.
>
> The trouble is that we're ending up "hiding" the information from the OS
> about the frequency of utilization of that page.  You have a good point
> and we wouldn't want to do this for pages that are just accessed once or
> similar, but perhaps just mark a page that's reached the 'max' as having
> been 'hot' and then, for those pages, advise the OS that while we're
> under pressure and need to push this page out, it was once pretty hottly
> used and therefore we may want it again soon.
>
> For pages that never reach the 'max' level, we wouldn't do anything on
> the assumption that those were only temporairly needed.

yeah -- the thing is, we are already too spendy already on
supplemental write i/o (hint bits, visible bits, freezing, etc) and
likely not worth it to throw something else on the pile unless the
page is already dirty; the medium term trend in storage is that read
vs write performance is becoming increasingly asymmetric, particularly
on the random side so it's very unlikely to balance out.

merlin



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Stephen Frost
Date:


On Thursday, April 17, 2014, Merlin Moncure <mmoncure@gmail.com> wrote:
yeah -- the thing is, we are already too spendy already on
supplemental write i/o (hint bits, visible bits, freezing, etc) and
likely not worth it to throw something else on the pile unless the
page is already dirty; the medium term trend in storage is that read
vs write performance is becoming increasingly asymmetric, particularly
on the random side so it's very unlikely to balance out.

Guess I wasn't clear but I was thinking to read the page in, not do any writing, and do it in a asynchronous way to the process doing the evicting. 

Thanks,

Stephen 

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Merlin Moncure
Date:
On Thu, Apr 17, 2014 at 2:28 PM, Stephen Frost <sfrost@snowman.net> wrote:
>
>
> On Thursday, April 17, 2014, Merlin Moncure <mmoncure@gmail.com> wrote:
>>
>> yeah -- the thing is, we are already too spendy already on
>> supplemental write i/o (hint bits, visible bits, freezing, etc) and
>> likely not worth it to throw something else on the pile unless the
>> page is already dirty; the medium term trend in storage is that read
>> vs write performance is becoming increasingly asymmetric, particularly
>> on the random side so it's very unlikely to balance out.
>
> Guess I wasn't clear but I was thinking to read the page in, not do any
> writing, and do it in a asynchronous way to the process doing the evicting.

no -- I got you. My point was, that's a pure guess unless you base it
on evidence recorded on the page itself.  Without that evidence,
(which requires writing) the operating is in a a better place to make
that guess so it's probably better to defer that decision.

merlin



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Stephen Frost
Date:


On Thursday, April 17, 2014, Merlin Moncure <mmoncure@gmail.com> wrote:
no -- I got you. My point was, that's a pure guess unless you base it
on evidence recorded on the page itself.  Without that evidence,
(which requires writing) the operating is in a a better place to make
that guess so it's probably better to defer that decision.

Well, we'd only need that info to be stored in the buffer cache somehow- wouldn't have to go to disk or cause more I/O, of course. My thinking was that we could track it with the existing counter too, avoiding even that small amount of locking to write to the buffer page. 

Thanks,

Stephen 

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Thu, Apr 17, 2014 at 8:10 AM, Greg Stark <stark@mit.edu> wrote:
> I don't think "common sense" is compelling. I think you need to pin
> down exactly what it is about btree intermediate pages that the LRU
> isn't capturing and not just argue they're more useful. The LRU is
> already capturing which pages are more heavily used than others so you
> need to identify what it is that makes index pages *even more* useful
> than their frequency and recency of access indicates. Not just that
> they're more useful than an average page.

See example 1.1 within the LRU-K paper.

> So what I think is missing is that indexes are always accessed from
> the root down to the leaf. So the most recent page accessed will
> always be the leaf. And in whatever chain of pages was used to reach
> the last leaf page the least recently accessed will always be the
> root. But we'll need the root page again on the subsequent descent
> even if it's to reach the same leaf page we kept in ram in preference
> to it.

I can't imagine that this is much of a problem in practice. Consider
the break-down of pages within indexes when pgbench scale is 5,000, as
in my original benchmark:

[local] pg@pgbench=# with tots as (
SELECT count(*) c, type, relname from       (select relname, relpages, generate_series(1, relpages - 1) i
from pg_class c join pg_namespace n on c.relnamespace = n.oid where
relkind = 'i' and nspname = 'public') r,       lateral (select * from bt_page_stats(relname, i)) u
group by relname, type)
select tots.relname, relpages -1 as non_meta_pages, c, c/sum(c)
over(partition by tots.relname) as prop_of_index, type from tots join
pg_class c on c.relname = tots.relname order by 2 desc, 1, type;
       relname        | non_meta_pages |    c    |
prop_of_index        | type
-----------------------+----------------+---------+----------------------------+------pgbench_accounts_pkey |
1370950|    4828 |
 
0.00352164557423684307 | ipgbench_accounts_pkey |        1370950 | 1366121 |
0.99647762500455888253 | lpgbench_accounts_pkey |        1370950 |       1 |
0.000000729421204274408257 | rpgbench_tellers_pkey  |            274 |     273 |
0.99635036496350364964 | lpgbench_tellers_pkey  |            274 |       1 |
0.00364963503649635036 | rpgbench_branches_pkey |             28 |      27 |
0.96428571428571428571 | lpgbench_branches_pkey |             28 |       1 |
0.03571428571428571429 | r
(7 rows)

Time: 14562.297 ms

Just over 99.6% of pages (leaving aside the meta page) in the big 10
GB pgbench_accounts_pkey index are leaf pages. The inner pages and
root page are at an enormous advantage. In this example, the other
indexes don't even have what would be separately classified as an
inner page (and not a root page) at all, because it's perfectly
sufficient to only have a root page to get to any one of, say, 273
leaf pages (in the case of pgbench_tellers_pkey here).

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Andres Freund
Date:
On 2014-04-17 13:33:27 -0700, Peter Geoghegan wrote:
> Just over 99.6% of pages (leaving aside the meta page) in the big 10
> GB pgbench_accounts_pkey index are leaf pages.

That's a rather nice number. I knew it was big, but I'd have guessed
it'd be a percent lower.

Do you happen to have the same stat handy for a sensibly wide text or
numeric real world index? It'd be interesting to see what the worst case
there is.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Thu, Apr 17, 2014 at 1:39 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-04-17 13:33:27 -0700, Peter Geoghegan wrote:
>> Just over 99.6% of pages (leaving aside the meta page) in the big 10
>> GB pgbench_accounts_pkey index are leaf pages.
>
> That's a rather nice number. I knew it was big, but I'd have guessed
> it'd be a percent lower.

Yes, it's usually past 99.5% for int4. It's really bad if it's as low
as 96%, and I think that often points to what are arguably bad
indexing choices, like indexing text columns that have long text
strings.

> Do you happen to have the same stat handy for a sensibly wide text or
> numeric real world index? It'd be interesting to see what the worst case
> there is.

Yes, as it happens I do:
http://www.postgresql.org/message-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=A@mail.gmail.com

I was working of my Mouse Genome database, which is actually
real-world data use by medical researchers, stored in a PostgreSQL
database by those researchers and made available for the benefit of
other medical researchers.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Thu, Apr 17, 2014 at 1:33 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I can't imagine that this is much of a problem in practice.

Although I will add that not caching highly useful inner pages for the
medium term, because that index isn't being used at all for 5 minutes
probably is very bad. Using the 4,828 buffers that it would take to
store all the inner pages (as in my large primary index example) to go
store something else is probably penny wise and pound foolish.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Greg Stark
Date:
On Thu, Apr 17, 2014 at 4:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Although I will add that not caching highly useful inner pages for the
> medium term, because that index isn't being used at all for 5 minutes
> probably is very bad. Using the 4,828 buffers that it would take to
> store all the inner pages (as in my large primary index example) to go
> store something else is probably penny wise and pound foolish.

But there could easily be 20 unused indexes for every 1 index that is
being used.



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Thu, Apr 17, 2014 at 6:50 PM, Greg Stark <stark@mit.edu> wrote:
> On Thu, Apr 17, 2014 at 4:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Although I will add that not caching highly useful inner pages for the
>> medium term, because that index isn't being used at all for 5 minutes
>> probably is very bad. Using the 4,828 buffers that it would take to
>> store all the inner pages (as in my large primary index example) to go
>> store something else is probably penny wise and pound foolish.
>
> But there could easily be 20 unused indexes for every 1 index that is
> being used.

Sure, but then there might not be. Obviously there is a trade-off to
be made between recency and frequency. One interesting observation in
the LRU-K paper is that for their test case, a pure LFU actually works
very well, despite, as the authors acknowledge, being a terrible
algorithm in the real world. That's because their test case is so
simple, and concerns only one table/index, with a uniform
distribution.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Atri Sharma
Date:

On Fri, Apr 18, 2014 at 7:27 AM, Peter Geoghegan <pg@heroku.com> wrote:

A way I have in mind about eviction policy is to introduce a way to have an ageing factor in each buffer and take the ageing factor into consideration when evicting a buffer.

Consider a case where a table is pretty huge and spread across multiple pages. The querying pattern is like a time series pattern i.e. a set of rows is queried pretty frequently for some time, making the corresponding page hot. Then, the next set of rows is queried frequently making that page hot and so on.

Consider a new page entering the shared buffers with refcount 1 and usage_count 1. If that page is a part of the workload described above, it is likely that it shall not be used for a considerable amount of time after it has entered the buffers but will be used eventually.

Now, the current hypothetical situation is that we have three pages:

1) The page that used to be hot at the previous time window but is no longer hot and is actually the correct candidate for eviction.
2) The current hot page (It wont be evicted anyway for now).
3) The new page which just got in and should not be evicted since it can be hot soon (for this workload it will be hot in the next time window).

When Clocksweep algorithm runs the next time, it will see the new buffer page as the one to be evicted (since page (1) may still have usage_count > 0 i.e. it may be 'cooling' but not 'cool' yet.)

This can be changed by introducing an ageing factor that sees how much time the current buffer has spend in shared buffers. If the time that the buffer has spent is large enough (relatively) and it is not hot currently, that means it has had its chance and can be evicted. This shall save the new page (3) from being evicted since it's time in shared buffers shall not be high enough to mandate eviction and it shall be given more chances.

Since gettimeofday() is an expensive call and hence cannot be done in the tight loop, we can count the number of clocksweeps the current buffer has seen (rather, survived). This shall give us a rough idea of the estimate of the relative age of the buffer.

When an eviction happens, all the candidates with refcount = 0 shall be taken.Then, among them, the one with highest ageing factor shall be evicted.

Of course, there may be better ways of doing the same, but I want to highlight the point (or possibility) of introducing an ageing factor to prevent eviction of relatively younger pages early in the eviction process.

The overhead isnt too big. We just need to add another attribute in buffer header for the number of clocksweeps seen (rather, survived) and check it when an eviction is taking place.The existing spinlock for buffer headers shall be good for protecting contention and access. The access rules can be similar to that of usage_count.

Thoughts and comments?

Regards,

Atri

--
Regards,
 
Atri
l'apprenant

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Thu, Apr 17, 2014 at 5:00 PM, Greg Stark <stark@mit.edu> wrote:
> On Thu, Apr 17, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Because all the usage counts are the same, the eviction at
>> this point is completely indiscriminate.  We're just as likely to kick
>> out a btree root page or a visibility map page as we are to kick out a
>> random heap page, even though the former have probably been accessed
>> several orders of magnitude more often.  That's clearly bad.
>
> That's not clear at all. In that circumstance regardless of what page
> you evict you're incurring precisely one page fault i/o when the page
> is read back in.

I am a bit confused by this remark.  In *any* circumstance when you
evict you're incurring precisely one page fault I/O when the page is
read back in.   That doesn't mean that the choice of which page to
evict is irrelevant.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Greg Stark
Date:
On Fri, Apr 18, 2014 at 4:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I am a bit confused by this remark.  In *any* circumstance when you
> evict you're incurring precisely one page fault I/O when the page is
> read back in.   That doesn't mean that the choice of which page to
> evict is irrelevant.

But you might be evicting a page that will be needed soon or one that
won't be needed for a while. If it's not needed for a while you might
be able to avoid many page evictions by caching a page that will be
used several times.

If all the pages currently in RAM are hot -- meaning they're hot
enough that they'll be needed again before the page you're reading in
-- then they're all equally bad to evict.

I'm trying to push us away from the gut instinct that frequently used
pages are important to cache and towards actually counting how many
i/os we're saving. In the extreme it's possible to simulate any cache
algorithm on a recorded list of page requests and count how many page
misses it generates to compare it with an optimal cache algorithm.

-- 
greg



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Bruce Momjian
Date:
On Fri, Apr 18, 2014 at 04:46:31PM +0530, Atri Sharma wrote:
> This can be changed by introducing an ageing factor that sees how much time the
> current buffer has spend in shared buffers. If the time that the buffer has
> spent is large enough (relatively) and it is not hot currently, that means it
> has had its chance and can be evicted. This shall save the new page (3) from
> being evicted since it's time in shared buffers shall not be high enough to
> mandate eviction and it shall be given more chances.
> 
> Since gettimeofday() is an expensive call and hence cannot be done in the tight
> loop, we can count the number of clocksweeps the current buffer has seen
> (rather, survived). This shall give us a rough idea of the estimate of the
> relative age of the buffer.

Counting clock sweeps is an intersting idea.  I think one concern was
tracking hot buffers in cases where there is no memory pressure, and
hence the clock sweep isn't running --- I am not sure how this would
help in that case.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Atri Sharma
Date:



On Sat, Apr 19, 2014 at 1:07 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Fri, Apr 18, 2014 at 04:46:31PM +0530, Atri Sharma wrote:
> This can be changed by introducing an ageing factor that sees how much time the
> current buffer has spend in shared buffers. If the time that the buffer has
> spent is large enough (relatively) and it is not hot currently, that means it
> has had its chance and can be evicted. This shall save the new page (3) from
> being evicted since it's time in shared buffers shall not be high enough to
> mandate eviction and it shall be given more chances.
>
> Since gettimeofday() is an expensive call and hence cannot be done in the tight
> loop, we can count the number of clocksweeps the current buffer has seen
> (rather, survived). This shall give us a rough idea of the estimate of the
> relative age of the buffer.

Counting clock sweeps is an intersting idea.  I think one concern was
tracking hot buffers in cases where there is no memory pressure, and
hence the clock sweep isn't running --- I am not sure how this would
help in that case.


I feel that if there is no memory pressure, frankly it doesnt matter much about what gets out and what not. The case I am specifically targeting is when the clocksweep gets to move about a lot i.e. high memory pressure workloads. Of course,  I may be totally wrong here.

One thing that I discussed with Merlin offline and am now concerned about is how will the actual eviction work. We cannot traverse the entire list and then find all the buffers with refcount 0 and then do another traversal to find the oldest one.

Any thoughts there would be appreciated.

Regards,

Atri

--
Regards,
 
Atri
l'apprenant

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Jason Petersen
Date:
On Apr 18, 2014, at 1:51 PM, Atri Sharma <atri.jiit@gmail.com> wrote:

Counting clock sweeps is an intersting idea.  I think one concern was
tracking hot buffers in cases where there is no memory pressure, and
hence the clock sweep isn't running --- I am not sure how this would
help in that case.


Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was something quick for the prototype.

The problem with the clocksweeps is they don’t actually track the progression of “time” within the PostgreSQL system.

What’s wrong with using a transaction number or some similar sequence? It would accurately track “age” in the sense we care about: how long ago in “units of real work being done by the DB” something was added.

—Jason

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Atri Sharma
Date:





Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was something quick for the prototype.

The problem with the clocksweeps is they don’t actually track the progression of “time” within the PostgreSQL system.

What’s wrong with using a transaction number or some similar sequence? It would accurately track “age” in the sense we care about: how long ago in “units of real work being done by the DB” something was added.



Well, AIUI, we only need the 'relative' age of buffers in relation to the youngest buffer present. So, the guy who has seen the maximum amount of clocksweeps is the guy who has been around the most.

I do not see a need for an accurate estimate of the time spent in the buffer for any purpose right now. It may be useful in the future though.

How do you get the transaction ID? By accessing a tuple on the page and reading it's XMIN?

Regards,

Atri



--
Regards,
 
Atri
l'apprenant

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Fri, Apr 18, 2014 at 1:11 PM, Jason Petersen <jason@citusdata.com> wrote:
> Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday
> seems silly to me: it was something quick for the prototype.

The gettimeofday() call doesn't need to happen in a tight loop. It can
be reasonably passed down from higher up, very probably without
consequence. The LRU-K paper actually recommends a delay of 5 seconds.
There is at least one other major database system that unambiguously
uses a wall-clock delay of 3 seconds (by default) for this exact
purpose - avoiding what the LRU-K paper calls "correlated references".

I'm not saying that we should go with that particular scheme for these
reasons. However, it's plainly untrue that using wall time like this
represents some kind of insurmountable scalability obstacle.

> The problem with the clocksweeps is they don’t actually track the
> progression of “time” within the PostgreSQL system.
>
> What’s wrong with using a transaction number or some similar sequence? It
> would accurately track “age” in the sense we care about: how long ago in
> “units of real work being done by the DB” something was added.

The LRU-K paper suggests that a scheme like this could be preferable.
I have my doubts that it can be made to work with Postgres.

--
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Bruce Momjian
Date:
On Sat, Apr 19, 2014 at 01:21:29AM +0530, Atri Sharma wrote:
> I feel that if there is no memory pressure, frankly it doesnt matter much about
> what gets out and what not. The case I am specifically targeting is when the
> clocksweep gets to move about a lot i.e. high memory pressure workloads. Of
> course,  I may be totally wrong here.
> 
> One thing that I discussed with Merlin offline and am now concerned about is
> how will the actual eviction work. We cannot traverse the entire list and then
> find all the buffers with refcount 0 and then do another traversal to find the
> oldest one.

I thought if there was memory pressure the clock sweep would run and we
wouldn't have everything at the max counter access value.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Atri Sharma
Date:



On Sat, Apr 19, 2014 at 3:37 AM, Bruce Momjian <bruce@momjian.us> wrote:

> One thing that I discussed with Merlin offline and am now concerned about is
> how will the actual eviction work. We cannot traverse the entire list and then
> find all the buffers with refcount 0 and then do another traversal to find the
> oldest one.

I thought if there was memory pressure the clock sweep would run and we
wouldn't have everything at the max counter access value.


Hmm, I see your point.

With that applicable as well, I feel that the clocksweep counting/logical clock system shall be useful when deciding between multiple candidates for eviction. At worst, it can serve to replace the gettimeofday() calls.

One thing I have thought of with ideas and inputs from Joshua Yanowski offline is that we can probably have a maxheap which is on the logical clock age of buffers. Each time clocksweep sees a buffer whose refcount has become zero, it will push the buffer into minheap. This can be a new representation of freelist or a new additional data structure.

This still does not solve the problem of seeing the entire list by the clocksweep, even if that makes the eviction process O(1) with the addition of the maxheap.

I am working on a PoC patch but am stuck on this point. My current approach sees the entire shared buffers list to search for any candidate buffers.

Another thing that is a pain point here is the concurrency and locking overheads of introducing a new data structure. Can the existing buffer header spinlock handle this problem or is it hitting the granularity of the spinlock too much?

I see some blockers for this idea still. Nevertheless, the point of clocksweep counts as logical clocks seems to be promising,atleast intuitively.

Thoughts and comments?

Regards,

Atri


--
Regards,
 
Atri
l'apprenant

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Jim Nasby
Date:
On 4/15/14, 1:15 PM, Peter Geoghegan wrote:
> On Tue, Apr 15, 2014 at 9:30 AM, Merlin Moncure<mmoncure@gmail.com>  wrote:
>> >There are many reports of improvement from lowering shared_buffers.
>> >The problem is that it tends to show up on complex production
>> >workloads and that there is no clear evidence pointing to problems
>> >with the clock sweep; it could be higher up in the partition locks or
>> >something else entirely (like the O/S).  pgbench is also not the
>> >greatest tool for sniffing out these cases: it's too random and for
>> >large database optimization is generally an exercise in de-randomizing
>> >i/o patterns.  We really, really need a broader testing suite that
>> >covers more usage patterns.
> I find it quite dissatisfying that we know so little about this.

This is an area where additional stats gathering would be very valuable. We're running 8G buffers on 512G servers
becausebumping it up hurt performance, but we have no clue why. Was it due to buffer pins? How many times the clock had
tosweep to find a victim? Something else entirely? No idea... :(
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Jim Nasby
Date:
On 4/18/14, 2:51 PM, Atri Sharma wrote:
>
> I feel that if there is no memory pressure, frankly it doesnt matter much about what gets out and what not. The case
Iam specifically targeting is when the clocksweep gets to move about a lot i.e. high memory pressure workloads. Of
course, I may be totally wrong here.
 

Well, there's either memory pressure or there isn't. If there isn't then it's all moot *because we're not evicting
anything*.

> One thing that I discussed with Merlin offline and am now concerned about is how will the actual eviction work. We
cannottraverse the entire list and then find all the buffers with refcount 0 and then do another traversal to find the
oldestone.
 

This is why OSes use multiple page pools. If we're going to use a clock sweep at all I think we need to use the same.

Every time we discuss this stuff it feels like we're completely reinventing the wheel that was solved by OSes years
ago.:(
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Jim Nasby
Date:
On 4/16/14, 10:28 AM, Robert Haas wrote:
> Also, I think the scalability problems around buffer eviction are
> eminently solvable, and in particular I'm hopeful that Amit is going
> to succeed in solving them.  Suppose we have a background process
> (whether the background writer or some other) that runs the clock
> sweep, identifies good candidates for eviction, and pushes them on a
> set of, say, 16 free-lists protected by spinlocks.  (The optimal
> number of free-lists probably depends on the size of shared_buffers.)

How *certain* are we that a single freelist lock (that actually ONLY protects the freelist) would be that big a deal? I
suspectit wouldn't be much of an issue at all:
 

- Right now (IIRC) it's tied into the clock as well, so immediate fail on scaling...
- The clock is WAY more expensive than grabbing one buffer off the free list. Last I looked it was so bad that even if
thenext buffer the clock hit was free it was still worse than hitting the free list.
 

I strongly suspect that a single freelist lock (that didn't protect anything else) would be fine. I think it'd be folly
tostart with a more complex multi-lock/multi-freelist implementation before we knew we needed one.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Clock sweep not caching enough B-Tree leaf pages?

From
David G Johnston
Date:
Jim Nasby-2 wrote
>> I feel that if there is no memory pressure, frankly it doesnt matter much
>> about what gets out and what not. The case I am specifically targeting is
>> when the clocksweep gets to move about a lot i.e. high memory pressure
>> workloads. Of course,  I may be totally wrong here.
> 
> Well, there's either memory pressure or there isn't. If there isn't then
> it's all moot *because we're not evicting anything*.

The trade-off I'm seeing here is between measuring when there is no memory
pressure - and thus eating at performance while not actually evicting
buffers - and not measuring but then encountering memory pressure and not
having a clue as to what should be evicted.

David J.






--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Clock-sweep-not-caching-enough-B-Tree-leaf-pages-tp5799947p5800988.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Claudio Freire
Date:
On Mon, Apr 21, 2014 at 8:07 PM, David G Johnston
<david.g.johnston@gmail.com> wrote:
> Jim Nasby-2 wrote
>>> I feel that if there is no memory pressure, frankly it doesnt matter much
>>> about what gets out and what not. The case I am specifically targeting is
>>> when the clocksweep gets to move about a lot i.e. high memory pressure
>>> workloads. Of course,  I may be totally wrong here.
>>
>> Well, there's either memory pressure or there isn't. If there isn't then
>> it's all moot *because we're not evicting anything*.
>
> The trade-off I'm seeing here is between measuring when there is no memory
> pressure - and thus eating at performance while not actually evicting
> buffers - and not measuring but then encountering memory pressure and not
> having a clue as to what should be evicted.


I believe that for the intended use discussed in this thread, a
compile-time switch would be more than enough control, and it would
avoid that tradeoff.



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Tom Lane
Date:
Jim Nasby <jim@nasby.net> writes:
> How *certain* are we that a single freelist lock (that actually ONLY
> protects the freelist) would be that big a deal?

We used to have one.  It was a big bottleneck --- and this was years
ago, when the buffer manager was much less scalable than it is today.
(IIRC, getting rid of a central lock was one of the main advantages
of the current clock sweep code over its predecessor.)

The real issue here is that in the modern code, we hardly ever actually
have anything in the freelist: only when a relation is dropped, or
something like that, do buffers ever get put to the freelist.  So your
argument that removing a buffer from the freelist is cheaper than running
the clock sweep is largely missing the point.  We'd have to run a clock
sweep in order to find something to put in the freelist.
        regards, tom lane



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Mon, Apr 21, 2014 at 5:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> We used to have one.  It was a big bottleneck --- and this was years
> ago, when the buffer manager was much less scalable than it is today.
> (IIRC, getting rid of a central lock was one of the main advantages
> of the current clock sweep code over its predecessor.)

Yes, it was. This is a major advantage of clock sweep, and anything
that replaces it will need to maintain the same advantage. Didn't
someone indicate that clock sweep could beat ARC around that time,
presumably for this reason? If no one did, then my reading of a
variety of other papers on caching indicates that this is probably the
case.


-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Mon, Apr 21, 2014 at 5:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We used to have one.  It was a big bottleneck --- and this was years
>> ago, when the buffer manager was much less scalable than it is today.
>> (IIRC, getting rid of a central lock was one of the main advantages
>> of the current clock sweep code over its predecessor.)

> Yes, it was. This is a major advantage of clock sweep, and anything
> that replaces it will need to maintain the same advantage. Didn't
> someone indicate that clock sweep could beat ARC around that time,
> presumably for this reason? If no one did, then my reading of a
> variety of other papers on caching indicates that this is probably the
> case.

ARC *was* the predecessor algorithm.  See commit 5d5087363.
        regards, tom lane



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Mon, Apr 21, 2014 at 5:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> ARC *was* the predecessor algorithm.  See commit 5d5087363.

I believe that the main impetus for replacing ARC with clock sweep
came from patent issues, though. It was a happy coincidence that clock
sweep happened to be better than ARC, but that doesn't mean that ARC
didn't have some clear advantages, even if it wasn't worth it on
balance. LRU-K, and 2Q have roughly the same advantages. I'm
reasonably confident you can have the best of both worlds, or
something closer to it.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Mon, Apr 21, 2014 at 5:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ARC *was* the predecessor algorithm.  See commit 5d5087363.

> I believe that the main impetus for replacing ARC with clock sweep
> came from patent issues, though.

That was one issue, but performance gains were a large part of it too,
and the main reason why we picked clock sweep rather than something else.
Did you read the commit message I pointed to?

(See also 4e8af8d27.)
        regards, tom lane



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Mon, Apr 21, 2014 at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Did you read the commit message I pointed to?

Yes.

> (See also 4e8af8d27.)

Oh, I wasn't actually aware of the fact that 2Q made it into the tree.
I thought that the first commit message you referred to just
referenced on-list discussion of 2Q. Interesting.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Mon, Apr 21, 2014 at 5:59 PM, Peter Geoghegan <pg@heroku.com> wrote:
> LRU-K, and 2Q have roughly the same advantages. I'm
> reasonably confident you can have the best of both worlds, or
> something closer to it.

Having said that, a big part of what I'd like to accomplish here is to
address the more general problem of "correlated references". That's
probably something that has independent value.


-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
Here is a benchmark that is similar to my earlier one, but with a rate
limit of 125 tps, to help us better characterize how the prototype
patch helps performance:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/

Again, these are 15 minute runs with unlogged tables at multiple
client counts, and scale 5,000.

Every test run should have managed to hit that limit, but in the case
of master 1 test did not. I have included vmstat, iostat and meminfo
OS instrumentation this time around, which is really interesting for
this particular limit based benchmark. The prototype patch tested here
is a slight refinement on my earlier prototype. Apart from reducing
the number of gettimeofday() calls, and doing them out of the critical
path, I increased the initial usage_count value to 6. I also set
BM_MAX_USAGE_COUNT to 30. I guess I couldn't resist the temptation to
tweak things, which I actually did very little of prior to publishing
my initial results. This helps, but there isn't a huge additional
benefit.

The benchmark results show that master cannot even meet the 125 tps
limit on 1 test out of 9. More interestingly, the background writer
consistently cleans about 10,000 buffers per test run when testing the
patch. At the same time, buffers are allocated at a very consistent
rate of around 262,000 for the patched test runs. Leaving aside the
first test run, with the patch there is only a tiny variance in the
number cleaned between each test, a variance of just a few hundred
buffers. In contrast, master has enormous variance. During just over
half of the tests, the background writer does not clean even a single
buffer. Then, on 2 tests out of 9, it cleans an enormous ~350,000
buffers. The second time this happens leads to master failing to even
meet the 125 tps limit (albeit with only one client).

If you drill down to individual test runs, a similar pattern is
evident. You'll now find operating system information (meminfo dirty
memory) graphed here. The majority of the time, master does not hold
more than 1,000 kB of dirty memory at a time. Once or twice it's 0 kB
for multiple minutes. However, during the test runs where we also see
huge spikes in background-writer-cleaned pages, we also see huge
spikes in the total amount of dirty memory (and correlated huge spikes
in latency). It can get to highs of ~700,000 kB at one point. In
contrast, the patched tests show very consistent amounts of dirty
memory. Per test, it almost always tops out at 4,000 kB - 6,000 kB
(there is a single 12,000 kB spike, though). There is consistently a
distinct zig-zag pattern to the dirty memory graph with the patched
tests, regardless of client count or where checkpoints occur. Master
shows mountains and valleys for those two particularly problematic
tests, correlating with a panicked background writer's aggressive
feedback loop. Master also shows less rhythmic zig-zag patterns that
only peak at about 600 kB - 1,000 kB for the entire duration of many
individual test runs.

Perhaps most notably, average and worst case latency is far improved
with the patch. On average it's less than half of master with 1
client, and less than a quarter of master with 32 clients.

I think that the rate limiting feature of pgbench is really useful for
characterizing how work like this improves performance. I see a far
smoother and more consistent pattern of I/O that superficially looks
like Postgres is cooperating with the operating system much more than
it does in the baseline. It sort of makes sense that the operating
system cache doesn't care about frequency while Postgres does. If the
OS cache did weigh frequency, it would surely not alter the outcome
very much, since OS cached data has presumably not been accessed very
frequently recently. I suspect that I've cut down on double buffering
by quite a bit. I would like to come up with a simple way of measuring
that, using something like pgfincore, but the available interfaces
don't seem well-suited to quantifying how much of a problem this is
and remains. I guess call pgfincore on the index segment files might
be interesting, since shared_buffers mostly holds index pages. This
has been verified using pg_buffercache.

It would be great to test this out with something involving
non-uniform distributions, like Gaussian and Zipfian distributions.
The LRU-K paper tests Zipfian too. The uniform distribution pgbench
uses here, while interesting, doesn't tell the full story at all and
is less representative of reality (TPB-C is formally required to have
a non-uniform distribution [1] for some things, for example). A
Gaussian distribution might show essentially the same failure to
properly credit pages with frequency of access in one additional
dimension, so to speak. Someone should probably look at the TPC-C-like
DBT-2, since in the past that was considered to be a problematic
workload for PostgreSQL [2] due to the heavy I/O. Zipfian is a lot
less sympathetic than uniform if the LRU-K paper is any indication,
and so if we're looking for a worst-case, that's probably a good place
to start. It's not obvious how you'd go about actually constructing a
practical test case for either, though.

I should acknowledge that I clearly have regressed one aspect that is
evident from the benchmark: Cleanup time (which is recorded per test)
takes more than twice as long. This makes intuitive sense, though.
VACUUM first creates a list of tuples to kill by scanning the heap. It
then goes to the indexes to kill them there first. It then returns to
the heap, and kills heap tuples in a final scan. Clearly it is bad for
VACUUM that shared_buffers mostly contains index pages when it begins.
That said, I haven't actually considered the interactions with buffer
access strategies here. It might well be more complicated than I've
suggested.

To be thorough, I've repeated each test set. There is a "do over" for
both master and patched, which serves to show how repeatable the
original test sets are.

[1] Clause 2.1.6, TPC-C specification:
http://www.tpc.org/tpcc/spec/tpcc_current.pdf

[2] https://wiki.postgresql.org/wiki/PgCon_2011_Developer_Meeting#DBT-2_I.2FO_Performance
-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Albe Laurenz
Date:
Jason Petersen wrote:
> Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was
> something quick for the prototype.
> 
> The problem with the clocksweeps is they don’t actually track the progression of “time” within the
> PostgreSQL system.

Would it make sense to just cache the result of the latest gettimeofday() call
and use that as an approximation for wall time?
The busier the system is, the more accurate that should be.

Yours,
Laurenz Albe

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Atri Sharma
Date:



On Tue, Apr 22, 2014 at 12:59 PM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
Jason Petersen wrote:
> Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was
> something quick for the prototype.
>
> The problem with the clocksweeps is they don’t actually track the progression of “time” within the
> PostgreSQL system.

Would it make sense to just cache the result of the latest gettimeofday() call
and use that as an approximation for wall time?
The busier the system is, the more accurate that should be.


That sounds...risky. How will the invalidation/updation of the cache work?

How will we track the time window in which the cached value is still valid and applicable?

My first thoughts only. I may be missing the point though.

Regards,

Atri



--
Regards,
 
Atri
l'apprenant

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Hannu Krosing
Date:
On 04/17/2014 10:39 PM, Andres Freund wrote:
> On 2014-04-17 13:33:27 -0700, Peter Geoghegan wrote:
>> Just over 99.6% of pages (leaving aside the meta page) in the big 10
>> GB pgbench_accounts_pkey index are leaf pages.

What is the depth of b-tree at this percentage ?

Cheers
Hannu



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Jim Nasby
Date:
On 4/21/14, 6:07 PM, David G Johnston wrote:
> Jim Nasby-2 wrote
>>> >>I feel that if there is no memory pressure, frankly it doesnt matter much
>>> >>about what gets out and what not. The case I am specifically targeting is
>>> >>when the clocksweep gets to move about a lot i.e. high memory pressure
>>> >>workloads. Of course,  I may be totally wrong here.
>> >
>> >Well, there's either memory pressure or there isn't. If there isn't then
>> >it's all moot*because we're not evicting anything*.
> The trade-off I'm seeing here is between measuring when there is no memory
> pressure - and thus eating at performance while not actually evicting
> buffers - and not measuring but then encountering memory pressure and not
> having a clue as to what should be evicted.

Right. OSes handle this by keeping a certain ratio of active vs inactive pages, regardless of pressure for free pages.
Thatway when you need more pages in the free list you can pull them from the inactive list knowing that you're making a
gooddecision.
 

One of the really nice things about this approach is that if memory pressure is low enough that you don't need more
pageson the inactive list you don't even need to run that clock.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Tue, Apr 22, 2014 at 2:03 AM, Hannu Krosing <hannu@krosing.net> wrote:
> What is the depth of b-tree at this percentage ?

Well, this percentage of B-Tree pages that are leaf pages doesn't have
much to do with the depth. The percentage seems very consistent for
each B-Tree, irrespective of the total size of the B-Tree.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Mon, Apr 21, 2014 at 11:57 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Here is a benchmark that is similar to my earlier one, but with a rate
> limit of 125 tps, to help us better characterize how the prototype
> patch helps performance:
>
> http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/

I've added some more test sets to this result report, again with a 125
TPS limit, but on this occasion with a pgbench Gaussian distribution.
I used V13 of the recently proposed Guassian distribution pgbench
patch [1] to accomplish this, including the Gaussian variant of tpc-b
that the pgbench patch has baked in. The distribution threshold used
was consistently 5, causing the patched pgbench to report for each
test:

transaction type: Custom query
scaling factor: 5000
standard deviation threshold: 5.00000
access probability of top 20%, 10% and 5% records: 0.68269 0.38293 0.19741

It looks like the patch continues to have much lower latency than
master for this somewhat distinct workload. Actually, even though the
background writer is somewhat working harder than in the uniform
distribution case, the average latency with patched is appreciably
lower. Total buffers allocated are just as consistent as before for
patched, but the number is markedly lower than for the prior uniform
distribution case. Dirty memory graphs start off similar to the
uniform case with patched, but get a bit spikier towards the end of
each test run there. It's still *markedly* better than master for
either distribution type, which is still really aggressive at times
for master, and other times by far isn't aggressive enough, in much
the same way as before.

In general, with the Gaussian distribution, average latency is lower,
but worst case is higher. The patch maintains its clear lead for
average case, albeit a smaller lead than with uniform, and with worst
case things are much better relatively speaking. Absolute worst case
(and not worst case averaged across client counts) is 1.4 seconds with
patched, to 8.3 with master...and that terrible worst case happens
*twice* with master. For uniform distribution, the same figure was 5.4
- 5.8 seconds for master, and 0.6 seconds for patched.

What is curious is that with master and with the Gaussian
distribution, I see distinct latency "no man's land" in multiple test
runs, like this one here for example:
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/49/index.html
. It looks like there is a clear differentiation between going to disk
and not going to disk, or something like that. I don't see this for
any other case, and it's quite obviously a consistent and distinct
feature of master + Gaussian when the OS isn't aggressively writing
out a mountain of dirty memory. This is something that I personally
have never seen before.

I also note that master had 3 huge background writer spikes with a
Gaussian distribution, rather than 2 and 1 small one, as was
(consistently) demonstrated to happen with a uniform distribution.
What's more, 90th percentile latency is very consistent across client
counts for the new patched test run, as opposed to being very much
higher with higher client counts when master is tested.

[1] http://www.postgresql.org/message-id/alpine.DEB.2.10.1404011107220.2557@sto
-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
I've now done a non-limited comparative benchmark of master against
the patch (once again, with usage_count starting at 6, and
BM_MAX_USAGE_COUNT at 30) with a Gaussian distribution. Once again,
the distribution threshold used was consistently 5.0, causing the
patched pgbench to report for each test:

transaction type: Custom query
scaling factor: 5000
standard deviation threshold: 5.00000
access probability of top 20%, 10% and 5% records: 0.68269 0.38293 0.19741

Results are available from:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-gauss/

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Fri, Apr 18, 2014 at 11:46 AM, Greg Stark <stark@mit.edu> wrote:
> On Fri, Apr 18, 2014 at 4:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I am a bit confused by this remark.  In *any* circumstance when you
>> evict you're incurring precisely one page fault I/O when the page is
>> read back in.   That doesn't mean that the choice of which page to
>> evict is irrelevant.
>
> But you might be evicting a page that will be needed soon or one that
> won't be needed for a while. If it's not needed for a while you might
> be able to avoid many page evictions by caching a page that will be
> used several times.

Sure.

> If all the pages currently in RAM are hot -- meaning they're hot
> enough that they'll be needed again before the page you're reading in
> -- then they're all equally bad to evict.

Also true.  But the problem is that it is very rarely, if ever, the
case that all pages are *equally* hot.  On a pgbench workload, for
example, I'm very confident that while there's not really any cold
data, the btree roots and visibility map pages are a whole lot hotter
than a randomly-selected heap page.  If you evict a heap page, you're
going to need it back pretty quick, because it won't be long until the
random-number generator again chooses a key that happens to be located
on that page.  But if you evict the root of the btree index, you're
going to need it back *immediately*, because the very next query, no
matter what key it's looking for, is going to need that page.  I'm
pretty sure that's a significant difference.

> I'm trying to push us away from the gut instinct that frequently used
> pages are important to cache and towards actually counting how many
> i/os we're saving. In the extreme it's possible to simulate any cache
> algorithm on a recorded list of page requests and count how many page
> misses it generates to compare it with an optimal cache algorithm.

There's another issue, which Simon clued me into a few years back:
evicting the wrong page can cause system-wide stalls.  In the pgbench
case, evicting a heap page will force the next process that chooses a
random number that maps to a tuple on that page to wait for the page
to be faulted back in.  That's sad, but unless the scale factor is
small compared to the number of backends, there will probably be only
ONE process waiting.  On the other hand, if we evict the btree root,
within a fraction of a second, EVERY process that isn't already
waiting on some other I/O will be waiting for that I/O to complete.
The impact on throughput is much bigger in that case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Mon, Apr 21, 2014 at 6:38 PM, Jim Nasby <jim@nasby.net> wrote:
>> I feel that if there is no memory pressure, frankly it doesnt matter much
>> about what gets out and what not. The case I am specifically targeting is
>> when the clocksweep gets to move about a lot i.e. high memory pressure
>> workloads. Of course,  I may be totally wrong here.
>
> Well, there's either memory pressure or there isn't. If there isn't then
> it's all moot *because we're not evicting anything*.

I don't think that's really true.  A workload can fit within
shared_buffers at some times and spill beyond it at others.  Every
time it fits within shared_buffers for even a short period of time,
the reference count of any buffer that's not ice-cold goes to 5 and we
essentially lose all knowledge of which buffers are relatively hotter.Then, when we spill out again, evictions are
random.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Mon, Apr 28, 2014 at 6:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Also true.  But the problem is that it is very rarely, if ever, the
> case that all pages are *equally* hot.  On a pgbench workload, for
> example, I'm very confident that while there's not really any cold
> data, the btree roots and visibility map pages are a whole lot hotter
> than a randomly-selected heap page.  If you evict a heap page, you're
> going to need it back pretty quick, because it won't be long until the
> random-number generator again chooses a key that happens to be located
> on that page.  But if you evict the root of the btree index, you're
> going to need it back *immediately*, because the very next query, no
> matter what key it's looking for, is going to need that page.  I'm
> pretty sure that's a significant difference.

I emphasized leaf pages because even with master the root and inner
pages are still going to be so hot as to make them constantly in
cache, at least with pgbench's use of a uniform distribution. You'd
have to have an absolutely enormous scale factor before this might not
be the case. As such, I'm not all that worried about inner pages when
performing these simple benchmarks. However, in the case of the
pgbench_accounts table, each of the B-Tree leaf pages that comprise
about 99.5% of the total is still going to be about six times more
frequently accessed than each heap page. That's a small enough
difference for it to easily go unappreciated, and yet a big enough
difference for it to hurt a lot.


-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Fri, Apr 25, 2014 at 10:45 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I've now done a non-limited comparative benchmark of master against
> the patch (once again, with usage_count starting at 6, and
> BM_MAX_USAGE_COUNT at 30) with a Gaussian distribution. Once again,
> the distribution threshold used was consistently 5.0, causing the
> patched pgbench to report for each test:
>
> transaction type: Custom query
> scaling factor: 5000
> standard deviation threshold: 5.00000
> access probability of top 20%, 10% and 5% records: 0.68269 0.38293 0.19741
>
> Results are available from:
>
> http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-gauss/

I updated this with various changes in bgwriter configuration. Perhaps
unsurprisingly, disabling the background writer entirely helps for
both master and patched. It is perhaps notable that the largest
difference between two comparable patch + master test runs is seen
when the background writer is disabled entirely, and 32 clients.


-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Jim Nasby
Date:
On 4/28/14, 8:04 AM, Robert Haas wrote:
> On Mon, Apr 21, 2014 at 6:38 PM, Jim Nasby <jim@nasby.net> wrote:
>>> I feel that if there is no memory pressure, frankly it doesnt matter much
>>> about what gets out and what not. The case I am specifically targeting is
>>> when the clocksweep gets to move about a lot i.e. high memory pressure
>>> workloads. Of course,  I may be totally wrong here.
>>
>> Well, there's either memory pressure or there isn't. If there isn't then
>> it's all moot *because we're not evicting anything*.
>
> I don't think that's really true.  A workload can fit within
> shared_buffers at some times and spill beyond it at others.  Every
> time it fits within shared_buffers for even a short period of time,
> the reference count of any buffer that's not ice-cold goes to 5 and we
> essentially lose all knowledge of which buffers are relatively hotter.
>   Then, when we spill out again, evictions are random.

That's a separate problem, but yes, just because we're not evicting something doesn't mean we can end up with every
buffermarked as equally important.
 

OSes handle this by splitting pages between active and inactive, and maintaining a relative balance between the two
(actuallya bit more complex because there's a separate inactive/dirty pool).
 

In our case this could maybe be handled by simply not incrementing counts when there's no eviction... but I'm more a
fanof separate polls/clocks, because that means you can do things like a LFU for active and an LRU for inactive.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Kevin Grittner
Date:
Jim Nasby <jim@nasby.net> wrote:

> In our case this could maybe be handled by simply not
> incrementing counts when there's no eviction... but I'm more a
> fan of separate polls/clocks, because that means you can do
> things like a LFU for active and an LRU for inactive.

I have hesitated to mention some benchmarks I did for optimal
caching techniques for a database load, because they were so old,
but maybe the ideas might spark something of value in the
discussion.  I'm talking about early 1985 on 80286 hardware on DOS
with a Terminate and Stay Resident (TSR) cache external to the
database.  The external cache used LRU caching, and I was looking
at what caching I could do inside the database to improve real
database workloads which tended to include both OLTP and reporting.

I found two types of caches improved performance.  Neither was a
substitute for the LRU cache closer to the hardware, and
eliminating either reduced performance over having both.  One was
index-specific -- each connection caused to be held in cache the
last page at each level of the index.  This proved useful because
in our real life applications it turned out that the next "random"
access on an index was very often the same or near the previous.
The other was a "weighted average" of access counts -- each access
bumped a count and after a certain number of bumps all counts were
reduced by 25%.  This was accomplished by setting each count to the
sum of it's existing value shifted right by one and shifted right
by two.

I understand that with the much larger RAM caches available 30
years later there could be some problems with passing all the
counts atomically without causing a noticeable pause, and the
higher connection counts may cause more contention issues.  But if
those issues could be solved (or somehow dodged for a proof of
concept benchmark) it might be interesting to see how that worked
out.

FWIW, I recall that we used a one byte counter for each page,
running 0 to 255.  I don't recall the number at which we
effectively multiplied by 0.75, and there was nothing particularly
magic about that multiplier other than it was pretty fast and
worked better that 0.5 in my benchmarks.  I also don't remember
what we used as the initial value on a page load.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Kevin Grittner
Date:
Kevin Grittner <kgrittn@ymail.com> wrote:

> each connection caused to be held in cache the last page at each
> level of the index.

Apologies for ambiguous terminology there.

To be clear: the most recently accessed page at each level of the index.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Wed, Apr 16, 2014 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> Anyways, I'm still curious if you can post similar numbers basing the
>> throttling on gross allocation counts instead of time.  Meaning: some
>> number of buffer allocations has to have occurred before you consider
>> eviction.  Besides being faster I think it's a better implementation:
>> an intermittently loaded server will give more consistent behavior.
>
> Yeah --- I think wall-clock-based throttling is fundamentally the wrong
> thing anyway.  Are we going to start needing a CPU speed measurement to
> tune the algorithm with?  Not the place to be going.  But driving it off
> the number of allocations that've been done could be sensible.  (OTOH,
> that means you need a central counter, which itself would be a
> bottleneck.)

So, I was thinking about this a little bit more today, prodded by my
coworker John Gorman.  I'm wondering if we could drive this off of the
clock sweep; that is, every time the clock sweep touches a buffer, its
usage count goes down by one, but we also set two flag bits.  Every
time the buffer gets touched thereafter, we check whether any flag
bits are set; if so, we clear one and increase the usage count, else
we do nothing.  So the usage count can increase at most twice per
clock sweep.  The advantage of that is that, as with Peter's approach,
it is harder for the usage count of a buffer to max out - to get
there, you need sustained access over a longer period of time.  But
it's not time-dependent, so replaying the same workload at twice the
speed or half the speed on faster or slower hardware doesn't change
the choice of which buffer to evict, which seems good.  And it will
cause us to prefer buffers which are accessed repeatedly over a period
of time rather than buffers that are accessed a bunch of times in
quick succession and then not touched again for a while, which seems
like a good bet.

I can't convince myself that this fully solves the (currently
existing) problem of usage counts increasing too fast.  In theory,
every time the clock sweep hand advances, it decrements the usage
count of one buffer by one, but also makes it possible for the usage
count of that buffer to increase by up to two (or by only one if the
buffer previously had a usage count of five).  So with the right
access pattern, it seems like you could still get to a place where a
typical buffer eviction has to do a lot of scanning before it finds a
victim buffer.  As with the present algorithm, we'd be most likely to
have that problem when the buffer hit ratio is very high, so that
there are many opportunities for usage counts to go up between each
opportunity to push usage counts back down.  But it seems like it
would at least limit the damage.

I haven't tried this or anything, so this is just random brainstorming.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Tue, Apr 14, 2015 at 2:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> So, I was thinking about this a little bit more today, prodded by my
> coworker John Gorman.  I'm wondering if we could drive this off of the
> clock sweep; that is, every time the clock sweep touches a buffer, its
> usage count goes down by one, but we also set two flag bits.  Every
> time the buffer gets touched thereafter, we check whether any flag
> bits are set; if so, we clear one and increase the usage count, else
> we do nothing.  So the usage count can increase at most twice per
> clock sweep.  The advantage of that is that, as with Peter's approach,
> it is harder for the usage count of a buffer to max out - to get
> there, you need sustained access over a longer period of time.  But
> it's not time-dependent, so replaying the same workload at twice the
> speed or half the speed on faster or slower hardware doesn't change
> the choice of which buffer to evict, which seems good.

Why is that good?

My little prototype basically implemented the LRU-K approach to
preventing correlated references (which tpc-b has some of, in a really
obvious way - not sure how significant that is). This would have
accidentally made it weight frequency better, but the prototype did
not pretend to actually implement something like LRU-K (which is not
the state of the art in any case), because it did not consider the
recency of the second-to-last access of the buffer at all.

The prototype's performance started off well, but regressed in later
pgbench-collector iterations (due to the influence of VACUUM, I
think). If I wanted to cheat, I could have only done one large 45
minute run, which would have made the prototype look much better
still.

> And it will
> cause us to prefer buffers which are accessed repeatedly over a period
> of time rather than buffers that are accessed a bunch of times in
> quick succession and then not touched again for a while, which seems
> like a good bet.

I think that people were all too quick to dismiss the idea of a wall
time interval playing some role here (at least as a defense against
correlated references, as a correlated reference period). I suppose
that that's because it doesn't fit with an intuition that says that
that kind of interval ought to be derived algebraically - magic delay
settings are considered suspect. However, the same can be said of "the
Five-minute rule", which is a highly influential rule of thumb that
Jim Gray came up with. The Five minute rule is now obsolete, but that
took a long time, and the fundamental observation still applies
(Wikipedia says it depends on what type of disks you have these days,
but the fact remains that rule of thumbs like this can be more or less
robust).

I think that correlated reference type delays *are* used effectively
in real world systems without it being fragile/overly workload
dependent.

> I can't convince myself that this fully solves the (currently
> existing) problem of usage counts increasing too fast.  In theory,
> every time the clock sweep hand advances, it decrements the usage
> count of one buffer by one, but also makes it possible for the usage
> count of that buffer to increase by up to two (or by only one if the
> buffer previously had a usage count of five).  So with the right
> access pattern, it seems like you could still get to a place where a
> typical buffer eviction has to do a lot of scanning before it finds a
> victim buffer.  As with the present algorithm, we'd be most likely to
> have that problem when the buffer hit ratio is very high, so that
> there are many opportunities for usage counts to go up between each
> opportunity to push usage counts back down.  But it seems like it
> would at least limit the damage.

> I haven't tried this or anything, so this is just random brainstorming.

That's what it'll take, I think -- random brainstorming, and crude
prototypes to test theories.

As long as we're doing random brainstorming, I'd suggest looking at
making clocksweep actually approximate LRU-K/LRU-2 (which, again, to
be clear, my prototype did not do). The clocksweep could maintain
statistics about the recency of the second-to-last access across all
buffers, and discriminate against buffers according to what bucket of
the population they fit in to. Not sure how aggressively we'd penalize
those buffers that had very old penultimate references (or credit
those that had very recent penultimate references), or what the bucket
partitioning scheme is, but that's probably where'd I'd take it next.
For example, buffers with a penultimate reference that is more than a
standard deviation below the mean would be double penalized (and maybe
the opposite, for those buffers with penultimate accesses a stddev
above the mean). If that didn't work so well, then I'd look into an
ARC style recency and frequency list (while remembering things about
already evicted blocks, which LRU-K does not do....although that paper
is from the early 1990s).

There are approaches to relieving lock contention with ARC (e.g., CAR,
or what OpenZFS now does [1]). So maybe we could just look to doing
something similar if simpler approaches turn out to be less effective.

[1] http://blog.delphix.com/prakash/2015/03/23/openzfs-reducing-arc-lock-contention/
-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Jim Nasby
Date:
On 4/14/15 5:22 PM, Peter Geoghegan wrote:
> As long as we're doing random brainstorming, I'd suggest looking at
> making clocksweep actually approximate LRU-K/LRU-2 (which, again, to
> be clear, my prototype did not do). The clocksweep could maintain
> statistics about the recency of the second-to-last access across all
> buffers, and discriminate against buffers according to what bucket of
> the population they fit in to. Not sure how aggressively we'd penalize
> those buffers that had very old penultimate references (or credit
> those that had very recent penultimate references), or what the bucket
> partitioning scheme is, but that's probably where'd I'd take it next.
> For example, buffers with a penultimate reference that is more than a
> standard deviation below the mean would be double penalized (and maybe
> the opposite, for those buffers with penultimate accesses a stddev
> above the mean). If that didn't work so well, then I'd look into an
> ARC style recency and frequency list (while remembering things about
> already evicted blocks, which LRU-K does not do....although that paper
> is from the early 1990s).

Along the lines of brainstorming... why do we even allow usage_count > 
1? Clocksweep was used pretty successfully by at least FreeBSD, but they 
simply used a bit to indicate recently used. Anything that wasn't 
recently used moved from the active pull to the inactive pool (which 
tended to be far larger than the active pool with decent amounts of 
memory), and a small number of buffers were keep on the 'free' list by 
pulling them out of the inactive pool and writing them if they were 
dirty. All of this was done on an LRU basis.

Given how common it is for the vast bulk of shared_buffers in an install 
to be stuck at 5, I'd think the first thing we should try is a 
combination of greatly reducing the max for usage_count (maybe to 2 
instead of 1 to simulate 2 pools), and running the clock sweep a lot 
more aggressively in a background process.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Greg Stark
Date:
<p dir="ltr">I've been meaning to write this since PGConf and now isn't a great time since I'm on my phone but I think
it'stime.<p dir="ltr">The way the clock sweep algorithm is meant to be thought about is that it's an approximate lru.
Eachusage count corresponds to an ntile of the lru. So we don't know which buffer is least recently used but it must be
inthe set of buffers with usage count 0 and that should be 1/nth of all the buffers.<p dir="ltr">In order for that
propertyto be maintained though the usage count for some buffer should be getting decremented every time we touch a
buffer.That is, every time we promote one buffer to the most recently moved ntile we should be demoting some other
buffer.<pdir="ltr">The way our cache works we promote when a buffer is accessed but we only demote when a buffer is
flushed.We flush a lot less often than we touch buffers so it's not surprising that the cache ends up full of buffers
thatare all in the "most recently used" section.<p dir="ltr">Now it's complicated by the fact that we aren't promoting
buffersdirectly to the most recently used ntile. We're incrementing the usage count by one. That makes it more of a
"leastfrequently used" list rather than a lru. I think that's a mistake but I recall some debate about that when it
firstwent in. 

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Tue, Apr 14, 2015 at 6:22 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Why is that good?

We did discuss this before.  I've recapped some of what I believe to
be the most salient points below.

> I think that people were all too quick to dismiss the idea of a wall
> time interval playing some role here (at least as a defense against
> correlated references, as a correlated reference period). I suppose
> that that's because it doesn't fit with an intuition that says that
> that kind of interval ought to be derived algebraically - magic delay
> settings are considered suspect.

Yep, Tom gave that reason here:

http://www.postgresql.org/message-id/11258.1397673898@sss.pgh.pa.us

But there was also this point from Andres - gettimeofday is not free:

http://www.postgresql.org/message-id/20140416075307.GC3906@awork2.anarazel.de

And this point from me - this can degrade to random eviction under
high pressure:

http://www.postgresql.org/message-id/CA+TgmoayUxr55zuEaPP6d2XByicJWACC9Myyn5aT4TiNdSJqYw@mail.gmail.com

You'll notice that my proposal avoids all three of those objections.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Amit Kapila
Date:
On Wed, Apr 15, 2015 at 2:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Apr 16, 2014 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Merlin Moncure <mmoncure@gmail.com> writes:
> >> Anyways, I'm still curious if you can post similar numbers basing the
> >> throttling on gross allocation counts instead of time.  Meaning: some
> >> number of buffer allocations has to have occurred before you consider
> >> eviction.  Besides being faster I think it's a better implementation:
> >> an intermittently loaded server will give more consistent behavior.
> >
> > Yeah --- I think wall-clock-based throttling is fundamentally the wrong
> > thing anyway.  Are we going to start needing a CPU speed measurement to
> > tune the algorithm with?  Not the place to be going.  But driving it off
> > the number of allocations that've been done could be sensible.  (OTOH,
> > that means you need a central counter, which itself would be a
> > bottleneck.)
>
> So, I was thinking about this a little bit more today, prodded by my
> coworker John Gorman.  I'm wondering if we could drive this off of the
> clock sweep; that is, every time the clock sweep touches a buffer, its
> usage count goes down by one, but we also set two flag bits.  Every
> time the buffer gets touched thereafter, we check whether any flag
> bits are set; if so, we clear one and increase the usage count, else
> we do nothing.  So the usage count can increase at most twice per
> clock sweep.  The advantage of that is that, as with Peter's approach,
> it is harder for the usage count of a buffer to max out - to get
> there, you need sustained access over a longer period of time.  But
> it's not time-dependent, so replaying the same workload at twice the
> speed or half the speed on faster or slower hardware doesn't change
> the choice of which buffer to evict, which seems good.  And it will
> cause us to prefer buffers which are accessed repeatedly over a period
> of time rather than buffers that are accessed a bunch of times in
> quick succession and then not touched again for a while, which seems
> like a good bet.
>

IIUC, this will allow us to increase usage count only when the buffer
is touched by clocksweep to decrement the usage count.
I think such a solution will be good for the cases when many evictions
needs to be performed to satisfy the workload,  OTOH when there are
not too many evictions that needs to be done, in such a case some of
the buffers that are accessed much more will have equal probability to
get evicted as compare to buffers which are less accessed.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Tue, Apr 14, 2015 at 7:10 PM, Greg Stark <stark@mit.edu> wrote:
> The way the clock sweep algorithm is meant to be thought about is that it's
> an approximate lru. Each usage count corresponds to an ntile of the lru. So
> we don't know which buffer is least recently used but it must be in the set
> of buffers with usage count 0 and that should be 1/nth of all the buffers.

Agreed.

> In order for that property to be maintained though the usage count for some
> buffer should be getting decremented every time we touch a buffer. That is,
> every time we promote one buffer to the most recently moved ntile we should
> be demoting some other buffer.

Agreed.  It's not easy to get this behavior exactly, though, because
the buffer you kick out necessarily has a usage count of 0 and the one
you bring in probably shouldn't.  And we don't wanna have to run the
clock sweep every time somebody touches a non-maximal usage count.
But I think it is still true that this is, to some degree, what our
algorithm is trying to approximate, and I also think it's pretty clear
that our current approximation isn't that great.

> The way our cache works we promote when a buffer is accessed but we only
> demote when a buffer is flushed. We flush a lot less often than we touch
> buffers so it's not surprising that the cache ends up full of buffers that
> are all in the "most recently used" section.

This isn't really correct.  We promote when it's accessed, but we
demote it when the clock sweep hand passes over it, which happens each
time we consider it for eviction.  It does not have to do with
flushing dirty date to disk, and it does not happen only when the
buffer is actually evicted.

> Now it's complicated by the fact that we aren't promoting buffers directly
> to the most recently used ntile. We're incrementing the usage count by one.
> That makes it more of a "least frequently used" list rather than a lru. I
> think that's a mistake but I recall some debate about that when it first
> went in.

Note that the discussion of 2Q, LRU(k), and perhaps others ask not
only how recently the page was used, but how frequently it was used.
The recency of the next-to-last access is often used as a proxy for
frequency.  Consider two buffers. One gets touched 5 times in a row,
once a day.  The other gets touched 5 times per day, at equal
intervals.  In general, the second buffer is a better choice to retain
than the first, even if it has been touched less recently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Wed, Apr 15, 2015 at 12:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> IIUC, this will allow us to increase usage count only when the buffer
> is touched by clocksweep to decrement the usage count.

Yes.

> I think such a solution will be good for the cases when many evictions
> needs to be performed to satisfy the workload,  OTOH when there are
> not too many evictions that needs to be done, in such a case some of
> the buffers that are accessed much more will have equal probability to
> get evicted as compare to buffers which are less accessed.

Possibly, but I think it's even worse under the current algorithm.
Under this proposal, if we go for a long time without any buffer
evictions, every buffer usage's count will top out at 2 more than
wherever it was after the last clock sweep.   In the worst case, every
buffer (or most of them) could end up with the same usage count.  But
under the status quo, they'll all go to 5, which is an even bigger
loss of information, and which will make the first eviction much more
expensive than if they are all pegged at 2 or 3.

There could be ways to improve things further, such as by slowly
advancing the clock sweep even when no eviction is required.  But
that's a bit tricky to do, too.  You'd like (perhaps) to advance it
one step for every buffer allocation, but you don't want a centralized
counter, or unnecessary contention on the clock sweep when no eviction
is necessary in the first place.  There's probably some way to make it
work, though, if we put our mind to it.  I'm inclined to try the
simpler approach first and see what it buys.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Amit Kapila
Date:
On Wed, Apr 15, 2015 at 10:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Apr 15, 2015 at 12:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > IIUC, this will allow us to increase usage count only when the buffer
> > is touched by clocksweep to decrement the usage count.
>
> Yes.
>
> > I think such a solution will be good for the cases when many evictions
> > needs to be performed to satisfy the workload,  OTOH when there are
> > not too many evictions that needs to be done, in such a case some of
> > the buffers that are accessed much more will have equal probability to
> > get evicted as compare to buffers which are less accessed.
>
> Possibly, but I think it's even worse under the current algorithm.
> Under this proposal, if we go for a long time without any buffer
> evictions, every buffer usage's count will top out at 2 more than
> wherever it was after the last clock sweep.   In the worst case, every
> buffer (or most of them) could end up with the same usage count.  But
> under the status quo, they'll all go to 5, which is an even bigger
> loss of information, and which will make the first eviction much more
> expensive than if they are all pegged at 2 or 3.
>

Okay, got your point.  On further thinking on this idea, it occurred to me
that with this idea you are trying to reduce the clock-sweep time which
in-turn will inturn improve the chances of finding usable buffer in less
time under high pressure, if that is true then I think we should have seen
the similar gain even with the bgreclaimer idea which I have tried sometime
back, because that has also reduced the time for clock-sweep. 

> There could be ways to improve things further, such as by slowly
> advancing the clock sweep even when no eviction is required.  But
> that's a bit tricky to do, too.  You'd like (perhaps) to advance it
> one step for every buffer allocation, but you don't want a centralized
> counter, or unnecessary contention on the clock sweep when no eviction
> is necessary in the first place.  There's probably some way to make it
> work, though, if we put our mind to it.  I'm inclined to try the
> simpler approach first and see what it buys.
>

Sure, I think it is worth a try.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Martijn van Oosterhout
Date:
On Wed, Apr 15, 2015 at 12:37:44AM -0400, Robert Haas wrote:
> > I think such a solution will be good for the cases when many evictions
> > needs to be performed to satisfy the workload,  OTOH when there are
> > not too many evictions that needs to be done, in such a case some of
> > the buffers that are accessed much more will have equal probability to
> > get evicted as compare to buffers which are less accessed.
>
> Possibly, but I think it's even worse under the current algorithm.
> Under this proposal, if we go for a long time without any buffer
> evictions, every buffer usage's count will top out at 2 more than
> wherever it was after the last clock sweep.   In the worst case, every
> buffer (or most of them) could end up with the same usage count.  But
> under the status quo, they'll all go to 5, which is an even bigger
> loss of information, and which will make the first eviction much more
> expensive than if they are all pegged at 2 or 3.

I've been following this thread from the side with interest and got
twigged by the point about loss of information.  If you'd like better
information about relative ages, you can acheive this by raising the
cap on the usage count and dividing (or right-shifting) each sweep.

This would allow you to remember much more about about the relative
worth of often used pages.  With a cap of 32 you'd have the same effect
as now where after 5 sweeps the buffer is evicted.  Mathematically the
count would converge to the number of times the block is used per
sweep.

If you wanted to be really clever, you could at the beginning of each
sweep take an estimate of the number of buffers used since the last
sweep (from the stats collector perhaps) and use that to drive your
divisor, so if you have a lots of allocations you become more
aggressive about reducing the counts.  Or if the load is light fall
back to just subtracting one.  Then you don't need a cap at all.

(Apologies if this has been suggested before, Google didn't find
anything for me).

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: Clock sweep not caching enough B-Tree leaf pages?

From
Greg Stark
Date:
On Wed, Apr 15, 2015 at 5:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> The way our cache works we promote when a buffer is accessed but we only
>> demote when a buffer is flushed. We flush a lot less often than we touch
>> buffers so it's not surprising that the cache ends up full of buffers that
>> are all in the "most recently used" section.
>
> This isn't really correct.  We promote when it's accessed, but we
> demote it when the clock sweep hand passes over it, which happens each
> time we consider it for eviction.  It does not have to do with
> flushing dirty date to disk, and it does not happen only when the
> buffer is actually evicted.


This is my point though (you're right that "flushed" isn't always the
same as eviction but that's not the important point here). Right now
we only demote when we consider buffers for eviction. But we promote
when we pin buffers. Those two things aren't necessarily happening at
the same rate and in fact are often orders of magnitude different. So
it makes sense that we end up with a lot of buffers promoted all the
way to the most recently used ntile and then have to do n passes
around the clock and have no good information about which buffer to
evict.

What I'm saying is that we should demote a buffer every time we
promote a buffer. So every time we pin a buffer we should advance the
clock a corresponding amount. I know I'm being intentionally vague
about what the corresponding amount is.) The important thing is that
the two should be tied together.

-- 
greg



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
> I've been following this thread from the side with interest and got
> twigged by the point about loss of information.  If you'd like better
> information about relative ages, you can acheive this by raising the
> cap on the usage count and dividing (or right-shifting) each sweep.

Yeah, I thought about that, too.  It might be worth experimenting with.

> This would allow you to remember much more about about the relative
> worth of often used pages.  With a cap of 32 you'd have the same effect
> as now where after 5 sweeps the buffer is evicted.  Mathematically the
> count would converge to the number of times the block is used per
> sweep.

Hmm, interesting point.  It's possible that we'd still have problems
with everything maxing out at 32 on some workloads, but at least it'd
be a little harder to max out at 32 than at 5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Merlin Moncure
Date:
On Mon, Apr 20, 2015 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout
> <kleptog@svana.org> wrote:
>> I've been following this thread from the side with interest and got
>> twigged by the point about loss of information.  If you'd like better
>> information about relative ages, you can acheive this by raising the
>> cap on the usage count and dividing (or right-shifting) each sweep.
>
> Yeah, I thought about that, too.  It might be worth experimenting with.
>
>> This would allow you to remember much more about about the relative
>> worth of often used pages.  With a cap of 32 you'd have the same effect
>> as now where after 5 sweeps the buffer is evicted.  Mathematically the
>> count would converge to the number of times the block is used per
>> sweep.
>
> Hmm, interesting point.  It's possible that we'd still have problems
> with everything maxing out at 32 on some workloads, but at least it'd
> be a little harder to max out at 32 than at 5.

Do we have any reproducible test cases to evaluate these assumptions?I haven't looked at this stuff for a while, but my
mainissue with
 
the clock sweep was finding sweep heavy cases that did not also have
trouble with the buffer mapping locks (although the facts on the
ground my have changed since then).

merlin



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark <stark@mit.edu> wrote:
> This is my point though (you're right that "flushed" isn't always the
> same as eviction but that's not the important point here). Right now
> we only demote when we consider buffers for eviction. But we promote
> when we pin buffers. Those two things aren't necessarily happening at
> the same rate and in fact are often orders of magnitude different.

I am absolutely, positively, violently in 100% agreement with this.  I
have made the same point before, but it sure is nice to hear someone
else thinking about it the same way.

> So
> it makes sense that we end up with a lot of buffers promoted all the
> way to the most recently used ntile and then have to do n passes
> around the clock and have no good information about which buffer to
> evict.

Right.

> What I'm saying is that we should demote a buffer every time we
> promote a buffer. So every time we pin a buffer we should advance the
> clock a corresponding amount. I know I'm being intentionally vague
> about what the corresponding amount is.) The important thing is that
> the two should be tied together.

Yes, absolutely.  If you tilt your head the right way, my proposal of
limiting the number of promotions per clock sweep has the effect of
tying buffer demotion and buffer promotion together much more tightly
than is the case right now.  You are limited to 2 promotions per
demotion; and practically speaking not all buffers eligible to be
promoted will actually get accessed, so the number of promotions per
demotion will in reality be somewhere between 0 and 2.  Ideally it
would be exactly 1, but 1 +/- 1 is still a tighter limit than we have
at present.  Which is not to say there isn't some other idea that is
better still.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Mon, Apr 20, 2015 at 11:00 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> Hmm, interesting point.  It's possible that we'd still have problems
>> with everything maxing out at 32 on some workloads, but at least it'd
>> be a little harder to max out at 32 than at 5.
>
> Do we have any reproducible test cases to evaluate these assumptions?

The particular point that you are responding to here is easy to
reproduce.  Just create any workload that fits in shared_buffers - a
small pgbench database, for example - and access stuff.  No usage
counts will go down, but every access will drive usage counts up.
Eventually everything will hit any maximum you care to install, and
actually I don't think it takes very long.  You can use pg_buffercache
to see the results.

>  I haven't looked at this stuff for a while, but my main issue with
> the clock sweep was finding sweep heavy cases that did not also have
> trouble with the buffer mapping locks (although the facts on the
> ground my have changed since then).

We increased the number of buffer mapping locks to 128, so that
problem should be considerably ameliorated now.  But it's not totally
gone, as demonstrated by Andres's experiments with my chash patch.

There was also a patch that eliminated BufFreelistLock in favor of a
spinlock held for much shorter time periods, and then Andres took that
one step further and made it use atomics.  That used to be a
*terrible* bottleneck on eviction-heavy workloads and is now much
improved.  More work may remain to be done, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Jim Nasby
Date:
On 4/20/15 11:11 AM, Robert Haas wrote:
> On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark <stark@mit.edu> wrote:
>> This is my point though (you're right that "flushed" isn't always the
>> same as eviction but that's not the important point here). Right now
>> we only demote when we consider buffers for eviction. But we promote
>> when we pin buffers. Those two things aren't necessarily happening at
>> the same rate and in fact are often orders of magnitude different.
>
> I am absolutely, positively, violently in 100% agreement with this.  I
> have made the same point before, but it sure is nice to hear someone
> else thinking about it the same way.

+1

>> What I'm saying is that we should demote a buffer every time we
>> promote a buffer. So every time we pin a buffer we should advance the
>> clock a corresponding amount. I know I'm being intentionally vague
>> about what the corresponding amount is.) The important thing is that
>> the two should be tied together.
>
> Yes, absolutely.  If you tilt your head the right way, my proposal of
> limiting the number of promotions per clock sweep has the effect of
> tying buffer demotion and buffer promotion together much more tightly
> than is the case right now.  You are limited to 2 promotions per
> demotion; and practically speaking not all buffers eligible to be
> promoted will actually get accessed, so the number of promotions per
> demotion will in reality be somewhere between 0 and 2.  Ideally it
> would be exactly 1, but 1 +/- 1 is still a tighter limit than we have
> at present.  Which is not to say there isn't some other idea that is
> better still.

I think that would help, but it still leaves user backends trying to 
advance the clock, which is quite painful. Has anyone tested running the 
clock in the background? We need a wiki page with all the ideas that 
have been tested around buffer management...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Peter Geoghegan
Date:
On Tue, Apr 14, 2015 at 7:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 14, 2015 at 6:22 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Why is that good?
>
> We did discuss this before.  I've recapped some of what I believe to
> be the most salient points below.
>
>> I think that people were all too quick to dismiss the idea of a wall
>> time interval playing some role here (at least as a defense against
>> correlated references, as a correlated reference period). I suppose
>> that that's because it doesn't fit with an intuition that says that
>> that kind of interval ought to be derived algebraically - magic delay
>> settings are considered suspect.
>
> Yep, Tom gave that reason here:
>
> http://www.postgresql.org/message-id/11258.1397673898@sss.pgh.pa.us
>
> But there was also this point from Andres - gettimeofday is not free:
>
> http://www.postgresql.org/message-id/20140416075307.GC3906@awork2.anarazel.de
>
> And this point from me - this can degrade to random eviction under
> high pressure:
>
> http://www.postgresql.org/message-id/CA+TgmoayUxr55zuEaPP6d2XByicJWACC9Myyn5aT4TiNdSJqYw@mail.gmail.com
>
> You'll notice that my proposal avoids all three of those objections.

All I'm saying is that you shouldn't dismiss the idea without trying
it out properly. The LRU-K paper, from the early 1990s, recommends
this, and gettimeofday() calls were a lot more expensive back then.
I'm sure that there is a way to overcome these issues if it turns out
to be worth it (by amortizing gettimeofday() calls by driving it from
an auxiliary process like the bgwriter, for example). In fact, I'm
almost certain there is, because at least one other major database
system uses just such a reference period.

Back when ARC (or was it 2Q?) was committed before being reverted
shortly thereafter, there was a similar idea actually implemented, but
with XIDs preventing correlated references (which the LRU-K paper also
hints at). I think that an actual delay is more robust than that,
though. Even though this correlated reference delay is all I
implemented with my prototype,it's just one piece of the puzzle.

-- 
Peter Geoghegan



Re: Clock sweep not caching enough B-Tree leaf pages?

From
Robert Haas
Date:
On Mon, Apr 20, 2015 at 2:53 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> I think that would help, but it still leaves user backends trying to advance
> the clock, which is quite painful. Has anyone tested running the clock in
> the background? We need a wiki page with all the ideas that have been tested
> around buffer management...

Amit's bgreclaimer patch did that, but we weren't able to demonstrate
a clear benefit.  I haven't given up on the idea yet, but we've got to
be able to prove that it's a good idea in practice as well as in
theory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company