Thread: Clock sweep not caching enough B-Tree leaf pages?
I have some theories about the PostgreSQL buffer manager/clock sweep. To motivate the reader to get through the material presented here, I present up-front a benchmark of a proof-of-concept patch of mine: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay/ Test Set 4 represents the patches performance here. This shows some considerable improvements for a tpc-b workload, with 15 minute runs, where the buffer manager struggles with moderately intense cache pressure. shared_buffers is 8GiB, with 32GiB of system memory in total. The scale factor is 5,000 here, so that puts the primary index of the accounts table at a size that makes it impossible to cache entirely within shared_buffers, by a margin of couple of GiBs. pgbench_accounts_pkey is ~"10GB", and pgbench_accounts is ~"63 GB". Obviously the heap is much larger, since for that table heap tuples are several times the size of index tuples (the ratio here is probably well below the mean, if I can be permitted to make a vast generalization). PostgreSQL implements a clock sweep algorithm, which gets us something approaching an LRU for the buffer manager in trade-off for less contention on core structures. Buffers have a usage_count/"popularity" that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK algorithm only has one bit for what approximates our "usage_count" (so it's either 0 or 1). I think that at its core CLOCK is an algorithm that has some very desirable properties that I am sure must be preserved. Actually, I think it's more accurate to say we use a variant of clock pro, a refinement of the original CLOCK. In the past, various hackers have noted problems they've observed with this scheme. A common pathology is to see frantic searching for a victim buffer only to find all buffer usage_count values at 5. It may take multiple revolutions of the clock hand before a victim buffer is found, as usage_count is decremented for each and every buffer. Also, BufFreelistLock contention is considered a serious bottleneck [1], which is related. I think a very real problem that may be that approximating an LRU is bad because an actual LRU is bad, though not due to the problems that CLOCK/our clock sweep algorithm to its great credit ameliorates. I don't think that we need to trade-off between an LRU and MRU as Atri once suggested [2]; rather, I think we need a trade-off between an LRU and a LFU (very loosely speaking). Something like a pure LRU can make very bad choices for database workloads. This is described extensively in the literature. As I recently remarked upon for unrelated reasons [3], B+Trees perform excellently when partially cached. There is a remarkable asymmetry between how many pages we must have cached relative to how much I/O is avoided, particularly the random I/O that is characteristic of fully uncached B-Trees when scanned. Approximately 99% of pages in our nbtree structures can be expected to be leaf pages in many common cases. There is a very wide fan-out of B-Tree structures that makes this possible. A lot of material on the subject doesn't emphasize this basic aspect strongly enough in my view. But it also bears mentioning that B-Tree pages are, in an important sense, far denser than heap pages. Let's leave aside inner/root pages though, because they're so dramatically useful when in a primary index on a tpb-b table that they'll always be cached by any non-terrible algorithm. It beggars belief that the still relatively dense (and consequently *popular*) B+Tree leaf pages get so little credit for being of such long-term utility (in the view of our clock sweep algorithm as implemented). The algorithm has what could be loosely described as an excessively short-term perspective. There is clearly a better balance to be had here. I don't think the answer is that we have the B-Tree code give its pages some special treatment that makes them harder to evict, although I will admit that I briefly entertained the idea. I am aware of the false start that we had with ARC back in 2005. I believe that effort tried to address some of these problems. I am mindful of the pitfalls there. I'm inclined to think that the decision to create a CLOCK variant in preference to ARC was the right one at the time, because clock sweep really is at a considerable advantage with regard to lock contention. The 1993 paper "The LRU-K Page Replacement Algorithm For Database Disk Buffering" [4] proposes what is (roughly speaking) an algorithm that is an adaptive hybrid of LRU and LFU. Consider Example 1.1. from that paper; it describes a very simple scenario in which its possible to have slightly more heap pages cached than B-Tree pages. This scenario is essentially a "pgbench -S", a scenario compared to the simplistic tpc-a; it's nothing more than that. The authors aren't clear on this, but I assumed a uniform distribution (IIRC, the 2Q paper, which was published a year or two later, also explicitly assumes this of the very same earlier LRU-K/LRU-2 example). It is argued within that example, quite cogently in my opinion, that it's very bad that a simple LRU cache will see just under 50% of buffers used to cache B-Tree leaf pages, while over 50% are used for "data" (heap) pages. In actual fact, it is preferable to almost exclusively cache B-Tree pages. Using 50% of the cache on B-Tree pages when it is optimal to cache a number approaching 100% of the cache is a pretty large disparity, particularly for such a simple workload. This is literally the furthest possible thing from a contrived corner case. I believe that it isn't hard to get clock sweep to do just the same as predicted in the '93 paper, even with a "usage_count". It is nothing more than an LRU approximation, or at least that's what the relevant comments say. I did some experiments here, but it probably isn't worth sharing the raw data, given what I already have here. The example surrounding caching B-Tree leaf pages in the paper draws attention to a particularly acute pathology, but there are probably others. Leaf pages are in many representative cases far denser, and therefore can be expected to be accessed far more frequently to service queries. Our current failure to credit them in a way that weighs the frequency with which they're accessed is something I suggest thinking long and hard about. Has anyone thought about this in the last few years? I know that Tom examined the LRU-K paper back in 2000 [5], but was discouraged by some kind of contention or CPU overhead (although he did say he intended to revisit the question). Obviously a lot has changed in the past 14 years, particularly with regard to CPU characteristics. Anyway, having gone through the requisite background information, I'll get back to the subject of the pgbench-tools tpc-b benchmark that I started out with, and what I've actually done in the attached patch to improve matters. Let me preface this by saying: this is a rough prototype. The way I add a field to the buffer descriptor struct clearly isn't going to fly, since we have every reason to believe that that is itself performance critical in other scenarios (just look at Andres' recent work on the alignment of these structures in memory). Actually, it probably matters a lot in this scenario too, just not enough to mask the general outcome. I've arrived at this prototype through plenty of principled theorizing, and a bit of unprincipled frobbing. I'm a big fan of building simple prototypes to test theories. In the interest of reproducibility I have not attempted to clean up what I have here just yet, in case I accidentally invalidate the benchmark presented. The prototype patch: 1) Throttles incrementation of usage_count temporally. It becomes impossible to increment usage_count for any given buffer more frequently than every 3 seconds, while decrementing usage_count is totally unaffected. This is thought to give the algorithm a short to medium term "appreciation" of the frequency of access. It also prevents very close increments in usage_count due to pinning a buffer twice in the same query or transaction, just because the code in the executor happens to be accidentally structured such that that happens (as when inserting two tuples within a single INSERT DML query), or whatever. Clearly the tpc-b accounts table is accessed twice in succession in the same transaction by our tpb-c, creating a sort of misrepresentation of buffer popularity that is likely in and of itself undesirable. 2) Has usage_count saturate at 10 (i.e. BM_MAX_USAGE_COUNT = 10), not 5 as before. This gives us a more granular representation of the popularity/usefulness of each buffer, which I believe spans a dramatically large spectrum (i.e. my guess is that testing will show I didn't go far enough). This step on its own would be assumed extremely counter-productive by those in the know, but I believe that other measures ameliorate the downsides. I could be wrong about how true that is in other cases, but then the case helped here isn't what you'd call a narrow benchmark. It's the pgbench default (although I do use unlogged tables, in part because the I/O subsystem on this server is quite under-powered, even though it has plenty of memory). 3) Has buffer usage_count start at 3. I also tried 4 (and 1, the current value within master), but settled on 3 for now. Seemingly this makes a very large difference compared to only doing 1) and 2), since it gives each page a fair shot at proving its usefulness. Presumably the throttling in 1) makes frantic buffer scanning much less prevalent, since clock sweep has the upper hand, so to speak. A tug of war between clock sweep and backends pinning buffers can be disastrous, but that seems much less likely than before. Semi-magical constant amounts of time are something you see surprisingly often in the research. Probably most notably, the "five-minute rule" [6] argues that one should only cache randomly accessed pages that are re-used every 5 minutes or less (it's actually a bit more complicated these days, but not much). This rule is a mainstay of database caching research theory. Anyway, I'm not sure whether or not this delay should be exposed as a GUC. I lean towards "no". The amount of evidence it takes to validate something like this is generally enormous. Validating the just-in-time bgwriter strategy was the original reason that Greg Smith wrote pgbench-tools. Still, I'm confident that I've identified a real problem. The big picture here is that pages have a fair go at proving their worth, while allowing for certain pages to be recognized as being of dramatically higher long-term utility. Generally speaking, as a caching algorithm, a pure LFU is inappropriate for almost all use-cases, because it never forgets anything until it forgets everything about individual pages. There is a balance to be had here, in terms of the extent to which we allow pages to "rest on their laurels". I wouldn't be surprised to learn I have the balance wrong right now. pgbench-tools benchmark interpretation ============================= I also attach a LibreOffice spreadsheet comparing hit rates for each table (and its indexes) for each run that you see in the pgbench-tools report (except the final do-over of master baseline). I built this with the help of a small pgbench-tools hack, resetting pg_statio* views, measuring hit rates per pgbench invocation. The hit rate shown would probably be a lot more interesting if I'd used a rate limit, so the amount of work performed is consistent across test sets (that is, variants of the patch, and master). Greg Smith is very enthusiastic about how much further insight can be had while using that pgbench feature, and I'd say he's probably right to be. Right now, in terms of hit rate you mostly see a slightly smaller one for indexes, and a considerably larger one for heap buffers as compared to the master baseline. If you run down the individual runs in the pgbench-tools report, and consider what actually happens and when, you tend to see a degradation in performance over successive runs for different client counts for certain cases tested. It's hard to be sure, but I think that might be because earlier runs have a big advantage due to the index being well cached (pgbench-tools performs vacuuming at the start of each run not including the first run. So VACUUM reverses the situation initially seen, since we kill heap tuples last when vacuuming, rather than creating the index last). In contrast, the impressive performance of the patch (where all 3 of the above measures are implemented) shows more consistent throughput even with that noise, which seems to support my theories about what is going on here. Leaving aside cache effectiveness as such, I suspect that in general we currently pay a high price for all too frequently having buffers fall out of shared_buffers into the OS cache, and fall back in again. Having arrived at the conclusion that these optimizations were worth sharing here, I decided to "do over" the original master baseline (Test Set 1). I ended up with a result (Test set 5) that is actually quite different from the original result, without it being all that clear that it was better or worse than the first time I took a baseline (it was probably worse). This might call into question the accuracy of my results. However, to see what's really going on here, it is necessary to drill down to the individual TPS/latency figures for individual pgbench invocations. That's the real story here. In general, as I said, the I/O system here is quite under specified for this workload. This is dedicated physical hardware, but it happens to be what was immediately available to me. There are a couple of SATA 7200 rpm disks in RAID 1. The system specifications, as advertised by my hosting provider is detailed here: https://www.hetzner.de/en/hosting/produkte_rootserver/ex40 . As always with pgbench-tools, you can drill down to see more about each test (including kernel vm settings, etc). Inevitably, with this disk setup, which is I imagine particularly bad with random I/O, and also with so much memory, it's only a matter of time before things get ugly, even if there is an initial high performance burst (even on master) before we have to face the music, unsurprisingly. It seems like the pain is felt at slightly different times between Test Set 1 and Test Set 5 (i.e. for each test of master). With either of those 2 test sets, if you drill down to see the moment-to-moment throughput and latency, you'll find tests from each test set where both latency and throughput were on the floor for as long as a couple of minutes at a time, after which there is a sudden recovery. We're subject to the vagaries of kernel I/O scheduling to a much greater extent than with the good patched test sets, it seems. What is seen with master looks very much like sync phase checkpoint spikes. Once or twice, things are consistently very poor for almost an entire master invocation of pgbench. In general things are very inconsistent, although to be fair it's possible that at its best master has greater throughput than patched for brief moments (of course, the buffer descriptor bloating which I have yet to deal with may well be all that is to blame here). If you look at the test sets that this patch covers (with all the tricks applied), there are pretty good figures throughout. You can kind of see the pain towards the end, but there are no dramatic falls in responsiveness for minutes at a time. There are latency spikes, but they're *far* shorter, and much better hidden. Without looking at individual multiple minute spikes, at the macro level (all client counts for all runs) average latency is about half of what is seen on master. Does anyone have any high-end hardware that they could use to test this out? [1] http://www.postgresql.org/message-id/CA+TgmobJm0GHk58nUPRQHCGwY25n1DCkU4ku9aQeczZEjiz9mQ@mail.gmail.com [2] http://www.postgresql.org/message-id/CAOeZVic4HikhmzVD=ZP4JY9g8PgpyiQQOXOELWP=kR+=H1Frgg@mail.gmail.com [3] http://www.postgresql.org/message-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=A@mail.gmail.com [4] http://www.cs.cmu.edu/~christos/courses/721-resources/p297-o_neil.pdf [5] http://www.postgresql.org/message-id/1601.967421129@sss.pgh.pa.us [6] http://en.wikipedia.org/wiki/Five-minute_rule -- Peter Geoghegan
Attachment
On 4/14/14, 12:11 PM, Peter Geoghegan wrote: > I have some theories about the PostgreSQL buffer manager/clock sweep. > To motivate the reader to get through the material presented here, I > present up-front a benchmark of a proof-of-concept patch of mine: > > http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay/ > > Test Set 4 represents the patches performance here. > > This shows some considerable improvements for a tpc-b workload, with > 15 minute runs, where the buffer manager struggles with moderately > intense cache pressure. shared_buffers is 8GiB, with 32GiB of system > memory in total. The scale factor is 5,000 here, so that puts the > primary index of the accounts table at a size that makes it impossible > to cache entirely within shared_buffers, by a margin of couple of > GiBs. pgbench_accounts_pkey is ~"10GB", and pgbench_accounts is ~"63 > GB". Obviously the heap is much larger, since for that table heap > tuples are several times the size of index tuples (the ratio here is > probably well below the mean, if I can be permitted to make a vast > generalization). > > PostgreSQL implements a clock sweep algorithm, which gets us something > approaching an LRU for the buffer manager in trade-off for less > contention on core structures. Buffers have a usage_count/"popularity" > that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK > algorithm only has one bit for what approximates our "usage_count" (so > it's either 0 or 1). I think that at its core CLOCK is an algorithm > that has some very desirable properties that I am sure must be > preserved. Actually, I think it's more accurate to say we use a > variant of clock pro, a refinement of the original CLOCK. I think it's important to mention that OS implementations (at least all I know of) have multiple page pools, each of whichhas it's own clock. IIRC one of the arguments for us supporting a count>1 was we could get the benefits of multiplepage pools without the overhead. In reality I believe that argument is false, because the clocks for each page poolin an OS *run at different rates* based on system demands. I don't know if multiple buffer pools would be good or bad for Postgres, but I do think it's important to remember this differenceany time we look at what OSes do. > If you look at the test sets that this patch covers (with all the > tricks applied), there are pretty good figures throughout. You can > kind of see the pain towards the end, but there are no dramatic falls > in responsiveness for minutes at a time. There are latency spikes, but > they're *far* shorter, and much better hidden. Without looking at > individual multiple minute spikes, at the macro level (all client > counts for all runs) average latency is about half of what is seen on > master. My guess would be that those latency spikes are caused by a need to run the clock for an extended period. IIRC there's codefloating around that makes it possible to measure that. I suspect it would be very interesting to see what happens if your patch is combined with the work that (Greg?) did to reducethe odds of individual backends needing to run the clock. (I know part of that work looked at proactively keeping pageson the free list, but I think there was more to it than that). -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Mon, Apr 14, 2014 at 10:11:53AM -0700, Peter Geoghegan wrote: > Has anyone thought about this in the last few years? I know that Tom > examined the LRU-K paper back in 2000 [5], but was discouraged by some > kind of contention or CPU overhead (although he did say he intended to > revisit the question). Obviously a lot has changed in the past 14 > years, particularly with regard to CPU characteristics. I am glad you are looking at this. You are right that it requires a huge amount of testing, but clearly our code needs improvement in this area. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
* Jim Nasby (jim@nasby.net) wrote: > I think it's important to mention that OS implementations (at least all I know of) have multiple page pools, each of whichhas it's own clock. IIRC one of the arguments for us supporting a count>1 was we could get the benefits of multiplepage pools without the overhead. In reality I believe that argument is false, because the clocks for each page poolin an OS *run at different rates* based on system demands. They're also maintained in *parallel*, no? That's something that I've been talking over with a few folks at various conferences- that we should consider breaking up shared buffers and then have new backend processes which work through each pool independently and in parallel. > I don't know if multiple buffer pools would be good or bad for Postgres, but I do think it's important to remember thisdifference any time we look at what OSes do. It's my suspicion that the one-big-pool is exactly why we see many cases where PG performs worse when the pool is more than a few gigs. Of course, this is all speculation and proper testing needs to be done.. Thanks, Stephen
On Mon, Apr 14, 2014 at 5:30 PM, Bruce Momjian <bruce@momjian.us> wrote: > I am glad you are looking at this. You are right that it requires a > huge amount of testing, but clearly our code needs improvement in this > area. Thanks. Does anyone recall the original justification for the recommendation that shared_buffers never exceed 8GiB? I'd like to revisit the test case, if such a thing exists. -- Peter Geoghegan
On Mon, Apr 14, 2014 at 05:45:56PM -0700, Peter Geoghegan wrote: > On Mon, Apr 14, 2014 at 5:30 PM, Bruce Momjian <bruce@momjian.us> wrote: > > I am glad you are looking at this. You are right that it requires a > > huge amount of testing, but clearly our code needs improvement in this > > area. > > Thanks. > > Does anyone recall the original justification for the recommendation > that shared_buffers never exceed 8GiB? I'd like to revisit the test > case, if such a thing exists. I have understood it be that the overhead of managing over 1 million buffers is too large if you aren't accessing more than 8GB of data in a five-minute period. If are accessing that much, it might be possible to have a win over 8GB. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 4/14/14, 7:43 PM, Stephen Frost wrote: > * Jim Nasby (jim@nasby.net) wrote: >> I think it's important to mention that OS implementations (at least all I know of) have multiple page pools, each of whichhas it's own clock. IIRC one of the arguments for us supporting a count>1 was we could get the benefits of multiplepage pools without the overhead. In reality I believe that argument is false, because the clocks for each page poolin an OS *run at different rates* based on system demands. > > They're also maintained in *parallel*, no? That's something that I've > been talking over with a few folks at various conferences- that we > should consider breaking up shared buffers and then have new backend > processes which work through each pool independently and in parallel. I suspect that varies based on the OS, but it certainly happens in a separate process from user processes. The expectationis that there should always be pages on the free list so requests for memory can happen quickly. http://www.freebsd.org/doc/en/articles/vm-design/freeing-pages.html contains a good overview of what FreeBSD does. See http://www.freebsd.org/doc/en/articles/vm-design/allen-briggs-qa.html#idp62990256as well. >> I don't know if multiple buffer pools would be good or bad for Postgres, but I do think it's important to remember thisdifference any time we look at what OSes do. > > It's my suspicion that the one-big-pool is exactly why we see many cases > where PG performs worse when the pool is more than a few gigs. Of > course, this is all speculation and proper testing needs to be done.. I think there some critical take-aways from FreeBSD that apply here (in no particular order): 1: The system is driven by memory pressure. No pressure means no processing. 2: It sounds like the active list is LFU, not LRU. The cache list is LRU. 3: *The use counter is maintained by a clock.* Because the clock only runs so often this means there is no run-away incrementinglike we see in Postgres. 4: Once a page is determined to not be active it goes onto a separate list depending on whether it's clean or dirty. 5: Dirty pages are only written to maintain a certain clean/dirty ratio and again, only when there's actual memory pressure. 6: The system maintains a list of free pages to serve memory requests quickly. In fact, lower level functions (ie: http://www.leidinger.net/FreeBSD/dox/vm/html/d4/d65/vm__phys_8c_source.html#l00862)simply return NULL if they can't findpages on the free list. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Mon, Apr 14, 2014 at 7:45 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Apr 14, 2014 at 5:30 PM, Bruce Momjian <bruce@momjian.us> wrote: >> I am glad you are looking at this. You are right that it requires a >> huge amount of testing, but clearly our code needs improvement in this >> area. > > Thanks. > > Does anyone recall the original justification for the recommendation > that shared_buffers never exceed 8GiB? I'd like to revisit the test > case, if such a thing exists. There are many reports of improvement from lowering shared_buffers. The problem is that it tends to show up on complex production workloads and that there is no clear evidence pointing to problems with the clock sweep; it could be higher up in the partition locks or something else entirely (like the O/S). pgbench is also not the greatest tool for sniffing out these cases: it's too random and for large database optimization is generally an exercise in de-randomizing i/o patterns. We really, really need a broader testing suite that covers more usage patterns. I was suspicious for a while that spinlock contention inside the clocksweep was causing stalls and posted a couple of different patches to try and reduce the chance of that. I basically gave up when I couldn't demonstrate that case in simulated testing. I still think there is no good reason for the clock to pedantically adjust usage count on contented buffers...better to throw a single TTAS and bail to the next buffer if either 'T' signals a lock. merlin
On Tue, Apr 15, 2014 at 9:30 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > There are many reports of improvement from lowering shared_buffers. > The problem is that it tends to show up on complex production > workloads and that there is no clear evidence pointing to problems > with the clock sweep; it could be higher up in the partition locks or > something else entirely (like the O/S). pgbench is also not the > greatest tool for sniffing out these cases: it's too random and for > large database optimization is generally an exercise in de-randomizing > i/o patterns. We really, really need a broader testing suite that > covers more usage patterns. I find it quite dissatisfying that we know so little about this. I'm finding that my patch helps much less when shared_buffers is sized large enough to fit the index entirely (although there are still some localized stalls on master, where there are none with patched). shared_buffers is still far too small to fit the entire heap. With shared_buffers=24GB (which still leaves just under 8GB of memory for the OS to use as cache, since this system has 32GB of main memory), the numbers are much less impressive relative to master with the same configuration. Both sets of numbers are still better than what you've already seen with shared_buffers=8GB, since of course the "no more than 8GB" recommendation is not an absolute, and as you say its efficacy seemingly cannot be demonstrated with pgbench. My guess is that the patch doesn't help because once there is more than enough room to cache the entire index (slightly over twice as many buffers as would be required to do so), even on master it becomes virtually impossible to evict those relatively popular index pages, since they still have an early advantage. It doesn't matter that master's clock sweep has what I've called an excessively short-term perspective, because there is always enough pressure relative to the number of leaf pages being pinned to prefer to evict heap pages. There is still a lot of buffers that can fit some moderate proportion of all heap pages even after buffering the entire index (something like ~13GB). You might say that with this new shared_buffers setting, clock sweep doesn't need to have a "good memory", because it can immediately observe the usefulness of B-Tree leaf pages. There is no need to limit myself to speculation here, of course. I'll check it out using pg_buffercache. -- Peter Geoghegan
On Mon, Apr 14, 2014 at 8:11 PM, Peter Geoghegan <pg@heroku.com> wrote: > PostgreSQL implements a clock sweep algorithm, which gets us something > approaching an LRU for the buffer manager in trade-off for less > contention on core structures. Buffers have a usage_count/"popularity" > that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK > algorithm only has one bit for what approximates our "usage_count" (so > it's either 0 or 1). I think that at its core CLOCK is an algorithm > that has some very desirable properties that I am sure must be > preserved. Actually, I think it's more accurate to say we use a > variant of clock pro, a refinement of the original CLOCK. PostgreSQL replacement algorithm is more similar to Generalized CLOCK or GCLOCK, as described in [1]. CLOCK-Pro [2] is a different algorithm that approximates LIRS[3]. LIRS is what MySQL implements[4] and CLOCK-Pro is implemented by NetBSD [5] and there has been some work on trying it on Linux [6]. Both LIRS and CLOCK-Pro work by keeping double the cache size metadata entries and detect pages that have been recently referenced. Basically they provide an adaptive tradeoff between LRU and LFU. > In the past, various hackers have noted problems they've observed with > this scheme. A common pathology is to see frantic searching for a > victim buffer only to find all buffer usage_count values at 5. It may > take multiple revolutions of the clock hand before a victim buffer is > found, as usage_count is decremented for each and every buffer. Also, > BufFreelistLock contention is considered a serious bottleneck [1], > which is related. There's a paper on a non blocking GCLOCK algorithm, that does lock free clock sweep and buffer pinning[7]. If we decide to stay with GCLOCK it may be interesting, although I still believe that some variant of buffer nailing[8] is a better idea, my experience shows that most of the locking overhead is cache line bouncing ignoring the extreme cases where our naive spinlock implementation blows up. > Let's leave aside inner/root pages though, because they're so > dramatically useful when in a primary index on a tpb-b table that > they'll always be cached by any non-terrible algorithm. It beggars > belief that the still relatively dense (and consequently *popular*) > B+Tree leaf pages get so little credit for being of such long-term > utility (in the view of our clock sweep algorithm as implemented). The > algorithm has what could be loosely described as an excessively > short-term perspective. There is clearly a better balance to be had > here. I don't think the answer is that we have the B-Tree code give > its pages some special treatment that makes them harder to evict, > although I will admit that I briefly entertained the idea. There has been some research that indicates that for TPC-A workloads giving index pages higher weights increases hitrates[1]. I think the hardest hurdle for any changes in this area will be showing that we don't have any nasty regressions. I think the best way to do that would be to study separately the performance overhead of the replacement algorithm and optimality of the replacement choices. If we capture a bunch of buffer reference traces by instrumenting PinBuffer, we can pretty accurately simulate the behavior of different algorithm and tuning choices with different shared buffer sizes. Obviously full scale tests are still needed due to interactions with OS, controller and disk caches and other miscellaneous influences. But even so, simulation would get us much better coverage of various workloads and at least some confidence that it's a good change overall. It will be very hard and time consuming to gather equivalent evidence with full scale tests. [1] http://www.csd.uoc.gr/~hy460/pdf/p35-nicola.pdf [2] http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-05-3.pdf [3] http://www.ece.eng.wayne.edu/~sjiang/pubs/papers/jiang02_LIRS.pdf [4] http://lists.mysql.com/commits/28601 [5] http://fxr.watson.org/fxr/source/uvm/uvm_pdpolicy_clockpro.c?v=NETBSD [6] http://lwn.net/Articles/147879/ [7] http://derby-nb.googlecode.com/svn-history/r41/trunk/derby-nb/ICDE10_conf_full_409.pdf [8] http://www.postgresql.org/message-id/CA+TgmoZYPeYHWAUeJVYy9A5aNDoULcF33WTnprfR9SYcw30vAg@mail.gmail.com Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Tue, Apr 15, 2014 at 3:59 PM, Ants Aasma <ants@cybertec.at> wrote: > PostgreSQL replacement algorithm is more similar to Generalized CLOCK > or GCLOCK, as described in [1]. CLOCK-Pro [2] is a different algorithm > that approximates LIRS[3]. LIRS is what MySQL implements[4] and > CLOCK-Pro is implemented by NetBSD [5] and there has been some work on > trying it on Linux [6]. Both LIRS and CLOCK-Pro work by keeping double > the cache size metadata entries and detect pages that have been > recently referenced. Basically they provide an adaptive tradeoff > between LRU and LFU. That's good to know. > There's a paper on a non blocking GCLOCK algorithm, that does lock > free clock sweep and buffer pinning[7]. If we decide to stay with > GCLOCK it may be interesting, although I still believe that some > variant of buffer nailing[8] is a better idea, my experience shows > that most of the locking overhead is cache line bouncing ignoring the > extreme cases where our naive spinlock implementation blows up. You might be right about that, but lets handle one problem at a time. Who knows what the bottleneck will end up being if and when we address the naivety around frequency? I want to better characterize that problem first. > There has been some research that indicates that for TPC-A workloads > giving index pages higher weights increases hitrates[1]. Frankly, there doesn't need to be any research on this, because it's just common sense that probabilistically, leaf pages are much more useful than heap pages in servicing index scan queries if we assume a uniform distribution. If we don't assume that, then they're still more useful on average. > I think the hardest hurdle for any changes in this area will be > showing that we don't have any nasty regressions. I think the best way > to do that would be to study separately the performance overhead of > the replacement algorithm and optimality of the replacement choices. > If we capture a bunch of buffer reference traces by instrumenting > PinBuffer, we can pretty accurately simulate the behavior of different > algorithm and tuning choices with different shared buffer sizes. > Obviously full scale tests are still needed due to interactions with > OS, controller and disk caches and other miscellaneous influences. But > even so, simulation would get us much better coverage of various > workloads and at least some confidence that it's a good change > overall. It will be very hard and time consuming to gather equivalent > evidence with full scale tests. I think I agree with all of that. The fact that we as a community don't appear to have too much to say about what workloads to prioritize somewhat frustrates this. The other problem is that sizing shared_buffers appropriately involves a surprising amount of deference to rules of thumb that in practice no one is quite prepared to rigorously defend - who is to say what apportionment of memory to Postgres is appropriate here? I too was hopeful that we could evaluate this work purely in terms of observed improvements to hit rate (at least initially), but now I doubt even that. It would be great to be able to say "here are the parameters of this discussion", and have everyone immediately agree with that, but in this instance that's legitimately not possible. -- Peter Geoghegan
On Wed, Apr 16, 2014 at 5:00 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Apr 15, 2014 at 3:59 PM, Ants Aasma <ants@cybertec.at> wrote: >> There's a paper on a non blocking GCLOCK algorithm, that does lock >> free clock sweep and buffer pinning[7]. If we decide to stay with >> GCLOCK it may be interesting, although I still believe that some >> variant of buffer nailing[8] is a better idea, my experience shows >> that most of the locking overhead is cache line bouncing ignoring the >> extreme cases where our naive spinlock implementation blows up. > > You might be right about that, but lets handle one problem at a time. > Who knows what the bottleneck will end up being if and when we address > the naivety around frequency? I want to better characterize that > problem first. Just to summarize you about the previous discussion and the improvements that we decided to do in this area based on feedback are as follows: 1. Bgwriter needs to be improved so that it can help in reducing usage count and finding next victim buffer (run the clocksweep and add buffers to the free list). 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are less. 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer (a spinlock for the freelist, and an lwlock for theclock sweep). Here we can try to make it lock free based on atomic ops as well. 4. Bgwriter needs to be more aggressive, logic based on which it calculates how many buffers it needs to process needsto be improved. 5. Contention around buffer mapping locks. 6. Cacheline bouncing around the buffer header spinlocks, is there anything we can do to reduce this? 7. Choose Optimistically used buffer in StrategyGetBuffer(). 8. Don't bump the usage count every time buffer is pinned. I have already addressed some of these improvements in patch[1] and for other's, I have plan to work on them for 9.5. I think here you want to address the improvements related to usage count and see if it can get us win in some of commonly used scenario's, without affecting any other commonly used scenario. I feel this is good idea to pursue and see if we can get good benefits with it. Infact few days back, I had ran some tests manually to see the problems around BufFreeListLock (currently I don't have script ready) and more recently Jason Petersen has done some benchmarking in this area which you can refer it here[2]. I wonder if we can work together to improve things in this area. [1] http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@huawei.com [2] https://googledrive.com/host/0Bx33JCTmOADOeTIwaE9KX21yWEk/Concurrency%20Limits%20with%20Large%20Working%20Sets With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Apr 14, 2014 at 1:11 PM, Peter Geoghegan <pg@heroku.com> wrote: > In the past, various hackers have noted problems they've observed with > this scheme. A common pathology is to see frantic searching for a > victim buffer only to find all buffer usage_count values at 5. It may > take multiple revolutions of the clock hand before a victim buffer is > found, as usage_count is decremented for each and every buffer. Also, > BufFreelistLock contention is considered a serious bottleneck [1], > which is related. I think that the basic problem here is that usage counts increase when buffers are referenced, but they decrease when buffers are evicted, and those two things are not in any necessary way connected to each other. In particular, if no eviction is happening, reference counts will converge to the maximum value. I've read a few papers about algorithms that attempt to segregate the list of buffers into "hot" and "cold" lists, and an important property of such algorithms is that they mustn't be allowed to make everything hot. It's easy to be too simplistic, here: an algorithm that requires that no more than half the list be hot will fall over badly on a workload where the working set exceeds the available cache and the really hot portion of the working set is 60% of the available cache. So you need a more sophisticated algorithm than that. But that core property that not all buffers can be hot must somehow be preserved, and our algorithm doesn't. This isn't a fundamental property of the usage-count idea; it's an artifact of the fact that usage count decreases are tied to eviction pressure rather than access pressure. For example, suppose we made a rule that if the total usage counts of all buffers exceed 3 * NBuffers, then every time you bump the usage count of a buffer from N to N+1, you're required to advance the clock sweep far enough to decrease the reference count of a buffer by one. When you want to reclaiim a buffer, you advance a separate clock sweep until you find a buffer with a zero usage count; if you circle the whole ring without finding one, then you reclaim the buffer you saw with the lowest usage count. There are obvious scalability problems here (everyone fighting over the right to advance the clock sweep) but ignore that for the sake of the thought experiment: now you have an algorithm where not all buffers can be hot. If some buffers are hotter than others, then whenever their usage count is decreased it will immediately get pushed back up again, but some other buffer then has to absorb the decrease. Only the buffers that are really hot can maintain high usage counts, because *somebody* has to have a low usage count. Even ignoring scalability concerns, this might not be (and probably isn't) exactly what we want to implement, but I think it illustrates an important control principle all the same: buffer "cooling" needs to be driven by the same underlying phenomenon - probably buffer access - as buffer "heating". If they're driven by unrelated phenomena, then the rates may be wildly incomparable, and you'll end up with everything hot or everything cold. If that happens, you lose, because with everything the same, there's no principled way to decide which things are actually best to evict. If we come up with some good solution for shared buffers, we should also consider it applying it to SLRU eviction. I believe that the current situation around CLOG eviction is none too pretty. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 15, 2014 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Apr 14, 2014 at 1:11 PM, Peter Geoghegan <pg@heroku.com> wrote: >> In the past, various hackers have noted problems they've observed with >> this scheme. A common pathology is to see frantic searching for a >> victim buffer only to find all buffer usage_count values at 5. It may >> take multiple revolutions of the clock hand before a victim buffer is >> found, as usage_count is decremented for each and every buffer. Also, >> BufFreelistLock contention is considered a serious bottleneck [1], >> which is related. > > I think that the basic problem here is that usage counts increase when > buffers are referenced, but they decrease when buffers are evicted, > and those two things are not in any necessary way connected to each > other. In particular, if no eviction is happening, reference counts > will converge to the maximum value. I've read a few papers about > algorithms that attempt to segregate the list of buffers into "hot" > and "cold" lists, and an important property of such algorithms is that > they mustn't be allowed to make everything hot. It's possible that I've misunderstood what you mean here, but do you really think it's likely that everything will be hot, in the event of using something like what I've sketched here? I think it's an important measure against this general problem that buffers really earn the right to be considered hot, so to speak. With my prototype, in order for a buffer to become as hard to evict as possible, at a minimum it must be *continually* pinned for at least 30 seconds. That's actually a pretty tall order. Although, as I said, I wouldn't be surprised if it was worth making it possible for buffers to be even more difficult to evict than that. It should be extremely difficult to evict a root B-Tree page, and to a lesser extent inner pages even under a lot of cache pressure, for example. There are lots of workloads in which that can happen, and I have a hard time believing that it's worth it to evict given the extraordinary difference in their utility as compared to a lot of other things. I can imagine a huge barrier against evicting what is actually a relatively tiny number of pages being worth it. I don't want to dismiss what you're saying about heating and cooling being unrelated, but I don't find the conclusion that not everything can be hot obvious. Maybe "heat" should be relative rather than absolute, and maybe that's actually what you meant. There is surely some workload where buffer access actually is perfectly uniform, and what do you do there? What "temperature" are those buffers? It occurs to me that within the prototype patch, even though usage_count is incremented in a vastly slower fashion (in a wall time sense), clock sweep doesn't take advantage of that. I should probably investigate having clock sweep become more aggressive in decrementing in response to realizing that it won't get some buffer's usage_count down to zero on the next revolution either. There are certainly problems with that, but they might be fixable. Within the patch, in order for it to be possible for the usage_count to be incremented in the interim, an average of 1.5 seconds must pass, so if clock sweep were to anticipate another no-set-to-zero revolution, it seems pretty likely that it would be exactly right, or if not then close enough, since it can only really fail to correct for some buffers getting incremented once more in the interim. Conceptually, it would be like multiple logical revolutions were merged into one actual one, sufficient to have the next revolution find a victim buffer. -- Peter Geoghegan
Hi, It's good to see focus on this - some improvements around s_b are sorely needed. On 2014-04-14 10:11:53 -0700, Peter Geoghegan wrote: > 1) Throttles incrementation of usage_count temporally. It becomes > impossible to increment usage_count for any given buffer more > frequently than every 3 seconds, while decrementing usage_count is > totally unaffected. I think this is unfortunately completely out of question. For one a gettimeofday() for every uffer pin will become a significant performance problem. Even the computation of the xact/stm start/stop timestamps shows up pretty heavily in profiles today - and they are far less frequent than buffer pins. And that's on x86 linux, where gettimeofday() is implemented as something more lightweight than a full syscall. The other significant problem I see with this is that its not adaptive to the actual throughput of buffers in s_b. In many cases there's hundreds of clock cycles through shared buffers in 3 seconds. By only increasing the usagecount that often you've destroyed the little semblance to a working LRU there is right now. It also wouldn't work well for situations with a fast changing workload >> s_b. If you have frequent queries that take a second or so and access some data repeatedly (index nodes or whatnot) only increasing the usagecount once will mean they'll continually fall back to disk access. > 2) Has usage_count saturate at 10 (i.e. BM_MAX_USAGE_COUNT = 10), not > 5 as before. ... . This step on its own would be assumed extremely > counter-productive by those in the know, but I believe that other > measures ameliorate the downsides. I could be wrong about how true > that is in other cases, but then the case helped here isn't what you'd > call a narrow benchmark. I don't see which mechanisms you have suggested that counter this? I think having more granular usagecount is a good idea, but I don't think it can realistically be implemented with the current method of choosing victim buffers. The amount of cacheline misses around that is already a major scalability limit; we surely can't make this even worse. I think it'd be possible to get back to this if we had a better bgwriter implementation. Greetings, Andres Freund
On Wed, Apr 16, 2014 at 12:53 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I think this is unfortunately completely out of question. For one a > gettimeofday() for every uffer pin will become a significant performance > problem. Even the computation of the xact/stm start/stop timestamps > shows up pretty heavily in profiles today - and they are far less > frequent than buffer pins. And that's on x86 linux, where gettimeofday() > is implemented as something more lightweight than a full syscall. Come on, Andres. Of course exactly what I've done here is completely out of the question as a patch that we can go and commit right now. I've numerous caveats about bloating the buffer descriptors, and about it being a proof of concept. I'm pretty sure we can come up with a scheme to significantly cut down on the number of gettimeofday() calls if it comes down to it. In any case, I'm interested in advancing our understanding of the problem right now. Let's leave the minutiae to one side for the time being. > The other significant problem I see with this is that its not adaptive > to the actual throughput of buffers in s_b. In many cases there's > hundreds of clock cycles through shared buffers in 3 seconds. By only > increasing the usagecount that often you've destroyed the little > semblance to a working LRU there is right now. If a usage_count can get to BM_MAX_USAGE_COUNT from its initial allocation within an instant, that's bad. It's that simple. Consider all the ways in which that can happen almost by accident. You could probably reasonably argue that the trade-off or lack of adaption (between an LRU and an LFU) that this particular sketch of mine represents is inappropriate or sub-optimal, but I don't understand why you're criticizing the patch for doing what I expressly set out to do. I wrote "I think a very real problem that may be that approximating an LRU is bad because an actual LRU is bad". > It also wouldn't work well for situations with a fast changing > workload >> s_b. If you have frequent queries that take a second or so > and access some data repeatedly (index nodes or whatnot) only increasing > the usagecount once will mean they'll continually fall back to disk access. No, it shouldn't, because there is a notion of buffers getting a fair chance to prove themselves. Now, it might well be the case that there are workloads where what I've done to make that happen in this prototype doesn't work out too well - I've already said so. But should a buffer get a usage count of 5 just because the user inserted 5 tuples within a single DML command, for example? If so, why? -- Peter Geoghegan
On 2014-04-16 01:58:23 -0700, Peter Geoghegan wrote: > On Wed, Apr 16, 2014 at 12:53 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I think this is unfortunately completely out of question. For one a > > gettimeofday() for every uffer pin will become a significant performance > > problem. Even the computation of the xact/stm start/stop timestamps > > shows up pretty heavily in profiles today - and they are far less > > frequent than buffer pins. And that's on x86 linux, where gettimeofday() > > is implemented as something more lightweight than a full syscall. > > Come on, Andres. Of course exactly what I've done here is completely > out of the question as a patch that we can go and commit right now. > I've numerous caveats about bloating the buffer descriptors, and about > it being a proof of concept. I'm pretty sure we can come up with a > scheme to significantly cut down on the number of gettimeofday() calls > if it comes down to it. In any case, I'm interested in advancing our > understanding of the problem right now. Let's leave the minutiae to > one side for the time being. *I* don't think any scheme that involves measuring the time around buffer pins is going to be acceptable. It's better than I say that now rather than when you've invested significant time into the approach, no? > > The other significant problem I see with this is that its not adaptive > > to the actual throughput of buffers in s_b. In many cases there's > > hundreds of clock cycles through shared buffers in 3 seconds. By only > > increasing the usagecount that often you've destroyed the little > > semblance to a working LRU there is right now. > > If a usage_count can get to BM_MAX_USAGE_COUNT from its initial > allocation within an instant, that's bad. It's that simple. Consider > all the ways in which that can happen almost by accident. Yes, I agree that that's a problem. It immediately going down to zero is a problem as well though. And that's what will happen in many scenarios, because you have time limits on increasing the usagecount, but not when decreasing. > > It also wouldn't work well for situations with a fast changing > > workload >> s_b. If you have frequent queries that take a second or so > > and access some data repeatedly (index nodes or whatnot) only increasing > > the usagecount once will mean they'll continually fall back to disk access. > > No, it shouldn't, because there is a notion of buffers getting a fair > chance to prove themselves. If you have a workload with > (BM_MAX_USAGE_COUNT + 1) clock cycles/second, how does *any* buffer has a chance to prove itself? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Apr 16, 2014 at 2:18 AM, Andres Freund <andres@2ndquadrant.com> wrote: > *I* don't think any scheme that involves measuring the time around > buffer pins is going to be acceptable. It's better than I say that now > rather than when you've invested significant time into the approach, no? Well, I do think that it will be possible to make something like that work. LRU-K/LRU-2 involves remembering the last two access times (not the last one). Researchers considered preeminent authorities on caching algorithms thought that was a good idea in 1993. There are plenty of other examples of similar schemes too. >> No, it shouldn't, because there is a notion of buffers getting a fair >> chance to prove themselves. > > If you have a workload with > (BM_MAX_USAGE_COUNT + 1) clock > cycles/second, how does *any* buffer has a chance to prove itself? There could be lots of ways. I thought about representing that more directly. I don't think that it's useful to have a large number of revolutions in search of a victim under any circumstances. Fundamentally, you're asking "what if any scheme here leans too heavily towards frequency?". That could certainly be a problem, as I've said, and we could think about adaptation over heuristics, as I've said, but it is very obviously a big problem that clock sweep doesn't really care about frequency one bit right now. Why should I be the one with all the answers? Aren't you interested in the significance of the patch, and the test case? -- Peter Geoghegan
Hi, On 2014-04-16 02:57:54 -0700, Peter Geoghegan wrote: > Why should I be the one with all the answers? Who said you need to be? The only thing I am saying is that I don't agree with some of your suggestions? I only responded to the thread now because downthread (in CAM3SWZQa2OAVUrfPL-df=we1sMozKBR392SW_NoVuKZEPXhu9w@mail.gmail.com) you further argued using the timestamp - which I think is a flawed concept. So I thought it'd be fair to argue against it now, rather than later. > Aren't you interested in the significance of the patch, and the test case? Not particularly in the specifics to be honest. The tradeoffs of the techniques you used in there seem prohibitive to me. It's easy to make individual cases faster by sacrificing others. Sometimes it's useful to prototype solutions while narrowing the scope for evaluation to get faster feedback, but as I don't see the solutions to be applicable in the general case... I think it's very important to improve upon the current state. It's imo one of postgres' biggest issues. But it's also far from trivial, otherwise it'd be done already. I *personally* don't think it's very likely that we can improve significantly upon the current state as long as every process regularly participates in the clock sweep. ISTM that prevents many more elaborate techniques to be used (cache misses/bus traffic, locking). But that's just gut feeling. I also think there are bigger issues than the actual LRU/whatever behaviour, namely the scalability issues around shared buffers making both small and big s_b settings major bottlenecks. But that's just where I have seen more issues personally. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Apr 16, 2014 at 3:22 AM, Peter Geoghegan <pg@heroku.com> wrote: > It's possible that I've misunderstood what you mean here, but do you > really think it's likely that everything will be hot, in the event of > using something like what I've sketched here? I think it's an > important measure against this general problem that buffers really > earn the right to be considered hot, so to speak. With my prototype, > in order for a buffer to become as hard to evict as possible, at a > minimum it must be *continually* pinned for at least 30 seconds. > That's actually a pretty tall order. Although, as I said, I wouldn't > be surprised if it was worth making it possible for buffers to be even > more difficult to evict than that. It should be extremely difficult to > evict a root B-Tree page, and to a lesser extent inner pages even > under a lot of cache pressure, for example. There are lots of > workloads in which that can happen, and I have a hard time believing > that it's worth it to evict given the extraordinary difference in > their utility as compared to a lot of other things. I can imagine a > huge barrier against evicting what is actually a relatively tiny > number of pages being worth it. I'm making a general statement about a property that I think a buffer eviction algorithm ought to have. I actually didn't say anything about the algorithm you've chosen one way or the other. Obviously, you've built in some protections against everything becoming hot, and that's a good thing as far as it goes. But you also have a greatly increased risk of everything becoming cold. All you need is a rate of buffer eviction that circles shared_buffers more often than once every 3 seconds, and everything will gradually cool down until you once again can't distinguish which stuff is hot from which stuff isn't. > I don't want to dismiss what you're saying about heating and cooling > being unrelated, but I don't find the conclusion that not everything > can be hot obvious. Maybe "heat" should be relative rather than > absolute, and maybe that's actually what you meant. There is surely > some workload where buffer access actually is perfectly uniform, and > what do you do there? What "temperature" are those buffers? Obviously, some value lower than the maximum and higher than the minimum. If they're all at max temperature and then a new buffer (a btree room or vm page, for example) comes along and is much hotter, there's no room on the scale left to express that. If they're all at min temperature and then a new buffer comes along that is just used once and thrown out, there's no room left on the scale for that buffer to emerge as a good candidate for eviction. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 16, 2014 at 7:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think that the basic problem here is that usage counts increase when > buffers are referenced, but they decrease when buffers are evicted, > and those two things are not in any necessary way connected to each > other. In particular, if no eviction is happening, reference counts > will converge to the maximum value. I've read a few papers about > algorithms that attempt to segregate the list of buffers into "hot" > and "cold" lists, and an important property of such algorithms is that > they mustn't be allowed to make everything hot. It's easy to be too > simplistic, here: an algorithm that requires that no more than half > the list be hot will fall over badly on a workload where the working > set exceeds the available cache and the really hot portion of the > working set is 60% of the available cache. So you need a more > sophisticated algorithm than that. But that core property that not > all buffers can be hot must somehow be preserved, and our algorithm > doesn't. FWIW in CLOCK-Pro segregating buffers between hot and cold is tied to eviction and the clock sweep, the ratio between hot and cold is dynamically adapted based on prior experience. The main downside is that it seems to require an indirection somewhere either in the clock sweep or buffer lookup. Maybe it's possible to avoid that with some clever engineering if we think hard enough. CLOCK-Pro may also have too little memory of hotness, making it too easy to blow the whole cache away with a burst of activity. It may be useful to have a (possibly tunable) notion of fairness where one query/backend can't take over the cache even though it may be an overall win in terms of total number of I/Os performed. Maybe we need to invent Generalized CLOCK-Pro with a larger number of levels, ranging from cold, hot and scalding to infernal. :) Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Tue, Apr 15, 2014 at 11:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Apr 16, 2014 at 5:00 AM, Peter Geoghegan <pg@heroku.com> wrote: >> On Tue, Apr 15, 2014 at 3:59 PM, Ants Aasma <ants@cybertec.at> wrote: >>> There's a paper on a non blocking GCLOCK algorithm, that does lock >>> free clock sweep and buffer pinning[7]. If we decide to stay with >>> GCLOCK it may be interesting, although I still believe that some >>> variant of buffer nailing[8] is a better idea, my experience shows >>> that most of the locking overhead is cache line bouncing ignoring the >>> extreme cases where our naive spinlock implementation blows up. >> >> You might be right about that, but lets handle one problem at a time. >> Who knows what the bottleneck will end up being if and when we address >> the naivety around frequency? I want to better characterize that >> problem first. > > Just to summarize you about the previous discussion and the > improvements that we decided to do in this area based on feedback > are as follows: > > 1. Bgwriter needs to be improved so that it can help in reducing > usage count and finding next victim buffer (run the clock sweep > and add buffers to the free list). > 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist > are less. > 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer > (a spinlock for the freelist, and an lwlock for the clock sweep). > Here we can try to make it lock free based on atomic ops as > well. > 4. Bgwriter needs to be more aggressive, logic based on which it > calculates how many buffers it needs to process needs to be > improved. > 5. Contention around buffer mapping locks. > 6. Cacheline bouncing around the buffer header spinlocks, is there > anything we can do to reduce this? > 7. Choose Optimistically used buffer in StrategyGetBuffer(). > 8. Don't bump the usage count every time buffer is pinned. What about: 9. Don't wait on locked buffer in the clock sweep. merlin
On 2014-04-16 07:55:44 -0500, Merlin Moncure wrote: > > 1. Bgwriter needs to be improved so that it can help in reducing > > usage count and finding next victim buffer (run the clock sweep > > and add buffers to the free list). > > 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist > > are less. > > 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer > > (a spinlock for the freelist, and an lwlock for the clock sweep). > > Here we can try to make it lock free based on atomic ops as > > well. > > 4. Bgwriter needs to be more aggressive, logic based on which it > > calculates how many buffers it needs to process needs to be > > improved. > > 5. Contention around buffer mapping locks. > > 6. Cacheline bouncing around the buffer header spinlocks, is there > > anything we can do to reduce this? > > 7. Choose Optimistically used buffer in StrategyGetBuffer(). > > 8. Don't bump the usage count every time buffer is pinned. > > What about: 9. Don't wait on locked buffer in the clock sweep. I don't think we do that? Or are you referring to locked buffer headers? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Apr 15, 2014 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I think that the basic problem here is that usage counts increase when > buffers are referenced, but they decrease when buffers are evicted, > and those two things are not in any necessary way connected to each > other. In particular, if no eviction is happening, reference counts > will converge to the maximum value. I've read a few papers about > algorithms that attempt to segregate the list of buffers into "hot" > and "cold" lists, and an important property of such algorithms is that > they mustn't be allowed to make everything hot. It's easy to be too > simplistic, here: an algorithm that requires that no more than half > the list be hot will fall over badly on a workload where the working > set exceeds the available cache and the really hot portion of the > working set is 60% of the available cache. So you need a more > sophisticated algorithm than that. But that core property that not > all buffers can be hot must somehow be preserved, and our algorithm > doesn't. A while back you sketched out an idea that did something like that: hotly accessed buffers became 'perma-pinned' such that they no longer participated in the clock sweep for eviction and there was a side-line process that did a two stage eviction (IIRC) from the super hot stack in order to mitigate locking. This idea had a couple of nice properties: 1) very hot buffers no longer get refcounted, reducing spinlock contention (which has been documented in real world workloads) 2) eviction loop shrinks. although you still have to check the 'very hot' flag, thats an unlocked check (again, IIRC) and no further processing is done. The downside of this approach was complexity and difficult to test for edge case complexity. I would like to point out though that while i/o efficiency gains are nice, I think contention issues are the bigger fish to fry. On Mon, Apr 14, 2014 at 12:11 PM, Peter Geoghegan <pg@heroku.com> wrote: > 1) Throttles incrementation of usage_count temporally. It becomes > impossible to increment usage_count for any given buffer more > frequently than every 3 seconds, while decrementing usage_count is > totally unaffected. hm, that's expensive. how about a heuristic based on the number of buffer allocations and the size of the buffer pool? On Wed, Apr 16, 2014 at 8:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-04-16 07:55:44 -0500, Merlin Moncure wrote: >> What about: 9. Don't wait on locked buffer in the clock sweep. > > I don't think we do that? Or are you referring to locked buffer headers? Right -- exactly. I posted patch for this a while back. It's quite trivial: implement a trylock variant of the buffer header lock macro and further guard the check with a non-locking test (which TAS() already does generally, but the idea is to avoid the cache line lock in likely cases of contention). I believe this to be unambiguously better: even if it's self healing or unlikely, there is no good reason to jump into a spinlock fray or even request a contented cache line while holding a critical lock. merlin
On 2014-04-16 08:25:23 -0500, Merlin Moncure wrote: > The downside of this approach was complexity and difficult to test for > edge case complexity. I would like to point out though that while i/o > efficiency gains are nice, I think contention issues are the bigger > fish to fry. That's my feeling as well. > > On Wed, Apr 16, 2014 at 8:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-04-16 07:55:44 -0500, Merlin Moncure wrote: > >> What about: 9. Don't wait on locked buffer in the clock sweep. > > > > I don't think we do that? Or are you referring to locked buffer headers? > > Right -- exactly. I posted patch for this a while back. It's quite > trivial: implement a trylock variant of the buffer header lock macro > and further guard the check with a non-locking test (which TAS() > already does generally, but the idea is to avoid the cache line lock > in likely cases of contention). I believe this to be unambiguously > better: even if it's self healing or unlikely, there is no good reason > to jump into a spinlock fray or even request a contented cache line > while holding a critical lock. IIRC you had problems proving the benefits of that, right? I think that's because the locking times of buffer headers are short enough that it's really unlikely to read a locked buffer header spinlock. The spinlock acquiration will have made the locker the exclusive owner of the spinlock in the majority of cases, and as soon as that happens the cache miss/transfer will take far longer than the lock takes. I think this is the wrong level to optimize things. Imo there's two possible solutions (that don't exclude each other): * perform the clock sweep in one process so there's a very fast way to get to a free buffer. Possibly in a partitioned way. * Don't take a global exclusive lock while performing the clock sweep. Instead increase StrategyControl->nextVictimBufferin chunks under an exclusive lock, and then scan the potential victim buffers in those chunkswithout a global lock held. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Apr 16, 2014 at 9:35 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I think this is the wrong level to optimize things. Imo there's two > possible solutions (that don't exclude each other): > > * perform the clock sweep in one process so there's a very fast way to > get to a free buffer. Possibly in a partitioned way. > > * Don't take a global exclusive lock while performing the clock > sweep. Instead increase StrategyControl->nextVictimBuffer in chunks > under an exclusive lock, and then scan the potential victim buffers in > those chunks without a global lock held. I definitely agree with both of these ideas. But isn't it sort of off-topic for this thread? There are two issues here: 1. Improving the rate at which we can evict buffers, which is what you're talking about here. 2. Improving the choice of which buffers we evict, which is what Peter's talking about, or at least what I think he's talking about. Those things are both important, but they're different, and I'm not sure that working on one precludes working on the other. There's certainly the potential for overlap, but not necessarily. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-04-16 10:29:29 -0400, Robert Haas wrote: > On Wed, Apr 16, 2014 at 9:35 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I think this is the wrong level to optimize things. Imo there's two > > possible solutions (that don't exclude each other): > > > > * perform the clock sweep in one process so there's a very fast way to > > get to a free buffer. Possibly in a partitioned way. > > > > * Don't take a global exclusive lock while performing the clock > > sweep. Instead increase StrategyControl->nextVictimBuffer in chunks > > under an exclusive lock, and then scan the potential victim buffers in > > those chunks without a global lock held. > > I definitely agree with both of these ideas. But isn't it sort of > off-topic for this thread? Yes, I agree it's somewhat offtopic - I only started on it (I think) because Merlin commented on it. But I also agree with Merlin's that comment at the moment that the scalability issues (concurrency and size of shared buffers). If you can't use a large enough s_b to contain a significant portion of your workload, you're relying on the OS cache anyway. > 1. Improving the rate at which we can evict buffers, which is what > you're talking about here. > > 2. Improving the choice of which buffers we evict, which is what > Peter's talking about, or at least what I think he's talking about. > > Those things are both important, but they're different, and I'm not > sure that working on one precludes working on the other. There's > certainly the potential for overlap, but not necessarily. I don't think that that they neccessarily preclude each other either. But my gut feeling tells me that it'll be hard to have interesting algorithmic improvements on the buffer eviction choice because any additional complexity around that will have prohibitively high scalability impacts due to the coarse locking. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, Stephen flagged a ENOPARSE: On 2014-04-16 16:49:55 +0200, Andres Freund wrote: > But I also agree with Merlin's that comment at the moment that the > scalability issues (concurrency and size of shared buffers). That should have been: But I also agree with Merlin's comment that at the moment the scalability issues are bigger than the cache eviction choices. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Apr 16, 2014 at 10:49 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> 1. Improving the rate at which we can evict buffers, which is what >> you're talking about here. >> >> 2. Improving the choice of which buffers we evict, which is what >> Peter's talking about, or at least what I think he's talking about. >> >> Those things are both important, but they're different, and I'm not >> sure that working on one precludes working on the other. There's >> certainly the potential for overlap, but not necessarily. > > I don't think that that they neccessarily preclude each other > either. But my gut feeling tells me that it'll be hard to have > interesting algorithmic improvements on the buffer eviction choice > because any additional complexity around that will have prohibitively > high scalability impacts due to the coarse locking. Doesn't that amount to giving up? I mean, I'm not optimistic about the particular approach Peter's chosen here being practical for the reasons that you and I already articulated. But I don't think that means there *isn't* a viable approach; and I think Peter's test results demonstrate that the additional complexity of a better algorithm can more than pay for itself. That's a pretty important point to keep in mind. Also, I think the scalability problems around buffer eviction are eminently solvable, and in particular I'm hopeful that Amit is going to succeed in solving them. Suppose we have a background process (whether the background writer or some other) that runs the clock sweep, identifies good candidates for eviction, and pushes them on a set of, say, 16 free-lists protected by spinlocks. (The optimal number of free-lists probably depends on the size of shared_buffers.) Backends try to reclaim by popping buffers off of one of these free-lists and double-checking whether the page is still a good candidate for eviction (i.e. it's still clean and unpinned). If the free list is running low, they kick the background process via a latch to make sure it's awake and working to free up more stuff, and if necessary, advance the clock sweep themselves. This can even be done by multiple processes at once, if we adopt your idea of advancing the clock sweep hand by N buffers at a time and then scanning them afterwards without any global lock. In such a world, it's still not permissible for reclaim calculations to be super-complex, but you hope that most of the activity is happening in the background process, so cycle-shaving becomes less critical. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-04-16 11:28:04 -0400, Robert Haas wrote: > On Wed, Apr 16, 2014 at 10:49 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> 1. Improving the rate at which we can evict buffers, which is what > >> you're talking about here. > >> > >> 2. Improving the choice of which buffers we evict, which is what > >> Peter's talking about, or at least what I think he's talking about. > >> > >> Those things are both important, but they're different, and I'm not > >> sure that working on one precludes working on the other. There's > >> certainly the potential for overlap, but not necessarily. > > > > I don't think that that they neccessarily preclude each other > > either. But my gut feeling tells me that it'll be hard to have > > interesting algorithmic improvements on the buffer eviction choice > > because any additional complexity around that will have prohibitively > > high scalability impacts due to the coarse locking. > > Doesn't that amount to giving up? I mean, I'm not optimistic about > the particular approach Peter's chosen here being practical for the > reasons that you and I already articulated. But I don't think that > means there *isn't* a viable approach; and I think Peter's test > results demonstrate that the additional complexity of a better > algorithm can more than pay for itself. That's a pretty important > point to keep in mind. Well, I think it could be a very good idea to invest more resources (cpu, bus, memory) in buffer management - but doing so right *now* where it's all done under one monolithic lock will have noticeable consequences for many workloads. Spending more cycles per buffer won't be very noticeable if it's not done under a gigantic lock - right now it will be. > [ reasonable proposal ]. In such a world, it's still not > permissible for reclaim calculations to be super-complex, but you hope > that most of the activity is happening in the background process, so > cycle-shaving becomes less critical. Yes, agreed. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Apr 16, 2014 at 4:22 AM, Peter Geoghegan <pg@heroku.com> wrote: > > I don't want to dismiss what you're saying about heating and cooling > being unrelated, but I don't find the conclusion that not everything > can be hot obvious. Maybe "heat" should be relative rather than > absolute, and maybe that's actually what you meant. There is surely > some workload where buffer access actually is perfectly uniform, and > what do you do there? What "temperature" are those buffers? In that case, hotness, or retention priority, should be relative to re-population cost. IE: whether it's likely to still be in the OS cache or not, whether it's dirty or not, etc. > It occurs to me that within the prototype patch, even though > usage_count is incremented in a vastly slower fashion (in a wall time > sense), clock sweep doesn't take advantage of that. I should probably > investigate having clock sweep become more aggressive in decrementing > in response to realizing that it won't get some buffer's usage_count > down to zero on the next revolution either. There are certainly > problems with that, but they might be fixable. Within the patch, in > order for it to be possible for the usage_count to be incremented in > the interim, an average of 1.5 seconds must pass, so if clock sweep > were to anticipate another no-set-to-zero revolution, it seems pretty > likely that it would be exactly right, or if not then close enough, > since it can only really fail to correct for some buffers getting > incremented once more in the interim. Conceptually, it would be like > multiple logical revolutions were merged into one actual one, > sufficient to have the next revolution find a victim buffer. Why use time at all? Why not synchronize usage bumpability to clock sweeps? I'd use a simple bit that the clock sweep clears, and the users set. Only one increase per sweep. Or maybe use a decreasing loop count instead of a bit. In any case, measuring "time" in terms of clock sweeps sounds like a better proposition.
On Wed, Apr 16, 2014 at 4:01 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> Aren't you interested in the significance of the patch, and the test case? > > Not particularly in the specifics to be honest. The tradeoffs of the > techniques you used in there seem prohibitive to me. It's easy to make > individual cases faster by sacrificing others. You're the one poring over the specifics of what I've done, to my consternation. I am not prepared to defend the patch at that level, as I've made abundantly clear. I've called it a sketch, a proof of concept half a dozen times already. I don't understand your difficulty with that. I also don't understand how you can be so dismissive of the benchmark, given the numbers involved. You're being unreasonable. If I didn't write this patch, and I talked to people about this issue at pgCon, I'm not sure that anyone would be convinced that it was a problem, or at least that it was this much of a problem. -- Peter Geoghegan
On Wed, Apr 16, 2014 at 1:42 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Apr 16, 2014 at 4:01 AM, Andres Freund <andres@2ndquadrant.com> wrote: >>> Aren't you interested in the significance of the patch, and the test case? >> >> Not particularly in the specifics to be honest. The tradeoffs of the >> techniques you used in there seem prohibitive to me. It's easy to make >> individual cases faster by sacrificing others. > > You're the one poring over the specifics of what I've done, to my > consternation. I am not prepared to defend the patch at that level, as > I've made abundantly clear. I've called it a sketch, a proof of > concept half a dozen times already. I don't understand your difficulty > with that. I also don't understand how you can be so dismissive of the > benchmark, given the numbers involved. You're being unreasonable. I don't think he's being unreasonable, and I don't understand why you're getting bent out of shape about it. You proposed a patch, he articulated a problem, you don't want to fix it right now. All of which is fine. Why the ad hominem accusations? > If I didn't write this patch, and I talked to people about this issue > at pgCon, I'm not sure that anyone would be convinced that it was a > problem, or at least that it was this much of a problem. I agree with that, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 16, 2014 at 10:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I don't think he's being unreasonable, and I don't understand why > you're getting bent out of shape about it. You proposed a patch, he > articulated a problem, you don't want to fix it right now. All of > which is fine. Why the ad hominem accusations? I just think it's bad form to hold something like this to the same standards as a formal commitfest submission. I am well aware that the patch probably has several scalability issues. -- Peter Geoghegan
On Wed, Apr 16, 2014 at 1:07 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Apr 16, 2014 at 10:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I don't think he's being unreasonable, and I don't understand why >> you're getting bent out of shape about it. You proposed a patch, he >> articulated a problem, you don't want to fix it right now. All of >> which is fine. Why the ad hominem accusations? > > I just think it's bad form to hold something like this to the same > standards as a formal commitfest submission. I am well aware that the > patch probably has several scalability issues. In fairness to Andres, while *you* may know that issuing an expensive syscall in a tight loop is on the list of Forbidden Things, a lot of people don't and it's pretty reasonable to issue methodology objections in order to get them documented. Anyways, I'm still curious if you can post similar numbers basing the throttling on gross allocation counts instead of time. Meaning: some number of buffer allocations has to have occurred before you consider eviction. Besides being faster I think it's a better implementation: an intermittently loaded server will give more consistent behavior. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > Anyways, I'm still curious if you can post similar numbers basing the > throttling on gross allocation counts instead of time. Meaning: some > number of buffer allocations has to have occurred before you consider > eviction. Besides being faster I think it's a better implementation: > an intermittently loaded server will give more consistent behavior. Yeah --- I think wall-clock-based throttling is fundamentally the wrong thing anyway. Are we going to start needing a CPU speed measurement to tune the algorithm with? Not the place to be going. But driving it off the number of allocations that've been done could be sensible. (OTOH, that means you need a central counter, which itself would be a bottleneck.) regards, tom lane
On Wed, Apr 16, 2014 at 2:07 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Apr 16, 2014 at 10:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I don't think he's being unreasonable, and I don't understand why >> you're getting bent out of shape about it. You proposed a patch, he >> articulated a problem, you don't want to fix it right now. All of >> which is fine. Why the ad hominem accusations? > > I just think it's bad form to hold something like this to the same > standards as a formal commitfest submission. I am well aware that the > patch probably has several scalability issues. I don't agree. I think it's perfectly appropriate to raise potential issues at the earliest possible time. People have regularly been heard to complain in this forum that those objecting to a patch did not object soon enough for them to make changes. That's a hard problem to fix because we can't force people whose salaries we're not paying to attention to patches over whatever else they may have to do, but we shouldn't label it as a bad thing when people choose to get involved and provide feedback early. Early feedback is exactly what we want to encourage here. And regardless of any of that, I think "person X is being unreasonable" is a personal attack that has exactly zero place on this mailing list. We are here to talk about technology, not anyone's character. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 16, 2014 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> Anyways, I'm still curious if you can post similar numbers basing the >> throttling on gross allocation counts instead of time. Meaning: some >> number of buffer allocations has to have occurred before you consider >> eviction. Besides being faster I think it's a better implementation: >> an intermittently loaded server will give more consistent behavior. > > Yeah --- I think wall-clock-based throttling is fundamentally the wrong > thing anyway. Are we going to start needing a CPU speed measurement to > tune the algorithm with? Not the place to be going. But driving it off > the number of allocations that've been done could be sensible. (OTOH, > that means you need a central counter, which itself would be a > bottleneck.) sure -- note we already track that in BufferStrategyControl (everything in buffer allocation is already centrally managed essentially). /* * Statistics. These counters should be wide enough that they can't * overflow during a single bgwritercycle. */ uint32 completePasses; /* Complete cycles of the clock sweep */ uint32 numBufferAllocs; /* Buffers allocated since last reset */ merlin
On Wed, Apr 16, 2014 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I don't agree. I think it's perfectly appropriate to raise potential > issues at the earliest possible time. If I didn't *strongly* emphasize my intent in writing the patch up front, I'd certainly agree. I just don't see why what I've done cannot be accepted in the spirit in which it was intended. > And regardless of any of that, I think "person X is being > unreasonable" is a personal attack that has exactly zero place on this > mailing list. We are here to talk about technology, not anyone's > character. Telling someone they're being unreasonable is not a personal attack. -- Peter Geoghegan
Merlin Moncure <mmoncure@gmail.com> writes: > On Wed, Apr 16, 2014 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah --- I think wall-clock-based throttling is fundamentally the wrong >> thing anyway. Are we going to start needing a CPU speed measurement to >> tune the algorithm with? Not the place to be going. But driving it off >> the number of allocations that've been done could be sensible. (OTOH, >> that means you need a central counter, which itself would be a >> bottleneck.) > sure -- note we already track that in BufferStrategyControl > (everything in buffer allocation is already centrally managed > essentially). Indeed, but I'd think getting rid of that property would be one of the top priorities for any attempt to do anything at all in this area of the code. regards, tom lane
On Wed, Apr 16, 2014 at 7:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: > 2. Improving the choice of which buffers we evict, which is what > Peter's talking about, or at least what I think he's talking about. > > Those things are both important, but they're different, and I'm not > sure that working on one precludes working on the other. There's > certainly the potential for overlap, but not necessarily. That's certainly my primary area of interest here, but there may be some overlap with other areas, or areas were cooperation turns out to be appropriate. At the risk of stating the very obvious, if the caching algorithm is making poor decisions about caching, and leaf pages are continually swapped in and out of shared_buffers for no good reason, that is likely to constrain the scalability of the buffer manager. That could be very significant. My immediate concern here is getting recognition of the importance of weighing frequency of access in *some* way. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > My immediate concern here is getting recognition of the importance of > weighing frequency of access in *some* way. That's a completely content-free statement; certainly the existing clock-sweep code is sensitive to frequency of access, as would be any other algorithm we'd be likely to adopt. It may well be that we can do better than what we've got, but sweeping generalities are unlikely to help us much. regards, tom lane
On Wed, Apr 16, 2014 at 6:25 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Apr 15, 2014 at 11:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> Just to summarize you about the previous discussion and the >> improvements that we decided to do in this area based on feedback >> are as follows: >> >> 1. Bgwriter needs to be improved so that it can help in reducing >> usage count and finding next victim buffer (run the clock sweep >> and add buffers to the free list). >> 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist >> are less. >> 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer >> (a spinlock for the freelist, and an lwlock for the clock sweep). >> Here we can try to make it lock free based on atomic ops as >> well. >> 4. Bgwriter needs to be more aggressive, logic based on which it >> calculates how many buffers it needs to process needs to be >> improved. >> 5. Contention around buffer mapping locks. >> 6. Cacheline bouncing around the buffer header spinlocks, is there >> anything we can do to reduce this? >> 7. Choose Optimistically used buffer in StrategyGetBuffer(). >> 8. Don't bump the usage count every time buffer is pinned. > > What about: 9. Don't wait on locked buffer in the clock sweep. Right, I remember you have written patch for it, I think we should consider it along with point-6. In general, I think unless we first improve the situation for BufFreelistLock and eviction strategy, it might not show benefit. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Apr 16, 2014 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: > This isn't a fundamental property of the usage-count idea; it's an > artifact of the fact that usage count decreases are tied to eviction > pressure rather than access pressure. For example, suppose we made a > rule that if the total usage counts of all buffers exceed 3 * > NBuffers, then every time you bump the usage count of a buffer from N > to N+1, you're required to advance the clock sweep far enough to > decrease the reference count of a buffer by one. This sounds like the right way to reason about it. From what I remember in school the idea with the clock sweep is to set the usage flags to the maximum whenever the buffer is used and decrement (actually iirc typically shift right) it when the clock sweep goes by. Ie, simulate a LRU where when the buffer is accessed it jumps to the head of the list and when the clock comes by it moves gradually down the list. What you're pointing out is that the clock might not come by very often resulting everything being at the head of the list. In that case I'm not clear it really matters what gets evicted though. And the cpu effort of running the clock n times sounds bad but doing the work earlier doesn't really change the amount of work being done, it just amortizes it over more calls. But if you want to do that it seems to me the way to do it is every time a buffer is pinned set to the maximum and then run the clock max_value - previous_value. So the total usage counts of all buffers remains constant. If that results in contention one way to reduce it is to do this probabilistically. Run the clock 1% of the time but run it 100x as much as you would normally. But I think you've misidentified the problem and what those other algorithms are trying to solve. The problem is not that Postgres will pick a bad buffer to evict. If all the buffers have been since the last time the clock came around then they're all "hot" anyways and it doesn't really matter which one we evict. The problem is that we expend an inordinate amount of work finding the few non-hot buffers. When you have a really large amount of memory and 99.9% of it is hot but 0.1% is whatever random non-hot page was needed last then there's an obvious buffer to evict when you need a new one. But we spend a lot of work decrementing every hot buffer's usage count 4 times only to have them immediately incremented again just to find the 1 buffer where the usage count was 4 or 3. The goal of these algorithms that divide the buffers into groups is to avoid having to do so much work to find the colder buffers. Once the hot buffers migrate to the hot pool we only need to run the clock there when we find we have new hot pages that we want to promote. All the thrashing in the cold pool can be more efficient because there's many fewer pages to consider. -- greg
On Thu, Apr 17, 2014 at 9:40 AM, Greg Stark <stark@mit.edu> wrote: > On Wed, Apr 16, 2014 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> This isn't a fundamental property of the usage-count idea; it's an >> artifact of the fact that usage count decreases are tied to eviction >> pressure rather than access pressure. For example, suppose we made a >> rule that if the total usage counts of all buffers exceed 3 * >> NBuffers, then every time you bump the usage count of a buffer from N >> to N+1, you're required to advance the clock sweep far enough to >> decrease the reference count of a buffer by one. > > This sounds like the right way to reason about it. > > From what I remember in school the idea with the clock sweep is to set > the usage flags to the maximum whenever the buffer is used and > decrement (actually iirc typically shift right) it when the clock > sweep goes by. Ie, simulate a LRU where when the buffer is accessed it > jumps to the head of the list and when the clock comes by it moves > gradually down the list. > > What you're pointing out is that the clock might not come by very > often resulting everything being at the head of the list. In that case > I'm not clear it really matters what gets evicted though. And the cpu > effort of running the clock n times sounds bad but doing the work > earlier doesn't really change the amount of work being done, it just > amortizes it over more calls. > > But if you want to do that it seems to me the way to do it is every > time a buffer is pinned set to the maximum and then run the clock > max_value - previous_value. So the total usage counts of all buffers > remains constant. If that results in contention one way to reduce it > is to do this probabilistically. Run the clock 1% of the time but run > it 100x as much as you would normally. > > But I think you've misidentified the problem and what those other > algorithms are trying to solve. The problem is not that Postgres will > pick a bad buffer to evict. If all the buffers have been since the > last time the clock came around then they're all "hot" anyways and it > doesn't really matter which one we evict. The problem is that we > expend an inordinate amount of work finding the few non-hot buffers. > When you have a really large amount of memory and 99.9% of it is hot > but 0.1% is whatever random non-hot page was needed last then there's > an obvious buffer to evict when you need a new one. But we spend a lot > of work decrementing every hot buffer's usage count 4 times only to > have them immediately incremented again just to find the 1 buffer > where the usage count was 4 or 3. The goal of these algorithms that > divide the buffers into groups is to avoid having to do so much work > to find the colder buffers. Once the hot buffers migrate to the hot > pool we only need to run the clock there when we find we have new hot > pages that we want to promote. All the thrashing in the cold pool can > be more efficient because there's many fewer pages to consider. Well, I think Peter has proved that PostgreSQL *will* pick a bad buffer to evict. The proof is that when he changed the choice of buffer to evict, he got a significant performance improvement. I also believe this to be the case on first principles and my own experiments. Suppose you have a workload that fits inside shared_buffers. All of the usage counts will converge to 5. Then, somebody accesses a table that is not cached, so something's got to be evicted. Because all the usage counts are the same, the eviction at this point is completely indiscriminate. We're just as likely to kick out a btree root page or a visibility map page as we are to kick out a random heap page, even though the former have probably been accessed several orders of magnitude more often. That's clearly bad. On systems that are not too heavily loaded it doesn't matter too much because we just fault the page right back in from the OS pagecache. But I've done pgbench runs where such decisions lead to long stalls, because the page has to be brought back in from disk, and there's a long I/O queue; or maybe just because the kernel thinks PostgreSQL is issuing too many I/O requests and makes some of them wait to cool things down. Of course, the overhead of repeated clock sweeps to push down the usage counts isn't a great thing either. I'm not saying that isn't a problem. But I think bad decisions about what to evict are also a problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 17, 2014 at 10:18:43AM -0400, Robert Haas wrote: > I also believe this to be the case on first principles and my own > experiments. Suppose you have a workload that fits inside > shared_buffers. All of the usage counts will converge to 5. Then, > somebody accesses a table that is not cached, so something's got to be > evicted. Because all the usage counts are the same, the eviction at > this point is completely indiscriminate. We're just as likely to kick > out a btree root page or a visibility map page as we are to kick out a > random heap page, even though the former have probably been accessed > several orders of magnitude more often. That's clearly bad. On > systems that are not too heavily loaded it doesn't matter too much > because we just fault the page right back in from the OS pagecache. > But I've done pgbench runs where such decisions lead to long stalls, > because the page has to be brought back in from disk, and there's a > long I/O queue; or maybe just because the kernel thinks PostgreSQL is > issuing too many I/O requests and makes some of them wait to cool > things down. I understand now. If there is no memory pressure, every buffer gets the max usage count, and when a new buffer comes in, it isn't the max so it is swiftly removed until the clock sweep has time to decrement the old buffers. Decaying buffers when there is no memory pressure creates additional overhead and gets into timing issues of when to decay. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, Apr 17, 2014 at 10:32 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Thu, Apr 17, 2014 at 10:18:43AM -0400, Robert Haas wrote: >> I also believe this to be the case on first principles and my own >> experiments. Suppose you have a workload that fits inside >> shared_buffers. All of the usage counts will converge to 5. Then, >> somebody accesses a table that is not cached, so something's got to be >> evicted. Because all the usage counts are the same, the eviction at >> this point is completely indiscriminate. We're just as likely to kick >> out a btree root page or a visibility map page as we are to kick out a >> random heap page, even though the former have probably been accessed >> several orders of magnitude more often. That's clearly bad. On >> systems that are not too heavily loaded it doesn't matter too much >> because we just fault the page right back in from the OS pagecache. >> But I've done pgbench runs where such decisions lead to long stalls, >> because the page has to be brought back in from disk, and there's a >> long I/O queue; or maybe just because the kernel thinks PostgreSQL is >> issuing too many I/O requests and makes some of them wait to cool >> things down. > > I understand now. If there is no memory pressure, every buffer gets the > max usage count, and when a new buffer comes in, it isn't the max so it > is swiftly removed until the clock sweep has time to decrement the old > buffers. Decaying buffers when there is no memory pressure creates > additional overhead and gets into timing issues of when to decay. That can happen, but the real problem I was trying to get at is that when all the buffers get up to max usage count, they all appear equally important. But in reality they're not. So when we do start evicting those long-resident buffers, it's essentially random which one we kick out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 17, 2014 at 10:40:40AM -0400, Robert Haas wrote: > On Thu, Apr 17, 2014 at 10:32 AM, Bruce Momjian <bruce@momjian.us> wrote: > > On Thu, Apr 17, 2014 at 10:18:43AM -0400, Robert Haas wrote: > >> I also believe this to be the case on first principles and my own > >> experiments. Suppose you have a workload that fits inside > >> shared_buffers. All of the usage counts will converge to 5. Then, > >> somebody accesses a table that is not cached, so something's got to be > >> evicted. Because all the usage counts are the same, the eviction at > >> this point is completely indiscriminate. We're just as likely to kick > >> out a btree root page or a visibility map page as we are to kick out a > >> random heap page, even though the former have probably been accessed > >> several orders of magnitude more often. That's clearly bad. On > >> systems that are not too heavily loaded it doesn't matter too much > >> because we just fault the page right back in from the OS pagecache. > >> But I've done pgbench runs where such decisions lead to long stalls, > >> because the page has to be brought back in from disk, and there's a > >> long I/O queue; or maybe just because the kernel thinks PostgreSQL is > >> issuing too many I/O requests and makes some of them wait to cool > >> things down. > > > > I understand now. If there is no memory pressure, every buffer gets the > > max usage count, and when a new buffer comes in, it isn't the max so it > > is swiftly removed until the clock sweep has time to decrement the old > > buffers. Decaying buffers when there is no memory pressure creates > > additional overhead and gets into timing issues of when to decay. > > That can happen, but the real problem I was trying to get at is that > when all the buffers get up to max usage count, they all appear > equally important. But in reality they're not. So when we do start > evicting those long-resident buffers, it's essentially random which > one we kick out. True. Ideally we would have some way to know that _all_ the buffers had reached the maximum and kick off a sweep to decrement them all. I am unclear how we would do that. One odd idea would be to have a global counter that is incremented everytime a buffer goes from 4 to 5 (max) --- when the counter equals 50% of all buffers, do a clock sweep. Of course, then the counter becomes a bottleneck. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, Apr 17, 2014 at 10:48 AM, Bruce Momjian <bruce@momjian.us> wrote: >> > I understand now. If there is no memory pressure, every buffer gets the >> > max usage count, and when a new buffer comes in, it isn't the max so it >> > is swiftly removed until the clock sweep has time to decrement the old >> > buffers. Decaying buffers when there is no memory pressure creates >> > additional overhead and gets into timing issues of when to decay. >> >> That can happen, but the real problem I was trying to get at is that >> when all the buffers get up to max usage count, they all appear >> equally important. But in reality they're not. So when we do start >> evicting those long-resident buffers, it's essentially random which >> one we kick out. > > True. Ideally we would have some way to know that _all_ the buffers had > reached the maximum and kick off a sweep to decrement them all. I am > unclear how we would do that. One odd idea would be to have a global > counter that is incremented everytime a buffer goes from 4 to 5 (max) > --- when the counter equals 50% of all buffers, do a clock sweep. Of > course, then the counter becomes a bottleneck. Yeah, I think that's the right general line of thinking. But it doesn't have to be as coarse-grained as "do a whole clock sweep". It can be, you know, for every buffer that gets incremented from 4 to 5, run the clock sweep far enough to decrement the usage count of some other buffer by one. That's similar to your idea but you can do it a bit at a time rather than having to make a complete pass over shared_buffers all at once. Your other point, that the counter can become the bottleneck, is quite right also and a major problem in this area. I don't know how to solve it right at the moment, but I'm hopeful that there may be a way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-04-17 10:48:15 -0400, Bruce Momjian wrote: > On Thu, Apr 17, 2014 at 10:40:40AM -0400, Robert Haas wrote: > > That can happen, but the real problem I was trying to get at is that > > when all the buffers get up to max usage count, they all appear > > equally important. But in reality they're not. So when we do start > > evicting those long-resident buffers, it's essentially random which > > one we kick out. > > True. Ideally we would have some way to know that _all_ the buffers had > reached the maximum and kick off a sweep to decrement them all. I am > unclear how we would do that. One odd idea would be to have a global > counter that is incremented everytime a buffer goes from 4 to 5 (max) > --- when the counter equals 50% of all buffers, do a clock sweep. Of > course, then the counter becomes a bottleneck. I have my doubts that we'll make the current scheme, where buffer reclaim essentially is O(NBuffers), work much better. Especially as CPU cache effects make such large, high frequency, accesses really expensive. I think we need more drastic changes. I am *not* suggesting that we do that, but I believe it'd be possible to implement a full LRU and be faster than today in scenarios with nontrivial amounts of shared buffers. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Apr 17, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Because all the usage counts are the same, the eviction at > this point is completely indiscriminate. We're just as likely to kick > out a btree root page or a visibility map page as we are to kick out a > random heap page, even though the former have probably been accessed > several orders of magnitude more often. That's clearly bad. That's not clear at all. In that circumstance regardless of what page you evict you're incurring precisely one page fault i/o when the page is read back in. Incurring that i/o is bad but it's unavoidable and it's the same badness regardless of what page it's for. The only way to prefer one page over another is if one page won't be needed for long enough for the page to be useful for caching this new buffer (or mixture of buffers) for multiple accesses. If you can't do that then it doesn't matter which buffer you use since it'll just be evicted to read back in the original page again. -- greg
On Tue, Apr 15, 2014 at 7:30 PM, Peter Geoghegan <pg@heroku.com> wrote: > Frankly, there doesn't need to be any research on this, because it's > just common sense that probabilistically, leaf pages are much more > useful than heap pages in servicing index scan queries if we assume a > uniform distribution. If we don't assume that, then they're still more > useful on average. I don't think "common sense" is compelling. I think you need to pin down exactly what it is about btree intermediate pages that the LRU isn't capturing and not just argue they're more useful. The LRU is already capturing which pages are more heavily used than others so you need to identify what it is that makes index pages *even more* useful than their frequency and recency of access indicates. Not just that they're more useful than an average page. So what I think is missing is that indexes are always accessed from the root down to the leaf. So the most recent page accessed will always be the leaf. And in whatever chain of pages was used to reach the last leaf page the least recently accessed will always be the root. But we'll need the root page again on the subsequent descent even if it's to reach the same leaf page we kept in ram in preference to it. Now it doesn't *always* make sense to keep an intermediate page over leaf pages. Imagine an index that we always do full traversals of. We'll always descend from the root down the left-most pages and then follow the right pointers across. All the other intermediate pages will be cold. If we do an occasional descent probing for other keys those leaf pages shouldn't be cached since they won't be needed again for the common full index traversals and the next occasional probe will probably be looking for different keys. But if we're often probing for the same keys the last thing we want to do is throw away one of the intermediate pages for those keys when we could throw away a leaf page. But that's what would happen in a strict LRU. It's almost like what we would really want to do is mark the pages as least recently used in the opposite order from the order they're actually accessed when descending. Or perhaps bump the usage count to max+1 when it's an intermediate page so that it takes one extra cycle of decrementing before it's considered old compared to a leaf page. -- greg
* Robert Haas (robertmhaas@gmail.com) wrote: > several orders of magnitude more often. That's clearly bad. On > systems that are not too heavily loaded it doesn't matter too much > because we just fault the page right back in from the OS pagecache. Ehhh. No. If it's a hot page that we've been holding in *our* cache long enough, the kernel will happily evict it as 'cold' from *its* cache, leading to... > But I've done pgbench runs where such decisions lead to long stalls, > because the page has to be brought back in from disk, and there's a > long I/O queue; or maybe just because the kernel thinks PostgreSQL is > issuing too many I/O requests and makes some of them wait to cool > things down. Exactly this. > Of course, the overhead of repeated clock sweeps to push down the > usage counts isn't a great thing either. I'm not saying that isn't a > problem. But I think bad decisions about what to evict are also a > problem. Using a bit more CPU here and there, particularly if it's done in a background worker, or ideally multiple background workers (for each buffer pool) would be much better than evicting a hot page that isn't in the kernel's buffer either 'cause we've held on to it long enough that the kernel thinks it's cold. Thanks, Stephen
On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost <sfrost@snowman.net> wrote: > Ehhh. No. If it's a hot page that we've been holding in *our* cache > long enough, the kernel will happily evict it as 'cold' from *its* > cache, leading to... This is a whole nother problem. It is worrisome that we could be benchmarking the page replacement algorithm in Postgres and choose a page replacement algorithm that chooses pages that performs well because it tends to evict pages that are in the OS cache. And then one day (hopefully not too far off) we'll fix the double buffering problem and end up with a strange choice of page replacement algorithm. It also means that every benchmark is super sensitive to the how large a fraction of system memory Postgres is managing. If A benchmark of a page replacement algorithm with 3GB shared buffers might perform well compared to others on a system with 8GB or 32GB total RAM but actually be choosing pages very poorly in normal terms and perform terribly on a system with 4GB total ram. Ideally what I would like to see is instrumentation of Postgres's buffer pinning so we can generate various test loads and then just run the different algorithms on them and measure precisely how many page evictions it's causing and when how often it's choosing pages that need to be read in soon after and so on. We shouldn't have to run Postgres to get these counts at all, just run the algorithm as we read through a text file (or database table) listing the pages being accessed. -- greg
On Thu, Apr 17, 2014 at 9:21 AM, Stephen Frost <sfrost@snowman.net> wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: >> several orders of magnitude more often. That's clearly bad. On >> systems that are not too heavily loaded it doesn't matter too much >> because we just fault the page right back in from the OS pagecache. > > Ehhh. No. If it's a hot page that we've been holding in *our* cache > long enough, the kernel will happily evict it as 'cold' from *its* > cache, leading to... > >> But I've done pgbench runs where such decisions lead to long stalls, >> because the page has to be brought back in from disk, and there's a >> long I/O queue; or maybe just because the kernel thinks PostgreSQL is >> issuing too many I/O requests and makes some of them wait to cool >> things down. > > Exactly this. Yes, I believe that's why this is so effective. -- Peter Geoghegan
* Greg Stark (stark@mit.edu) wrote: > On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost <sfrost@snowman.net> wrote: > > Ehhh. No. If it's a hot page that we've been holding in *our* cache > > long enough, the kernel will happily evict it as 'cold' from *its* > > cache, leading to... > > This is a whole nother problem. > > It is worrisome that we could be benchmarking the page replacement > algorithm in Postgres and choose a page replacement algorithm that > chooses pages that performs well because it tends to evict pages that > are in the OS cache. And then one day (hopefully not too far off) > we'll fix the double buffering problem and end up with a strange > choice of page replacement algorithm. That's certainly possible but I don't see the double buffering problem going away any time particularly soon and, even if it does, it's likely to either a) mean we're just using the kernel's cache (eg: something w/ mmap, etc), or b) will involve so many other changes that this will end up getting changed anyway. In any case, while I think we should document any such cache management system we employ as having this risk, I don't think we should worry about it terribly much. > It also means that every benchmark is super sensitive to the how large > a fraction of system memory Postgres is managing. If A benchmark of a > page replacement algorithm with 3GB shared buffers might perform well > compared to others on a system with 8GB or 32GB total RAM but actually > be choosing pages very poorly in normal terms and perform terribly on > a system with 4GB total ram. I'm not following you here- benchmarks are already sensitive to how much of the system's memory PG is managing (and how much ends up being *dedicated* to PG's cache and therefore unavailable for other work). > Ideally what I would like to see is instrumentation of Postgres's > buffer pinning so we can generate various test loads and then just run > the different algorithms on them and measure precisely how many page > evictions it's causing and when how often it's choosing pages that > need to be read in soon after and so on. We shouldn't have to run > Postgres to get these counts at all, just run the algorithm as we read > through a text file (or database table) listing the pages being > accessed. Go for it. I'd love to see that also. Thanks, Stephen
On 04/17/2014 09:38 PM, Stephen Frost wrote: > * Greg Stark (stark@mit.edu) wrote: >> On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost <sfrost@snowman.net> wrote: >>> Ehhh. No. If it's a hot page that we've been holding in *our* cache >>> long enough, the kernel will happily evict it as 'cold' from *its* >>> cache, leading to... >> >> This is a whole nother problem. >> >> It is worrisome that we could be benchmarking the page replacement >> algorithm in Postgres and choose a page replacement algorithm that >> chooses pages that performs well because it tends to evict pages that >> are in the OS cache. And then one day (hopefully not too far off) >> we'll fix the double buffering problem and end up with a strange >> choice of page replacement algorithm. > > That's certainly possible but I don't see the double buffering problem > going away any time particularly soon and, even if it does, it's likely > to either a) mean we're just using the kernel's cache (eg: something w/ > mmap, etc), or b) will involve so many other changes that this will end > up getting changed anyway. In any case, while I think we should > document any such cache management system we employ as having this risk, > I don't think we should worry about it terribly much. Note that if we somehow come up with a page replacement algorithm that tends to evict pages that are in the OS cache, we have effectively solved the double buffering problem. When a page is cached in both caches, evicting it from one of them eliminates the double buffering. Granted, you might prefer to evict it from the OS cache instead, and such an algorithm could be bad in other ways. But if a page replacement algorithm happens avoid double buffering, that's a genuine merit for that algorithm. - Heikki
On 2014-04-17 21:44:47 +0300, Heikki Linnakangas wrote: > On 04/17/2014 09:38 PM, Stephen Frost wrote: > >* Greg Stark (stark@mit.edu) wrote: > >>On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost <sfrost@snowman.net> wrote: > >>>Ehhh. No. If it's a hot page that we've been holding in *our* cache > >>>long enough, the kernel will happily evict it as 'cold' from *its* > >>>cache, leading to... > >> > >>This is a whole nother problem. > >> > >>It is worrisome that we could be benchmarking the page replacement > >>algorithm in Postgres and choose a page replacement algorithm that > >>chooses pages that performs well because it tends to evict pages that > >>are in the OS cache. And then one day (hopefully not too far off) > >>we'll fix the double buffering problem and end up with a strange > >>choice of page replacement algorithm. > > > >That's certainly possible but I don't see the double buffering problem > >going away any time particularly soon and, even if it does, it's likely > >to either a) mean we're just using the kernel's cache (eg: something w/ > >mmap, etc), or b) will involve so many other changes that this will end > >up getting changed anyway. In any case, while I think we should > >document any such cache management system we employ as having this risk, > >I don't think we should worry about it terribly much. > > Note that if we somehow come up with a page replacement algorithm that tends > to evict pages that are in the OS cache, we have effectively solved the > double buffering problem. When a page is cached in both caches, evicting it > from one of them eliminates the double buffering. Granted, you might prefer > to evict it from the OS cache instead, and such an algorithm could be bad in > other ways. But if a page replacement algorithm happens avoid double > buffering, that's a genuine merit for that algorithm. I don't think it's a good idea to try to synchronize algorithms with the OSs. There's so much change about the caching logic in e.g. linux that it won't stay effective for very long. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* Andres Freund (andres@2ndquadrant.com) wrote: > > Note that if we somehow come up with a page replacement algorithm that tends > > to evict pages that are in the OS cache, we have effectively solved the > > double buffering problem. When a page is cached in both caches, evicting it > > from one of them eliminates the double buffering. Granted, you might prefer > > to evict it from the OS cache instead, and such an algorithm could be bad in > > other ways. But if a page replacement algorithm happens avoid double > > buffering, that's a genuine merit for that algorithm. > > I don't think it's a good idea to try to synchronize algorithms with the > OSs. There's so much change about the caching logic in e.g. linux that > it won't stay effective for very long. There's also more than one OS... Thanks, Stephen
On Thu, Apr 17, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-04-17 21:44:47 +0300, Heikki Linnakangas wrote: >> On 04/17/2014 09:38 PM, Stephen Frost wrote: >> >* Greg Stark (stark@mit.edu) wrote: >> >>On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost <sfrost@snowman.net> wrote: >> >>>Ehhh. No. If it's a hot page that we've been holding in *our* cache >> >>>long enough, the kernel will happily evict it as 'cold' from *its* >> >>>cache, leading to... >> >> >> >>This is a whole nother problem. >> >> >> >>It is worrisome that we could be benchmarking the page replacement >> >>algorithm in Postgres and choose a page replacement algorithm that >> >>chooses pages that performs well because it tends to evict pages that >> >>are in the OS cache. And then one day (hopefully not too far off) >> >>we'll fix the double buffering problem and end up with a strange >> >>choice of page replacement algorithm. >> > >> >That's certainly possible but I don't see the double buffering problem >> >going away any time particularly soon and, even if it does, it's likely >> >to either a) mean we're just using the kernel's cache (eg: something w/ >> >mmap, etc), or b) will involve so many other changes that this will end >> >up getting changed anyway. In any case, while I think we should >> >document any such cache management system we employ as having this risk, >> >I don't think we should worry about it terribly much. >> >> Note that if we somehow come up with a page replacement algorithm that tends >> to evict pages that are in the OS cache, we have effectively solved the >> double buffering problem. When a page is cached in both caches, evicting it >> from one of them eliminates the double buffering. Granted, you might prefer >> to evict it from the OS cache instead, and such an algorithm could be bad in >> other ways. But if a page replacement algorithm happens avoid double >> buffering, that's a genuine merit for that algorithm. > > I don't think it's a good idea to try to synchronize algorithms with the > OSs. There's so much change about the caching logic in e.g. linux that > it won't stay effective for very long. No. but if you were very judicious, maybe you could hint the o/s (posix_fadvise) about pages that are likely to stay hot that you don't need them. I doubt that's necessary though -- if the postgres caching algorithm improves such that there is a better tendency for hot pages to stay in s_b, Eventually the O/S will deschedule the page for something else that needs it. In other words, otherwise preventable double buffering is really a measurement of bad eviction policy because it manifests in volatility of frequency accessed pages. merlin
On Thu, Apr 17, 2014 at 11:53 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > No. but if you were very judicious, maybe you could hint the o/s > (posix_fadvise) about pages that are likely to stay hot that you don't > need them. Mitsumasa KONDO wrote a patch like that. I don't think the results were that promising, but things change quickly. -- Peter Geoghegan
* Merlin Moncure (mmoncure@gmail.com) wrote: > I doubt that's necessary though -- if the postgres caching algorithm > improves such that there is a better tendency for hot pages to stay in > s_b, Eventually the O/S will deschedule the page for something else > that needs it. In other words, otherwise preventable double > buffering is really a measurement of bad eviction policy because it > manifests in volatility of frequency accessed pages. I wonder if it would help to actually tell the OS to read in buffers that we're *evicting*... On the general notion that if the OS already has them buffered then it's almost a no-op, and if it doesn't and it's actually a 'hot' buffer that we're gonna need again shortly, the OS will have it. In other words, try to make the OS more like a secondary cache to ours by encouraging it to cache things we're evicting. Thanks, Stephen
On Thu, Apr 17, 2014 at 2:00 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Merlin Moncure (mmoncure@gmail.com) wrote: >> I doubt that's necessary though -- if the postgres caching algorithm >> improves such that there is a better tendency for hot pages to stay in >> s_b, Eventually the O/S will deschedule the page for something else >> that needs it. In other words, otherwise preventable double >> buffering is really a measurement of bad eviction policy because it >> manifests in volatility of frequency accessed pages. > > I wonder if it would help to actually tell the OS to read in buffers > that we're *evicting*... On the general notion that if the OS already > has them buffered then it's almost a no-op, and if it doesn't and it's > actually a 'hot' buffer that we're gonna need again shortly, the OS will > have it. > > In other words, try to make the OS more like a secondary cache to ours > by encouraging it to cache things we're evicting. I don't think this would work unless we would keep some kind of tracking information on the page itself which seems not worth a write operation to do (maybe if the page is dirtied it could be snuck in there though...). IOW, it would only make sense to do this if we knew that this page was likely to be read in again. This might be true in general on particular workloads but is probably a pretty flimsy assumption without supporting evidence; probably better to let the O/S deal with it. merlin
Stephen Frost <sfrost@snowman.net> writes: > I wonder if it would help to actually tell the OS to read in buffers > that we're *evicting*... On the general notion that if the OS already > has them buffered then it's almost a no-op, and if it doesn't and it's > actually a 'hot' buffer that we're gonna need again shortly, the OS will > have it. But if it's actually gone cold, you're just forcing unnecessary read I/O, not to mention possibly causing something slightly warmer to be lost from kernel cache. regards, tom lane
* Merlin Moncure (mmoncure@gmail.com) wrote: > I don't think this would work unless we would keep some kind of > tracking information on the page itself which seems not worth a write > operation to do (maybe if the page is dirtied it could be snuck in > there though...). IOW, it would only make sense to do this if we knew > that this page was likely to be read in again. This might be true in > general on particular workloads but is probably a pretty flimsy > assumption without supporting evidence; probably better to let the O/S > deal with it. The trouble is that we're ending up "hiding" the information from the OS about the frequency of utilization of that page. You have a good point and we wouldn't want to do this for pages that are just accessed once or similar, but perhaps just mark a page that's reached the 'max' as having been 'hot' and then, for those pages, advise the OS that while we're under pressure and need to push this page out, it was once pretty hottly used and therefore we may want it again soon. For pages that never reach the 'max' level, we wouldn't do anything on the assumption that those were only temporairly needed. Just some thoughts. Thanks, Stephen
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > I wonder if it would help to actually tell the OS to read in buffers > > that we're *evicting*... On the general notion that if the OS already > > has them buffered then it's almost a no-op, and if it doesn't and it's > > actually a 'hot' buffer that we're gonna need again shortly, the OS will > > have it. > > But if it's actually gone cold, you're just forcing unnecessary read I/O, > not to mention possibly causing something slightly warmer to be lost from > kernel cache. Certainly possible- see the email I just sent about another thought around this. Obviously, none of these thoughts are really fully formed solutions and are, instead, just speculation and ideas. Thanks, Stephen
On Thu, Apr 17, 2014 at 2:16 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Merlin Moncure (mmoncure@gmail.com) wrote: >> I don't think this would work unless we would keep some kind of >> tracking information on the page itself which seems not worth a write >> operation to do (maybe if the page is dirtied it could be snuck in >> there though...). IOW, it would only make sense to do this if we knew >> that this page was likely to be read in again. This might be true in >> general on particular workloads but is probably a pretty flimsy >> assumption without supporting evidence; probably better to let the O/S >> deal with it. > > The trouble is that we're ending up "hiding" the information from the OS > about the frequency of utilization of that page. You have a good point > and we wouldn't want to do this for pages that are just accessed once or > similar, but perhaps just mark a page that's reached the 'max' as having > been 'hot' and then, for those pages, advise the OS that while we're > under pressure and need to push this page out, it was once pretty hottly > used and therefore we may want it again soon. > > For pages that never reach the 'max' level, we wouldn't do anything on > the assumption that those were only temporairly needed. yeah -- the thing is, we are already too spendy already on supplemental write i/o (hint bits, visible bits, freezing, etc) and likely not worth it to throw something else on the pile unless the page is already dirty; the medium term trend in storage is that read vs write performance is becoming increasingly asymmetric, particularly on the random side so it's very unlikely to balance out. merlin
On Thursday, April 17, 2014, Merlin Moncure <mmoncure@gmail.com> wrote:
yeah -- the thing is, we are already too spendy already on
supplemental write i/o (hint bits, visible bits, freezing, etc) and
likely not worth it to throw something else on the pile unless the
page is already dirty; the medium term trend in storage is that read
vs write performance is becoming increasingly asymmetric, particularly
on the random side so it's very unlikely to balance out.
Guess I wasn't clear but I was thinking to read the page in, not do any writing, and do it in a asynchronous way to the process doing the evicting.
Thanks,
Stephen
On Thu, Apr 17, 2014 at 2:28 PM, Stephen Frost <sfrost@snowman.net> wrote: > > > On Thursday, April 17, 2014, Merlin Moncure <mmoncure@gmail.com> wrote: >> >> yeah -- the thing is, we are already too spendy already on >> supplemental write i/o (hint bits, visible bits, freezing, etc) and >> likely not worth it to throw something else on the pile unless the >> page is already dirty; the medium term trend in storage is that read >> vs write performance is becoming increasingly asymmetric, particularly >> on the random side so it's very unlikely to balance out. > > Guess I wasn't clear but I was thinking to read the page in, not do any > writing, and do it in a asynchronous way to the process doing the evicting. no -- I got you. My point was, that's a pure guess unless you base it on evidence recorded on the page itself. Without that evidence, (which requires writing) the operating is in a a better place to make that guess so it's probably better to defer that decision. merlin
On Thursday, April 17, 2014, Merlin Moncure <mmoncure@gmail.com> wrote:
no -- I got you. My point was, that's a pure guess unless you base it
on evidence recorded on the page itself. Without that evidence,
(which requires writing) the operating is in a a better place to make
that guess so it's probably better to defer that decision.
Well, we'd only need that info to be stored in the buffer cache somehow- wouldn't have to go to disk or cause more I/O, of course. My thinking was that we could track it with the existing counter too, avoiding even that small amount of locking to write to the buffer page.
Thanks,
Stephen
On Thu, Apr 17, 2014 at 8:10 AM, Greg Stark <stark@mit.edu> wrote: > I don't think "common sense" is compelling. I think you need to pin > down exactly what it is about btree intermediate pages that the LRU > isn't capturing and not just argue they're more useful. The LRU is > already capturing which pages are more heavily used than others so you > need to identify what it is that makes index pages *even more* useful > than their frequency and recency of access indicates. Not just that > they're more useful than an average page. See example 1.1 within the LRU-K paper. > So what I think is missing is that indexes are always accessed from > the root down to the leaf. So the most recent page accessed will > always be the leaf. And in whatever chain of pages was used to reach > the last leaf page the least recently accessed will always be the > root. But we'll need the root page again on the subsequent descent > even if it's to reach the same leaf page we kept in ram in preference > to it. I can't imagine that this is much of a problem in practice. Consider the break-down of pages within indexes when pgbench scale is 5,000, as in my original benchmark: [local] pg@pgbench=# with tots as ( SELECT count(*) c, type, relname from (select relname, relpages, generate_series(1, relpages - 1) i from pg_class c join pg_namespace n on c.relnamespace = n.oid where relkind = 'i' and nspname = 'public') r, lateral (select * from bt_page_stats(relname, i)) u group by relname, type) select tots.relname, relpages -1 as non_meta_pages, c, c/sum(c) over(partition by tots.relname) as prop_of_index, type from tots join pg_class c on c.relname = tots.relname order by 2 desc, 1, type; relname | non_meta_pages | c | prop_of_index | type -----------------------+----------------+---------+----------------------------+------pgbench_accounts_pkey | 1370950| 4828 | 0.00352164557423684307 | ipgbench_accounts_pkey | 1370950 | 1366121 | 0.99647762500455888253 | lpgbench_accounts_pkey | 1370950 | 1 | 0.000000729421204274408257 | rpgbench_tellers_pkey | 274 | 273 | 0.99635036496350364964 | lpgbench_tellers_pkey | 274 | 1 | 0.00364963503649635036 | rpgbench_branches_pkey | 28 | 27 | 0.96428571428571428571 | lpgbench_branches_pkey | 28 | 1 | 0.03571428571428571429 | r (7 rows) Time: 14562.297 ms Just over 99.6% of pages (leaving aside the meta page) in the big 10 GB pgbench_accounts_pkey index are leaf pages. The inner pages and root page are at an enormous advantage. In this example, the other indexes don't even have what would be separately classified as an inner page (and not a root page) at all, because it's perfectly sufficient to only have a root page to get to any one of, say, 273 leaf pages (in the case of pgbench_tellers_pkey here). -- Peter Geoghegan
On 2014-04-17 13:33:27 -0700, Peter Geoghegan wrote: > Just over 99.6% of pages (leaving aside the meta page) in the big 10 > GB pgbench_accounts_pkey index are leaf pages. That's a rather nice number. I knew it was big, but I'd have guessed it'd be a percent lower. Do you happen to have the same stat handy for a sensibly wide text or numeric real world index? It'd be interesting to see what the worst case there is. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Apr 17, 2014 at 1:39 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-04-17 13:33:27 -0700, Peter Geoghegan wrote: >> Just over 99.6% of pages (leaving aside the meta page) in the big 10 >> GB pgbench_accounts_pkey index are leaf pages. > > That's a rather nice number. I knew it was big, but I'd have guessed > it'd be a percent lower. Yes, it's usually past 99.5% for int4. It's really bad if it's as low as 96%, and I think that often points to what are arguably bad indexing choices, like indexing text columns that have long text strings. > Do you happen to have the same stat handy for a sensibly wide text or > numeric real world index? It'd be interesting to see what the worst case > there is. Yes, as it happens I do: http://www.postgresql.org/message-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=A@mail.gmail.com I was working of my Mouse Genome database, which is actually real-world data use by medical researchers, stored in a PostgreSQL database by those researchers and made available for the benefit of other medical researchers. -- Peter Geoghegan
On Thu, Apr 17, 2014 at 1:33 PM, Peter Geoghegan <pg@heroku.com> wrote: > I can't imagine that this is much of a problem in practice. Although I will add that not caching highly useful inner pages for the medium term, because that index isn't being used at all for 5 minutes probably is very bad. Using the 4,828 buffers that it would take to store all the inner pages (as in my large primary index example) to go store something else is probably penny wise and pound foolish. -- Peter Geoghegan
On Thu, Apr 17, 2014 at 4:48 PM, Peter Geoghegan <pg@heroku.com> wrote: > Although I will add that not caching highly useful inner pages for the > medium term, because that index isn't being used at all for 5 minutes > probably is very bad. Using the 4,828 buffers that it would take to > store all the inner pages (as in my large primary index example) to go > store something else is probably penny wise and pound foolish. But there could easily be 20 unused indexes for every 1 index that is being used.
On Thu, Apr 17, 2014 at 6:50 PM, Greg Stark <stark@mit.edu> wrote: > On Thu, Apr 17, 2014 at 4:48 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Although I will add that not caching highly useful inner pages for the >> medium term, because that index isn't being used at all for 5 minutes >> probably is very bad. Using the 4,828 buffers that it would take to >> store all the inner pages (as in my large primary index example) to go >> store something else is probably penny wise and pound foolish. > > But there could easily be 20 unused indexes for every 1 index that is > being used. Sure, but then there might not be. Obviously there is a trade-off to be made between recency and frequency. One interesting observation in the LRU-K paper is that for their test case, a pure LFU actually works very well, despite, as the authors acknowledge, being a terrible algorithm in the real world. That's because their test case is so simple, and concerns only one table/index, with a uniform distribution. -- Peter Geoghegan
A way I have in mind about eviction policy is to introduce a way to have an ageing factor in each buffer and take the ageing factor into consideration when evicting a buffer.
Consider a case where a table is pretty huge and spread across multiple pages. The querying pattern is like a time series pattern i.e. a set of rows is queried pretty frequently for some time, making the corresponding page hot. Then, the next set of rows is queried frequently making that page hot and so on.
Consider a new page entering the shared buffers with refcount 1 and usage_count 1. If that page is a part of the workload described above, it is likely that it shall not be used for a considerable amount of time after it has entered the buffers but will be used eventually.
Now, the current hypothetical situation is that we have three pages:
1) The page that used to be hot at the previous time window but is no longer hot and is actually the correct candidate for eviction.
2) The current hot page (It wont be evicted anyway for now).
3) The new page which just got in and should not be evicted since it can be hot soon (for this workload it will be hot in the next time window).
When Clocksweep algorithm runs the next time, it will see the new buffer page as the one to be evicted (since page (1) may still have usage_count > 0 i.e. it may be 'cooling' but not 'cool' yet.)
This can be changed by introducing an ageing factor that sees how much time the current buffer has spend in shared buffers. If the time that the buffer has spent is large enough (relatively) and it is not hot currently, that means it has had its chance and can be evicted. This shall save the new page (3) from being evicted since it's time in shared buffers shall not be high enough to mandate eviction and it shall be given more chances.
Since gettimeofday() is an expensive call and hence cannot be done in the tight loop, we can count the number of clocksweeps the current buffer has seen (rather, survived). This shall give us a rough idea of the estimate of the relative age of the buffer.
When an eviction happens, all the candidates with refcount = 0 shall be taken.Then, among them, the one with highest ageing factor shall be evicted.
Of course, there may be better ways of doing the same, but I want to highlight the point (or possibility) of introducing an ageing factor to prevent eviction of relatively younger pages early in the eviction process.
The overhead isnt too big. We just need to add another attribute in buffer header for the number of clocksweeps seen (rather, survived) and check it when an eviction is taking place.The existing spinlock for buffer headers shall be good for protecting contention and access. The access rules can be similar to that of usage_count.
Thoughts and comments?
Regards,
Atri
The overhead isnt too big. We just need to add another attribute in buffer header for the number of clocksweeps seen (rather, survived) and check it when an eviction is taking place.The existing spinlock for buffer headers shall be good for protecting contention and access. The access rules can be similar to that of usage_count.
Thoughts and comments?
Regards,
Atri
--
Regards,
Atri
l'apprenant
On Thu, Apr 17, 2014 at 5:00 PM, Greg Stark <stark@mit.edu> wrote: > On Thu, Apr 17, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Because all the usage counts are the same, the eviction at >> this point is completely indiscriminate. We're just as likely to kick >> out a btree root page or a visibility map page as we are to kick out a >> random heap page, even though the former have probably been accessed >> several orders of magnitude more often. That's clearly bad. > > That's not clear at all. In that circumstance regardless of what page > you evict you're incurring precisely one page fault i/o when the page > is read back in. I am a bit confused by this remark. In *any* circumstance when you evict you're incurring precisely one page fault I/O when the page is read back in. That doesn't mean that the choice of which page to evict is irrelevant. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 18, 2014 at 4:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I am a bit confused by this remark. In *any* circumstance when you > evict you're incurring precisely one page fault I/O when the page is > read back in. That doesn't mean that the choice of which page to > evict is irrelevant. But you might be evicting a page that will be needed soon or one that won't be needed for a while. If it's not needed for a while you might be able to avoid many page evictions by caching a page that will be used several times. If all the pages currently in RAM are hot -- meaning they're hot enough that they'll be needed again before the page you're reading in -- then they're all equally bad to evict. I'm trying to push us away from the gut instinct that frequently used pages are important to cache and towards actually counting how many i/os we're saving. In the extreme it's possible to simulate any cache algorithm on a recorded list of page requests and count how many page misses it generates to compare it with an optimal cache algorithm. -- greg
On Fri, Apr 18, 2014 at 04:46:31PM +0530, Atri Sharma wrote: > This can be changed by introducing an ageing factor that sees how much time the > current buffer has spend in shared buffers. If the time that the buffer has > spent is large enough (relatively) and it is not hot currently, that means it > has had its chance and can be evicted. This shall save the new page (3) from > being evicted since it's time in shared buffers shall not be high enough to > mandate eviction and it shall be given more chances. > > Since gettimeofday() is an expensive call and hence cannot be done in the tight > loop, we can count the number of clocksweeps the current buffer has seen > (rather, survived). This shall give us a rough idea of the estimate of the > relative age of the buffer. Counting clock sweeps is an intersting idea. I think one concern was tracking hot buffers in cases where there is no memory pressure, and hence the clock sweep isn't running --- I am not sure how this would help in that case. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Sat, Apr 19, 2014 at 1:07 AM, Bruce Momjian <bruce@momjian.us> wrote:
-- On Fri, Apr 18, 2014 at 04:46:31PM +0530, Atri Sharma wrote:Counting clock sweeps is an intersting idea. I think one concern was
> This can be changed by introducing an ageing factor that sees how much time the
> current buffer has spend in shared buffers. If the time that the buffer has
> spent is large enough (relatively) and it is not hot currently, that means it
> has had its chance and can be evicted. This shall save the new page (3) from
> being evicted since it's time in shared buffers shall not be high enough to
> mandate eviction and it shall be given more chances.
>
> Since gettimeofday() is an expensive call and hence cannot be done in the tight
> loop, we can count the number of clocksweeps the current buffer has seen
> (rather, survived). This shall give us a rough idea of the estimate of the
> relative age of the buffer.
tracking hot buffers in cases where there is no memory pressure, and
hence the clock sweep isn't running --- I am not sure how this would
help in that case.
I feel that if there is no memory pressure, frankly it doesnt matter much about what gets out and what not. The case I am specifically targeting is when the clocksweep gets to move about a lot i.e. high memory pressure workloads. Of course, I may be totally wrong here.
One thing that I discussed with Merlin offline and am now concerned about is how will the actual eviction work. We cannot traverse the entire list and then find all the buffers with refcount 0 and then do another traversal to find the oldest one.
One thing that I discussed with Merlin offline and am now concerned about is how will the actual eviction work. We cannot traverse the entire list and then find all the buffers with refcount 0 and then do another traversal to find the oldest one.
Any thoughts there would be appreciated.
Regards,
Atri
Regards,
Atri
Regards,
Atri
l'apprenant
On Apr 18, 2014, at 1:51 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was something quick for the prototype.
Counting clock sweeps is an intersting idea. I think one concern was
tracking hot buffers in cases where there is no memory pressure, and
hence the clock sweep isn't running --- I am not sure how this would
help in that case.
The problem with the clocksweeps is they don’t actually track the progression of “time” within the PostgreSQL system.
What’s wrong with using a transaction number or some similar sequence? It would accurately track “age” in the sense we care about: how long ago in “units of real work being done by the DB” something was added.
—Jason
Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was something quick for the prototype.The problem with the clocksweeps is they don’t actually track the progression of “time” within the PostgreSQL system.What’s wrong with using a transaction number or some similar sequence? It would accurately track “age” in the sense we care about: how long ago in “units of real work being done by the DB” something was added.
Well, AIUI, we only need the 'relative' age of buffers in relation to the youngest buffer present. So, the guy who has seen the maximum amount of clocksweeps is the guy who has been around the most.
I do not see a need for an accurate estimate of the time spent in the buffer for any purpose right now. It may be useful in the future though.
How do you get the transaction ID? By accessing a tuple on the page and reading it's XMIN?
Regards,
Atri
How do you get the transaction ID? By accessing a tuple on the page and reading it's XMIN?
Regards,
Atri
--
Regards,
Atri
l'apprenant
On Fri, Apr 18, 2014 at 1:11 PM, Jason Petersen <jason@citusdata.com> wrote: > Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday > seems silly to me: it was something quick for the prototype. The gettimeofday() call doesn't need to happen in a tight loop. It can be reasonably passed down from higher up, very probably without consequence. The LRU-K paper actually recommends a delay of 5 seconds. There is at least one other major database system that unambiguously uses a wall-clock delay of 3 seconds (by default) for this exact purpose - avoiding what the LRU-K paper calls "correlated references". I'm not saying that we should go with that particular scheme for these reasons. However, it's plainly untrue that using wall time like this represents some kind of insurmountable scalability obstacle. > The problem with the clocksweeps is they don’t actually track the > progression of “time” within the PostgreSQL system. > > What’s wrong with using a transaction number or some similar sequence? It > would accurately track “age” in the sense we care about: how long ago in > “units of real work being done by the DB” something was added. The LRU-K paper suggests that a scheme like this could be preferable. I have my doubts that it can be made to work with Postgres. -- Peter Geoghegan
On Sat, Apr 19, 2014 at 01:21:29AM +0530, Atri Sharma wrote: > I feel that if there is no memory pressure, frankly it doesnt matter much about > what gets out and what not. The case I am specifically targeting is when the > clocksweep gets to move about a lot i.e. high memory pressure workloads. Of > course, I may be totally wrong here. > > One thing that I discussed with Merlin offline and am now concerned about is > how will the actual eviction work. We cannot traverse the entire list and then > find all the buffers with refcount 0 and then do another traversal to find the > oldest one. I thought if there was memory pressure the clock sweep would run and we wouldn't have everything at the max counter access value. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Sat, Apr 19, 2014 at 3:37 AM, Bruce Momjian <bruce@momjian.us> wrote:
I thought if there was memory pressure the clock sweep would run and we
> One thing that I discussed with Merlin offline and am now concerned about is
> how will the actual eviction work. We cannot traverse the entire list and then
> find all the buffers with refcount 0 and then do another traversal to find the
> oldest one.
wouldn't have everything at the max counter access value.
Hmm, I see your point.
With that applicable as well, I feel that the clocksweep counting/logical clock system shall be useful when deciding between multiple candidates for eviction. At worst, it can serve to replace the gettimeofday() calls.
One thing I have thought of with ideas and inputs from Joshua Yanowski offline is that we can probably have a maxheap which is on the logical clock age of buffers. Each time clocksweep sees a buffer whose refcount has become zero, it will push the buffer into minheap. This can be a new representation of freelist or a new additional data structure.
This still does not solve the problem of seeing the entire list by the clocksweep, even if that makes the eviction process O(1) with the addition of the maxheap.
I am working on a PoC patch but am stuck on this point. My current approach sees the entire shared buffers list to search for any candidate buffers.
Another thing that is a pain point here is the concurrency and locking overheads of introducing a new data structure. Can the existing buffer header spinlock handle this problem or is it hitting the granularity of the spinlock too much?
I see some blockers for this idea still. Nevertheless, the point of clocksweep counts as logical clocks seems to be promising,atleast intuitively.
Thoughts and comments?
Regards,
Regards,
Atri
--
Regards,
Atri
l'apprenant
On 4/15/14, 1:15 PM, Peter Geoghegan wrote: > On Tue, Apr 15, 2014 at 9:30 AM, Merlin Moncure<mmoncure@gmail.com> wrote: >> >There are many reports of improvement from lowering shared_buffers. >> >The problem is that it tends to show up on complex production >> >workloads and that there is no clear evidence pointing to problems >> >with the clock sweep; it could be higher up in the partition locks or >> >something else entirely (like the O/S). pgbench is also not the >> >greatest tool for sniffing out these cases: it's too random and for >> >large database optimization is generally an exercise in de-randomizing >> >i/o patterns. We really, really need a broader testing suite that >> >covers more usage patterns. > I find it quite dissatisfying that we know so little about this. This is an area where additional stats gathering would be very valuable. We're running 8G buffers on 512G servers becausebumping it up hurt performance, but we have no clue why. Was it due to buffer pins? How many times the clock had tosweep to find a victim? Something else entirely? No idea... :( -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 4/18/14, 2:51 PM, Atri Sharma wrote: > > I feel that if there is no memory pressure, frankly it doesnt matter much about what gets out and what not. The case Iam specifically targeting is when the clocksweep gets to move about a lot i.e. high memory pressure workloads. Of course, I may be totally wrong here. Well, there's either memory pressure or there isn't. If there isn't then it's all moot *because we're not evicting anything*. > One thing that I discussed with Merlin offline and am now concerned about is how will the actual eviction work. We cannottraverse the entire list and then find all the buffers with refcount 0 and then do another traversal to find the oldestone. This is why OSes use multiple page pools. If we're going to use a clock sweep at all I think we need to use the same. Every time we discuss this stuff it feels like we're completely reinventing the wheel that was solved by OSes years ago.:( -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 4/16/14, 10:28 AM, Robert Haas wrote: > Also, I think the scalability problems around buffer eviction are > eminently solvable, and in particular I'm hopeful that Amit is going > to succeed in solving them. Suppose we have a background process > (whether the background writer or some other) that runs the clock > sweep, identifies good candidates for eviction, and pushes them on a > set of, say, 16 free-lists protected by spinlocks. (The optimal > number of free-lists probably depends on the size of shared_buffers.) How *certain* are we that a single freelist lock (that actually ONLY protects the freelist) would be that big a deal? I suspectit wouldn't be much of an issue at all: - Right now (IIRC) it's tied into the clock as well, so immediate fail on scaling... - The clock is WAY more expensive than grabbing one buffer off the free list. Last I looked it was so bad that even if thenext buffer the clock hit was free it was still worse than hitting the free list. I strongly suspect that a single freelist lock (that didn't protect anything else) would be fine. I think it'd be folly tostart with a more complex multi-lock/multi-freelist implementation before we knew we needed one. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Jim Nasby-2 wrote >> I feel that if there is no memory pressure, frankly it doesnt matter much >> about what gets out and what not. The case I am specifically targeting is >> when the clocksweep gets to move about a lot i.e. high memory pressure >> workloads. Of course, I may be totally wrong here. > > Well, there's either memory pressure or there isn't. If there isn't then > it's all moot *because we're not evicting anything*. The trade-off I'm seeing here is between measuring when there is no memory pressure - and thus eating at performance while not actually evicting buffers - and not measuring but then encountering memory pressure and not having a clue as to what should be evicted. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Clock-sweep-not-caching-enough-B-Tree-leaf-pages-tp5799947p5800988.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Mon, Apr 21, 2014 at 8:07 PM, David G Johnston <david.g.johnston@gmail.com> wrote: > Jim Nasby-2 wrote >>> I feel that if there is no memory pressure, frankly it doesnt matter much >>> about what gets out and what not. The case I am specifically targeting is >>> when the clocksweep gets to move about a lot i.e. high memory pressure >>> workloads. Of course, I may be totally wrong here. >> >> Well, there's either memory pressure or there isn't. If there isn't then >> it's all moot *because we're not evicting anything*. > > The trade-off I'm seeing here is between measuring when there is no memory > pressure - and thus eating at performance while not actually evicting > buffers - and not measuring but then encountering memory pressure and not > having a clue as to what should be evicted. I believe that for the intended use discussed in this thread, a compile-time switch would be more than enough control, and it would avoid that tradeoff.
Jim Nasby <jim@nasby.net> writes: > How *certain* are we that a single freelist lock (that actually ONLY > protects the freelist) would be that big a deal? We used to have one. It was a big bottleneck --- and this was years ago, when the buffer manager was much less scalable than it is today. (IIRC, getting rid of a central lock was one of the main advantages of the current clock sweep code over its predecessor.) The real issue here is that in the modern code, we hardly ever actually have anything in the freelist: only when a relation is dropped, or something like that, do buffers ever get put to the freelist. So your argument that removing a buffer from the freelist is cheaper than running the clock sweep is largely missing the point. We'd have to run a clock sweep in order to find something to put in the freelist. regards, tom lane
On Mon, Apr 21, 2014 at 5:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > We used to have one. It was a big bottleneck --- and this was years > ago, when the buffer manager was much less scalable than it is today. > (IIRC, getting rid of a central lock was one of the main advantages > of the current clock sweep code over its predecessor.) Yes, it was. This is a major advantage of clock sweep, and anything that replaces it will need to maintain the same advantage. Didn't someone indicate that clock sweep could beat ARC around that time, presumably for this reason? If no one did, then my reading of a variety of other papers on caching indicates that this is probably the case. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Mon, Apr 21, 2014 at 5:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We used to have one. It was a big bottleneck --- and this was years >> ago, when the buffer manager was much less scalable than it is today. >> (IIRC, getting rid of a central lock was one of the main advantages >> of the current clock sweep code over its predecessor.) > Yes, it was. This is a major advantage of clock sweep, and anything > that replaces it will need to maintain the same advantage. Didn't > someone indicate that clock sweep could beat ARC around that time, > presumably for this reason? If no one did, then my reading of a > variety of other papers on caching indicates that this is probably the > case. ARC *was* the predecessor algorithm. See commit 5d5087363. regards, tom lane
On Mon, Apr 21, 2014 at 5:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > ARC *was* the predecessor algorithm. See commit 5d5087363. I believe that the main impetus for replacing ARC with clock sweep came from patent issues, though. It was a happy coincidence that clock sweep happened to be better than ARC, but that doesn't mean that ARC didn't have some clear advantages, even if it wasn't worth it on balance. LRU-K, and 2Q have roughly the same advantages. I'm reasonably confident you can have the best of both worlds, or something closer to it. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Mon, Apr 21, 2014 at 5:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ARC *was* the predecessor algorithm. See commit 5d5087363. > I believe that the main impetus for replacing ARC with clock sweep > came from patent issues, though. That was one issue, but performance gains were a large part of it too, and the main reason why we picked clock sweep rather than something else. Did you read the commit message I pointed to? (See also 4e8af8d27.) regards, tom lane
On Mon, Apr 21, 2014 at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Did you read the commit message I pointed to? Yes. > (See also 4e8af8d27.) Oh, I wasn't actually aware of the fact that 2Q made it into the tree. I thought that the first commit message you referred to just referenced on-list discussion of 2Q. Interesting. -- Peter Geoghegan
On Mon, Apr 21, 2014 at 5:59 PM, Peter Geoghegan <pg@heroku.com> wrote: > LRU-K, and 2Q have roughly the same advantages. I'm > reasonably confident you can have the best of both worlds, or > something closer to it. Having said that, a big part of what I'd like to accomplish here is to address the more general problem of "correlated references". That's probably something that has independent value. -- Peter Geoghegan
Here is a benchmark that is similar to my earlier one, but with a rate limit of 125 tps, to help us better characterize how the prototype patch helps performance: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/ Again, these are 15 minute runs with unlogged tables at multiple client counts, and scale 5,000. Every test run should have managed to hit that limit, but in the case of master 1 test did not. I have included vmstat, iostat and meminfo OS instrumentation this time around, which is really interesting for this particular limit based benchmark. The prototype patch tested here is a slight refinement on my earlier prototype. Apart from reducing the number of gettimeofday() calls, and doing them out of the critical path, I increased the initial usage_count value to 6. I also set BM_MAX_USAGE_COUNT to 30. I guess I couldn't resist the temptation to tweak things, which I actually did very little of prior to publishing my initial results. This helps, but there isn't a huge additional benefit. The benchmark results show that master cannot even meet the 125 tps limit on 1 test out of 9. More interestingly, the background writer consistently cleans about 10,000 buffers per test run when testing the patch. At the same time, buffers are allocated at a very consistent rate of around 262,000 for the patched test runs. Leaving aside the first test run, with the patch there is only a tiny variance in the number cleaned between each test, a variance of just a few hundred buffers. In contrast, master has enormous variance. During just over half of the tests, the background writer does not clean even a single buffer. Then, on 2 tests out of 9, it cleans an enormous ~350,000 buffers. The second time this happens leads to master failing to even meet the 125 tps limit (albeit with only one client). If you drill down to individual test runs, a similar pattern is evident. You'll now find operating system information (meminfo dirty memory) graphed here. The majority of the time, master does not hold more than 1,000 kB of dirty memory at a time. Once or twice it's 0 kB for multiple minutes. However, during the test runs where we also see huge spikes in background-writer-cleaned pages, we also see huge spikes in the total amount of dirty memory (and correlated huge spikes in latency). It can get to highs of ~700,000 kB at one point. In contrast, the patched tests show very consistent amounts of dirty memory. Per test, it almost always tops out at 4,000 kB - 6,000 kB (there is a single 12,000 kB spike, though). There is consistently a distinct zig-zag pattern to the dirty memory graph with the patched tests, regardless of client count or where checkpoints occur. Master shows mountains and valleys for those two particularly problematic tests, correlating with a panicked background writer's aggressive feedback loop. Master also shows less rhythmic zig-zag patterns that only peak at about 600 kB - 1,000 kB for the entire duration of many individual test runs. Perhaps most notably, average and worst case latency is far improved with the patch. On average it's less than half of master with 1 client, and less than a quarter of master with 32 clients. I think that the rate limiting feature of pgbench is really useful for characterizing how work like this improves performance. I see a far smoother and more consistent pattern of I/O that superficially looks like Postgres is cooperating with the operating system much more than it does in the baseline. It sort of makes sense that the operating system cache doesn't care about frequency while Postgres does. If the OS cache did weigh frequency, it would surely not alter the outcome very much, since OS cached data has presumably not been accessed very frequently recently. I suspect that I've cut down on double buffering by quite a bit. I would like to come up with a simple way of measuring that, using something like pgfincore, but the available interfaces don't seem well-suited to quantifying how much of a problem this is and remains. I guess call pgfincore on the index segment files might be interesting, since shared_buffers mostly holds index pages. This has been verified using pg_buffercache. It would be great to test this out with something involving non-uniform distributions, like Gaussian and Zipfian distributions. The LRU-K paper tests Zipfian too. The uniform distribution pgbench uses here, while interesting, doesn't tell the full story at all and is less representative of reality (TPB-C is formally required to have a non-uniform distribution [1] for some things, for example). A Gaussian distribution might show essentially the same failure to properly credit pages with frequency of access in one additional dimension, so to speak. Someone should probably look at the TPC-C-like DBT-2, since in the past that was considered to be a problematic workload for PostgreSQL [2] due to the heavy I/O. Zipfian is a lot less sympathetic than uniform if the LRU-K paper is any indication, and so if we're looking for a worst-case, that's probably a good place to start. It's not obvious how you'd go about actually constructing a practical test case for either, though. I should acknowledge that I clearly have regressed one aspect that is evident from the benchmark: Cleanup time (which is recorded per test) takes more than twice as long. This makes intuitive sense, though. VACUUM first creates a list of tuples to kill by scanning the heap. It then goes to the indexes to kill them there first. It then returns to the heap, and kills heap tuples in a final scan. Clearly it is bad for VACUUM that shared_buffers mostly contains index pages when it begins. That said, I haven't actually considered the interactions with buffer access strategies here. It might well be more complicated than I've suggested. To be thorough, I've repeated each test set. There is a "do over" for both master and patched, which serves to show how repeatable the original test sets are. [1] Clause 2.1.6, TPC-C specification: http://www.tpc.org/tpcc/spec/tpcc_current.pdf [2] https://wiki.postgresql.org/wiki/PgCon_2011_Developer_Meeting#DBT-2_I.2FO_Performance -- Peter Geoghegan
Jason Petersen wrote: > Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was > something quick for the prototype. > > The problem with the clocksweeps is they don’t actually track the progression of “time” within the > PostgreSQL system. Would it make sense to just cache the result of the latest gettimeofday() call and use that as an approximation for wall time? The busier the system is, the more accurate that should be. Yours, Laurenz Albe
On Tue, Apr 22, 2014 at 12:59 PM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
Jason Petersen wrote:Would it make sense to just cache the result of the latest gettimeofday() call
> Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was
> something quick for the prototype.
>
> The problem with the clocksweeps is they don’t actually track the progression of “time” within the
> PostgreSQL system.
and use that as an approximation for wall time?
The busier the system is, the more accurate that should be.
That sounds...risky. How will the invalidation/updation of the cache work?
How will we track the time window in which the cached value is still valid and applicable?
My first thoughts only. I may be missing the point though.
Regards,
Atri
Regards,
Atri
--
Regards,
Atri
l'apprenant
On 04/17/2014 10:39 PM, Andres Freund wrote: > On 2014-04-17 13:33:27 -0700, Peter Geoghegan wrote: >> Just over 99.6% of pages (leaving aside the meta page) in the big 10 >> GB pgbench_accounts_pkey index are leaf pages. What is the depth of b-tree at this percentage ? Cheers Hannu
On 4/21/14, 6:07 PM, David G Johnston wrote: > Jim Nasby-2 wrote >>> >>I feel that if there is no memory pressure, frankly it doesnt matter much >>> >>about what gets out and what not. The case I am specifically targeting is >>> >>when the clocksweep gets to move about a lot i.e. high memory pressure >>> >>workloads. Of course, I may be totally wrong here. >> > >> >Well, there's either memory pressure or there isn't. If there isn't then >> >it's all moot*because we're not evicting anything*. > The trade-off I'm seeing here is between measuring when there is no memory > pressure - and thus eating at performance while not actually evicting > buffers - and not measuring but then encountering memory pressure and not > having a clue as to what should be evicted. Right. OSes handle this by keeping a certain ratio of active vs inactive pages, regardless of pressure for free pages. Thatway when you need more pages in the free list you can pull them from the inactive list knowing that you're making a gooddecision. One of the really nice things about this approach is that if memory pressure is low enough that you don't need more pageson the inactive list you don't even need to run that clock. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Tue, Apr 22, 2014 at 2:03 AM, Hannu Krosing <hannu@krosing.net> wrote: > What is the depth of b-tree at this percentage ? Well, this percentage of B-Tree pages that are leaf pages doesn't have much to do with the depth. The percentage seems very consistent for each B-Tree, irrespective of the total size of the B-Tree. -- Peter Geoghegan
On Mon, Apr 21, 2014 at 11:57 PM, Peter Geoghegan <pg@heroku.com> wrote: > Here is a benchmark that is similar to my earlier one, but with a rate > limit of 125 tps, to help us better characterize how the prototype > patch helps performance: > > http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/ I've added some more test sets to this result report, again with a 125 TPS limit, but on this occasion with a pgbench Gaussian distribution. I used V13 of the recently proposed Guassian distribution pgbench patch [1] to accomplish this, including the Gaussian variant of tpc-b that the pgbench patch has baked in. The distribution threshold used was consistently 5, causing the patched pgbench to report for each test: transaction type: Custom query scaling factor: 5000 standard deviation threshold: 5.00000 access probability of top 20%, 10% and 5% records: 0.68269 0.38293 0.19741 It looks like the patch continues to have much lower latency than master for this somewhat distinct workload. Actually, even though the background writer is somewhat working harder than in the uniform distribution case, the average latency with patched is appreciably lower. Total buffers allocated are just as consistent as before for patched, but the number is markedly lower than for the prior uniform distribution case. Dirty memory graphs start off similar to the uniform case with patched, but get a bit spikier towards the end of each test run there. It's still *markedly* better than master for either distribution type, which is still really aggressive at times for master, and other times by far isn't aggressive enough, in much the same way as before. In general, with the Gaussian distribution, average latency is lower, but worst case is higher. The patch maintains its clear lead for average case, albeit a smaller lead than with uniform, and with worst case things are much better relatively speaking. Absolute worst case (and not worst case averaged across client counts) is 1.4 seconds with patched, to 8.3 with master...and that terrible worst case happens *twice* with master. For uniform distribution, the same figure was 5.4 - 5.8 seconds for master, and 0.6 seconds for patched. What is curious is that with master and with the Gaussian distribution, I see distinct latency "no man's land" in multiple test runs, like this one here for example: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/49/index.html . It looks like there is a clear differentiation between going to disk and not going to disk, or something like that. I don't see this for any other case, and it's quite obviously a consistent and distinct feature of master + Gaussian when the OS isn't aggressively writing out a mountain of dirty memory. This is something that I personally have never seen before. I also note that master had 3 huge background writer spikes with a Gaussian distribution, rather than 2 and 1 small one, as was (consistently) demonstrated to happen with a uniform distribution. What's more, 90th percentile latency is very consistent across client counts for the new patched test run, as opposed to being very much higher with higher client counts when master is tested. [1] http://www.postgresql.org/message-id/alpine.DEB.2.10.1404011107220.2557@sto -- Peter Geoghegan
I've now done a non-limited comparative benchmark of master against the patch (once again, with usage_count starting at 6, and BM_MAX_USAGE_COUNT at 30) with a Gaussian distribution. Once again, the distribution threshold used was consistently 5.0, causing the patched pgbench to report for each test: transaction type: Custom query scaling factor: 5000 standard deviation threshold: 5.00000 access probability of top 20%, 10% and 5% records: 0.68269 0.38293 0.19741 Results are available from: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-gauss/ -- Peter Geoghegan
On Fri, Apr 18, 2014 at 11:46 AM, Greg Stark <stark@mit.edu> wrote: > On Fri, Apr 18, 2014 at 4:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I am a bit confused by this remark. In *any* circumstance when you >> evict you're incurring precisely one page fault I/O when the page is >> read back in. That doesn't mean that the choice of which page to >> evict is irrelevant. > > But you might be evicting a page that will be needed soon or one that > won't be needed for a while. If it's not needed for a while you might > be able to avoid many page evictions by caching a page that will be > used several times. Sure. > If all the pages currently in RAM are hot -- meaning they're hot > enough that they'll be needed again before the page you're reading in > -- then they're all equally bad to evict. Also true. But the problem is that it is very rarely, if ever, the case that all pages are *equally* hot. On a pgbench workload, for example, I'm very confident that while there's not really any cold data, the btree roots and visibility map pages are a whole lot hotter than a randomly-selected heap page. If you evict a heap page, you're going to need it back pretty quick, because it won't be long until the random-number generator again chooses a key that happens to be located on that page. But if you evict the root of the btree index, you're going to need it back *immediately*, because the very next query, no matter what key it's looking for, is going to need that page. I'm pretty sure that's a significant difference. > I'm trying to push us away from the gut instinct that frequently used > pages are important to cache and towards actually counting how many > i/os we're saving. In the extreme it's possible to simulate any cache > algorithm on a recorded list of page requests and count how many page > misses it generates to compare it with an optimal cache algorithm. There's another issue, which Simon clued me into a few years back: evicting the wrong page can cause system-wide stalls. In the pgbench case, evicting a heap page will force the next process that chooses a random number that maps to a tuple on that page to wait for the page to be faulted back in. That's sad, but unless the scale factor is small compared to the number of backends, there will probably be only ONE process waiting. On the other hand, if we evict the btree root, within a fraction of a second, EVERY process that isn't already waiting on some other I/O will be waiting for that I/O to complete. The impact on throughput is much bigger in that case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 21, 2014 at 6:38 PM, Jim Nasby <jim@nasby.net> wrote: >> I feel that if there is no memory pressure, frankly it doesnt matter much >> about what gets out and what not. The case I am specifically targeting is >> when the clocksweep gets to move about a lot i.e. high memory pressure >> workloads. Of course, I may be totally wrong here. > > Well, there's either memory pressure or there isn't. If there isn't then > it's all moot *because we're not evicting anything*. I don't think that's really true. A workload can fit within shared_buffers at some times and spill beyond it at others. Every time it fits within shared_buffers for even a short period of time, the reference count of any buffer that's not ice-cold goes to 5 and we essentially lose all knowledge of which buffers are relatively hotter.Then, when we spill out again, evictions are random. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 28, 2014 at 6:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Also true. But the problem is that it is very rarely, if ever, the > case that all pages are *equally* hot. On a pgbench workload, for > example, I'm very confident that while there's not really any cold > data, the btree roots and visibility map pages are a whole lot hotter > than a randomly-selected heap page. If you evict a heap page, you're > going to need it back pretty quick, because it won't be long until the > random-number generator again chooses a key that happens to be located > on that page. But if you evict the root of the btree index, you're > going to need it back *immediately*, because the very next query, no > matter what key it's looking for, is going to need that page. I'm > pretty sure that's a significant difference. I emphasized leaf pages because even with master the root and inner pages are still going to be so hot as to make them constantly in cache, at least with pgbench's use of a uniform distribution. You'd have to have an absolutely enormous scale factor before this might not be the case. As such, I'm not all that worried about inner pages when performing these simple benchmarks. However, in the case of the pgbench_accounts table, each of the B-Tree leaf pages that comprise about 99.5% of the total is still going to be about six times more frequently accessed than each heap page. That's a small enough difference for it to easily go unappreciated, and yet a big enough difference for it to hurt a lot. -- Peter Geoghegan
On Fri, Apr 25, 2014 at 10:45 AM, Peter Geoghegan <pg@heroku.com> wrote: > I've now done a non-limited comparative benchmark of master against > the patch (once again, with usage_count starting at 6, and > BM_MAX_USAGE_COUNT at 30) with a Gaussian distribution. Once again, > the distribution threshold used was consistently 5.0, causing the > patched pgbench to report for each test: > > transaction type: Custom query > scaling factor: 5000 > standard deviation threshold: 5.00000 > access probability of top 20%, 10% and 5% records: 0.68269 0.38293 0.19741 > > Results are available from: > > http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-gauss/ I updated this with various changes in bgwriter configuration. Perhaps unsurprisingly, disabling the background writer entirely helps for both master and patched. It is perhaps notable that the largest difference between two comparable patch + master test runs is seen when the background writer is disabled entirely, and 32 clients. -- Peter Geoghegan
On 4/28/14, 8:04 AM, Robert Haas wrote: > On Mon, Apr 21, 2014 at 6:38 PM, Jim Nasby <jim@nasby.net> wrote: >>> I feel that if there is no memory pressure, frankly it doesnt matter much >>> about what gets out and what not. The case I am specifically targeting is >>> when the clocksweep gets to move about a lot i.e. high memory pressure >>> workloads. Of course, I may be totally wrong here. >> >> Well, there's either memory pressure or there isn't. If there isn't then >> it's all moot *because we're not evicting anything*. > > I don't think that's really true. A workload can fit within > shared_buffers at some times and spill beyond it at others. Every > time it fits within shared_buffers for even a short period of time, > the reference count of any buffer that's not ice-cold goes to 5 and we > essentially lose all knowledge of which buffers are relatively hotter. > Then, when we spill out again, evictions are random. That's a separate problem, but yes, just because we're not evicting something doesn't mean we can end up with every buffermarked as equally important. OSes handle this by splitting pages between active and inactive, and maintaining a relative balance between the two (actuallya bit more complex because there's a separate inactive/dirty pool). In our case this could maybe be handled by simply not incrementing counts when there's no eviction... but I'm more a fanof separate polls/clocks, because that means you can do things like a LFU for active and an LRU for inactive. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Jim Nasby <jim@nasby.net> wrote: > In our case this could maybe be handled by simply not > incrementing counts when there's no eviction... but I'm more a > fan of separate polls/clocks, because that means you can do > things like a LFU for active and an LRU for inactive. I have hesitated to mention some benchmarks I did for optimal caching techniques for a database load, because they were so old, but maybe the ideas might spark something of value in the discussion. I'm talking about early 1985 on 80286 hardware on DOS with a Terminate and Stay Resident (TSR) cache external to the database. The external cache used LRU caching, and I was looking at what caching I could do inside the database to improve real database workloads which tended to include both OLTP and reporting. I found two types of caches improved performance. Neither was a substitute for the LRU cache closer to the hardware, and eliminating either reduced performance over having both. One was index-specific -- each connection caused to be held in cache the last page at each level of the index. This proved useful because in our real life applications it turned out that the next "random" access on an index was very often the same or near the previous. The other was a "weighted average" of access counts -- each access bumped a count and after a certain number of bumps all counts were reduced by 25%. This was accomplished by setting each count to the sum of it's existing value shifted right by one and shifted right by two. I understand that with the much larger RAM caches available 30 years later there could be some problems with passing all the counts atomically without causing a noticeable pause, and the higher connection counts may cause more contention issues. But if those issues could be solved (or somehow dodged for a proof of concept benchmark) it might be interesting to see how that worked out. FWIW, I recall that we used a one byte counter for each page, running 0 to 255. I don't recall the number at which we effectively multiplied by 0.75, and there was nothing particularly magic about that multiplier other than it was pretty fast and worked better that 0.5 in my benchmarks. I also don't remember what we used as the initial value on a page load. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> wrote: > each connection caused to be held in cache the last page at each > level of the index. Apologies for ambiguous terminology there. To be clear: the most recently accessed page at each level of the index. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 16, 2014 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> Anyways, I'm still curious if you can post similar numbers basing the >> throttling on gross allocation counts instead of time. Meaning: some >> number of buffer allocations has to have occurred before you consider >> eviction. Besides being faster I think it's a better implementation: >> an intermittently loaded server will give more consistent behavior. > > Yeah --- I think wall-clock-based throttling is fundamentally the wrong > thing anyway. Are we going to start needing a CPU speed measurement to > tune the algorithm with? Not the place to be going. But driving it off > the number of allocations that've been done could be sensible. (OTOH, > that means you need a central counter, which itself would be a > bottleneck.) So, I was thinking about this a little bit more today, prodded by my coworker John Gorman. I'm wondering if we could drive this off of the clock sweep; that is, every time the clock sweep touches a buffer, its usage count goes down by one, but we also set two flag bits. Every time the buffer gets touched thereafter, we check whether any flag bits are set; if so, we clear one and increase the usage count, else we do nothing. So the usage count can increase at most twice per clock sweep. The advantage of that is that, as with Peter's approach, it is harder for the usage count of a buffer to max out - to get there, you need sustained access over a longer period of time. But it's not time-dependent, so replaying the same workload at twice the speed or half the speed on faster or slower hardware doesn't change the choice of which buffer to evict, which seems good. And it will cause us to prefer buffers which are accessed repeatedly over a period of time rather than buffers that are accessed a bunch of times in quick succession and then not touched again for a while, which seems like a good bet. I can't convince myself that this fully solves the (currently existing) problem of usage counts increasing too fast. In theory, every time the clock sweep hand advances, it decrements the usage count of one buffer by one, but also makes it possible for the usage count of that buffer to increase by up to two (or by only one if the buffer previously had a usage count of five). So with the right access pattern, it seems like you could still get to a place where a typical buffer eviction has to do a lot of scanning before it finds a victim buffer. As with the present algorithm, we'd be most likely to have that problem when the buffer hit ratio is very high, so that there are many opportunities for usage counts to go up between each opportunity to push usage counts back down. But it seems like it would at least limit the damage. I haven't tried this or anything, so this is just random brainstorming. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 14, 2015 at 2:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: > So, I was thinking about this a little bit more today, prodded by my > coworker John Gorman. I'm wondering if we could drive this off of the > clock sweep; that is, every time the clock sweep touches a buffer, its > usage count goes down by one, but we also set two flag bits. Every > time the buffer gets touched thereafter, we check whether any flag > bits are set; if so, we clear one and increase the usage count, else > we do nothing. So the usage count can increase at most twice per > clock sweep. The advantage of that is that, as with Peter's approach, > it is harder for the usage count of a buffer to max out - to get > there, you need sustained access over a longer period of time. But > it's not time-dependent, so replaying the same workload at twice the > speed or half the speed on faster or slower hardware doesn't change > the choice of which buffer to evict, which seems good. Why is that good? My little prototype basically implemented the LRU-K approach to preventing correlated references (which tpc-b has some of, in a really obvious way - not sure how significant that is). This would have accidentally made it weight frequency better, but the prototype did not pretend to actually implement something like LRU-K (which is not the state of the art in any case), because it did not consider the recency of the second-to-last access of the buffer at all. The prototype's performance started off well, but regressed in later pgbench-collector iterations (due to the influence of VACUUM, I think). If I wanted to cheat, I could have only done one large 45 minute run, which would have made the prototype look much better still. > And it will > cause us to prefer buffers which are accessed repeatedly over a period > of time rather than buffers that are accessed a bunch of times in > quick succession and then not touched again for a while, which seems > like a good bet. I think that people were all too quick to dismiss the idea of a wall time interval playing some role here (at least as a defense against correlated references, as a correlated reference period). I suppose that that's because it doesn't fit with an intuition that says that that kind of interval ought to be derived algebraically - magic delay settings are considered suspect. However, the same can be said of "the Five-minute rule", which is a highly influential rule of thumb that Jim Gray came up with. The Five minute rule is now obsolete, but that took a long time, and the fundamental observation still applies (Wikipedia says it depends on what type of disks you have these days, but the fact remains that rule of thumbs like this can be more or less robust). I think that correlated reference type delays *are* used effectively in real world systems without it being fragile/overly workload dependent. > I can't convince myself that this fully solves the (currently > existing) problem of usage counts increasing too fast. In theory, > every time the clock sweep hand advances, it decrements the usage > count of one buffer by one, but also makes it possible for the usage > count of that buffer to increase by up to two (or by only one if the > buffer previously had a usage count of five). So with the right > access pattern, it seems like you could still get to a place where a > typical buffer eviction has to do a lot of scanning before it finds a > victim buffer. As with the present algorithm, we'd be most likely to > have that problem when the buffer hit ratio is very high, so that > there are many opportunities for usage counts to go up between each > opportunity to push usage counts back down. But it seems like it > would at least limit the damage. > I haven't tried this or anything, so this is just random brainstorming. That's what it'll take, I think -- random brainstorming, and crude prototypes to test theories. As long as we're doing random brainstorming, I'd suggest looking at making clocksweep actually approximate LRU-K/LRU-2 (which, again, to be clear, my prototype did not do). The clocksweep could maintain statistics about the recency of the second-to-last access across all buffers, and discriminate against buffers according to what bucket of the population they fit in to. Not sure how aggressively we'd penalize those buffers that had very old penultimate references (or credit those that had very recent penultimate references), or what the bucket partitioning scheme is, but that's probably where'd I'd take it next. For example, buffers with a penultimate reference that is more than a standard deviation below the mean would be double penalized (and maybe the opposite, for those buffers with penultimate accesses a stddev above the mean). If that didn't work so well, then I'd look into an ARC style recency and frequency list (while remembering things about already evicted blocks, which LRU-K does not do....although that paper is from the early 1990s). There are approaches to relieving lock contention with ARC (e.g., CAR, or what OpenZFS now does [1]). So maybe we could just look to doing something similar if simpler approaches turn out to be less effective. [1] http://blog.delphix.com/prakash/2015/03/23/openzfs-reducing-arc-lock-contention/ -- Peter Geoghegan
On 4/14/15 5:22 PM, Peter Geoghegan wrote: > As long as we're doing random brainstorming, I'd suggest looking at > making clocksweep actually approximate LRU-K/LRU-2 (which, again, to > be clear, my prototype did not do). The clocksweep could maintain > statistics about the recency of the second-to-last access across all > buffers, and discriminate against buffers according to what bucket of > the population they fit in to. Not sure how aggressively we'd penalize > those buffers that had very old penultimate references (or credit > those that had very recent penultimate references), or what the bucket > partitioning scheme is, but that's probably where'd I'd take it next. > For example, buffers with a penultimate reference that is more than a > standard deviation below the mean would be double penalized (and maybe > the opposite, for those buffers with penultimate accesses a stddev > above the mean). If that didn't work so well, then I'd look into an > ARC style recency and frequency list (while remembering things about > already evicted blocks, which LRU-K does not do....although that paper > is from the early 1990s). Along the lines of brainstorming... why do we even allow usage_count > 1? Clocksweep was used pretty successfully by at least FreeBSD, but they simply used a bit to indicate recently used. Anything that wasn't recently used moved from the active pull to the inactive pool (which tended to be far larger than the active pool with decent amounts of memory), and a small number of buffers were keep on the 'free' list by pulling them out of the inactive pool and writing them if they were dirty. All of this was done on an LRU basis. Given how common it is for the vast bulk of shared_buffers in an install to be stuck at 5, I'd think the first thing we should try is a combination of greatly reducing the max for usage_count (maybe to 2 instead of 1 to simulate 2 pools), and running the clock sweep a lot more aggressively in a background process. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
<p dir="ltr">I've been meaning to write this since PGConf and now isn't a great time since I'm on my phone but I think it'stime.<p dir="ltr">The way the clock sweep algorithm is meant to be thought about is that it's an approximate lru. Eachusage count corresponds to an ntile of the lru. So we don't know which buffer is least recently used but it must be inthe set of buffers with usage count 0 and that should be 1/nth of all the buffers.<p dir="ltr">In order for that propertyto be maintained though the usage count for some buffer should be getting decremented every time we touch a buffer.That is, every time we promote one buffer to the most recently moved ntile we should be demoting some other buffer.<pdir="ltr">The way our cache works we promote when a buffer is accessed but we only demote when a buffer is flushed.We flush a lot less often than we touch buffers so it's not surprising that the cache ends up full of buffers thatare all in the "most recently used" section.<p dir="ltr">Now it's complicated by the fact that we aren't promoting buffersdirectly to the most recently used ntile. We're incrementing the usage count by one. That makes it more of a "leastfrequently used" list rather than a lru. I think that's a mistake but I recall some debate about that when it firstwent in.
On Tue, Apr 14, 2015 at 6:22 PM, Peter Geoghegan <pg@heroku.com> wrote: > Why is that good? We did discuss this before. I've recapped some of what I believe to be the most salient points below. > I think that people were all too quick to dismiss the idea of a wall > time interval playing some role here (at least as a defense against > correlated references, as a correlated reference period). I suppose > that that's because it doesn't fit with an intuition that says that > that kind of interval ought to be derived algebraically - magic delay > settings are considered suspect. Yep, Tom gave that reason here: http://www.postgresql.org/message-id/11258.1397673898@sss.pgh.pa.us But there was also this point from Andres - gettimeofday is not free: http://www.postgresql.org/message-id/20140416075307.GC3906@awork2.anarazel.de And this point from me - this can degrade to random eviction under high pressure: http://www.postgresql.org/message-id/CA+TgmoayUxr55zuEaPP6d2XByicJWACC9Myyn5aT4TiNdSJqYw@mail.gmail.com You'll notice that my proposal avoids all three of those objections. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 15, 2015 at 2:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Apr 16, 2014 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Merlin Moncure <mmoncure@gmail.com> writes:
> >> Anyways, I'm still curious if you can post similar numbers basing the
> >> throttling on gross allocation counts instead of time. Meaning: some
> >> number of buffer allocations has to have occurred before you consider
> >> eviction. Besides being faster I think it's a better implementation:
> >> an intermittently loaded server will give more consistent behavior.
> >
> > Yeah --- I think wall-clock-based throttling is fundamentally the wrong
> > thing anyway. Are we going to start needing a CPU speed measurement to
> > tune the algorithm with? Not the place to be going. But driving it off
> > the number of allocations that've been done could be sensible. (OTOH,
> > that means you need a central counter, which itself would be a
> > bottleneck.)
>
> So, I was thinking about this a little bit more today, prodded by my
> coworker John Gorman. I'm wondering if we could drive this off of the
> clock sweep; that is, every time the clock sweep touches a buffer, its
> usage count goes down by one, but we also set two flag bits. Every
> time the buffer gets touched thereafter, we check whether any flag
> bits are set; if so, we clear one and increase the usage count, else
> we do nothing. So the usage count can increase at most twice per
> clock sweep. The advantage of that is that, as with Peter's approach,
> it is harder for the usage count of a buffer to max out - to get
> there, you need sustained access over a longer period of time. But
> it's not time-dependent, so replaying the same workload at twice the
> speed or half the speed on faster or slower hardware doesn't change
> the choice of which buffer to evict, which seems good. And it will
> cause us to prefer buffers which are accessed repeatedly over a period
> of time rather than buffers that are accessed a bunch of times in
> quick succession and then not touched again for a while, which seems
> like a good bet.
>
>
> On Wed, Apr 16, 2014 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Merlin Moncure <mmoncure@gmail.com> writes:
> >> Anyways, I'm still curious if you can post similar numbers basing the
> >> throttling on gross allocation counts instead of time. Meaning: some
> >> number of buffer allocations has to have occurred before you consider
> >> eviction. Besides being faster I think it's a better implementation:
> >> an intermittently loaded server will give more consistent behavior.
> >
> > Yeah --- I think wall-clock-based throttling is fundamentally the wrong
> > thing anyway. Are we going to start needing a CPU speed measurement to
> > tune the algorithm with? Not the place to be going. But driving it off
> > the number of allocations that've been done could be sensible. (OTOH,
> > that means you need a central counter, which itself would be a
> > bottleneck.)
>
> So, I was thinking about this a little bit more today, prodded by my
> coworker John Gorman. I'm wondering if we could drive this off of the
> clock sweep; that is, every time the clock sweep touches a buffer, its
> usage count goes down by one, but we also set two flag bits. Every
> time the buffer gets touched thereafter, we check whether any flag
> bits are set; if so, we clear one and increase the usage count, else
> we do nothing. So the usage count can increase at most twice per
> clock sweep. The advantage of that is that, as with Peter's approach,
> it is harder for the usage count of a buffer to max out - to get
> there, you need sustained access over a longer period of time. But
> it's not time-dependent, so replaying the same workload at twice the
> speed or half the speed on faster or slower hardware doesn't change
> the choice of which buffer to evict, which seems good. And it will
> cause us to prefer buffers which are accessed repeatedly over a period
> of time rather than buffers that are accessed a bunch of times in
> quick succession and then not touched again for a while, which seems
> like a good bet.
>
IIUC, this will allow us to increase usage count only when the buffer
is touched by clocksweep to decrement the usage count.
I think such a solution will be good for the cases when many evictions
needs to be performed to satisfy the workload, OTOH when there are
not too many evictions that needs to be done, in such a case some of
the buffers that are accessed much more will have equal probability to
get evicted as compare to buffers which are less accessed.
On Tue, Apr 14, 2015 at 7:10 PM, Greg Stark <stark@mit.edu> wrote: > The way the clock sweep algorithm is meant to be thought about is that it's > an approximate lru. Each usage count corresponds to an ntile of the lru. So > we don't know which buffer is least recently used but it must be in the set > of buffers with usage count 0 and that should be 1/nth of all the buffers. Agreed. > In order for that property to be maintained though the usage count for some > buffer should be getting decremented every time we touch a buffer. That is, > every time we promote one buffer to the most recently moved ntile we should > be demoting some other buffer. Agreed. It's not easy to get this behavior exactly, though, because the buffer you kick out necessarily has a usage count of 0 and the one you bring in probably shouldn't. And we don't wanna have to run the clock sweep every time somebody touches a non-maximal usage count. But I think it is still true that this is, to some degree, what our algorithm is trying to approximate, and I also think it's pretty clear that our current approximation isn't that great. > The way our cache works we promote when a buffer is accessed but we only > demote when a buffer is flushed. We flush a lot less often than we touch > buffers so it's not surprising that the cache ends up full of buffers that > are all in the "most recently used" section. This isn't really correct. We promote when it's accessed, but we demote it when the clock sweep hand passes over it, which happens each time we consider it for eviction. It does not have to do with flushing dirty date to disk, and it does not happen only when the buffer is actually evicted. > Now it's complicated by the fact that we aren't promoting buffers directly > to the most recently used ntile. We're incrementing the usage count by one. > That makes it more of a "least frequently used" list rather than a lru. I > think that's a mistake but I recall some debate about that when it first > went in. Note that the discussion of 2Q, LRU(k), and perhaps others ask not only how recently the page was used, but how frequently it was used. The recency of the next-to-last access is often used as a proxy for frequency. Consider two buffers. One gets touched 5 times in a row, once a day. The other gets touched 5 times per day, at equal intervals. In general, the second buffer is a better choice to retain than the first, even if it has been touched less recently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 15, 2015 at 12:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > IIUC, this will allow us to increase usage count only when the buffer > is touched by clocksweep to decrement the usage count. Yes. > I think such a solution will be good for the cases when many evictions > needs to be performed to satisfy the workload, OTOH when there are > not too many evictions that needs to be done, in such a case some of > the buffers that are accessed much more will have equal probability to > get evicted as compare to buffers which are less accessed. Possibly, but I think it's even worse under the current algorithm. Under this proposal, if we go for a long time without any buffer evictions, every buffer usage's count will top out at 2 more than wherever it was after the last clock sweep. In the worst case, every buffer (or most of them) could end up with the same usage count. But under the status quo, they'll all go to 5, which is an even bigger loss of information, and which will make the first eviction much more expensive than if they are all pegged at 2 or 3. There could be ways to improve things further, such as by slowly advancing the clock sweep even when no eviction is required. But that's a bit tricky to do, too. You'd like (perhaps) to advance it one step for every buffer allocation, but you don't want a centralized counter, or unnecessary contention on the clock sweep when no eviction is necessary in the first place. There's probably some way to make it work, though, if we put our mind to it. I'm inclined to try the simpler approach first and see what it buys. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 15, 2015 at 10:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Apr 15, 2015 at 12:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > IIUC, this will allow us to increase usage count only when the buffer
> > is touched by clocksweep to decrement the usage count.
>
> Yes.
>
> > I think such a solution will be good for the cases when many evictions
> > needs to be performed to satisfy the workload, OTOH when there are
> > not too many evictions that needs to be done, in such a case some of
> > the buffers that are accessed much more will have equal probability to
> > get evicted as compare to buffers which are less accessed.
>
> Possibly, but I think it's even worse under the current algorithm.
> Under this proposal, if we go for a long time without any buffer
> evictions, every buffer usage's count will top out at 2 more than
> wherever it was after the last clock sweep. In the worst case, every
> buffer (or most of them) could end up with the same usage count. But
> under the status quo, they'll all go to 5, which is an even bigger
> loss of information, and which will make the first eviction much more
> expensive than if they are all pegged at 2 or 3.
>
> There could be ways to improve things further, such as by slowly
> advancing the clock sweep even when no eviction is required. But
> that's a bit tricky to do, too. You'd like (perhaps) to advance it
> one step for every buffer allocation, but you don't want a centralized
> counter, or unnecessary contention on the clock sweep when no eviction
> is necessary in the first place. There's probably some way to make it
> work, though, if we put our mind to it. I'm inclined to try the
> simpler approach first and see what it buys.
>
>
> On Wed, Apr 15, 2015 at 12:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > IIUC, this will allow us to increase usage count only when the buffer
> > is touched by clocksweep to decrement the usage count.
>
> Yes.
>
> > I think such a solution will be good for the cases when many evictions
> > needs to be performed to satisfy the workload, OTOH when there are
> > not too many evictions that needs to be done, in such a case some of
> > the buffers that are accessed much more will have equal probability to
> > get evicted as compare to buffers which are less accessed.
>
> Possibly, but I think it's even worse under the current algorithm.
> Under this proposal, if we go for a long time without any buffer
> evictions, every buffer usage's count will top out at 2 more than
> wherever it was after the last clock sweep. In the worst case, every
> buffer (or most of them) could end up with the same usage count. But
> under the status quo, they'll all go to 5, which is an even bigger
> loss of information, and which will make the first eviction much more
> expensive than if they are all pegged at 2 or 3.
>
Okay, got your point. On further thinking on this idea, it occurred to me
that with this idea you are trying to reduce the clock-sweep time which
in-turn will inturn improve the chances of finding usable buffer in less
time under high pressure, if that is true then I think we should have seen
the similar gain even with the bgreclaimer idea which I have tried sometime
back, because that has also reduced the time for clock-sweep.
> There could be ways to improve things further, such as by slowly
> advancing the clock sweep even when no eviction is required. But
> that's a bit tricky to do, too. You'd like (perhaps) to advance it
> one step for every buffer allocation, but you don't want a centralized
> counter, or unnecessary contention on the clock sweep when no eviction
> is necessary in the first place. There's probably some way to make it
> work, though, if we put our mind to it. I'm inclined to try the
> simpler approach first and see what it buys.
>
Sure, I think it is worth a try.
On Wed, Apr 15, 2015 at 12:37:44AM -0400, Robert Haas wrote: > > I think such a solution will be good for the cases when many evictions > > needs to be performed to satisfy the workload, OTOH when there are > > not too many evictions that needs to be done, in such a case some of > > the buffers that are accessed much more will have equal probability to > > get evicted as compare to buffers which are less accessed. > > Possibly, but I think it's even worse under the current algorithm. > Under this proposal, if we go for a long time without any buffer > evictions, every buffer usage's count will top out at 2 more than > wherever it was after the last clock sweep. In the worst case, every > buffer (or most of them) could end up with the same usage count. But > under the status quo, they'll all go to 5, which is an even bigger > loss of information, and which will make the first eviction much more > expensive than if they are all pegged at 2 or 3. I've been following this thread from the side with interest and got twigged by the point about loss of information. If you'd like better information about relative ages, you can acheive this by raising the cap on the usage count and dividing (or right-shifting) each sweep. This would allow you to remember much more about about the relative worth of often used pages. With a cap of 32 you'd have the same effect as now where after 5 sweeps the buffer is evicted. Mathematically the count would converge to the number of times the block is used per sweep. If you wanted to be really clever, you could at the beginning of each sweep take an estimate of the number of buffers used since the last sweep (from the stats collector perhaps) and use that to drive your divisor, so if you have a lots of allocations you become more aggressive about reducing the counts. Or if the load is light fall back to just subtracting one. Then you don't need a cap at all. (Apologies if this has been suggested before, Google didn't find anything for me). Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On Wed, Apr 15, 2015 at 5:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> The way our cache works we promote when a buffer is accessed but we only >> demote when a buffer is flushed. We flush a lot less often than we touch >> buffers so it's not surprising that the cache ends up full of buffers that >> are all in the "most recently used" section. > > This isn't really correct. We promote when it's accessed, but we > demote it when the clock sweep hand passes over it, which happens each > time we consider it for eviction. It does not have to do with > flushing dirty date to disk, and it does not happen only when the > buffer is actually evicted. This is my point though (you're right that "flushed" isn't always the same as eviction but that's not the important point here). Right now we only demote when we consider buffers for eviction. But we promote when we pin buffers. Those two things aren't necessarily happening at the same rate and in fact are often orders of magnitude different. So it makes sense that we end up with a lot of buffers promoted all the way to the most recently used ntile and then have to do n passes around the clock and have no good information about which buffer to evict. What I'm saying is that we should demote a buffer every time we promote a buffer. So every time we pin a buffer we should advance the clock a corresponding amount. I know I'm being intentionally vague about what the corresponding amount is.) The important thing is that the two should be tied together. -- greg
On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout <kleptog@svana.org> wrote: > I've been following this thread from the side with interest and got > twigged by the point about loss of information. If you'd like better > information about relative ages, you can acheive this by raising the > cap on the usage count and dividing (or right-shifting) each sweep. Yeah, I thought about that, too. It might be worth experimenting with. > This would allow you to remember much more about about the relative > worth of often used pages. With a cap of 32 you'd have the same effect > as now where after 5 sweeps the buffer is evicted. Mathematically the > count would converge to the number of times the block is used per > sweep. Hmm, interesting point. It's possible that we'd still have problems with everything maxing out at 32 on some workloads, but at least it'd be a little harder to max out at 32 than at 5. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 20, 2015 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout > <kleptog@svana.org> wrote: >> I've been following this thread from the side with interest and got >> twigged by the point about loss of information. If you'd like better >> information about relative ages, you can acheive this by raising the >> cap on the usage count and dividing (or right-shifting) each sweep. > > Yeah, I thought about that, too. It might be worth experimenting with. > >> This would allow you to remember much more about about the relative >> worth of often used pages. With a cap of 32 you'd have the same effect >> as now where after 5 sweeps the buffer is evicted. Mathematically the >> count would converge to the number of times the block is used per >> sweep. > > Hmm, interesting point. It's possible that we'd still have problems > with everything maxing out at 32 on some workloads, but at least it'd > be a little harder to max out at 32 than at 5. Do we have any reproducible test cases to evaluate these assumptions?I haven't looked at this stuff for a while, but my mainissue with the clock sweep was finding sweep heavy cases that did not also have trouble with the buffer mapping locks (although the facts on the ground my have changed since then). merlin
On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark <stark@mit.edu> wrote: > This is my point though (you're right that "flushed" isn't always the > same as eviction but that's not the important point here). Right now > we only demote when we consider buffers for eviction. But we promote > when we pin buffers. Those two things aren't necessarily happening at > the same rate and in fact are often orders of magnitude different. I am absolutely, positively, violently in 100% agreement with this. I have made the same point before, but it sure is nice to hear someone else thinking about it the same way. > So > it makes sense that we end up with a lot of buffers promoted all the > way to the most recently used ntile and then have to do n passes > around the clock and have no good information about which buffer to > evict. Right. > What I'm saying is that we should demote a buffer every time we > promote a buffer. So every time we pin a buffer we should advance the > clock a corresponding amount. I know I'm being intentionally vague > about what the corresponding amount is.) The important thing is that > the two should be tied together. Yes, absolutely. If you tilt your head the right way, my proposal of limiting the number of promotions per clock sweep has the effect of tying buffer demotion and buffer promotion together much more tightly than is the case right now. You are limited to 2 promotions per demotion; and practically speaking not all buffers eligible to be promoted will actually get accessed, so the number of promotions per demotion will in reality be somewhere between 0 and 2. Ideally it would be exactly 1, but 1 +/- 1 is still a tighter limit than we have at present. Which is not to say there isn't some other idea that is better still. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 20, 2015 at 11:00 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> Hmm, interesting point. It's possible that we'd still have problems >> with everything maxing out at 32 on some workloads, but at least it'd >> be a little harder to max out at 32 than at 5. > > Do we have any reproducible test cases to evaluate these assumptions? The particular point that you are responding to here is easy to reproduce. Just create any workload that fits in shared_buffers - a small pgbench database, for example - and access stuff. No usage counts will go down, but every access will drive usage counts up. Eventually everything will hit any maximum you care to install, and actually I don't think it takes very long. You can use pg_buffercache to see the results. > I haven't looked at this stuff for a while, but my main issue with > the clock sweep was finding sweep heavy cases that did not also have > trouble with the buffer mapping locks (although the facts on the > ground my have changed since then). We increased the number of buffer mapping locks to 128, so that problem should be considerably ameliorated now. But it's not totally gone, as demonstrated by Andres's experiments with my chash patch. There was also a patch that eliminated BufFreelistLock in favor of a spinlock held for much shorter time periods, and then Andres took that one step further and made it use atomics. That used to be a *terrible* bottleneck on eviction-heavy workloads and is now much improved. More work may remain to be done, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/20/15 11:11 AM, Robert Haas wrote: > On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark <stark@mit.edu> wrote: >> This is my point though (you're right that "flushed" isn't always the >> same as eviction but that's not the important point here). Right now >> we only demote when we consider buffers for eviction. But we promote >> when we pin buffers. Those two things aren't necessarily happening at >> the same rate and in fact are often orders of magnitude different. > > I am absolutely, positively, violently in 100% agreement with this. I > have made the same point before, but it sure is nice to hear someone > else thinking about it the same way. +1 >> What I'm saying is that we should demote a buffer every time we >> promote a buffer. So every time we pin a buffer we should advance the >> clock a corresponding amount. I know I'm being intentionally vague >> about what the corresponding amount is.) The important thing is that >> the two should be tied together. > > Yes, absolutely. If you tilt your head the right way, my proposal of > limiting the number of promotions per clock sweep has the effect of > tying buffer demotion and buffer promotion together much more tightly > than is the case right now. You are limited to 2 promotions per > demotion; and practically speaking not all buffers eligible to be > promoted will actually get accessed, so the number of promotions per > demotion will in reality be somewhere between 0 and 2. Ideally it > would be exactly 1, but 1 +/- 1 is still a tighter limit than we have > at present. Which is not to say there isn't some other idea that is > better still. I think that would help, but it still leaves user backends trying to advance the clock, which is quite painful. Has anyone tested running the clock in the background? We need a wiki page with all the ideas that have been tested around buffer management... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Apr 14, 2015 at 7:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Apr 14, 2015 at 6:22 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Why is that good? > > We did discuss this before. I've recapped some of what I believe to > be the most salient points below. > >> I think that people were all too quick to dismiss the idea of a wall >> time interval playing some role here (at least as a defense against >> correlated references, as a correlated reference period). I suppose >> that that's because it doesn't fit with an intuition that says that >> that kind of interval ought to be derived algebraically - magic delay >> settings are considered suspect. > > Yep, Tom gave that reason here: > > http://www.postgresql.org/message-id/11258.1397673898@sss.pgh.pa.us > > But there was also this point from Andres - gettimeofday is not free: > > http://www.postgresql.org/message-id/20140416075307.GC3906@awork2.anarazel.de > > And this point from me - this can degrade to random eviction under > high pressure: > > http://www.postgresql.org/message-id/CA+TgmoayUxr55zuEaPP6d2XByicJWACC9Myyn5aT4TiNdSJqYw@mail.gmail.com > > You'll notice that my proposal avoids all three of those objections. All I'm saying is that you shouldn't dismiss the idea without trying it out properly. The LRU-K paper, from the early 1990s, recommends this, and gettimeofday() calls were a lot more expensive back then. I'm sure that there is a way to overcome these issues if it turns out to be worth it (by amortizing gettimeofday() calls by driving it from an auxiliary process like the bgwriter, for example). In fact, I'm almost certain there is, because at least one other major database system uses just such a reference period. Back when ARC (or was it 2Q?) was committed before being reverted shortly thereafter, there was a similar idea actually implemented, but with XIDs preventing correlated references (which the LRU-K paper also hints at). I think that an actual delay is more robust than that, though. Even though this correlated reference delay is all I implemented with my prototype,it's just one piece of the puzzle. -- Peter Geoghegan
On Mon, Apr 20, 2015 at 2:53 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > I think that would help, but it still leaves user backends trying to advance > the clock, which is quite painful. Has anyone tested running the clock in > the background? We need a wiki page with all the ideas that have been tested > around buffer management... Amit's bgreclaimer patch did that, but we weren't able to demonstrate a clear benefit. I haven't given up on the idea yet, but we've got to be able to prove that it's a good idea in practice as well as in theory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company