Thread: less expensive pg_buffercache on big shmem
Hi hackers, Recently I have finished my work on a patch for pg_buffercache contrib, I think it's time to share my results. Introduction ============ I want to offer you the implementation that allows to decrease system workload by partially sacrificing (fully snapshot consistency) data consistency. Sometimes we do not need full data consistency, for example on quantitative rather than qualitative analysis of memory contents, or when we want to catch insufficient memory resources or how often relation is used. Problem description =================== Currently, the pg_buffercache v1.1 and prior takes an exclusive lock on all shared buffers, which greatly affects system performance. Usually we use pg_buffercache to find out why DB is working slower than expected or examine what occupies the entire memory. So when we run pg_buffercache on such system, we make it even slower. Implementation ============== Vanilla implementation contains loop which collecting statistic from whole shared memory acquire, read and release Spinlocks one by one, page by page while holding LWLock. V1.2 implementation contains flexible loop which can collect shared memory statistic using three different methods: 1) with holding LWLock only on one partition of shared memory (semiconsistent method) 2) without LWLocks (nonconsistent method), 3) or in vanilia way (consistent method) The aforementioned allow us to launch pg_buffercache in the three different ways. Each of them have some advantages and some disadvantages: Consistent: + 100% consistency of shared memory snapshot - Slowdown the system with whole shared memory exclusive lock Semiconsistent: + Faster than consistent method + Mostly doesn`t affect on the system load - Speed of taking that snapshot is low Nonconsistent: The fastest + Doesn`t noticeably affects on the systems - <3% lost of snapshot consistency What works ========== Actually, it work well even on big load, but of course there might be things I've overlooked. VIEW pg_buffercache_cons VIEW pg_buffercache_noncons VIEW pg_buffercache_semicons Examples from docs in new realization: SELECT c.relname, count(*) AS buffers FROM pg_buffercache_noncons b INNER JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database())) GROUP BY c.relname ORDER BY 2 DESC LIMIT 10; SELECT c.relname, count(*) AS buffers FROM pg_buffercache_semicons b INNER JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database())) GROUP BY c.relname ORDER BY 2 DESC LIMIT 10; Testing the implementation ====================== How implementation tested: 1) Start server 2) Make pgbench tps pgbench -c 250 -s 1000 -T 200 -P1 3) Compare how tps sags under load if: SELECT count(*) FROM pg_buffercache_cons; SELECT count(*) FROM pg_buffercache_semicons; SELECT count(*) FROM pg_buffercache_noncons; This test was made on server (server parameters) Model name: Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz CPU(s): 144 Socket(s): 4 Shared_buffers: 200GB Results of testing ====================== Our DBA team obtained the following results: Nonconsistent: * 10% faster then consistent method * doesn`t noticeably affects on the systems * the maximum loss of accuracy was less then 3%* ( in most situation it is permissible accuracy loss ) Semiconsistent: * 5 time slower then nonconsistent * made less affects on system compared to consistent Overall results: Our clients was pleased with this implementation. Implementation is made with backward compatibility, as a conclusion old pg_buffercache v1.1 queries will work well. Semiconsistent show results approaching to nonconsistent on SELECTONLY queries. * this values were obtained from our DBA tests. What can be better =============== It is unclear how to optimize the semiconsistent method to make it faster, and reduce temporary effect that appears from time to time. I will be glad to see your feedback! --- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Aug 31, 2016 at 8:27 PM, Ivan Kartyshov <i.kartyshov@postgrespro.ru> wrote: > Recently I have finished my work on a patch for pg_buffercache contrib, I > think it's time to share my results. Thanks for sharing your results. > V1.2 implementation contains flexible loop which can collect shared memory > statistic using three different methods: > 1) with holding LWLock only on one partition of shared memory > (semiconsistent method) > 2) without LWLocks (nonconsistent method), > 3) or in vanilia way (consistent method) > > The aforementioned allow us to launch pg_buffercache in the three different > ways. I am a little skeptical about the idea of offering the user three different choices about how to do this. That seems like it is exposing what ought to be an internal implementation detail and I'm not sure users will really know how to make an intelligent choice between the three options. But of course others may have a different view on that. I wonder whether we ought to just switch from the consistent method to the semiconsistent method and call it good. I agree with you that taking every buffer partition lock simultaneously seems like too much locking. And in the future if we replace the buffer mapping hash with something that is lock-free or close to it, then we wouldn't even have buffer partition locks any more, and probably no way at all to get an entirely consistent snapshot. What do you think of this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-09-02 08:31:42 +0530, Robert Haas wrote: > I wonder whether we ought to just switch from the consistent method to > the semiconsistent method and call it good. +1. I think, before long, we're going to have to switch away from having locks & partitions in the first place. So I don't see a problem relaxing this. It's not like that consistency really buys you anything... I'd even consider not using any locks. Andres
On Thu, Sep 1, 2016 at 8:19 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-09-02 08:31:42 +0530, Robert Haas wrote: >> I wonder whether we ought to just switch from the consistent method to >> the semiconsistent method and call it good. > > +1. I think, before long, we're going to have to switch away from having > locks & partitions in the first place. So I don't see a problem relaxing > this. It's not like that consistency really buys you anything... I'd > even consider not using any locks. Right. ISTM that the consistency guarantee was added on the off chance that it mattered, without actually being justified. I would like to be able to run pg_buffercache in production from time to time. -- Peter Geoghegan
On Fri, Sep 2, 2016 at 8:49 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-09-02 08:31:42 +0530, Robert Haas wrote: >> I wonder whether we ought to just switch from the consistent method to >> the semiconsistent method and call it good. > > +1. I think, before long, we're going to have to switch away from having > locks & partitions in the first place. So I don't see a problem relaxing > this. It's not like that consistency really buys you anything... I'd > even consider not using any locks. I think we certainly want to lock the buffer header, because otherwise we might get a torn read of the buffer tag, which doesn't seem good. But it's not obvious to me that there's any point in taking the lock on the buffer mapping partition; I'm thinking that doesn't really do anything unless we lock them all, and we all seem to agree that's going too far. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/02/2016 06:01 AM, Robert Haas wrote: > I wonder whether we ought to just switch from the consistent method to > the semiconsistent method and call it good. I agree with you that > taking every buffer partition lock simultaneously seems like too much > locking. And in the future if we replace the buffer mapping hash with > something that is lock-free or close to it, then we wouldn't even have > buffer partition locks any more, and probably no way at all to get an > entirely consistent snapshot. > What do you think of this? I fully agree with you that it would be preferred in the future to replace the buffer mapping hash with some of lock-free algorithms. In the question of replacing the consistent method I agree with you, Andres Freund and Peter Geoghegan: the consistent method does not bring any considerable benefits. You might be right regarding the three different modes, but our DBAs asked if we could preserve a legacy mode too, thus the choice. On 09/02/2016 06:19 AM, Andres Freund wrote: > +1. I think, before long, we're going to have to switch away from having >locks & partitions in the first place. So I don't see a problem relaxing > this. It's not like that consistency really buysyou anything... I'd > even consider not using any locks. If we will replace consistent method, then we should replace it with the partially consistent method (called "nonconsistent") because: 1) it's based on fast spinlocks (it's not fully up to its name, though) 2) it's *almost* the fastest one (the less time needed for execution of method, the less data core will change and as a consequence the more consistent snapshot will be) 3) and it has less influence on the entire system during query processing. On 09/02/2016 06:30 AM, Peter Geoghegan wrote: > I would like to be able to run pg_buffercache in production from time >to time. Yes, in our experience the usage of fully consistent pg_buffercache in production is quite a courageous decision. --- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi Ivan, This patch needs a rebase, as 06d7fd6e bumped the version to 1.2. On 09/02/2016 01:38 PM, Ivan Kartyshov wrote: > On 09/02/2016 06:01 AM, Robert Haas wrote: >> I wonder whether we ought to just switch from the consistent method to >> the semiconsistent method and call it good. I agree with you that >> taking every buffer partition lock simultaneously seems like too much >> locking. And in the future if we replace the buffer mapping hash with >> something that is lock-free or close to it, then we wouldn't even have >> buffer partition locks any more, and probably no way at all to get an >> entirely consistent snapshot. >> What do you think of this? > > I fully agree with you that it would be preferred in the future to > replace the buffer mapping hash with some of lock-free algorithms. > > In the question of replacing the consistent method I agree with you, > Andres Freund and Peter Geoghegan: the consistent method does not bring > any considerable benefits. > > You might be right regarding the three different modes, but our DBAs > asked if we could preserve a legacy mode too, thus the choice. > > On 09/02/2016 06:19 AM, Andres Freund wrote: > > +1. I think, before long, we're going to have to switch away from having > > locks & partitions in the first place. So I don't see a problem relaxing > > this. It's not like that consistency really buys you anything... I'd > > even consider not using any locks. > > If we will replace consistent method, then we should replace it with the > partially consistent method (called "nonconsistent") because: > 1) it's based on fast spinlocks (it's not fully up to its name, though) In other words, it does exactly the thing proposed up-thread, i.e. locking only buffer headers. What do you mean by fast spinlocks? And why aren't they up to the name? > 2) it's *almost* the fastest one (the less time needed for execution of > method, the less data core will change and as a consequence the more > consistent snapshot will be) I'm not sure I understand what you're saying here? What do you mean by "almost the fastest one"? I'm a bit confused by the results you reported before, i.e. that 1) nonconsistent is 10% faster then consistent method 2) semiconsistent is 5-times slower then nonconsistent method What does that mean? Are you refering to duration of the queries reading data from pg_buffercache, or to impact on the other queries? How can be semiconsistent 5x slower than nonconsistent? Wouldn't that make it significantly slower than the consistent method? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/03/2016 05:04 AM, Tomas Vondra wrote:> This patch needs a rebase, as 06d7fd6e bumped the version to 1.2. Thank you for a valuable hint. > > If we will replace consistent method, then we should replace it with the> > partially consistent method (called "nonconsistent") because:> > 1) it's based on fast spinlocks (it's not fullyup to its name, though)>> In other words, it does exactly the thing proposed up-thread, i.e.> locking only buffer headers.>>What do you mean by fast spinlocks? And why aren't they up to the name? Not they (spinlocks), but the name “nonconsistent” is somewhat misleading. At the moment we can't implement concurrent shared memory access without locks in general, so most inconsistent method that has been proposed was the "nonconsistent" one. But roughly speaking *nonconsistent* is not as such by the name, because it contains a locking mechanism (spinlocks). > > 2) it's *almost* the fastest one (the less time needed for execution of> > method, the less data core will change andas a consequence the more> > consistent snapshot will be)>> I'm not sure I understand what you're saying here? What doyou mean by> "almost the fastest one"? I mean the fastest one that has been proposed ("consistent"| "semiconsistent"| "nonconsistent"). > I'm a bit confused by the results you reported before, i.e. that>> 1) nonconsistent is 10% faster than consistent method>2) semiconsistent is 5-times slower than nonconsistent method>> What does that mean? Are you refering to durationof the queries reading> data from pg_buffercache, or to impact on the other queries? Here I mean "working duration time". > How can be semiconsistent 5x slower than nonconsistent? Wouldn't that> make it significantly slower than the consistentmethod? Firstly, when we want to measure the quality of pg_buffercache, we must measure several values: 1) Execution time (duration of the queries reading data from pg_buffercache) 2) How it influences the system (and other queries) during its work Secondly, the semiconsistent is slower than nonconsistent and consistent method, but it makes less influence on other queries then consistent. Thirdly, it is slower because it locks the partitions of shared memory in a different way than in consistent or nonconsistent methods. The semi-consistent strategy implies that only one buffer is locked at a time. Therefor has no significant effect on the system, but it takes more time. --- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 02/09/16 15:19, Andres Freund wrote: > On 2016-09-02 08:31:42 +0530, Robert Haas wrote: >> I wonder whether we ought to just switch from the consistent method to >> the semiconsistent method and call it good. > +1. I think, before long, we're going to have to switch away from having > locks & partitions in the first place. So I don't see a problem relaxing > this. It's not like that consistency really buys you anything... I'd > even consider not using any locks. > +1 as well. When I wrote the original module I copied the design of the pg_locks view - as it was safe and consistent. Now it is clear that the ability to look at (semi-consistent) contents of the buffer cache is more important than having a theoretically correct point in time snapshot. Go for it :-) regards Mark
On 09/05/2016 06:19 PM, Ivan Kartyshov wrote: > On 09/03/2016 05:04 AM, Tomas Vondra wrote: >> This patch needs a rebase, as 06d7fd6e bumped the version to 1.2. > Thank you for a valuable hint. > So, will we get a rebased patch? I see the patch is back in 'needs review' but there's no new version. >> > If we will replace consistent method, then we should replace it with > the >> > partially consistent method (called "nonconsistent") because: >> > 1) it's based on fast spinlocks (it's not fully up to its name, though) >> >> In other words, it does exactly the thing proposed up-thread, i.e. >> locking only buffer headers. >> >> What do you mean by fast spinlocks? And why aren't they up to the name? > > Not they (spinlocks), but the name “nonconsistent” is somewhat > misleading. At the moment we can't implement concurrent shared memory > access without locks in general, so most inconsistent method that has > been proposed was the "nonconsistent" one. But roughly speaking > *nonconsistent* is not as such by the name, because it contains a > locking mechanism (spinlocks). > I'm a bit confused by the naming, but my understanding is that nonconsistent only acquires (spin)locks on individual buffers, so it does not have a consistent view on shared_buffers as a whole, but data about individual buffers are consistent. I think that's fine and perfectly acceptable for this extension. >> > 2) it's *almost* the fastest one (the less time needed for execution of >> > method, the less data core will change and as a consequence the more >> > consistent snapshot will be) >> >> I'm not sure I understand what you're saying here? What do you mean by >> "almost the fastest one"? > > I mean the fastest one that has been proposed ("consistent"| > "semiconsistent"| "nonconsistent"). > >> I'm a bit confused by the results you reported before, i.e. that >> >> 1) nonconsistent is 10% faster than consistent method >> 2) semiconsistent is 5-times slower than nonconsistent method >> >> What does that mean? Are you refering to duration of the queries reading >> data from pg_buffercache, or to impact on the other queries? > > Here I mean "working duration time". > I have no idea what "working duration time" is. >> How can be semiconsistent 5x slower than nonconsistent? Wouldn't that >> make it significantly slower than the consistent method? > > Firstly, when we want to measure the quality of pg_buffercache, we must > measure several values: > 1) Execution time (duration of the queries reading data from > pg_buffercache) > 2) How it influences the system (and other queries) during its work > > Secondly, the semiconsistent is slower than nonconsistent and consistent > method, but it makes less influence on other queries then consistent. > > Thirdly, it is slower because it locks the partitions of shared memory > in a different way than in consistent or nonconsistent methods. > The semi-consistent strategy implies that only one buffer is locked at a > time. Therefor has no significant effect on the system, but it takes > more time. > It'd be really useful if you could provide actual numbers, explain what metrics you compare and how. I'm sure it makes sense to you but it's utterly confusing for everyone else, and it also makes it impossible to reproduce the benchmark. I've looked at the code (applied on top of e3b607c) and I see this in pg_buffercache_pages: for (j = 0; j < num_partit; j++) { if (snaptype == CONS_SNAP) for (i = 0; i < NUM_BUFFER_PARTITIONS;i++) LWLockAcquire(BufMappingPartitionLockByIndex(i), LW_SHARED); else if (snaptype == SEMICONS_SNAP) LWLockAcquire(BufMappingPartitionLockByIndex(j), LW_SHARED); for (i = 0; i < NBuffers; i++) { ... walk through all shared buffers ... } } How is the SEMICONS_SNAP case supposed to only scan the currently locked partition, when we simply scans all shared buffers for SEMICONS_SNAP? That seems outright broken, it should scan only the currently locked partition. Secondly, I see this bit added to the loop over buffers: if (bufHdr->tag.forkNum == -1) { fctx->record[i].blocknum = InvalidBlockNumber; continue; } and I have no idea why this is needed (when it was not needed before). The patch also breaks a few comments, because it adds code between the comment and the related code. For example the (likely unnecessary) bit of code changes this: /* Lock each buffer header before inspecting. */ buf_state = LockBufHdr(bufHdr); into this /* Lock each buffer header before inspecting. */ if (bufHdr->tag.forkNum == -1) { fctx->record[i].blocknum= InvalidBlockNumber; continue; } buf_state = LockBufHdr(bufHdr); which is confusing (and I'd argue broken). Similarly when releasing the locks, the original comment/code block looks like this: /* * And release locks. We do this in reverse order for two reasons: * (1) Anyone else who needs more than one ofthe locks will be trying * to lock them in increasing order; we don't want to release the * other process until itcan get all the locks it needs. (2) This * avoids O(N^2) behavior inside LWLockRelease. */ for (i = NUM_BUFFER_PARTITIONS;--i >= 0;) LWLockRelease(BufMappingPartitionLockByIndex(i)); but the patch changes is to this: if (snaptype == CONS_SNAP) for (i = NUM_BUFFER_PARTITIONS; --i >= 0;) LWLockRelease(BufMappingPartitionLockByIndex(i)); else if (snaptype == SEMICONS_SNAP) /* * And release locks. Wedo this in reverse order for two reasons: * (1) Anyone else who needs more than one of the locks will be * tryingto lock them in increasing order; we don't want to * release the other process until it can get all the locks it * needs. (2) This avoids O(N^2) behavior inside LWLockRelease. */ LWLockRelease(BufMappingPartitionLockByIndex(j)); Notice how the comment is moved from the loop to the block that only unlocks a single partition. That clearly makes no sense, as the comment is about the CONS_SNAP loop. Now that I read the comment, I wonder whether the semiconsistent case (locking one partition at a time) changes this reverse-order unlocking behavior or not - clearly, the guarantee that the other process will get all the locks at once is no longer true, as pg_buffercache will likely hold the next one. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/02/2016 11:01 AM, Robert Haas wrote: > On Fri, Sep 2, 2016 at 8:49 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-09-02 08:31:42 +0530, Robert Haas wrote: >>> I wonder whether we ought to just switch from the consistent method to >>> the semiconsistent method and call it good. >> >> +1. I think, before long, we're going to have to switch away from having >> locks & partitions in the first place. So I don't see a problem relaxing >> this. It's not like that consistency really buys you anything... I'd >> even consider not using any locks. > > I think we certainly want to lock the buffer header, because otherwise > we might get a torn read of the buffer tag, which doesn't seem good. > But it's not obvious to me that there's any point in taking the lock > on the buffer mapping partition; I'm thinking that doesn't really do > anything unless we lock them all, and we all seem to agree that's > going too far. +1 from me to only locking the buffer headers. IMHO that's perfectly fine for the purpose of this extension. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 20, 2016 at 7:43 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 09/02/2016 11:01 AM, Robert Haas wrote: >> >> On Fri, Sep 2, 2016 at 8:49 AM, Andres Freund <andres@anarazel.de> wrote: >>> >>> On 2016-09-02 08:31:42 +0530, Robert Haas wrote: >>>> >>>> I wonder whether we ought to just switch from the consistent method to >>>> the semiconsistent method and call it good. >>> >>> >>> +1. I think, before long, we're going to have to switch away from having >>> locks & partitions in the first place. So I don't see a problem relaxing >>> this. It's not like that consistency really buys you anything... I'd >>> even consider not using any locks. >> >> I think we certainly want to lock the buffer header, because otherwise >> we might get a torn read of the buffer tag, which doesn't seem good. >> But it's not obvious to me that there's any point in taking the lock >> on the buffer mapping partition; I'm thinking that doesn't really do >> anything unless we lock them all, and we all seem to agree that's >> going too far. > > +1 from me to only locking the buffer headers. IMHO that's perfectly fine > for the purpose of this extension. So, I think we have agreement on the way forward here, but what we don't have is a committable patch. I'm willing to commit one before the end of this CommitFest if somebody produces one RSN; otherwise, this is going to have to go into the "Returned with Feedback" bucket. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello everyone, patch was rebased. Thank you Tomas for your reviewing this patch and for your valuable comments. From the very beginning we had the misunderstanding with the naming of meethods. > It'd be really useful if you could provide actual numbers, explain what > metrics you compare and how. I'm sure it makes sense to you but it's > utterly confusing for everyone else, and it also makes it impossible to > reproduce the benchmark. I test it as I wrote earlier, I run it a several times collecting TPS in each series, and calculate average value. > Secondly, I see this bit added to the loop over buffers: > > if (bufHdr->tag.forkNum == -1) > { > fctx->record[i].blocknum = InvalidBlockNumber; > continue; > } > > and I have no idea why this is needed (when it was not needed before). This helps to skip not used bufferpages. It is valuable on big and not warmup shared memory. On 09/02/2016 12:01 PM, Robert Haas wrote: > I think we certainly want to lock the buffer header, because otherwise > we might get a torn read of the buffer tag, which doesn't seem good. > But it's not obvious to me that there's any point in taking the lock > on the buffer mapping partition; I'm thinking that doesn't really do > anything unless we lock them all, and we all seem to agree that's > going too far. Replace consistent method with semiconsistent (that lock buffer headers without partition lock). Made some additional fixes thanks to reviewers. -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 09/29/2016 02:45 AM, Ivan Kartyshov wrote: > > Secondly, I see this bit added to the loop over buffers: > > > > if (bufHdr->tag.forkNum == -1) > > { > > fctx->record[i].blocknum = InvalidBlockNumber; > > continue; > > } > > > > and I have no idea why this is needed (when it was not needed before). > > This helps to skip not used bufferpages. It is valuable on big and not > warmup shared memory. That seems like an unrelated change. Checking for forkNum, instead of e.g. BM_VALID seems a bit surprising, too. > On 09/02/2016 12:01 PM, Robert Haas wrote: >> I think we certainly want to lock the buffer header, because otherwise >> we might get a torn read of the buffer tag, which doesn't seem good. >> But it's not obvious to me that there's any point in taking the lock >> on the buffer mapping partition; I'm thinking that doesn't really do >> anything unless we lock them all, and we all seem to agree that's >> going too far. > > Replace consistent method with semiconsistent (that lock buffer headers > without partition lock). Made some additional fixes thanks to reviewers. This patch also does: +++ b/contrib/pg_buffercache/pg_buffercache--1.2--1.3.sql @@ -0,0 +1,6 @@ +/* contrib/pg_buffercache/pg_buffercache--1.2--1.3.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_buffercache(text) UPDATE TO '1.3'" to load this file. \quit + +ALTER FUNCTION pg_buffercache_pages() PARALLEL SAFE; Why? That seems unrelated to what's been discussed in this thread. I have committed the straightforward part of this, removing the partition-locking. I think we're done here for this commitfest, but feel free to post new patches for that PARALLEL SAFE and the quick-check for unused buffers, if you think it's worthwhile. - Heikki