Thread: Readme of Buffer Management seems to have wrong sentence
<div class="WordSection1"><p class="MsoNormal">While going through Readme in backend\storage\buffer, I found some point misleading.<pclass="MsoNormal"> <p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">NormalBuffer Replacement Strategy</span><br /><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">----------------------------------</span>--------------<br />..<pclass="MsoNormal">..<p class="MsoNormal"><br /><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Eachbuffer header contains a usage counter, which is incremented(up to a</span><br /><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">small limit value) wheneverthe buffer is <b>unpinned</b>. (This requires only the</span><br /><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">bufferheader spinlock, whic anyway to decrement the</span><br /><spanstyle="font-size:10.0pt;font-family:"Arial","sans-serif"">buffer reference count, so it's nearly free.)</span><p class="MsoNormal"> <pclass="MsoNormal">…<p class="MsoNormal"> <p class="MsoNormal">I have checked the code and logic accordingto which usage counter is increased when the buffer is pinned.<p class="MsoNormal"> <p class="MsoNormal"> <p class="MsoNormal">AnotherDoubt : Why in function BufferAlloc, it needs to hold the BufFreelistLock till it pin the bufferwhich increases its reference count.<p class="MsoNormal"> As what I understood isBufFreelistLock is to protect StrategyControl structure and till it finds </div>
On Tue, May 8, 2012 at 9:37 PM, Amit Kapila <amit.kapila@huawei.com> wrote: > I have checked the code and logic according to which usage counter is > increased when the buffer is pinned. Fixed, thanks for the report. > Another Doubt : Why in function BufferAlloc, it needs to hold the > BufFreelistLock till it pin the buffer which increases its reference count. Well, I think the problem is that, if we didn't do that, then, in theory, the strategy point could wrap all the way around shared_buffers and someone else could pin the buffer, and then we'd be hosed. Mind you, I think this whole area of the code needs some reengineering for better performance, but I'm not sure this is the right place to start. What I think is really bad is that we're forcing every BufferAlloc() to iterate over buffers checking whether each one is evictable. I think we ought to put only those buffers that we think are likely to be evictable on the freelist, and then the actual buffer eviction code would only need to recheck that nothing had changed, instead of needing to scan over a potentially quite large number of buffers that never had any chance of being selected in the first place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Mind you, I think this whole area of the code needs some reengineering > for better performance, but I'm not sure this is the right place to > start. What I think is really bad is that we're forcing every > BufferAlloc() to iterate over buffers checking whether each one is > evictable. Well, keep in mind that that action is not merely there to obtain a victim buffer; it is also maintaining the global LRU state (by decrementing the usage counts of buffers it passes over). I don't think you can change it to simply look only at a predetermined freelist without seriously compromising the overall quality of our buffer replacement decisions. regards, tom lane
On Tue, May 22, 2012 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Mind you, I think this whole area of the code needs some reengineering >> for better performance, but I'm not sure this is the right place to >> start. What I think is really bad is that we're forcing every >> BufferAlloc() to iterate over buffers checking whether each one is >> evictable. > > Well, keep in mind that that action is not merely there to obtain a > victim buffer; it is also maintaining the global LRU state (by > decrementing the usage counts of buffers it passes over). I don't think > you can change it to simply look only at a predetermined freelist > without seriously compromising the overall quality of our buffer > replacement decisions. The idea would be to have a background process (like bgwriter) maintain the global LRU state and push candidate buffers onto the freelist. Then foreground processes can just pop them off the list and recheck that they haven't been pinned meanwhile. As long as we don't let the background sweep get too far ahead of actual allocation needs, I don't think this would change the quality of buffer allocation much at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, May 22, 2012 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, keep in mind that that action is not merely there to obtain a >> victim buffer; it is also maintaining the global LRU state (by >> decrementing the usage counts of buffers it passes over). �I don't think >> you can change it to simply look only at a predetermined freelist >> without seriously compromising the overall quality of our buffer >> replacement decisions. > The idea would be to have a background process (like bgwriter) > maintain the global LRU state and push candidate buffers onto the > freelist. Amit was trying to convince me of the same idea at PGCon, but I don't buy it. bgwriter doesn't scan the buffer array nearly fast enough to provide useful adjustment of the usage counts under load. And besides if the decrements are decoupled from the allocation requests it's no longer obvious that the algorithm is even an approximation of LRU. But the larger issue here is that if that processing is a bottleneck (which I agree it is), how does it help to force a single process to be responsible for it? Any real improvement in scalability here will need to decentralize the operation more, not less. My own thoughts about this had pointed in the direction of getting rid of the central freelist entirely, instead letting each backend run its own independent clock sweep as needed. The main problem with that is that if there's no longer any globally-visible clock sweep state, it's pretty hard to figure out what the control logic for the bgwriter should look like. Maybe it would be all right to have global variables that are just statistics counters for allocations and buffers swept over, which backends would need to spinlock for just long enough to increment the counters at the end of each buffer allocation. regards, tom lane
On Tue, May 22, 2012 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The idea would be to have a background process (like bgwriter) >> maintain the global LRU state and push candidate buffers onto the >> freelist. > > Amit was trying to convince me of the same idea at PGCon, but I don't > buy it. bgwriter doesn't scan the buffer array nearly fast enough to > provide useful adjustment of the usage counts under load. And besides > if the decrements are decoupled from the allocation requests it's no > longer obvious that the algorithm is even an approximation of LRU. Well, bgwriter is *supposed* to anticipate which buffers are about to be allocated and clean any of those that are dirty. Having it decrement the usage counts and stuff the resulting list of buffers into a linked list seems like a pretty reasonable extension of that, assuming that it works in the first place. If it doesn't, then we need a rethink. > But the larger issue here is that if that processing is a bottleneck > (which I agree it is), how does it help to force a single process to > be responsible for it? Any real improvement in scalability here will > need to decentralize the operation more, not less. Sure. I think we could have the freelist and the clock sweep protected by different locks. The background writer would lock out other people running the clock sweep, but the freelist could be protected by a spinlock which no one would ever need to take for more than a few cycles. Right there, you should get a significant scalability improvement, since the critical section would be so much shorter than it is now. If that's not enough, you could have several freelists protected by different spinlocks; the bgwriter would put 1/Nth of the reusable buffers on each freelist, and backends would pick a freelist at random to pull buffers off of. > My own thoughts about this had pointed in the direction of getting rid > of the central freelist entirely, instead letting each backend run its > own independent clock sweep as needed. The main problem with that is > that if there's no longer any globally-visible clock sweep state, it's > pretty hard to figure out what the control logic for the bgwriter should > look like. Maybe it would be all right to have global variables that > are just statistics counters for allocations and buffers swept over, > which backends would need to spinlock for just long enough to increment > the counters at the end of each buffer allocation. Hmm, that's certainly an interesting idea. I fear that if the clock sweeps from the different backends ended up too closely synchronized, you would end up evicting whatever was in the way, be it hot or cold. It might almost be better to have individual backends choose buffers to evict at random; if the chosen buffer isn't evictable, we decrement its usage count and pick another one, also at random. With respect to the control logic for the background writer, one idea I had was to get rid of the idea that the background writer's job is to write in advance of the strategy point. Instead, every time the clock sweep passes over a dirty buffer that is otherwise evictable, we add it to a queue of things that the bgwriter should clean. Those buffers, once cleaned, go on the free list. Maybe some variant of that could work with your idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, May 22, 2012 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> My own thoughts about this had pointed in the direction of getting rid >> of the central freelist entirely, instead letting each backend run its >> own independent clock sweep as needed. > Hmm, that's certainly an interesting idea. I fear that if the clock > sweeps from the different backends ended up too closely synchronized, > you would end up evicting whatever was in the way, be it hot or cold. > It might almost be better to have individual backends choose buffers > to evict at random; if the chosen buffer isn't evictable, we decrement > its usage count and pick another one, also at random. Hmm ... yeah, in principle that could be better. It's still a clock sweep but following a hard-to-predict ordering of the buffers. Being hard to predict doesn't necessarily guarantee no bad behavior though. Maybe we could do something a bit cheaper than random() and more amenable to analysis, that would still ensure that backends couldn't end up with closely sync'd scans. I'm imagining advancing not 1 buffer each time, but K buffers where each backend will use a different value of K. Perhaps backend with backendID N could use the N'th prime number, say. You'd want something relatively prime to shared_buffers to guarantee that the backend can visit all the buffers, and just making it prime would do fine. > With respect to the control logic for the background writer, one idea > I had was to get rid of the idea that the background writer's job is > to write in advance of the strategy point. Instead, every time the > clock sweep passes over a dirty buffer that is otherwise evictable, we > add it to a queue of things that the bgwriter should clean. Those > buffers, once cleaned, go on the free list. Maybe some variant of > that could work with your idea. Both ends of that imply centralized data structures that could become contention bottlenecks, so it doesn't sound tremendously appealing to me. Maybe it can work as long as nobody has to lock the lists for long, but I'd rather think of approaches with no central point of contention. regards, tom lane
On Tue, May 22, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> With respect to the control logic for the background writer, one idea >> I had was to get rid of the idea that the background writer's job is >> to write in advance of the strategy point. Instead, every time the >> clock sweep passes over a dirty buffer that is otherwise evictable, we >> add it to a queue of things that the bgwriter should clean. Those >> buffers, once cleaned, go on the free list. Maybe some variant of >> that could work with your idea. > > Both ends of that imply centralized data structures that could become > contention bottlenecks, so it doesn't sound tremendously appealing to > me. Maybe it can work as long as nobody has to lock the lists for long, > but I'd rather think of approaches with no central point of contention. If we're going to throw our current algorithm over wholesale, I'd rather use some approach that has been demonstrated to work well in other systems. Buffer eviction is a problem that's been around since the 1970s, and our algorithm is just about that old. I realize that there are (legitimate) concerns about what might be patented, but hopefully that doesn't mean we're not allowed to consult the literature in any way. If that were the case, we wouldn't have SSI. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > If we're going to throw our current algorithm over wholesale, I'd > rather use some approach that has been demonstrated to work well in > other systems. Buffer eviction is a problem that's been around since > the 1970s, and our algorithm is just about that old. Um, if you're claiming that that code dates from Berkeley days, you're quite mistaken. We adopted the clock sweep in 2005, after trying a few other things whose performance was worse. I've not seen any argument in this thread that suggests we should abandon clock sweep + usage counts entirely. Rather, to me the issue is that we haven't completely gotten rid of the last vestiges of the old global freelist approach. BTW, it might be worth pointing out something I was trying to explain to Amit at PGCon: the key reason that we went with usage counters rather than something like a global LRU chain is that in the fast path where ReadBuffer finds the requested block already in a buffer, it does not have to contend for any global data structure to update the buffer's usage information. It just has to bump the usage count in the buffer's header. The free list, and the contention for BufFreelistLock, only matter in the slow path where you're going to have to incur I/O anyway (or at least a visit to the kernel). That seemed plenty good enough in 2005. Our ambitions have now advanced further, so I'm on board with trying to reduce contention here too, but I think it would be a mistake to make the fast case any slower. regards, tom lane
On Tue, May 22, 2012 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> If we're going to throw our current algorithm over wholesale, I'd >> rather use some approach that has been demonstrated to work well in >> other systems. Buffer eviction is a problem that's been around since >> the 1970s, and our algorithm is just about that old. > > Um, if you're claiming that that code dates from Berkeley days, you're > quite mistaken. Note the code: the algorithm. I believe that what we have implemented is GCLOCK, which is very old. In the interest of full disclosure, descriptions of exactly what GCLOCK is seem to vary a bit from paper to paper, but at least some papers describe the exact algorithm we use. > We adopted the clock sweep in 2005, after trying a few > other things whose performance was worse. I've not seen any argument in > this thread that suggests we should abandon clock sweep + usage counts > entirely. Rather, to me the issue is that we haven't completely gotten > rid of the last vestiges of the old global freelist approach. Well, I think that switching from one clock sweep to a clock sweep per backend would be basically an abandonment of the current approach. The results might be better or worse, but they'd surely be different. > BTW, it might be worth pointing out something I was trying to explain > to Amit at PGCon: the key reason that we went with usage counters rather > than something like a global LRU chain is that in the fast path where > ReadBuffer finds the requested block already in a buffer, it does not > have to contend for any global data structure to update the buffer's > usage information. It just has to bump the usage count in the buffer's > header. The free list, and the contention for BufFreelistLock, only > matter in the slow path where you're going to have to incur I/O anyway > (or at least a visit to the kernel). That seemed plenty good enough > in 2005. Our ambitions have now advanced further, so I'm on board with > trying to reduce contention here too, but I think it would be a mistake > to make the fast case any slower. Totally agreed. We're not the first people to think of this, either: CLOCK and GLOCK have been extensively studied and found to be almost as good as LRU in selecting good victim pages, but with less contention. That's why people are using them. Here's a paper that defines GLOCK to be the algorithm we use (page 2, second column, second paragraph from the bottom), and furthermore mentions PostgreSQL (top of page 3): http://staff.aist.go.jp/m.yui/publications/ICDE10_conf_full_409.pdf -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 22, 2012 at 8:28 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Totally agreed. We're not the first people to think of this, either: > CLOCK and GLOCK have been extensively studied and found to be almost > as good as LRU in selecting good victim pages, but with less > contention. That's why people are using them. > > Here's a paper that defines GLOCK to be the algorithm we use (page 2, > second column, second paragraph from the bottom), and furthermore > mentions PostgreSQL (top of page 3): > > http://staff.aist.go.jp/m.yui/publications/ICDE10_conf_full_409.pdf Interesting paper, they make the whole buffer management lock free. Getting rid of not only the BufFreelistLock but also BufMappingLocks and buffer header spinlocks. Now AFAIK BufMappingLocks and buffer header aren't really a big issue for us, and atleast the latter looks like turning it lock-free would entail lots of pretty hairy and bug-prone code. On the other hand, making the clock sweep lock-free looks relatively easy. As far as I can see there is no reason why nextVictimBuffer couldn't be read, incremented and atomically cmpxchg'ed to let other continue before checking for the usage count. Likewise completePasses and numBufferAllocs can easily be atomically incremented, bgwriter shouldn't mind if the values are slightly out of date. The free list itself is a bit trickier, but if it's still necessary/useful then SC->firstFreeBuffer and buf->freeNext are in effect a linked-list stack, there should plenty of tested lock free algorithms floating around for that. (btw. lastFreeBuffer looks like dead code, is that correct?) As for better buffer management algorithms, have you read about CLOCK-Pro? [1] It looks like it's an adaptive variant of LIRS, the algorithm the MySQL uses (or atleast used some time ago). Looks like Linux kernel devs also at least thought about implementing it [2] [3] (hard to tell exactly, their docs are pretty chaotic compared pg). According to [4], LIRS is almost as good as or better than LRU, by extension I'd expect CLOCK-Pro to be better than GLOCK. It still has a single clock mechanism, so better replacement policies won't help a whole lot with lock contention on buffer allocation. [1] http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf [2] http://linux-mm.org/ClockProApproximation [3] http://linux-mm.org/PageReplacementDesign [4] http://www.almaden.ibm.com/cs/people/dmodha/ARC.pdf Cheers, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
>>And besides >>if the decrements are decoupled from the allocation requests it's no >>longer obvious that the algorithm is even an approximation of LRU. I was trying to highlight that we can do the clocksweep in bgwriter and keep the backends logic as it is currently. The core idea is that it will reduce the work of backends and chances of them to get the free buffer early than currently will be more. Some of the other ideas about it which I have discussed are 1. moving clean buffers to freelist when any found by bgwriter or checkpoint. This is to get the buffers early by backends without going into clock sweep algorithm. 2. having multiple lists of hot and cold buffers and backends will find the buffers in following order if the required buffer is not already there a. freelist b. cold list c. hot list 3. try to experiment with different values of usage count under heavy load scenarios and check does it make any difference. I will try to prototype these ideas and publish the results here, so that we can check if it gives any benfit. There are some other ideas also in this chain list which I shall check like 1. (clock sweeps from the different backends ended up too closely synchronized). If these really cause problems I will try to address. 2. Not to have same lock for all the algorithm for finding a free buffer. I would like to build a prototype as follows: 1. prepare or identify an existing testcase where these problems would be more evident 2. Change the code to implement the ideas in this mail chain and what I think can improve. Do you feel I can attempt to address this problem with some prototypes and discuss here after few days when I have some results ready. -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Tuesday, May 22, 2012 7:55 PM To: Robert Haas Cc: Amit Kapila; PostgreSQL-development Subject: Re: [HACKERS] Readme of Buffer Management seems to have wrong sentence Robert Haas <robertmhaas@gmail.com> writes: > On Tue, May 22, 2012 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, keep in mind that that action is not merely there to obtain a >> victim buffer; it is also maintaining the global LRU state (by >> decrementing the usage counts of buffers it passes over). I don't think >> you can change it to simply look only at a predetermined freelist >> without seriously compromising the overall quality of our buffer >> replacement decisions. > The idea would be to have a background process (like bgwriter) > maintain the global LRU state and push candidate buffers onto the > freelist. Amit was trying to convince me of the same idea at PGCon, but I don't buy it. bgwriter doesn't scan the buffer array nearly fast enough to provide useful adjustment of the usage counts under load. And besides if the decrements are decoupled from the allocation requests it's no longer obvious that the algorithm is even an approximation of LRU. But the larger issue here is that if that processing is a bottleneck (which I agree it is), how does it help to force a single process to be responsible for it? Any real improvement in scalability here will need to decentralize the operation more, not less. My own thoughts about this had pointed in the direction of getting rid of the central freelist entirely, instead letting each backend run its own independent clock sweep as needed. The main problem with that is that if there's no longer any globally-visible clock sweep state, it's pretty hard to figure out what the control logic for the bgwriter should look like. Maybe it would be all right to have global variables that are just statistics counters for allocations and buffers swept over, which backends would need to spinlock for just long enough to increment the counters at the end of each buffer allocation. regards, tom lane
On 05/23/2012 11:36 AM, Amit Kapila wrote: > Do you feel I can attempt to address this problem with some prototypes and > discuss here after few days when I have some results ready. I don't think there is a clear picture yet of what benchmark to use for testing changes here. Items like "Consider adding buffers the background writer finds reusable to the free list" have been on the TODO list since 2007; neither ideas nor code are the problem here. Proving that a) a new policy helps on some workloads and b) it doesn't harm any important workload, those are the hard parts here. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
>>I don't think there is a clear picture yet of what benchmark to use for testing changes here. I will first try to generate such a scenario(benchmark). I have still not thought completely. However the idea in my mind is that scenario where buffer list is heavily operated upon. Operations where shared buffers are much less compare to the data in disk and the operations are distributed such that they require to access most of the data in disk randomly. >> Proving that a) a new policy helps on some workloads I thought to prove, I should write a proof of concept code and then test it by having appropriate test, or do you think that it needs to be proved some other way. >>and b) it doesn't harm any important workload, those are the hard parts here Do you have something in mind which needs to be taken care or thought about before attempting the idea. -----Original Message----- From: Greg Smith [mailto:greg@2ndQuadrant.com] Sent: Wednesday, May 23, 2012 10:35 PM To: pgsql-hackers@postgresql.org; amit.kapila@huawei.com Subject: Re: [HACKERS] Readme of Buffer Management seems to have wrong sentence On 05/23/2012 11:36 AM, Amit Kapila wrote: > Do you feel I can attempt to address this problem with some prototypes and > discuss here after few days when I have some results ready. I don't think there is a clear picture yet of what benchmark to use for testing changes here. Items like "Consider adding buffers the background writer finds reusable to the free list" have been on the TODO list since 2007; neither ideas nor code are the problem here. Proving that a) a new policy helps on some workloads and b) it doesn't harm any important workload, those are the hard parts here. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Wed, May 23, 2012 at 10:33 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >>>I don't think there is a clear picture yet of what benchmark to use for > testing changes here. > I will first try to generate such a scenario(benchmark). I have still not > thought completely. > However the idea in my mind is that scenario where buffer list is heavily > operated upon. > Operations where shared buffers are much less compare to the data in disk > and the operations are distributed such that > they require to access most of the data in disk randomly. If most buffer reads actually have to read from disk, then that will so throttle your throughput that you will not be able to make anything else be relevant. You need to have shared_buffers be much smaller than RAM, and have almost all the "disk" data resident in RAM but not in shared_buffers. Cheers, Jeff
On Wed, May 23, 2012 at 8:36 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >>>And besides >>>if the decrements are decoupled from the allocation requests it's no >>>longer obvious that the algorithm is even an approximation of LRU. > > I was trying to highlight that we can do the clocksweep in bgwriter and keep > the backends logic as it is currently. > The core idea is that it will reduce the work of backends and chances of > them to get the free buffer early than currently will be more. Do we have any evidence that this is actually a problem which needs to be solved? I've seen plenty of evidence that contention for the lock can be a problem, but no evidence that the amount of work done under the lock, once it is obtained, is a problem. > Some of the other ideas about it which I have discussed are > 1. moving clean buffers to freelist when any found by bgwriter or > checkpoint. This is to get the buffers early by backends without going into > clock sweep algorithm. You would need to lock once to put the buffer on, and again to take it off. That increases the traffic through a contended lock, rather than decreasing it. So unless this is coupled with reducing the lock strength, I don't see how it can win. > > 2. having multiple lists of hot and cold buffers and backends will find the > buffers in following order if the required buffer is not already there > a. freelist > b. cold list > c. hot list > > 3. try to experiment with different values of usage count under heavy load > scenarios and check does it make any difference. One thing I wanted to play with is having newly read buffers get a usage count of 0 rather than 1. The problem is that there is no way to test it in enough different situations to convince people it would be a general improvement. Cheers, Jeff
>>You need to have shared_buffers be much smaller than RAM, and have almost all the "disk" data resident in RAM but not >> in shared_buffers. Sure, this is better way to generate heavy activity buffers -----Original Message----- From: Jeff Janes [mailto:jeff.janes@gmail.com] Sent: Wednesday, May 23, 2012 11:39 PM To: Amit Kapila Cc: Greg Smith; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Readme of Buffer Management seems to have wrong sentence On Wed, May 23, 2012 at 10:33 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >>>I don't think there is a clear picture yet of what benchmark to use for > testing changes here. > I will first try to generate such a scenario(benchmark). I have still not > thought completely. > However the idea in my mind is that scenario where buffer list is heavily > operated upon. > Operations where shared buffers are much less compare to the data in disk > and the operations are distributed such that > they require to access most of the data in disk randomly. If most buffer reads actually have to read from disk, then that will so throttle your throughput that you will not be able to make anything else be relevant. You need to have shared_buffers be much smaller than RAM, and have almost all the "disk" data resident in RAM but not in shared_buffers. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > One thing I wanted to play with is having newly read buffers get a > usage count of 0 rather than 1. The problem is that there is no way > to test it in enough different situations to convince people it would > be a general improvement. Hmm ... ISTM that that was discussed back when we instituted buffer usage counts, and rejected on the grounds that a newly-read buffer could then have negligible life expectancy. The clock sweep might be just about to pass over it. By starting at 1, it's guaranteed to have at least 1 sweep cycle time in which it might accumulate more hits. In other words, we have a choice of whether a buffer's initial lifetime is between 0 and 1 sweep times, or between 1 and 2 sweep times; and the discrimination against an unlucky buffer position is infinite in the first case versus at most 2X in the second case. regards, tom lane
>>Do we have any evidence that this is actually a problem which needs to be solved? I don't have very clear evidence, didn't generate any profiling report still. However in the scenario Where there are many buffers in the list and backend has to traverse it, there should be reasonable work under lock. >>You would need to lock once to put the buffer on, and again to take it off. That increases the traffic through a >>contended lock, rather than decreasing it. But it shall reduce time of backends which traverse the buffer list as in some cases they will get the buffer directly from freelist. I understand that some new contention will be created but wanted to see if the amount it has reduced can out weight the new contention that gets created. >>The problem is that there is no way to test it in enough different situations to convince people it would be a general >>improvement. I think in this whole idea main point was to have scenarios which can prove all the points are win. This is same point raised by Greg as well. My first thought was to generate heavy contention on shared buffers will be able to show up but I still need to think more on it. Any more ideas and suggestions to generate the scenarios? -----Original Message----- From: Jeff Janes [mailto:jeff.janes@gmail.com] Sent: Wednesday, May 23, 2012 11:48 PM To: Amit Kapila Cc: Tom Lane; Robert Haas; PostgreSQL-development Subject: Re: [HACKERS] Readme of Buffer Management seems to have wrong sentence On Wed, May 23, 2012 at 8:36 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >>>And besides >>>if the decrements are decoupled from the allocation requests it's no >>>longer obvious that the algorithm is even an approximation of LRU. > > I was trying to highlight that we can do the clocksweep in bgwriter and keep > the backends logic as it is currently. > The core idea is that it will reduce the work of backends and chances of > them to get the free buffer early than currently will be more. Do we have any evidence that this is actually a problem which needs to be solved? I've seen plenty of evidence that contention for the lock can be a problem, but no evidence that the amount of work done under the lock, once it is obtained, is a problem. > Some of the other ideas about it which I have discussed are > 1. moving clean buffers to freelist when any found by bgwriter or > checkpoint. This is to get the buffers early by backends without going into > clock sweep algorithm. You would need to lock once to put the buffer on, and again to take it off. That increases the traffic through a contended lock, rather than decreasing it. So unless this is coupled with reducing the lock strength, I don't see how it can win. > > 2. having multiple lists of hot and cold buffers and backends will find the > buffers in following order if the required buffer is not already there > a. freelist > b. cold list > c. hot list > > 3. try to experiment with different values of usage count under heavy load > scenarios and check does it make any difference. One thing I wanted to play with is having newly read buffers get a usage count of 0 rather than 1. The problem is that there is no way to test it in enough different situations to convince people it would be a general improvement. Cheers, Jeff
On Wed, May 23, 2012 at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> One thing I wanted to play with is having newly read buffers get a >> usage count of 0 rather than 1. The problem is that there is no way >> to test it in enough different situations to convince people it would >> be a general improvement. > > Hmm ... ISTM that that was discussed back when we instituted buffer > usage counts, and rejected on the grounds that a newly-read buffer could > then have negligible life expectancy. The clock sweep might be just > about to pass over it. I guess that could be the case if the buffer just came off of the linked list, but that situation is very rare in general use (and the sweep hand wouldn't be moving at all until the linked list is empty). If it were allocated through the clock, then the sweep hand should have just passed over it and so is unlikely to be just about to pass over it again. If the clock sweep is moving so fast that it makes nearly a complete rotation in the time it takes to read a buffer from disk, then I think the system is probably beyond redemption. But I guess that that is something to be tested for, if I can engineer a test. > By starting at 1, it's guaranteed to have at > least 1 sweep cycle time in which it might accumulate more hits. > > In other words, we have a choice of whether a buffer's initial lifetime > is between 0 and 1 sweep times, or between 1 and 2 sweep times; and the > discrimination against an unlucky buffer position is infinite in the > first case versus at most 2X in the second case. But the cost of preventing an occasional buffer from being unlucky is that the length of the average clock sweep is almost doubled, and thus it is also easier for hot-ish buffers to get accidentally evicted. This last part could perhaps be ameliorated by having the usage incremented by 2 rather than 1 each time a buffer hits. Also, if the just-read buffer does get unlucky, either it won't be needed again soon anyway and so it deserves to be unlucky, or the next time it is needed it will probably still be in RAM, will be read in quickly before the clock hand has moved much, and will have plenty of time to accumulate new hits the next time around. Cheers, Jeff
On Wed, May 23, 2012 at 2:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, May 23, 2012 at 10:33 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >>>>I don't think there is a clear picture yet of what benchmark to use for >> testing changes here. >> I will first try to generate such a scenario(benchmark). I have still not >> thought completely. >> However the idea in my mind is that scenario where buffer list is heavily >> operated upon. >> Operations where shared buffers are much less compare to the data in disk >> and the operations are distributed such that >> they require to access most of the data in disk randomly. > > If most buffer reads actually have to read from disk, then that will > so throttle your throughput that you will not be able to make anything > else be relevant. You need to have shared_buffers be much smaller > than RAM, and have almost all the "disk" data resident in RAM but not > in shared_buffers. But this is pretty common, since we advise people to set shared_buffers relatively low as compared to physical memory. The problem is visible on the graph I posted here: http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html When the scale factor gets large enough to exceed shared_buffers, performance peaks in the 36-44 client range. When it's small enough to fit in shared_buffers, performance continues to increase through 64 clients and even a bit beyond. Note that the absolute *performance* is not much worse with the larger scale factor, if you have only one client. It's the *scalability* that goes out the window. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Jeff Janes <jeff.janes@gmail.com> writes: > On Wed, May 23, 2012 at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmm ... ISTM that that was discussed back when we instituted buffer >> usage counts, and rejected on the grounds that a newly-read buffer could >> then have negligible life expectancy. �The clock sweep might be just >> about to pass over it. > I guess that could be the case if the buffer just came off of the > linked list, but that situation is very rare in general use (and the > sweep hand wouldn't be moving at all until the linked list is empty). > If it were allocated through the clock, then the sweep hand should > have just passed over it and so is unlikely to be just about to pass > over it again. Hmm ... that's true as long as we have one central clock sweep. With the nearby discussion of per-backend sweeps I'm less sure that I believe it. Still, you might be right that it's worth experimenting with to see if the average clock sweep distance can be shortened. So, back to the discussion of what would make acceptable test case(s) for this. You might look back at the 2005 discussions to see what we were using at the time of the last major rejiggering in this area. I concur with your point that we should look at cases where the bulk of the database fits in kernel disk buffers but not shared_buffers. regards, tom lane
On Tue, May 22, 2012 at 11:36 PM, Ants Aasma <ants@cybertec.at> wrote: > ... The free list > itself is a bit trickier, but if it's still necessary/useful then > SC->firstFreeBuffer and buf->freeNext are in effect a linked-list > stack, there should plenty of tested lock free algorithms floating > around for that. (btw. lastFreeBuffer looks like dead code, is that > correct?) Thinking about it a bit more, if the freelist is mostly empty, a simpler alternative would be to make an unprotected read to check SC->firstFreeBuffer and only acquire BufFreelistLock if there's anything to pop. This would reduce the lock free parts to just atomically incrementing a variable. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de