Thread: BufFreelistLock
I think that the BufFreelistLock can be a contention bottleneck on a system with a lot of CPUs that do a lot of shared-buffer allocations which can fulfilled by the OS buffer cache. That is, read-mostly queries where the working data set fits in RAM, but not in shared_buffers. (You can always increase shared_buffers, but that leads to other problems, and who wants to spend their time micromanaging the size of shared_buffers as work loads slowly change?) I can't prove it is a contention bottleneck without first solving the putative problem and timing the difference, but it is the dominant blocking lock showing up under LWLOCK_STATS for one benchmark I've done using 8 CPUs. So I had two questions. 1) Would it be useful for BufFreelistLock be partitioned, like BufMappingLock, or via some kind of clever "virtual partitioning" that could get the same benefit via another means? I don't know if both the linked list and the clock sweep would have to be partitioned, or if some other arrangement could be made 2) Could BufFreelistLock simply go away, by reducing it from a lwlock to a spinlock? Or at least in most common paths? For doing away with it, I think that any manipulation of the freelist is short enough (just a few instructions) that it could be done under a spinlock. If you somehow obtained a pinned or usage_count buffer, you would have to retake the spinlock to look at the new head of the chain, but the comments StrategyGetBuffer suggest that that should be rare or impossible. For the clock sweep algorithm, I think you could access nextVictimBuffer without any type of locking. If a non-atomic increment causes an occasional buffer to be skipped or examined twice, that doesn't seem like a correctness problem. When nextVictimBuffer gets reset to zero and completePasses gets incremented, that would probably need to be protected to prevent a double-increment of completePasses from throwing off the background writer's usage estimations. But again, a spinlock should be enough for that. And it shouldn't occur all that often. If potentially inaccurate non-atomic increments of numBufferAllocs are a problem, it could be incremented under the same spinlock used to protect the test firstFreeBuffer>0 to determine if the freelist is empty. Doing away with the lock without some form of partitioning might just move the contention to the BufHdr spinlocks. But if most of the processes entering the code at about the same time perceive each others increments to nextVictimBuffer, they would all start out offset from each other and shouldn't collide too badly. Does any of this sound like it might be fruitful to look into? Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > I think that the BufFreelistLock can be a contention bottleneck on a > system with a lot of CPUs that do a lot of shared-buffer allocations > which can fulfilled by the OS buffer cache. Really? buffer/README says The buffer management policy is designed so that BufFreelistLock need not be taken except in paths that will require I/O,and thus will be slow anyway. It's hard to see how it's going to be much of a problem if you're going to be doing kernel calls as well. Is the test case you're looking at really representative of any common situation? > 1) Would it be useful for BufFreelistLock be partitioned, like > BufMappingLock, or via some kind of clever "virtual partitioning" that > could get the same benefit via another means? Maybe, but you could easily end up with a net loss if the partitioning makes buffer allocation significantly stupider (ie, higher probability of picking a less-than-optimal buffer to recycle). > For the clock sweep algorithm, I think you could access > nextVictimBuffer without any type of locking. This is wrong, mainly because you wouldn't have any security against two processes decrementing the usage count of the same buffer because they'd fetched the same value of nextVictimBuffer. That would probably happen often enough to severely compromise the accuracy of the usage counts and thus the accuracy of the LRU eviction behavior. See above. It might be worth looking into actual partitioning, so that more than one processor can usefully be working on the usage count management. But simply dropping the locking primitives isn't going to lead to anything except severe screw-ups. regards, tom lane
On Wed, Dec 8, 2010 at 8:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> I think that the BufFreelistLock can be a contention bottleneck on a >> system with a lot of CPUs that do a lot of shared-buffer allocations >> which can fulfilled by the OS buffer cache. > > Really? buffer/README says > > The buffer > management policy is designed so that BufFreelistLock need not be taken > except in paths that will require I/O, and thus will be slow anyway. True, but very large memory means they often don't require true disk I/O anyway. > It's hard to see how it's going to be much of a problem if you're going > to be doing kernel calls as well. Are kernels calls really all that slow? I thought they had been greatly optimized on recent hardware and kernels. I'm not sure how to create a test case to distinguish that. > Is the test case you're looking at > really representative of any common situation? That's always the question. I took the "pick a random number and use it to look up a pgbench_accounts by primary key" logic from pgbench -S, and but it into a stored procedure where it loops 10,000 times, to remove the overhead of ping-ponging messages back and forth for every query. (But doing so also removes the overhead of taking AccessShareLock for every select, so those two changes are entangled.) This type of workload could be representative of a nested loop join. I started looking into it because someone (http://archives.postgresql.org/pgsql-performance/2010-11/msg00350.php) thought that that pgbench -S might more or less match their real world work load. But by the time I moved most of selecting into a stored procedure, maybe it no longer does (it's not even clear if they were using prepared statements). But separating things into their component potential bottlenecks, which do you tackle first? The more fundamental. The easiest to analyze. The one that can't be gotten around by fine-tuning. The more interesting :). >> 1) Would it be useful for BufFreelistLock be partitioned, like >> BufMappingLock, or via some kind of clever "virtual partitioning" that >> could get the same benefit via another means? > > Maybe, but you could easily end up with a net loss if the partitioning > makes buffer allocation significantly stupider (ie, higher probability > of picking a less-than-optimal buffer to recycle). > >> For the clock sweep algorithm, I think you could access >> nextVictimBuffer without any type of locking. > > This is wrong, mainly because you wouldn't have any security against two > processes decrementing the usage count of the same buffer because they'd > fetched the same value of nextVictimBuffer. That would probably happen > often enough to severely compromise the accuracy of the usage counts and > thus the accuracy of the LRU eviction behavior. See above. Ah, I hadn't considered that. Cheers, Jeff
On Dec 8, 2010, at 11:44 PM, Jeff Janes wrote: >>> For the clock sweep algorithm, I think you could access >>> nextVictimBuffer without any type of locking. >> >> This is wrong, mainly because you wouldn't have any security against two >> processes decrementing the usage count of the same buffer because they'd >> fetched the same value of nextVictimBuffer. That would probably happen >> often enough to severely compromise the accuracy of the usage counts and >> thus the accuracy of the LRU eviction behavior. See above. > > Ah, I hadn't considered that. Ideally, the clock sweep would be run by bgwriter and not individual backends. In that case it shouldn't matter much whatthe performance of the sweep is. To do that I think we'd want the bgwriter to target there being X number of bufferson the free list instead of (or in addition to) targeting how many dirty buffers need to be written. This would mirrorwhat operating systems do; they strive to keep X number of pages on the free list so that when a process needs memoryit can get it quickly. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Excerpts from Jim Nasby's message of jue dic 09 16:54:24 -0300 2010: > Ideally, the clock sweep would be run by bgwriter and not individual backends. In that case it shouldn't matter much whatthe performance of the sweep is. To do that I think we'd want the bgwriter to target there being X number of bufferson the free list instead of (or in addition to) targeting how many dirty buffers need to be written. This would mirrorwhat operating systems do; they strive to keep X number of pages on the free list so that when a process needs memoryit can get it quickly. Isn't it what it does if you set bgwriter_lru_maxpages to some very large value? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, Dec 10, 2010 at 5:45 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Jim Nasby's message of jue dic 09 16:54:24 -0300 2010: > >> Ideally, the clock sweep would be run by bgwriter and not individual backends. In that case it shouldn't matter much whatthe performance of the sweep is. Lock contention between the bgwriter and the individual backends would matter very much. This might actually make things worse. Now you need two BufFreelistLocks, one to stick it on the freelist, and one to take it off. >> To do that I think we'd want the bgwriter to target there being X number of buffers on the free list instead of (or inaddition to) targeting how many dirty buffers need to be written. This would mirror what operating systems do; they striveto keep X number of pages on the free list so that when a process needs memory it can get it quickly. > > Isn't it what it does if you set bgwriter_lru_maxpages to some very > large value? As far as I can tell, bgwriter never adds things to the freelist. That is only done at start up, and when a relation or a database is dropped. The clock sweep does the vast majority of the work. But I could be wrong. Cheers, Jeff
Excerpts from Jeff Janes's message of vie dic 10 12:24:34 -0300 2010: > On Fri, Dec 10, 2010 at 5:45 AM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: > > Excerpts from Jim Nasby's message of jue dic 09 16:54:24 -0300 2010: > >> To do that I think we'd want the bgwriter to target there being X number of buffers on the free list instead of (orin addition to) targeting how many dirty buffers need to be written. This would mirror what operating systems do; theystrive to keep X number of pages on the free list so that when a process needs memory it can get it quickly. > > > > Isn't it what it does if you set bgwriter_lru_maxpages to some very > > large value? > > As far as I can tell, bgwriter never adds things to the freelist. > That is only done at start up, and when a relation or a database is > dropped. The clock sweep does the vast majority of the work. AFAIU bgwriter runs the clock sweep most of the time (BgBufferSync). -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Jeff Janes's message of vie dic 10 12:24:34 -0300 2010: >> As far as I can tell, bgwriter never adds things to the freelist. >> That is only done at start up, and when a relation or a database is >> dropped. The clock sweep does the vast majority of the work. > AFAIU bgwriter runs the clock sweep most of the time (BgBufferSync). I think bgwriter just tries to write out dirty buffers so they'll be clean when the clock sweep reaches them. It doesn't try to move them to the freelist. There might be some advantage in having it move buffers to a freelist that's just protected by a simple spinlock (or at least, a lock different from the one that protects the clock sweep). The idea would be that most of the time, backends just need to lock the freelist for long enough to take a buffer off it, and don't run clock sweep at all. regards, tom lane
On Dec 10, 2010, at 10:49 AM, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> Excerpts from Jeff Janes's message of vie dic 10 12:24:34 -0300 2010: >>> As far as I can tell, bgwriter never adds things to the freelist. >>> That is only done at start up, and when a relation or a database is >>> dropped. The clock sweep does the vast majority of the work. > >> AFAIU bgwriter runs the clock sweep most of the time (BgBufferSync). > > I think bgwriter just tries to write out dirty buffers so they'll be > clean when the clock sweep reaches them. It doesn't try to move them to > the freelist. Yeah, it calls SyncOneBuffer which does nothing for the clock sweep. > There might be some advantage in having it move buffers > to a freelist that's just protected by a simple spinlock (or at least, > a lock different from the one that protects the clock sweep). The > idea would be that most of the time, backends just need to lock the > freelist for long enough to take a buffer off it, and don't run clock > sweep at all. Yeah, the clock sweep code is very intensive compared to pulling a buffer from the freelist, yet AFAICT nothing will runthe clock sweep except backends. Unless I'm missing something, the free list is practically useless because buffers areonly put there by InvalidateBuffer, which is only called by DropRelFileNodeBuffers and DropDatabaseBuffers. So we makebackends queue up behind the freelist lock with very little odds of getting a buffer, then we make them queue up forthe clock sweep lock and make them actually run the clock sweep. BTW, when we moved from 96G to 192G servers I tried increasing shared buffers from 8G to 28G and performance went down enoughto be noticeable (we don't have any good benchmarks, so I cant really quantify the degradation). Going back to 8G broughtperformance back up, so it seems like it was the change in shared buffers that caused the issue (the larger serversalso have 24 cores vs 16). My immediate thought was that we needed more lock partitions, but I haven't had the chanceto see if that helps. ISTM the issue could just as well be due to clock sweep suddenly taking over 3x longer than before. We're working on getting a performance test environment setup, so hopefully in a month or two we'd be able to actually runsome testing on this. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Dec 12, 2010, at 8:48 PM, Jim Nasby wrote: >> There might be some advantage in having it move buffers >> to a freelist that's just protected by a simple spinlock (or at least, >> a lock different from the one that protects the clock sweep). The >> idea would be that most of the time, backends just need to lock the >> freelist for long enough to take a buffer off it, and don't run clock >> sweep at all. > > Yeah, the clock sweep code is very intensive compared to pulling a buffer from the freelist, yet AFAICT nothing will runthe clock sweep except backends. Unless I'm missing something, the free list is practically useless because buffers areonly put there by InvalidateBuffer, which is only called by DropRelFileNodeBuffers and DropDatabaseBuffers. So we makebackends queue up behind the freelist lock with very little odds of getting a buffer, then we make them queue up forthe clock sweep lock and make them actually run the clock sweep. Looking at the code, it seems to be pretty trivial to have SyncOneBuffer decrement the usage count of every buffer it's handed.The challenge is that the code that estimates how many buffers we need to sync looks at where the clock hand is at,and I think it uses that information as part of it's calculation. So the real challenge here is coming up with a good model for how many buffers we need to sync on each pass *and* how farthe clock needs to be swept. There is also (currently) an interdependency here: the LRU scan will not sync buffers thathave a usage_count > 0. So unless the clock sweep is being run well enough, the LRU scan becomes completely useless. My thought is that the clock sweep should be scheduled the same way that OS VMs handle their free list: they attempt to keepX number of pages on the free list at all times. We already track the rate of buffer allocations, so that can be usedto estimate how many pages are being consumed per cycle. Plus we'd want some number of extra pages as a buffer. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Sun, Dec 12, 2010 at 6:48 PM, Jim Nasby <jim@nasby.net> wrote: > On Dec 10, 2010, at 10:49 AM, Tom Lane wrote: >> Alvaro Herrera <alvherre@commandprompt.com> writes: >>> Excerpts from Jeff Janes's message of vie dic 10 12:24:34 -0300 2010: >>>> As far as I can tell, bgwriter never adds things to the freelist. >>>> That is only done at start up, and when a relation or a database is >>>> dropped. The clock sweep does the vast majority of the work. >> >>> AFAIU bgwriter runs the clock sweep most of the time (BgBufferSync). >> >> I think bgwriter just tries to write out dirty buffers so they'll be >> clean when the clock sweep reaches them. It doesn't try to move them to >> the freelist. > > Yeah, it calls SyncOneBuffer which does nothing for the clock sweep. > >> There might be some advantage in having it move buffers >> to a freelist that's just protected by a simple spinlock (or at least, >> a lock different from the one that protects the clock sweep). The >> idea would be that most of the time, backends just need to lock the >> freelist for long enough to take a buffer off it, and don't run clock >> sweep at all. > > Yeah, the clock sweep code is very intensive compared to pulling a buffer from the freelist, yet AFAICT nothing will runthe clock sweep except backends. Unless I'm missing something, the free list is practically useless because buffers areonly put there by InvalidateBuffer, which is only called by DropRelFileNodeBuffers and DropDatabaseBuffers. Buffers are also put on the freelist at start up (all of them). But of course any busy system with more data than buffers will rapidly deplete them, and DropRelFileNodeBuffers and DropDatabaseBuffers are generally not going to happen enough to be meaningful on most setups, I would think. I was wondering, if the steady state condition is to always use the clock sweep, if that shouldn't be the only mechanism that exists. > So we make backends queue up behind the freelist lock with very little odds of getting a buffer, then we make them queueup for the clock sweep lock and make them actually run the clock sweep. It is the same lock that governs both. Given the simplicity of the checking that the freelist is empty, I don't think it adds much overhead. > > BTW, when we moved from 96G to 192G servers I tried increasing shared buffers from 8G to 28G and performance went downenough to be noticeable (we don't have any good benchmarks, so I cant really quantify the degradation). Going back to8G brought performance back up, so it seems like it was the change in shared buffers that caused the issue (the largerservers also have 24 cores vs 16). What kind of work load do you have (intensity of reading versus writing)? How intensely concurrent is the access? > My immediate thought was that we needed more lock partitions, but I haven't had the chance to see if that helps. ISTM theissue could just as well be due to clock sweep suddenly taking over 3x longer than before. It would surprise me if most clock sweeps need to make anything near a full pass over the buffers for each allocation (but technically it wouldn't need to do that take 3x longer. It could be that the fraction of a pass it needs to make is merely proportional to shared_buffers. That too would surprise me, though). You could compare the number of passes with the number of allocations to see how much sweeping is done per allocation. However, I don't think the number of passes is reported anywhere, unless you compile with #define BGW_DEBUG and run with debug2. I wouldn't expect an increase in shared_buffers to make contention on BufFreelistLock worse. If the increased buffers are used to hold heavily-accessed data, then you will find the pages you want in shared_buffers more often, and so need to run the clock-sweep less often. That should make up for longer sweeps. But if the increased buffers are used to hold data that is just read once and thrown away, then the clock sweep shouldn't need to sweep very far before finding a candidate. But of course being able to test would be better than speculation. Cheers, Jeff
On Dec 14, 2010, at 11:08 AM, Jeff Janes wrote: > On Sun, Dec 12, 2010 at 6:48 PM, Jim Nasby <jim@nasby.net> wrote: >> >> BTW, when we moved from 96G to 192G servers I tried increasing shared buffers from 8G to 28G and performance went downenough to be noticeable (we don't have any good benchmarks, so I cant really quantify the degradation). Going back to8G brought performance back up, so it seems like it was the change in shared buffers that caused the issue (the largerservers also have 24 cores vs 16). > > What kind of work load do you have (intensity of reading versus > writing)? How intensely concurrent is the access? It writes at the rate of ~3-5MB/s, doing ~700TPS on average. It's hard to judge the exact read mix, because it's runningon a 192G server (actually, 512G now, but 192G when I tested). The working set is definitely between 96G and 192G;we saw a major performance improvement last year when we went to 192G, but we haven't seen any improvement moving to512G. We typically have 10-20 active queries at any point. >> My immediate thought was that we needed more lock partitions, but I haven't had the chance to see if that helps. ISTMthe issue could just as well be due to clock sweep suddenly taking over 3x longer than before. > > It would surprise me if most clock sweeps need to make anything near a > full pass over the buffers for each allocation (but technically it > wouldn't need to do that take 3x longer. It could be that the > fraction of a pass it needs to make is merely proportional to > shared_buffers. That too would surprise me, though). You could > compare the number of passes with the number of allocations to see how > much sweeping is done per allocation. However, I don't think the > number of passes is reported anywhere, unless you compile with #define > BGW_DEBUG and > run with debug2. > > I wouldn't expect an increase in shared_buffers to make contention on > BufFreelistLock worse. If the increased buffers are used to hold > heavily-accessed data, then you will find the pages you want in > shared_buffers more often, and so need to run the clock-sweep less > often. That should make up for longer sweeps. But if the increased > buffers are used to hold data that is just read once and thrown away, > then the clock sweep shouldn't need to sweep very far before finding a > candidate. Well, we're talking about a working set that's between 96 and 192G, but only 8G (or 28G) of shared buffers. So there's goingto be a pretty large amount of buffer replacement happening. We also have 210 tables where the ratio of heap bufferhits to heap reads is over 1000, so the stuff that is in shared buffers probably keeps usage_count quite high. Putthese two together, and we're probably spending a fairly significant amount of time running the clock sweep. Even excluding our admittedly unusual workload, there is still significant overhead in running the clock sweep vs just grabbingsomething off of the free list (assuming we had separate locks for the two operations). Does anyone know what theoverhead of getting a block from the filesystem cache is? I wonder how many buffers you can move through in the same amountof time. Put another way, at some point you have to check enough buffers to find a free one that you just doubled theamount of time it takes to get data from the filesystem cache into a shared buffer. > But of course being able to test would be better than speculation. Yeah, I'm working on getting pg_buffercache installed so we can see what's actually in the cache. Hmm... I wonder how hard it would be to hack something up that has a separate process that does nothing but run the clocksweep. We'd obviously not run a hack in production, but we're working on being able to reproduce a production workload.If we had a separate clock-sweep process we could get an idea of exactly how much work was involved in keeping freebuffers available. BTW, given our workload I can't see any way of running at debug2 without having a large impact on performance. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Tue, Dec 14, 2010 at 1:42 PM, Jim Nasby <jim@nasby.net> wrote: > > On Dec 14, 2010, at 11:08 AM, Jeff Janes wrote: > >> I wouldn't expect an increase in shared_buffers to make contention on >> BufFreelistLock worse. If the increased buffers are used to hold >> heavily-accessed data, then you will find the pages you want in >> shared_buffers more often, and so need to run the clock-sweep less >> often. That should make up for longer sweeps. But if the increased >> buffers are used to hold data that is just read once and thrown away, >> then the clock sweep shouldn't need to sweep very far before finding a >> candidate. > > Well, we're talking about a working set that's between 96 and 192G, but > only 8G (or 28G) of shared buffers. So there's going to be a pretty > large amount of buffer replacement happening. We also have > 210 tables where the ratio of heap buffer hits to heap reads is > over 1000, so the stuff that is in shared buffers probably keeps > usage_count quite high. Put these two together, and we're probably > spending a fairly significant amount of time running the clock sweep. The thing that makes me think the bottleneck is elsewhere is that increasing from 8G to 28G made it worse. If buffer unpins are happening at about the same rate, then my gut feeling is that the clock sweep has to do about the same amount of decrementing before it gets to a free buffer under steady state conditions. Whether it has to decrement 8G in buffers three and a half times each, or 28G of buffers one time each, it would do about the same amount of work. This is all hand waving, of course. > Even excluding our admittedly unusual workload, there is still significant overhead in running the clock sweep vs justgrabbing something off of the free list (assuming we had separate locks for the two operations). But do we actually know that? Doing a clock sweep is only a lot of overhead if it has to pass over many buffers in order to find a good one, and we don't know the numbers on that. I think you can sweep a lot of buffers for the overhead of a single contended lock. If the sweep and the freelist had separate locks, you still need to lock the freelist to add to it things discovered during the sweep. > Does anyone know what the overhead of getting a block from the filesystem cache is? I did tests on this a few days ago. It took on average 20 microseconds per row to select one row via primary key when everything was in shared buffers. When everything was in RAM but not shared buffers, it took 40 microseconds. Of this, about 10 microseconds were the kernel calls to seek and read from OS cache to shared_buffers, and the other 10 microseconds is some kind of PG overhead, I don't know where. The timings are per select, not per page, and one select usually reads two pages, one for the index leaf and one for the table. This was all single-client usage on 2.8GHz AMD Opteron. Not all the components of the timings will scale equally with additional clients on additional CPUs of course. I think the time spent in the kernel calls to do the seek and read will scale better than most other parts. > BTW, given our workload I can't see any way of running at debug2 without having a large impact on performance. As long as you are adding #define BGW_DEBUG and recompiling, you might as well promote all the DEBUG2 in src/backend/storage/buffer/bufmgr.c to DEBUG1 or LOG. I think this will only generate a couple log message per bgwriter_delay. That should be tolerable, especially for testing purposes. Cheers, Jeff
On Dec 15, 2010, at 2:40 PM, Jeff Janes wrote: > On Tue, Dec 14, 2010 at 1:42 PM, Jim Nasby <jim@nasby.net> wrote: >> >> On Dec 14, 2010, at 11:08 AM, Jeff Janes wrote: >>> I wouldn't expect an increase in shared_buffers to make contention on >>> BufFreelistLock worse. If the increased buffers are used to hold >>> heavily-accessed data, then you will find the pages you want in >>> shared_buffers more often, and so need to run the clock-sweep less >>> often. That should make up for longer sweeps. But if the increased >>> buffers are used to hold data that is just read once and thrown away, >>> then the clock sweep shouldn't need to sweep very far before finding a >>> candidate. >> >> Well, we're talking about a working set that's between 96 and 192G, but >> only 8G (or 28G) of shared buffers. So there's going to be a pretty >> large amount of buffer replacement happening. We also have >> 210 tables where the ratio of heap buffer hits to heap reads is >> over 1000, so the stuff that is in shared buffers probably keeps >> usage_count quite high. Put these two together, and we're probably >> spending a fairly significant amount of time running the clock sweep. > > The thing that makes me think the bottleneck is elsewhere is that > increasing from 8G to 28G made it worse. If buffer unpins are > happening at about the same rate, then my gut feeling is that the > clock sweep has to do about the same amount of decrementing before it > gets to a free buffer under steady state conditions. Whether it has > to decrement 8G in buffers three and a half times each, or 28G of > buffers one time each, it would do about the same amount of work. > This is all hand waving, of course. While we're waving hands... I think the issue is that our working set size is massive. That means that there will be a lotof activity driving usage_count up on buffers. Increasing shared buffers will help reduce that effect as they begin tocontain more and more of the working set, but I suspect that going from 8G to 28G wouldn't have made much difference. Thatmeans that we now have *more* buffers with a high usage count that the sweep has to slog through. Anyway, once I'm able to get the buffer stats contrib module installed we'll have a better idea of what's actually happening. >> Even excluding our admittedly unusual workload, there is still significant overhead in running the clock sweep vs justgrabbing something off of the free list (assuming we had separate locks for the two operations). > > But do we actually know that? Doing a clock sweep is only a lot of > overhead if it has to pass over many buffers in order to find a good > one, and we don't know the numbers on that. I think you can sweep a > lot of buffers for the overhead of a single contended lock. > > If the sweep and the freelist had separate locks, you still need to > lock the freelist to add to it things discovered during the sweep. I'm hoping we could actually use separate locks for adding and removing, assuming we discover this is actually a consideration. >> Does anyone know what the overhead of getting a block from the filesystem cache is? > > I did tests on this a few days ago. It took on average 20 > microseconds per row to select one row via primary key when everything > was in shared buffers. > When everything was in RAM but not shared buffers, it took 40 > microseconds. Of this, about 10 microseconds were the kernel calls to > seek and read from OS cache to shared_buffers, and the other 10 > microseconds is some kind of PG overhead, I don't know where. The > timings are per select, not per page, and one select usually reads two > pages, one for the index leaf and one for the table. > > This was all single-client usage on 2.8GHz AMD Opteron. Not all the > components of the timings will scale equally with additional clients > on additional CPUs of course. I think the time spent in the kernel > calls to do the seek and read will scale better than most other parts. Interesting info. I wonder if that 10us of unknown overhead was related to shared buffers. Do you know if you had room inshared buffers when you ran that test? It would be interesting to see the differences between having buffers on the freelist, no buffers on the free list but buffers with 0 usage count (though, I'm not sure how you could set that up), andshared buffers with high usage count. >> BTW, given our workload I can't see any way of running at debug2 without having a large impact on performance. > > As long as you are adding #define BGW_DEBUG and recompiling, you might > as well promote all the DEBUG2 in src/backend/storage/buffer/bufmgr.c > to DEBUG1 or LOG. I think this will only generate a couple log > message per bgwriter_delay. That should be tolerable, especially for > testing purposes. Good ideas; I'll try to get that in place once we can benchmark, though it'll be easier to get pg_buffercache in place, soI'll focus on that first. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net