Thread: Scaling shared buffer eviction
As mentioned previously about my interest in improving shared buffer eviction especially by reducing contention around BufFreelistLock, I would like to share my progress about the same. The test used for this work is mainly the case when all the data doesn't fit in shared buffers, but does fit in memory. It is mainly based on previous comparison done by Robert for similar workload: http://rhaas.blogspot.in/2012/03/performance-and-scalability-on-ibm.html To start with, I have taken LWLOCK_STATS report to confirm the contention around BufFreelistLock and the data for HEAD is as follows: M/c details IBM POWER-7 16 cores, 64 hardware threads RAM - 64GB Test scale factor = 3000 shared_buffers = 8GB number_of_threads = 64 duration = 5mins ./pgbench -c 64 -j 64 -T 300 -S postgres LWLOCK_STATS data for BufFreeListLock PID 11762 lwlock main 0: shacq 0 exacq 253988 blk 29023 Here the high *blk* count for scale factor 3000 clearly shows that to find a usable buffer when data doesn't fit in shared buffers it has to wait. To solve this issue, I have implemented a patch which makes sure that there are always enough buffers on freelist such that the need for backend to run clock-sweep is minimal, the implementation idea is more or less same as discussed previously in below thread, so I will explain it at end of mail. http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@huawei.com LWLOCK_STATS data after Patch (test used is same as used for HEAD): BufFreeListLock PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0 Here the low *exacq* and *blk* count shows that the need to run clock sweep for backend has reduced significantly. Performance Data ------------------------------- shared_buffers= 8GB number of threads - 64 sc - scale factor sc tps Head 3000 45569 Patch 3000 46457 Head 1000 93037 Patch 1000 92711 Above data shows that there is no significant change in performance or scalability even after the contention is reduced significantly around BufFreelistLock. I have analyzed the patch both with perf record and LWLOCK_STATS, both indicates that there is a high contention around BufMappingLocks. Data With perf record -a -g ----------------------------------------- + 10.14% swapper [kernel.kallsyms] [k] .pseries_dedicated_idle_sleep + 7.77% postgres [kernel.kallsyms] [k] ._raw_spin_lock + 6.88% postgres [kernel.kallsyms] [k] .function_trace_call + 4.15% pgbench [kernel.kallsyms] [k] .try_to_wake_up + 3.20% swapper [kernel.kallsyms] [k] .function_trace_call + 2.99% pgbench [kernel.kallsyms] [k] .function_trace_call + 2.41% postgres postgres [.] AllocSetAlloc + 2.38% postgres [kernel.kallsyms] [k] .try_to_wake_up + 2.27% pgbench [kernel.kallsyms] [k] ._raw_spin_lock + 1.49% postgres [kernel.kallsyms] [k] ._raw_spin_lock_irq + 1.36% postgres postgres [.] AllocSetFreeIndex + 1.09% swapper [kernel.kallsyms] [k] ._raw_spin_lock + 0.91% postgres postgres [.] GetSnapshotData + 0.90% postgres postgres [.] MemoryContextAllocZeroAligned Expanded graph ------------------------------ - 10.14% swapper [kernel.kallsyms] [k] .pseries_dedicated_idle_sleep - .pseries_dedicated_idle_sleep - 10.13% .pseries_dedicated_idle_sleep - 10.13% .cpu_idle - 10.00% .start_secondary .start_secondary_prolog - 7.77% postgres [kernel.kallsyms] [k] ._raw_spin_lock - ._raw_spin_lock - 6.63% ._raw_spin_lock - 5.95% .double_rq_lock - .load_balance - 5.95% .__schedule - .schedule - 3.27% .SyS_semtimedop .SyS_ipc syscall_exit semop PGSemaphoreLock LWLockAcquireCommon - LWLockAcquire - 3.27% BufferAlloc ReadBuffer_common - ReadBufferExtended - 3.27% ReadBuffer - 2.73% ReleaseAndReadBuffer - 1.70% _bt_relandgetbuf _bt_search _bt_first btgettuple It shows BufferAlloc->LWLOCK as top contributor and we use BufMappingLocks in BufferAlloc, I have checked other expanded calls as well, StrategyGetBuffer is not present in top contributors. Data with LWLOCK_STATS ---------------------------------------------- BufMappingLocks PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101 PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72 PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54 PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60 PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128 PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81 PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503 PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71 PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46 PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30 PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14 PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25 PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30 PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28 PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25 PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22 PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29 PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36 PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22 This data shows that after patch, there is no contention for BufFreeListLock, rather there is a huge contention around BufMappingLocks. I have checked that HEAD also has contention around BufMappingLocks. As per my analysis till now, I think reducing contention around BufFreelistLock is not sufficient to improve scalability, we need to work on reducing contention around BufMappingLocks as well. Details of patch ------------------------ 1. Changed bgwriter to move buffers (having usage_count as zero) on free list based on threshold (high_watermark) and decrement the usage count if usage_count is greater than zero. 2. StrategyGetBuffer() will wakeup bgwriter when the number of buffers in freelist drop under low_watermark. Currently I am using hard-coded values, we can choose to make them as configurable later on if required. 3. Work done to get a buffer from freelist is done under spin lock and clock sweep still runs under BufFreelistLock. This is still a WIP patch and some of the changes are just kind of prototype to check the idea, like I have hacked bgwriter code such that it continuously fills the freelist till it is able to put enough buffers on freelist such that it reaches high_watermark and commented some part of previous code. Thoughts? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Data with LWLOCK_STATS
----------------------------------------------
BufMappingLocks
PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101
PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72
PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54
PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60
PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128
PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81
PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503
PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71
PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46
PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30
PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14
PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25
PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30
PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28
PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25
PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22
PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29
PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36
PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22
This data shows that after patch, there is no contention
for BufFreeListLock, rather there is a huge contention around
BufMappingLocks. I have checked that HEAD also has contention
around BufMappingLocks.
As per my analysis till now, I think reducing contention around
BufFreelistLock is not sufficient to improve scalability, we need
to work on reducing contention around BufMappingLocks as well.
RAM - 64GB
Thrds (64) | Thrds (128) | |
HEAD | 45562 | 17128 |
HEAD + 64 | 57904 | 32810 |
V1 + 64 | 105557 | 81011 |
HEAD + 128 | 58383 | 32997 |
V1 + 128 | 110705 | 114544 |
RAM - 64GB
Thrds (64) | Thrds (128) | |
HEAD | 92142 | 31050 |
HEAD + 64 | 108120 | 86367 |
V1 + 64 | 117454 | 123429 |
HEAD + 128 | 107762 | 86902 |
V1 + 128 | 123641 | 124822 |
Attachment
Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64 105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544
http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
shared_buffers= 8GBscale factor = 3000
RAM - 64GB
Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64 105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544 shared_buffers= 8GBscale factor = 1000
RAM - 64GB
Thrds (64) Thrds (128) HEAD 92142 31050 HEAD + 64 108120 86367 V1 + 64 117454 123429 HEAD + 128 107762 86902 V1 + 128 123641 124822
I'm having a little trouble following this. These figure are transactions per second for a 300 second pgbench tpc-b run? What does "Thrds" denote?
Peter Geoghegan
>
>
> On Fri, May 16, 2014 at 7:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> shared_buffers= 8GB
>> scale factor = 3000
>> RAM - 64GB
>
> I'm having a little trouble following this. These figure are transactions per second for a 300 second pgbench tpc-b run?
./pgbench -c 64 -j 64 -T 300 -S postgres
>
> I haven't actually reviewed the code, but this sort of thing seems like good evidence that we need your patch, or something like it. The fact that the patch produces little performance improvement on it's own (though it does produce some) shouldn't be held against it - the fact that the contention shifts elsewhere when the first bottleneck is removed is not your patch's fault.
>
> In terms of ameliorating contention on the buffer mapping locks, I think it would be better to replace the whole buffer mapping table with something different.
I have still not read the complete code, but by just going through initial file
On Fri, May 16, 2014 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64 105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544 I haven't actually reviewed the code, but this sort of thing seems like good evidence that we need your patch, or something like it. The fact that the patch produces little performance improvement on it's own (though it does produce some) shouldn't be held against it - the fact that the contention shifts elsewhere when the first bottleneck is removed is not your patch's fault.
adding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
Client Count/patch_ver (tps) | 8 | 16 | 32 | 64 | 128 |
Head | 26220 | 48686 | 70779 | 45232 | 17310 |
Patch | 26402 | 50726 | 75574 | 111468 | 114521 |
Amit Kapila.
Attachment
Amit Kapila <amit.kapila16@gmail.com> wrote: > I have improved the patch by making following changes: > a. Improved the bgwriter logic to log for xl_running_xacts info and > removed the hibernate logic as bgwriter will now work only when > there is scarcity of buffer's in free list. Basic idea is when the > number of buffers on freelist drops below the low threshold, the > allocating backend sets the latch and bgwriter wakesup and begin > adding buffer's to freelist until it reaches high threshold and then > again goes back to sleep. The numbers from your benchmarks are very exciting, but the above concerns me. My tuning of the bgwriter in production has generally *not* been aimed at keeping pages on the freelist, but toward preventing shared_buffers from accumulating a lot of dirty pages, which were leading to cascades of writes between caches and thus to write stalls. By pushing dirty pages into the (*much* larger) OS cache, and letting write combining happen there, where the OS could pace based on the total number of dirty pages instead of having some hidden and appearing rather suddenly, latency spikes were avoided while not causing any noticeable increase in the number of OS writes to the RAID controller's cache. Essentially I was able to tune the bgwriter so that a dirty page was always push out to the OS cache within three seconds, which led to a healthy balance of writes between the checkpoint process and the bgwriter. Backend processes related to user connections still performed about 30% of the writes, and this work shows promise toward bringing that down, which would be great; but please don't eliminate the ability to prevent write stalls in the process. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > I have improved the patch by making following changes:
> > a. Improved the bgwriter logic to log for xl_running_xacts info and
> > removed the hibernate logic as bgwriter will now work only when
> > there is scarcity of buffer's in free list. Basic idea is when the
> > number of buffers on freelist drops below the low threshold, the
> > allocating backend sets the latch and bgwriter wakesup and begin
> > adding buffer's to freelist until it reaches high threshold and then
> > again goes back to sleep.
>
> The numbers from your benchmarks are very exciting, but the above
> concerns me. My tuning of the bgwriter in production has generally
> *not* been aimed at keeping pages on the freelist, but toward
> preventing shared_buffers from accumulating a lot of dirty pages,
> which were leading to cascades of writes between caches and thus to
> write stalls. By pushing dirty pages into the (*much* larger) OS
> cache, and letting write combining happen there, where the OS could
> pace based on the total number of dirty pages instead of having
> some hidden and appearing rather suddenly, latency spikes were
> avoided while not causing any noticeable increase in the number of
> OS writes to the RAID controller's cache.
> Essentially I was able to tune the bgwriter so that a dirty page
> to a healthy balance of writes between the checkpoint process and
> the bgwriter.
> performed about 30% of the writes, and this work shows promise
> toward bringing that down, which would be great; but please don't
> eliminate the ability to prevent write stalls in the process.
On Sun, Jun 8, 2014 at 9:51 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > Amit Kapila <amit.kapila16@gmail.com> wrote: >> I have improved the patch by making following changes: >> a. Improved the bgwriter logic to log for xl_running_xacts info and >> removed the hibernate logic as bgwriter will now work only when >> there is scarcity of buffer's in free list. Basic idea is when the >> number of buffers on freelist drops below the low threshold, the >> allocating backend sets the latch and bgwriter wakesup and begin >> adding buffer's to freelist until it reaches high threshold and then >> again goes back to sleep. > > The numbers from your benchmarks are very exciting, but the above > concerns me. My tuning of the bgwriter in production has generally > *not* been aimed at keeping pages on the freelist, Just to be clear, prior to this patch, the bgwriter has never been in the business of putting pages on the freelist in the first place, so it wouldn't have been possible for you to tune for that. > Essentially I was able to tune the bgwriter so that a dirty page > was always push out to the OS cache within three seconds, which led > to a healthy balance of writes between the checkpoint process and > the bgwriter. Backend processes related to user connections still > performed about 30% of the writes, and this work shows promise > toward bringing that down, which would be great; but please don't > eliminate the ability to prevent write stalls in the process. I think, as Amit says downthread, that the crucial design question here is whether we need two processes, one to populate the freelist so that regular backends don't need to run the clock sweep, and a second to flush dirty buffers, or whether a single process can serve both needs. In favor of a single process, many people have commented that the background writer doesn't seem to do much right now. If the process is mostly sitting around idle, then giving it more responsibilities might be OK. In favor of having a second process, I'm a little concerned that if the background writer gets busy writing a page, it might then be unavailable to populate the freelist until it finishes, which might be a very long time relative to the buffer allocation needs of other backends. I'm not sure what the right answer is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> > Backend processes related to user connections still
> > performed about 30% of the writes, and this work shows promise
> > toward bringing that down, which would be great; but please don't
> > eliminate the ability to prevent write stalls in the process.
>
>
> I am planing to take some more performance data, part of which will
> be write load as well, but I am now sure if that can anyway show the
> need as mentioned by you.
Client count/TPS | 64 | 128 |
Un-patched | 45232 | 17310 |
sbe_v3 | 111468 | 114521 |
sbe_v4 | 153137 | 160752 |
Client count/TPS | 64 | 128 |
Un-patched | 825 | 784 |
sbe_v4 | 814 | 845 |
Attachment
I have improved the patch by making following changes:a. Improved the bgwriter logic to log for xl_running_xacts info andremoved the hibernate logic as bgwriter will now work only whenthere is scarcity of buffer's in free list. Basic idea is when thenumber of buffers on freelist drops below the low threshold, theallocating backend sets the latch and bgwriter wakesup and beginadding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.
b. New stats for number of buffers on freelist has been added, someold one's like maxwritten_clean can be removed as new logic forsyncing buffers and moving them to free list doesn't use them.However I think it's better to remove them once the new logic isaccepted. Added some new logs for info related to free list underBGW_DEBUG
c. Used the already existing bgwriterLatch in BufferStrategyControl towake bgwriter when number of buffer's in freelist drops belowthreshold.
d. Autotune the low and high threshold for freelist for variousconfigurations. Generally if keep small number (200~2000) of buffersalways available on freelist, then even for high shared bufferslike 15GB, it appears to be sufficient. However when the valueof shared buffer's is less, then we need much smaller number. Ithink we can provide these as config knobs for user as well, but fornow based on LWLOCK_STATS result, I have chosen some hardcoded values for low and high threshold values for freelist.Values for low and high threshold have been decided based on totalnumber of shared buffers, basically I have divided them into 5categories (16~100, 100~1000, 1000~10000, 10000~100000,100000 and above) and then ran tests(read-only pgbench) for variousconfigurations falling under these categories. The reason for keepinglesser categories for larger shared buffers is that if there are smallnumber (200~2000) of buffers available on free list, then it seems tobe sufficient for quite high loads, however as the total number of sharedbuffer's decreases we need to be more careful as if we keep the number astoo low then it will lead to more clock sweep by backends (which meansfreelist lock contention) and if we keep number higher bgwriter will evictmany useful buffers. Results based on LWLOCK_STATS is at end of mail.
Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the LWLock is acquired today. Increment StrategyControl->numBufferAllocs; save the values of StrategyControl->bgwriterLatch; pop a buffer off the freelist if there is one, saving its identity. Release the spinlock. Then, set the bgwriterLatch if needed. In the first loop, first check whether the buffer we previously popped from the freelist is pinned or has a non-zero usage count and return it if not, holding the buffer header lock. Otherwise, reacquire the spinlock just long enough to pop a new potential victim and then loop around.
Under this locking strategy, StrategyNotifyBgWriter would use freelist_lck. Right now, the patch removes the only caller, and should therefore remove the function as well, but if we go with the new-process idea listed above that part would get reverted, and then you'd need to make it use the correct spinlock. You should also go through this patch and remove all the commented-out bits and pieces that you haven't cleaned up; those are distracting and unhelpful.
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
> On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> This essentially removes BgWriterDelay, but it's still mentioned in BgBufferSync(). Looking further, I see that with the patch applied, BgBufferSync() is still present in the source code but is no longer called from anywhere. Please don't submit patches that render things unused without actually removing them; it makes it much harder to see what you've changed. I realize you probably left it that way for testing purposes, but you need to clean such things up before submitting. Likewise, if you've rendered GUCs or statistics counters removed, you need to rip them out, so that the scope of the changes you've made is clear to reviewers.
>
> A comparison of BgBufferSync() with BgBufferSyncAndMoveBuffersToFreelist() reveals that you've removed at least one behavior that some people (at least, me) will care about, which is the guarantee that the background writer will scan the entire buffer pool at least every couple of minutes.
>> b. New stats for number of buffers on freelist has been added, some
>> old one's like maxwritten_clean can be removed as new logic for
>> syncing buffers and moving them to free list doesn't use them.
>> However I think it's better to remove them once the new logic is
>> accepted. Added some new logs for info related to free list under
>> BGW_DEBUG
>
>
> If I'm reading this right, the new statistic is an incrementing counter where, every time you update it, you add the number of buffers currently on the freelist. That makes no sense.
>>
>> d. Autotune the low and high threshold for freelist for various
>> configurations.
>
> I think we need to come up with some kind of formula here rather than just a list of hard-coded constants.
>
> Aside from those specific remarks, I think the elephant in the room is the question of whether it really makes sense to have one process which is responsible both for populating the free list and for writing buffers to disk. One problem, which I alluded to above under point (1), is that we might sometimes want to ensure that dirty buffers are written out to disk without decrementing usage counts or adding anything to the free list. This is a potentially solvable problem, though, because we can figure out the number of buffers that we need to scan for freelist population and the number that we need to scan for minimum buffer pool cleaning (one cycle every 2 minutes). Once we've met the first goal, any further buffers we run into under the second goal get cleaned if appropriate but their usage counts don't get pushed down nor do they get added to the freelist. Once we meet the second goal, we can go back to sleep.
>
> But the other problem, which I think is likely unsolvable, is that writing a dirty page can take a long time on a busy system (multiple seconds) and the freelist can be emptied much, much quicker than that (milliseconds). Although your benchmark results show great speed-ups on read-only workloads, we're not really going to get the benefit consistently on read-write workloads -- unless of course the background writer fails to actually write anything, which should be viewed as a bug, not a feature -- because the freelist will often be empty while the background writer is blocked on I/O.
>
> I'm wondering if it would be a whole lot simpler and better to introduce a new background process, maybe with a name like bgreclaim.
>
> Incidentally, while I generally think your changes to the locking regimen in StrategyGetBuffer() are going in the right direction, they need significant cleanup. Your patch adds two new spinlocks, freelist_lck and victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and you've now got StrategyGetBuffer() running with no lock at all when accessing some things that used to be protected by BufFreelistLock; specifically, you're doing StrategyControl->numBufferAllocs++ and SetLatch(StrategyControl->bgwriterLatch) without any locking. That's not OK.
>
> Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the LWLock is acquired today. Increment StrategyControl->numBufferAllocs; save the values of StrategyControl->bgwriterLatch; pop a buffer off the freelist if there is one, saving its identity. Release the spinlock. Then, set the bgwriterLatch if needed. In the first loop, first check whether the buffer we previously popped from the freelist is pinned or has a non-zero usage count and return it if not, holding the buffer header lock. Otherwise, reacquire the spinlock just long enough to pop a new potential victim and then loop around.
>
> Under this locking strategy, StrategyNotifyBgWriter would use freelist_lck. Right now, the patch removes the only caller, and should therefore remove the function as well, but if we go with the new-process idea listed above that part would get reverted, and then you'd need to make it use the correct spinlock. You should also go through this patch and remove all the commented-out bits and pieces that you haven't cleaned up; those are distracting and unhelpful.
On Wed, Aug 6, 2014 at 6:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> If I'm reading this right, the new statistic is an incrementing counter >> where, every time you update it, you add the number of buffers currently on >> the freelist. That makes no sense. > > I think using 'number of buffers currently on the freelist' and > 'number of recently allocated buffers' for consecutive cycles, > we can figure out approximately how many buffer allocations > needs clock sweep assuming low and high threshold water > marks are fixed. However there can be cases where it is not > easy to estimate that number. Counters should be design in such a way that you can read it, and then read it again later, and make sense of it - you should not need to read the counter on *consecutive* cycles to interpret it. >> I think what you should be counting is the number of allocations that are >> being satisfied from the free-list. Then, by comparing the rate at which >> that value is incrementing to the rate at which buffers_alloc is >> incrementing, somebody can figure out what percentage of allocations are >> requiring a clock-sweep run. Actually, I think it's better to flip it >> around: count the number of allocations that require an individual backend >> to run the clock sweep (vs. being satisfied from the free-list); call it, >> say, buffers_backend_clocksweep. We can then try to tune the patch to make >> that number as small as possible under varying workloads. > > This can give us clear idea to tune the patch, however we need > to maintain 3 counters for it in code(recent_alloc (needed for > current bgwriter logic) and other 2 suggested by you). Do you > want to retain such counters in code or it's for kind of debug info > for patch? I only mean to propose one new counter, and I'd imagine including that in the final patch. We already have a counter of total buffer allocations; that's buffers_alloc. I'm proposing to add an additional counter for the number of those allocations not satisfied from the free list, with a name like buffers_alloc_clocksweep (I said buffers_backend_clocksweep above, but that's probably not best, as the existing buffers_backend counts buffer *writes*, not allocations). I think we would definitely want to retain this counter in the final patch, as an additional column in pg_stat_bgwriter. >>> d. Autotune the low and high threshold for freelist for various >>> configurations. >> >> I think we need to come up with some kind of formula here rather than just >> a list of hard-coded constants. > > That was my initial intention as well and I have tried based > on number of shared buffers like keeping threshold values as > percentage of shared buffers but nothing could satisfy different > kind of workloads. The current values I have choosen are based > on experiments for various workloads at different thresholds. I have > shown the lwlock_stats data for various loads based on current > thresholds upthread. Another way could be to make them as config > knobs and use the values as given by user incase it is provided by > user else go with fixed values. How did you go about determining the optimal value for a particular workload? When the list is kept short, it's less likely that a value on the list will be referenced or dirtied again before the page is actually recycled. That's clearly good. But when the list is long, it's less likely to become completely empty and thereby force individual backends to run the clock-sweep. My suspicion is that, when the number of buffers is small, the impact of the list being too short isn't likely to be very significant, because running the clock-sweep isn't all that expensive anyway - even if you have to scan through the entire buffer pool multiple times, there aren't that many buffers. But when the number of buffers is large, those repeated scans can cause a major performance hit, so having an adequate pool of free buffers becomes much more important. I think your list of high-watermarks is far too generous for low buffer counts. With more than 100k shared buffers, you've got a high-watermark of 2k buffers, which means that 2% or less of the buffers will be on the freelist, which seems a little on the high side to me, but probably in the ballpark of what we should be aiming for. But at 10001 shared buffers, you can have 1000 of them on the freelist, which is 10% of the buffer pool; that seems high. At 101 shared buffers, 75% of the buffers in the system can be on the freelist; that seems ridiculous. The chances of a buffer still being unused by the time it reaches the head of the freelist seem very small. Based on your existing list of thresholds, and taking the above into account, I'd suggest something like this: let the high-watermark for the freelist be 0.5% of the total number of buffers, with a maximum of 2000 and a minimum of 5. Let the low-watermark be 20% of the high-watermark. That might not be best, but I think some kind of formula like that can likely be made to work. I would suggest focusing your testing on configurations with *large* settings for shared_buffers, say 1-64GB, rather than small configurations. Anyone who cares greatly about performance isn't going to be running with only 8MB of shared_buffers anyway. Arguably we shouldn't even run the reclaim process on very small configurations; I think there should probably a GUC (PGC_SIGHUP) to control whether it gets launched. I think it would be a good idea to analyze how frequently the reclaim process gets woken up. In the worst case, this happens once per (high watermark - low watermark) allocations; that is, the system reaches the low watermark and then does no further allocations until the reclaim process brings the freelist back up to the high watermark. But if more allocations occur between the time the reclaim process is woken and the time it reaches the high watermark, then it should run for longer, until the high watermark is reached. At least for debugging purposes, I think it would be useful to have a counter of reclaim wakeups. I'm not sure whether that's worth including in the final patch, but it might be. > That will certainly help in retaining the current behaviour of > bgwriter and make the idea cleaner. I will modify the patch > to have a new background process unless somebody thinks > otherwise. > > If we go with this approach, one thing which we need to decide > is what to do incase buf which has usage_count as zero is *dirty*, > as I don't think it is good idea to put it in freelist. I thought a bit about this yesterday. I think the problem is that we might be in a situation where buffers are being dirtied faster than they can be cleaned. In that case, if we only put clean buffers on the freelist, then every backend in the system will be fighting over the ever-dwindling supply of clean buffers until, in the worst case, there's maybe only 1 clean buffer which is getting evicted repeatedly at top speed - or maybe even no clean buffers, and the reclaim process just spins in an infinite loop looking for clean buffers that aren't there. To put that another way, the rate at which buffers are being dirtied can't exceed the rate at which they are being cleaned forever. Eventually, somebody is going to have to wait. Having the backends wait by being forced to write some dirty buffers does not seem like a bad way to accomplish that. So I favor just putting the buffers on freelist without regard to whether they are clean or dirty. If this turns out not to work well we can look at other options (probably some variant of (b) from your list). >> Instead, it would just run the clock sweep (i.e. the last loop inside >> StrategyGetBuffer) and put the buffers onto the free list. > > Don't we need to do more than just last loop inside StrategyGetBuffer(), > as clock sweep in strategy get buffer is responsible for getting one > buffer with usage_count = 0 where as we need to run the loop till it > finds and moves enough such buffers so that it can populate freelist > with number of buffers equal to high water mark of freelist. Yeah, that's what I meant. Of course, it should add each buffer to the freelist individually, not batch them up and add them all at once. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-08-06 15:42:08 +0530, Amit Kapila wrote: > On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > This essentially removes BgWriterDelay, but it's still mentioned in > BgBufferSync(). Looking further, I see that with the patch applied, > BgBufferSync() is still present in the source code but is no longer called > from anywhere. Please don't submit patches that render things unused > without actually removing them; it makes it much harder to see what you've > changed. I realize you probably left it that way for testing purposes, but > you need to clean such things up before submitting. Likewise, if you've > rendered GUCs or statistics counters removed, you need to rip them out, so > that the scope of the changes you've made is clear to reviewers. FWIW, I found this email amost unreadable because it misses quoting signs after linebreaks in quoted content. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 2014-08-06 15:42:08 +0530, Amit Kapila wrote:
> > On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > This essentially removes BgWriterDelay, but it's still mentioned in
> > BgBufferSync(). Looking further, I see that with the patch applied,
> > BgBufferSync() is still present in the source code but is no longer called
> > from anywhere. Please don't submit patches that render things unused
> > without actually removing them; it makes it much harder to see what you've
> > changed. I realize you probably left it that way for testing purposes, but
> > you need to clean such things up before submitting. Likewise, if you've
> > rendered GUCs or statistics counters removed, you need to rip them out, so
> > that the scope of the changes you've made is clear to reviewers.
>
> FWIW, I found this email amost unreadable because it misses quoting
> signs after linebreaks in quoted content.
EnterpriseDB: http://www.enterprisedb.com
On 2014-08-13 09:51:58 +0530, Amit Kapila wrote: > Overall, the main changes required in patch as per above feedback > are: > 1. add an additional counter for the number of those > allocations not satisfied from the free list, with a > name like buffers_alloc_clocksweep. > 2. Autotune the low and high threshold values for buffers > in freelist. In the patch, I have kept them as hard-coded > values. > 3. For populating freelist, have a separate process (bgreclaim) > instead of doing it by bgwriter. I'm not convinced that 3) is the right way to go to be honest. Seems like a huge bandaid to me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
>
> On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:
> > Overall, the main changes required in patch as per above feedback
> > are:
> > 1. add an additional counter for the number of those
> > allocations not satisfied from the free list, with a
> > name like buffers_alloc_clocksweep.
> > 2. Autotune the low and high threshold values for buffers
> > in freelist. In the patch, I have kept them as hard-coded
> > values.
> > 3. For populating freelist, have a separate process (bgreclaim)
> > instead of doing it by bgwriter.
>
> I'm not convinced that 3) is the right way to go to be honest. Seems
> like a huge bandaid to me.
Doing both (populating freelist and flushing dirty buffers) via bgwriter
<div dir="ltr">On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br />><br />> Incidentally, while I generallythink your changes to the locking regimen in StrategyGetBuffer() are going in the right direction, they need significantcleanup. Your patch adds two new spinlocks, freelist_lck and victimbuf_lck, that mostly but not-quite replaceBufFreelistLock, and you've now got StrategyGetBuffer() running with no lock at all when accessing some things thatused to be protected by BufFreelistLock; specifically, you're doing StrategyControl->numBufferAllocs++ and SetLatch(StrategyControl->bgwriterLatch)without any locking. That's not OK. I think you should get rid of BufFreelistLockcompletely and just decide that freelist_lck will protect everything the freeNext links, plus everything inStrategyControl except for nextVictimBuffer. victimbuf_lck will protect nextVictimBuffer and nothing else.<br /> ><br/>> Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the LWLock is acquired today. IncrementStrategyControl->numBufferAllocs; save the values of StrategyControl->bgwriterLatch; pop a buffer off thefreelist if there is one, saving its identity. Release the spinlock. Then, set the bgwriterLatch if needed. In thefirst loop, first check whether the buffer we previously popped from the freelist is pinned or has a non-zero usage countand return it if not, holding the buffer header lock. Otherwise, reacquire the spinlock just long enough to pop a newpotential victim and then loop around.<br /> ><br /><br />Today, while working on updating the patch to improve locking<br/>I found that as now we are going to have a new process, we need<br />a separate latch in StrategyControl to wakeupthat process.<br />Another point is I think it will be better to protect<br /> StrategyControl->completePasses withvictimbuf_lck rather than<br />freelist_lck, as when we are going to update it we will already be<br />holding the victimbuf_lckand it doesn't make much sense to release<br />the victimbuf_lck and reacquire freelist_lck to update it.<br/><br />I thought it is better to mention about above points so that if you have<br />any different thoughts aboutit, then it is better to discuss them now<br />rather than after I take performance data with this locking protocol.<br/><br />With Regards,<br />Amit Kapila.<br />EnterpriseDB: <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a></div>
Amit Kapila <amit.kapila16@gmail.com> writes: > On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I think you should get rid of BufFreelistLock completely and just >>> decide that freelist_lck will protect everything the freeNext links, plus >>> everything in StrategyControl except for nextVictimBuffer. victimbuf_lck >>> will protect nextVictimBuffer and nothing else. > Another point is I think it will be better to protect > StrategyControl->completePasses with victimbuf_lck rather than > freelist_lck, as when we are going to update it we will already be > holding the victimbuf_lck and it doesn't make much sense to release > the victimbuf_lck and reacquire freelist_lck to update it. I'm rather concerned by this cavalier assumption that we can protect fields a,b,c with one lock and fields x,y,z in the same struct with some other lock. A minimum requirement for that to work safely at all is that the fields are of atomically fetchable/storable widths; which might be okay here but it's a restriction that bears thinking about (and documenting). But quite aside from safety, the fields are almost certainly going to be in the same cache line which means contention between processes that are trying to fetch or store them concurrently. For a patch whose sole excuse for existence is to improve performance, that should be a very scary concern. (And yes, I realize these issues already affect the freelist. Perhaps that's part of the reason we have performance issues with it.) regards, tom lane
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > Another point is I think it will be better to protect
> > StrategyControl->completePasses with victimbuf_lck rather than
> > freelist_lck, as when we are going to update it we will already be
> > holding the victimbuf_lck and it doesn't make much sense to release
> > the victimbuf_lck and reacquire freelist_lck to update it.
>
> I'm rather concerned by this cavalier assumption that we can protect
> fields a,b,c with one lock and fields x,y,z in the same struct with some
> other lock.
In some cases, it could be beneficial especially when a,b,c are
> A minimum requirement for that to work safely at all is that the fields
> are of atomically fetchable/storable widths; which might be okay here
> but it's a restriction that bears thinking about (and documenting).
>
> But quite aside from safety, the fields are almost certainly going to
> be in the same cache line which means contention between processes that
> are trying to fetch or store them concurrently.
On Tue, Aug 26, 2014 at 11:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Kapila <amit.kapila16@gmail.com> writes: >> On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> I think you should get rid of BufFreelistLock completely and just >>>> decide that freelist_lck will protect everything the freeNext links, plus >>>> everything in StrategyControl except for nextVictimBuffer. victimbuf_lck >>>> will protect nextVictimBuffer and nothing else. > >> Another point is I think it will be better to protect >> StrategyControl->completePasses with victimbuf_lck rather than >> freelist_lck, as when we are going to update it we will already be >> holding the victimbuf_lck and it doesn't make much sense to release >> the victimbuf_lck and reacquire freelist_lck to update it. > > I'm rather concerned by this cavalier assumption that we can protect > fields a,b,c with one lock and fields x,y,z in the same struct with some > other lock. > > A minimum requirement for that to work safely at all is that the fields > are of atomically fetchable/storable widths; which might be okay here > but it's a restriction that bears thinking about (and documenting). > > But quite aside from safety, the fields are almost certainly going to > be in the same cache line which means contention between processes that > are trying to fetch or store them concurrently. For a patch whose sole > excuse for existence is to improve performance, that should be a very > scary concern. > > (And yes, I realize these issues already affect the freelist. Perhaps > that's part of the reason we have performance issues with it.) False sharing is certainly a concern that has crossed my mine while looking at Amit's work, but the performance numbers he's posted upthread are stellar. Maybe we can squeeze some additional performance out of this by padding out the cache lines, but it's probably minor compared to the gains he's already seeing. I think we should focus on trying to lock in those gains, and then we can consider what further things we may want to do after that. If it turns out that structure-padding is among those things, that's easy enough to do as a separate patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 26, 2014 at 10:53 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Today, while working on updating the patch to improve locking > I found that as now we are going to have a new process, we need > a separate latch in StrategyControl to wakeup that process. > Another point is I think it will be better to protect > StrategyControl->completePasses with victimbuf_lck rather than > freelist_lck, as when we are going to update it we will already be > holding the victimbuf_lck and it doesn't make much sense to release > the victimbuf_lck and reacquire freelist_lck to update it. Sounds reasonable. I think the key thing at this point is to get a new version of the patch with the background reclaim running in a different process than the background writer. I don't see much point in fine-tuning the locking regimen until that's done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 26, 2014 at 10:53 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:Sounds reasonable. I think the key thing at this point is to get a
> Today, while working on updating the patch to improve locking
> I found that as now we are going to have a new process, we need
> a separate latch in StrategyControl to wakeup that process.
> Another point is I think it will be better to protect
> StrategyControl->completePasses with victimbuf_lck rather than
> freelist_lck, as when we are going to update it we will already be
> holding the victimbuf_lck and it doesn't make much sense to release
> the victimbuf_lck and reacquire freelist_lck to update it.
new version of the patch with the background reclaim running in a
different process than the background writer. I don't see much point
in fine-tuning the locking regimen until that's done.
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
All the data is in tps and taken using pgbench read-only load
Client Count/Patch_ver | 8 | 16 | 32 | 64 | 128 |
HEAD | 58614 | 107370 | 140717 | 104357 | 65010 |
Patch | 60849 | 118701 | 165631 | 209226 | 213029 |
Attachment
On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I have updated the patch to address the feedback. Main changes are: > > 1. For populating freelist, have a separate process (bgreclaimer) > instead of doing it by bgwriter. > 2. Autotune the low and high threshold values for buffers > in freelist. I have used the formula as suggested by you upthread. > 3. Cleanup of locking regimen as discussed upthread (completely > eliminated BufFreelist Lock). > 4. Improved comments and general code cleanup. Overall this looks quite promising to me. I had thought to call the new process just "bgreclaim" rather than "bgreclaimer", but perhaps your name is better after all. At least, it matches what we do elsewhere. But I don't care for the use "Bgreclaimer"; let's do "BgReclaimer" if we really need mixed-case, or else "bgreclaimer". This is unclear: +buffers for replacement. Earlier to protect freelist, we use LWLOCK as that +is needed to perform clock sweep which is a longer operation, however now we +are using two spinklocks freelist_lck and victimbuf_lck to perform operations +on freelist and run clock sweep respectively. I would drop the discussion of what was done before and say something like this: The data structures relating to buffer eviction are protected by two spinlocks. freelist_lck protects the freelist and related data structures, while victimbuf_lck protects information related to the current clock sweep condition. +always in this list. We also throw buffers into this list if we consider +their pages unlikely to be needed soon; this is done by background process +reclaimer. The list is singly-linked using fields in the I suggest: Allocating pages from this list is much cheaper than running the "clock sweep" algorithm, which may encounter many buffers that are poor candidates for eviction before finding a good candidate. Therefore, we have a background process called bgreclaimer which works to keep this list populated. +Background Reclaimer's Processing +--------------------------------- I suggest titling this section "Background Reclaim". +The background reclaimer is designed to move buffers to freelist that are I suggest replacing the first three words of this sentence with "bgreclaimer". +and move the the unpinned and zero usage count buffers to freelist. It +keep on doing this until the number of buffers in freelist become equal +high threshold of freelist. s/keep/keeps/ s/become equal/reaches the/ s/high threshold/high water mark/ s/of freelist// Please change the other places that say threshold to use the "water mark" terminology. + if (StrategyMoveBufferToFreeListEnd (bufHdr)) Extra space. + * buffers in consecutive cycles. s/consecutive/later/ + /* Execute the LRU scan */ s/LRU scan/clock sweep/ ? + while (tmp_num_to_free > 0) I am not sure it's a good idea for this value to be fixed at loop start and then just decremented. Shouldn't we loop and do the whole thing over once we reach the high watermark, only stopping when StrategySyncStartAndEnd() says num_to_free is 0? + /* choose next victim buffer to clean. */ This process doesn't clean buffers; it puts them on the freelist. + * high threshold of freelsit), we drasticaly reduce the odds for Two typos. + * of buffers in freelist fall below low threshold of freelist. s/fall/falls/ In freelist.c, it seems like a poor idea to have two spinlocks as consecutive structure members; they'll be in the same cache line, leading to false sharing. If we merge them into a single spinlock, does that hurt performance? If we put them further apart, e.g. by moving the freelist_lck to the start of the structure, followed by the latches, and leaving victimbuf_lck where it is, does that help performance? + /* + * If the buffer is pinned or has a nonzero usage_count, we cannot use + * it; discard it and retry. (This can only happen if VACUUM put a + * valid buffer in the freelist and then someone else used it before + * we got to it. It's probably impossible altogether as of 8.3, but + * we'd better check anyway.) + */ + This comment is clearly obsolete. > I have not yet added statistics (buffers_backend_clocksweep) as > for that we need to add one more variable in BufferStrategyControl > structure where I have already added few variables for this patch. > I think it is important to have such a stat available via > pg_stat_bgwriter, but not sure if it is worth to make the structure > bit more bulky. I think it's worth it. > Another minor point is about changes in lwlock.h > lwlock.h > * if you remove a lock, consider leaving a gap in the numbering > * sequence for the benefit of DTrace and other external debugging > * scripts. > > As I have removed BufFreelist lock, I have adjusted the numbering > as well in lwlock.h. There is a meesage on top of lock definitions > which suggest to leave gap if we remove any lock, however I was not > sure whether this case (removing the first element) can effect anything, > so for now, I have adjusted the numbering. Let's leave slot 0 unused, instead. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>
> I have yet to collect data under varying loads, however I have
> collected performance data for 8GB shared buffers which shows
> reasonably good performance and scalability.
>
> I think the main part left for this patch is more data for various loads
> which I will share in next few days, however I think patch is ready for
> next round of review, so I will mark it as Needs Review.
I have collected more data with the patch. I understand that you
> Performance Data:
> -------------------------------
>
> Configuration and Db Details
> IBM POWER-7 16 cores, 64 hardware threads
> RAM = 64GB
> Database Locale =C
> checkpoint_segments=256
> checkpoint_timeout =15min
> scale factor = 3000
> Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
> Duration of each individual run = 5mins
>
> All the data is in tps and taken using pgbench read-only load
Common configuration remains same as above.
Client Count/Patch_Ver | 8 | 16 | 32 | 64 | 128 |
HEAD | 56248 | 100112 | 121341 | 81128 | 56552 |
Patch | 59389 | 112483 | 157034 | 185740 | 166725 |
Client Count/Patch_Ver | 8 | 16 | 32 | 64 | 128 |
HEAD | 56401 | 102557 | 121643 | 81686 | 57091 |
Patch | 59361 | 114813 | 157528 | 188209 | 167752 |
Client Count/Patch_Ver | 8 | 16 | 32 | 64 | 128 |
HEAD | 60059 | 110582 | 152051 | 130718 | 97014 |
Patch | 61462 | 117377 | 169767 | 225803 | 229083 |
Client Count/Patch_Ver | 8 | 16 | 32 | 64 | 128 |
HEAD | 60005 | 112928 | 153650 | 135203 | 36343 |
Patch | 61345 | 115569 | 168767 | 226150 | 36985 |
>
> On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I have yet to collect data under varying loads, however I have
> > collected performance data for 8GB shared buffers which shows
> > reasonably good performance and scalability.
> >
> > I think the main part left for this patch is more data for various loads
> > which I will share in next few days, however I think patch is ready for
> > next round of review, so I will mark it as Needs Review.
>
> I have collected more data with the patch. I understand that you
> have given more review comments due to which patch require
> changes, however I think it will not effect the performance data
> to a great extent and I have anyway taken the data, so sharing the
> same.
>
>
> > Performance Data:
> > -------------------------------
> >
> > Configuration and Db Details
> > IBM POWER-7 16 cores, 64 hardware threads
> > RAM = 64GB
> > Database Locale =C
> > checkpoint_segments=256
> > checkpoint_timeout =15min
> > scale factor = 3000
> > Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
> > Duration of each individual run = 5mins
> >
> > All the data is in tps and taken using pgbench read-only load
>
> Common configuration remains same as above.
Forgot to mention that data is a median of 3 runs and attached
Attachment
> On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have updated the patch to address the feedback. Main changes are:
> >
> > 1. For populating freelist, have a separate process (bgreclaimer)
> > instead of doing it by bgwriter.
> > 2. Autotune the low and high threshold values for buffers
> > in freelist. I have used the formula as suggested by you upthread.
> > 3. Cleanup of locking regimen as discussed upthread (completely
> > eliminated BufFreelist Lock).
> > 4. Improved comments and general code cleanup.
>
> +Background Reclaimer's Processing
> +---------------------------------
>
> I suggest titling this section "Background Reclaim".
I don't mind changing it, but currently used title is based on similar
>
> I suggest replacing the first three words of this sentence with "bgreclaimer".
Again what I have used is matching with BgWriter's explanation. I thought
it would be better if wording used is similar.
> + while (tmp_num_to_free > 0)
>
> I am not sure it's a good idea for this value to be fixed at loop
> start and then just decremented.
calling it once has advantage that we need to take freelist_lck
just once.
> thing over once we reach the high watermark, only stopping when
> StrategySyncStartAndEnd() says num_to_free is 0?
Do you mean to say that for high water mark check, we should
always refer StrategySyncStartAndEnd() rather than getting the
value in begining?
Are you thinking that somebody else would have already put some
buffers onto freelist which would change initially identified high water
>
> In freelist.c, it seems like a poor idea to have two spinlocks as
> consecutive structure members; they'll be in the same cache line,
> leading to false sharing. If we merge them into a single spinlock,
> does that hurt performance?
> moving the freelist_lck to the start of the structure, followed by the
> latches, and leaving victimbuf_lck where it is, does that help
> performance?
I can investigate.
> + /*
> + * If the buffer is pinned or has a nonzero usage_count,
> we cannot use
> + * it; discard it and retry. (This can only happen if VACUUM put a
> + * valid buffer in the freelist and then someone else
> used it before
> + * we got to it. It's probably impossible altogether as
> of 8.3, but
> + * we'd better check anyway.)
> + */
> +
>
> This comment is clearly obsolete.
Okay, but this patch hasn't changed anything w.r.t above comment,
so I haven't changed it. Do you want me to remove second part of
On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> +Background Reclaimer's Processing >> +--------------------------------- >> >> I suggest titling this section "Background Reclaim". > > I don't mind changing it, but currently used title is based on similar > title "Background Writer's Processing". It is used in previous > paragraph. Is there a reason to title this differently? Oh, I didn't see that. Seems like weird phrasing to me, but I guess it's probably better to keep it consistent. >> +The background reclaimer is designed to move buffers to freelist that are >> >> I suggest replacing the first three words of this sentence with >> "bgreclaimer". > > Again what I have used is matching with BgWriter's explanation. I thought > it would be better if wording used is similar. OK. >> + while (tmp_num_to_free > 0) >> >> I am not sure it's a good idea for this value to be fixed at loop >> start and then just decremented. > > It is based on the idea what bgwriter does for num_to_scan and > calling it once has advantage that we need to take freelist_lck > just once. Right, we shouldn't call it every loop iteration. However, consider this scenario: there are no remaining buffers on the list and the high watermark is 2000. We add 2000 buffers to the list. But by the time we get done, other backends have already done 500 more allocations, so now there are only 1500 buffers on the list. If this should occur, we should add an additional 500 buffers to the list before we consider sleeping. We want bgreclaimer to be able to run continuously if the demand for buffers is high enough. >> In freelist.c, it seems like a poor idea to have two spinlocks as >> consecutive structure members; they'll be in the same cache line, >> leading to false sharing. If we merge them into a single spinlock, >> does that hurt performance? > > I have kept them separate so that backends searching for a buffer > in freelist doesn't contend with bgreclaimer (while doing clock sweep) > or clock sweep being done by other backends. I think it will be bit > tricky to devise a test where this can hurt, however it doesn't seem > too bad to have two separate locks in this case. It's not. But if they are in the same cache line, they will behave almost like one lock, because the CPU will lock the entire cache line for each atomic op. See Tom's comments upthread. > Okay, but this patch hasn't changed anything w.r.t above comment, > so I haven't changed it. Do you want me to remove second part of > comment starting with "(This can only happen"? Right. Clearly it can happen again once we have this patch: that's the entire point of the patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03/09/14 16:22, Amit Kapila wrote: > On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >>> >>> I have yet to collect data under varying loads, however I have >>> collected performance data for 8GB shared buffers which shows >>> reasonably good performance and scalability. >>> >>> I think the main part left for this patch is more data for various loads >>> which I will share in next few days, however I think patch is ready for >>> next round of review, so I will mark it as Needs Review. >> >> I have collected more data with the patch. I understand that you >> have given more review comments due to which patch require >> changes, however I think it will not effect the performance data >> to a great extent and I have anyway taken the data, so sharing the >> same. >> >> >>> Performance Data: >>> ------------------------------- >>> >>> Configuration and Db Details >>> IBM POWER-7 16 cores, 64 hardware threads >>> RAM = 64GB >>> Database Locale =C >>> checkpoint_segments=256 >>> checkpoint_timeout =15min >>> scale factor = 3000 >>> Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) >>> Duration of each individual run = 5mins >>> >>> All the data is in tps and taken using pgbench read-only load >> >> Common configuration remains same as above. > > Forgot to mention that data is a median of 3 runs and attached > sheet contains data for individual runs. > > Hi Amit, Results look pretty good. Does it help in the read-write case too? Cheers Mark
>
>
> Hi Amit,
>
> Results look pretty good. Does it help in the read-write case too?
> On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >> + while (tmp_num_to_free > 0)
> >>
> >> I am not sure it's a good idea for this value to be fixed at loop
> >> start and then just decremented.
> >
> > It is based on the idea what bgwriter does for num_to_scan and
> > calling it once has advantage that we need to take freelist_lck
> > just once.
>
> Right, we shouldn't call it every loop iteration. However, consider
> this scenario: there are no remaining buffers on the list and the high
> watermark is 2000. We add 2000 buffers to the list. But by the time
> we get done, other backends have already done 500 more allocations, so
> now there are only 1500 buffers on the list. If this should occur, we
> should add an additional 500 buffers to the list before we consider
> sleeping. We want bgreclaimer to be able to run continuously if the
> demand for buffers is high enough.
Its not difficult to handle such cases, but it can have downside also
Consider in above case if instead of 500 more allocations, it just
does 5 more allocations, then bgreclaimer will again have to go through
the list and move 5 buffers and same can happen again by the time
it moves 5 buffers. Another point to keep in mind here is that in this
first finds that the buffers in freelist falls below low water mark can
> >> In freelist.c, it seems like a poor idea to have two spinlocks as
> >> consecutive structure members; they'll be in the same cache line,
> >> leading to false sharing. If we merge them into a single spinlock,
> >> does that hurt performance?
> >
> > I have kept them separate so that backends searching for a buffer
> > in freelist doesn't contend with bgreclaimer (while doing clock sweep)
> > or clock sweep being done by other backends. I think it will be bit
> > tricky to devise a test where this can hurt, however it doesn't seem
> > too bad to have two separate locks in this case.
>
> It's not. But if they are in the same cache line, they will behave
> almost like one lock, because the CPU will lock the entire cache line
> for each atomic op. See Tom's comments upthread.
I think to avoid having them in same cache line, we might need to
add some padding (at least 72 bytes) as the structure size including both
a. use two spinlocks as in patch, but keep them as far apart as possible.
This might not have an advantage as compare to what is used currently
b. use only one spinlock, this can have disadvantage in certain cases
as mentioned upthread, however those might not be usual cases, so for
now we can consider them as lower priority and can choose this option.
Another point in this regard is that I have to make use of volatile
pointer to prevent code rearrangement in this case.
> Performance Data:
> -------------------------------
>
> Configuration and Db Details
> IBM POWER-7 16 cores, 64 hardware threads
> RAM = 64GB
> Database Locale =C
> checkpoint_segments=256
> checkpoint_timeout =15min> scale factor = 3000Common configuration remains same as above.
> Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
> Duration of each individual run = 5mins
>
> All the data is in tps and taken using pgbench read-only loadShared_Buffers = 500MB
Client Count/Patch_Ver 8 16 32 64 128 HEAD 56248 100112 121341 81128 56552 Patch 59389 112483 157034 185740 166725 ..
Observations---------------------1. Performance improvement is upto 2~3 times for higher clientcounts (64, 128).2. For lower client count (8), we can see 2~5 % performanceimprovement.3. Overall, this improves the read scalability.4. For lower number of shared buffers, we see that there is a minordip in tps even after patch (it might be that we can improve it bytuning higher water mark for the number of buffers on freelist, I willtry this by varying high water mark).
Client Count/Patch_Ver (Data in tps) | 128 |
HM=0.5;LM=20 | 166725 |
HM=1;LM=20 | 166556 |
HM=2;LM=30 | 166463 |
HM=5;LM=30 | 166107 |
HM=10;LM=30 | 167231 |
On Thu, Sep 4, 2014 at 7:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Its not difficult to handle such cases, but it can have downside also > for the cases where demand from backends is not high. > Consider in above case if instead of 500 more allocations, it just > does 5 more allocations, then bgreclaimer will again have to go through > the list and move 5 buffers and same can happen again by the time > it moves 5 buffers. That's exactly the scenario in which we *want* the looping behavior. If that's happening, then it means it's taking us exactly as long to find 5 buffers as it takes the rest of the system to use 5 buffers. We need to run continuously to keep up. >> It's not. But if they are in the same cache line, they will behave >> almost like one lock, because the CPU will lock the entire cache line >> for each atomic op. See Tom's comments upthread. > > I think to avoid having them in same cache line, we might need to > add some padding (at least 72 bytes) as the structure size including both > the spin locks is 56 bytes on PPC64 m/c and cache line size is 128 bytes. > I have taken performance data as well by keeping them further apart > as suggested by you upthread and by introducing padding, but the > difference in performance is less than 1.5% (on 64 and 128 client count) > which also might be due to variation of data across runs. So now to > proceed we have below options: > > a. use two spinlocks as in patch, but keep them as far apart as possible. > This might not have an advantage as compare to what is used currently > in patch, but in future we can adding padding to take the advantage if > possible (currently on PPC64, it doesn't show any noticeable advantage, > however on some other m/c, it might show the advantage). > > b. use only one spinlock, this can have disadvantage in certain cases > as mentioned upthread, however those might not be usual cases, so for > now we can consider them as lower priority and can choose this option. I guess I don't care that much. I only mentioned it because Tom brought it up; I don't really see a big problem with the way you're doing it. > Another point in this regard is that I have to make use of volatile > pointer to prevent code rearrangement in this case. Yep. Or we need to get off our duff and fix it so that's not necessary. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 4, 2014 at 7:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Its not difficult to handle such cases, but it can have downside also >> for the cases where demand from backends is not high. >> Consider in above case if instead of 500 more allocations, it just >> does 5 more allocations, then bgreclaimer will again have to go through >> the list and move 5 buffers and same can happen again by the time >> it moves 5 buffers. > > That's exactly the scenario in which we *want* the looping behavior. > If that's happening, then it means it's taking us exactly as long to > find 5 buffers as it takes the rest of the system to use 5 buffers. > We need to run continuously to keep up. That's what I was thinking, as long as there isn't a lot of overhead to starting and finishing a cycle. If there is, my inclination would be to try to fix that rather than to sleep and hope things don't get out of hand before it wakes up again. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > >> +Background Reclaimer's Processing > >> +--------------------------------- > >> > >> I suggest titling this section "Background Reclaim". > > > > I don't mind changing it, but currently used title is based on similar > > title "Background Writer's Processing". It is used in previous > > paragraph. Is there a reason to title this differently? > > Oh, I didn't see that. Seems like weird phrasing to me, but I guess > it's probably better to keep it consistent. ... or you can also change the other one. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 04/09/14 14:42, Amit Kapila wrote: > On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> > wrote: >> >> >> Hi Amit, >> >> Results look pretty good. Does it help in the read-write case too? > > Last time I ran the tpc-b test of pgbench (results of which are > posted earlier in this thread), there doesn't seem to be any major > gain for that, however for cases where read is predominant, you > might see better gains. > > I am again planing to take that data in next few days. > FWIW below are some test results on the 60 core beast with this patch applied to 9.4. I'll need to do more runs to iron out the variation, but it looks like the patch helps the standard (write heavy) pgbench workload a little, and clearly helps the read only case. 4x E7-4890 15 cores each. 1 TB ram 16x Toshiba PX02SS SATA SSD 4x Samsung NVMe XS1715 PCIe SSD Ubuntu 14.04 (Linux 3.13) Postgres 9.4 beta2 + buffer eviction patch v5 Pgbench scale 2000 Non default params: max_connections = 400; shared_buffers = "10GB"; maintenance_work_mem = "1GB"; effective_io_concurrency = 10; wal_buffers = "256MB"; checkpoint_segments = 1920; checkpoint_completion_target = 0.8; ssl = 'off'; wal_sync_method = 'open_datasync'; read write elapsed 600s Clients | tps | tps (unpatched) ---------+-------+---------------- 6 | 8279 | 8328 12 | 16260 | 16381 24 | 23639 | 23451 48 | 31430| 31004 96 | 38516 | 34777 192 | 33535 | 32443 384 | 27978 | 25068 384 | 30589 | 28798 read only elapsed 300s Clients | tps | tps (unpatched) ---------+--------+---------------- 6 | 57654 | 57255 12 | 111361 | 112360 24 | 220304 | 187967 48 | 384567 | 230961 96 | 380309 | 241947 192 | 330865 | 214570 384 | 315516 | 207548 Regards Mark
>
> On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have updated the patch to address the feedback. Main changes are:
> >
> > 1. For populating freelist, have a separate process (bgreclaimer)
> > instead of doing it by bgwriter.
> > 2. Autotune the low and high threshold values for buffers
> > in freelist. I have used the formula as suggested by you upthread.
> > 3. Cleanup of locking regimen as discussed upthread (completely
> > eliminated BufFreelist Lock).
> > 4. Improved comments and general code cleanup.
>
> Overall this looks quite promising to me.
>
> I had thought to call the new process just "bgreclaim" rather than
> "bgreclaimer", but perhaps your name is better after all. At least,
> it matches what we do elsewhere. But I don't care for the use
> "Bgreclaimer"; let's do "BgReclaimer" if we really need mixed-case, or
> else "bgreclaimer".
Changed it to bgreclaimer.
> This is unclear:
>
> +buffers for replacement. Earlier to protect freelist, we use LWLOCK as that
> +is needed to perform clock sweep which is a longer operation, however now we
> +are using two spinklocks freelist_lck and victimbuf_lck to perform operations
> +on freelist and run clock sweep respectively.
>
> I would drop the discussion of what was done before and say something
> like this: The data structures relating to buffer eviction are
> protected by two spinlocks. freelist_lck protects the freelist and
> related data structures, while victimbuf_lck protects information
> related to the current clock sweep condition.
Changed, but I have not used exact wording mentioned above, let me know
> +always in this list. We also throw buffers into this list if we consider
> +their pages unlikely to be needed soon; this is done by background process
> +reclaimer. The list is singly-linked using fields in the
>
> I suggest: Allocating pages from this list is much cheaper than
> running the "clock sweep" algorithm, which may encounter many buffers
> that are poor candidates for eviction before finding a good candidate.
> Therefore, we have a background process called bgreclaimer which works
> to keep this list populated.
Changed as per your suggestion.
> +Background Reclaimer's Processing
> +---------------------------------
>
> I suggest titling this section "Background Reclaim".
>
> +The background reclaimer is designed to move buffers to freelist that are
>
> I suggest replacing the first three words of this sentence with "bgreclaimer".
As per discussion in thread, I have kept it as it is.
> +and move the the unpinned and zero usage count buffers to freelist. It
> +keep on doing this until the number of buffers in freelist become equal
> +high threshold of freelist.
>
> s/keep/keeps/
> s/become equal/reaches the/
> s/high threshold/high water mark/
> s/of freelist//
Changed as per your suggestion.
> Please change the other places that say threshold to use the "water
> mark" terminology.
>
> + if (StrategyMoveBufferToFreeListEnd (bufHdr))
>
> Extra space.
>
> + * buffers in consecutive cycles.
>
> s/consecutive/later/
>
> + /* Execute the LRU scan */
>
> s/LRU scan/clock sweep/ ?
Changed as per your suggestion.
>
> + while (tmp_num_to_free > 0)
>
> I am not sure it's a good idea for this value to be fixed at loop
> start and then just decremented. Shouldn't we loop and do the whole
> thing over once we reach the high watermark, only stopping when
> StrategySyncStartAndEnd() says num_to_free is 0?
Okay, changed the loop as per your suggestion.
> + /* choose next victim buffer to clean. */
>
> This process doesn't clean buffers; it puts them on the freelist.
Right. Changed it to match what it does.
> + * high threshold of freelsit), we drasticaly reduce the odds for
>
> Two typos.
Fixed.
> + * of buffers in freelist fall below low threshold of freelist.
>
> s/fall/falls/
Changed as per your suggestion.
> In freelist.c, it seems like a poor idea to have two spinlocks as
> consecutive structure members; they'll be in the same cache line,
> leading to false sharing. If we merge them into a single spinlock,
> does that hurt performance? If we put them further apart, e.g. by
> moving the freelist_lck to the start of the structure, followed by the
> latches, and leaving victimbuf_lck where it is, does that help
> performance?
As per discussion, I have kept them as it is and added a comment
indicating that we can consider having both locks in separate
cache lines.
> + /*
> + * If the buffer is pinned or has a nonzero usage_count,
> we cannot use
> + * it; discard it and retry. (This can only happen if VACUUM put a
> + * valid buffer in the freelist and then someone else
> used it before
> + * we got to it. It's probably impossible altogether as
> of 8.3, but
> + * we'd better check anyway.)
> + */
> +
>
> This comment is clearly obsolete.
Removed.
> > I have not yet added statistics (buffers_backend_clocksweep) as
> > for that we need to add one more variable in BufferStrategyControl
> > structure where I have already added few variables for this patch.
> > I think it is important to have such a stat available via
> > pg_stat_bgwriter, but not sure if it is worth to make the structure
> > bit more bulky.
>
> I think it's worth it.
Okay added new statistic.
> > Another minor point is about changes in lwlock.h
> > lwlock.h
> > * if you remove a lock, consider leaving a gap in the numbering
> > * sequence for the benefit of DTrace and other external debugging
> > * scripts.
> >
> > As I have removed BufFreelist lock, I have adjusted the numbering
> > as well in lwlock.h. There is a meesage on top of lock definitions
> > which suggest to leave gap if we remove any lock, however I was not
> > sure whether this case (removing the first element) can effect anything,
> > so for now, I have adjusted the numbering.
>
> Let's leave slot 0 unused, instead.
Sure, that make sense.
Performance Data with updated patch
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
All the data is in tps and taken using pgbench read-only load
Client Count/Patch_Ver (tps) | 8 | 16 | 32 | 64 | 128 |
HEAD | 58614 | 107370 | 140717 | 104357 | 65010 |
Patch | 60092 | 113564 | 165014 | 213848 | 216065 |
Attachment
>
> On 04/09/14 14:42, Amit Kapila wrote:
>>
>> On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
>> wrote:
>>>
>>>
>>>
>>> Hi Amit,
>>>
>>> Results look pretty good. Does it help in the read-write case too?
>>
>>
>> Last time I ran the tpc-b test of pgbench (results of which are
>> posted earlier in this thread), there doesn't seem to be any major
>> gain for that, however for cases where read is predominant, you
>> might see better gains.
>>
>> I am again planing to take that data in next few days.
>>
>
> FWIW below are some test results on the 60 core beast with this patch applied to 9.4. I'll need to do more runs to iron out the variation,
>
Thanks for doing the test. I think if possible you can take
>
> Apart from above, I think for this patch, cat version bump is required
> as I have modified system catalog. However I have not done the
> same in patch as otherwise it will be bit difficult to take performance
> data.
One regression failed on linux due to spacing issue which is
Attachment
On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Client Count/Patch_Ver (tps) 8 16 32 64 128 > HEAD 58614 107370 140717 104357 65010 > Patch 60092 113564 165014 213848 216065 > > This data is median of 3 runs, detailed report is attached with mail. > I have not repeated the test for all configurations, as there is no > major change in design/algorithm which can effect performance. > Mark has already taken tpc-b data which ensures that there is > no problem with it, however I will also take it once with latest version. Well, these numbers are pretty much amazing. Question: It seems there's obviously quite a bit of contention left; do you think there's still a significant amount of time in the clock sweep, or is the primary bottleneck the buffer mapping locks? merlin
On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Apart from above, I think for this patch, cat version bump is required >> as I have modified system catalog. However I have not done the >> same in patch as otherwise it will be bit difficult to take performance >> data. > > One regression failed on linux due to spacing issue which is > fixed in attached patch. I took another read through this patch. Here are some further review comments: 1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have both num_to_free and tmp_num_to_free. I'd get rid of tmp_num_to_free and move the declaration of num_to_free inside the outer loop. I'd also move the definitions of tmp_next_to_clean, tmp_recent_alloc, tmp_recent_backend_clocksweep into the innermost scope in which they are used. 2. Also in that function, I think the innermost bit of logic could be rewritten more compactly, and in such a way as to make it clearer for what set of instructions the buffer header will be locked. LockBufHdr(bufHdr); if (bufHdr->refcount == 0) { if (bufHdr->usage_count > 0) bufHdr->usage_count--; else add_to_freelist = true; } UnlockBufHdr(bufHdr); if (add_to_freelist && StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--; 3. This comment is now obsolete: + /* + * If bgwriterLatch is set, we need to waken the bgwriter, but we should + * not do so while holding freelist_lck; so set it after releasing the + * freelist_lck. This is annoyingly tedious, but it happens at most once + * per bgwriter cycle, so the performance hit is minimal. + */ + We're not actually holding any lock in need of releasing at that point in the code, so this can be shortened to "If bgwriterLatch is set, we need to waken the bgwriter." * Ideally numFreeListBuffers should get called under freelist spinlock, That doesn't make any sense. numFreeListBuffers is a variable, so you can't "call" it. The value should be *read* under the spinlock, but it is. I think this whole comment can be deleted and replaced with "If the number of free buffers has fallen below the low water mark, awaken the bgreclaimer to repopulate it." 4. StrategySyncStartAndEnd() is kind of a mess. One, it can return the same victim buffer that's being handed out - at almost the same time - to a backend running the clock sweep; if it does, they'll fight over the buffer. Two, the *end out parameter actually returns a count, not an endpoint. I think we should have BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the top of the inner loop rather than the bottom, and change StrategySyncStartAndEnd() so that it knows nothing about victimbuf_lck. Let's also change StrategyGetBuffer() to call StrategySyncNextVictimBuffer() so that the logic is centralized in one place, and rename StrategySyncStartAndEnd() to something that better matches its revised purpose. Maybe StrategyGetReclaimInfo(). 5. Have you tested that this new bgwriter statistic is actually working? Because it looks to me like BgMoveBuffersToFreelist is changing BgWriterStats but never calling pgstat_send_bgwriter(), which I'm thinking will mean the counters accumulate forever inside the reclaimer but never get forwarded to the stats collector. 6. StrategyInitialize() uses #defines for HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000) for clamping. Let's have constants for all of those (and omit mention of the specific values in the comments). 7. The line you've added to the definition of view pg_stat_bgwriter doesn't seem to be indented the same way as all the others. Tab vs. space problem? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Apart from above, I think for this patch, cat version bump is required
> as I have modified system catalog. However I have not done the
> same in patch as otherwise it will be bit difficult to take performance
> data.
One regression failed on linux due to spacing issue which isfixed in attached patch.
Thom
>
> On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Client Count/Patch_Ver (tps) 8 16 32 64 128
> > HEAD 58614 107370 140717 104357 65010
> > Patch 60092 113564 165014 213848 216065
> >
> > This data is median of 3 runs, detailed report is attached with mail.
> > I have not repeated the test for all configurations, as there is no
> > major change in design/algorithm which can effect performance.
> > Mark has already taken tpc-b data which ensures that there is
> > no problem with it, however I will also take it once with latest version.
>
> Well, these numbers are pretty much amazing. Question: It seems
> there's obviously quite a bit of contention left; do you think
> there's still a significant amount of time in the clock sweep, or is
> the primary bottleneck the buffer mapping locks?
> On 5 September 2014 14:19, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>> > Apart from above, I think for this patch, cat version bump is required
>> > as I have modified system catalog. However I have not done the
>> > same in patch as otherwise it will be bit difficult to take performance
>> > data.
>>
>> One regression failed on linux due to spacing issue which is
>> fixed in attached patch.
>
>
> Here's a set of test results against this patch:
Many thanks for taking the performance data. This data clearly shows
On 05/09/14 23:50, Amit Kapila wrote: > On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> > wrote: > > > > On 04/09/14 14:42, Amit Kapila wrote: > >> > >> On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> > >> wrote: > >>> > >>> > >>> > >>> Hi Amit, > >>> > >>> Results look pretty good. Does it help in the read-write case too? > >> > >> > >> Last time I ran the tpc-b test of pgbench (results of which are > >> posted earlier in this thread), there doesn't seem to be any major > >> gain for that, however for cases where read is predominant, you > >> might see better gains. > >> > >> I am again planing to take that data in next few days. > >> > > > > FWIW below are some test results on the 60 core beast with this patch > applied to 9.4. I'll need to do more runs to iron out the variation, > > but it looks like the patch helps the standard (write heavy) pgbench > workload a little, and clearly helps the read only case. > > > > Thanks for doing the test. I think if possible you can take > the performance data with higher scale factor (4000) as it > seems your m/c has 1TB of RAM. You might want to use > latest patch I have posted today. > Here's some fairly typical data from read-write and read-only runs at scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not seeing much variation between repeated read-write runs with the same config (which is nice - sleep 30 and explicit checkpoint call between each one seem to help there). Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be better at higher client counts than beta1 was... In terms of the effect of the patch - looks pretty similar to the scale 2000 results for read-write, but read-only is a different and more interesting story - unpatched 9.4 is noticeably impacted in the higher client counts, whereas the patched version scales as well (or even better perhaps) than in the scale 2000 case. read write (600s) Clients | tps | tps (unpatched) ---------+--------+---------------- 6 | 9395 | 9334 12 | 16605 | 16525 24 | 24634 | 24910 48 | 32170 | 31275 96 | 35675 | 36533 192 | 35579 | 31137 384 | 30528 | 28308 read only (300s) Clients | tps | tps (unpatched) ---------+--------+---------------- 6 | 35743 | 35362 12 | 111019 | 106579 24 | 199746 | 160305 48 | 327026 | 198407 96 | 379184 | 171863 192 | 356623 | 152224 384 | 340878 | 128308 regards Mark
On Tue, Sep 9, 2014 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> One regression failed on linux due to spacing issue which is >> fixed in attached patch. I just read the latest patch by curiosity, wouldn't it make more sense to split the patch into two different patches for clarity: one for the reclaimer worker centralized around BgMoveBuffersToFreelist and a second for the pg_stat portion? Those seem two different features. Regards, -- Michael
> On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> Apart from above, I think for this patch, cat version bump is required
> >> as I have modified system catalog. However I have not done the
> >> same in patch as otherwise it will be bit difficult to take performance
> >> data.
> >
> > One regression failed on linux due to spacing issue which is
> > fixed in attached patch.
>
> I took another read through this patch. Here are some further review comments:
>
> 1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have
> both num_to_free and tmp_num_to_free.
num_to_free is used to accumulate total number of buffers that are
freed in one cycle of BgMoveBuffersToFreelist() which is reported
for debug info (BGW_DEBUG) and tmp_num_to_free is used as a temporary
number which is a count of number of buffers to be freed in one sub-cycle
(inner while loop)
> and move the declaration of num_to_free inside the outer loop. I'd
> also move the definitions of tmp_next_to_clean, tmp_recent_alloc,
> tmp_recent_backend_clocksweep into the innermost scope in which they
> are used.
okay, I have moved the tmp_* variables in innermost scope.
> 2. Also in that function, I think the innermost bit of logic could be
> rewritten more compactly, and in such a way as to make it clearer for
> what set of instructions the buffer header will be locked.
> LockBufHdr(bufHdr); if (bufHdr->refcount == 0) { if
> (bufHdr->usage_count > 0) bufHdr->usage_count--; else add_to_freelist
> = true; } UnlockBufHdr(bufHdr); if (add_to_freelist &&
> StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--;
Changed as per suggestion.
> 3. This comment is now obsolete:
>
> + /*
> + * If bgwriterLatch is set, we need to waken the bgwriter, but we should
> + * not do so while holding freelist_lck; so set it after releasing the
> + * freelist_lck. This is annoyingly tedious, but it happens
> at most once
> + * per bgwriter cycle, so the performance hit is minimal.
> + */
> +
>
> We're not actually holding any lock in need of releasing at that point
> in the code, so this can be shortened to "If bgwriterLatch is set, we
> need to waken the bgwriter."
Changed as per suggestion.
> * Ideally numFreeListBuffers should get called under freelist spinlock,
>
> That doesn't make any sense. numFreeListBuffers is a variable, so you
> can't "call" it. The value should be *read* under the spinlock, but
> it is. I think this whole comment can be deleted and replaced with
> "If the number of free buffers has fallen below the low water mark,
> awaken the bgreclaimer to repopulate it."
Changed as per suggestion.
> 4. StrategySyncStartAndEnd() is kind of a mess. One, it can return
> the same victim buffer that's being handed out - at almost the same
> time - to a backend running the clock sweep; if it does, they'll fight
> over the buffer. Two, the *end out parameter actually returns a
> count, not an endpoint. I think we should have
> BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the
> top of the inner loop rather than the bottom, and change
> StrategySyncStartAndEnd() so that it knows nothing about
> victimbuf_lck. Let's also change StrategyGetBuffer() to call
> StrategySyncNextVictimBuffer() so that the logic is centralized in one
> place, and rename StrategySyncStartAndEnd() to something that better
> matches its revised purpose.
I have changed it to StrategyGetFreelistAccessInfo() as it seems most
other functions in freelist.c uses the names to sound something related
to buffers.
> 5. Have you tested that this new bgwriter statistic is actually
> working? Because it looks to me like BgMoveBuffersToFreelist is
> changing BgWriterStats but never calling pgstat_send_bgwriter(), which
> I'm thinking will mean the counters accumulate forever inside the
> reclaimer but never get forwarded to the stats collector.
pgstat_send_bgwriter() is called in bgreclaimer loop (caller of
BgMoveBuffersToFreelist, this is similar to how bgwriter does).
I have done few tests with it before sending the previous patch.
> 6. StrategyInitialize() uses #defines for
> HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and
> LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000)
> for clamping. Let's have constants for all of those (and omit mention
> of the specific values in the comments).
Changed as per suggestion.
> 7. The line you've added to the definition of view pg_stat_bgwriter
> doesn't seem to be indented the same way as all the others. Tab vs.
> space problem?
Fixed.
Performance Data:
-------------------------------
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
All the data is in tps and taken using pgbench read-only load
Client Count/Patch_ver | 8 | 16 | 32 | 64 | 128 |
HEAD | 58614 | 107370 | 140717 | 104357 | 65010 |
Patch | 61825 | 115152 | 170952 | 217389 | 220994 |
Attachment
>
> On 05/09/14 23:50, Amit Kapila wrote:
>>
>> On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood
>> > FWIW below are some test results on the 60 core beast with this patch
>> applied to 9.4. I'll need to do more runs to iron out the variation,
>> > but it looks like the patch helps the standard (write heavy) pgbench
>> workload a little, and clearly helps the read only case.
>> >
>>
>> Thanks for doing the test. I think if possible you can take
>> the performance data with higher scale factor (4000) as it
>> seems your m/c has 1TB of RAM. You might want to use
>> latest patch I have posted today.
>>
>
> Here's some fairly typical data from read-write and read-only runs at scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not seeing much variation between repeated read-write runs with the same config (which is nice - sleep 30 and explicit checkpoint call between each one seem to help there).
>
> Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be better at higher client counts than beta1 was...
>
> In terms of the effect of the patch - looks pretty similar to the scale 2000 results for read-write, but read-only is a different and more interesting story - unpatched 9.4 is noticeably impacted in the higher client counts, whereas the patched version scales as well (or even better perhaps) than in the scale 2000 case.
Yeah, that's what I was expecting, the benefit of this patch
On 10/09/14 18:54, Amit Kapila wrote: > On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> > wrote: > > > > In terms of the effect of the patch - looks pretty similar to the > scale 2000 results for read-write, but read-only is a different and more > interesting story - unpatched 9.4 is noticeably impacted in the higher > client counts, whereas the patched version scales as well (or even > better perhaps) than in the scale 2000 case. > > Yeah, that's what I was expecting, the benefit of this patch > will be more at higher client count when there is large data > and all the data can fit in RAM . > > Many thanks for doing the performance test for patch. > > No worries, this is looking like a patch we're going to apply to 9.4 to make the 60 core beast scale a bit better, so thanks very much for your work in this area. If you would like more tests run at higher scales let me know (we have two of these machines at pre-production state currently so I can run benchmarks as reqd)! Regards Mark
Hi, On 2014-09-10 12:17:34 +0530, Amit Kapila wrote: > include $(top_srcdir)/src/backend/common.mk > diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c > new file mode 100644 > index 0000000..3df2337 > --- /dev/null > +++ b/src/backend/postmaster/bgreclaimer.c A fair number of comments in that file refer to bgwriter... > @@ -0,0 +1,302 @@ > +/*------------------------------------------------------------------------- > + * > + * bgreclaimer.c > + * > + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It > + * attempts to keep regular backends from having to run clock sweep (which > + * they would only do when they don't find a usable shared buffer from > + * freelist to read in another page). That's not really accurate. Freelist pages are often also needed to write new pages, without reading anything in. I'd phrase it as "which they only need to do if they don't find a victim buffer from the freelist" > In the best scenario all requests > + * for shared buffers will be fulfilled from freelist as the background > + * reclaimer process always tries to maintain buffers on freelist. However, > + * regular backends are still empowered to run clock sweep to find a usable > + * buffer if the bgreclaimer fails to maintain enough buffers on freelist. "empowered" sounds strange to me. 'still can run the clock sweep'? > + * The bgwriter is started by the postmaster as soon as the startup subprocess > + * finishes, or as soon as recovery begins if we are doing archive recovery. Why only archive recovery? I guess (only read this far...) it's not just during InArchiveRecoveryb recovery but also StandbyMode? But I don't see why we shouldn't use it during normal crash recovery. That's also often painfully slow and the reclaimer could help? Less, but still. > +static void bgreclaim_quickdie(SIGNAL_ARGS); > +static void BgreclaimSigHupHandler(SIGNAL_ARGS); > +static void ReqShutdownHandler(SIGNAL_ARGS); > +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); This looks inconsistent. > + /* > + * If an exception is encountered, processing resumes here. > + * > + * See notes in postgres.c about the design of this coding. > + */ > + if (sigsetjmp(local_sigjmp_buf, 1) != 0) > + { > + /* Since not using PG_TRY, must reset error stack by hand */ > + error_context_stack = NULL; > + > + /* Prevent interrupts while cleaning up */ > + HOLD_INTERRUPTS(); > + > + /* Report the error to the server log */ > + EmitErrorReport(); > + > + /* > + * These operations are really just a minimal subset of > + * AbortTransaction(). We don't have very many resources to worry > + * about in bgreclaim, but we do have buffers and file descriptors. > + */ > + UnlockBuffers(); > + AtEOXact_Buffers(false); > + AtEOXact_Files(); No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a good idea, regardless of it possibly being true today (which I'm not sure about yet). > + /* > + * Now return to normal top-level context and clear ErrorContext for > + * next time. > + */ > + MemoryContextSwitchTo(bgreclaim_context); > + FlushErrorState(); > + > + /* Flush any leaked data in the top-level context */ > + MemoryContextResetAndDeleteChildren(bgreclaim_context); > + > + /* Now we can allow interrupts again */ > + RESUME_INTERRUPTS(); Other processes sleep for a second here, I think that's a good idea. E.g. that bit: /* * Sleep at least 1 second after any error. A write error is likely * to be repeated,and we don't want to be filling the error logs as * fast as we can. */ pg_usleep(1000000L); > + /* > + * Loop forever > + */ > + for (;;) > + { > + int rc; > + > + > + /* > + * Backend will signal bgreclaimer when the number of buffers in > + * freelist falls below than low water mark of freelist. > + */ > + rc = WaitLatch(&MyProc->procLatch, > + WL_LATCH_SET | WL_POSTMASTER_DEATH, > + -1); That's probably not going to work well directly after a (re)start of bgreclaim (depending on how you handle the water mark, I'll see in a bit). Maybe it should rather be ResetLatch(); BgMoveBuffersToFreelist(); pgstat_send_bgwriter(); rc = WaitLatch() if (rc & WL_POSTMASTER_DEATH) exit(1) > +Background Reclaimer's Processing > +--------------------------------- > + > +The background reclaimer is designed to move buffers to freelist that are > +likely to be recycled soon, thereby offloading the need to perform > +clock sweep work from active backends. To do this, it runs the clock sweep > +and move the the unpinned and zero usage count buffers to freelist. It > +keeps on doing this until the number of buffers in freelist reaches the > +high water mark. > + > +Two water mark indicators are used to maintain sufficient number of buffers > +on freelist. Low water mark indicator is used by backends to wake bgreclaimer > +when the number of buffers in freelist falls below it. High water mark > +indicator is used by bgreclaimer to move buffers to freelist. For me the description of the high water as stated here doesn't seem to explain anything. This section should have a description of how the reclaimer interacts with the bgwriter logic. Do we put dirty buffers on the freelist that are then cleaned by the bgwriter? Which buffers does the bgwriter write out? > /* > + * Move buffers with reference and usage_count as zero to freelist. > + * By maintaining enough number of buffers on freelist (equal to > + * high water mark of freelist), we drastically reduce the odds for > + * backend's to perform clock sweep. Move buffers with reference and a usage_count *of* zero to freelist. By maintaining enough buffers in the freelist (up to the list's high water mark), we drastically reduce the likelihood of individual backends having to perform the clock sweep themselves. > + * This is called by the background reclaim process when the number > + * of buffers in freelist falls below low water mark of freelist. > + */ The logic used here *definitely* needs to be documented in another form somewhere in the source. > +void > +BgMoveBuffersToFreelist(void) > +{ > + uint32 num_to_free = 0; > + uint32 recent_alloc = 0; > + uint32 recent_backend_clocksweep = 0; > + volatile uint32 next_victim = 0; > + > + /* Execute the clock sweep */ > + for (;;) > + { > + uint32 tmp_num_to_free; > + uint32 tmp_recent_alloc; > + uint32 tmp_recent_backend_clocksweep; > + > + StrategyGetFreelistAccessInfo(&tmp_num_to_free, > + &tmp_recent_alloc, > + &tmp_recent_backend_clocksweep); > + > + num_to_free += tmp_num_to_free; > + recent_alloc += tmp_recent_alloc; > + recent_backend_clocksweep += tmp_recent_backend_clocksweep; > + > + if (tmp_num_to_free == 0) > + break; num_to_free isn't a convincing name if I understand what this is doing correctly. Maybe 'move_to_freelist', 'freelist_needed', 'needed_on_freelist' or something like that? I wonder if we don't want to increase the high watermark when tmp_recent_backend_clocksweep > 0? > + while (tmp_num_to_free > 0) > + { > + volatile BufferDesc *bufHdr; > + bool add_to_freelist = false; > + > + /* > + * Choose next victim buffer to look if that can be moved > + * to freelist. > + */ > + StrategySyncNextVictimBuffer(&next_victim); > + > + bufHdr = &BufferDescriptors[next_victim]; > + > + /* > + * If the buffer is pinned or has a nonzero usage_count, we cannot > + * move it to freelist; decrement the usage_count (unless pinned) > + * and keep scanning. > + */ > + LockBufHdr(bufHdr); Hm. Perhaps we should do a bufHdr->refcount != zero check without locking here? The atomic op will transfer the cacheline exclusively to the reclaimer's CPU. Even though it very shortly afterwards will be touched afterwards by the pinning backend. > +/* > + * Water mark indicators for maintaining buffers on freelist. When the > + * number of buffers on freelist drops below the low water mark, the > + * allocating backend sets the latch and bgreclaimer wakesup and begin > + * adding buffer's to freelist until it reaches high water mark and then > + * again goes back to sleep. > + */ s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/; /to freelist/to the freelist/; s/reaches high water/reaches the high water/ > +int freelistLowWaterMark; > +int freelistHighWaterMark; > + > +/* Percentage indicators for maintaining buffers on freelist */ > +#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.005 > +#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.2 > +#define MIN_HIGH_WATER_MARK 5 > +#define MAX_HIGH_WATER_MARK 2000 I'm confused. The high water mark percentage is smaller than the low water mark? What's the reasoning for these numbers? What's the justification for the max of 2k buffers for the high watermark? That's not much on a busy database with large s_b? > - *lock_held = true; > - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); > + SpinLockAcquire(&StrategyControl->freelist_lck); > > /* > - * We count buffer allocation requests so that the bgwriter can estimate > - * the rate of buffer consumption. Note that buffers recycled by a > - * strategy object are intentionally not counted here. > + * We count buffer allocation requests so that the bgwriter or bgreclaimer > + * can know the rate of buffer consumption and report it as stats. Note > + * that buffers recycled by a strategy object are intentionally not counted > + * here. > */ > StrategyControl->numBufferAllocs++; > > /* > - * If bgwriterLatch is set, we need to waken the bgwriter, but we should > - * not do so while holding BufFreelistLock; so release and re-grab. This > - * is annoyingly tedious, but it happens at most once per bgwriter cycle, > - * so the performance hit is minimal. > + * Remember the values of bgwriter and bgreclaimer latch so that they can > + * be set outside spin lock and try to get a buffer from the freelist. > */ > + bgreclaimerLatch = StrategyControl->bgreclaimerLatch; > bgwriterLatch = StrategyControl->bgwriterLatch; I don't understand why these need to be grabbed under the spinlock? > if (bgwriterLatch) > - { > StrategyControl->bgwriterLatch = NULL; > - LWLockRelease(BufFreelistLock); > - SetLatch(bgwriterLatch); > - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); > - } > > - /* > - * Try to get a buffer from the freelist. Note that the freeNext fields > - * are considered to be protected by the BufFreelistLock not the > - * individual buffer spinlocks, so it's OK to manipulate them without > - * holding the spinlock. > - */ > - while (StrategyControl->firstFreeBuffer >= 0) > + numFreeListBuffers = StrategyControl->numFreeListBuffers; > + > + if (StrategyControl->firstFreeBuffer >= 0) > { > buf = &BufferDescriptors[StrategyControl->firstFreeBuffer]; > Assert(buf->freeNext != FREENEXT_NOT_IN_LIST); > @@ -169,35 +198,82 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) > /* Unconditionally remove buffer from freelist */ > StrategyControl->firstFreeBuffer = buf->freeNext; > buf->freeNext = FREENEXT_NOT_IN_LIST; > + --StrategyControl->numFreeListBuffers; > + } > + else > + StrategyControl->numBufferBackendClocksweep++; > + > + SpinLockRelease(&StrategyControl->freelist_lck); > + /* If bgwriterLatch is set, we need to waken the bgwriter */ > + if (bgwriterLatch) > + SetLatch(bgwriterLatch); > + > + /* > + * If the number of free buffers has fallen below the low water mark, > + * awaken the bgreclaimer to repopulate it. bgreclaimerLatch is initialized in > + * early phase of BgReclaimer startup, however we still check before using > + * it to avoid any problem incase we reach here before its initializion. > + */ > + if (numFreeListBuffers < freelistLowWaterMark && bgreclaimerLatch) > + SetLatch(StrategyControl->bgreclaimerLatch); > + > + if (buf != NULL) > + { > /* > - * If the buffer is pinned or has a nonzero usage_count, we cannot use > - * it; discard it and retry. (This can only happen if VACUUM put a > - * valid buffer in the freelist and then someone else used it before > - * we got to it. It's probably impossible altogether as of 8.3, but > - * we'd better check anyway.) > + * Try to get a buffer from the freelist. Note that the freeNext fields > + * are considered to be protected by the freelist_lck not the > + * individual buffer spinlocks, so it's OK to manipulate them without > + * holding the buffer spinlock. > */ > - LockBufHdr(buf); > - if (buf->refcount == 0 && buf->usage_count == 0) > + for(;;) > { > - if (strategy != NULL) > - AddBufferToRing(strategy, buf); > - return buf; > + /* > + * If the buffer is pinned or has a nonzero usage_count, we cannot use > + * it; discard it and retry. > + */ > + LockBufHdr(buf); > + if (buf->refcount == 0 && buf->usage_count == 0) > + { > + if (strategy != NULL) > + AddBufferToRing(strategy, buf); > + return buf; > + } > + UnlockBufHdr(buf); > + > + SpinLockAcquire(&StrategyControl->freelist_lck); > + > + if (StrategyControl->firstFreeBuffer >= 0) > + { > + buf = &BufferDescriptors[StrategyControl->firstFreeBuffer]; > + Assert(buf->freeNext != FREENEXT_NOT_IN_LIST); > + > + /* Unconditionally remove buffer from freelist */ > + StrategyControl->firstFreeBuffer = buf->freeNext; > + buf->freeNext = FREENEXT_NOT_IN_LIST; > + --StrategyControl->numFreeListBuffers; > + > + SpinLockRelease(&StrategyControl->freelist_lck); > + } > + else > + { > + StrategyControl->numBufferBackendClocksweep++; > + SpinLockRelease(&StrategyControl->freelist_lck); > + break; > + } > } > - UnlockBufHdr(buf); > } I think it makes sense to break out this bit into its own function. That'll make StrategyGetBuffer() a good bit easier to read. > /* Nothing on the freelist, so run the "clock sweep" algorithm */ > trycounter = NBuffers; > + > for (;;) > { > - buf = &BufferDescriptors[StrategyControl->nextVictimBuffer]; > + volatile uint32 next_victim; > > - if (++StrategyControl->nextVictimBuffer >= NBuffers) > - { > - StrategyControl->nextVictimBuffer = 0; > - StrategyControl->completePasses++; > - } > + StrategySyncNextVictimBuffer(&next_victim); > + > + buf = &BufferDescriptors[next_victim]; I'd also move this into its own function, but thats's more debatable. > /* > * If the buffer is pinned or has a nonzero usage_count, we cannot use > @@ -241,7 +317,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) > void > StrategyFreeBuffer(volatile BufferDesc *buf) > { > - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); > + SpinLockAcquire(&StrategyControl->freelist_lck); > > /* > * It is possible that we are told to put something in the freelist that > @@ -253,11 +329,50 @@ StrategyFreeBuffer(volatile BufferDesc *buf) > if (buf->freeNext < 0) > StrategyControl->lastFreeBuffer = buf->buf_id; > StrategyControl->firstFreeBuffer = buf->buf_id; > + ++StrategyControl->numFreeListBuffers; > + } > + > + SpinLockRelease(&StrategyControl->freelist_lck); > +} > + > +/* > + * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist > + */ > +bool > +StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf) > +{ Should maybe rather be named *Tail? > + bool freed = false; > + SpinLockAcquire(&StrategyControl->freelist_lck); > + > + /* > + * It is possible that we are told to put something in the freelist that > + * is already in it; don't screw up the list if so. > + */ When/Why is that possible? > /* > + * StrategyGetFreelistAccessInfo -- get information required by bgreclaimer > + * to move unused buffers to freelist. > + * > + * The result is the number of buffers that are required to be moved to > + * freelist and count of recent buffer allocs and buffer allocs not > + * satisfied from freelist. > + */ > +void > +StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc, > + uint32 *num_buf_backend_clocksweep) > +{ > + int curfreebuffers; > + > + SpinLockAcquire(&StrategyControl->freelist_lck); > + curfreebuffers = StrategyControl->numFreeListBuffers; > + if (curfreebuffers < freelistHighWaterMark) > + *num_buf_to_free = freelistHighWaterMark - curfreebuffers; > + else > + *num_buf_to_free = 0; > + > + if (num_buf_alloc) > + { > + *num_buf_alloc = StrategyControl->numBufferAllocs; > + StrategyControl->numBufferAllocs = 0; > + } > + if (num_buf_backend_clocksweep) > + { > + *num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep; > + StrategyControl->numBufferBackendClocksweep = 0; > + } > + SpinLockRelease(&StrategyControl->freelist_lck); > + > + return; > +} Do we need the if (num_buf_alloc) bits? Can't we make it unconditional? Some more general remarks: * I think there's a fair amount of unexplained heuristics in here * Are we sure that the freelist_lck spinlock won't cause pain? Right now there will possibly be dozens of processes busilyspinning on it... I think it's a acceptable risk, but we should think about it. * I'm not convinced that these changes can be made without also changing the bgwriter logic. Have you measured whether thereare differences in how effective the bgwriter is? Not that it's very effective right now :) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Thanks for reviewing, Andres. On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> +static void bgreclaim_quickdie(SIGNAL_ARGS); >> +static void BgreclaimSigHupHandler(SIGNAL_ARGS); >> +static void ReqShutdownHandler(SIGNAL_ARGS); >> +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); > > This looks inconsistent. It's exactly the same as what bgwriter.c does. > No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a > good idea, regardless of it possibly being true today (which I'm not > sure about yet). We really need a more centralized way to handle error cleanup in auxiliary processes. The current state of affairs is really pretty helter-skelter. But for this patch, I think we should aim to mimic the existing style, as ugly as it is. I'm not sure whether Amit's got the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum, is probably a good idea. >> +Two water mark indicators are used to maintain sufficient number of buffers >> +on freelist. Low water mark indicator is used by backends to wake bgreclaimer >> +when the number of buffers in freelist falls below it. High water mark >> +indicator is used by bgreclaimer to move buffers to freelist. > > For me the description of the high water as stated here doesn't seem to > explain anything. Yeah, let me try to revise and expand on that a bit: Background Reclaimer's Processing --------------------------------- The background reclaimer runs the clock sweep to identify buffers that are good candidates for eviction and puts them on the freelist. This makes buffer allocation much faster, since removing a buffer from the head of a linked list is much cheaper than linearly scanning the whole buffer pool until a promising candidate is found. It's possible that a buffer we add to the freelist may be accessed or even pinned before it's evicted; if that happens, the backend that would have evicted it will simply disregard it and take the next buffer instead (or run the clock sweep itself, if necessary). However, to make sure that doesn't happen too often, we need to keep the freelist as short as possible, so that there won't be many other buffer accesses between when the time a buffer is added to the freelist and the time when it's actually evicted. We use two water marks to control the activity of the bgreclaimer process. Each time bgreclaimer is awoken, it will move buffers to the freelist until the length of the free list reaches the high water mark. It will then sleep. When the number of buffers on the freelist reaches the low water mark, backends attempting to allocate new buffers will set the bgreclaimer's latch, waking it up again. While it's important for the high water mark to be small (for the reasons described above), we also need to ensure adequate separation between the low and high water marks, so that the bgreclaimer isn't constantly being awoken to find just a handful of additional candidate buffers, and we need to ensure that the low watermark is adequate to keep the freelist from becoming completely empty before bgreclaimer has time to wake up and beginning filling it again. > This section should have a description of how the reclaimer interacts > with the bgwriter logic. Do we put dirty buffers on the freelist that > are then cleaned by the bgwriter? Which buffers does the bgwriter write > out? The bgwriter is cleaning ahead of the strategy point, and the bgreclaimer is advancing the strategy point. I think we should consider having the bgreclaimer wake the bgwriter if it comes across a dirty buffer, because while the bgwriter only estimates the rate of buffer allocation, bgreclaimer *knows* the rate of allocation, because its own activity is tied to the allocation rate. I think there's the potential for this kind of thing to make the background writer significantly more effective than it is today, but I'm heavily in favor of leaving it for a separate patch. > I wonder if we don't want to increase the high watermark when > tmp_recent_backend_clocksweep > 0? That doesn't really work unless there's some countervailing force to eventually reduce it again; otherwise, it'd just converge to infinity. And it doesn't really seem necessary at the moment. > Hm. Perhaps we should do a bufHdr->refcount != zero check without > locking here? The atomic op will transfer the cacheline exclusively to > the reclaimer's CPU. Even though it very shortly afterwards will be > touched afterwards by the pinning backend. Meh. I'm not in favor of adding more funny games with locking unless we can prove they're necessary for performance. > * Are we sure that the freelist_lck spinlock won't cause pain? Right now > there will possibly be dozens of processes busily spinning on it... I > think it's a acceptable risk, but we should think about it. As you and I have talked about before, we could reduce contention here by partitioning the freelist, or by using a CAS loop to pop items off of it. But I am not convinced either is necessary; I think it's hard for the system to accumulate enough people hitting the freelist simultaneously to matter, because the stuff they've got to do between one freelist access and the next is generally going to be something much more expensive, like reading 8kB from the OS. One question in my mind is whether we ought to separate this patch into two - one for the changes to the locking regime, and another for the addition of the bgreclaimer process. Those things are really two different features, although they are tightly enough coupled that maybe it's OK to keep them together. I also think it would be good to get some statistics on how often regular backends are running the clocksweep vs. how often bgreclaimer is satisfying their needs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-11 09:02:34 -0400, Robert Haas wrote: > Thanks for reviewing, Andres. > > On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> +static void bgreclaim_quickdie(SIGNAL_ARGS); > >> +static void BgreclaimSigHupHandler(SIGNAL_ARGS); > >> +static void ReqShutdownHandler(SIGNAL_ARGS); > >> +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); > > > > This looks inconsistent. > > It's exactly the same as what bgwriter.c does. So what? There's no code in common, so I see no reason to have one signal handler using underscores and the next one camelcase names. > > No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a > > good idea, regardless of it possibly being true today (which I'm not > > sure about yet). > > We really need a more centralized way to handle error cleanup in > auxiliary processes. The current state of affairs is really pretty > helter-skelter. Agreed. There really should be three variants: * full abort including support for transactions * full abort without transactions being used (most background processes) * abort without shared memory interactions > But for this patch, I think we should aim to mimic > the existing style, as ugly as it is. Agreed. > Background Reclaimer's Processing > --------------------------------- > > The background reclaimer runs the clock sweep to identify buffers that > are good candidates for eviction and puts them on the freelist. This > makes buffer allocation much faster, since removing a buffer from the > head of a linked list is much cheaper than linearly scanning the whole > buffer pool until a promising candidate is found. It's possible that > a buffer we add to the freelist may be accessed or even pinned before > it's evicted; if that happens, the backend that would have evicted it > will simply disregard it and take the next buffer instead (or run the > clock sweep itself, if necessary). However, to make sure that doesn't > happen too often, we need to keep the freelist as short as possible, > so that there won't be many other buffer accesses between when the > time a buffer is added to the freelist and the time when it's actually > evicted. > > We use two water marks to control the activity of the bgreclaimer > process. Each time bgreclaimer is awoken, it will move buffers to the > freelist until the length of the free list reaches the high water > mark. It will then sleep. I wonder if we should recheck the number of freelist items before sleeping. As the latch currently is reset before sleeping (IIRC) we might miss being woken up soon. It very well might be that bgreclaim needs to run for more than one cycle in a row to keep up... > > This section should have a description of how the reclaimer interacts > > with the bgwriter logic. Do we put dirty buffers on the freelist that > > are then cleaned by the bgwriter? Which buffers does the bgwriter write > > out? > > The bgwriter is cleaning ahead of the strategy point, and the > bgreclaimer is advancing the strategy point. That sentence, in some form, should be in the above paragraph. > I think we should > consider having the bgreclaimer wake the bgwriter if it comes across a > dirty buffer, because while the bgwriter only estimates the rate of > buffer allocation, bgreclaimer *knows* the rate of allocation, because > its own activity is tied to the allocation rate. I think there's the > potential for this kind of thing to make the background writer > significantly more effective than it is today, but I'm heavily in > favor of leaving it for a separate patch. Yes, doing that sounds like a good plan. I'm happy with that being done in a separate patch. > > I wonder if we don't want to increase the high watermark when > > tmp_recent_backend_clocksweep > 0? > > That doesn't really work unless there's some countervailing force to > eventually reduce it again; otherwise, it'd just converge to infinity. > And it doesn't really seem necessary at the moment. Right, it obviously needs to go both ways. I'm a bit sceptic about untunable, fixed, numbers proving to be accurate for widely varied workloads. > > Hm. Perhaps we should do a bufHdr->refcount != zero check without > > locking here? The atomic op will transfer the cacheline exclusively to > > the reclaimer's CPU. Even though it very shortly afterwards will be > > touched afterwards by the pinning backend. > > Meh. I'm not in favor of adding more funny games with locking unless > we can prove they're necessary for performance. Well, this in theory increases the number of processes touching buffer headers regularly. Currently, if you have one read IO intensive backend, there's pretty much only process touching the cachelines. This will make it two. I don't think it's unreasonable to try to reduce the cacheline pingpong caused by that... > > * Are we sure that the freelist_lck spinlock won't cause pain? Right now > > there will possibly be dozens of processes busily spinning on it... I > > think it's a acceptable risk, but we should think about it. > > As you and I have talked about before, we could reduce contention here > by partitioning the freelist, or by using a CAS loop to pop items off > of it. But I am not convinced either is necessary; I think it's hard > for the system to accumulate enough people hitting the freelist > simultaneously to matter, because the stuff they've got to do between > one freelist access and the next is generally going to be something > much more expensive, like reading 8kB from the OS. > > One question in my mind is whether we ought to separate this patch > into two - one for the changes to the locking regime, and another for > the addition of the bgreclaimer process. Those things are really two > different features, although they are tightly enough coupled that > maybe it's OK to keep them together. I think it's ok to commit them together. Hm, although: It'd actually be highly interesting to see the effect of replacing the freelist lock with a spinlock without the rest of these changes. I think that's really a number we want to see at least once. > I also think it would be good to > get some statistics on how often regular backends are running the > clocksweep vs. how often bgreclaimer is satisfying their needs. I think that's necessary. The patch added buf_backend_clocksweep. Maybe we just also need buf_backend_from_freelist? Greetings, Andres Freund
>
> Thanks for reviewing, Andres.
>
> On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> +static void bgreclaim_quickdie(SIGNAL_ARGS);
> >> +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> >> +static void ReqShutdownHandler(SIGNAL_ARGS);
> >> +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
> >
> > This looks inconsistent.
>
> It's exactly the same as what bgwriter.c does.
>
> > No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> > good idea, regardless of it possibly being true today (which I'm not
> > sure about yet).
>
> We really need a more centralized way to handle error cleanup in
> auxiliary processes. The current state of affairs is really pretty
> helter-skelter. But for this patch, I think we should aim to mimic
> the existing style, as ugly as it is. I'm not sure whether Amit's got
> the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
> > We really need a more centralized way to handle error cleanup in > > auxiliary processes. The current state of affairs is really pretty > > helter-skelter. But for this patch, I think we should aim to mimic > > the existing style, as ugly as it is. I'm not sure whether Amit's got > > the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum, > > is probably a good idea. > > Code related to bgreclaimer logic itself doesn't take any LWLock, do > you suspect the same might be required due to some Signal/Interrupt > handling? I suspect it might creep in at some point at some unrelated place. Which will only ever break in production scenarios. Say, a lwlock in in config file processing. I seem to recall somebody seing a version of a patching adding a lwlock there... :). Or a logging hook. Or ... The savings from not doing LWLockReleaseAll() are nonexistant, so ... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
>
> > > We really need a more centralized way to handle error cleanup in
> > > auxiliary processes. The current state of affairs is really pretty
> > > helter-skelter. But for this patch, I think we should aim to mimic
> > > the existing style, as ugly as it is. I'm not sure whether Amit's got
> > > the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
> > > is probably a good idea.
> >
> > Code related to bgreclaimer logic itself doesn't take any LWLock, do
> > you suspect the same might be required due to some Signal/Interrupt
> > handling?
>
> I suspect it might creep in at some point at some unrelated place. Which
> will only ever break in production scenarios. Say, a lwlock in in config
> file processing.
> adding a lwlock there... :). Or a logging hook. Or ...
>
> The savings from not doing LWLockReleaseAll() are nonexistant, so ...
On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> It's exactly the same as what bgwriter.c does. > > So what? There's no code in common, so I see no reason to have one > signal handler using underscores and the next one camelcase names. /me shrugs. It's not always possible to have things be consistent with each other within a file and also with what gets done in other files. I'm not sure we should fault patch authors for choosing a different one than we would have chosen. FWIW, I probably would have done it the way Amit did it. I don't actually care, though. > I wonder if we should recheck the number of freelist items before > sleeping. As the latch currently is reset before sleeping (IIRC) we > might miss being woken up soon. It very well might be that bgreclaim > needs to run for more than one cycle in a row to keep up... The outer loop in BgMoveBuffersToFreelist() was added to address precisely this point, which I raised in a previous review. >> > I wonder if we don't want to increase the high watermark when >> > tmp_recent_backend_clocksweep > 0? >> >> That doesn't really work unless there's some countervailing force to >> eventually reduce it again; otherwise, it'd just converge to infinity. >> And it doesn't really seem necessary at the moment. > > Right, it obviously needs to go both ways. I'm a bit sceptic about > untunable, fixed, numbers proving to be accurate for widely varied > workloads. Me, too, but I'm *even more* skeptical about making things complicated on the pure theory that a simple solution can't be correct. I'm not blind to the possibility that the current logic is inadequate, but testing proves that it works well enough to produce a massive performance boost over where we are now. When, and if, we develop a theory about specifically how it falls short then, sure, let's adjust it. But I think it would be a serious error to try to engineer a perfect algorithm here based on the amount of testing that we can reasonably do pre-commit. We have no chance of getting that right, and I'd rather have an algorithm that is simple and imperfect than an algorithm that is complex and still imperfect. No matter what, it's better than what we have now. >> > Hm. Perhaps we should do a bufHdr->refcount != zero check without >> > locking here? The atomic op will transfer the cacheline exclusively to >> > the reclaimer's CPU. Even though it very shortly afterwards will be >> > touched afterwards by the pinning backend. >> >> Meh. I'm not in favor of adding more funny games with locking unless >> we can prove they're necessary for performance. > > Well, this in theory increases the number of processes touching buffer > headers regularly. Currently, if you have one read IO intensive backend, > there's pretty much only process touching the cachelines. This will make > it two. I don't think it's unreasonable to try to reduce the cacheline > pingpong caused by that... It's not unreasonable, but this is a good place to apply Knuth's first law of optimization. There's no proof we need to do this, so let's not until there is. >> I also think it would be good to >> get some statistics on how often regular backends are running the >> clocksweep vs. how often bgreclaimer is satisfying their needs. > > I think that's necessary. The patch added buf_backend_clocksweep. Maybe > we just also need buf_backend_from_freelist? That's just (or should be just) buf_alloc - buf_backend_clocksweep. I think buf_backend_clocksweep should really be called buf_alloc_clocksweep, and should be added (in all relevant places) right next to buf_alloc. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-11 09:48:10 -0400, Robert Haas wrote: > On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I wonder if we should recheck the number of freelist items before > > sleeping. As the latch currently is reset before sleeping (IIRC) we > > might miss being woken up soon. It very well might be that bgreclaim > > needs to run for more than one cycle in a row to keep up... > > The outer loop in BgMoveBuffersToFreelist() was added to address > precisely this point, which I raised in a previous review. Hm, right. But then let's move BgWriterStats.m_buf_alloc =+, ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end up being continously busy without being visible. > >> > I wonder if we don't want to increase the high watermark when > >> > tmp_recent_backend_clocksweep > 0? > >> > >> That doesn't really work unless there's some countervailing force to > >> eventually reduce it again; otherwise, it'd just converge to infinity. > >> And it doesn't really seem necessary at the moment. > > > > Right, it obviously needs to go both ways. I'm a bit sceptic about > > untunable, fixed, numbers proving to be accurate for widely varied > > workloads. > > Me, too, but I'm *even more* skeptical about making things complicated > on the pure theory that a simple solution can't be correct. Fair enough. > I'm not > blind to the possibility that the current logic is inadequate, but > testing proves that it works well enough to produce a massive > performance boost over where we are now. But, to be honest, the testing so far was pretty "narrow" in the kind of workloads that were run if I crossread things accurately. Don't get me wrong, I'm *really* happy about having this patch, that just doesn't mean every detail is right ;) > >> > Hm. Perhaps we should do a bufHdr->refcount != zero check without > >> > locking here? The atomic op will transfer the cacheline exclusively to > >> > the reclaimer's CPU. Even though it very shortly afterwards will be > >> > touched afterwards by the pinning backend. > >> > >> Meh. I'm not in favor of adding more funny games with locking unless > >> we can prove they're necessary for performance. > > > > Well, this in theory increases the number of processes touching buffer > > headers regularly. Currently, if you have one read IO intensive backend, > > there's pretty much only process touching the cachelines. This will make > > it two. I don't think it's unreasonable to try to reduce the cacheline > > pingpong caused by that... > > It's not unreasonable, but this is a good place to apply Knuth's first > law of optimization. There's no proof we need to do this, so let's > not until there is. That's true for new (pieces of) software; less so, when working with a installed base that you might regress... But whatever. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 11, 2014 at 10:03 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-09-11 09:48:10 -0400, Robert Haas wrote: >> On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > I wonder if we should recheck the number of freelist items before >> > sleeping. As the latch currently is reset before sleeping (IIRC) we >> > might miss being woken up soon. It very well might be that bgreclaim >> > needs to run for more than one cycle in a row to keep up... >> >> The outer loop in BgMoveBuffersToFreelist() was added to address >> precisely this point, which I raised in a previous review. > > Hm, right. But then let's move BgWriterStats.m_buf_alloc =+, > ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end > up being continously busy without being visible. Good idea. >> I'm not >> blind to the possibility that the current logic is inadequate, but >> testing proves that it works well enough to produce a massive >> performance boost over where we are now. > > But, to be honest, the testing so far was pretty "narrow" in the kind of > workloads that were run if I crossread things accurately. Don't get me > wrong, I'm *really* happy about having this patch, that just doesn't > mean every detail is right ;) Oh, sure. Totally agreed. And, to the extent that we're improving things based on actual testing, I'm A-OK with that. I just don't want to start speculating, or we'll never get this thing off the ground. Some possibly-interesting test cases would be: (1) A read-only pgbench workload that is just a tiny bit larger than shared_buffers, say size of shared_buffers plus 0.01%. Such workloads tend to stress buffer eviction heavily. (2) A workload that maximizes the rate of concurrent buffer eviction relative to other tasks. Read-only pgbench is not bad for this, but maybe somebody's got a better idea. As I sort of mentioned in what I was writing for the bufmgr README, there are, more or less, three ways this can fall down, at least that I can see: (1) if the high water mark is too high, then we'll start finding buffers in the freelist that have already been touched since we added them: (2) if the low water mark is too low, the freelist will run dry; and (3) if the low and high water marks are too close together, the bgreclaimer will be constantly getting woken up and going to sleep again. I can't personally think of a workload that will enable us to get a better handle on those cases than high-concurrency pgbench, but you're known to be ingenious at coming up with destruction workloads, so if you have an idea, by all means fire away. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
> > +++ b/src/backend/postmaster/bgreclaimer.c
>
> A fair number of comments in that file refer to bgwriter...
Will fix.
> > @@ -0,0 +1,302 @@
> > +/*-------------------------------------------------------------------------
> > + *
> > + * bgreclaimer.c
> > + *
> > + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It
> > + * attempts to keep regular backends from having to run clock sweep (which
> > + * they would only do when they don't find a usable shared buffer from
> > + * freelist to read in another page).
>
> That's not really accurate. Freelist pages are often also needed to
> write new pages, without reading anything in.
> they only need to do if they don't find a victim buffer from the
> freelist"
> > In the best scenario all requests
> > + * for shared buffers will be fulfilled from freelist as the background
> > + * reclaimer process always tries to maintain buffers on freelist. However,
> > + * regular backends are still empowered to run clock sweep to find a usable
> > + * buffer if the bgreclaimer fails to maintain enough buffers on freelist.
>
> "empowered" sounds strange to me. 'still can run the clock sweep'?
> > + * The bgwriter is started by the postmaster as soon as the startup subprocess
> > + * finishes, or as soon as recovery begins if we are doing archive recovery.
>
> Why only archive recovery? I guess (only read this far...) it's not just
> during InArchiveRecoveryb recovery but also StandbyMode?
> why we shouldn't use it during normal crash recovery. That's also often
> painfully slow and the reclaimer could help? Less, but still.
>
> > +static void bgreclaim_quickdie(SIGNAL_ARGS);
> > +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> > +static void ReqShutdownHandler(SIGNAL_ARGS);
> > +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
>
> This looks inconsistent.
I have kept based on bgwriter, so not sure if it's good to change.
> > + /*
> > + * If an exception is encountered, processing resumes here.
> > + *
> > + * See notes in postgres.c about the design of this coding.
> > + */
> > + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
> > + {
> > + /* Since not using PG_TRY, must reset error stack by hand */
> > + error_context_stack = NULL;
..
>
> No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> good idea, regardless of it possibly being true today (which I'm not
> sure about yet).
> > +
> > + /* Now we can allow interrupts again */
> > + RESUME_INTERRUPTS();
>
> Other processes sleep for a second here, I think that's a good
> idea. E.g. that bit:
Agreed, will make change as per suggestion.
>
> > + /*
> > + * Loop forever
> > + */
> > + for (;;)
> > + {
> > + int rc;
> > +
> > +
> > + /*
> > + * Backend will signal bgreclaimer when the number of buffers in
> > + * freelist falls below than low water mark of freelist.
> > + */
> > + rc = WaitLatch(&MyProc->procLatch,
> > + WL_LATCH_SET | WL_POSTMASTER_DEATH,
> > + -1);
>
> That's probably not going to work well directly after a (re)start of
> bgreclaim (depending on how you handle the water mark, I'll see in a
> bit).
>
> > +Background Reclaimer's Processing
> > +---------------------------------
..
> > +Two water mark indicators are used to maintain sufficient number of buffers
> > +on freelist. Low water mark indicator is used by backends to wake bgreclaimer
> > +when the number of buffers in freelist falls below it. High water mark
> > +indicator is used by bgreclaimer to move buffers to freelist.
>
> For me the description of the high water as stated here doesn't seem to
> explain anything.
>
> This section should have a description of how the reclaimer interacts
> with the bgwriter logic. Do we put dirty buffers on the freelist that
> are then cleaned by the bgwriter? Which buffers does the bgwriter write
> out?
As discussed in thread, I will change this accordingly.
> > /*
> > + * Move buffers with reference and usage_count as zero to freelist.
> > + * By maintaining enough number of buffers on freelist (equal to
> > + * high water mark of freelist), we drastically reduce the odds for
> > + * backend's to perform clock sweep.
>
> Move buffers with reference and a usage_count *of* zero to freelist. By
> maintaining enough buffers in the freelist (up to the list's high water
> mark), we drastically reduce the likelihood of individual backends
> having to perform the clock sweep themselves.
Okay will rephrase it as per your suggestion.
> > + * This is called by the background reclaim process when the number
> > + * of buffers in freelist falls below low water mark of freelist.
> > + */
>
> The logic used here *definitely* needs to be documented in another form
> somewhere in the source.
> > +
> > + if (tmp_num_to_free == 0)
> > + break;
>
> num_to_free isn't a convincing name if I understand what this is doing
> correctly. Maybe 'move_to_freelist', 'freelist_needed',
> 'needed_on_freelist' or something like that?
>
> I wonder if we don't want to increase the high watermark when
> tmp_recent_backend_clocksweep > 0?
>
> > + while (tmp_num_to_free > 0)
> > + {
..
> > + /*
> > + * If the buffer is pinned or has a nonzero usage_count, we cannot
> > + * move it to freelist; decrement the usage_count (unless pinned)
> > + * and keep scanning.
> > + */
> > + LockBufHdr(bufHdr);
>
> Hm. Perhaps we should do a bufHdr->refcount != zero check without
> locking here? The atomic op will transfer the cacheline exclusively to
> the reclaimer's CPU. Even though it very shortly afterwards will be
> touched afterwards by the pinning backend.
> > +/*
> > + * Water mark indicators for maintaining buffers on freelist. When the
> > + * number of buffers on freelist drops below the low water mark, the
> > + * allocating backend sets the latch and bgreclaimer wakesup and begin
> > + * adding buffer's to freelist until it reaches high water mark and then
> > + * again goes back to sleep.
> > + */
>
> s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/;
> /to freelist/to the freelist/; s/reaches high water/reaches the high water/
Will change as per suggestions.
> > +int freelistLowWaterMark;
> > +int freelistHighWaterMark;
> > +
> > +/* Percentage indicators for maintaining buffers on freelist */
> > +#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.005
> > +#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.2
> > +#define MIN_HIGH_WATER_MARK 5
> > +#define MAX_HIGH_WATER_MARK 2000
>
> I'm confused. The high water mark percentage is smaller than the low
> water mark?
> max of 2k buffers for the high watermark? That's not much on a busy
> database with large s_b?
> > - *lock_held = true;
> > - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> > + SpinLockAcquire(&StrategyControl->freelist_lck);
> >
..
> > + bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
> > bgwriterLatch = StrategyControl->bgwriterLatch;
>
> I don't understand why these need to be grabbed under the spinlock?
..
> > + if (numFreeListBuffers < freelistLowWaterMark && bgreclaimerLatch)
> > + SetLatch(StrategyControl->bgreclaimerLatch);
> > +
..
> > }
> > - UnlockBufHdr(buf);
> > }
>
> I think it makes sense to break out this bit into its own
> function. That'll make StrategyGetBuffer() a good bit easier to read.
I will try to move it into a new function GetBufferFromFreelist().
> > /* Nothing on the freelist, so run the "clock sweep" algorithm */
> > trycounter = NBuffers;
> > +
...
> > + buf = &BufferDescriptors[next_victim];
>
> I'd also move this into its own function, but thats's more debatable.
> > +/*
> > + * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
> > + */
> > +bool
> > +StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
> > +{
>
> Should maybe rather be named *Tail?
Tail is better suited than End, I will change it.
> > + bool freed = false;
> > + SpinLockAcquire(&StrategyControl->freelist_lck);
> > +
> > + /*
> > + * It is possible that we are told to put something in the freelist that
> > + * is already in it; don't screw up the list if so.
> > + */
>
> When/Why is that possible?
> > +void
> > +StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc,
> > + uint32 *num_buf_backend_clocksweep)
> > +
> > + if (num_buf_alloc)
> > + {
> > + *num_buf_alloc = StrategyControl->numBufferAllocs;
> > + StrategyControl->numBufferAllocs = 0;
> > + }
> > + if (num_buf_backend_clocksweep)
> > + {
> > + *num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
> > + StrategyControl->numBufferBackendClocksweep = 0;
> > + }
> > + SpinLockRelease(&StrategyControl->freelist_lck);
> > +
> > + return;
> > +}
>
> Do we need the if (num_buf_alloc) bits? Can't we make it unconditional?
>
> Some more general remarks:
> * I think there's a fair amount of unexplained heuristics in here
> * Are we sure that the freelist_lck spinlock won't cause pain? Right now
> there will possibly be dozens of processes busily spinning on it... I
> think it's a acceptable risk, but we should think about it.
> * I'm not convinced that these changes can be made without also changing
> the bgwriter logic. Have you measured whether there are differences in
> how effective the bgwriter is? Not that it's very effective right now :)
On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > Hm. Perhaps we should do a bufHdr->refcount != zero check without >> > locking here? The atomic op will transfer the cacheline exclusively to >> > the reclaimer's CPU. Even though it very shortly afterwards will be >> > touched afterwards by the pinning backend. >> >> Meh. I'm not in favor of adding more funny games with locking unless >> we can prove they're necessary for performance. > > Well, this in theory increases the number of processes touching buffer > headers regularly. Currently, if you have one read IO intensive backend, > there's pretty much only process touching the cachelines. This will make > it two. I don't think it's unreasonable to try to reduce the cacheline > pingpong caused by that... I don't think it will help much. A pinned buffer is pretty likely to be in modified state in the cache of the cpu of the pinning backend. Even taking a look at the refcount will trigger a writeback and demotion to shared state. When time comes to unpin the buffer the cacheline must again be promoted to exclusive state introducing coherency traffic. Not locking the buffer only saves transfering the cacheline back to the pinning backend, not a huge amount of savings. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 2014-09-12 12:38:48 +0300, Ants Aasma wrote: > On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> > Hm. Perhaps we should do a bufHdr->refcount != zero check without > >> > locking here? The atomic op will transfer the cacheline exclusively to > >> > the reclaimer's CPU. Even though it very shortly afterwards will be > >> > touched afterwards by the pinning backend. > >> > >> Meh. I'm not in favor of adding more funny games with locking unless > >> we can prove they're necessary for performance. > > > > Well, this in theory increases the number of processes touching buffer > > headers regularly. Currently, if you have one read IO intensive backend, > > there's pretty much only process touching the cachelines. This will make > > it two. I don't think it's unreasonable to try to reduce the cacheline > > pingpong caused by that... > > I don't think it will help much. A pinned buffer is pretty likely to > be in modified state in the cache of the cpu of the pinning backend. Right. Unless you're on a MOESI platforms. I'd really like to know why that's not more widely used. > Even taking a look at the refcount will trigger a writeback and > demotion to shared state. When time comes to unpin the buffer the > cacheline must again be promoted to exclusive state introducing > coherency traffic. Not locking the buffer only saves transfering the > cacheline back to the pinning backend, not a huge amount of savings. Yes. But: In many, if not most, cases the cacheline will be read a couple times before modifying it via the spinlock. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 9/11/14, 7:01 AM, Andres Freund wrote: > I'm not convinced that these changes can be made without also changing > the bgwriter logic. Have you measured whether there are differences in > how effective the bgwriter is? Not that it's very effective right now :) The current background writer tuning went out of its way to do nothing when it wasn't clear there was something that always worked. What happened with all of the really clever schemes was that they worked on some workloads, and just trashed others. Most of the gain from the 8.3 rewrite came from looking at well theorized ideas for how to handle things like pre-emptive LRU scanning for writes, and just throwing them out altogether in favor of ignoring the problem. The magic numbers left in or added to the code were tuned to do very little work, intentionally. If anything, since then the pressure to do nothing has gone up in the default install, because now people are very concerned about extra wakeups using power. To find bad cases before, I was running about 30 different test combinations by the end, Heikki was running another set in the EDB lab, I believe there was a lab at NTT running their own set too. What went in was the combination that didn't break any of them badly--not the one that performed best on the good workloads. This looks like it's squashed one of the very fundamental buffer scaling issues though; well done Amit. What I'd like to see is preserving the heart of that while touching as little as possible. When in doubt, do nothing; let the backends suck it up and do the work themselves. I had to take a health break from community development for a while, and I'm hoping to jump back into review again for the rest of the current development cycle. I'll go back to my notes and try to recreate the pathological cases that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync spreading effort I did; get those running again and see how they do on this new approach. I have a decent sized 24 core server that should be good enough for this job. I'll see what I can do. -- Greg Smith greg.smith@crunchydatasolutions.com Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/
> On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
> > > +++ b/src/backend/postmaster/bgreclaimer.c
> >
> > A fair number of comments in that file refer to bgwriter...
>
> Will fix.
Fixed.
> > > @@ -0,0 +1,302 @@
> > > +/*-------------------------------------------------------------------------
> > > + *
> > > + * bgreclaimer.c
> > > + *
> > > + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It
> > > + * attempts to keep regular backends from having to run clock sweep (which
> > > + * they would only do when they don't find a usable shared buffer from
> > > + * freelist to read in another page).
> >
> > That's not really accurate. Freelist pages are often also needed to
> > write new pages, without reading anything in.
>
> Agreed, but the same is used in bgwriter file as well; so if we
> change here, we might want to change bgwriter file header as well.
I have just changed bgreclaimer for this comment, if the same
> > I'd phrase it as "which
> > they only need to do if they don't find a victim buffer from the
> > freelist"
>
> victim buffer sounds more like a buffer which it will get from
> clock sweep, how about next candidate (same is used in function
> header of StrategyGetBuffer()).
Fixed.
> > > In the best scenario all requests
> > > + * for shared buffers will be fulfilled from freelist as the background
> > > + * reclaimer process always tries to maintain buffers on freelist. However,
> > > + * regular backends are still empowered to run clock sweep to find a usable
> > > + * buffer if the bgreclaimer fails to maintain enough buffers on freelist.
> >
> > "empowered" sounds strange to me. 'still can run the clock sweep'?
>
> No harm in changing like what you are suggesting, however the same is
> used in file header of bgwriter.c as well, so I think lets keep this usage as
> it is because there is no correctness issue here.
Not changed anything for this comment.
> >
> > > +static void bgreclaim_quickdie(SIGNAL_ARGS);
> > > +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> > > +static void ReqShutdownHandler(SIGNAL_ARGS);
> > > +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
> >
> > This looks inconsistent.
>
> I have kept based on bgwriter, so not sure if it's good to change.
> However I we want consistent in naming, I would like to keep
> something like:
>
> ReclaimShutdownHandler
> ReclaimQuickDieHandler
> ..
> ..
Changed function names to make them consistent.
> > > + /*
> > > + * If an exception is encountered, processing resumes here.
> > > + *
> > > + * See notes in postgres.c about the design of this coding.
> > > + */
> > > + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
> > > + {
> > > + /* Since not using PG_TRY, must reset error stack by hand */
> > > + error_context_stack = NULL;
> ..
> >
> > No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> > good idea, regardless of it possibly being true today (which I'm not
> > sure about yet).
>
> I will add LWLockReleaseAll() in exception handling as discussed
> elsewhere in thread.
Done.
>
> > > +
> > > + /* Now we can allow interrupts again */
> > > + RESUME_INTERRUPTS();
> >
> > Other processes sleep for a second here, I think that's a good
> > idea. E.g. that bit:
>
> Agreed, will make change as per suggestion.
Done.
> >
> > > + /*
> > > + * Loop forever
> > > + */
> > > + for (;;)
> > > + {
> > > + int rc;
> > > +
> > > +
> > > + /*
> > > + * Backend will signal bgreclaimer when the number of buffers in
> > > + * freelist falls below than low water mark of freelist.
> > > + */
> > > + rc = WaitLatch(&MyProc->procLatch,
> > > + WL_LATCH_SET | WL_POSTMASTER_DEATH,
> > > + -1);
> >
> > That's probably not going to work well directly after a (re)start of
> > bgreclaim (depending on how you handle the water mark, I'll see in a
> > bit).
>
> Could you please be more specific here?
I wasn't sure if any change is required here, so kept the code
> >
> > > +Background Reclaimer's Processing
> > > +---------------------------------
> ..
> > > +Two water mark indicators are used to maintain sufficient number of buffers
> > > +on freelist. Low water mark indicator is used by backends to wake bgreclaimer
> > > +when the number of buffers in freelist falls below it. High water mark
> > > +indicator is used by bgreclaimer to move buffers to freelist.
> >
> > For me the description of the high water as stated here doesn't seem to
> > explain anything.
> >
> > This section should have a description of how the reclaimer interacts
> > with the bgwriter logic. Do we put dirty buffers on the freelist that
> > are then cleaned by the bgwriter? Which buffers does the bgwriter write
> > out?
>
> As discussed in thread, I will change this accordingly.
Done
> > Move buffers with reference and a usage_count *of* zero to freelist. By
> > maintaining enough buffers in the freelist (up to the list's high water
> > mark), we drastically reduce the likelihood of individual backends
> > having to perform the clock sweep themselves.
>
> Okay will rephrase it as per your suggestion.
Done
> > > + * This is called by the background reclaim process when the number
> > > + * of buffers in freelist falls below low water mark of freelist.
> > > + */
> >
> > The logic used here *definitely* needs to be documented in another form
> > somewhere in the source.
>
> I think the way Robert has suggested to modify Readme adresses
> this to an extent, however if you think it is better to go in more
> detail, then I can expand function header of function
> BgMoveBuffersToFreelist() or on top of bgreclaimer.c, what do you prefer?
>
> > > +
> > > + if (tmp_num_to_free == 0)
> > > + break;
> >
> > num_to_free isn't a convincing name if I understand what this is doing
> > correctly. Maybe 'move_to_freelist', 'freelist_needed',
> > 'needed_on_freelist' or something like that?
>
> I think keeping num or count in name could be more helpful as it is used
> to loop for finding usable buffers. How about 'num_needed_on_freelist',
> without any count or num it sounds more like a boolean variable.
Changed.
> >
> > I wonder if we don't want to increase the high watermark when
> > tmp_recent_backend_clocksweep > 0?
> >
> > > + while (tmp_num_to_free > 0)
> > > + {
> ..
> > > + /*
> > > + * If the buffer is pinned or has a nonzero usage_count, we cannot
> > > + * move it to freelist; decrement the usage_count (unless pinned)
> > > + * and keep scanning.
> > > + */
> > > + LockBufHdr(bufHdr);
> >
> > Hm. Perhaps we should do a bufHdr->refcount != zero check without
> > locking here? The atomic op will transfer the cacheline exclusively to
> > the reclaimer's CPU. Even though it very shortly afterwards will be
> > touched afterwards by the pinning backend.
>
> As per discussion, the conclusion seems to be that we can do some
> more tests to see if we need such a change, I will do more tests on
> the lines suggested by Robert in below mails and then we can decide
> if any thing is required.
Still performance data related to this needs to be collected.
> > > +/*
> > > + * Water mark indicators for maintaining buffers on freelist. When the
> > > + * number of buffers on freelist drops below the low water mark, the
> > > + * allocating backend sets the latch and bgreclaimer wakesup and begin
> > > + * adding buffer's to freelist until it reaches high water mark and then
> > > + * again goes back to sleep.
> > > + */
> >
> > s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/;
> > /to freelist/to the freelist/; s/reaches high water/reaches the high water/
>
> Will change as per suggestions.
Done
> > > +int freelistLowWaterMark;
> > > +int freelistHighWaterMark;
> > > +
> > > +/* Percentage indicators for maintaining buffers on freelist */
> > > +#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.005
> > > +#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.2
> > > +#define MIN_HIGH_WATER_MARK 5
> > > +#define MAX_HIGH_WATER_MARK 2000
> >
> > I'm confused. The high water mark percentage is smaller than the low
> > water mark?
>
> High water mark is percentage of NBuffers (shared_buffers).
> Low water mark is percentage of High water mark.
>
> I will add a comment.
Added a comment.
>
>
> > > - *lock_held = true;
> > > - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> > > + SpinLockAcquire(&StrategyControl->freelist_lck);
> > >
> ..
> > > + bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
> > > bgwriterLatch = StrategyControl->bgwriterLatch;
> >
> > I don't understand why these need to be grabbed under the spinlock?
>
> In earlier version of patch, it was done without spinklock,
> however Robert has given comment that as it was previously done
> with BufFreelist lock, these should be under spinlock (atleast
> that is what I understood) and I tested the patch again by having
> them under spinlock and didn't find any difference, so I have moved
> them under spinlock.
Not changed anything related to this.
> ..
> > > + if (numFreeListBuffers < freelistLowWaterMark && bgreclaimerLatch)
> > > + SetLatch(StrategyControl->bgreclaimerLatch);
> > > +
> ..
> > > }
> > > - UnlockBufHdr(buf);
> > > }
> >
> > I think it makes sense to break out this bit into its own
> > function. That'll make StrategyGetBuffer() a good bit easier to read.
>
> I will try to move it into a new function GetBufferFromFreelist().
Moved this part of code into new function.
>
> > > +/*
> > > + * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
> > > + */
> > > +bool
> > > +StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
> > > +{
> >
> > Should maybe rather be named *Tail?
>
> Tail is better suited than End, I will change it.
Changed.
> > > +void
> > > +StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc,
> > > + uint32 *num_buf_backend_clocksweep)
> > > +
> > > + if (num_buf_alloc)
> > > + {
> > > + *num_buf_alloc = StrategyControl->numBufferAllocs;
> > > + StrategyControl->numBufferAllocs = 0;
> > > + }
> > > + if (num_buf_backend_clocksweep)
> > > + {
> > > + *num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
> > > + StrategyControl->numBufferBackendClocksweep = 0;
> > > + }
> > > + SpinLockRelease(&StrategyControl->freelist_lck);
> > > +
> > > + return;
> > > +}
> >
> > Do we need the if (num_buf_alloc) bits? Can't we make it unconditional?
>
> This API is more or less designed on lines of StrategySyncStart().
> Currently there is no case where we can't do it unconditional (same is
> true for num_buf_backend_clocksweep), however there is some merit
> to keep it in sync with existing API. Having said that if you feel we should
> go about doing it unconditionally, I will change it in next patch.
I have made it unconditional.
> Hm, right. But then let's move BgWriterStats.m_buf_alloc =+,
> ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end
> up being continously busy without being visible.
Done and I have removed pgstat_send_bgwriter() call from
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
wal_buffers = 256MB
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 30mins
Client_count/Patch_ver | 32 | 64 | 128 |
HEAD | 420 | 535 | 556 |
Patch (sbe_v8) | 431 | 537 | 568 |
Attachment
> This looks like it's squashed one of the very fundamental buffer
On 14/09/14 19:00, Amit Kapila wrote: > On Fri, Sep 12, 2014 at 11:09 PM, Gregory Smith > <gregsmithpgsql@gmail.com <mailto:gregsmithpgsql@gmail.com>> wrote: > > > This looks like it's squashed one of the very fundamental buffer > > scaling issues though; well done Amit. > > Thanks. > > > I'll go back to my notes and try to recreate the pathological cases > > that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync > > spreading effort I did; get those running again and see how they > > do on this new approach. I have a decent sized 24 core server > > that should be good enough for this job. I'll see what I can do. > > It will be really helpful if you can try out those cases. > > And if you want 'em run on the 60 core beast, just let me know the details and I'll do some runs for you. Cheers Mark
On Fri, Sep 12, 2014 at 11:55 AM, Amit Chapel <amit.kapila16@gmail.com> wrote:
> On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:I will post the data with the latest patch separately (where I will focuson new cases discussed between Robert and Andres).
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
Client_Count/Patch_Ver | 8 | 16 | 32 | 64 | 128 |
HEAD | 58614 | 107370 | 140717 | 104357 | 65010 |
sbe_v9 | 62943 | 119064 | 172246 | 220174 | 220904 |
> shared_buffers, say size of shared_buffers plus 0.01%. Such workloads
> tend to stress buffer eviction heavily.
Client_Count/Patch_Ver | 1 | 8 | 16 | 32 | 64 | 128 |
HEAD | 8406 | 68712 | 132222 | 198481 | 290340 | 289828 |
sbe_v9 | 8504 | 68546 | 131926 | 195789 | 289959 | 289021 |
Scale Factor - 800
Shared_Buffers - 12166MB (Total db size is 12288MB)
Client_Count/Patch_Ver | 1 | 8 | 16 | 32 | 64 | 128 |
HEAD | 8428 | 68609 | 128092 | 196596 | 292066 | 293812 |
sbe_v9 | 8386 | 68546 | 126926 | 197126 | 289959 | 287621 |
Observations
> (2) A workload that maximizes the rate of concurrent buffer eviction
> relative to other tasks. Read-only pgbench is not bad for this, but
> maybe somebody's got a better idea.
I think the first test of pgbench (scale_factor-3000;shared_buffers-8GB)
> As I sort of mentioned in what I was writing for the bufmgr README,
> there are, more or less, three ways this can fall down, at least that
> I can see: (1) if the high water mark is too high, then we'll start
> finding buffers in the freelist that have already been touched since
> we added them:
Attachment
In most cases performance with patch is slightly less as compareto HEAD and the difference is generally less than 1% and in a caseor 2 close to 2%. I think the main reason for slight difference is thatwhen the size of shared buffers is almost same as data size, the numberof buffers it needs from clock sweep are very less, as an example in firstcase (when size of shared buffers is 12286MB), it actually needs at most256 additional buffers (2MB) via clock sweep, where as bgreclaimerwill put 2000 (high water mark) additional buffers (0.5% of shared buffersis greater than 2000 ) in free list, so bgreclaimer does some extra workwhen it is not required and it also leads to condition you mentioneddown (freelist will contain buffers that have already been touched sincewe added them). Now for case 2 (12166MB), we need buffers morethan 2000 additional buffers, but not too many, so it can also havesimilar effect.
I think we have below options related to this observationa. Some further tuning in bgreclaimer, so that instead of puttingthe buffers up to high water mark in freelist, it puts just 1/4th or1/2 of high water mark and then check if the free list still containslesser than equal to low water mark, if yes it continues and if notthen it can wait (or may be some other way).
b. Instead of waking bgreclaimer when the number of buffers fallbelow low water mark, wake when the number of times backendsdoes clock sweep crosses certain threshold
c. Give low and high water mark as config knobs, so that in somerare cases users can use them to do tuning.
d. Lets not do anything as if user does such a configuration, he shouldbe educated to configure shared buffers in a better way and or theperformance hit doesn't seem to be justified to do any furtherwork.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Sep 16, 2014 at 8:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:In most cases performance with patch is slightly less as compareto HEAD and the difference is generally less than 1% and in a caseor 2 close to 2%. I think the main reason for slight difference is thatwhen the size of shared buffers is almost same as data size, the numberof buffers it needs from clock sweep are very less, as an example in firstcase (when size of shared buffers is 12286MB), it actually needs at most256 additional buffers (2MB) via clock sweep, where as bgreclaimerwill put 2000 (high water mark) additional buffers (0.5% of shared buffersis greater than 2000 ) in free list, so bgreclaimer does some extra workwhen it is not required and it also leads to condition you mentioneddown (freelist will contain buffers that have already been touched sincewe added them). Now for case 2 (12166MB), we need buffers morethan 2000 additional buffers, but not too many, so it can also havesimilar effect.So there are two suboptimal things that can happen and they pull in opposite directions. I think you should instrument the server how often each is happening. #1 is that we can pop a buffer from the freelist and find that it's been touched. That means we wasted the effort of putting it on the freelist in the first place. #2 is that we can want to pop a buffer from the freelist and find it empty and thus be forced to run the clock sweep ourselves. If we're having problem #1, we could improve things by reducing the water marks. If we're having problem #2, we could improve things by increasing the water marks. If we're having both problems, then I dunno. But let's get some numbers on the frequency of these specific things, rather than just overall tps numbers.
buffers_alloc | 1531023 |
buffers_backend_clocksweep | 0 |
buffers_touched_freelist | 0 |
Shared_Buffers - 12166MB (Total db size is 12288MB)
buffers_alloc | 1531010 |
buffers_backend_clocksweep | 0 |
buffers_touched_freelist | 0 |
Client_Count/Patch_Ver | 8 | 16 | 32 | 64 | 128 |
HEAD | 68424 | 132540 | 195496 | 279511 | 283280 |
sbe_v9 | 68565 | 132709 | 194631 | 284351 | 289333 |
Client_Count/Patch_Ver | 8 | 16 | 32 | 64 | 128 |
HEAD | 68331 | 127752 | 196385 | 274387 | 281753 |
sbe_v9 | 68922 | 131314 | 194452 | 284292 | 287221 |
d. Lets not do anything as if user does such a configuration, he shouldbe educated to configure shared buffers in a better way and or theperformance hit doesn't seem to be justified to do any furtherwork.At least worth entertaining.
Attachment
>
> Apart from this, I have changed kid of newly added function as
> due to recent commit, the oid I was using is no longer available.
After sending last patch I have realized that oid used in patch for
Attachment
On 9/16/14, 8:18 AM, Amit Kapila wrote: > I think the main reason for slight difference is that > when the size of shared buffers is almost same as data size, the number > of buffers it needs from clock sweep are very less, as an example in first > case (when size of shared buffers is 12286MB), it actually needs at most > 256 additional buffers (2MB) via clock sweep, where as bgreclaimer > will put 2000 (high water mark) additional buffers (0.5% of shared buffers > is greater than 2000 ) in free list, so bgreclaimer does some extra work > when it is not required This is exactly what I was warning about, as the sort of lesson learned from the last round of such tuning. There are going to be spots where trying to tune the code to be aggressive on the hard cases will work great. But you need to make that dynamic to some degree, such that the code doesn't waste a lot of time sweeping buffers when the demand for them is actually weak. That will make all sorts of cases that look like this slower. We should be able to tell these apart if there's enough instrumentation and solid logic inside of the program itself though. The 8.3 era BGW coped with a lot of these issues using a particular style of moving average with fast reaction time, plus instrumenting the buffer allocation rate as accurately as it could. So before getting into high/low water note questions, are you comfortable that there's a clear, accurate number that measures the activity level that's important here? And have you considered ways it might be averaging over time or have a history that's analyzed? The exact fast approach / slow decay weighted moving average approach of the 8.3 BGW, the thing that tried to smooth the erratic data set possible here, was a pretty critical part of getting itself auto-tuning to workload size. It ended up being much more important than the work of setting the arbitrary watermark levels.
On 9/16/14, 8:18 AM, Amit Kapila wrote:I think the main reason for slight difference is thatThis is exactly what I was warning about, as the sort of lesson learned from the last round of such tuning. There are going to be spots where trying to tune the code to be aggressive on the hard cases will work great. But you need to make that dynamic to some degree, such that the code doesn't waste a lot of time sweeping buffers when the demand for them is actually weak. That will make all sorts of cases that look like this slower.
when the size of shared buffers is almost same as data size, the number
of buffers it needs from clock sweep are very less, as an example in first
case (when size of shared buffers is 12286MB), it actually needs at most
256 additional buffers (2MB) via clock sweep, where as bgreclaimer
will put 2000 (high water mark) additional buffers (0.5% of shared buffers
is greater than 2000 ) in free list, so bgreclaimer does some extra work
when it is not required
We should be able to tell these apart if there's enough instrumentation and solid logic inside of the program itself though. The 8.3 era BGW coped with a lot of these issues using a particular style of moving average with fast reaction time, plus instrumenting the buffer allocation rate as accurately as it could. So before getting into high/low water note questions, are you comfortable that there's a clear, accurate number that measures the activity level that's important here?
And have you considered ways it might be averaging over time or have a history that's analyzed?
The exact fast approach / slow decay weighted moving average approach of the 8.3 BGW, the thing that tried to smooth the erratic data set possible here, was a pretty critical part of getting itself auto-tuning to workload size. It ended up being much more important than the work of setting the arbitrary watermark levels.
Specific numbers of both the configurations for which I haveposted data in previous mail are as follows:Scale Factor - 800Shared_Buffers - 12286MB (Total db size is 12288MB)Client and Thread Count = 64buffers_touched_freelist - count of buffers that backends found touched afterpopping from freelist.buffers_backend_clocksweep - count of buffer allocations not satisfied from freelist
buffers_alloc 1531023 buffers_backend_clocksweep 0 buffers_touched_freelist 0
--
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi, On 2014-09-23 10:31:24 -0400, Robert Haas wrote: > I suggest we count these things: > > 1. The number of buffers the reclaimer has put back on the free list. > 2. The number of times a backend has run the clocksweep. > 3. The number of buffers past which the reclaimer has advanced the clock > sweep (i.e. the number of buffers it had to examine in order to reclaim the > number counted by #1). > 4. The number of buffers past which a backend has advanced the clocksweep > (i.e. the number of buffers it had to examine in order to allocate the > number of buffers count by #3). > 5. The number of buffers allocated from the freelist which the backend did > not use because they'd been touched (what you're calling > buffers_touched_freelist). Sounds good. > It's hard to come up with good names for all of these things that are > consistent with the somewhat wonky existing names. Here's an attempt: > > 1. bgreclaim_freelist bgreclaim_alloc_clocksweep? > 2. buffers_alloc_clocksweep (you've got buffers_backend_clocksweep, but I > think we want to make it more parallel with buffers_alloc, which is the > number of buffers allocated, not buffers_backend, the number of buffers > *written* by a backend) > 3. clocksweep_bgreclaim > 4. clocksweep_backend I think bgreclaim/backend should always be either be a prefix or a postfix. But not one in some variables and some in another. > 5. freelist_touched I wonder if we shouldn't move all this to a new view, instead of stuffing it somewhere where it really doesn't belong. pg_stat_buffers or something like it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: > [ review ] Oh, by the way, I noticed that this patch breaks pg_buffercache. If we're going to have 128 lock partitions, we need to bump MAX_SIMUL_LWLOCKS. But this gets at another point: the way we're benchmarking this right now, we're really conflating the effects of three different things: 1. Changing the locking regimen around the freelist and clocksweep. 2. Adding a bgreclaimer process. 3. Raising the number of buffer locking partitions. I think it's pretty clear that #1 and #2 are a good idea. #3 is a mixed bag, and it might account for the regressions you saw on some test cases. Increasing the number of buffer mapping locks means that those locks take up more cache lines, which could slow things down in cases where there's no reduction in contention. It also means that the chances of an allocated buffer ending up in the same buffer mapping lock partition are 1/128 instead of 1/16, which means about 5.4 additional lwlock acquire/release cycles per 100 allocations. That's not a ton, but it's not free either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 23, 2014 at 10:55 AM, Robert Haas <robertmhaas@gmail.com> wrote: > But this gets at another point: the way we're benchmarking this right > now, we're really conflating the effects of three different things: > > 1. Changing the locking regimen around the freelist and clocksweep. > 2. Adding a bgreclaimer process. > 3. Raising the number of buffer locking partitions. I did some more experimentation on this. Attached is a patch that JUST does #1, and, as previously suggested, it uses a single spinlock instead of using two of them that are probably in the same cacheline. Without this patch, on a 32-client, read-only pgbench at scale factor 1000 and shared_buffers=8GB, perf -e cs -g -a has this to say about LWLock-inspired preemption: - LWLockAcquire + 68.41% ReadBuffer_common + 31.59% StrategyGetBuffer With the patch, unsurprisingly, StrategyGetBuffer disappears and the only lwlocks that are causing context-switches are the individual buffer locks. But are we suffering spinlock contention instead as a result? Yes, but not much. s_lock is at 0.41% in the corresponding profile, and only 6.83% of those calls are from the patched StrategyGetBuffer. In a similar profile of master, s_lock is at 0.31%. So there's a bit of additional s_lock contention, but I think it's considerably less than the contention over the lwlock it's replacing, because the spinlock is only held for the minimal amount of time needed, whereas the lwlock could be held across taking and releasing many individual buffer locks. TPS results are a little higher with the patch - these are alternating 5 minute runs: master tps = 176010.647944 (including connections establishing) master tps = 176615.291149 (including connections establishing) master tps = 175648.370487 (including connections establishing) reduce-replacement-locking tps = 177888.734320 (including connections establishing) reduce-replacement-locking tps = 177797.842410 (including connections establishing) reduce-replacement-locking tps = 177894.822656 (including connections establishing) The picture is similar at 64 clients, but the benefit is a little more: master tps = 179037.231597 (including connections establishing) master tps = 180500.937068 (including connections establishing) master tps = 181565.706514 (including connections establishing) reduce-replacement-locking tps = 185741.503425 (including connections establishing) reduce-replacement-locking tps = 188598.728062 (including connections establishing) reduce-replacement-locking tps = 187340.977277 (including connections establishing) What's interesting is that I can't see in the perf output any real sign that the buffer mapping locks are slowing things down, but they clearly are, because when I take this patch and also boost NUM_BUFFER_PARTITIONS to 128, the performance goes up: reduce-replacement-locking-128 tps = 251001.812843 (including connections establishing) reduce-replacement-locking-128 tps = 247368.925927 (including connections establishing) reduce-replacement-locking-128 tps = 250775.304177 (including connections establishing) The performance also goes up if I do that on master, but the effect is substantially less: master-128 tps = 219301.492902 (including connections establishing) master-128 tps = 219786.249076 (including connections establishing) master-128 tps = 219821.220271 (including connections establishing) I think this shows pretty clearly that, even without the bgreclaimer, there's a lot of merit in getting rid of BufFreelistLock and using a spinlock held for the absolutely minimal number of instructions instead. There's already some benefit without doing anything about the buffer mapping locks, and we'll get a lot more benefit once that issue is addressed. I think we need to do some serious study to see whether bgreclaimer is even necessary, because it looks like this change alone, which is much simpler, takes a huge bite out of the scalability problem. I welcome further testing and comments, but my current inclination is to go ahead and push the attached patch. To my knowledge, nobody has at any point objected to this aspect of what's being proposed, and it looks safe to me and seems to be a clear win. We can then deal with the other issues on their own merits. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I did some more experimentation on this. Attached is a patch that > JUST does #1, and, ... ...and that was the wrong patch. Thanks to Heikki for point that out. Second try. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I did some more experimentation on this. Attached is a patch that >> JUST does #1, and, ... > ...and that was the wrong patch. Thanks to Heikki for point that out. > Second try. But the results you gave in the previous message were correctly attributed? regards, tom lane
On Tue, Sep 23, 2014 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I did some more experimentation on this. Attached is a patch that >>> JUST does #1, and, ... > >> ...and that was the wrong patch. Thanks to Heikki for point that out. >> Second try. > > But the results you gave in the previous message were correctly > attributed? The patch I attached the first time was just the last commit in the git repository where I wrote the patch, rather than the changes that I made on top of that commit. So, yes, the results from the previous message are with the patch attached to the follow-up. I just typed the wrong git command when attempting to extract that patch to attach it to the email. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9/23/14, 10:31 AM, Robert Haas wrote: > I suggest we count these things: > > 1. The number of buffers the reclaimer has put back on the free list. > 2. The number of times a backend has run the clocksweep. > 3. The number of buffers past which the reclaimer has advanced the > clock sweep (i.e. the number of buffers it had to examine in order to > reclaim the number counted by #1). > 4. The number of buffers past which a backend has advanced the > clocksweep (i.e. the number of buffers it had to examine in order to > allocate the number of buffers count by #3). > 5. The number of buffers allocated from the freelist which the backend > did not use because they'd been touched (what you're calling > buffers_touched_freelist). All sound reasonable. To avoid wasting time here, I think it's only worth doing all of these as DEBUG level messages for now. Then only go through the overhead of exposing the ones that actually seem relevant. That's what I did for the 8.3 work, and most of that data at this level was barely relevant to anyone but me then or since. We don't want the system views to include so much irrelevant trivia that finding the important parts becomes overwhelming. I'd like to see that level of instrumentation--just the debug level messages--used to quantify the benchmarks that people are running already, to prove they are testing what they think they are. That would have caught the test error you already stumbled on for example. Simple paranoia says there may be more issues like that hidden in here somewhere, and this set you've identified should find any more of them around. If all that matches up so the numbers for the new counters seem sane, I think that's enough to commit the first basic improvement here. Then make a second pass over exposing the useful internals that let everyone quantify either the improvement or things that may need to be tunable. -- Greg Smith greg.smith@crunchydatasolutions.com Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/
On 2014-09-23 16:29:16 -0400, Robert Haas wrote: > On Tue, Sep 23, 2014 at 10:55 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > But this gets at another point: the way we're benchmarking this right > > now, we're really conflating the effects of three different things: > > > > 1. Changing the locking regimen around the freelist and clocksweep. > > 2. Adding a bgreclaimer process. > > 3. Raising the number of buffer locking partitions. > > I did some more experimentation on this. Attached is a patch that > JUST does #1, and, as previously suggested, it uses a single spinlock > instead of using two of them that are probably in the same cacheline. > Without this patch, on a 32-client, read-only pgbench at scale factor > 1000 and shared_buffers=8GB, perf -e cs -g -a has this to say about > LWLock-inspired preemption: > > - LWLockAcquire > + 68.41% ReadBuffer_common > + 31.59% StrategyGetBuffer > > With the patch, unsurprisingly, StrategyGetBuffer disappears and the > only lwlocks that are causing context-switches are the individual > buffer locks. But are we suffering spinlock contention instead as a > result? Yes, but not much. s_lock is at 0.41% in the corresponding > profile, and only 6.83% of those calls are from the patched > StrategyGetBuffer. In a similar profile of master, s_lock is at > 0.31%. So there's a bit of additional s_lock contention, but I think > it's considerably less than the contention over the lwlock it's > replacing, because the spinlock is only held for the minimal amount of > time needed, whereas the lwlock could be held across taking and > releasing many individual buffer locks. Am I understanding you correctly that you also measured context switches for spinlocks? If so, I don't think that's a valid comparison. LWLocks explicitly yield the CPU as soon as there's any contention while spinlocks will, well, spin. Sure they also go to sleep, but it'll take longer. It's also worthwile to remember in such comparisons that lots of the cost of spinlocks will be in the calling routines, not s_lock() - the first TAS() is inlined into it. And that's the one that'll incur cache misses and such... Note that I'm explicitly *not* doubting the use of a spinlock itself. Given the short acquiration times and the exclusive use of exlusive acquiration a spinlock makes more sense. The lwlock's spinlock alone will have about as much contention. I think it might be possible to construct some cases where the spinlock performs worse than the lwlock. But I think those will be clearly in the minority. And at least some of those will be fixed by bgwriter. As an example consider a single backend COPY ... FROM of large files with a relatively large s_b. That's a seriously bad workload for the current victim buffer selection because frequently most of the needs to be searched for a buffer with usagecount 0. And this patch will double the amount of atomic ops during that. Let me try to quantify that. > What's interesting is that I can't see in the perf output any real > sign that the buffer mapping locks are slowing things down, but they > clearly are, because when I take this patch and also boost > NUM_BUFFER_PARTITIONS to 128, the performance goes up: What did events did you profile? > I welcome further testing and comments, but my current inclination is > to go ahead and push the attached patch. To my knowledge, nobody has > at any point objected to this aspect of what's being proposed, and it > looks safe to me and seems to be a clear win. We can then deal with > the other issues on their own merits. I've took a look at this, and all the stuff I saw that I disliked were there before this patch. So +1 for going ahead. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Sep 23, 2014 at 6:02 PM, Gregory Smith <gregsmithpgsql@gmail.com> wrote: > On 9/23/14, 10:31 AM, Robert Haas wrote: >> I suggest we count these things: >> >> 1. The number of buffers the reclaimer has put back on the free list. >> 2. The number of times a backend has run the clocksweep. >> 3. The number of buffers past which the reclaimer has advanced the clock >> sweep (i.e. the number of buffers it had to examine in order to reclaim the >> number counted by #1). >> 4. The number of buffers past which a backend has advanced the clocksweep >> (i.e. the number of buffers it had to examine in order to allocate the >> number of buffers count by #3). >> 5. The number of buffers allocated from the freelist which the backend did >> not use because they'd been touched (what you're calling >> buffers_touched_freelist). > > All sound reasonable. To avoid wasting time here, I think it's only worth > doing all of these as DEBUG level messages for now. Then only go through > the overhead of exposing the ones that actually seem relevant. That's what > I did for the 8.3 work, and most of that data at this level was barely > relevant to anyone but me then or since. We don't want the system views to > include so much irrelevant trivia that finding the important parts becomes > overwhelming. I think we expose far too little information in our system views. Just to take one example, we expose no useful information about lwlock acquire or release, but a lot of real-world performance problems are caused by lwlock contention. There are of course difficulties in exposing huge numbers of counters, but we're not talking about many here, so I'd lean toward exposing them in the final patch if they seem at all useful. > I'd like to see that level of instrumentation--just the debug level > messages--used to quantify the benchmarks that people are running already, > to prove they are testing what they think they are. That would have caught > the test error you already stumbled on for example. Simple paranoia says > there may be more issues like that hidden in here somewhere, and this set > you've identified should find any more of them around. Right. > If all that matches up so the numbers for the new counters seem sane, I > think that's enough to commit the first basic improvement here. Then make a > second pass over exposing the useful internals that let everyone quantify > either the improvement or things that may need to be tunable. Well, I posted a patch a bit ago that I think is the first basic improvement - and none of these counters are relevant to that. It doesn't add a new background process or anything; it just does pretty much the same thing we do now with less-painful locking. There are no water marks to worry about, or tunable thresholds, or anything; and because it's so much simpler, it's far easier to reason about than the full patch, which is why I feel quite confident pressing on toward a commit. Once that is in, I think we should revisit the idea of a bgreclaimer process, and see how much more that improves things - if at all - on top of what that basic patch already does. For that we'll need these counters, and maybe others. But let's make that phase two. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 23, 2014 at 6:54 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Am I understanding you correctly that you also measured context switches > for spinlocks? If so, I don't think that's a valid comparison. LWLocks > explicitly yield the CPU as soon as there's any contention while > spinlocks will, well, spin. Sure they also go to sleep, but it'll take > longer. No, I measured CPU consumption attributable to s_lock. (I checked context-switches, too.) > It's also worthwile to remember in such comparisons that lots of the > cost of spinlocks will be in the calling routines, not s_lock() - the > first TAS() is inlined into it. And that's the one that'll incur cache > misses and such... True. I can check that - I did not. > Note that I'm explicitly *not* doubting the use of a spinlock > itself. Given the short acquiration times and the exclusive use of > exlusive acquiration a spinlock makes more sense. The lwlock's spinlock > alone will have about as much contention. Right. > I think it might be possible to construct some cases where the spinlock > performs worse than the lwlock. But I think those will be clearly in the > minority. And at least some of those will be fixed by bgwriter. As an > example consider a single backend COPY ... FROM of large files with a > relatively large s_b. That's a seriously bad workload for the current > victim buffer selection because frequently most of the needs to be > searched for a buffer with usagecount 0. And this patch will double the > amount of atomic ops during that. It will actually be far worse than that, because we'll acquire and release the spinlock for every buffer over which we advance the clock sweep, instead of just once for the whole thing. That's reason to hope that a smart bgreclaimer process may be a good improvement on top of this. It can do things like advance the clock sweep hand 16 buffers at a time and then sweep them all after-the-fact, reducing contention on the mutex by an order-of-magnitude, if that turns out to be an important consideration. But I think it's right to view that as something we need to test vs. the baseline established by this patch. What's clear today is that workloads which stress buffer-eviction fall to pieces, because the entire buffer-eviction process is essentially single-threaded. One process can't begin evicting a buffer until another has finished doing so. This lets multiple backends do that at the same time. We may find cases where that leads to an unpleasant amount of contention, but since we have several ideas for how to mitigate that as needs be, I think it's OK to go ahead. The testing we're doing on the combined patch is conflating the effects the new locking regimen with however the bgreclaimer affects things, and it's very clear to me now that we need to make sure those are clearly separate. > Let me try to quantify that. Please do. >> What's interesting is that I can't see in the perf output any real >> sign that the buffer mapping locks are slowing things down, but they >> clearly are, because when I take this patch and also boost >> NUM_BUFFER_PARTITIONS to 128, the performance goes up: > > What did events did you profile? cs. >> I welcome further testing and comments, but my current inclination is >> to go ahead and push the attached patch. To my knowledge, nobody has >> at any point objected to this aspect of what's being proposed, and it >> looks safe to me and seems to be a clear win. We can then deal with >> the other issues on their own merits. > > I've took a look at this, and all the stuff I saw that I disliked were > there before this patch. So +1 for going ahead. Cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-23 19:21:10 -0400, Robert Haas wrote: > On Tue, Sep 23, 2014 at 6:54 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > I think it might be possible to construct some cases where the spinlock > > performs worse than the lwlock. But I think those will be clearly in the > > minority. And at least some of those will be fixed by bgwriter. Err, this should read bgreclaimer, not bgwriter. > > As an > > example consider a single backend COPY ... FROM of large files with a > > relatively large s_b. That's a seriously bad workload for the current > > victim buffer selection because frequently most of the needs to be > > searched for a buffer with usagecount 0. And this patch will double the > > amount of atomic ops during that. > > It will actually be far worse than that, because we'll acquire and > release the spinlock for every buffer over which we advance the clock > sweep, instead of just once for the whole thing. I said double, because we already acquire the buffer header's spinlock every tick. > That's reason to hope that a smart bgreclaimer process may be a good > improvement on top of this. Right. That's what I was trying to say with bgreclaimer above. > But I think it's right to view that as something we need to test vs. > the baseline established by this patch. Agreed. I think the possible downsides are at the very fringe of already bad cases. That's why I agreed that you should go ahead. > > Let me try to quantify that. > > Please do. I've managed to find a ~1.5% performance regression. But the setup was plain absurd. COPY ... FROM /tmp/... BINARY; of large bytea datums into a fillfactor 10 table with the column set to PLAIN storage. With the resulting table size chosen so it's considerably bigger than s_b, but smaller than the dirty writeback limit of the kernel. That's perfectly reasonable. I can think of a couple other cases, but they're all similarly absurd. > >> What's interesting is that I can't see in the perf output any real > >> sign that the buffer mapping locks are slowing things down, but they > >> clearly are, because when I take this patch and also boost > >> NUM_BUFFER_PARTITIONS to 128, the performance goes up: > > > > What did events did you profile? > > cs. Ah. My guess is that most of the time will probably actually be spent in the lwlock's spinlock, not the the lwlock putting itself to sleep. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Sep 23, 2014 at 7:42 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> It will actually be far worse than that, because we'll acquire and >> release the spinlock for every buffer over which we advance the clock >> sweep, instead of just once for the whole thing. > > I said double, because we already acquire the buffer header's spinlock > every tick. Oh, good point. >> > Let me try to quantify that. >> >> Please do. > > I've managed to find a ~1.5% performance regression. But the setup was > plain absurd. COPY ... FROM /tmp/... BINARY; of large bytea datums into > a fillfactor 10 table with the column set to PLAIN storage. With the > resulting table size chosen so it's considerably bigger than s_b, but > smaller than the dirty writeback limit of the kernel. > > That's perfectly reasonable. > > I can think of a couple other cases, but they're all similarly absurd. Well, it's not insane to worry about such things, but if you can only manage 1.5% on such an extreme case, I'm encouraged. This is killing us on OLTP workloads, and fixing that is a lot more important than a couple percent on an extreme case. > Ah. My guess is that most of the time will probably actually be spent in > the lwlock's spinlock, not the the lwlock putting itself to sleep. Ah, OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9/23/14, 7:13 PM, Robert Haas wrote: > I think we expose far too little information in our system views. Just > to take one example, we expose no useful information about lwlock > acquire or release, but a lot of real-world performance problems are > caused by lwlock contention. I sent over a proposal for what I was calling Performance Events about a year ago. The idea was to provide a place to save data about lock contention, weird checkpoint sync events, that sort of thing. Replacing log parsing to get at log_lock_waits data was my top priority. Once that's there, lwlocks was an obvious next target. Presumably we just needed collection to be low enough overhead, and then we can go down to whatever shorter locks we want; lower the overhead, faster the event we can measure. Sometimes the database will never be able to instrument some of its fastest events without blowing away the event itself. We'll still have perf / dtrace / systemtap / etc. for those jobs. But those are not the problems of the average Postgres DBA's typical day. The data people need to solve this sort of thing in production can't always show up in counters. You'll get evidence the problem is there, but you need more details to actually find the culprit. Some info about the type of lock, tables and processes involved, maybe the query that's running, that sort of thing. You can kind of half-ass the job if you make per-tables counter for everything, but we really need more, both to serve our users and to compare well against what other databases provide for tools. That's why I was trying to get the infrastructure to capture all that lock detail, without going through the existing logging system first. Actually building Performance Events fell apart on the storage side: figuring out where to put it all without waiting for a log file to hit disk. I wanted in-memory storage so clients don't wait for anything, then a potentially lossy persistence writer. I thought I could get away with a fixed size buffer like pg_stat_statements uses. That was optimistic. Trying to do better got me lost in memory management land without making much progress. I think the work you've now done on dynamic shared memory gives the right shape of infrastructure that I could pull this off now. I even have funding to work on it again, and it's actually the #2 thing I'd like to take on as I get energy for new feature development. (#1 is the simple but time consuming job of adding block write counters, the lack of which which is just killing me on some fast growing installs) I have a lot of unread messages on this list to sort through right now. I know I saw someone try to revive the idea of saving new sorts of performance log data again recently; can't seem to find it again right now. That didn't seem like it went any farther than thinking about the specifications though. The last time I jumped right over that and hit a wall with this one hard part of the implementation instead, low overhead memory management for saving everything. -- Greg Smith greg.smith@crunchydatasolutions.com Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/
On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > The patch I attached the first time was just the last commit in the > git repository where I wrote the patch, rather than the changes that I > made on top of that commit. So, yes, the results from the previous > message are with the patch attached to the follow-up. I just typed > the wrong git command when attempting to extract that patch to attach > it to the email. Here are some more results. TL;DR: The patch still looks good, but we should raise the number of buffer mapping partitions as well. On the IBM POWER7 machine, I ran read-only pgbench tests with 1 client, 8 clients, and all multiples of 16 up to 96. I ran these tests at scale factor 1000 and scale factor 3000. I tested four builds: master as of commit df4077cda2eae3eb4a5cf387da0c1e7616e73204, that same commit with the number of buffer mapping partitions raised to 128, that commit with reduce-replacement-locking.patch applied, and that commit with reduce-replacement-locking.patch applied AND the number of buffer mapping partitions raised to 128. The results from each configuration are reported in that order on each of the lines below; each is the median of three results. shared_buffers=8GB for all tests. scale factor 1000 1 8119.907618 8230.853237 8153.515217 8145.045004 8 65457.006762 65826.439701 65851.010116 65703.168020 16 125263.858855 125723.441853 125020.598728 129506.037997 32 176696.288187 182376.232631 178278.917581 186440.340283 48 193251.602743 214243.417591 197958.562641 226782.327868 64 182264.276909 218655.105894 190364.759052 256863.652885 80 171719.210488 203104.673512 179861.241080 274065.020956 96 162525.883898 190960.622943 169759.271356 277820.128782 scale factor 3000 1 7690.357053 7723.925932 7772.207513 7684.079850 8 60789.325087 61688.547446 61485.398967 62546.166411 16 112509.423777 115138.385501 115606.858594 120015.350112 32 147881.211900 161359.994902 153302.501020 173063.463752 48 129748.929652 153986.160920 136164.387103 204935.207578 64 114364.542340 132174.970721 116705.371890 224636.957891 80 101375.265389 117279.931095 102374.794412 232966.076908 96 93144.724830 106676.309224 92787.650325 233862.872939 Analysis: 1. To see the effect of reduce-replacement-locking.patch, compare the first TPS number in each line to the third, or the second to the fourth. At scale factor 1000, the patch wins in all of the cases with 32 or more clients and exactly half of the cases with 1, 8, or 16 clients. The variations at low client counts are quite small, and the patch isn't expected to do much at low concurrency levels, so that's probably just random variation. At scale factor 3000, the situation is more complicated. With only 16 bufmappinglocks, the patch gets its biggest win at 48 clients, and by 96 clients it's actually losing to unpatched master. But with 128 bufmappinglocks, it wins - often massively - on everything but the single-client test, which is a small loss, hopefully within experimental variation. 2. To see the effect of increasing the number of buffer mapping locks to 128, compare the first TPS number in each line to the second, or the third to the fourth. Without reduce-replacement-locking.patch, that's a win at every concurrency level at both scale factors. With that patch, the 1 and 8 client tests are small losses at scale factor 1000, and the 1 client test is a small loss at scale factor 3000. The single-client results, which are often a concern for scalability patches, bear a bit of further comment. In this round of testing, either patch alone improved things slightly, and both patches together made them slightly worse. Even if that is reproducible, I don't think it should be cause for concern, because it tends to indicate (at least to me) that the shifting around is just the result of slightly different placement of code across cache lines, or other minor factors we can't really pin down. So I'm inclined to (a) push reduce-replacement-locking.patch and then also (b) bump up the number of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS accordingly so that pg_buffercache doesn't get unhappy). Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > 1. To see the effect of reduce-replacement-locking.patch, compare the > first TPS number in each line to the third, or the second to the > fourth. At scale factor 1000, the patch wins in all of the cases with > 32 or more clients and exactly half of the cases with 1, 8, or 16 > clients. The variations at low client counts are quite small, and the > patch isn't expected to do much at low concurrency levels, so that's > probably just random variation. At scale factor 3000, the situation > is more complicated. With only 16 bufmappinglocks, the patch gets its > biggest win at 48 clients, and by 96 clients it's actually losing to > unpatched master. But with 128 bufmappinglocks, it wins - often > massively - on everything but the single-client test, which is a small > loss, hopefully within experimental variation. > > Comments? Why stop at 128 mapping locks? Theoretical downsides to having more mapping locks have been mentioned a few times but has this ever been measured? I'm starting to wonder if the # mapping locks should be dependent on some other value, perhaps the # of shared bufffers... merlin
On Thu, Sep 25, 2014 at 10:02 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> 1. To see the effect of reduce-replacement-locking.patch, compare the >> first TPS number in each line to the third, or the second to the >> fourth. At scale factor 1000, the patch wins in all of the cases with >> 32 or more clients and exactly half of the cases with 1, 8, or 16 >> clients. The variations at low client counts are quite small, and the >> patch isn't expected to do much at low concurrency levels, so that's >> probably just random variation. At scale factor 3000, the situation >> is more complicated. With only 16 bufmappinglocks, the patch gets its >> biggest win at 48 clients, and by 96 clients it's actually losing to >> unpatched master. But with 128 bufmappinglocks, it wins - often >> massively - on everything but the single-client test, which is a small >> loss, hopefully within experimental variation. >> >> Comments? > > Why stop at 128 mapping locks? Theoretical downsides to having more > mapping locks have been mentioned a few times but has this ever been > measured? I'm starting to wonder if the # mapping locks should be > dependent on some other value, perhaps the # of shared bufffers... Good question. My belief is that the number of buffer mapping locks required to avoid serious contention will be roughly proportional to the number of hardware threads. At the time the value 16 was chosen, there were probably not more than 8-core CPUs in common use; but now we've got a machine with 64 hardware threads and, what do you know but it wants 128 locks. I think the long-term solution here is that we need a lock-free hash table implementation for our buffer mapping tables, because I'm pretty sure that just cranking the number of locks up and up is going to start to have unpleasant side effects at some point. We may be able to buy a few more years by just cranking it up, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-25 09:51:17 -0400, Robert Haas wrote: > On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > The patch I attached the first time was just the last commit in the > > git repository where I wrote the patch, rather than the changes that I > > made on top of that commit. So, yes, the results from the previous > > message are with the patch attached to the follow-up. I just typed > > the wrong git command when attempting to extract that patch to attach > > it to the email. > > Here are some more results. TL;DR: The patch still looks good, but we > should raise the number of buffer mapping partitions as well. > So I'm inclined to (a) push > reduce-replacement-locking.patch and then also (b) bump up the number > of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS > accordingly so that pg_buffercache doesn't get unhappy). I'm happy with that. I don't think it's likely that a moderate increase in the number of mapping lwlocks will be noticeably bad for any workload. One difference is that the total number of lwlock acquirations will be a bit higher because currently it's more likely for the old and new to fall into different partitions. But that's not really significant. The other difference is the number of cachelines touched. Currently, in concurrent workloads, there's already lots of L1 cache misses around the buffer mapping locks because they're exclusively owned by a different core/socket. So, to be effectively worse, the increase would need to lead to lower overall cache hit rates by them not being in *any* cache or displacing other content. That leads me to wonder: Have you measured different, lower, number of buffer mapping locks? 128 locks is, if we'd as we should align them properly, 8KB of memory. Common L1 cache sizes are around 32k... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-09-25 09:02:25 -0500, Merlin Moncure wrote: > On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > 1. To see the effect of reduce-replacement-locking.patch, compare the > > first TPS number in each line to the third, or the second to the > > fourth. At scale factor 1000, the patch wins in all of the cases with > > 32 or more clients and exactly half of the cases with 1, 8, or 16 > > clients. The variations at low client counts are quite small, and the > > patch isn't expected to do much at low concurrency levels, so that's > > probably just random variation. At scale factor 3000, the situation > > is more complicated. With only 16 bufmappinglocks, the patch gets its > > biggest win at 48 clients, and by 96 clients it's actually losing to > > unpatched master. But with 128 bufmappinglocks, it wins - often > > massively - on everything but the single-client test, which is a small > > loss, hopefully within experimental variation. > > > > Comments? > > Why stop at 128 mapping locks? Theoretical downsides to having more > mapping locks have been mentioned a few times but has this ever been > measured? I'm starting to wonder if the # mapping locks should be > dependent on some other value, perhaps the # of shared bufffers... Wrong way round. You need to prove the upside of increasing it further, not the contrary. The primary downside is cache hit ratio and displacing other cache entries... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > That leads me to wonder: Have you measured different, lower, number of > buffer mapping locks? 128 locks is, if we'd as we should align them > properly, 8KB of memory. Common L1 cache sizes are around 32k... Amit has some results upthread showing 64 being good, but not as good as 128. I haven't verified that myself, but have no reason to doubt it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-25 10:22:47 -0400, Robert Haas wrote: > On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > That leads me to wonder: Have you measured different, lower, number of > > buffer mapping locks? 128 locks is, if we'd as we should align them > > properly, 8KB of memory. Common L1 cache sizes are around 32k... > > Amit has some results upthread showing 64 being good, but not as good > as 128. I haven't verified that myself, but have no reason to doubt > it. How about you push the spinlock change and I crosscheck the partition number on a multi socket x86 machine? Seems worthwile to make sure that it doesn't cause problems on x86. I seriously doubt it'll, but ... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> Why stop at 128 mapping locks? Theoretical downsides to having more >> mapping locks have been mentioned a few times but has this ever been >> measured? I'm starting to wonder if the # mapping locks should be >> dependent on some other value, perhaps the # of shared bufffers... > > Wrong way round. You need to prove the upside of increasing it further, > not the contrary. The primary downside is cache hit ratio and displacing > other cache entries... I can't do that because I don't have the hardware. I wasn't suggesting to just set it but to measure the affects of setting it. But the benefits from going from 16 to 128 are pretty significant at least on this hardware; I'm curious how much further it can be pushed...what's wrong with trying it out? merlin
On 2014-09-25 10:09:30 -0400, Robert Haas wrote: > I think the long-term solution here is that we need a lock-free hash > table implementation for our buffer mapping tables, because I'm pretty > sure that just cranking the number of locks up and up is going to > start to have unpleasant side effects at some point. We may be able > to buy a few more years by just cranking it up, though. I think mid to long term we actually need something else than a hashtable. Capable of efficiently looking for the existance of 'neighboring' buffers so we can intelligently prefetch far enough that the read actually completes when we get there. Also I'm pretty sure that we'll need a way to efficiently remove all buffers for a relfilenode from shared buffers - linearly scanning for that isn't a good solution. So I think we need a different data structure. I've played a bit around with just replacing buf_table.c with a custom handrolled hashtable because I've seen more than one production workload where hash_search_with_hash_value() is both cpu and cache miss wise top#1 of profiles. With most calls coming from the buffer mapping and then from the lock manager. There's two reasons for that: a) dynahash just isn't very good and it does a lot of things that will never be necessary for these hashes. b) the key into the hash table is *far* too wide. A significant portion of the time is spent comparing buffer/lock tags. The aforementioned replacement hash table was a good bit faster for fully cached workloads - but at the time I wrote I could still make it crash in very high cache pressure workloads, so that should be taken with a fair bit of salt. I think we can comparatively easily get rid of the tablespace in buffer tags. Getting rid of the database already would be a fair bit harder. I haven't really managed to get an idea how to remove the fork number without making the catalog much more complicated. I don't think we can go too long without at least some of these steps :(. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-09-25 10:22:47 -0400, Robert Haas wrote: >> On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > That leads me to wonder: Have you measured different, lower, number of >> > buffer mapping locks? 128 locks is, if we'd as we should align them >> > properly, 8KB of memory. Common L1 cache sizes are around 32k... >> >> Amit has some results upthread showing 64 being good, but not as good >> as 128. I haven't verified that myself, but have no reason to doubt >> it. > > How about you push the spinlock change and I crosscheck the partition > number on a multi socket x86 machine? Seems worthwile to make sure that > it doesn't cause problems on x86. I seriously doubt it'll, but ... OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-25 09:34:57 -0500, Merlin Moncure wrote: > On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> Why stop at 128 mapping locks? Theoretical downsides to having more > >> mapping locks have been mentioned a few times but has this ever been > >> measured? I'm starting to wonder if the # mapping locks should be > >> dependent on some other value, perhaps the # of shared bufffers... > > > > Wrong way round. You need to prove the upside of increasing it further, > > not the contrary. The primary downside is cache hit ratio and displacing > > other cache entries... > > I can't do that because I don't have the hardware. One interesting part of this is making sure it doesn't regress older/smaller machines. So at least that side you could check... > what's wrong with trying it out? If somebody is willing to do it: nothing. I'd just much rather do the, by now proven, simple change before starting with more complex solutions. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
>
> But this gets at another point: the way we're benchmarking this right
> now, we're really conflating the effects of three different things:
>
> 1. Changing the locking regimen around the freelist and clocksweep.
> 2. Adding a bgreclaimer process.
> 3. Raising the number of buffer locking partitions.
Patch_Ver/Client_Count | 16 | 32 | 64 | 128 |
reduce-replacement-locking.patch + 128 Buf Partitions | 157732 | 229547 | 271536 | 245295 |
scalable_buffer_eviction_v9.patch | 163762 | 230753 | 275147 | 248309 |
Patch_Ver/Client_Count | 16 | 32 | 64 | 128 |
reduce-replacement-locking.patch + 128 Buf Partitions | 157781 | 212134 | 202209 | 171176 |
scalable_buffer_eviction_v9.patch | 160301 | 213922 | 208680 | 172720 |
On 09/25/2014 05:40 PM, Andres Freund wrote: > There's two reasons for that: a) dynahash just isn't very good and it > does a lot of things that will never be necessary for these hashes. b) > the key into the hash table is*far* too wide. A significant portion of > the time is spent comparing buffer/lock tags. Hmm. Is it the comparing, or calculating the hash? We could precalculate the hash for RelFileNode+ForkNumber, and store it RelationData. At a lookup, you'd only need to mix in the block number. - Heikki
On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote: > On 09/25/2014 05:40 PM, Andres Freund wrote: > >There's two reasons for that: a) dynahash just isn't very good and it > >does a lot of things that will never be necessary for these hashes. b) > >the key into the hash table is*far* too wide. A significant portion of > >the time is spent comparing buffer/lock tags. > > Hmm. Is it the comparing, or calculating the hash? Neither, really. The hash calculation is visible in the profile, but not that pronounced yet. The primary thing noticeable in profiles (besides cache efficiency) is the comparison of the full tag after locating a possible match in a bucket. 20 byte memcmp's aren't free. Besides making the hashtable more efficent, a smaller key (say, 4 byte relfilenode, 4 byte blocknumber) would also make using a radix tree or similar more realistic. I've prototyped that once and it has nice properties, but the tree is too deep. Obviousy it'd also help making buffer descriptors smaller, which is also good from a cache efficiency perspective... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
First of all thanks for committing part-1 of this changes and itseems you are planing to commit part-3 based on results of testswhich Andres is planing to do and for remaining part (part-2), todayI have tried some tests, the results of which are as follows:Scale Factor - 3000, Shared_buffer - 8GB
Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch + 128 Buf Partitions 157732 229547 271536 245295 scalable_buffer_eviction_v9.patch 163762 230753 275147 248309 Scale Factor - 3000, Shared_buffer - 8GB
Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch + 128 Buf Partitions 157781 212134 202209 171176 scalable_buffer_eviction_v9.patch 160301 213922 208680 172720 The results indicates that in all cases there is benefit by doingpart-2 (bgreclaimer). Though the benefit at this configuration isnot high, but might be at some higher configurations(scale factor - 10000) there is more benefit. Do you see any meritin pursuing further to accomplish part-2 as well?
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 09/26/2014 03:26 PM, Andres Freund wrote: > On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote: >> On 09/25/2014 05:40 PM, Andres Freund wrote: >>> There's two reasons for that: a) dynahash just isn't very good and it >>> does a lot of things that will never be necessary for these hashes. b) >>> the key into the hash table is*far* too wide. A significant portion of >>> the time is spent comparing buffer/lock tags. >> >> Hmm. Is it the comparing, or calculating the hash? > > Neither, really. The hash calculation is visible in the profile, but not > that pronounced yet. The primary thing noticeable in profiles (besides > cache efficiency) is the comparison of the full tag after locating a > possible match in a bucket. 20 byte memcmp's aren't free. Hmm. We could provide a custom compare function instead of relying on memcmp. We can do somewhat better than generic memcmo when we know that the BufferTag is MAXALIGNed (is it? at least it's 4 bytes aligned), and it's always exactly 20 bytes. I wonder if you're actually just seeing a cache miss showing up in the profile, though. - Heikki
On Fri, Sep 26, 2014 at 8:26 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Neither, really. The hash calculation is visible in the profile, but not > that pronounced yet. The primary thing noticeable in profiles (besides > cache efficiency) is the comparison of the full tag after locating a > possible match in a bucket. 20 byte memcmp's aren't free. Would it be faster to test the relfilenode first, and then test the rest only if that matches? One idea for improving this is to get rid of relation forks. Like all true PostgreSQL patriots, I love the free space map and the visibility map passionately, and I believe that those features are one of the biggest contributors to the success of the project over the last half-decade. But I'd love them even more if they didn't triple our rate of inode consumption and bloat our buffer tags. More, it's just not an extensible mechanism: too many things have to loop over all forks, and it just doesn't scale to keep adding more of them. If we added a metapage to each heap, we could have the FSM and VM have their own relfilenode and just have the heap point at them. Or (maybe better still) we could store the data in the heap itself. It would be a lot of work, though. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Neither, really. The hash calculation is visible in the profile, but not > that pronounced yet. The primary thing noticeable in profiles (besides > cache efficiency) is the comparison of the full tag after locating a > possible match in a bucket. 20 byte memcmp's aren't free. I'm not arguing against a more efficient hash table, but one simple win would be to have a buffer tag specific comparison function. I mean something like the attached patch. This way we avoid the loop counter overhead, can check the blocknum first and possibly have better branch prediction. Do you have a workload where I could test if this helps alleviate the comparison overhead? Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Attachment
On Fri, Sep 26, 2014 at 7:40 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:First of all thanks for committing part-1 of this changes and itseems you are planing to commit part-3 based on results of testswhich Andres is planing to do and for remaining part (part-2), todayI have tried some tests, the results of which are as follows:Scale Factor - 3000, Shared_buffer - 8GB
Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch + 128 Buf Partitions 157732 229547 271536 245295 scalable_buffer_eviction_v9.patch 163762 230753 275147 248309 Scale Factor - 3000, Shared_buffer - 8GB
Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch + 128 Buf Partitions 157781 212134 202209 171176 scalable_buffer_eviction_v9.patch 160301 213922 208680 172720 The results indicates that in all cases there is benefit by doingpart-2 (bgreclaimer). Though the benefit at this configuration isnot high, but might be at some higher configurations(scale factor - 10000) there is more benefit. Do you see any meritin pursuing further to accomplish part-2 as well?Interesting results. Thanks for gathering this data.
If this is the best we can do with the bgreclaimer, I think the case for pursuing it is somewhat marginal. The biggest jump you've got seems to be at scale factor 3000 with 64 clients, where you picked up about 4%. 4% isn't nothing, but it's not a lot, either. On the other hand, this might not be the best we can do. There may be further improvements to bgreclaimer that make the benefit larger.Backing up a it, to what extent have we actually solved the problem here? If we had perfectly removed all of the scalability bottlenecks, what would we expect to see? You didn't say which machine this testing was done on
, or how many cores it had, but for example on the IBM POWER7 machine, we probably wouldn't expect the throughput at 64 clients to be 4 times the throughput at 16 cores because up to 16 clients each one can have a full CPU core, whereas after that and out to 64 each is getting a hardware thread, which is not quite as good. Still, we'd expect performance to go up, or at least not go down. Your data shows a characteristic performance knee: between 16 and 32 clients we go up, but then between 32 and 64 we go down,
and between 64 and 128 we go down more. You haven't got enough data points there to show very precisely where the knee is, but unless you tested this on a smaller box than what you have been using, we're certainly hitting the knee sometime before we run out of physical cores. That implies a remaining contention bottleneck.My results from yesterday were a bit different. I tested 1 client, 8 clients, and multiples of 16 clients out to 96. With reduce-replacement-locking.patch + 128 buffer mapping partitions, performance continued to rise all the way out to 96 clients. It definitely wasn't linearly, but it went up, not down. I don't know why this is different from what you are seeing.
Anyway there's a little more ambiguity there about how much contention remains, but my bet is that there is at least some contention that we could still hope to remove. We need to understand where that contention is. Are the buffer mapping locks still contended? Is the new spinlock contended? Are there other contention points? I won't be surprised if it turns out that the contention is on the new spinlock and that a proper design for bgreclaimer is the best way to remove that contention .... but I also won't be surprised if it turns out that there are bigger wins elsewhere. So I think you should try to figure out where the remaining contention is first, and then we can discuss what to do about it.
On another point, I think it would be a good idea to rebase the bgreclaimer patch over what I committed, so that we have a clean patch against master to test with.
On 2014-09-26 17:01:52 +0300, Ants Aasma wrote: > On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > Neither, really. The hash calculation is visible in the profile, but not > > that pronounced yet. The primary thing noticeable in profiles (besides > > cache efficiency) is the comparison of the full tag after locating a > > possible match in a bucket. 20 byte memcmp's aren't free. > > I'm not arguing against a more efficient hash table, but one simple > win would be to have a buffer tag specific comparison function. I mean > something like the attached patch. This way we avoid the loop counter > overhead, can check the blocknum first and possibly have better branch > prediction. Heh. Yea. As I wrote to Heikki, 64bit compares were the thing showing most benefits - at least with my own hashtable implementation. > Do you have a workload where I could test if this helps alleviate the > comparison overhead? Fully cached readonly pgbench workloads with -M prepared already show it. But it gets more pronounced for workload that access buffers at a higher frequency. At two customers I've seen this really badly in the profile because they have OLTP statements that some index nested loops. Often looking the same buffer up *over and over*. There often isn't a better plan (with current pg join support at least) for joining a somewhat small number of rows out of a large table to small/mid sized table. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-09-26 16:47:55 +0300, Heikki Linnakangas wrote: > On 09/26/2014 03:26 PM, Andres Freund wrote: > >On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote: > >>On 09/25/2014 05:40 PM, Andres Freund wrote: > >>>There's two reasons for that: a) dynahash just isn't very good and it > >>>does a lot of things that will never be necessary for these hashes. b) > >>>the key into the hash table is*far* too wide. A significant portion of > >>>the time is spent comparing buffer/lock tags. > >> > >>Hmm. Is it the comparing, or calculating the hash? > > > >Neither, really. The hash calculation is visible in the profile, but not > >that pronounced yet. The primary thing noticeable in profiles (besides > >cache efficiency) is the comparison of the full tag after locating a > >possible match in a bucket. 20 byte memcmp's aren't free. > > Hmm. We could provide a custom compare function instead of relying on > memcmp. We can do somewhat better than generic memcmo when we know that the > BufferTag is MAXALIGNed (is it? at least it's 4 bytes aligned), and it's > always exactly 20 bytes. That might give a little benefit. I haven't experimented with that with dynahash.c. I've compared memcmp() and custom comparison with my own hashtable and there were some differences, but neglegible. The biggest was using 64bit compares. Either way, it all ends up being rather branch heavy with high misprediction rates. > I wonder if you're actually just seeing a cache miss showing up in the > profile, though. I don't think so. I hacked (by moving it to the end of RelfileNode/BufferTag and comparing only the front port) the tablespace out of buffer tags and that produced measurable benefits. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-09-26 09:59:41 -0400, Robert Haas wrote: > On Fri, Sep 26, 2014 at 8:26 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > Neither, really. The hash calculation is visible in the profile, but not > > that pronounced yet. The primary thing noticeable in profiles (besides > > cache efficiency) is the comparison of the full tag after locating a > > possible match in a bucket. 20 byte memcmp's aren't free. > > Would it be faster to test the relfilenode first, and then test the > rest only if that matches? I tried that and I couldn't see that much benefit. Check my message to Heikki. > One idea for improving this is to get rid of relation forks. I think that's the hardest end to start from. A cool goal, but hard. Getting rid of the tablespace sound comparatively simple. And even getting rid of the database in the buffer tag seems to be simpler although already pretty hard. > Like all > true PostgreSQL patriots, I love the free space map and the visibility > map passionately, and I believe that those features are one of the > biggest contributors to the success of the project over the last > half-decade. But I'd love them even more if they didn't triple our > rate of inode consumption and bloat our buffer tags. More, it's just > not an extensible mechanism: too many things have to loop over all > forks, and it just doesn't scale to keep adding more of them. If we > added a metapage to each heap, we could have the FSM and VM have their > own relfilenode and just have the heap point at them. Or (maybe > better still) we could store the data in the heap itself. > > It would be a lot of work, though. :-( Yea, it's really hard. And nearly impossible to do without breaking binary compatibility. What I've been wondering about was to give individual forks their own relfilenodes and manage them via columns in pg_class. But that's also a heck of a lot of work and gets complicated for unlogged relations because we need to access those during recovery when we don't have catalog access yet and can't access all databases anyway. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Part of this patch was already committed, and the overall patch has had its fair share of review for this commitfest, so I'm marking this as "Returned with feedback". The benchmark results for the bgreclaimer showed a fairly small improvement, so it doesn't seem like anyone's going to commit the rest of the patch soon, not without more discussion and testing anyway. Let's not hold up the commitfest for that. - Heikki
On 2014-09-25 16:50:44 +0200, Andres Freund wrote: > On 2014-09-25 10:44:40 -0400, Robert Haas wrote: > > On Thu, Sep 25, 2014 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > > On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > >> On 2014-09-25 10:22:47 -0400, Robert Haas wrote: > > >>> On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > >>> > That leads me to wonder: Have you measured different, lower, number of > > >>> > buffer mapping locks? 128 locks is, if we'd as we should align them > > >>> > properly, 8KB of memory. Common L1 cache sizes are around 32k... > > >>> > > >>> Amit has some results upthread showing 64 being good, but not as good > > >>> as 128. I haven't verified that myself, but have no reason to doubt > > >>> it. > > >> > > >> How about you push the spinlock change and I crosscheck the partition > > >> number on a multi socket x86 machine? Seems worthwile to make sure that > > >> it doesn't cause problems on x86. I seriously doubt it'll, but ... > > > > > > OK. > > > > Another thought is that we should test what impact your atomics-based > > lwlocks have on this. > > Yes, I'd planned to test that as well. I think that it will noticeably > reduce the need to increase the number of partitions for workloads that > fit into shared_buffers. But it won't do much about exclusive > acquirations of the buffer mapping locks. So I think there's independent > benefit of increasing the number. Here we go. Postgres was configured with.-c shared_buffers=8GB \-c log_line_prefix="[%m %p] " \-c log_min_messages=debug1 \-p 5440 \-ccheckpoint_segments=600-c max_connections=200 Each individual measurement (#TPS) is the result of a pgbench -h /tmp/ -p 5440 postgres -n -M prepared -c $clients -j $clients -S -T 10 run. Master is as of ef8863844bb0b0dab7b92c5f278302a42b4bf05a. First, a scale 200 run. That fits entirely into shared_buffers: #scale #client #partitions #TPS 200 1 16 8353.547724 8145.296655 8263.295459 200 16 16 171014.763118 193971.091518 133992.128348 200 32 16 259119.988034 234619.421322 201879.618322 200 64 16 178909.038670 179425.091562 181391.354613 200 96 16 141402.895201 138392.705402 137216.416951 200 128 16 125643.089677 124465.288860 122527.209125 (other runs here stricken, they were contorted due some concurrent activity. But nothing interesting). So, there's quite some variation in here. Not very surprising given the short runtimes, but still. Looking at a profile nearly all the contention is around GetSnapshotData(). That might hide the interesting scalability effects of the partition number. So I next tried my rwlock-contention branch. #scale #client #partitions #TPS 200 1 1 8540.390223 8285.628397 8497.022656 200 16 1 136875.484896 164302.769380 172053.413980 200 32 1 308624.650724 240502.019046 260825.231470 200 64 1 453004.188676 406226.943046 406973.325822 200 96 1 442608.459701 450185.431848 445549.710907 200 128 1 487138.077973 496233.594356 457877.992783 200 1 16 9477.217454 8181.098317 8457.276961 200 16 16 154224.573476 170238.637315 182941.035416 200 32 16 302230.215403 285124.708236 265917.729628 200 64 16 405151.647136 443473.797835 456072.782722 200 96 16 443360.377281 457164.981119 474049.685940 200 128 16 490616.257063 458273.380238 466429.948417 200 1 64 8410.981874 11554.708966 8359.294710 200 16 64 139378.312883 168398.919590 166184.744944 200 32 64 288657.701012 283588.901083 302241.706222 200 64 64 424838.919754 416926.779367 436848.292520 200 96 64 462352.017671 446384.114441 483332.592663 200 128 64 471578.594596 488862.395621 466692.726385 200 1 128 8350.274549 8140.699687 8305.975703 200 16 128 144553.966808 154711.927715 202437.837908 200 32 128 290193.349170 213242.292597 261016.779185 200 64 128 413792.389493 431267.716855 456587.450294 200 96 128 490459.212833 456375.442210 496430.996055 200 128 128 470067.179360 464513.801884 483485.000502 Not much there either. So, on to the next scale, 1000. That doesn't fit into s_b anymore. master: #scale #client #partitions #TPS 1000 1 1 7378.370717 7110.988121 7164.977746 1000 16 1 66439.037413 85151.814130 85047.296626 1000 32 1 71505.487093 75687.291060 69803.895496 1000 64 1 42148.071099 41934.631603 43253.528849 1000 96 1 33760.812746 33969.800564 33598.640121 1000 128 1 30382.414165 30047.284982 30144.576494 1000 1 16 7228.883843 9479.793813 7217.657145 1000 16 16 105203.710528 112375.187471 110919.986283 1000 32 16 146294.286762 145391.938025 144620.709764 1000 64 16 134411.772164 134536.943367 136196.793573 1000 96 16 107626.878208 105289.783922 96480.468107 1000 128 16 92597.909379 86128.040557 92417.727720 1000 1 64 7130.392436 12801.641683 7019.999330 1000 16 64 120180.196384 125319.373819 126137.930478 1000 32 64 181876.697461 190578.106760 189412.973015 1000 64 64 216233.590299 222561.774501 225802.194056 1000 96 64 171928.358031 165922.395721 168283.712990 1000 128 64 139303.139631 137564.877450 141534.449640 1000 1 128 8215.702354 7209.520152 7026.888706 1000 16 128 116196.740200 123018.284948 127045.761518 1000 32 128 183391.488566 185428.757458 185732.926794 1000 64 128 218547.133675 218096.002473 208679.436158 1000 96 128 155209.830821 156327.200412 157542.582637 1000 128 128 131127.769076 132084.933955 124706.336737 rwlock: #scale #client #partitions #TPS 1000 1 1 7377.270393 7494.260136 7207.898866 1000 16 1 79289.755569 88032.480145 86810.772569 1000 32 1 83006.336151 88961.964680 88508.832253 1000 64 1 44135.036648 46582.727314 45119.421278 1000 96 1 35036.174438 35687.025568 35469.127697 1000 128 1 30597.870830 30782.335225 30342.454439 1000 1 16 7114.602838 7265.863826 7205.225737 1000 16 16 128507.292054 131868.678603 124507.097065 1000 32 16 212779.122153 185666.608338 210714.373254 1000 64 16 239776.079534 239923.393293 242476.922423 1000 96 16 169240.934839 166021.430680 169187.643644 1000 128 16 136601.409985 139340.961857 141731.068752 1000 1 64 13271.722885 11348.028311 12531.188689 1000 16 64 129074.053482 125334.720264 125140.499619 1000 32 64 198405.463848 196605.923684 198354.818005 1000 64 64 250463.474112 249543.622897 251517.159399 1000 96 64 251715.751133 254168.028451 251502.783058 1000 128 64 243596.368933 234671.592026 239123.259642 1000 1 128 7376.371403 7301.077478 7240.526379 1000 16 128 127992.070372 133537.637394 123382.418747 1000 32 128 185807.703422 194303.674428 184919.586634 1000 64 128 270233.496350 271576.483715 262281.662510 1000 96 128 266023.529574 272484.352878 271921.597420 1000 128 128 260004.301457 266710.469926 263713.245868 Based on this I think we can fairly conclude that increasing the number of partitions is quite the win on larger x86 machines too. Independent of the rwlock patch, although it moves the contention points to some degree. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-10-01 20:54:39 +0200, Andres Freund wrote: > Here we go. > > Postgres was configured with. > -c shared_buffers=8GB \ > -c log_line_prefix="[%m %p] " \ > -c log_min_messages=debug1 \ > -p 5440 \ > -c checkpoint_segments=600 > -c max_connections=200 Robert reminded me that I missed to report the hardware aspect of this... 4x E5-4620 @ 2.20GHz: 32 cores, 64 threads 256GB RAM Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-09-25 10:42:29 -0400, Robert Haas wrote: > On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-09-25 10:22:47 -0400, Robert Haas wrote: > >> On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> > That leads me to wonder: Have you measured different, lower, number of > >> > buffer mapping locks? 128 locks is, if we'd as we should align them > >> > properly, 8KB of memory. Common L1 cache sizes are around 32k... > >> > >> Amit has some results upthread showing 64 being good, but not as good > >> as 128. I haven't verified that myself, but have no reason to doubt > >> it. > > > > How about you push the spinlock change and I crosscheck the partition > > number on a multi socket x86 machine? Seems worthwile to make sure that > > it doesn't cause problems on x86. I seriously doubt it'll, but ... > > OK. Given that the results look good, do you plan to push this? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> OK. > > Given that the results look good, do you plan to push this? By "this", you mean the increase in the number of buffer mapping partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS? If so, and if you don't have any reservations, yeah I'll go change it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-10-02 10:40:30 -0400, Robert Haas wrote: > On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> OK. > > > > Given that the results look good, do you plan to push this? > > By "this", you mean the increase in the number of buffer mapping > partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS? Yes. Now that I think about it I wonder if we shouldn't define MAX_SIMUL_LWLOCKS like #define MAX_SIMUL_LWLOCKS (NUM_BUFFER_PARTITIONS + 64) or something like that? > If so, and if you don't have any reservations, yeah I'll go change it. Yes, I'm happy with going forward. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 2, 2014 at 10:44 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-10-02 10:40:30 -0400, Robert Haas wrote: >> On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> >> OK. >> > >> > Given that the results look good, do you plan to push this? >> >> By "this", you mean the increase in the number of buffer mapping >> partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS? > > Yes. Now that I think about it I wonder if we shouldn't define MAX_SIMUL_LWLOCKS like > #define MAX_SIMUL_LWLOCKS (NUM_BUFFER_PARTITIONS + 64) > or something like that? Nah. That assumes NUM_BUFFER_PARTITIONS will always be the biggest thing, and I don't see any reason to assume that, even if we're making it true for now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-10-02 10:56:05 -0400, Robert Haas wrote: > On Thu, Oct 2, 2014 at 10:44 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-10-02 10:40:30 -0400, Robert Haas wrote: > >> On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> >> OK. > >> > > >> > Given that the results look good, do you plan to push this? > >> > >> By "this", you mean the increase in the number of buffer mapping > >> partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS? > > > > Yes. Now that I think about it I wonder if we shouldn't define MAX_SIMUL_LWLOCKS like > > #define MAX_SIMUL_LWLOCKS (NUM_BUFFER_PARTITIONS + 64) > > or something like that? > > Nah. That assumes NUM_BUFFER_PARTITIONS will always be the biggest > thing, and I don't see any reason to assume that, even if we're making > it true for now. The reason I'm suggesting is that NUM_BUFFER_PARTITIONS (and NUM_LOCK_PARTITIONS) are the cases where we can expect many lwlocks to be held at the same time. It doesn't seem friendly to users experimenting with changing this to know about a define that's private to lwlock.c. But I'm fine with doing this not at all or separately - there's no need to actually do it together. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 10/02/2014 05:40 PM, Robert Haas wrote: > On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: >>> OK. >> >> Given that the results look good, do you plan to push this? > > By "this", you mean the increase in the number of buffer mapping > partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS? Hmm, do we actually ever need to hold all the buffer partition locks at the same time? At a quick search for NUM_BUFFER_PARTITIONS in the code, I couldn't find any place where we'd do that. I bumped up NUM_BUFFER_PARTITIONS to 128, but left MAX_SIMUL_LWLOCKS at 100, and did "make check". It passed. - Heikki
On 2014-10-02 20:04:58 +0300, Heikki Linnakangas wrote: > On 10/02/2014 05:40 PM, Robert Haas wrote: > >On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >>>OK. > >> > >>Given that the results look good, do you plan to push this? > > > >By "this", you mean the increase in the number of buffer mapping > >partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS? > > Hmm, do we actually ever need to hold all the buffer partition locks at the > same time? At a quick search for NUM_BUFFER_PARTITIONS in the code, I > couldn't find any place where we'd do that. I bumped up > NUM_BUFFER_PARTITIONS to 128, but left MAX_SIMUL_LWLOCKS at 100, and did > "make check". It passed. Do a make check-world and it'll hopefully fail ;). Check pg_buffercache_pages.c. I'd actually quite like to have a pg_buffercache version that, at least optionally, doesn't do this, but that's a separate thing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 2, 2014 at 1:07 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Do a make check-world and it'll hopefully fail ;). Check > pg_buffercache_pages.c. Yep. Committed, with an update to the comments in lwlock.c to allude to the pg_buffercache issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>
> On another point, I think it would be a good idea to rebase the
> bgreclaimer patch over what I committed, so that we have a
> clean patch against master to test with.
Please find the rebased patch attached with this mail. I have taken
some performance data as well and done some analysis based on
the same.
Performance Data
----------------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 5000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
Below data is median of 3 runs.
patch_ver/client_count | 1 | 8 | 32 | 64 | 128 | 256 |
HEAD | 18884 | 118628 | 251093 | 216294 | 186625 | 177505 |
PATCH | 18743 | 122578 | 247243 | 205521 | 179712 | 175031 |
Here we can see that the performance dips at higher client
Attachment
On 2014-10-09 18:17:09 +0530, Amit Kapila wrote: > On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > > On another point, I think it would be a good idea to rebase the > > bgreclaimer patch over what I committed, so that we have a > > clean patch against master to test with. > > Please find the rebased patch attached with this mail. I have taken > some performance data as well and done some analysis based on > the same. > > Performance Data > ---------------------------- > IBM POWER-8 24 cores, 192 hardware threads > RAM = 492GB > max_connections =300 > Database Locale =C > checkpoint_segments=256 > checkpoint_timeout =15min > shared_buffers=8GB > scale factor = 5000 > Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) > Duration of each individual run = 5mins I don't think OLTP really is the best test case for this. Especially not pgbench with relatilvely small rows *and* a uniform distribution of access. Try parallel COPY TO. Batch write loads is where I've seen this hurt badly. > patch_ver/client_count 1 8 32 64 128 256 > HEAD 18884 118628 251093 216294 186625 177505 > PATCH 18743 122578 247243 205521 179712 175031 So, pretty much no benefits on any scale, right? > Here we can see that the performance dips at higher client > count(>=32) which was quite surprising for me, as I was expecting > it to improve, because bgreclaimer reduces the contention by making > buffers available on free list. So I tried to analyze the situation by > using perf and found that in above configuration, there is a contention > around freelist spinlock with HEAD and the same is removed by Patch, > but still the performance goes down with Patch. On further analysis, I > observed that actually after Patch there is an increase in contention > around ProcArrayLock (shared LWlock) via GetSnapshotData which > sounds bit odd, but that's what I can see in profiles. Based on analysis, > few ideas which I would like to further investigate are: > a. As there is an increase in spinlock contention, I would like to check > with Andres's latest patch which reduces contention around shared > lwlocks. > b. Reduce some instructions added by patch in StrategyGetBuffer(), > like instead of awakening bgreclaimer at low threshold, awaken when > it tries to do clock sweep. > Are you sure you didn't mix up the profiles here? The head vs. patched look more like profiles from different client counts than different versions of the code. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
>
> On 2014-10-09 18:17:09 +0530, Amit Kapila wrote:
> > On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On another point, I think it would be a good idea to rebase the
> > > bgreclaimer patch over what I committed, so that we have a
> > > clean patch against master to test with.
> >
> > Please find the rebased patch attached with this mail. I have taken
> > some performance data as well and done some analysis based on
> > the same.
> >
> > Performance Data
> > ----------------------------
> > IBM POWER-8 24 cores, 192 hardware threads
> > RAM = 492GB
> > max_connections =300
> > Database Locale =C
> > checkpoint_segments=256
> > checkpoint_timeout =15min
> > shared_buffers=8GB
> > scale factor = 5000
> > Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
> > Duration of each individual run = 5mins
>
> I don't think OLTP really is the best test case for this. Especially not
> pgbench with relatilvely small rows *and* a uniform distribution of
> access.
>
> Try parallel COPY TO. Batch write loads is where I've seen this hurt
> badly.
>
> > HEAD 18884 118628 251093 216294 186625 177505
> > PATCH 18743 122578 247243 205521 179712 175031
>
> So, pretty much no benefits on any scale, right?
Almost Right, there seem to be slight benefit at client count
> > Here we can see that the performance dips at higher client
> > count(>=32) which was quite surprising for me, as I was expecting
> > it to improve, because bgreclaimer reduces the contention by making
> > buffers available on free list. So I tried to analyze the situation by
> > using perf and found that in above configuration, there is a contention
> > around freelist spinlock with HEAD and the same is removed by Patch,
> > but still the performance goes down with Patch. On further analysis, I
> > observed that actually after Patch there is an increase in contention
> > around ProcArrayLock (shared LWlock) via GetSnapshotData which
> > sounds bit odd, but that's what I can see in profiles. Based on analysis,
> > few ideas which I would like to further investigate are:
> > a. As there is an increase in spinlock contention, I would like to check
> > with Andres's latest patch which reduces contention around shared
> > lwlocks.
> > b. Reduce some instructions added by patch in StrategyGetBuffer(),
> > like instead of awakening bgreclaimer at low threshold, awaken when
> > it tries to do clock sweep.
> >
>
> Are you sure you didn't mix up the profiles here?
On 2014-10-09 16:01:55 +0200, Andres Freund wrote: > On 2014-10-09 18:17:09 +0530, Amit Kapila wrote: > > On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > On another point, I think it would be a good idea to rebase the > > > bgreclaimer patch over what I committed, so that we have a > > > clean patch against master to test with. > > > > Please find the rebased patch attached with this mail. I have taken > > some performance data as well and done some analysis based on > > the same. > > > > Performance Data > > ---------------------------- > > IBM POWER-8 24 cores, 192 hardware threads > > RAM = 492GB > > max_connections =300 > > Database Locale =C > > checkpoint_segments=256 > > checkpoint_timeout =15min > > shared_buffers=8GB > > scale factor = 5000 > > Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) > > Duration of each individual run = 5mins > > I don't think OLTP really is the best test case for this. Especially not > pgbench with relatilvely small rows *and* a uniform distribution of > access. > > Try parallel COPY TO. Batch write loads is where I've seen this hurt > badly. As an example. The attached scripts go from: progress: 5.3 s, 20.9 tps, lat 368.917 ms stddev 49.655 progress: 10.1 s, 21.0 tps, lat 380.326 ms stddev 64.525 progress: 15.1 s, 14.1 tps, lat 568.108 ms stddev 226.040 progress: 20.4 s, 12.0 tps, lat 634.557 ms stddev 300.519 progress: 25.2 s, 17.5 tps, lat 461.738 ms stddev 136.257 progress: 30.2 s, 9.8 tps, lat 850.766 ms stddev 305.454 progress: 35.3 s, 12.2 tps, lat 670.473 ms stddev 271.075 progress: 40.2 s, 7.9 tps, lat 972.617 ms stddev 313.152 progress: 45.3 s, 14.9 tps, lat 546.056 ms stddev 211.987 progress: 50.2 s, 13.2 tps, lat 610.608 ms stddev 271.780 progress: 55.5 s, 16.9 tps, lat 468.757 ms stddev 156.516 progress: 60.5 s, 14.3 tps, lat 548.913 ms stddev 190.414 progress: 65.7 s, 9.3 tps, lat 821.293 ms stddev 353.665 progress: 70.1 s, 16.0 tps, lat 524.240 ms stddev 174.903 progress: 75.2 s, 17.0 tps, lat 485.692 ms stddev 194.273 progress: 80.2 s, 19.9 tps, lat 396.295 ms stddev 78.891 progress: 85.3 s, 18.3 tps, lat 423.744 ms stddev 105.798 progress: 90.1 s, 14.5 tps, lat 577.373 ms stddev 270.914 progress: 95.3 s, 12.0 tps, lat 649.434 ms stddev 247.001 progress: 100.3 s, 14.6 tps, lat 563.693 ms stddev 275.236 tps = 14.812222 (including connections establishing) to: progress: 5.1 s, 18.9 tps, lat 409.766 ms stddev 75.032 progress: 10.3 s, 20.2 tps, lat 396.781 ms stddev 67.593 progress: 15.1 s, 19.1 tps, lat 418.545 ms stddev 109.431 progress: 20.3 s, 20.6 tps, lat 388.606 ms stddev 74.259 progress: 25.1 s, 19.5 tps, lat 406.591 ms stddev 109.050 progress: 30.0 s, 19.1 tps, lat 420.199 ms stddev 157.005 progress: 35.0 s, 18.4 tps, lat 421.102 ms stddev 124.019 progress: 40.3 s, 12.3 tps, lat 640.640 ms stddev 88.409 progress: 45.2 s, 12.8 tps, lat 586.471 ms stddev 145.543 progress: 50.5 s, 6.9 tps, lat 1116.603 ms stddev 285.479 progress: 56.2 s, 6.3 tps, lat 1349.055 ms stddev 381.095 progress: 60.6 s, 7.9 tps, lat 1083.745 ms stddev 452.386 progress: 65.0 s, 9.6 tps, lat 805.981 ms stddev 273.845 progress: 71.1 s, 9.6 tps, lat 798.273 ms stddev 184.108 progress: 75.2 s, 9.3 tps, lat 950.131 ms stddev 150.870 progress: 80.8 s, 8.6 tps, lat 899.389 ms stddev 135.090 progress: 85.3 s, 8.8 tps, lat 928.183 ms stddev 152.056 progress: 90.9 s, 8.0 tps, lat 929.737 ms stddev 71.155 progress: 95.7 s, 9.0 tps, lat 968.070 ms stddev 127.824 progress: 100.3 s, 8.7 tps, lat 911.767 ms stddev 130.697 just by switching shared_buffers from 1 to 8GB. I haven't tried, but I hope that with an approach like your's this might become better. psql -f /tmp/prepare.sql pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100 Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
> >
> > I don't think OLTP really is the best test case for this. Especially not
> > pgbench with relatilvely small rows *and* a uniform distribution of
> > access.
> >
> > Try parallel COPY TO. Batch write loads is where I've seen this hurt
> > badly.
>
>
> just by switching shared_buffers from 1 to 8GB. I haven't tried, but I
> hope that with an approach like your's this might become better.
>
> psql -f /tmp/prepare.sql
> pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100
RAM = 492GB
On 2014-10-10 12:28:13 +0530, Amit Kapila wrote: > On Fri, Oct 10, 2014 at 1:08 AM, Andres Freund <andres@2ndquadrant.com> > wrote: > > On 2014-10-09 16:01:55 +0200, Andres Freund wrote: > > > > > > I don't think OLTP really is the best test case for this. Especially not > > > pgbench with relatilvely small rows *and* a uniform distribution of > > > access. > > > > > > Try parallel COPY TO. Batch write loads is where I've seen this hurt > > > badly. > > > > > > just by switching shared_buffers from 1 to 8GB. I haven't tried, but I > > hope that with an approach like your's this might become better. > > > > psql -f /tmp/prepare.sql > > pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100 > > Thanks for providing the scripts. You haven't specified how much data > is present in 'large' file used in tests. I don't think it matters much. It should be small enough that you get a couple TPS over all backends. > I have tried with different set of > rows, but I could not see the dip that is present in your data when you > increased shared buffers from 1GB to 8GB, also I don't see any difference > with patch. Interesting. I wonder whether that's because the concurrency wasn't high enough for that machine to show the problem. I ran the test on my workstation which has 8 actual cores. > BTW, why do you think that for such worklaods this patch can > be helpful, according to my understanding it can be helpful mainly for > read mostly workloads when all the data doesn't fit in shared buffers. The performance dip comes from all the backends performing the clock sweep. As the access is pretty uniform all the buffers start with some usage count (IIRC 3 in this example. Much worse if 5). Due to the uniform usagecount the clock sweep frequently has to go several times through *all* the buffers. That leads to quite horrible performance in some cases. I had hoped that bgreclaimer can make that workload les horrible by funneling most of the accesses through the freelist. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
>
> On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On another point, I think it would be a good idea to rebase the
> > bgreclaimer patch over what I committed, so that we have a
> > clean patch against master to test with.
>
> Please find the rebased patch attached with this mail. I have taken
> some performance data as well and done some analysis based on
> the same.
>
>
>
> Below data is median of 3 runs.
>
> patch_ver/client_count 1 8 32 64 128 256
> HEAD 18884 118628 251093 216294 186625 177505
> PATCH 18743 122578 247243 205521 179712 175031
>
>
> Here we can see that the performance dips at higher client
> count(>=32) which was quite surprising for me, as I was expecting
> it to improve, because bgreclaimer reduces the contention by making
> buffers available on free list. So I tried to analyze the situation by
> using perf and found that in above configuration, there is a contention
> around freelist spinlock with HEAD and the same is removed by Patch,
> but still the performance goes down with Patch. On further analysis, I
> observed that actually after Patch there is an increase in contention
> around ProcArrayLock (shared LWlock) via GetSnapshotData which
> sounds bit odd, but that's what I can see in profiles. Based on analysis,
> few ideas which I would like to further investigate are:
> a. As there is an increase in spinlock contention, I would like to check
> with Andres's latest patch which reduces contention around shared
> lwlocks.
Attachment
On 2014-10-14 15:24:57 +0530, Amit Kapila wrote: > After that I observed that contention for LW_SHARED has reduced > for this load, but it didn't help much in terms of performance, so I again > rechecked the profile and this time most of the contention is moved > to spinlock used in dynahash for buf mapping tables, please refer > the profile (for 128 client count; Read only load) below: > > bgreclaimer patch + wait free lw_shared acquisition patches - > ------------------------------------------------------------------------------------------ > > 9.24% swapper [unknown] [H] 0x00000000011d9c10 > + 7.19% postgres postgres [.] s_lock > + 3.52% postgres postgres [.] GetSnapshotData > + 3.02% postgres postgres [.] calc_bucket > + 2.71% postgres postgres [.] > hash_search_with_hash_value > 2.32% postgres [unknown] [H] 0x00000000011e0d7c > + 2.17% postgres postgres [.] > pg_atomic_fetch_add_u32_impl > + 1.84% postgres postgres [.] AllocSetAlloc > + 1.57% postgres postgres [.] _bt_compare > + 1.05% postgres postgres [.] AllocSetFreeIndex > + 1.02% postgres [kernel.kallsyms] [k] > .__copy_tofrom_user_power7 > + 0.94% postgres postgres [.] tas > + 0.85% swapper [kernel.kallsyms] [k] .int_sqrt > + 0.80% postgres postgres [.] pg_encoding_mbcliplen > + 0.78% pgbench [kernel.kallsyms] [k] .find_busiest_group > 0.65% pgbench [unknown] [H] 0x00000000011d96e0 > + 0.59% postgres postgres [.] hash_any > + 0.54% postgres postgres [.] LWLockRelease This profile is without -O2 again. I really don't think it makes much sense to draw much inference from an unoptimized build. I realize that you said that the builds you use for benchmarking don't have that problem, but that doesn't make this profile meaningful... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 2014-10-14 15:24:57 +0530, Amit Kapila wrote:
> > After that I observed that contention for LW_SHARED has reduced
> > for this load, but it didn't help much in terms of performance, so I again
> > rechecked the profile and this time most of the contention is moved
> > to spinlock used in dynahash for buf mapping tables, please refer
> > the profile (for 128 client count; Read only load) below:
> >
> > bgreclaimer patch + wait free lw_shared acquisition patches -
> > ------------------------------------------------------------------------------------------
>
> This profile is without -O2 again. I really don't think it makes much
> sense to draw much inference from an unoptimized build.
> On Thu, Oct 9, 2014 at 6:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On another point, I think it would be a good idea to rebase the
> > > bgreclaimer patch over what I committed, so that we have a
> > > clean patch against master to test with.
> >
> > Please find the rebased patch attached with this mail. I have taken
> > some performance data as well and done some analysis based on
> > the same.
> >
> >
> >
>
> To reduce above contention, I tried to write a patch to replace spin lock
> used in dynahash to manage free list by atomic operations. Still there
> is work pending for this patch with respect to ensuring whether the
> approach used in patch is completely sane, however I am posting the
> patch so that others can have a look at it and give me feedback about
> the approach.
On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> But this gets at another point: the way we're benchmarking this right
> now, we're really conflating the effects of three different things:
>
> 1. Changing the locking regimen around the freelist and clocksweep.
> 2. Adding a bgreclaimer process.
> 3. Raising the number of buffer locking partitions.First of all thanks for committing part-1 of this changes and itseems you are planing to commit part-3 based on results of testswhich Andres is planing to do and for remaining part (part-2), today
On 2014-11-11 09:29:22 +0000, Thom Brown wrote: > On 26 September 2014 12:40, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com> > > wrote: > > > > > > But this gets at another point: the way we're benchmarking this right > > > now, we're really conflating the effects of three different things: > > > > > > 1. Changing the locking regimen around the freelist and clocksweep. > > > 2. Adding a bgreclaimer process. > > > 3. Raising the number of buffer locking partitions. > > > > First of all thanks for committing part-1 of this changes and it > > seems you are planing to commit part-3 based on results of tests > > which Andres is planing to do and for remaining part (part-2), today > > > > Were parts 2 and 3 committed in the end? 3 was committed. 2 wasn't because it's not yet clear whether how beneficial it is generally. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 2014-11-11 09:29:22 +0000, Thom Brown wrote:
> > On 26 September 2014 12:40, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com>
> > > wrote:
> > > >
> > > > But this gets at another point: the way we're benchmarking this right
> > > > now, we're really conflating the effects of three different things:
> > > >
> > > > 1. Changing the locking regimen around the freelist and clocksweep.
> > > > 2. Adding a bgreclaimer process.
> > > > 3. Raising the number of buffer locking partitions.
> > >
> > > First of all thanks for committing part-1 of this changes and it
> > > seems you are planing to commit part-3 based on results of tests
> > > which Andres is planing to do and for remaining part (part-2), today
> > >
> >
> > Were parts 2 and 3 committed in the end?
>
> 3 was committed. 2 wasn't because it's not yet clear whether how
> beneficial it is generally.
>
As shown upthread that this patch (as it stands today) is dependent on