Thread: Scaling shared buffer eviction

Scaling shared buffer eviction

From

Amit Kapila

Date:

15 May 2014, 05:41:58

As mentioned previously about my interest in improving shared
buffer eviction especially by reducing contention around
BufFreelistLock, I would like to share my progress about the
same.

The test used for this work is mainly the case when all the
data doesn't fit in shared buffers, but does fit in memory.
It is mainly based on previous comparison done by Robert
for similar workload:
http://rhaas.blogspot.in/2012/03/performance-and-scalability-on-ibm.html

To start with, I have taken LWLOCK_STATS report to confirm
the contention around BufFreelistLock and the data for HEAD
is as follows:

M/c details
IBM POWER-7 16 cores, 64 hardware threads
RAM - 64GB
Test
scale factor = 3000
shared_buffers = 8GB
number_of_threads = 64
duration = 5mins
./pgbench -c 64 -j 64 -T 300 -S  postgres

LWLOCK_STATS data for BufFreeListLock
PID 11762 lwlock main 0: shacq 0 exacq 253988 blk 29023

Here the high *blk* count for scale factor 3000 clearly shows
that to find a usable buffer when data doesn't fit in shared buffers
it has to wait.

To solve this issue, I have implemented a patch which makes
sure that there are always enough buffers on freelist such that
the need for backend to run clock-sweep is minimal, the
implementation idea is more or less same as discussed
previously in below thread, so I will explain it at end of mail.
http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@huawei.com

LWLOCK_STATS data after Patch (test used is same as
used for HEAD):

BufFreeListLock
PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0

Here the low *exacq* and *blk* count shows that the need to
run clock sweep for backend has reduced significantly.

Performance Data
-------------------------------
shared_buffers= 8GB
number of threads - 64
sc - scale factor

          sc      tps
Head    3000    45569
Patch   3000    46457
Head    1000    93037
Patch   1000    92711

Above data shows that there is no significant change in
performance or scalability even after the contention is
reduced significantly around BufFreelistLock.

I have analyzed the patch both with perf record and
LWLOCK_STATS, both indicates that there is a high
contention around BufMappingLocks.

Data With perf record -a -g
-----------------------------------------

+  10.14%          swapper  [kernel.kallsyms]      [k]
.pseries_dedicated_idle_sleep
+   7.77%         postgres  [kernel.kallsyms]      [k] ._raw_spin_lock
+   6.88%         postgres  [kernel.kallsyms]      [k]
.function_trace_call
+   4.15%          pgbench  [kernel.kallsyms]      [k] .try_to_wake_up
+   3.20%          swapper  [kernel.kallsyms]      [k]
.function_trace_call
+   2.99%          pgbench  [kernel.kallsyms]      [k]
.function_trace_call
+   2.41%         postgres  postgres               [.] AllocSetAlloc
+   2.38%         postgres  [kernel.kallsyms]      [k] .try_to_wake_up
+   2.27%          pgbench  [kernel.kallsyms]      [k] ._raw_spin_lock
+   1.49%         postgres  [kernel.kallsyms]      [k]
._raw_spin_lock_irq
+   1.36%         postgres  postgres               [.]
AllocSetFreeIndex
+   1.09%          swapper  [kernel.kallsyms]      [k] ._raw_spin_lock
+   0.91%         postgres  postgres               [.] GetSnapshotData
+   0.90%         postgres  postgres               [.]
MemoryContextAllocZeroAligned


Expanded graph
------------------------------

-  10.14%          swapper  [kernel.kallsyms]      [k]
.pseries_dedicated_idle_sleep
   - .pseries_dedicated_idle_sleep
      - 10.13% .pseries_dedicated_idle_sleep
         - 10.13% .cpu_idle
            - 10.00% .start_secondary
                 .start_secondary_prolog
-   7.77%         postgres  [kernel.kallsyms]      [k] ._raw_spin_lock
   - ._raw_spin_lock
      - 6.63% ._raw_spin_lock
         - 5.95% .double_rq_lock
            - .load_balance
               - 5.95% .__schedule
                  - .schedule
                     - 3.27% .SyS_semtimedop
                          .SyS_ipc
                          syscall_exit
                          semop
                          PGSemaphoreLock
                          LWLockAcquireCommon
                        - LWLockAcquire
                           - 3.27% BufferAlloc
                                ReadBuffer_common
                             - ReadBufferExtended
                                 - 3.27% ReadBuffer
                                    - 2.73% ReleaseAndReadBuffer
                                       - 1.70% _bt_relandgetbuf
                                            _bt_search
                                            _bt_first
                                            btgettuple


It shows BufferAlloc->LWLOCK as top contributor and we use
BufMappingLocks in BufferAlloc, I have checked other expanded
calls as well, StrategyGetBuffer is not present in top contributors.

Data with LWLOCK_STATS
----------------------------------------------
BufMappingLocks

PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101
PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72
PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54
PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60
PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128
PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81
PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503
PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71
PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46
PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30
PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14
PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25
PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30
PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28
PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25
PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22
PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29
PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36
PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22

This data shows that after patch, there is no contention
for BufFreeListLock, rather there is a huge contention around
BufMappingLocks.  I have checked that HEAD also has contention
around BufMappingLocks.

As per my analysis till now, I think reducing contention around
BufFreelistLock is not sufficient to improve scalability, we need
to work on reducing contention around BufMappingLocks as well.

Details of patch
------------------------
1. Changed bgwriter to move buffers (having usage_count as zero)
on free list based on  threshold (high_watermark) and decrement the
usage count if usage_count is greater than zero.
2. StrategyGetBuffer() will wakeup bgwriter when the number of
buffers in freelist drop under low_watermark.
Currently I am using hard-coded values, we can choose to make
them as configurable later on if required.
3. Work done to get a buffer from freelist is done under spin lock
and clock sweep still runs under BufFreelistLock.

This is still a WIP patch and some of the changes are just kind
of prototype to check the idea, like I have hacked bgwriter code
such that it continuously fills the freelist till it is able to put
enough buffers on freelist such that it reaches high_watermark
and commented some part of previous code.

Thoughts?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

scalable_buffer_eviction_v1.patch

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

16 May 2014, 14:51:20

On Thu, May 15, 2014 at 11:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Data with LWLOCK_STATS
----------------------------------------------
BufMappingLocks

PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101
PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72
PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54
PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60
PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128
PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81
PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503
PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71
PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46
PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30
PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14
PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25
PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30
PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28
PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25
PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22
PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29
PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36
PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22

This data shows that after patch, there is no contention
for BufFreeListLock, rather there is a huge contention around
BufMappingLocks. I have checked that HEAD also has contention
around BufMappingLocks.

As per my analysis till now, I think reducing contention around
BufFreelistLock is not sufficient to improve scalability, we need
to work on reducing contention around BufMappingLocks as well.

To reduce the contention around BufMappingLocks, I have tried the patch

by just increasing the Number of Buffer Partitions, and it actually shows

a really significant increase in scalability both due to reduced contention

around BufFreeListLock and BufMappingLocks. The real effect of reducing

contention around BufFreeListLock was hidden because the whole contention

was shifted to BufMappingLocks. I have taken performance data for both

HEAD+increase_buf_part and Patch+increase_buf_part to clearly see the

benefit of reducing contention around BufFreeListLock. This data has been

taken using pgbench read only load (Select).

Performance Data

-------------------------------

HEAD + 64 = HEAD + (NUM_BUFFER_PARTITONS(64) +

LOG2_NUM_LOCK_PARTITIONS(6))

V1 + 64 = PATCH + (NUM_BUFFER_PARTITONS(64) +

LOG2_NUM_LOCK_PARTITIONS(6))

Similarly 128 means 128 buffer partitions

shared_buffers= 8GB

scale factor = 3000
RAM - 64GB

	Thrds (64)	Thrds (128)
HEAD	45562	17128
HEAD + 64	57904	32810
V1 + 64	105557	81011
HEAD + 128	58383	32997
V1 + 128	110705	114544

shared_buffers= 8GB

scale factor = 1000
RAM - 64GB

	Thrds (64)	Thrds (128)
HEAD	92142	31050
HEAD + 64	108120	86367
V1 + 64	117454	123429
HEAD + 128	107762	86902
V1 + 128	123641	124822

Observations

-------------------------

1. There is increase of upto 5 times in performance for data that can fit

in memory but not in shared buffers

2. Though there is a increase in performance by just increasing number

of buffer partitions, but it doesn't scale well (especially see the case

when partitions have increased to 128 from 64).

I have verified that contention has reduced around BufMappingLocks

by running the patch with LWLOCKS

BufFreeListLock

PID 17894 lwlock main 0: shacq 0 exacq 171 blk 27 spindelay 1

BufMappingLocks

PID 17902 lwlock main 38: shacq 12770 exacq 10104 blk 282 spindelay 0

PID 17924 lwlock main 39: shacq 11409 exacq 10257 blk 243 spindelay 0

PID 17929 lwlock main 40: shacq 13120 exacq 10739 blk 239 spindelay 0

PID 17940 lwlock main 41: shacq 11865 exacq 10373 blk 262 spindelay 0

PID 17831 lwlock main 162: shacq 12706 exacq 10267 blk 199 spindelay 0

PID 17826 lwlock main 163: shacq 11081 exacq 10256 blk 168 spindelay 0

PID 17903 lwlock main 164: shacq 11494 exacq 10375 blk 176 spindelay 0

PID 17899 lwlock main 165: shacq 12043 exacq 10485 blk 216 spindelay 0

We can clearly notice that the number for *blk* has reduced significantly

which shows that contention has reduced.

The patch is still in a shape to prove the merit of idea and I have just

changed the number of partitions so that if someone wants to verify

the performance for similar load, it can be done by just applying

the patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

scalable_buffer_eviction_v2.patch

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

17 May 2014, 00:32:07

On Fri, May 16, 2014 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Thrds (64) Thrds (128)
HEAD 45562 17128
HEAD + 64 57904 32810
V1 + 64 105557 81011
HEAD + 128 58383 32997
V1 + 128 110705 114544

I haven't actually reviewed the code, but this sort of thing seems like good evidence that we need your patch, or something like it. The fact that the patch produces little performance improvement on it's own (though it does produce some) shouldn't be held against it - the fact that the contention shifts elsewhere when the first bottleneck is removed is not your patch's fault.

In terms of ameliorating contention on the buffer mapping locks, I think it would be better to replace the whole buffer mapping table with something different. I started working on that almost 2 years ago, building a hash-table that can be read without requiring any locks and written with, well, less locking than what we have right now:

http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash

I never got quite as far as trying to hook that up to the buffer mapping machinery, but maybe that would be worth doing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Peter Geoghegan

Date:

17 May 2014, 00:59:11

On Fri, May 16, 2014 at 7:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

shared_buffers= 8GB
scale factor = 3000
RAM - 64GB

Thrds (64) Thrds (128)
HEAD 45562 17128
HEAD + 64 57904 32810
V1 + 64 105557 81011
HEAD + 128 58383 32997
V1 + 128 110705 114544

shared_buffers= 8GB
scale factor = 1000
RAM - 64GB

Thrds (64) Thrds (128)
HEAD 92142 31050
HEAD + 64 108120 86367
V1 + 64 117454 123429
HEAD + 128 107762 86902
V1 + 128 123641 124822

I'm having a little trouble following this. These figure are transactions per second for a 300 second pgbench tpc-b run? What does "Thrds" denote?

--
Peter Geoghegan

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

17 May 2014, 03:31:55

On Sat, May 17, 2014 at 6:29 AM, Peter Geoghegan <pg@heroku.com> wrote:
>
>
> On Fri, May 16, 2014 at 7:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> shared_buffers= 8GB
>> scale factor = 3000
>> RAM - 64GB
>
> I'm having a little trouble following this. These figure are transactions per second for a 300 second pgbench tpc-b run?

Yes, the figures are tps for a 300 second run.

It is for select-only transactions.

What does "Thrds" denote?

It denotes number of threads (-j in pgbench run)

I have used below statements to take data
./pgbench -c 64 -j 64 -T 300 -S postgres

./pgbench -c 128 -j 128 -T 300 -S postgres

The reason for posting the numbers for 64/128 threads is because we have

mainly concurrency bottleneck when the number of connections are higher

than CPU cores and I am using 16 cores, 64 hardware threads m/c.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

17 May 2014, 04:54:31

On Sat, May 17, 2014 at 6:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I haven't actually reviewed the code, but this sort of thing seems like good evidence that we need your patch, or something like it. The fact that the patch produces little performance improvement on it's own (though it does produce some) shouldn't be held against it - the fact that the contention shifts elsewhere when the first bottleneck is removed is not your patch's fault.
>
> In terms of ameliorating contention on the buffer mapping locks, I think it would be better to replace the whole buffer mapping table with something different.

Is there anything bad except for may be increase in LWLocks with scaling

hash partitions w.r.t to shared buffers either by auto tuning or by having a

configuration knob. I understand that it would be bit difficult for users to

estimate the correct value of such a parameter, we can provide info about

its usage in docs such that if user increases shared buffers by 'X' (20 times)

of default value (128M), then consider increasing such partitions and it should

be always power of 2 or does something similar to above internally in code.

I agree that may be even by having a reasonably good estimate of number of

partitions w.r.t shared buffers, we might not be able to eliminate the contention

around BufMappingLocks, but I think the scalability we get by doing that is not

bad either.

> I started working on that almost 2 years ago, building a hash-table that can be read without requiring any locks and written with, well, less locking than what we have right now:

I have still not read the complete code, but by just going through initial file

header, it seems to me that it will be much better than current

implementation in terms of concurrency, by the way does such an

implementation can extend to reducing scalability for hash indexes as well?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

05 June 2014, 08:43:44

On Sat, May 17, 2014 at 6:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, May 16, 2014 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Thrds (64) Thrds (128)
HEAD 45562 17128
HEAD + 64 57904 32810
V1 + 64 105557 81011
HEAD + 128 58383 32997
V1 + 128 110705 114544

I haven't actually reviewed the code, but this sort of thing seems like good evidence that we need your patch, or something like it. The fact that the patch produces little performance improvement on it's own (though it does produce some) shouldn't be held against it - the fact that the contention shifts elsewhere when the first bottleneck is removed is not your patch's fault.

I have improved the patch by making following changes:

a. Improved the bgwriter logic to log for xl_running_xacts info and

removed the hibernate logic as bgwriter will now work only when

there is scarcity of buffer's in free list. Basic idea is when the

number of buffers on freelist drops below the low threshold, the

allocating backend sets the latch and bgwriter wakesup and begin

adding buffer's to freelist until it reaches high threshold and then

again goes back to sleep.

b. New stats for number of buffers on freelist has been added, some

old one's like maxwritten_clean can be removed as new logic for

syncing buffers and moving them to free list doesn't use them.

However I think it's better to remove them once the new logic is

accepted. Added some new logs for info related to free list under

BGW_DEBUG

c. Used the already existing bgwriterLatch in BufferStrategyControl to

wake bgwriter when number of buffer's in freelist drops below

threshold.

d. Autotune the low and high threshold for freelist for various

configurations. Generally if keep small number (200~2000) of buffers

always available on freelist, then even for high shared buffers

like 15GB, it appears to be sufficient. However when the value

of shared buffer's is less, then we need much smaller number. I

think we can provide these as config knobs for user as well, but for

now based on LWLOCK_STATS result, I have chosen some hard

coded values for low and high threshold values for freelist.

Values for low and high threshold have been decided based on total

number of shared buffers, basically I have divided them into 5

categories (16~100, 100~1000, 1000~10000, 10000~100000,

100000 and above) and then ran tests(read-only pgbench) for various

configurations falling under these categories. The reason for keeping

lesser categories for larger shared buffers is that if there are small

number (200~2000) of buffers available on free list, then it seems to

be sufficient for quite high loads, however as the total number of shared

buffer's decreases we need to be more careful as if we keep the number as

too low then it will lead to more clock sweep by backends (which means

freelist lock contention) and if we keep number higher bgwriter will evict

many useful buffers. Results based on LWLOCK_STATS is at end of mail.

e. One reason why I think number of buf-partitions is hard-coded to 16 is that

minimum number of shared buffers allowed are 16 (128kb). However, there

is handling in code (in function init_htab()) which ensure that even if number

of partitions are more that shared buffers, it handles it safely.

I have checked the bgwriter CPU usage with and without patch

for various configurations and the observation is that for most of the

loads bgwriter's CPU usage after patch is between 8~20% and in

HEAD it is 0~2%. It shows that with patch when shared buffers

are under use by backends, bgwriter is constantly doing work to

ease the work of backends. Detailed data is provided later in the

mail.

Performance Data:

-------------------------------

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads

RAM = 64GB

Database Locale =C

checkpoint_segments=256

checkpoint_timeout =15min

shared_buffers=8GB

scale factor = 3000

Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)

Duration of each individual run = 5mins

Client Count/patch_ver (tps)	8	16	32	64	128
Head	26220	48686	70779	45232	17310
Patch	26402	50726	75574	111468	114521

Data is taken by using script (pert_buff_mgmt.sh) attached with mail.

This data is read-only pgbench data with different number of client

connections. All the numbers are in tps. This data is median of 3

5 min pgbench read-only runs. Please find the detailed data for 3 runs

in attached open office document (perf_read_scalability_data_v3.ods)

This data clearly shows that patch has improved improved performance

upto 5~6 times.

Results of BGwriter CPU usage:

--------------------------------------------------

Here sc is scale factor and sb is shared buffers and the data is

for read-only pgbench runs.

./pgbench -c 64 - j 64 -S -T 300 postgres

sc - 3000, sb - 8GB

HEAD

CPU usage - 0~2.3%

Patch v_3

CPU usage - 8.6%

sc - 100, sb - 128MB

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

CPU Usage - 1~2%

tps- 36199.047132

Patch v_3

CPU usage - 12~13%

tps = 109182.681827

sc - 50, sb - 75MB

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

CPU Usage - 0.7~2%

tps- 37760.575128

Patch v_3

CPU usage - 20~22%

tps = 106310.744198

./pgbench -c 16 - j 16 -S -T 300 postgres

sc - 100, sb - 128kb

--need to change pgbench for this.

HEAD

CPU Usage - 0~0.3%

tps- 40979.529254

Patch v_3

CPU usage - 35~40%

tps = 42956.785618

Results of LWLOCK_STATS based on low-high threshold values of freelist:

--------------------------------------------------------------------------------------------------------------

In the results, values of exacq and blk shows the contention on freelist lock.

sc is scale factor and sb is number of shared_buffers. Below results shows

that for all except one (1MB) of configuration the contention around buffreelist

lock is reduced significantly. For 1MB case also, it has reduced exacq count

which shows that it has performed clock sweep lesser number of times.

sc - 3000, sb - 15GB --(sb > 100000)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 4406 lwlock main 0: shacq 0 exacq 84482 blk 5139 spindelay 62

Patch v_3

PID 4864 lwlock main 0: shacq 0 exacq 34 blk 1 spindelay 0

sc - 3000, sb - 8GB --(sb > 100000)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 24124 lwlock main 0: shacq 0 exacq 285155 blk 33910 spindelay 548

Patch v_3

PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0

sc - 100, sb - 768MB --(sb > 10000)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 9144 lwlock main 0: shacq 0 exacq 284636 blk 34091 spindelay 555

Patch v-3 (lw=100,hg=1000)

PID 9428 lwlock main 0: shacq 0 exacq 306 blk 59 spindelay 0

sc - 100, sb - 128MB --(sb > 10000)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 5405 lwlock main 0: shacq 0 exacq 285449 blk 32345 spindelay 714

Patch v-3

PID 8625 lwlock main 0: shacq 0 exacq 740 blk 178 spindelay 0

sc - 50, sb - 75MB --(sb > 1000)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 12681 lwlock main 0: shacq 0 exacq 289347 blk 34064 spindelay 773

Patch v3

PID 12800 lwlock main 0: shacq 0 exacq 76287 blk 15183 spindelay 28

sc - 50, sb - 10MB --(sb > 1000)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 10283 lwlock main 0: shacq 0 exacq 287500 blk 32177 spindelay 864

Patch v3 (for > 1000, lw = 50 hg =200)

PID 11629 lwlock main 0: shacq 0 exacq 60139 blk 12978 spindelay 40

sc - 1, sb - 7MB --(sb > 100)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 47127 lwlock main 0: shacq 0 exacq 289462 blk 37057 spindelay 119

Patch v3

PID 47283 lwlock main 0: shacq 0 exacq 9507 blk 1656 spindelay 0

sc - 1, sb - 1MB --(sb > 100)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 43215 lwlock main 0: shacq 0 exacq 301384 blk 36740 spindelay 902

Patch v3

PID 46542 lwlock main 0: shacq 0 exacq 197231 blk 37532 spindelay 294

sc - 100, sb - 128kb focus(sb > 16)

./pgbench -c 16 - j 16 -S -T 300 postgres (for this, I need to reduce value

of naccounts to 2500, else it was always giving no unpinned buffers

available)

HEAD

PID 49751 lwlock main 0: shacq 0 exacq 1821276 blk 130119 spindelay 7

Patch v3

PID 50768 lwlock main 0: shacq 0 exacq 382610 blk 46543 spindelay 1

More Datapoints and work:

a. I have yet to take data by merging it with scalable lwlock patch of

Andres (https://commitfest.postgresql.org/action/patch_view?id=1313).

There are many conflicts in the patch, so waiting for an updated patch.

b. Read-only data for more configurations.

c. Data for Write work load (tpc-b of pgbench, Bulk insert (Copy))

d. Update docs and Remove unused code.

Suggestions?

With Regards,
Amit Kapila.

EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Scaling shared buffer eviction

From

Kevin Grittner

Date:

08 June 2014, 13:51:55

Amit Kapila <amit.kapila16@gmail.com> wrote:

> I have improved the patch  by making following changes:
> a. Improved the bgwriter logic to log for xl_running_xacts info and
>    removed the hibernate logic as bgwriter will now work only when
>    there is scarcity of buffer's in free list. Basic idea is when the
>    number of buffers on freelist drops below the low threshold, the
>    allocating backend sets the latch and bgwriter wakesup and begin
>    adding buffer's to freelist until it reaches high threshold and then
>    again goes back to sleep.

The numbers from your benchmarks are very exciting, but the above
concerns me.  My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist, but toward
preventing shared_buffers from accumulating a lot of dirty pages,
which were leading to cascades of writes between caches and thus to
write stalls.  By pushing dirty pages into the (*much* larger) OS
cache, and letting write combining happen there, where the OS could
pace based on the total number of dirty pages instead of having
some hidden and appearing rather suddenly, latency spikes were
avoided while not causing any noticeable increase in the number of
OS writes to the RAID controller's cache.

Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter. Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

09 June 2014, 04:03:20

On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > I have improved the patch by making following changes:
> > a. Improved the bgwriter logic to log for xl_running_xacts info and
> > removed the hibernate logic as bgwriter will now work only when
> > there is scarcity of buffer's in free list. Basic idea is when the
> > number of buffers on freelist drops below the low threshold, the
> > allocating backend sets the latch and bgwriter wakesup and begin
> > adding buffer's to freelist until it reaches high threshold and then
> > again goes back to sleep.
>
> The numbers from your benchmarks are very exciting, but the above
> concerns me. My tuning of the bgwriter in production has generally
> *not* been aimed at keeping pages on the freelist, but toward
> preventing shared_buffers from accumulating a lot of dirty pages,
> which were leading to cascades of writes between caches and thus to
> write stalls. By pushing dirty pages into the (*much* larger) OS
> cache, and letting write combining happen there, where the OS could
> pace based on the total number of dirty pages instead of having
> some hidden and appearing rather suddenly, latency spikes were
> avoided while not causing any noticeable increase in the number of
> OS writes to the RAID controller's cache.

>
> Essentially I was able to tune the bgwriter so that a dirty page

> was always push out to the OS cache within three seconds, which led
> to a healthy balance of writes between the checkpoint process and
> the bgwriter.

I think it would have been better if bgwriter does writes based on

the amount of buffer's that get dirtied to achieve the balance of

writes.

> Backend processes related to user connections still
> performed about 30% of the writes, and this work shows promise
> toward bringing that down, which would be great; but please don't
> eliminate the ability to prevent write stalls in the process.

I agree that for some cases as explained by you, the current bgwriter

logic does satisfy the need, however there are other cases as well

where actually it doesn't help much, one of such cases I am trying to

improve (ease backend buffer allocations) and other may be when

there is constant write activity for which I am not sure how much it

really helps. Part of the reason for trying to make bgwriter respond

mainly to ease backend allocations is the previous discussion for

the same, refer below link:

http://www.postgresql.org/message-id/CA+TgmoZ7dvhC4h-ffJmZCff6VWyNfOEAPZ021VxW61uH46R3QA@mail.gmail.com

However if we want to retain current property of bgwriter, we can do

the same by one of below ways:

a. Have separate processes for writing dirty buffers and moving buffers

to freelist.

b. In the current bgwriter, separate the two works based on the need.

The need can be decided based on whether bgwriter has been waked

due to shortage of buffers on free list or if it has been waked due to

BgWriterDelay.

Now as populating freelist and balance writes by writing dirty buffers

are two separate responsibilities, so not sure if doing that by one

process is a good idea.

I am planing to take some more performance data, part of which will

be write load as well, but I am now sure if that can anyway show the

need as mentioned by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

10 June 2014, 15:48:11

On Sun, Jun 8, 2014 at 9:51 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I have improved the patch  by making following changes:
>> a. Improved the bgwriter logic to log for xl_running_xacts info and
>>    removed the hibernate logic as bgwriter will now work only when
>>    there is scarcity of buffer's in free list. Basic idea is when the
>>    number of buffers on freelist drops below the low threshold, the
>>    allocating backend sets the latch and bgwriter wakesup and begin
>>    adding buffer's to freelist until it reaches high threshold and then
>>    again goes back to sleep.
>
> The numbers from your benchmarks are very exciting, but the above
> concerns me.  My tuning of the bgwriter in production has generally
> *not* been aimed at keeping pages on the freelist,

Just to be clear, prior to this patch, the bgwriter has never been in
the business of putting pages on the freelist in the first place, so
it wouldn't have been possible for you to tune for that.

> Essentially I was able to tune the bgwriter so that a dirty page
> was always push out to the OS cache within three seconds, which led
> to a healthy balance of writes between the checkpoint process and
> the bgwriter. Backend processes related to user connections still
> performed about 30% of the writes, and this work shows promise
> toward bringing that down, which would be great; but please don't
> eliminate the ability to prevent write stalls in the process.

I think, as Amit says downthread, that the crucial design question
here is whether we need two processes, one to populate the freelist so
that regular backends don't need to run the clock sweep, and a second
to flush dirty buffers, or whether a single process can serve both
needs.  In favor of a single process, many people have commented that
the background writer doesn't seem to do much right now.  If the
process is mostly sitting around idle, then giving it more
responsibilities might be OK.  In favor of having a second process,
I'm a little concerned that if the background writer gets busy writing
a page, it might then be unavailable to populate the freelist until it
finishes, which might be a very long time relative to the buffer
allocation needs of other backends.  I'm not sure what the right
answer is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

26 June 2014, 04:31:04

On Mon, Jun 9, 2014 at 9:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> > Backend processes related to user connections still
> > performed about 30% of the writes, and this work shows promise
> > toward bringing that down, which would be great; but please don't
> > eliminate the ability to prevent write stalls in the process.
>
>
> I am planing to take some more performance data, part of which will
> be write load as well, but I am now sure if that can anyway show the
> need as mentioned by you.

After taking the performance data for write load using tpc-b with the

patch, I found that there is a regression in it. So I went ahead and

tried to figure out the reason for same and found that after patch,

Bgwriter started flushing buffers which were required by backends

and reason was that *nextVictimBuffer* was not getting updated

properly while we are running clock sweep kind of logic (decrement

the usage count when number of buffers on freelist fall below low

threshhold value) in Bgwriter. In HEAD, I noticed that at default

settings, BGwriter was not at all flushing any buffers which is at least

better than what my patch was doing (flushing buffers required by

backend).

So I tried to fix the issue by updating *nextVictimBuffer* in new

BGWriter logic and results are positive.

sbe - scalable buffer eviction

Select only Data

Client count/TPS	64	128
Un-patched	45232	17310
sbe_v3	111468	114521
sbe_v4	153137	160752

TPC-B

Client count/TPS	64	128
Un-patched	825	784
sbe_v4	814	845

For Select Data, I am quite confident that it will improve if we introduce

nextVictimBuffer increments in BGwriter and rather it scales much better

with that change, however for TPC-B, I am getting fluctuation in data,

so not sure it has eliminated the problem. The main difference is that in

HEAD, BGwriter never increments nextVictimBuffer during syncing the

buffers, it just notes down the current setting before start and then

proceeds sequentially.

I think it will be good if we can have a new process for moving buffers to

free list due to below reasons:

a. while trying to move buffers to freelist, it should not block due

to in between write activity.

b. The writer should not increment nextVictimBuffer and maintain

the current logic.

One significant change in this version of patch is to use a separate

spin lock to protect nextVictimBuffer rather than using BufFreelistLock.

Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

scalable_buffer_eviction_v4.patch

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

05 August 2014, 15:52:01

On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin
adding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.

This essentially removes BgWriterDelay, but it's still mentioned in BgBufferSync(). Looking further, I see that with the patch applied, BgBufferSync() is still present in the source code but is no longer called from anywhere. Please don't submit patches that render things unused without actually removing them; it makes it much harder to see what you've changed. I realize you probably left it that way for testing purposes, but you need to clean such things up before submitting. Likewise, if you've rendered GUCs or statistics counters removed, you need to rip them out, so that the scope of the changes you've made is clear to reviewers.

A comparison of BgBufferSync() with BgBufferSyncAndMoveBuffersToFreelist() reveals that you've removed at least one behavior that some people (at least, me) will care about, which is the guarantee that the background writer will scan the entire buffer pool at least every couple of minutes. This is important because it guarantees that dirty data doesn't sit in memory forever. When the system becomes busy again after a long idle period, users will expect the system to have used the idle time to flush dirty buffers to disk. This also improves data recovery prospects if, for example, somebody loses their pg_xlog directory - there may be dirty buffers whose contents are lost, of course, but they won't be months old.

b. New stats for number of buffers on freelist has been added, some
old one's like maxwritten_clean can be removed as new logic for
syncing buffers and moving them to free list doesn't use them.
However I think it's better to remove them once the new logic is
accepted. Added some new logs for info related to free list under
BGW_DEBUG

If I'm reading this right, the new statistic is an incrementing counter where, every time you update it, you add the number of buffers currently on the freelist. That makes no sense. I think what you should be counting is the number of allocations that are being satisfied from the free-list. Then, by comparing the rate at which that value is incrementing to the rate at which buffers_alloc is incrementing, somebody can figure out what percentage of allocations are requiring a clock-sweep run. Actually, I think it's better to flip it around: count the number of allocations that require an individual backend to run the clock sweep (vs. being satisfied from the free-list); call it, say, buffers_backend_clocksweep. We can then try to tune the patch to make that number as small as possible under varying workloads.

c. Used the already existing bgwriterLatch in BufferStrategyControl to
wake bgwriter when number of buffer's in freelist drops below
threshold.

Seems like a good idea.

d. Autotune the low and high threshold for freelist for various
configurations. Generally if keep small number (200~2000) of buffers
always available on freelist, then even for high shared buffers
like 15GB, it appears to be sufficient. However when the value
of shared buffer's is less, then we need much smaller number. I
think we can provide these as config knobs for user as well, but for
now based on LWLOCK_STATS result, I have chosen some hard
coded values for low and high threshold values for freelist.
Values for low and high threshold have been decided based on total
number of shared buffers, basically I have divided them into 5
categories (16~100, 100~1000, 1000~10000, 10000~100000,
100000 and above) and then ran tests(read-only pgbench) for various
configurations falling under these categories. The reason for keeping
lesser categories for larger shared buffers is that if there are small
number (200~2000) of buffers available on free list, then it seems to
be sufficient for quite high loads, however as the total number of shared
buffer's decreases we need to be more careful as if we keep the number as
too low then it will lead to more clock sweep by backends (which means
freelist lock contention) and if we keep number higher bgwriter will evict
many useful buffers. Results based on LWLOCK_STATS is at end of mail.

I think we need to come up with some kind of formula here rather than just a list of hard-coded constants. And it definitely needs some comments explaining the logic behind the choices.

Aside from those specific remarks, I think the elephant in the room is the question of whether it really makes sense to have one process which is responsible both for populating the free list and for writing buffers to disk. One problem, which I alluded to above under point (1), is that we might sometimes want to ensure that dirty buffers are written out to disk without decrementing usage counts or adding anything to the free list. This is a potentially solvable problem, though, because we can figure out the number of buffers that we need to scan for freelist population and the number that we need to scan for minimum buffer pool cleaning (one cycle every 2 minutes). Once we've met the first goal, any further buffers we run into under the second goal get cleaned if appropriate but their usage counts don't get pushed down nor do they get added to the freelist. Once we meet the second goal, we can go back to sleep.

But the other problem, which I think is likely unsolvable, is that writing a dirty page can take a long time on a busy system (multiple seconds) and the freelist can be emptied much, much quicker than that (milliseconds). Although your benchmark results show great speed-ups on read-only workloads, we're not really going to get the benefit consistently on read-write workloads -- unless of course the background writer fails to actually write anything, which should be viewed as a bug, not a feature -- because the freelist will often be empty while the background writer is blocked on I/O.

I'm wondering if it would be a whole lot simpler and better to introduce a new background process, maybe with a name like bgreclaim. That process wouldn't write dirty buffers. Instead, it would just run the clock sweep (i.e. the last loop inside StrategyGetBuffer) and put the buffers onto the free list. Then, we could leave the bgwriter logic more or less intact. It certainly needs improvement, but that could be another patch.

Incidentally, while I generally think your changes to the locking regimen in StrategyGetBuffer() are going in the right direction, they need significant cleanup. Your patch adds two new spinlocks, freelist_lck and victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and you've now got StrategyGetBuffer() running with no lock at all when accessing some things that used to be protected by BufFreelistLock; specifically, you're doing StrategyControl->numBufferAllocs++ and SetLatch(StrategyControl->bgwriterLatch) without any locking. That's not OK. I think you should get rid of BufFreelistLock completely and just decide that freelist_lck will protect everything the freeNext links, plus everything in StrategyControl except for nextVictimBuffer. victimbuf_lck will protect nextVictimBuffer and nothing else.

Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the LWLock is acquired today. Increment StrategyControl->numBufferAllocs; save the values of StrategyControl->bgwriterLatch; pop a buffer off the freelist if there is one, saving its identity. Release the spinlock. Then, set the bgwriterLatch if needed. In the first loop, first check whether the buffer we previously popped from the freelist is pinned or has a non-zero usage count and return it if not, holding the buffer header lock. Otherwise, reacquire the spinlock just long enough to pop a new potential victim and then loop around.

Under this locking strategy, StrategyNotifyBgWriter would use freelist_lck. Right now, the patch removes the only caller, and should therefore remove the function as well, but if we go with the new-process idea listed above that part would get reverted, and then you'd need to make it use the correct spinlock. You should also go through this patch and remove all the commented-out bits and pieces that you haven't cleaned up; those are distracting and unhelpful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

06 August 2014, 10:12:17

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> This essentially removes BgWriterDelay, but it's still mentioned in BgBufferSync(). Looking further, I see that with the patch applied, BgBufferSync() is still present in the source code but is no longer called from anywhere. Please don't submit patches that render things unused without actually removing them; it makes it much harder to see what you've changed. I realize you probably left it that way for testing purposes, but you need to clean such things up before submitting. Likewise, if you've rendered GUCs or statistics counters removed, you need to rip them out, so that the scope of the changes you've made is clear to reviewers.

I have kept it just for the reason that if the basic approach is

sounds reasonable/accepted, then I will clean it up. Sorry for

the inconvenience, I didn't realized that it can be annoying for

reviewer, I will remove all such code from patch in next version.

>
> A comparison of BgBufferSync() with BgBufferSyncAndMoveBuffersToFreelist() reveals that you've removed at least one behavior that some people (at least, me) will care about, which is the guarantee that the background writer will scan the entire buffer pool at least every couple of minutes.

Okay, I will take care of this based on the conclusion of

the other points in this mail.

>This is important because it guarantees that dirty data doesn't sit in memory forever. When the system becomes busy again after a long idle period, users will expect the system to have used the idle time to flush dirty buffers to disk. This also improves data recovery prospects if, for example, somebody loses their pg_xlog directory - there may be dirty buffers whose contents are lost, of course, but they won't be months old.

>
>> b. New stats for number of buffers on freelist has been added, some
>> old one's like maxwritten_clean can be removed as new logic for
>> syncing buffers and moving them to free list doesn't use them.
>> However I think it's better to remove them once the new logic is
>> accepted. Added some new logs for info related to free list under
>> BGW_DEBUG
>
>
> If I'm reading this right, the new statistic is an incrementing counter where, every time you update it, you add the number of buffers currently on the freelist. That makes no sense.

I think using 'number of buffers currently on the freelist' and

'number of recently allocated buffers' for consecutive cycles,

we can figure out approximately how many buffer allocations

needs clock sweep assuming low and high threshold water

marks are fixed. However there can be cases where it is not

easy to estimate that number.

> I think what you should be counting is the number of allocations that are being satisfied from the free-list. Then, by comparing the rate at which that value is incrementing to the rate at which buffers_alloc is incrementing, somebody can figure out what percentage of allocations are requiring a clock-sweep run. Actually, I think it's better to flip it around: count the number of allocations that require an individual backend to run the clock sweep (vs. being satisfied from the free-list); call it, say, buffers_backend_clocksweep. We can then try to tune the patch to make that number as small as possible under varying workloads.

This can give us clear idea to tune the patch, however we need

to maintain 3 counters for it in code(recent_alloc (needed for

current bgwriter logic) and other 2 suggested by you). Do you

want to retain such counters in code or it's for kind of debug info

for patch?

>
>>
>> d. Autotune the low and high threshold for freelist for various
>> configurations.
>
> I think we need to come up with some kind of formula here rather than just a list of hard-coded constants.

That was my initial intention as well and I have tried based

on number of shared buffers like keeping threshold values as

percentage of shared buffers but nothing could satisfy different

kind of workloads. The current values I have choosen are based

on experiments for various workloads at different thresholds. I have

shown the lwlock_stats data for various loads based on current

thresholds upthread. Another way could be to make them as config

knobs and use the values as given by user incase it is provided by

user else go with fixed values.

There are other instances in code as well (one of them I remember

offhand is in pglz_compress) where we use fixed values based on

different sizes.

>And it definitely needs some comments explaining the logic behind the choices.

Agreed, I shall improve them in next version of patch.

>
> Aside from those specific remarks, I think the elephant in the room is the question of whether it really makes sense to have one process which is responsible both for populating the free list and for writing buffers to disk. One problem, which I alluded to above under point (1), is that we might sometimes want to ensure that dirty buffers are written out to disk without decrementing usage counts or adding anything to the free list. This is a potentially solvable problem, though, because we can figure out the number of buffers that we need to scan for freelist population and the number that we need to scan for minimum buffer pool cleaning (one cycle every 2 minutes). Once we've met the first goal, any further buffers we run into under the second goal get cleaned if appropriate but their usage counts don't get pushed down nor do they get added to the freelist. Once we meet the second goal, we can go back to sleep.
>
> But the other problem, which I think is likely unsolvable, is that writing a dirty page can take a long time on a busy system (multiple seconds) and the freelist can be emptied much, much quicker than that (milliseconds). Although your benchmark results show great speed-ups on read-only workloads, we're not really going to get the benefit consistently on read-write workloads -- unless of course the background writer fails to actually write anything, which should be viewed as a bug, not a feature -- because the freelist will often be empty while the background writer is blocked on I/O.
>
> I'm wondering if it would be a whole lot simpler and better to introduce a new background process, maybe with a name like bgreclaim.

That will certainly help in retaining the current behaviour of

bgwriter and make the idea cleaner. I will modify the patch

to have a new background process unless somebody thinks

otherwise.

>That process wouldn't write dirty buffers.

If we go with this approach, one thing which we need to decide

is what to do incase buf which has usage_count as zero is *dirty*,

as I don't think it is good idea to put it in freelist. Few options to

handle such a case are:

a. Skip such a buffer; the downside is if we have to skip lot

of buffers due to this reason then having separate process

such as bgreclaim will be less advantageous.

b. Skip the buffer and notify bgwriter to flush buffers, now this

notification can be sent either as soon as we encounter one

such buffer or after few such buffers (incase of few, we need to decide

some useful number). In this option, there is a chance that bgwriter

decide not to flush buffer/'s which ideally should not happen because

I think bgwriter considers the number of recent allocations for

performing scan to flush dirt buffers.

c. Have some mechanism where bgreclaim can notify bgwriter

to flush some specific buffers. I think if we have such a mechanism

that can be later even used by backends if required.

d. Keep the logic as per current patch and improve such that it can

retain the behaviour of one cycle per two minutes as suggested above

by you on the basis that in anycase it is better than the current code.

I don't think option (d) is best way to handle this scenario,however I

kept it incase nothing else sounds reasonable. Option (c) might have

lot of work which I am not sure is justifiable to handle the current scenario,

though it can be useful for some other things. Option (a) should be okay

for most cases, but I think option (b) would be better.

> Instead, it would just run the clock sweep (i.e. the last loop inside StrategyGetBuffer) and put the buffers onto the free list.

Don't we need to do more than just last loop inside StrategyGetBuffer(),

as clock sweep in strategy get buffer is responsible for getting one

buffer with usage_count = 0 where as we need to run the loop till it

finds and moves enough such buffers so that it can populate freelist

with number of buffers equal to high water mark of freelist.

>Then, we could leave the bgwriter logic more or less intact. It certainly needs improvement, but that could be another patch.
>
> Incidentally, while I generally think your changes to the locking regimen in StrategyGetBuffer() are going in the right direction, they need significant cleanup. Your patch adds two new spinlocks, freelist_lck and victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and you've now got StrategyGetBuffer() running with no lock at all when accessing some things that used to be protected by BufFreelistLock; specifically, you're doing StrategyControl->numBufferAllocs++ and SetLatch(StrategyControl->bgwriterLatch) without any locking. That's not OK.

I have kept them outside spinlock because as per patch the only

callsite for setting StrategyControl->bgwriterLatch is StrategyGetBuffer()

and StrategyControl->numBufferAllocs is used just for statistics purpose

(which I thought might be okay even if it is not accurate) whereas without

patch it is used by bgwriter for purpose other than stats as well.

However it certainly needs to be protected for separate bgreclaim process

idea or for retaining current bgwriter behaviour.

> I think you should get rid of BufFreelistLock completely and just decide that freelist_lck will protect everything the freeNext links, plus everything in StrategyControl except for nextVictimBuffer. victimbuf_lck will protect nextVictimBuffer and nothing else.
>
> Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the LWLock is acquired today. Increment StrategyControl->numBufferAllocs; save the values of StrategyControl->bgwriterLatch; pop a buffer off the freelist if there is one, saving its identity. Release the spinlock. Then, set the bgwriterLatch if needed. In the first loop, first check whether the buffer we previously popped from the freelist is pinned or has a non-zero usage count and return it if not, holding the buffer header lock. Otherwise, reacquire the spinlock just long enough to pop a new potential victim and then loop around.

I shall take care of doing this way in next version of patch.

>
> Under this locking strategy, StrategyNotifyBgWriter would use freelist_lck. Right now, the patch removes the only caller, and should therefore remove the function as well, but if we go with the new-process idea listed above that part would get reverted, and then you'd need to make it use the correct spinlock. You should also go through this patch and remove all the commented-out bits and pieces that you haven't cleaned up; those are distracting and unhelpful.

Sure.

Thank you for review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

06 August 2014, 15:15:08

On Wed, Aug 6, 2014 at 6:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> If I'm reading this right, the new statistic is an incrementing counter
>> where, every time you update it, you add the number of buffers currently on
>> the freelist.  That makes no sense.
>
> I think using 'number of buffers currently on the freelist' and
> 'number of recently allocated buffers' for consecutive cycles,
> we can figure out approximately how many buffer allocations
> needs clock sweep assuming low and high threshold water
> marks are fixed.  However there can be cases where it is not
> easy to estimate that number.

Counters should be design in such a way that you can read it, and then
read it again later, and make sense of it - you should not need to
read the counter on *consecutive* cycles to interpret it.

>> I think what you should be counting is the number of allocations that are
>> being satisfied from the free-list.  Then, by comparing the rate at which
>> that value is incrementing to the rate at which buffers_alloc is
>> incrementing, somebody can figure out what percentage of allocations are
>> requiring a clock-sweep run.  Actually, I think it's better to flip it
>> around: count the number of allocations that require an individual backend
>> to run the clock sweep (vs. being satisfied from the free-list); call it,
>> say, buffers_backend_clocksweep.  We can then try to tune the patch to make
>> that number as small as possible under varying workloads.
>
> This can give us clear idea to tune the patch, however we need
> to maintain 3 counters for it in code(recent_alloc (needed for
> current bgwriter logic) and other 2 suggested by you).  Do you
> want to retain such counters in code or it's for kind of debug info
> for patch?

I only mean to propose one new counter, and I'd imagine including that
in the final patch.  We already have a counter of total buffer
allocations; that's buffers_alloc.  I'm proposing to add an additional
counter for the number of those allocations not satisfied from the
free list, with a name like buffers_alloc_clocksweep (I said
buffers_backend_clocksweep above, but that's probably not best, as the
existing buffers_backend counts buffer *writes*, not allocations).  I
think we would definitely want to retain this counter in the final
patch, as an additional column in pg_stat_bgwriter.

>>> d.  Autotune the low and high threshold for freelist for various
>>>      configurations.
>>
>> I think we need to come up with some kind of formula here rather than just
>> a list of hard-coded constants.
>
> That was my initial intention as well and I have tried based
> on number of shared buffers like keeping threshold values as
> percentage of shared buffers but nothing could satisfy different
> kind of workloads.  The current values I have choosen are based
> on experiments for various workloads at different thresholds.  I have
> shown the lwlock_stats data for various loads based on current
> thresholds upthread.  Another way could be to make them as config
> knobs and use the values as given by user incase it is provided by
> user else go with fixed values.

How did you go about determining the optimal value for a particular workload?

When the list is kept short, it's less likely that a value on the list
will be referenced or dirtied again before the page is actually
recycled.  That's clearly good.  But when the list is long, it's less
likely to become completely empty and thereby force individual
backends to run the clock-sweep.  My suspicion is that, when the
number of buffers is small, the impact of the list being too short
isn't likely to be very significant, because running the clock-sweep
isn't all that expensive anyway - even if you have to scan through the
entire buffer pool multiple times, there aren't that many buffers.
But when the number of buffers is large, those repeated scans can
cause a major performance hit, so having an adequate pool of free
buffers becomes much more important.

I think your list of high-watermarks is far too generous for low
buffer counts.  With more than 100k shared buffers, you've got a
high-watermark of 2k buffers, which means that 2% or less of the
buffers will be on the freelist, which seems a little on the high side
to me, but probably in the ballpark of what we should be aiming for.
But at 10001 shared buffers, you can have 1000 of them on the
freelist, which is 10% of the buffer pool; that seems high.  At 101
shared buffers, 75% of the buffers in the system can be on the
freelist; that seems ridiculous.  The chances of a buffer still being
unused by the time it reaches the head of the freelist seem very
small.

Based on your existing list of thresholds, and taking the above into
account, I'd suggest something like this: let the high-watermark for
the freelist be 0.5% of the total number of buffers, with a maximum of
2000 and a minimum of 5.  Let the low-watermark be 20% of the
high-watermark.  That might not be best, but I think some kind of
formula like that can likely be made to work.  I would suggest
focusing your testing on configurations with *large* settings for
shared_buffers, say 1-64GB, rather than small configurations.  Anyone
who cares greatly about performance isn't going to be running with
only 8MB of shared_buffers anyway.  Arguably we shouldn't even run the
reclaim process on very small configurations; I think there should
probably a GUC (PGC_SIGHUP) to control whether it gets launched.

I think it would be a good idea to analyze how frequently the reclaim
process gets woken up.  In the worst case, this happens once per (high
watermark - low watermark) allocations; that is, the system reaches
the low watermark and then does no further allocations until the
reclaim process brings the freelist back up to the high watermark.
But if more allocations occur between the time the reclaim process is
woken and the time it reaches the high watermark, then it should run
for longer, until the high watermark is reached.  At least for
debugging purposes, I think it would be useful to have a counter of
reclaim wakeups.  I'm not sure whether that's worth including in the
final patch, but it might be.

> That will certainly help in retaining the current behaviour of
> bgwriter and make the idea cleaner.  I will modify the patch
> to have a new background process unless somebody thinks
> otherwise.
>
> If we go with this approach, one thing which we need to decide
> is what to do incase buf which has usage_count as zero is *dirty*,
> as I don't think it is good idea to put it in freelist.

I thought a bit about this yesterday.  I think the problem is that we
might be in a situation where buffers are being dirtied faster than
they can be cleaned. In that case, if we only put clean buffers on the
freelist, then every backend in the system will be fighting over the
ever-dwindling supply of clean buffers until, in the worst case,
there's maybe only 1 clean buffer which is getting evicted repeatedly
at top speed - or maybe even no clean buffers, and the reclaim process
just spins in an infinite loop looking for clean buffers that aren't
there.

To put that another way, the rate at which buffers are being dirtied
can't exceed the rate at which they are being cleaned forever.
Eventually, somebody is going to have to wait.  Having the backends
wait by being forced to write some dirty buffers does not seem like a
bad way to accomplish that.  So I favor just putting the buffers on
freelist without regard to whether they are clean or dirty.  If this
turns out not to work well we can look at other options (probably some
variant of (b) from your list).

>> Instead, it would just run the clock sweep (i.e. the last loop inside
>> StrategyGetBuffer) and put the buffers onto the free list.
>
> Don't we need to do more than just last loop inside StrategyGetBuffer(),
> as clock sweep in strategy get buffer is responsible for getting one
> buffer with usage_count = 0 where as we need to run the loop till it
> finds and moves enough such buffers so that it can populate freelist
> with number of buffers equal to high water mark of freelist.

Yeah, that's what I meant.  Of course, it should add each buffer to
the freelist individually, not batch them up and add them all at once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

12 August 2014, 21:02:42

On 2014-08-06 15:42:08 +0530, Amit Kapila wrote:
> On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > This essentially removes BgWriterDelay, but it's still mentioned in
> BgBufferSync().  Looking further, I see that with the patch applied,
> BgBufferSync() is still present in the source code but is no longer called
> from anywhere.  Please don't submit patches that render things unused
> without actually removing them; it makes it much harder to see what you've
> changed.  I realize you probably left it that way for testing purposes, but
> you need to clean such things up before submitting.  Likewise, if you've
> rendered GUCs or statistics counters removed, you need to rip them out, so
> that the scope of the changes you've made is clear to reviewers.

FWIW, I found this email amost unreadable because it misses quoting
signs after linebreaks in quoted content.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

13 August 2014, 04:22:05

On Wed, Aug 13, 2014 at 2:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-08-06 15:42:08 +0530, Amit Kapila wrote:
> > On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > This essentially removes BgWriterDelay, but it's still mentioned in
> > BgBufferSync(). Looking further, I see that with the patch applied,
> > BgBufferSync() is still present in the source code but is no longer called
> > from anywhere. Please don't submit patches that render things unused
> > without actually removing them; it makes it much harder to see what you've
> > changed. I realize you probably left it that way for testing purposes, but
> > you need to clean such things up before submitting. Likewise, if you've
> > rendered GUCs or statistics counters removed, you need to rip them out, so
> > that the scope of the changes you've made is clear to reviewers.
>
> FWIW, I found this email amost unreadable because it misses quoting
> signs after linebreaks in quoted content.

I think I have done something wrong while replying to Robert's

mail, the main point in that mail was trying to see if there is any

major problem incase we have separate process (bgreclaim) to

populate freelist. One thing which I thought could be problematic

is to put a buf in freelist which has usage_count as zero and is *dirty*.

Please do let me know if you want clarification for something in

particular.

Overall, the main changes required in patch as per above feedback

are:

1. add an additional counter for the number of those

allocations not satisfied from the free list, with a

name like buffers_alloc_clocksweep.

2. Autotune the low and high threshold values for buffers

in freelist. In the patch, I have kept them as hard-coded

values.

3. For populating freelist, have a separate process (bgreclaim)

instead of doing it by bgwriter.

There are other things also which I need to take care as per

feedback like some change in locking strategy and code.

With Regards,

Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

13 August 2014, 10:55:48

On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:
> Overall, the main changes required in patch as per above feedback
> are:
> 1. add an additional counter for the number of those
> allocations not satisfied from the free list, with a
> name like buffers_alloc_clocksweep.
> 2. Autotune the low and high threshold values for buffers
> in freelist. In the patch, I have kept them as hard-coded
> values.
> 3. For populating freelist, have a separate process (bgreclaim)
> instead of doing it by bgwriter.

I'm not convinced that 3) is the right way to go to be honest. Seems
like a huge bandaid to me.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

14 August 2014, 03:26:55

On Wed, Aug 13, 2014 at 4:25 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:
> > Overall, the main changes required in patch as per above feedback
> > are:
> > 1. add an additional counter for the number of those
> > allocations not satisfied from the free list, with a
> > name like buffers_alloc_clocksweep.
> > 2. Autotune the low and high threshold values for buffers
> > in freelist. In the patch, I have kept them as hard-coded
> > values.
> > 3. For populating freelist, have a separate process (bgreclaim)
> > instead of doing it by bgwriter.
>
> I'm not convinced that 3) is the right way to go to be honest. Seems
> like a huge bandaid to me.

Doing both (populating freelist and flushing dirty buffers) via bgwriter

isn't the best way either because it might not be able to perform

both the jobs as per need.

One example is it could take much longer time to flush a dirty buffer

than to move it into free list, so if there are few buffers which we need

to flush, then I think task of maintaining buffers in freelist will get hit

even though there are buffers in list which can be moved to

free list(non-dirty buffers).

Another is maintaining the current behaviour of bgwriter which is to scan

the entire buffer pool every few mins (assuming default configuration).

We can attempt to solve this problem as suggested by Robert upthread

but I am not completely sure if that can guarantee that the current

behaviour will be retained as it is.

I am not telling that having a separate process won't have any issues,

but I think we can tackle them without changing or complicating current

bgwriter behaviour.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

26 August 2014, 14:53:11

<div dir="ltr">On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br />><br />> Incidentally, while I
generallythink your changes to the locking regimen in StrategyGetBuffer() are going in the right direction, they need
significantcleanup.  Your patch adds two new spinlocks, freelist_lck and victimbuf_lck, that mostly but not-quite
replaceBufFreelistLock, and you've now got StrategyGetBuffer() running with no lock at all when accessing some things
thatused to be protected by BufFreelistLock; specifically, you're doing StrategyControl->numBufferAllocs++ and
SetLatch(StrategyControl->bgwriterLatch)without any locking.  That's not OK.  I think you should get rid of
BufFreelistLockcompletely and just decide that freelist_lck will protect everything the freeNext links, plus everything
inStrategyControl except for nextVictimBuffer.  victimbuf_lck will protect nextVictimBuffer and nothing else.<br />
><br/>> Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the LWLock is acquired today.
 IncrementStrategyControl->numBufferAllocs; save the values of StrategyControl->bgwriterLatch; pop a buffer off
thefreelist if there is one, saving its identity.  Release the spinlock.  Then, set the bgwriterLatch if needed.  In
thefirst loop, first check whether the buffer we previously popped from the freelist is pinned or has a non-zero usage
countand return it if not, holding the buffer header lock.  Otherwise, reacquire the spinlock just long enough to pop a
newpotential victim and then loop around.<br /> ><br /><br />Today, while working on updating the patch to improve
locking<br/>I found that as now we are going to have a new process, we need<br />a separate latch in StrategyControl to
wakeupthat process.<br />Another point is I think it will be better to protect<br /> StrategyControl->completePasses
withvictimbuf_lck rather than<br />freelist_lck, as when we are going to update it we will already be<br />holding the
victimbuf_lckand it doesn't make much sense to release<br />the victimbuf_lck and reacquire freelist_lck to update
it.<br/><br />I thought it is better to mention about above points so that if you have<br />any different thoughts
aboutit, then it is better to discuss them now<br />rather than after I take performance data with this locking
protocol.<br/><br />With Regards,<br />Amit Kapila.<br />EnterpriseDB: <a
href="http://www.enterprisedb.com">http://www.enterprisedb.com</a></div>

Re: Scaling shared buffer eviction

From

Tom Lane

Date:

26 August 2014, 15:11:03

Amit Kapila <amit.kapila16@gmail.com> writes:
> On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I think you should get rid of BufFreelistLock completely and just
>>> decide that freelist_lck will protect everything the freeNext links, plus
>>> everything in StrategyControl except for nextVictimBuffer.  victimbuf_lck
>>> will protect nextVictimBuffer and nothing else.

> Another point is I think it will be better to protect
> StrategyControl->completePasses with victimbuf_lck rather than
> freelist_lck, as when we are going to update it we will already be
> holding the victimbuf_lck and it doesn't make much sense to release
> the victimbuf_lck and reacquire freelist_lck to update it.

I'm rather concerned by this cavalier assumption that we can protect
fields a,b,c with one lock and fields x,y,z in the same struct with some
other lock.

A minimum requirement for that to work safely at all is that the fields
are of atomically fetchable/storable widths; which might be okay here
but it's a restriction that bears thinking about (and documenting).

But quite aside from safety, the fields are almost certainly going to
be in the same cache line which means contention between processes that
are trying to fetch or store them concurrently.  For a patch whose sole
excuse for existence is to improve performance, that should be a very
scary concern.

(And yes, I realize these issues already affect the freelist.  Perhaps
that's part of the reason we have performance issues with it.)
        regards, tom lane

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

26 August 2014, 15:35:23

On Tue, Aug 26, 2014 at 8:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > Another point is I think it will be better to protect
> > StrategyControl->completePasses with victimbuf_lck rather than
> > freelist_lck, as when we are going to update it we will already be
> > holding the victimbuf_lck and it doesn't make much sense to release
> > the victimbuf_lck and reacquire freelist_lck to update it.
>
> I'm rather concerned by this cavalier assumption that we can protect
> fields a,b,c with one lock and fields x,y,z in the same struct with some
> other lock.

In some cases, it could be beneficial especially when a,b,c are

going to be more frequently accessed as compare to x,y,z.

> A minimum requirement for that to work safely at all is that the fields
> are of atomically fetchable/storable widths; which might be okay here
> but it's a restriction that bears thinking about (and documenting).
>
> But quite aside from safety, the fields are almost certainly going to
> be in the same cache line which means contention between processes that
> are trying to fetch or store them concurrently.

I think patch will reduce the contention for some of such variables

(which are accessed during clock sweep) as it will minimize the need

to perform clock sweep by backends.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

27 August 2014, 15:02:59

On Tue, Aug 26, 2014 at 11:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> I think you should get rid of BufFreelistLock completely and just
>>>> decide that freelist_lck will protect everything the freeNext links, plus
>>>> everything in StrategyControl except for nextVictimBuffer.  victimbuf_lck
>>>> will protect nextVictimBuffer and nothing else.
>
>> Another point is I think it will be better to protect
>> StrategyControl->completePasses with victimbuf_lck rather than
>> freelist_lck, as when we are going to update it we will already be
>> holding the victimbuf_lck and it doesn't make much sense to release
>> the victimbuf_lck and reacquire freelist_lck to update it.
>
> I'm rather concerned by this cavalier assumption that we can protect
> fields a,b,c with one lock and fields x,y,z in the same struct with some
> other lock.
>
> A minimum requirement for that to work safely at all is that the fields
> are of atomically fetchable/storable widths; which might be okay here
> but it's a restriction that bears thinking about (and documenting).
>
> But quite aside from safety, the fields are almost certainly going to
> be in the same cache line which means contention between processes that
> are trying to fetch or store them concurrently.  For a patch whose sole
> excuse for existence is to improve performance, that should be a very
> scary concern.
>
> (And yes, I realize these issues already affect the freelist.  Perhaps
> that's part of the reason we have performance issues with it.)

False sharing is certainly a concern that has crossed my mine while
looking at Amit's work, but the performance numbers he's posted
upthread are stellar.  Maybe we can squeeze some additional
performance out of this by padding out the cache lines, but it's
probably minor compared to the gains he's already seeing.  I think we
should focus on trying to lock in those gains, and then we can
consider what further things we may want to do after that.  If it
turns out that structure-padding is among those things, that's easy
enough to do as a separate patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

27 August 2014, 15:04:50

On Tue, Aug 26, 2014 at 10:53 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Today, while working on updating the patch to improve locking
> I found that as now we are going to have a new process, we need
> a separate latch in StrategyControl to wakeup that process.
> Another point is I think it will be better to protect
> StrategyControl->completePasses with victimbuf_lck rather than
> freelist_lck, as when we are going to update it we will already be
> holding the victimbuf_lck and it doesn't make much sense to release
> the victimbuf_lck and reacquire freelist_lck to update it.

Sounds reasonable.  I think the key thing at this point is to get a
new version of the patch with the background reclaim running in a
different process than the background writer.  I don't see much point
in fine-tuning the locking regimen until that's done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

28 August 2014, 11:11:52

On Wed, Aug 27, 2014 at 8:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 26, 2014 at 10:53 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Today, while working on updating the patch to improve locking
> I found that as now we are going to have a new process, we need
> a separate latch in StrategyControl to wakeup that process.
> Another point is I think it will be better to protect
> StrategyControl->completePasses with victimbuf_lck rather than
> freelist_lck, as when we are going to update it we will already be
> holding the victimbuf_lck and it doesn't make much sense to release
> the victimbuf_lck and reacquire freelist_lck to update it.

Sounds reasonable. I think the key thing at this point is to get a
new version of the patch with the background reclaim running in a
different process than the background writer. I don't see much point
in fine-tuning the locking regimen until that's done.

I have updated the patch to address the feedback. Main changes are:

1. For populating freelist, have a separate process (bgreclaimer)

instead of doing it by bgwriter.

2. Autotune the low and high threshold values for buffers

in freelist. I have used the formula as suggested by you upthread.

3. Cleanup of locking regimen as discussed upthread (completely

eliminated BufFreelist Lock).

4. Improved comments and general code cleanup.

I have not yet added statistics (buffers_backend_clocksweep) as

for that we need to add one more variable in BufferStrategyControl

structure where I have already added few variables for this patch.

I think it is important to have such a stat available via

pg_stat_bgwriter, but not sure if it is worth to make the structure

bit more bulky.

Another minor point is about changes in lwlock.h

lwlock.h

* if you remove a lock, consider leaving a gap in the numbering

* sequence for the benefit of DTrace and other external debugging

* scripts.

As I have removed BufFreelist lock, I have adjusted the numbering

as well in lwlock.h. There is a meesage on top of lock definitions

which suggest to leave gap if we remove any lock, however I was not

sure whether this case (removing the first element) can effect anything,

so for now, I have adjusted the numbering.

I have yet to collect data under varying loads, however I have

collected performance data for 8GB shared buffers which shows

reasonably good performance and scalability.

I think the main part left for this patch is more data for various loads

which I will share in next few days, however I think patch is ready for

next round of review, so I will mark it as Needs Review.

Performance Data:

-------------------------------

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads

RAM = 64GB

Database Locale =C

checkpoint_segments=256

checkpoint_timeout =15min

shared_buffers=8GB

scale factor = 3000

Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)

Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Client Count/Patch_ver	8	16	32	64	128
HEAD	58614	107370	140717	104357	65010
Patch	60849	118701	165631	209226	213029

Note -

a. The numbers are slightly different than previously reported

numbers as earlier I was using debug mode of binaries to take

data and it seems some kind of trace was enabled on m/c.

However the improve in performance and scalability is almost

similar to previous.

b. Above data is median of 3 runs, for detailed data refer attached

document (perf_read_scalability_data_v5.ods)

CPU Usage

------------------

I have observed that CPU usage for new process (reclaimer) is

between 5~9%.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

02 September 2014, 20:15:35

On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have updated the patch to address the feedback.  Main changes are:
>
> 1. For populating freelist, have a separate process (bgreclaimer)
> instead of doing it by bgwriter.
> 2. Autotune the low and high threshold values for buffers
> in freelist. I have used the formula as suggested by you upthread.
> 3. Cleanup of locking regimen as discussed upthread (completely
> eliminated BufFreelist Lock).
> 4. Improved comments and general code cleanup.

Overall this looks quite promising to me.

I had thought to call the new process just "bgreclaim" rather than
"bgreclaimer", but perhaps your name is better after all.  At least,
it matches what we do elsewhere.  But I don't care for the use
"Bgreclaimer"; let's do "BgReclaimer" if we really need mixed-case, or
else "bgreclaimer".

This is unclear:

+buffers for replacement.  Earlier to protect freelist, we use LWLOCK as that
+is needed to perform clock sweep which is a longer operation, however now we
+are using two spinklocks freelist_lck and victimbuf_lck to perform operations
+on freelist and run clock sweep respectively.

I would drop the discussion of what was done before and say something
like this: The data structures relating to buffer eviction are
protected by two spinlocks.  freelist_lck protects the freelist and
related data structures, while victimbuf_lck protects information
related to the current clock sweep condition.

+always in this list.  We also throw buffers into this list if we consider
+their pages unlikely to be needed soon; this is done by background process
+reclaimer.  The list is singly-linked using fields in the

I suggest: Allocating pages from this list is much cheaper than
running the "clock sweep" algorithm, which may encounter many buffers
that are poor candidates for eviction before finding a good candidate.
Therefore, we have a background process called bgreclaimer which works
to keep this list populated.

+Background Reclaimer's Processing
+---------------------------------

I suggest titling this section "Background Reclaim".

+The background reclaimer is designed to move buffers to freelist that are

I suggest replacing the first three words of this sentence with "bgreclaimer".

+and move the the unpinned and zero usage count buffers to freelist.  It
+keep on doing this until the number of buffers in freelist become equal
+high threshold of freelist.

s/keep/keeps/
s/become equal/reaches the/
s/high threshold/high water mark/
s/of freelist//

Please change the other places that say threshold to use the "water
mark" terminology.

+                if (StrategyMoveBufferToFreeListEnd (bufHdr))

Extra space.

+                 * buffers in consecutive cycles.

s/consecutive/later/

+    /* Execute the LRU scan */

s/LRU scan/clock sweep/ ?

+    while (tmp_num_to_free > 0)

I am not sure it's a good idea for this value to be fixed at loop
start and then just decremented.  Shouldn't we loop and do the whole
thing over once we reach the high watermark, only stopping when
StrategySyncStartAndEnd() says num_to_free is 0?

+        /* choose next victim buffer to clean. */

This process doesn't clean buffers; it puts them on the freelist.

+ * high threshold of freelsit), we drasticaly reduce the odds for

Two typos.

+ * of buffers in freelist fall below low threshold of freelist.

s/fall/falls/

In freelist.c, it seems like a poor idea to have two spinlocks as
consecutive structure members; they'll be in the same cache line,
leading to false sharing.  If we merge them into a single spinlock,
does that hurt performance?  If we put them further apart, e.g. by
moving the freelist_lck to the start of the structure, followed by the
latches, and leaving victimbuf_lck where it is, does that help
performance?

+            /*
+             * If the buffer is pinned or has a nonzero usage_count,
we cannot use
+             * it; discard it and retry.  (This can only happen if VACUUM put a
+             * valid buffer in the freelist and then someone else
used it before
+             * we got to it.  It's probably impossible altogether as
of 8.3, but
+             * we'd better check anyway.)
+             */
+

This comment is clearly obsolete.

> I have not yet added statistics (buffers_backend_clocksweep) as
> for that we need to add one more variable in BufferStrategyControl
> structure where I have already added few variables for this patch.
> I think it is important to have such a stat available via
> pg_stat_bgwriter, but not sure if it is worth to make the structure
> bit more bulky.

I think it's worth it.

> Another minor point is about changes in lwlock.h
> lwlock.h
> * if you remove a lock, consider leaving a gap in the numbering
> * sequence for the benefit of DTrace and other external debugging
> * scripts.
>
> As I have removed BufFreelist lock, I have adjusted the numbering
> as well in lwlock.h.  There is a meesage on top of lock definitions
> which suggest to leave gap if we remove any lock, however I was not
> sure whether this case (removing the first element) can effect anything,
> so for now, I have adjusted the numbering.

Let's leave slot 0 unused, instead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

03 September 2014, 04:15:33

On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have yet to collect data under varying loads, however I have
> collected performance data for 8GB shared buffers which shows
> reasonably good performance and scalability.
>
> I think the main part left for this patch is more data for various loads
> which I will share in next few days, however I think patch is ready for
> next round of review, so I will mark it as Needs Review.

I have collected more data with the patch. I understand that you

have given more review comments due to which patch require

changes, however I think it will not effect the performance data

to a great extent and I have anyway taken the data, so sharing the

same.

> Performance Data:
> -------------------------------
>
> Configuration and Db Details
> IBM POWER-7 16 cores, 64 hardware threads
> RAM = 64GB
> Database Locale =C
> checkpoint_segments=256
> checkpoint_timeout =15min
> scale factor = 3000
> Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
> Duration of each individual run = 5mins
>
> All the data is in tps and taken using pgbench read-only load

Common configuration remains same as above.

Shared_Buffers = 500MB

Client Count/Patch_Ver	8	16	32	64	128
HEAD	56248	100112	121341	81128	56552
Patch	59389	112483	157034	185740	166725

Shared_Buffers = 1GB

Client Count/Patch_Ver	8	16	32	64	128
HEAD	56401	102557	121643	81686	57091
Patch	59361	114813	157528	188209	167752

Shared_Buffers = 14GB

Client Count/Patch_Ver	8	16	32	64	128
HEAD	60059	110582	152051	130718	97014
Patch	61462	117377	169767	225803	229083

Shared_Buffers = 15GB

Client Count/Patch_Ver	8	16	32	64	128
HEAD	60005	112928	153650	135203	36343
Patch	61345	115569	168767	226150	36985

Observations

---------------------

1. Performance improvement is upto 2~3 times for higher client

counts (64, 128).

2. For lower client count (8), we can see 2~5 % performance

improvement.

3. Overall, this improves the read scalability.

4. For lower number of shared buffers, we see that there is a minor

dip in tps even after patch (it might be that we can improve it by

tuning higher water mark for the number of buffers on freelist, I will

try this by varying high water mark).

5. For larger shared buffers (15GB), we can see that there is still a

dip at large client count, although situation is not bad as compare to

HEAD. The reason is that at such high shared buffers and client count,

I/O starts happening because all the data can't be contained in RAM.

I will try to take some data for tpc-b load as well. Kindly let me know

if you want to see data for some other configuration.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

03 September 2014, 04:22:40

On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I have yet to collect data under varying loads, however I have
> > collected performance data for 8GB shared buffers which shows
> > reasonably good performance and scalability.
> >
> > I think the main part left for this patch is more data for various loads
> > which I will share in next few days, however I think patch is ready for
> > next round of review, so I will mark it as Needs Review.
>
> I have collected more data with the patch. I understand that you
> have given more review comments due to which patch require
> changes, however I think it will not effect the performance data
> to a great extent and I have anyway taken the data, so sharing the
> same.
>
>
> > Performance Data:
> > -------------------------------
> >
> > Configuration and Db Details
> > IBM POWER-7 16 cores, 64 hardware threads
> > RAM = 64GB
> > Database Locale =C
> > checkpoint_segments=256
> > checkpoint_timeout =15min
> > scale factor = 3000
> > Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
> > Duration of each individual run = 5mins
> >
> > All the data is in tps and taken using pgbench read-only load
>
> Common configuration remains same as above.

Forgot to mention that data is a median of 3 runs and attached

sheet contains data for individual runs.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

perf_read_scalability_data_v5.ods

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

03 September 2014, 11:27:49

On Wed, Sep 3, 2014 at 1:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have updated the patch to address the feedback. Main changes are:
> >
> > 1. For populating freelist, have a separate process (bgreclaimer)
> > instead of doing it by bgwriter.
> > 2. Autotune the low and high threshold values for buffers
> > in freelist. I have used the formula as suggested by you upthread.
> > 3. Cleanup of locking regimen as discussed upthread (completely
> > eliminated BufFreelist Lock).
> > 4. Improved comments and general code cleanup.
>
> +Background Reclaimer's Processing
> +---------------------------------
>
> I suggest titling this section "Background Reclaim".

I don't mind changing it, but currently used title is based on similar

title "Background Writer's Processing". It is used in previous

paragraph. Is there a reason to title this differently?

> +The background reclaimer is designed to move buffers to freelist that are
>
> I suggest replacing the first three words of this sentence with "bgreclaimer".

Again what I have used is matching with BgWriter's explanation. I thought
it would be better if wording used is similar.

>
> + while (tmp_num_to_free > 0)
>
> I am not sure it's a good idea for this value to be fixed at loop
> start and then just decremented.

It is based on the idea what bgwriter does for num_to_scan and
calling it once has advantage that we need to take freelist_lck
just once.

> Shouldn't we loop and do the whole
> thing over once we reach the high watermark, only stopping when
> StrategySyncStartAndEnd() says num_to_free is 0?

Do you mean to say that for high water mark check, we should
always refer StrategySyncStartAndEnd() rather than getting the
value in begining?

Are you thinking that somebody else would have already put some
buffers onto freelist which would change initially identified high water

mark? I think that can be done only during very few and less used

operations. Do you think for that we start referring

StrategySyncStartAndEnd() in loop?
>
> In freelist.c, it seems like a poor idea to have two spinlocks as
> consecutive structure members; they'll be in the same cache line,
> leading to false sharing. If we merge them into a single spinlock,
> does that hurt performance?

I have kept them separate so that backends searching for a buffer

in freelist doesn't contend with bgreclaimer (while doing clock sweep)

or clock sweep being done by other backends. I think it will be bit

tricky to devise a test where this can hurt, however it doesn't seem

too bad to have two separate locks in this case.

> If we put them further apart, e.g. by
> moving the freelist_lck to the start of the structure, followed by the
> latches, and leaving victimbuf_lck where it is, does that help
> performance?

I can investigate.

> + /*
> + * If the buffer is pinned or has a nonzero usage_count,
> we cannot use
> + * it; discard it and retry. (This can only happen if VACUUM put a
> + * valid buffer in the freelist and then someone else
> used it before
> + * we got to it. It's probably impossible altogether as
> of 8.3, but
> + * we'd better check anyway.)
> + */
> +
>
> This comment is clearly obsolete.

Okay, but this patch hasn't changed anything w.r.t above comment,
so I haven't changed it. Do you want me to remove second part of

comment starting with "(This can only happen"?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

03 September 2014, 14:33:25

On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> +Background Reclaimer's Processing
>> +---------------------------------
>>
>> I suggest titling this section "Background Reclaim".
>
> I don't mind changing it, but currently used title is based on similar
> title "Background Writer's Processing".  It is used in previous
> paragraph.  Is there a reason to title this differently?

Oh, I didn't see that.  Seems like weird phrasing to me, but I guess
it's probably better to keep it consistent.

>> +The background reclaimer is designed to move buffers to freelist that are
>>
>> I suggest replacing the first three words of this sentence with
>> "bgreclaimer".
>
> Again what I have used is matching with BgWriter's explanation. I thought
> it would be better if wording used is similar.

OK.

>> +    while (tmp_num_to_free > 0)
>>
>> I am not sure it's a good idea for this value to be fixed at loop
>> start and then just decremented.
>
> It is based on the idea what bgwriter does for num_to_scan and
> calling it once has advantage that we need to take freelist_lck
> just once.

Right, we shouldn't call it every loop iteration.  However, consider
this scenario: there are no remaining buffers on the list and the high
watermark is 2000.  We add 2000 buffers to the list.  But by the time
we get done, other backends have already done 500 more allocations, so
now there are only 1500 buffers on the list.  If this should occur, we
should add an additional 500 buffers to the list before we consider
sleeping.  We want bgreclaimer to be able to run continuously if the
demand for buffers is high enough.

>> In freelist.c, it seems like a poor idea to have two spinlocks as
>> consecutive structure members; they'll be in the same cache line,
>> leading to false sharing.  If we merge them into a single spinlock,
>> does that hurt performance?
>
> I have kept them separate so that backends searching for a buffer
> in freelist doesn't contend with bgreclaimer (while doing clock sweep)
> or clock sweep being done by other backends.  I think it will be bit
> tricky to devise a test where this can hurt, however it doesn't seem
> too bad to have two separate locks in this case.

It's not.  But if they are in the same cache line, they will behave
almost like one lock, because the CPU will lock the entire cache line
for each atomic op.  See Tom's comments upthread.

> Okay, but this patch hasn't changed anything w.r.t above comment,
> so I haven't changed it. Do you want me to remove second part of
> comment starting with "(This can only happen"?

Right.  Clearly it can happen again once we have this patch: that's
the entire point of the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Mark Kirkwood

Date:

04 September 2014, 02:30:30

On 03/09/14 16:22, Amit Kapila wrote:
> On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>>
>>> I have yet to collect data under varying loads, however I have
>>> collected performance data for 8GB shared buffers which shows
>>> reasonably good performance and scalability.
>>>
>>> I think the main part left for this patch is more data for various loads
>>> which I will share in next few days, however I think patch is ready for
>>> next round of review, so I will mark it as Needs Review.
>>
>> I have collected more data with the patch.  I understand that you
>> have given more review comments due to which patch require
>> changes, however I think it will not effect the performance data
>> to a great extent and I have anyway taken the data, so sharing the
>> same.
>>
>>
>>> Performance Data:
>>> -------------------------------
>>>
>>> Configuration and Db Details
>>> IBM POWER-7 16 cores, 64 hardware threads
>>> RAM = 64GB
>>> Database Locale =C
>>> checkpoint_segments=256
>>> checkpoint_timeout    =15min
>>> scale factor = 3000
>>> Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
>>> Duration of each individual run = 5mins
>>>
>>> All the data is in tps and taken using pgbench read-only load
>>
>> Common configuration remains same as above.
>
> Forgot to mention that data is a median of 3 runs and attached
> sheet contains data for individual runs.
>
>

Hi Amit,

Results look pretty good. Does it help in the read-write case too?

Cheers

Mark

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

04 September 2014, 02:42:34

On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
>
>
> Hi Amit,
>
> Results look pretty good. Does it help in the read-write case too?

Last time I ran the tpc-b test of pgbench (results of which are

posted earlier in this thread), there doesn't seem to be any major

gain for that, however for cases where read is predominant, you

might see better gains.

I am again planing to take that data in next few days.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

04 September 2014, 11:26:04

On Wed, Sep 3, 2014 at 8:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >> + while (tmp_num_to_free > 0)
> >>
> >> I am not sure it's a good idea for this value to be fixed at loop
> >> start and then just decremented.
> >
> > It is based on the idea what bgwriter does for num_to_scan and
> > calling it once has advantage that we need to take freelist_lck
> > just once.
>
> Right, we shouldn't call it every loop iteration. However, consider
> this scenario: there are no remaining buffers on the list and the high
> watermark is 2000. We add 2000 buffers to the list. But by the time
> we get done, other backends have already done 500 more allocations, so
> now there are only 1500 buffers on the list. If this should occur, we
> should add an additional 500 buffers to the list before we consider
> sleeping. We want bgreclaimer to be able to run continuously if the
> demand for buffers is high enough.

Its not difficult to handle such cases, but it can have downside also

for the cases where demand from backends is not high.
Consider in above case if instead of 500 more allocations, it just
does 5 more allocations, then bgreclaimer will again have to go through
the list and move 5 buffers and same can happen again by the time
it moves 5 buffers. Another point to keep in mind here is that in this

loop we are reducing the usage_count of buffers as well incase we don't

find buffer with usage_count=0. OTOH if we let bgreclaimer to go for

sleep after it moves initially identified buffers, then the backend which
first finds that the buffers in freelist falls below low water mark can

wake bgreclaimer.

> >> In freelist.c, it seems like a poor idea to have two spinlocks as
> >> consecutive structure members; they'll be in the same cache line,
> >> leading to false sharing. If we merge them into a single spinlock,
> >> does that hurt performance?
> >
> > I have kept them separate so that backends searching for a buffer
> > in freelist doesn't contend with bgreclaimer (while doing clock sweep)
> > or clock sweep being done by other backends. I think it will be bit
> > tricky to devise a test where this can hurt, however it doesn't seem
> > too bad to have two separate locks in this case.
>
> It's not. But if they are in the same cache line, they will behave
> almost like one lock, because the CPU will lock the entire cache line
> for each atomic op. See Tom's comments upthread.

I think to avoid having them in same cache line, we might need to
add some padding (at least 72 bytes) as the structure size including both

the spin locks is 56 bytes on PPC64 m/c and cache line size is 128 bytes.

I have taken performance data as well by keeping them further apart

as suggested by you upthread and by introducing padding, but the

difference in performance is less than 1.5% (on 64 and 128 client count)

which also might be due to variation of data across runs. So now to

proceed we have below options:

a. use two spinlocks as in patch, but keep them as far apart as possible.
This might not have an advantage as compare to what is used currently

in patch, but in future we can adding padding to take the advantage if

possible (currently on PPC64, it doesn't show any noticeable advantage,

however on some other m/c, it might show the advantage).

b. use only one spinlock, this can have disadvantage in certain cases
as mentioned upthread, however those might not be usual cases, so for
now we can consider them as lower priority and can choose this option.

Another point in this regard is that I have to make use of volatile
pointer to prevent code rearrangement in this case.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

04 September 2014, 12:02:16

On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

> Performance Data:
> -------------------------------
>
> Configuration and Db Details
> IBM POWER-7 16 cores, 64 hardware threads
> RAM = 64GB
> Database Locale =C
> checkpoint_segments=256
> checkpoint_timeout =15min
> scale factor = 3000
> Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
> Duration of each individual run = 5mins
>
> All the data is in tps and taken using pgbench read-only load

Common configuration remains same as above.

Shared_Buffers = 500MB
Client Count/Patch_Ver 8 16 32 64 128
HEAD 56248 100112 121341 81128 56552
Patch 59389 112483 157034 185740 166725

..

Observations
---------------------
1. Performance improvement is upto 2~3 times for higher client
counts (64, 128).
2. For lower client count (8), we can see 2~5 % performance
improvement.
3. Overall, this improves the read scalability.
4. For lower number of shared buffers, we see that there is a minor
dip in tps even after patch (it might be that we can improve it by
tuning higher water mark for the number of buffers on freelist, I will
try this by varying high water mark).

I have taken performance data by varying high and low mater marks

for lower value of shared buffers which is as below:

Shared_buffers = 500MB

Scale_factor = 3000

HM - High water mark, 0.5 means 0.5% of total shared buffers

LM - Low water mark, 20 means 20% of HM.

Client Count/Patch_Ver (Data in tps)	128
HM=0.5;LM=20	166725
HM=1;LM=20	166556
HM=2;LM=30	166463
HM=5;LM=30	166107
HM=10;LM=30	167231

Observation

--------------------

a. There is hardly any difference by varying High and Low water marks

as compared to default values currently used in patch.

b. I think this minor dip as compare to 64 client count is because one

this m/c has 64 hardware threads due which scaling beyond 64 client

count is difficult and second at relatively lower buffer count (500MB),

there is still minor contention around Buf Mapping locks.

In general, I think with patch the scaling is much better (2 times) than

HEAD, even when shared buffers are less and client count is high,

so this is not an issue.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

04 September 2014, 12:17:04

On Thu, Sep 4, 2014 at 7:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Its not difficult to handle such cases, but it can have downside also
> for the cases where demand from backends is not high.
> Consider in above case if instead of 500 more allocations, it just
> does 5 more allocations, then bgreclaimer will again have to go through
> the list and move 5 buffers and same can happen again by the time
> it moves 5 buffers.

That's exactly the scenario in which we *want* the looping behavior.
If that's happening, then it means it's taking us exactly as long to
find 5 buffers as it takes the rest of the system to use 5 buffers.
We need to run continuously to keep up.

>> It's not.  But if they are in the same cache line, they will behave
>> almost like one lock, because the CPU will lock the entire cache line
>> for each atomic op.  See Tom's comments upthread.
>
> I think to avoid having them in same cache line, we might need to
> add some padding (at least 72 bytes) as the structure size including both
> the spin locks is 56 bytes on PPC64 m/c and cache line size is 128 bytes.
> I have taken performance data as well by keeping them further apart
> as suggested by you upthread and by introducing padding, but the
> difference in performance is less than 1.5% (on 64 and 128 client count)
> which also might be due to variation of data across runs.  So now to
> proceed we have below options:
>
> a. use two spinlocks as in patch, but keep them as far apart as possible.
> This might not have an advantage as compare to what is used currently
> in patch, but in future we can adding padding to take the advantage if
> possible (currently on PPC64, it doesn't show any noticeable advantage,
> however on some other m/c, it might show the advantage).
>
> b. use only one spinlock, this can have disadvantage in certain cases
> as mentioned upthread, however those might not be usual cases, so for
> now we can consider them as lower priority and can choose this option.

I guess I don't care that much.  I only mentioned it because Tom
brought it up; I don't really see a big problem with the way you're
doing it.

> Another point in this regard is that I have to make use of volatile
> pointer to prevent code rearrangement in this case.

Yep.  Or we need to get off our duff and fix it so that's not necessary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Kevin Grittner

Date:

04 September 2014, 14:14:32

Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 4, 2014 at 7:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Its not difficult to handle such cases, but it can have downside also
>> for the cases where demand from backends is not high.
>> Consider in above case if instead of 500 more allocations, it just
>> does 5 more allocations, then bgreclaimer will again have to go through
>> the list and move 5 buffers and same can happen again by the time
>> it moves 5 buffers.
>
> That's exactly the scenario in which we *want* the looping behavior.
> If that's happening, then it means it's taking us exactly as long to
> find 5 buffers as it takes the rest of the system to use 5 buffers.
> We need to run continuously to keep up.

That's what I was thinking, as long as there isn't a lot of
overhead to starting and finishing a cycle.  If there is, my
inclination would be to try to fix that rather than to sleep and
hope things don't get out of hand before it wakes up again.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Alvaro Herrera

Date:

04 September 2014, 15:10:53

Robert Haas wrote:
> On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> +Background Reclaimer's Processing
> >> +---------------------------------
> >>
> >> I suggest titling this section "Background Reclaim".
> >
> > I don't mind changing it, but currently used title is based on similar
> > title "Background Writer's Processing".  It is used in previous
> > paragraph.  Is there a reason to title this differently?
> 
> Oh, I didn't see that.  Seems like weird phrasing to me, but I guess
> it's probably better to keep it consistent.

... or you can also change the other one.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Scaling shared buffer eviction

From

Mark Kirkwood

Date:

05 September 2014, 03:13:01

On 04/09/14 14:42, Amit Kapila wrote:
> On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
> wrote:
>>
>>
>> Hi Amit,
>>
>> Results look pretty good. Does it help in the read-write case too?
>
> Last time I ran the tpc-b test of pgbench (results of which are
> posted earlier in this thread), there doesn't seem to be any major
> gain for that, however for cases where read is predominant, you
> might see better gains.
>
> I am again planing to take that data in next few days.
>

FWIW below are some test results on the 60 core beast with this patch 
applied to 9.4. I'll need to do more runs to iron out the variation, but 
it looks like the patch helps the standard (write heavy) pgbench 
workload a little, and clearly helps the read only case.


4x E7-4890 15 cores each.
1 TB ram
16x Toshiba PX02SS SATA SSD
4x Samsung NVMe XS1715 PCIe SSD

Ubuntu 14.04  (Linux 3.13)
Postgres 9.4 beta2
+ buffer eviction patch v5

Pgbench

scale 2000

Non default params:

max_connections = 400;
shared_buffers = "10GB";
maintenance_work_mem = "1GB";
effective_io_concurrency = 10;
wal_buffers = "256MB";
checkpoint_segments = 1920;
checkpoint_completion_target = 0.8;
ssl = 'off';
wal_sync_method = 'open_datasync';

read write

elapsed 600s

Clients  | tps   | tps (unpatched)
---------+-------+----------------  6      |  8279 |  8328  12     | 16260 | 16381  24     | 23639 | 23451  48     |
31430| 31004  96     | 38516 | 34777 192     | 33535 | 32443 384     | 27978 | 25068 384     | 30589 | 28798
 


read only

elapsed 300s

Clients  | tps    | tps (unpatched)
---------+--------+----------------  6      |  57654 |  57255  12     | 111361 | 112360  24     | 220304 | 187967  48
 | 384567 | 230961  96     | 380309 | 241947 192     | 330865 | 214570 384     | 315516 | 207548
 


Regards

Mark

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

05 September 2014, 11:47:57

On Wed, Sep 3, 2014 at 1:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have updated the patch to address the feedback. Main changes are:
> >
> > 1. For populating freelist, have a separate process (bgreclaimer)
> > instead of doing it by bgwriter.
> > 2. Autotune the low and high threshold values for buffers
> > in freelist. I have used the formula as suggested by you upthread.
> > 3. Cleanup of locking regimen as discussed upthread (completely
> > eliminated BufFreelist Lock).
> > 4. Improved comments and general code cleanup.
>
> Overall this looks quite promising to me.
>
> I had thought to call the new process just "bgreclaim" rather than
> "bgreclaimer", but perhaps your name is better after all. At least,
> it matches what we do elsewhere. But I don't care for the use
> "Bgreclaimer"; let's do "BgReclaimer" if we really need mixed-case, or
> else "bgreclaimer".

Changed it to bgreclaimer.

> This is unclear:
>
> +buffers for replacement. Earlier to protect freelist, we use LWLOCK as that
> +is needed to perform clock sweep which is a longer operation, however now we
> +are using two spinklocks freelist_lck and victimbuf_lck to perform operations
> +on freelist and run clock sweep respectively.
>
> I would drop the discussion of what was done before and say something
> like this: The data structures relating to buffer eviction are
> protected by two spinlocks. freelist_lck protects the freelist and
> related data structures, while victimbuf_lck protects information
> related to the current clock sweep condition.

Changed, but I have not used exact wording mentioned above, let me know

if new wording used is okay?

> +always in this list. We also throw buffers into this list if we consider
> +their pages unlikely to be needed soon; this is done by background process
> +reclaimer. The list is singly-linked using fields in the
>
> I suggest: Allocating pages from this list is much cheaper than
> running the "clock sweep" algorithm, which may encounter many buffers
> that are poor candidates for eviction before finding a good candidate.
> Therefore, we have a background process called bgreclaimer which works
> to keep this list populated.

Changed as per your suggestion.

> +Background Reclaimer's Processing
> +---------------------------------
>
> I suggest titling this section "Background Reclaim".
>
> +The background reclaimer is designed to move buffers to freelist that are
>
> I suggest replacing the first three words of this sentence with "bgreclaimer".

As per discussion in thread, I have kept it as it is.

> +and move the the unpinned and zero usage count buffers to freelist. It
> +keep on doing this until the number of buffers in freelist become equal
> +high threshold of freelist.
>
> s/keep/keeps/
> s/become equal/reaches the/
> s/high threshold/high water mark/
> s/of freelist//

Changed as per your suggestion.

> Please change the other places that say threshold to use the "water
> mark" terminology.
>
> + if (StrategyMoveBufferToFreeListEnd (bufHdr))
>
> Extra space.
>
> + * buffers in consecutive cycles.
>
> s/consecutive/later/
>
> + /* Execute the LRU scan */
>
> s/LRU scan/clock sweep/ ?

Changed as per your suggestion.

>
> + while (tmp_num_to_free > 0)
>
> I am not sure it's a good idea for this value to be fixed at loop
> start and then just decremented. Shouldn't we loop and do the whole
> thing over once we reach the high watermark, only stopping when
> StrategySyncStartAndEnd() says num_to_free is 0?

Okay, changed the loop as per your suggestion.

> + /* choose next victim buffer to clean. */
>
> This process doesn't clean buffers; it puts them on the freelist.

Right. Changed it to match what it does.

> + * high threshold of freelsit), we drasticaly reduce the odds for
>
> Two typos.

Fixed.

> + * of buffers in freelist fall below low threshold of freelist.
>
> s/fall/falls/

Changed as per your suggestion.

> In freelist.c, it seems like a poor idea to have two spinlocks as
> consecutive structure members; they'll be in the same cache line,
> leading to false sharing. If we merge them into a single spinlock,
> does that hurt performance? If we put them further apart, e.g. by
> moving the freelist_lck to the start of the structure, followed by the
> latches, and leaving victimbuf_lck where it is, does that help
> performance?

As per discussion, I have kept them as it is and added a comment
indicating that we can consider having both locks in separate
cache lines.

> + /*
> + * If the buffer is pinned or has a nonzero usage_count,
> we cannot use
> + * it; discard it and retry. (This can only happen if VACUUM put a
> + * valid buffer in the freelist and then someone else
> used it before
> + * we got to it. It's probably impossible altogether as
> of 8.3, but
> + * we'd better check anyway.)
> + */
> +
>
> This comment is clearly obsolete.

Removed.

> > I have not yet added statistics (buffers_backend_clocksweep) as
> > for that we need to add one more variable in BufferStrategyControl
> > structure where I have already added few variables for this patch.
> > I think it is important to have such a stat available via
> > pg_stat_bgwriter, but not sure if it is worth to make the structure
> > bit more bulky.
>
> I think it's worth it.

Okay added new statistic.

> > Another minor point is about changes in lwlock.h
> > lwlock.h
> > * if you remove a lock, consider leaving a gap in the numbering
> > * sequence for the benefit of DTrace and other external debugging
> > * scripts.
> >
> > As I have removed BufFreelist lock, I have adjusted the numbering
> > as well in lwlock.h. There is a meesage on top of lock definitions
> > which suggest to leave gap if we remove any lock, however I was not
> > sure whether this case (removing the first element) can effect anything,
> > so for now, I have adjusted the numbering.
>
> Let's leave slot 0 unused, instead.

Sure, that make sense.

Apart from above, I think for this patch, cat version bump is required

as I have modified system catalog. However I have not done the

same in patch as otherwise it will be bit difficult to take performance

data.

Performance Data with updated patch

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Client Count/Patch_Ver (tps)	8	16	32	64	128
HEAD	58614	107370	140717	104357	65010
Patch	60092	113564	165014	213848	216065

This data is median of 3 runs, detailed report is attached with mail.

I have not repeated the test for all configurations, as there is no

major change in design/algorithm which can effect performance.

Mark has already taken tpc-b data which ensures that there is

no problem with it, however I will also take it once with latest version.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

05 September 2014, 11:50:43

On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
>
> On 04/09/14 14:42, Amit Kapila wrote:
>>
>> On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
>> wrote:
>>>
>>>
>>>
>>> Hi Amit,
>>>
>>> Results look pretty good. Does it help in the read-write case too?
>>
>>
>> Last time I ran the tpc-b test of pgbench (results of which are
>> posted earlier in this thread), there doesn't seem to be any major
>> gain for that, however for cases where read is predominant, you
>> might see better gains.
>>
>> I am again planing to take that data in next few days.
>>
>
> FWIW below are some test results on the 60 core beast with this patch applied to 9.4. I'll need to do more runs to iron out the variation,

> but it looks like the patch helps the standard (write heavy) pgbench workload a little, and clearly helps the read only case.
>

Thanks for doing the test. I think if possible you can take

the performance data with higher scale factor (4000) as it

seems your m/c has 1TB of RAM. You might want to use

latest patch I have posted today.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

05 September 2014, 13:20:00

On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Apart from above, I think for this patch, cat version bump is required
> as I have modified system catalog. However I have not done the
> same in patch as otherwise it will be bit difficult to take performance
> data.

One regression failed on linux due to spacing issue which is

fixed in attached patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

scalable_buffer_eviction_v7.patch

Re: Scaling shared buffer eviction

From

Merlin Moncure

Date:

08 September 2014, 16:42:13

On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Client Count/Patch_Ver (tps) 8 16 32 64 128
> HEAD 58614 107370 140717 104357 65010
> Patch 60092 113564 165014 213848 216065
>
> This data is median of 3 runs, detailed report is attached with mail.
> I have not repeated the test for all configurations, as there is no
> major change in design/algorithm which can effect performance.
> Mark has already taken tpc-b data which ensures that there is
> no problem with it, however I will also take it once with latest version.

Well, these numbers are pretty much amazing.  Question: It seems
there's obviously quite a bit of contention left;  do you think
there's still a significant amount of time in the clock sweep, or is
the primary bottleneck the buffer mapping locks?

merlin

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

08 September 2014, 18:46:29

On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Apart from above, I think for this patch, cat version bump is required
>> as I have modified system catalog.  However I have not done the
>> same in patch as otherwise it will be bit difficult to take performance
>> data.
>
> One regression failed on linux due to spacing issue which is
> fixed in attached patch.

I took another read through this patch.  Here are some further review comments:

1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have
both num_to_free and tmp_num_to_free.  I'd get rid of tmp_num_to_free
and move the declaration of num_to_free inside the outer loop.  I'd
also move the definitions of tmp_next_to_clean, tmp_recent_alloc,
tmp_recent_backend_clocksweep into the innermost scope in which they
are used.

2. Also in that function, I think the innermost bit of logic could be
rewritten more compactly, and in such a way as to make it clearer for
what set of instructions the buffer header will be locked.
LockBufHdr(bufHdr); if (bufHdr->refcount == 0) { if
(bufHdr->usage_count > 0) bufHdr->usage_count--; else add_to_freelist
= true; } UnlockBufHdr(bufHdr); if (add_to_freelist &&
StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--;

3. This comment is now obsolete:

+       /*
+        * If bgwriterLatch is set, we need to waken the bgwriter, but we should
+        * not do so while holding freelist_lck; so set it after releasing the
+        * freelist_lck.  This is annoyingly tedious, but it happens
at most once
+        * per bgwriter cycle, so the performance hit is minimal.
+        */
+

We're not actually holding any lock in need of releasing at that point
in the code, so this can be shortened to "If bgwriterLatch is set, we
need to waken the bgwriter."
    * Ideally numFreeListBuffers should get called under freelist spinlock,

That doesn't make any sense.  numFreeListBuffers is a variable, so you
can't "call" it.  The value should be *read* under the spinlock, but
it is.  I think this whole comment can be deleted and replaced with
"If the number of free buffers has fallen below the low water mark,
awaken the bgreclaimer to repopulate it."

4. StrategySyncStartAndEnd() is kind of a mess.  One, it can return
the same victim buffer that's being handed out - at almost the same
time - to a backend running the clock sweep; if it does, they'll fight
over the buffer.  Two, the *end out parameter actually returns a
count, not an endpoint.  I think we should have
BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the
top of the inner loop rather than the bottom, and change
StrategySyncStartAndEnd() so that it knows nothing about
victimbuf_lck.  Let's also change StrategyGetBuffer() to call
StrategySyncNextVictimBuffer() so that the logic is centralized in one
place, and rename StrategySyncStartAndEnd() to something that better
matches its revised purpose.  Maybe StrategyGetReclaimInfo().

5. Have you tested that this new bgwriter statistic is actually
working?  Because it looks to me like BgMoveBuffersToFreelist is
changing BgWriterStats but never calling pgstat_send_bgwriter(), which
I'm thinking will mean the counters accumulate forever inside the
reclaimer but never get forwarded to the stats collector.

6. StrategyInitialize() uses #defines for
HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and
LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000)
for clamping.  Let's have constants for all of those (and omit mention
of the specific values in the comments).

7. The line you've added to the definition of view pg_stat_bgwriter
doesn't seem to be indented the same way as all the others.  Tab vs.
space problem?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Thom Brown

Date:

08 September 2014, 21:41:37

On 5 September 2014 14:19, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Apart from above, I think for this patch, cat version bump is required
> as I have modified system catalog. However I have not done the
> same in patch as otherwise it will be bit difficult to take performance
> data.

One regression failed on linux due to spacing issue which is
fixed in attached patch.

Here's a set of test results against this patch:

8-core AMD FX-9590, 32GB RAM

Config:

checkpoint_segments = 256

checkpoint_timeout = 15min

shared_buffers = 8GB

pgbench scale factor = 3000

run duration time = 5min

HEAD

Client Count/No. Of Runs (tps) 8 16 32 64 128

Run-1 31568 41890 48638 49266 41845

Run-2 31613 41879 48597 49332 41736

Run-3 31674 41987 48647 49320 41745

Patch=scalable_buffer_eviction_v7.patch

Client Count/No. Of Runs (tps) 8 16 32 64 128

Run-1 31880 42943 49359 49901 43779

Run-2 32150 43078 48934 49828 43769

Run-3 32323 42894 49481 49600 43529

--
Thom

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

09 September 2014, 03:49:03

On Mon, Sep 8, 2014 at 10:12 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
> On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Client Count/Patch_Ver (tps) 8 16 32 64 128
> > HEAD 58614 107370 140717 104357 65010
> > Patch 60092 113564 165014 213848 216065
> >
> > This data is median of 3 runs, detailed report is attached with mail.
> > I have not repeated the test for all configurations, as there is no
> > major change in design/algorithm which can effect performance.
> > Mark has already taken tpc-b data which ensures that there is
> > no problem with it, however I will also take it once with latest version.
>
> Well, these numbers are pretty much amazing. Question: It seems
> there's obviously quite a bit of contention left; do you think
> there's still a significant amount of time in the clock sweep, or is
> the primary bottleneck the buffer mapping locks?

I think contention around clock sweep is very minimal and for buffer

mapping locks it has reduced significantly (you can refer upthread

some of LWLOCK stat data, I have posted after this patch), but

might be we can get more out of it by improving hash table

concurrency. However apart from both of the above, the next thing

I have seen in profiles was palloc (at least that is what I remember

when I had done some profiling few months back during the

development of this patch). It seems to me at that time a totally

different optimisation, so I left it for another patch.

Another point is that the m/c on which I am doing performance

test has 16 cores (64 hardware threads), so not sure how much

scaling we can expect.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

09 September 2014, 10:24:56

On Tue, Sep 9, 2014 at 3:11 AM, Thom Brown <thom@linux.com> wrote:
> On 5 September 2014 14:19, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>> > Apart from above, I think for this patch, cat version bump is required
>> > as I have modified system catalog. However I have not done the
>> > same in patch as otherwise it will be bit difficult to take performance
>> > data.
>>
>> One regression failed on linux due to spacing issue which is
>> fixed in attached patch.
>
>
> Here's a set of test results against this patch:

Many thanks for taking the performance data. This data clearly shows

that there is a performance improvement at even lower configuration,

however the real benefit of the patch can be seen with higher core

m/c and with larger RAM (can contain all the data).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Mark Kirkwood

Date:

10 September 2014, 00:16:23

On 05/09/14 23:50, Amit Kapila wrote:
> On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
> wrote:
>  >
>  > On 04/09/14 14:42, Amit Kapila wrote:
>  >>
>  >> On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
>  >> wrote:
>  >>>
>  >>>
>  >>>
>  >>> Hi Amit,
>  >>>
>  >>> Results look pretty good. Does it help in the read-write case too?
>  >>
>  >>
>  >> Last time I ran the tpc-b test of pgbench (results of which are
>  >> posted earlier in this thread), there doesn't seem to be any major
>  >> gain for that, however for cases where read is predominant, you
>  >> might see better gains.
>  >>
>  >> I am again planing to take that data in next few days.
>  >>
>  >
>  > FWIW below are some test results on the 60 core beast with this patch
> applied to 9.4. I'll need to do more runs to iron out the variation,
>  > but it looks like the patch helps the standard (write heavy) pgbench
> workload a little, and clearly helps the read only case.
>  >
>
> Thanks for doing the test.  I think if possible you can take
> the performance data with higher scale factor (4000) as it
> seems your m/c has 1TB of RAM.  You might want to use
> latest patch I have posted today.
>

Here's some fairly typical data from read-write and read-only runs at 
scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not 
seeing much variation between repeated read-write runs with the same 
config (which is nice - sleep 30 and explicit checkpoint call between 
each one seem to help there).

Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be 
better at higher client counts than beta1 was...

In terms of the effect of the patch - looks pretty similar to the scale 
2000 results for read-write, but read-only is a different and more 
interesting story - unpatched 9.4 is noticeably impacted in the higher 
client counts, whereas the patched version scales as well (or even 
better perhaps) than in the scale 2000 case.

read write (600s)

Clients  | tps    | tps (unpatched)
---------+--------+----------------  6      |   9395 |   9334  12     |  16605 |  16525  24     |  24634 |  24910  48
 |  32170 |  31275  96     |  35675 |  36533 192     |  35579 |  31137 384     |  30528 |  28308
 


read only (300s)

Clients  | tps    | tps (unpatched)
---------+--------+----------------  6      |  35743 |   35362  12     | 111019 |  106579  24     | 199746 |  160305
48    | 327026 |  198407  96     | 379184 |  171863 192     | 356623 |  152224 384     | 340878 |  128308
 


regards

Mark

Re: Scaling shared buffer eviction

From

Michael Paquier

Date:

10 September 2014, 03:42:06

On Tue, Sep 9, 2014 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> One regression failed on linux due to spacing issue which is
>> fixed in attached patch.
I just read the latest patch by curiosity, wouldn't it make more sense
to split the patch into two different patches for clarity: one for the
reclaimer worker centralized around BgMoveBuffersToFreelist and a
second for the pg_stat portion? Those seem two different features.
Regards,
-- 
Michael

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

10 September 2014, 06:47:41

On Tue, Sep 9, 2014 at 12:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> Apart from above, I think for this patch, cat version bump is required
> >> as I have modified system catalog. However I have not done the
> >> same in patch as otherwise it will be bit difficult to take performance
> >> data.
> >
> > One regression failed on linux due to spacing issue which is
> > fixed in attached patch.
>
> I took another read through this patch. Here are some further review comments:
>
> 1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have
> both num_to_free and tmp_num_to_free.

num_to_free is used to accumulate total number of buffers that are
freed in one cycle of BgMoveBuffersToFreelist() which is reported
for debug info (BGW_DEBUG) and tmp_num_to_free is used as a temporary
number which is a count of number of buffers to be freed in one sub-cycle
(inner while loop)

> I'd get rid of tmp_num_to_free
> and move the declaration of num_to_free inside the outer loop. I'd
> also move the definitions of tmp_next_to_clean, tmp_recent_alloc,
> tmp_recent_backend_clocksweep into the innermost scope in which they
> are used.

okay, I have moved the tmp_* variables in innermost scope.

> 2. Also in that function, I think the innermost bit of logic could be
> rewritten more compactly, and in such a way as to make it clearer for
> what set of instructions the buffer header will be locked.
> LockBufHdr(bufHdr); if (bufHdr->refcount == 0) { if
> (bufHdr->usage_count > 0) bufHdr->usage_count--; else add_to_freelist
> = true; } UnlockBufHdr(bufHdr); if (add_to_freelist &&
> StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--;

Changed as per suggestion.

> 3. This comment is now obsolete:
>
> + /*
> + * If bgwriterLatch is set, we need to waken the bgwriter, but we should
> + * not do so while holding freelist_lck; so set it after releasing the
> + * freelist_lck. This is annoyingly tedious, but it happens
> at most once
> + * per bgwriter cycle, so the performance hit is minimal.
> + */
> +
>
> We're not actually holding any lock in need of releasing at that point
> in the code, so this can be shortened to "If bgwriterLatch is set, we
> need to waken the bgwriter."

Changed as per suggestion.

> * Ideally numFreeListBuffers should get called under freelist spinlock,
>
> That doesn't make any sense. numFreeListBuffers is a variable, so you
> can't "call" it. The value should be *read* under the spinlock, but
> it is. I think this whole comment can be deleted and replaced with
> "If the number of free buffers has fallen below the low water mark,
> awaken the bgreclaimer to repopulate it."

Changed as per suggestion.

> 4. StrategySyncStartAndEnd() is kind of a mess. One, it can return
> the same victim buffer that's being handed out - at almost the same
> time - to a backend running the clock sweep; if it does, they'll fight
> over the buffer. Two, the *end out parameter actually returns a
> count, not an endpoint. I think we should have
> BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the
> top of the inner loop rather than the bottom, and change
> StrategySyncStartAndEnd() so that it knows nothing about
> victimbuf_lck. Let's also change StrategyGetBuffer() to call
> StrategySyncNextVictimBuffer() so that the logic is centralized in one
> place, and rename StrategySyncStartAndEnd() to something that better
> matches its revised purpose.

Changed as per suggestion. I have also updated

StrategySyncNextVictimBuffer() such that it increments completePasses

on completion of cycle as I think it is appropriate to update it, even when

clock sweep is done by bgreclaimer.

> Maybe StrategyGetReclaimInfo().

I have changed it to StrategyGetFreelistAccessInfo() as it seems most
other functions in freelist.c uses the names to sound something related
to buffers.

> 5. Have you tested that this new bgwriter statistic is actually
> working? Because it looks to me like BgMoveBuffersToFreelist is
> changing BgWriterStats but never calling pgstat_send_bgwriter(), which
> I'm thinking will mean the counters accumulate forever inside the
> reclaimer but never get forwarded to the stats collector.

pgstat_send_bgwriter() is called in bgreclaimer loop (caller of
BgMoveBuffersToFreelist, this is similar to how bgwriter does).
I have done few tests with it before sending the previous patch.

> 6. StrategyInitialize() uses #defines for
> HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and
> LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000)
> for clamping. Let's have constants for all of those (and omit mention
> of the specific values in the comments).

Changed as per suggestion.

> 7. The line you've added to the definition of view pg_stat_bgwriter
> doesn't seem to be indented the same way as all the others. Tab vs.
> space problem?

Fixed.

Performance Data:
-------------------------------
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Client Count/Patch_ver	8	16	32	64	128
HEAD	58614	107370	140717	104357	65010
Patch	61825	115152	170952	217389	220994

Observation

--------------------

1. The scalability/performance is similar to previous patch, slightly

better at higher client count.

2. I have taken the performance data just for one set of configuration,

as there doesn't seem to be any fundamental change which can impact

performance.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

scalable_buffer_eviction_v8.patch

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

10 September 2014, 06:54:11

On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
>
> On 05/09/14 23:50, Amit Kapila wrote:
>>
>> On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood
>> > FWIW below are some test results on the 60 core beast with this patch
>> applied to 9.4. I'll need to do more runs to iron out the variation,
>> > but it looks like the patch helps the standard (write heavy) pgbench
>> workload a little, and clearly helps the read only case.
>> >
>>
>> Thanks for doing the test. I think if possible you can take
>> the performance data with higher scale factor (4000) as it
>> seems your m/c has 1TB of RAM. You might want to use
>> latest patch I have posted today.
>>
>
> Here's some fairly typical data from read-write and read-only runs at scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not seeing much variation between repeated read-write runs with the same config (which is nice - sleep 30 and explicit checkpoint call between each one seem to help there).
>
> Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be better at higher client counts than beta1 was...
>
> In terms of the effect of the patch - looks pretty similar to the scale 2000 results for read-write, but read-only is a different and more interesting story - unpatched 9.4 is noticeably impacted in the higher client counts, whereas the patched version scales as well (or even better perhaps) than in the scale 2000 case.

Yeah, that's what I was expecting, the benefit of this patch

will be more at higher client count when there is large data

and all the data can fit in RAM .

Many thanks for doing the performance test for patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Mark Kirkwood

Date:

10 September 2014, 08:09:27

On 10/09/14 18:54, Amit Kapila wrote:
> On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
> wrote:
>  >
>  > In terms of the effect of the patch - looks pretty similar to the
> scale 2000 results for read-write, but read-only is a different and more
> interesting story - unpatched 9.4 is noticeably impacted in the higher
> client counts, whereas the patched version scales as well (or even
> better perhaps) than in the scale 2000 case.
>
> Yeah, that's what I was expecting, the benefit of this patch
> will be more at higher client count when there is large data
> and all the data can fit in RAM .
>
> Many thanks for doing the performance test for patch.
>
>

No worries, this is looking like a patch we're going to apply to 9.4 to 
make the 60 core beast scale a bit better, so thanks very much for your 
work in this area.

If you would like more tests run at higher scales let me know (we have 
two of these machines at pre-production state currently so I can run 
benchmarks as reqd)!

Regards

Mark

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

11 September 2014, 11:02:03

Hi,

On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
>  include $(top_srcdir)/src/backend/common.mk
> diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c
> new file mode 100644
> index 0000000..3df2337
> --- /dev/null
> +++ b/src/backend/postmaster/bgreclaimer.c

A fair number of comments in that file refer to bgwriter...

> @@ -0,0 +1,302 @@
> +/*-------------------------------------------------------------------------
> + *
> + * bgreclaimer.c
> + *
> + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.  It
> + * attempts to keep regular backends from having to run clock sweep (which
> + * they would only do when they don't find a usable shared buffer from
> + * freelist to read in another page).

That's not really accurate. Freelist pages are often also needed to
write new pages, without reading anything in. I'd phrase it as "which
they only need to do if they don't find a victim buffer from the
freelist"

>  In the best scenario all requests
> + * for shared buffers will be fulfilled from freelist as the background
> + * reclaimer process always tries to maintain buffers on freelist.  However,
> + * regular backends are still empowered to run clock sweep to find a usable
> + * buffer if the bgreclaimer fails to maintain enough buffers on freelist.

"empowered" sounds strange to me. 'still can run the clock sweep'?

> + * The bgwriter is started by the postmaster as soon as the startup subprocess
> + * finishes, or as soon as recovery begins if we are doing archive recovery.

Why only archive recovery? I guess (only read this far...) it's not just
during InArchiveRecoveryb recovery but also StandbyMode? But I don't see
why we shouldn't use it during normal crash recovery. That's also often
painfully slow and the reclaimer could help? Less, but still.


> +static void bgreclaim_quickdie(SIGNAL_ARGS);
> +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> +static void ReqShutdownHandler(SIGNAL_ARGS);
> +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);

This looks inconsistent.

> +    /*
> +     * If an exception is encountered, processing resumes here.
> +     *
> +     * See notes in postgres.c about the design of this coding.
> +     */
> +    if (sigsetjmp(local_sigjmp_buf, 1) != 0)
> +    {
> +        /* Since not using PG_TRY, must reset error stack by hand */
> +        error_context_stack = NULL;
> +
> +        /* Prevent interrupts while cleaning up */
> +        HOLD_INTERRUPTS();
> +
> +        /* Report the error to the server log */
> +        EmitErrorReport();
> +
> +        /*
> +         * These operations are really just a minimal subset of
> +         * AbortTransaction().  We don't have very many resources to worry
> +         * about in bgreclaim, but we do have buffers and file descriptors.
> +         */
> +        UnlockBuffers();
> +        AtEOXact_Buffers(false);
> +        AtEOXact_Files();

No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
good idea, regardless of it possibly being true today (which I'm not
sure about yet).

> +        /*
> +         * Now return to normal top-level context and clear ErrorContext for
> +         * next time.
> +         */
> +        MemoryContextSwitchTo(bgreclaim_context);
> +        FlushErrorState();
> +
> +        /* Flush any leaked data in the top-level context */
> +        MemoryContextResetAndDeleteChildren(bgreclaim_context);
> +
> +        /* Now we can allow interrupts again */
> +        RESUME_INTERRUPTS();

Other processes sleep for a second here, I think that's a good
idea. E.g. that bit:    /*     * Sleep at least 1 second after any error.  A write error is likely     * to be
repeated,and we don't want to be filling the error logs as     * fast as we can.     */    pg_usleep(1000000L);
 


> +    /*
> +     * Loop forever
> +     */
> +    for (;;)
> +    {
> +        int            rc;
> +
> +
> +        /*
> +         * Backend will signal bgreclaimer when the number of buffers in
> +         * freelist falls below than low water mark of freelist.
> +         */
> +        rc = WaitLatch(&MyProc->procLatch,
> +                       WL_LATCH_SET | WL_POSTMASTER_DEATH,
> +                       -1);

That's probably not going to work well directly after a (re)start of
bgreclaim (depending on how you handle the water mark, I'll see in a
bit). Maybe it should rather be

ResetLatch();
BgMoveBuffersToFreelist();
pgstat_send_bgwriter();
rc = WaitLatch()
if (rc & WL_POSTMASTER_DEATH)  exit(1)

> +Background Reclaimer's Processing
> +---------------------------------
> +
> +The background reclaimer is designed to move buffers to freelist that are
> +likely to be recycled soon, thereby offloading the need to perform
> +clock sweep work from active backends.  To do this, it runs the clock sweep
> +and move the the unpinned and zero usage count buffers to freelist.  It
> +keeps on doing this until the number of buffers in freelist reaches the
> +high water mark.
> +
> +Two water mark indicators are used to maintain sufficient number of buffers
> +on freelist.  Low water mark indicator is used by backends to wake bgreclaimer
> +when the number of buffers in freelist falls below it.  High water mark
> +indicator is used by bgreclaimer to move buffers to freelist.

For me the description of the high water as stated here doesn't seem to
explain anything.

This section should have a description of how the reclaimer interacts
with the bgwriter logic. Do we put dirty buffers on the freelist that
are then cleaned by the bgwriter? Which buffers does the bgwriter write
out?

>  /*
> + * Move buffers with reference and usage_count as zero to freelist.
> + * By maintaining enough number of buffers on freelist (equal to
> + * high water mark of freelist), we drastically reduce the odds for
> + * backend's to perform clock sweep.

Move buffers with reference and a usage_count *of* zero to freelist. By
maintaining enough buffers in the freelist (up to the list's high water
mark), we drastically reduce the likelihood of individual backends
having to perform the clock sweep themselves.

> + * This is called by the background reclaim process when the number
> + * of buffers in freelist falls below low water mark of freelist.
> + */

The logic used here *definitely* needs to be documented in another form
somewhere in the source.

> +void
> +BgMoveBuffersToFreelist(void)
> +{
> +    uint32    num_to_free = 0;
> +    uint32    recent_alloc = 0;
> +    uint32  recent_backend_clocksweep = 0;
> +    volatile uint32    next_victim = 0;
> +
> +    /* Execute the clock sweep */
> +    for (;;)
> +    {
> +        uint32    tmp_num_to_free;
> +        uint32    tmp_recent_alloc;
> +        uint32  tmp_recent_backend_clocksweep;
> +
> +        StrategyGetFreelistAccessInfo(&tmp_num_to_free,
> +                                      &tmp_recent_alloc,
> +                                      &tmp_recent_backend_clocksweep);
> +
> +        num_to_free += tmp_num_to_free;
> +        recent_alloc += tmp_recent_alloc;
> +        recent_backend_clocksweep += tmp_recent_backend_clocksweep;
> +
> +        if (tmp_num_to_free == 0)
> +            break;

num_to_free isn't a convincing name if I understand what this is doing
correctly. Maybe 'move_to_freelist', 'freelist_needed',
'needed_on_freelist' or something like that?


I wonder if we don't want to increase the high watermark when
tmp_recent_backend_clocksweep > 0?

> +        while (tmp_num_to_free > 0)
> +        {
> +            volatile BufferDesc *bufHdr;
> +            bool    add_to_freelist = false;
> +
> +            /*
> +             * Choose next victim buffer to look if that can be moved
> +             * to freelist.
> +             */
> +            StrategySyncNextVictimBuffer(&next_victim);
> +
> +            bufHdr = &BufferDescriptors[next_victim];
> +
> +            /*
> +             * If the buffer is pinned or has a nonzero usage_count, we cannot
> +             * move it to freelist; decrement the usage_count (unless pinned)
> +             * and keep scanning.
> +             */
> +            LockBufHdr(bufHdr);

Hm. Perhaps we should do a bufHdr->refcount != zero check without
locking here? The atomic op will transfer the cacheline exclusively to
the reclaimer's CPU. Even though it very shortly afterwards will be
touched afterwards by the pinning backend.

> +/*
> + * Water mark indicators for maintaining buffers on freelist.  When the
> + * number of buffers on freelist drops below the low water mark, the
> + * allocating backend sets the latch and bgreclaimer wakesup and begin
> + * adding buffer's to freelist until it reaches high water mark and then
> + * again goes back to sleep.
> + */

s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/;
/to freelist/to the freelist/; s/reaches high water/reaches the high water/

> +int freelistLowWaterMark;
> +int freelistHighWaterMark;
> +
> +/* Percentage indicators for maintaining buffers on freelist */
> +#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT    0.005
> +#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT    0.2
> +#define MIN_HIGH_WATER_MARK    5
> +#define MAX_HIGH_WATER_MARK    2000

I'm confused. The high water mark percentage is smaller than the low
water mark?

What's the reasoning for these numbers? What's the justification for the
max of 2k buffers for the high watermark? That's not much on a busy
database with large s_b?

> -    *lock_held = true;
> -    LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> +    SpinLockAcquire(&StrategyControl->freelist_lck);
>  
>      /*
> -     * We count buffer allocation requests so that the bgwriter can estimate
> -     * the rate of buffer consumption.  Note that buffers recycled by a
> -     * strategy object are intentionally not counted here.
> +     * We count buffer allocation requests so that the bgwriter or bgreclaimer
> +     * can know the rate of buffer consumption and report it as stats.  Note
> +     * that buffers recycled by a strategy object are intentionally not counted
> +     * here.
>       */
>      StrategyControl->numBufferAllocs++;
>  
>      /*
> -     * If bgwriterLatch is set, we need to waken the bgwriter, but we should
> -     * not do so while holding BufFreelistLock; so release and re-grab.  This
> -     * is annoyingly tedious, but it happens at most once per bgwriter cycle,
> -     * so the performance hit is minimal.
> +     * Remember the values of bgwriter and bgreclaimer latch so that they can
> +     * be set outside spin lock and try to get a buffer from the freelist.
>       */
> +    bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
>      bgwriterLatch = StrategyControl->bgwriterLatch;

I don't understand why these need to be grabbed under the spinlock?

>      if (bgwriterLatch)
> -    {
>          StrategyControl->bgwriterLatch = NULL;
> -        LWLockRelease(BufFreelistLock);
> -        SetLatch(bgwriterLatch);
> -        LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> -    }
>  
> -    /*
> -     * Try to get a buffer from the freelist.  Note that the freeNext fields
> -     * are considered to be protected by the BufFreelistLock not the
> -     * individual buffer spinlocks, so it's OK to manipulate them without
> -     * holding the spinlock.
> -     */
> -    while (StrategyControl->firstFreeBuffer >= 0)
> +    numFreeListBuffers = StrategyControl->numFreeListBuffers;
> +
> +    if (StrategyControl->firstFreeBuffer >= 0)
>      {
>          buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
>          Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
> @@ -169,35 +198,82 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
>          /* Unconditionally remove buffer from freelist */
>          StrategyControl->firstFreeBuffer = buf->freeNext;
>          buf->freeNext = FREENEXT_NOT_IN_LIST;
> +        --StrategyControl->numFreeListBuffers;
> +    }
> +    else
> +        StrategyControl->numBufferBackendClocksweep++;
> +
> +    SpinLockRelease(&StrategyControl->freelist_lck);

> +    /* If bgwriterLatch is set, we need to waken the bgwriter */
> +    if (bgwriterLatch)
> +        SetLatch(bgwriterLatch);
> +
> +    /*
> +     * If the number of free buffers has fallen below the low water mark,
> +     * awaken the bgreclaimer to repopulate it.  bgreclaimerLatch is initialized in
> +     * early phase of BgReclaimer startup, however we still check before using
> +     * it to avoid any problem incase we reach here before its initializion.
> +     */
> +    if (numFreeListBuffers < freelistLowWaterMark  && bgreclaimerLatch)
> +        SetLatch(StrategyControl->bgreclaimerLatch);
> +
> +    if (buf != NULL)
> +    {
>          /*
> -         * If the buffer is pinned or has a nonzero usage_count, we cannot use
> -         * it; discard it and retry.  (This can only happen if VACUUM put a
> -         * valid buffer in the freelist and then someone else used it before
> -         * we got to it.  It's probably impossible altogether as of 8.3, but
> -         * we'd better check anyway.)
> +         * Try to get a buffer from the freelist.  Note that the freeNext fields
> +         * are considered to be protected by the freelist_lck not the
> +         * individual buffer spinlocks, so it's OK to manipulate them without
> +         * holding the buffer spinlock.
>           */
> -        LockBufHdr(buf);
> -        if (buf->refcount == 0 && buf->usage_count == 0)
> +        for(;;)
>          {
> -            if (strategy != NULL)
> -                AddBufferToRing(strategy, buf);
> -            return buf;
> +            /*
> +             * If the buffer is pinned or has a nonzero usage_count, we cannot use
> +             * it; discard it and retry.
> +             */
> +            LockBufHdr(buf);
> +            if (buf->refcount == 0 && buf->usage_count == 0)
> +            {
> +                if (strategy != NULL)
> +                    AddBufferToRing(strategy, buf);
> +                return buf;
> +            }
> +            UnlockBufHdr(buf);
> +
> +            SpinLockAcquire(&StrategyControl->freelist_lck);
> +
> +            if (StrategyControl->firstFreeBuffer >= 0)
> +            {
> +                buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
> +                Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
> +
> +                /* Unconditionally remove buffer from freelist */
> +                StrategyControl->firstFreeBuffer = buf->freeNext;
> +                buf->freeNext = FREENEXT_NOT_IN_LIST;
> +                --StrategyControl->numFreeListBuffers;
> +
> +                SpinLockRelease(&StrategyControl->freelist_lck);
> +            }
> +            else
> +            {
> +                StrategyControl->numBufferBackendClocksweep++;
> +                SpinLockRelease(&StrategyControl->freelist_lck);
> +                break;
> +            }
>          }
> -        UnlockBufHdr(buf);
>      }

I think it makes sense to break out this bit into its own
function. That'll make StrategyGetBuffer() a good bit easier to read.

>      /* Nothing on the freelist, so run the "clock sweep" algorithm */
>      trycounter = NBuffers;
> +
>      for (;;)
>      {
> -        buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
> +        volatile uint32    next_victim;
>  
> -        if (++StrategyControl->nextVictimBuffer >= NBuffers)
> -        {
> -            StrategyControl->nextVictimBuffer = 0;
> -            StrategyControl->completePasses++;
> -        }
> +        StrategySyncNextVictimBuffer(&next_victim);
> +
> +        buf = &BufferDescriptors[next_victim];

I'd also move this into its own function, but thats's more debatable.

>          /*
>           * If the buffer is pinned or has a nonzero usage_count, we cannot use
> @@ -241,7 +317,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
>  void
>  StrategyFreeBuffer(volatile BufferDesc *buf)
>  {
> -    LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> +    SpinLockAcquire(&StrategyControl->freelist_lck);
>  
>      /*
>       * It is possible that we are told to put something in the freelist that
> @@ -253,11 +329,50 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
>          if (buf->freeNext < 0)
>              StrategyControl->lastFreeBuffer = buf->buf_id;
>          StrategyControl->firstFreeBuffer = buf->buf_id;
> +        ++StrategyControl->numFreeListBuffers;
> +    }
> +
> +    SpinLockRelease(&StrategyControl->freelist_lck);
> +}
> +
> +/*
> + * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
> + */
> +bool
> +StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
> +{

Should maybe rather be named *Tail?

> +    bool        freed = false;
> +    SpinLockAcquire(&StrategyControl->freelist_lck);
> +
> +    /*
> +     * It is possible that we are told to put something in the freelist that
> +     * is already in it; don't screw up the list if so.
> +     */

When/Why is that possible?

>  /*
> + * StrategyGetFreelistAccessInfo -- get information required by bgreclaimer
> + * to move unused buffers to freelist.
> + *
> + * The result is the number of buffers that are required to be moved to
> + * freelist and count of recent buffer allocs and buffer allocs not
> + * satisfied from freelist.
> + */
> +void
> +StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc,
> +                              uint32 *num_buf_backend_clocksweep)
> +{
> +    int            curfreebuffers;
> +
> +    SpinLockAcquire(&StrategyControl->freelist_lck);
> +    curfreebuffers = StrategyControl->numFreeListBuffers;
> +    if (curfreebuffers < freelistHighWaterMark)
> +        *num_buf_to_free = freelistHighWaterMark - curfreebuffers;
> +    else
> +        *num_buf_to_free = 0;
> +
> +    if (num_buf_alloc)
> +    {
> +        *num_buf_alloc = StrategyControl->numBufferAllocs;
> +        StrategyControl->numBufferAllocs = 0;
> +    }
> +    if (num_buf_backend_clocksweep)
> +    {
> +        *num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
> +        StrategyControl->numBufferBackendClocksweep = 0;
> +    }
> +    SpinLockRelease(&StrategyControl->freelist_lck);
> +
> +    return;
> +}

Do we need the if (num_buf_alloc) bits? Can't we make it unconditional?


Some more general remarks:
* I think there's a fair amount of unexplained heuristics in here
* Are we sure that the freelist_lck spinlock won't cause pain? Right now there will possibly be dozens of processes
busilyspinning on it... I think it's a acceptable risk, but we should think about it.
 
* I'm not convinced that these changes can be made without also changing the bgwriter logic. Have you measured whether
thereare differences in how effective the bgwriter is? Not that it's very effective right now :)
 

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

11 September 2014, 13:02:44

Thanks for reviewing, Andres.

On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> +static void bgreclaim_quickdie(SIGNAL_ARGS);
>> +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
>> +static void ReqShutdownHandler(SIGNAL_ARGS);
>> +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
>
> This looks inconsistent.

It's exactly the same as what bgwriter.c does.

> No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> good idea, regardless of it possibly being true today (which I'm not
> sure about yet).

We really need a more centralized way to handle error cleanup in
auxiliary processes.  The current state of affairs is really pretty
helter-skelter.  But for this patch, I think we should aim to mimic
the existing style, as ugly as it is.  I'm not sure whether Amit's got
the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
is probably a good idea.

>> +Two water mark indicators are used to maintain sufficient number of buffers
>> +on freelist.  Low water mark indicator is used by backends to wake bgreclaimer
>> +when the number of buffers in freelist falls below it.  High water mark
>> +indicator is used by bgreclaimer to move buffers to freelist.
>
> For me the description of the high water as stated here doesn't seem to
> explain anything.

Yeah, let me try to revise and expand on that a bit:

Background Reclaimer's Processing
---------------------------------

The background reclaimer runs the clock sweep to identify buffers that
are good candidates for eviction and puts them on the freelist.  This
makes buffer allocation much faster, since removing a buffer from the
head of a linked list is much cheaper than linearly scanning the whole
buffer pool until a promising candidate is found.  It's possible that
a buffer we add to the freelist may be accessed or even pinned before
it's evicted; if that happens, the backend that would have evicted it
will simply disregard it and take the next buffer instead (or run the
clock sweep itself, if necessary).  However, to make sure that doesn't
happen too often, we need to keep the freelist as short as possible,
so that there won't be many other buffer accesses between when the
time a buffer is added to the freelist and the time when it's actually
evicted.

We use two water marks to control the activity of the bgreclaimer
process.  Each time bgreclaimer is awoken, it will move buffers to the
freelist until the length of the free list reaches the high water
mark.  It will then sleep.  When the number of buffers on the freelist
reaches the low water mark, backends attempting to allocate new
buffers will set the bgreclaimer's latch, waking it up again.  While
it's important for the high water mark to be small (for the reasons
described above), we also need to ensure adequate separation between
the low and high water marks, so that the bgreclaimer isn't constantly
being awoken to find just a handful of additional candidate buffers,
and we need to ensure that the low watermark is adequate to keep the
freelist from becoming completely empty before bgreclaimer has time to
wake up and beginning filling it again.

> This section should have a description of how the reclaimer interacts
> with the bgwriter logic. Do we put dirty buffers on the freelist that
> are then cleaned by the bgwriter? Which buffers does the bgwriter write
> out?

The bgwriter is cleaning ahead of the strategy point, and the
bgreclaimer is advancing the strategy point.  I think we should
consider having the bgreclaimer wake the bgwriter if it comes across a
dirty buffer, because while the bgwriter only estimates the rate of
buffer allocation, bgreclaimer *knows* the rate of allocation, because
its own activity is tied to the allocation rate.  I think there's the
potential for this kind of thing to make the background writer
significantly more effective than it is today, but I'm heavily in
favor of leaving it for a separate patch.

> I wonder if we don't want to increase the high watermark when
> tmp_recent_backend_clocksweep > 0?

That doesn't really work unless there's some countervailing force to
eventually reduce it again; otherwise, it'd just converge to infinity.
And it doesn't really seem necessary at the moment.

> Hm. Perhaps we should do a bufHdr->refcount != zero check without
> locking here? The atomic op will transfer the cacheline exclusively to
> the reclaimer's CPU. Even though it very shortly afterwards will be
> touched afterwards by the pinning backend.

Meh.  I'm not in favor of adding more funny games with locking unless
we can prove they're necessary for performance.

> * Are we sure that the freelist_lck spinlock won't cause pain? Right now
>   there will possibly be dozens of processes busily spinning on it... I
>   think it's a acceptable risk, but we should think about it.

As you and I have talked about before, we could reduce contention here
by partitioning the freelist, or by using a CAS loop to pop items off
of it.  But I am not convinced either is necessary; I think it's hard
for the system to accumulate enough people hitting the freelist
simultaneously to matter, because the stuff they've got to do between
one freelist access and the next is generally going to be something
much more expensive, like reading 8kB from the OS.

One question in my mind is whether we ought to separate this patch
into two - one for the changes to the locking regime, and another for
the addition of the bgreclaimer process.  Those things are really two
different features, although they are tightly enough coupled that
maybe it's OK to keep them together.  I also think it would be good to
get some statistics on how often regular backends are running the
clocksweep vs. how often bgreclaimer is satisfying their needs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

11 September 2014, 13:22:37

On 2014-09-11 09:02:34 -0400, Robert Haas wrote:
> Thanks for reviewing, Andres.
> 
> On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> +static void bgreclaim_quickdie(SIGNAL_ARGS);
> >> +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> >> +static void ReqShutdownHandler(SIGNAL_ARGS);
> >> +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
> >
> > This looks inconsistent.
> 
> It's exactly the same as what bgwriter.c does.

So what? There's no code in common, so I see no reason to have one
signal handler using underscores and the next one camelcase names.

> > No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> > good idea, regardless of it possibly being true today (which I'm not
> > sure about yet).
> 
> We really need a more centralized way to handle error cleanup in
> auxiliary processes.  The current state of affairs is really pretty
> helter-skelter.

Agreed. There really should be three variants:
* full abort including support for transactions
* full abort without transactions being used (most background processes)
* abort without shared memory interactions

> But for this patch, I think we should aim to mimic
> the existing style, as ugly as it is.

Agreed.

> Background Reclaimer's Processing
> ---------------------------------
> 
> The background reclaimer runs the clock sweep to identify buffers that
> are good candidates for eviction and puts them on the freelist.  This
> makes buffer allocation much faster, since removing a buffer from the
> head of a linked list is much cheaper than linearly scanning the whole
> buffer pool until a promising candidate is found.  It's possible that
> a buffer we add to the freelist may be accessed or even pinned before
> it's evicted; if that happens, the backend that would have evicted it
> will simply disregard it and take the next buffer instead (or run the
> clock sweep itself, if necessary).  However, to make sure that doesn't
> happen too often, we need to keep the freelist as short as possible,
> so that there won't be many other buffer accesses between when the
> time a buffer is added to the freelist and the time when it's actually
> evicted.
> 
> We use two water marks to control the activity of the bgreclaimer
> process.  Each time bgreclaimer is awoken, it will move buffers to the
> freelist until the length of the free list reaches the high water
> mark.  It will then sleep.

I wonder if we should recheck the number of freelist items before
sleeping. As the latch currently is reset before sleeping (IIRC) we
might miss being woken up soon. It very well might be that bgreclaim
needs to run for more than one cycle in a row to keep up...

> > This section should have a description of how the reclaimer interacts
> > with the bgwriter logic. Do we put dirty buffers on the freelist that
> > are then cleaned by the bgwriter? Which buffers does the bgwriter write
> > out?
> 
> The bgwriter is cleaning ahead of the strategy point, and the
> bgreclaimer is advancing the strategy point.

That sentence, in some form, should be in the above paragraph.

> I think we should
> consider having the bgreclaimer wake the bgwriter if it comes across a
> dirty buffer, because while the bgwriter only estimates the rate of
> buffer allocation, bgreclaimer *knows* the rate of allocation, because
> its own activity is tied to the allocation rate.  I think there's the
> potential for this kind of thing to make the background writer
> significantly more effective than it is today, but I'm heavily in
> favor of leaving it for a separate patch.

Yes, doing that sounds like a good plan. I'm happy with that being done
in a separate patch.

> > I wonder if we don't want to increase the high watermark when
> > tmp_recent_backend_clocksweep > 0?
> 
> That doesn't really work unless there's some countervailing force to
> eventually reduce it again; otherwise, it'd just converge to infinity.
> And it doesn't really seem necessary at the moment.

Right, it obviously needs to go both ways. I'm a bit sceptic about
untunable, fixed, numbers proving to be accurate for widely varied
workloads.

> > Hm. Perhaps we should do a bufHdr->refcount != zero check without
> > locking here? The atomic op will transfer the cacheline exclusively to
> > the reclaimer's CPU. Even though it very shortly afterwards will be
> > touched afterwards by the pinning backend.
> 
> Meh.  I'm not in favor of adding more funny games with locking unless
> we can prove they're necessary for performance.

Well, this in theory increases the number of processes touching buffer
headers regularly. Currently, if you have one read IO intensive backend,
there's pretty much only process touching the cachelines. This will make
it two. I don't think it's unreasonable to try to reduce the cacheline
pingpong caused by that...

> > * Are we sure that the freelist_lck spinlock won't cause pain? Right now
> >   there will possibly be dozens of processes busily spinning on it... I
> >   think it's a acceptable risk, but we should think about it.
> 
> As you and I have talked about before, we could reduce contention here
> by partitioning the freelist, or by using a CAS loop to pop items off
> of it.  But I am not convinced either is necessary; I think it's hard
> for the system to accumulate enough people hitting the freelist
> simultaneously to matter, because the stuff they've got to do between
> one freelist access and the next is generally going to be something
> much more expensive, like reading 8kB from the OS.
> 
> One question in my mind is whether we ought to separate this patch
> into two - one for the changes to the locking regime, and another for
> the addition of the bgreclaimer process.  Those things are really two
> different features, although they are tightly enough coupled that
> maybe it's OK to keep them together.

I think it's ok to commit them together. Hm, although: It'd actually be
highly interesting to see the effect of replacing the freelist lock with
a spinlock without the rest of these changes. I think that's really a
number we want to see at least once.

> I also think it would be good to
> get some statistics on how often regular backends are running the
> clocksweep vs. how often bgreclaimer is satisfying their needs.

I think that's necessary. The patch added buf_backend_clocksweep. Maybe
we just also need buf_backend_from_freelist?

Greetings,

Andres Freund

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

11 September 2014, 13:22:44

On Thu, Sep 11, 2014 at 6:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Thanks for reviewing, Andres.
>
> On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> +static void bgreclaim_quickdie(SIGNAL_ARGS);
> >> +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> >> +static void ReqShutdownHandler(SIGNAL_ARGS);
> >> +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
> >
> > This looks inconsistent.
>
> It's exactly the same as what bgwriter.c does.
>
> > No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> > good idea, regardless of it possibly being true today (which I'm not
> > sure about yet).
>
> We really need a more centralized way to handle error cleanup in
> auxiliary processes. The current state of affairs is really pretty
> helter-skelter. But for this patch, I think we should aim to mimic
> the existing style, as ugly as it is. I'm not sure whether Amit's got
> the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,

> is probably a good idea.

Code related to bgreclaimer logic itself doesn't take any LWLock, do

you suspect the same might be required due to some Signal/Interrupt

handling?

From myside, I have thought about what to keep for error cleanup based

on the working of bgreclaimer. However there is a chance that I have

missed something.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

11 September 2014, 13:30:16

> > We really need a more centralized way to handle error cleanup in
> > auxiliary processes.  The current state of affairs is really pretty
> > helter-skelter.  But for this patch, I think we should aim to mimic
> > the existing style, as ugly as it is.  I'm not sure whether Amit's got
> > the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
> > is probably a good idea.
> 
> Code related to bgreclaimer logic itself doesn't take any LWLock, do
> you suspect the same might be required due to some Signal/Interrupt
> handling?

I suspect it might creep in at some point at some unrelated place. Which
will only ever break in production scenarios. Say, a lwlock in in config
file processing. I seem to recall somebody seing a version of a patching
adding a lwlock there... :). Or a logging hook. Or ...

The savings from not doing LWLockReleaseAll() are nonexistant, so ...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

11 September 2014, 13:48:08

On Thu, Sep 11, 2014 at 6:59 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> > > We really need a more centralized way to handle error cleanup in
> > > auxiliary processes. The current state of affairs is really pretty
> > > helter-skelter. But for this patch, I think we should aim to mimic
> > > the existing style, as ugly as it is. I'm not sure whether Amit's got
> > > the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
> > > is probably a good idea.
> >
> > Code related to bgreclaimer logic itself doesn't take any LWLock, do
> > you suspect the same might be required due to some Signal/Interrupt
> > handling?
>
> I suspect it might creep in at some point at some unrelated place. Which
> will only ever break in production scenarios. Say, a lwlock in in config
> file processing.

Yeah, I suspected the same and checked that path, but couldn't find but

may be in some path it is there as the code has many flows.

> I seem to recall somebody seing a version of a patching
> adding a lwlock there... :). Or a logging hook. Or ...
>
> The savings from not doing LWLockReleaseAll() are nonexistant, so ...

Okay, I shall add it in next version of patch and mention in comments

the reasons quoted by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

11 September 2014, 13:48:23

On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> It's exactly the same as what bgwriter.c does.
>
> So what? There's no code in common, so I see no reason to have one
> signal handler using underscores and the next one camelcase names.

/me shrugs.

It's not always possible to have things be consistent with each other
within a file and also with what gets done in other files.  I'm not
sure we should fault patch authors for choosing a different one than
we would have chosen.  FWIW, I probably would have done it the way
Amit did it.  I don't actually care, though.

> I wonder if we should recheck the number of freelist items before
> sleeping. As the latch currently is reset before sleeping (IIRC) we
> might miss being woken up soon. It very well might be that bgreclaim
> needs to run for more than one cycle in a row to keep up...

The outer loop in BgMoveBuffersToFreelist() was added to address
precisely this point, which I raised in a previous review.

>> > I wonder if we don't want to increase the high watermark when
>> > tmp_recent_backend_clocksweep > 0?
>>
>> That doesn't really work unless there's some countervailing force to
>> eventually reduce it again; otherwise, it'd just converge to infinity.
>> And it doesn't really seem necessary at the moment.
>
> Right, it obviously needs to go both ways. I'm a bit sceptic about
> untunable, fixed, numbers proving to be accurate for widely varied
> workloads.

Me, too, but I'm *even more* skeptical about making things complicated
on the pure theory that a simple solution can't be correct.  I'm not
blind to the possibility that the current logic is inadequate, but
testing proves that it works well enough to produce a massive
performance boost over where we are now.  When, and if, we develop a
theory about specifically how it falls short then, sure, let's adjust
it.  But I think it would be a serious error to try to engineer a
perfect algorithm here based on the amount of testing that we can
reasonably do pre-commit.  We have no chance of getting that right,
and I'd rather have an algorithm that is simple and imperfect than an
algorithm that is complex and still imperfect.  No matter what, it's
better than what we have now.

>> > Hm. Perhaps we should do a bufHdr->refcount != zero check without
>> > locking here? The atomic op will transfer the cacheline exclusively to
>> > the reclaimer's CPU. Even though it very shortly afterwards will be
>> > touched afterwards by the pinning backend.
>>
>> Meh.  I'm not in favor of adding more funny games with locking unless
>> we can prove they're necessary for performance.
>
> Well, this in theory increases the number of processes touching buffer
> headers regularly. Currently, if you have one read IO intensive backend,
> there's pretty much only process touching the cachelines. This will make
> it two. I don't think it's unreasonable to try to reduce the cacheline
> pingpong caused by that...

It's not unreasonable, but this is a good place to apply Knuth's first
law of optimization.  There's no proof we need to do this, so let's
not until there is.

>> I also think it would be good to
>> get some statistics on how often regular backends are running the
>> clocksweep vs. how often bgreclaimer is satisfying their needs.
>
> I think that's necessary. The patch added buf_backend_clocksweep. Maybe
> we just also need buf_backend_from_freelist?

That's just (or should be just) buf_alloc - buf_backend_clocksweep.

I think buf_backend_clocksweep should really be called
buf_alloc_clocksweep, and should be added (in all relevant places)
right next to buf_alloc.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

11 September 2014, 14:04:08

On 2014-09-11 09:48:10 -0400, Robert Haas wrote:
> On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I wonder if we should recheck the number of freelist items before
> > sleeping. As the latch currently is reset before sleeping (IIRC) we
> > might miss being woken up soon. It very well might be that bgreclaim
> > needs to run for more than one cycle in a row to keep up...
> 
> The outer loop in BgMoveBuffersToFreelist() was added to address
> precisely this point, which I raised in a previous review.

Hm, right. But then let's move BgWriterStats.m_buf_alloc =+,
... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end
up being continously busy without being visible.

> >> > I wonder if we don't want to increase the high watermark when
> >> > tmp_recent_backend_clocksweep > 0?
> >>
> >> That doesn't really work unless there's some countervailing force to
> >> eventually reduce it again; otherwise, it'd just converge to infinity.
> >> And it doesn't really seem necessary at the moment.
> >
> > Right, it obviously needs to go both ways. I'm a bit sceptic about
> > untunable, fixed, numbers proving to be accurate for widely varied
> > workloads.
> 
> Me, too, but I'm *even more* skeptical about making things complicated
> on the pure theory that a simple solution can't be correct.

Fair enough.

> I'm not
> blind to the possibility that the current logic is inadequate, but
> testing proves that it works well enough to produce a massive
> performance boost over where we are now.

But, to be honest, the testing so far was pretty "narrow" in the kind of
workloads that were run if I crossread things accurately. Don't get me
wrong, I'm *really* happy about having this patch, that just doesn't
mean every detail is right ;)

> >> > Hm. Perhaps we should do a bufHdr->refcount != zero check without
> >> > locking here? The atomic op will transfer the cacheline exclusively to
> >> > the reclaimer's CPU. Even though it very shortly afterwards will be
> >> > touched afterwards by the pinning backend.
> >>
> >> Meh.  I'm not in favor of adding more funny games with locking unless
> >> we can prove they're necessary for performance.
> >
> > Well, this in theory increases the number of processes touching buffer
> > headers regularly. Currently, if you have one read IO intensive backend,
> > there's pretty much only process touching the cachelines. This will make
> > it two. I don't think it's unreasonable to try to reduce the cacheline
> > pingpong caused by that...
> 
> It's not unreasonable, but this is a good place to apply Knuth's first
> law of optimization.  There's no proof we need to do this, so let's
> not until there is.

That's true for new (pieces of) software; less so, when working with a
installed base that you might regress... But whatever.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

11 September 2014, 14:16:07

On Thu, Sep 11, 2014 at 10:03 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-09-11 09:48:10 -0400, Robert Haas wrote:
>> On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > I wonder if we should recheck the number of freelist items before
>> > sleeping. As the latch currently is reset before sleeping (IIRC) we
>> > might miss being woken up soon. It very well might be that bgreclaim
>> > needs to run for more than one cycle in a row to keep up...
>>
>> The outer loop in BgMoveBuffersToFreelist() was added to address
>> precisely this point, which I raised in a previous review.
>
> Hm, right. But then let's move BgWriterStats.m_buf_alloc =+,
> ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end
> up being continously busy without being visible.

Good idea.

>> I'm not
>> blind to the possibility that the current logic is inadequate, but
>> testing proves that it works well enough to produce a massive
>> performance boost over where we are now.
>
> But, to be honest, the testing so far was pretty "narrow" in the kind of
> workloads that were run if I crossread things accurately. Don't get me
> wrong, I'm *really* happy about having this patch, that just doesn't
> mean every detail is right ;)

Oh, sure.  Totally agreed.  And, to the extent that we're improving
things based on actual testing, I'm A-OK with that.  I just don't want
to start speculating, or we'll never get this thing off the ground.

Some possibly-interesting test cases would be:

(1) A read-only pgbench workload that is just a tiny bit larger than
shared_buffers, say size of shared_buffers plus 0.01%.  Such workloads
tend to stress buffer eviction heavily.

(2) A workload that maximizes the rate of concurrent buffer eviction
relative to other tasks.  Read-only pgbench is not bad for this, but
maybe somebody's got a better idea.

As I sort of mentioned in what I was writing for the bufmgr README,
there are, more or less, three ways this can fall down, at least that
I can see: (1) if the high water mark is too high, then we'll start
finding buffers in the freelist that have already been touched since
we added them: (2) if the low water mark is too low, the freelist will
run dry; and (3) if the low and high water marks are too close
together, the bgreclaimer will be constantly getting woken up and
going to sleep again.  I can't personally think of a workload that
will enable us to get a better handle on those cases than
high-concurrency pgbench, but you're known to be ingenious at coming
up with destruction workloads, so if you have an idea, by all means
fire away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

12 September 2014, 06:25:13

On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
> > +++ b/src/backend/postmaster/bgreclaimer.c
>
> A fair number of comments in that file refer to bgwriter...

Will fix.

> > @@ -0,0 +1,302 @@
> > +/*-------------------------------------------------------------------------
> > + *
> > + * bgreclaimer.c
> > + *
> > + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It
> > + * attempts to keep regular backends from having to run clock sweep (which
> > + * they would only do when they don't find a usable shared buffer from
> > + * freelist to read in another page).
>
> That's not really accurate. Freelist pages are often also needed to
> write new pages, without reading anything in.

Agreed, but the same is used in bgwriter file as well; so if we

change here, we might want to change bgwriter file header as well.

> I'd phrase it as "which
> they only need to do if they don't find a victim buffer from the
> freelist"

victim buffer sounds more like a buffer which it will get from

clock sweep, how about next candidate (same is used in function

header of StrategyGetBuffer()).

> > In the best scenario all requests
> > + * for shared buffers will be fulfilled from freelist as the background
> > + * reclaimer process always tries to maintain buffers on freelist. However,
> > + * regular backends are still empowered to run clock sweep to find a usable
> > + * buffer if the bgreclaimer fails to maintain enough buffers on freelist.
>
> "empowered" sounds strange to me. 'still can run the clock sweep'?

No harm in changing like what you are suggesting, however the same is

used in file header of bgwriter.c as well, so I think lets keep this usage as

it is because there is no correctness issue here.

> > + * The bgwriter is started by the postmaster as soon as the startup subprocess
> > + * finishes, or as soon as recovery begins if we are doing archive recovery.
>
> Why only archive recovery? I guess (only read this far...) it's not just
> during InArchiveRecoveryb recovery but also StandbyMode?

It will be for both.

> But I don't see
> why we shouldn't use it during normal crash recovery. That's also often
> painfully slow and the reclaimer could help? Less, but still.

Yes, it might improve a bit, however the main benefit with this patch is

under heavy load which means that it mainly addresses contention and

in case of crash recovery there will not be any contention because there

will be no backend processes.

Also I think to enable bgreclaimer during crash recovery, we need to have

one new signal which is not a problem, but it will make more sense to enable

it later based on benefit if it is there.

>
> > +static void bgreclaim_quickdie(SIGNAL_ARGS);
> > +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> > +static void ReqShutdownHandler(SIGNAL_ARGS);
> > +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
>
> This looks inconsistent.

I have kept based on bgwriter, so not sure if it's good to change.

However I we want consistent in naming, I would like to keep

something like:

ReclaimShutdownHandler

ReclaimQuickDieHandler

..

> > + /*
> > + * If an exception is encountered, processing resumes here.
> > + *
> > + * See notes in postgres.c about the design of this coding.
> > + */
> > + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
> > + {
> > + /* Since not using PG_TRY, must reset error stack by hand */
> > + error_context_stack = NULL;
..
>
> No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> good idea, regardless of it possibly being true today (which I'm not
> sure about yet).

I will add LWLockReleaseAll() in exception handling as discussed

elsewhere in thread.

> > +
> > + /* Now we can allow interrupts again */
> > + RESUME_INTERRUPTS();
>
> Other processes sleep for a second here, I think that's a good
> idea. E.g. that bit:

Agreed, will make change as per suggestion.

>
> > + /*
> > + * Loop forever
> > + */
> > + for (;;)
> > + {
> > + int rc;
> > +
> > +
> > + /*
> > + * Backend will signal bgreclaimer when the number of buffers in
> > + * freelist falls below than low water mark of freelist.
> > + */
> > + rc = WaitLatch(&MyProc->procLatch,
> > + WL_LATCH_SET | WL_POSTMASTER_DEATH,
> > + -1);
>
> That's probably not going to work well directly after a (re)start of
> bgreclaim (depending on how you handle the water mark, I'll see in a
> bit).

Could you please be more specific here?

>
> > +Background Reclaimer's Processing
> > +---------------------------------
..
> > +Two water mark indicators are used to maintain sufficient number of buffers
> > +on freelist. Low water mark indicator is used by backends to wake bgreclaimer
> > +when the number of buffers in freelist falls below it. High water mark
> > +indicator is used by bgreclaimer to move buffers to freelist.
>
> For me the description of the high water as stated here doesn't seem to
> explain anything.
>
> This section should have a description of how the reclaimer interacts
> with the bgwriter logic. Do we put dirty buffers on the freelist that
> are then cleaned by the bgwriter? Which buffers does the bgwriter write
> out?

As discussed in thread, I will change this accordingly.

> > /*
> > + * Move buffers with reference and usage_count as zero to freelist.
> > + * By maintaining enough number of buffers on freelist (equal to
> > + * high water mark of freelist), we drastically reduce the odds for
> > + * backend's to perform clock sweep.
>
> Move buffers with reference and a usage_count *of* zero to freelist. By
> maintaining enough buffers in the freelist (up to the list's high water
> mark), we drastically reduce the likelihood of individual backends
> having to perform the clock sweep themselves.

Okay will rephrase it as per your suggestion.

> > + * This is called by the background reclaim process when the number
> > + * of buffers in freelist falls below low water mark of freelist.
> > + */
>
> The logic used here *definitely* needs to be documented in another form
> somewhere in the source.

I think the way Robert has suggested to modify Readme adresses

this to an extent, however if you think it is better to go in more

detail, then I can expand function header of function

BgMoveBuffersToFreelist() or on top of bgreclaimer.c, what do you prefer?

> > +
> > + if (tmp_num_to_free == 0)
> > + break;
>
> num_to_free isn't a convincing name if I understand what this is doing
> correctly. Maybe 'move_to_freelist', 'freelist_needed',
> 'needed_on_freelist' or something like that?

I think keeping num or count in name could be more helpful as it is used

to loop for finding usable buffers. How about 'num_needed_on_freelist',

without any count or num it sounds more like a boolean variable.

>
> I wonder if we don't want to increase the high watermark when
> tmp_recent_backend_clocksweep > 0?
>
> > + while (tmp_num_to_free > 0)
> > + {
..
> > + /*
> > + * If the buffer is pinned or has a nonzero usage_count, we cannot
> > + * move it to freelist; decrement the usage_count (unless pinned)
> > + * and keep scanning.
> > + */
> > + LockBufHdr(bufHdr);
>
> Hm. Perhaps we should do a bufHdr->refcount != zero check without
> locking here? The atomic op will transfer the cacheline exclusively to
> the reclaimer's CPU. Even though it very shortly afterwards will be
> touched afterwards by the pinning backend.

As per discussion, the conclusion seems to be that we can do some

more tests to see if we need such a change, I will do more tests on

the lines suggested by Robert in below mails and then we can decide

if any thing is required.

> > +/*
> > + * Water mark indicators for maintaining buffers on freelist. When the
> > + * number of buffers on freelist drops below the low water mark, the
> > + * allocating backend sets the latch and bgreclaimer wakesup and begin
> > + * adding buffer's to freelist until it reaches high water mark and then
> > + * again goes back to sleep.
> > + */
>
> s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/;
> /to freelist/to the freelist/; s/reaches high water/reaches the high water/

Will change as per suggestions.

> > +int freelistLowWaterMark;
> > +int freelistHighWaterMark;
> > +
> > +/* Percentage indicators for maintaining buffers on freelist */
> > +#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.005
> > +#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.2
> > +#define MIN_HIGH_WATER_MARK 5
> > +#define MAX_HIGH_WATER_MARK 2000
>
> I'm confused. The high water mark percentage is smaller than the low
> water mark?

High water mark is percentage of NBuffers (shared_buffers).

Low water mark is percentage of High water mark.

I will add a comment.

> What's the reasoning for these numbers?

Based on experiments with different amount of shared buffers

> What's the justification for the
> max of 2k buffers for the high watermark? That's not much on a busy
> database with large s_b?

I have done tests at various loads upto 15GB of shared_buffers

and having data size as 44GB, 2000 turns out to be good enough

number of buffers on freelist(based on performance data).

Another thing I have mentioned in begining was to have guc for

these thresholds, but I think it will be much more difficult for

user to decide these values.

I think for now it is okay to fix these numbers as in patch

and we can anyway go back to add guc's or change them based

on more tests if anything turns out that needs further tuning.

> > - *lock_held = true;
> > - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> > + SpinLockAcquire(&StrategyControl->freelist_lck);
> >
..
> > + bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
> > bgwriterLatch = StrategyControl->bgwriterLatch;
>
> I don't understand why these need to be grabbed under the spinlock?

In earlier version of patch, it was done without spinklock,

however Robert has given comment that as it was previously done

with BufFreelist lock, these should be under spinlock (atleast

that is what I understood) and I tested the patch again by having

them under spinlock and didn't find any difference, so I have moved

them under spinlock.

..
> > + if (numFreeListBuffers < freelistLowWaterMark && bgreclaimerLatch)
> > + SetLatch(StrategyControl->bgreclaimerLatch);
> > +
..
> > }
> > - UnlockBufHdr(buf);
> > }
>
> I think it makes sense to break out this bit into its own
> function. That'll make StrategyGetBuffer() a good bit easier to read.

I will try to move it into a new function GetBufferFromFreelist().

> > /* Nothing on the freelist, so run the "clock sweep" algorithm */
> > trycounter = NBuffers;
> > +
...
> > + buf = &BufferDescriptors[next_victim];
>
> I'd also move this into its own function, but thats's more debatable.

As we have not touched much of this part of code, so lets refactor

this code separately if required.

> > +/*
> > + * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
> > + */
> > +bool
> > +StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
> > +{
>
> Should maybe rather be named *Tail?

Tail is better suited than End, I will change it.

> > + bool freed = false;
> > + SpinLockAcquire(&StrategyControl->freelist_lck);
> > +
> > + /*
> > + * It is possible that we are told to put something in the freelist that
> > + * is already in it; don't screw up the list if so.
> > + */
>
> When/Why is that possible?

As we are doing clocksweep which can come across the same buffer

again incase it didn't find sufficient buffers to put on freelist in one

cycle, so this function needs to ensure the same.

> > +void
> > +StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc,
> > + uint32 *num_buf_backend_clocksweep)
> > +
> > + if (num_buf_alloc)
> > + {
> > + *num_buf_alloc = StrategyControl->numBufferAllocs;
> > + StrategyControl->numBufferAllocs = 0;
> > + }
> > + if (num_buf_backend_clocksweep)
> > + {
> > + *num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
> > + StrategyControl->numBufferBackendClocksweep = 0;
> > + }
> > + SpinLockRelease(&StrategyControl->freelist_lck);
> > +
> > + return;
> > +}
>
> Do we need the if (num_buf_alloc) bits? Can't we make it unconditional?

This API is more or less designed on lines of StrategySyncStart().

Currently there is no case where we can't do it unconditional (same is

true for num_buf_backend_clocksweep), however there is some merit

to keep it in sync with existing API. Having said that if you feel we should

go about doing it unconditionally, I will change it in next patch.

>
> Some more general remarks:
> * I think there's a fair amount of unexplained heuristics in here

I think after addressing the comments given by you above,

the situation will be better.

> * Are we sure that the freelist_lck spinlock won't cause pain? Right now
> there will possibly be dozens of processes busily spinning on it... I
> think it's a acceptable risk, but we should think about it.

The situation seems to be better by having spinlock, rather than

by using LWLock, however if there is anything that causes pain,

I think we might want to consider partitioning the free list, but lets

keep that for other day.

> * I'm not convinced that these changes can be made without also changing
> the bgwriter logic. Have you measured whether there are differences in
> how effective the bgwriter is? Not that it's very effective right now :)

Current changes doesn't do anything to make bgwriter more or

less effective, however as discussed elsewhere in thread that it

makes sense for bgreclaimer to notify bgwriter in some situtions,

that can make bgwriter more effective than now, however as concluded

there lets do it in separate patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Ants Aasma

Date:

12 September 2014, 09:39:07

On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > Hm. Perhaps we should do a bufHdr->refcount != zero check without
>> > locking here? The atomic op will transfer the cacheline exclusively to
>> > the reclaimer's CPU. Even though it very shortly afterwards will be
>> > touched afterwards by the pinning backend.
>>
>> Meh.  I'm not in favor of adding more funny games with locking unless
>> we can prove they're necessary for performance.
>
> Well, this in theory increases the number of processes touching buffer
> headers regularly. Currently, if you have one read IO intensive backend,
> there's pretty much only process touching the cachelines. This will make
> it two. I don't think it's unreasonable to try to reduce the cacheline
> pingpong caused by that...

I don't think it will help much. A pinned buffer is pretty likely to
be in modified state in the cache of the cpu of the pinning backend.
Even taking a look at the refcount will trigger a writeback and
demotion to shared state. When time comes to unpin the buffer the
cacheline must again be promoted to exclusive state introducing
coherency traffic. Not locking the buffer only saves transfering the
cacheline back to the pinning backend, not a huge amount of savings.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

12 September 2014, 09:52:04

On 2014-09-12 12:38:48 +0300, Ants Aasma wrote:
> On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > Hm. Perhaps we should do a bufHdr->refcount != zero check without
> >> > locking here? The atomic op will transfer the cacheline exclusively to
> >> > the reclaimer's CPU. Even though it very shortly afterwards will be
> >> > touched afterwards by the pinning backend.
> >>
> >> Meh.  I'm not in favor of adding more funny games with locking unless
> >> we can prove they're necessary for performance.
> >
> > Well, this in theory increases the number of processes touching buffer
> > headers regularly. Currently, if you have one read IO intensive backend,
> > there's pretty much only process touching the cachelines. This will make
> > it two. I don't think it's unreasonable to try to reduce the cacheline
> > pingpong caused by that...
> 
> I don't think it will help much. A pinned buffer is pretty likely to
> be in modified state in the cache of the cpu of the pinning backend.

Right. Unless you're on a MOESI platforms. I'd really like to know why
that's not more widely used.

> Even taking a look at the refcount will trigger a writeback and
> demotion to shared state.  When time comes to unpin the buffer the
> cacheline must again be promoted to exclusive state introducing
> coherency traffic. Not locking the buffer only saves transfering the
> cacheline back to the pinning backend, not a huge amount of savings.

Yes. But: In many, if not most, cases the cacheline will be read a
couple times before modifying it via the spinlock.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Gregory Smith

Date:

12 September 2014, 17:39:29

On 9/11/14, 7:01 AM, Andres Freund wrote:
> I'm not convinced that these changes can be made without also changing 
> the bgwriter logic. Have you measured whether there are differences in 
> how effective the bgwriter is? Not that it's very effective right now :)

The current background writer tuning went out of its way to do nothing 
when it wasn't clear there was something that always worked.  What 
happened with all of the really clever schemes was that they worked on 
some workloads, and just trashed others.  Most of the gain from the 8.3 
rewrite came from looking at well theorized ideas for how to handle 
things like pre-emptive LRU scanning for writes, and just throwing them 
out altogether in favor of ignoring the problem.  The magic numbers left 
in or added to the code  were tuned to do very little work, 
intentionally.  If anything, since then the pressure to do nothing has 
gone up in the default install, because now people are very concerned 
about extra wakeups using power.

To find bad cases before, I was running about 30 different test 
combinations by the end, Heikki was running another set in the EDB lab, 
I believe there was a lab at NTT running their own set too. What went in 
was the combination that didn't break any of them badly--not the one 
that performed best on the good workloads.

This looks like it's squashed one of the very fundamental buffer scaling 
issues though; well done Amit.  What I'd like to see is preserving the 
heart of that while touching as little as possible. When in doubt, do 
nothing; let the backends suck it up and do the work themselves.

I had to take a health break from community development for a while, and 
I'm hoping to jump back into review again for the rest of the current 
development cycle.  I'll go back to my notes and try to recreate the 
pathological cases that plagued both the 8.3 BGW rewrite and the aborted 
9.2 fsync spreading effort I did; get those running again and see how 
they do on this new approach.  I have a decent sized 24 core server that 
should be good enough for this job.  I'll see what I can do.

-- 
Greg Smith greg.smith@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

14 September 2014, 06:53:22

On Fri, Sep 12, 2014 at 11:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
> > > +++ b/src/backend/postmaster/bgreclaimer.c
> >
> > A fair number of comments in that file refer to bgwriter...
>
> Will fix.

Fixed.

> > > @@ -0,0 +1,302 @@
> > > +/*-------------------------------------------------------------------------
> > > + *
> > > + * bgreclaimer.c
> > > + *
> > > + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It
> > > + * attempts to keep regular backends from having to run clock sweep (which
> > > + * they would only do when they don't find a usable shared buffer from
> > > + * freelist to read in another page).
> >
> > That's not really accurate. Freelist pages are often also needed to
> > write new pages, without reading anything in.
>
> Agreed, but the same is used in bgwriter file as well; so if we
> change here, we might want to change bgwriter file header as well.

I have just changed bgreclaimer for this comment, if the same

is required for bgwriter, I can create a separate patch as that

change is not related to this patch, so I thought it is better

to keep it separate.

> > I'd phrase it as "which
> > they only need to do if they don't find a victim buffer from the
> > freelist"
>
> victim buffer sounds more like a buffer which it will get from
> clock sweep, how about next candidate (same is used in function
> header of StrategyGetBuffer()).

Fixed.

> > > In the best scenario all requests
> > > + * for shared buffers will be fulfilled from freelist as the background
> > > + * reclaimer process always tries to maintain buffers on freelist. However,
> > > + * regular backends are still empowered to run clock sweep to find a usable
> > > + * buffer if the bgreclaimer fails to maintain enough buffers on freelist.
> >
> > "empowered" sounds strange to me. 'still can run the clock sweep'?
>
> No harm in changing like what you are suggesting, however the same is
> used in file header of bgwriter.c as well, so I think lets keep this usage as
> it is because there is no correctness issue here.

Not changed anything for this comment.

> >
> > > +static void bgreclaim_quickdie(SIGNAL_ARGS);
> > > +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> > > +static void ReqShutdownHandler(SIGNAL_ARGS);
> > > +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
> >
> > This looks inconsistent.
>
> I have kept based on bgwriter, so not sure if it's good to change.
> However I we want consistent in naming, I would like to keep
> something like:
>
> ReclaimShutdownHandler
> ReclaimQuickDieHandler
> ..
> ..

Changed function names to make them consistent.

> > > + /*
> > > + * If an exception is encountered, processing resumes here.
> > > + *
> > > + * See notes in postgres.c about the design of this coding.
> > > + */
> > > + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
> > > + {
> > > + /* Since not using PG_TRY, must reset error stack by hand */
> > > + error_context_stack = NULL;
> ..
> >
> > No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> > good idea, regardless of it possibly being true today (which I'm not
> > sure about yet).
>
> I will add LWLockReleaseAll() in exception handling as discussed
> elsewhere in thread.

Done.

>
> > > +
> > > + /* Now we can allow interrupts again */
> > > + RESUME_INTERRUPTS();
> >
> > Other processes sleep for a second here, I think that's a good
> > idea. E.g. that bit:
>
> Agreed, will make change as per suggestion.

Done.

> >
> > > + /*
> > > + * Loop forever
> > > + */
> > > + for (;;)
> > > + {
> > > + int rc;
> > > +
> > > +
> > > + /*
> > > + * Backend will signal bgreclaimer when the number of buffers in
> > > + * freelist falls below than low water mark of freelist.
> > > + */
> > > + rc = WaitLatch(&MyProc->procLatch,
> > > + WL_LATCH_SET | WL_POSTMASTER_DEATH,
> > > + -1);
> >
> > That's probably not going to work well directly after a (re)start of
> > bgreclaim (depending on how you handle the water mark, I'll see in a
> > bit).
>
> Could you please be more specific here?

I wasn't sure if any change is required here, so kept the code

as it is.

> >
> > > +Background Reclaimer's Processing
> > > +---------------------------------
> ..
> > > +Two water mark indicators are used to maintain sufficient number of buffers
> > > +on freelist. Low water mark indicator is used by backends to wake bgreclaimer
> > > +when the number of buffers in freelist falls below it. High water mark
> > > +indicator is used by bgreclaimer to move buffers to freelist.
> >
> > For me the description of the high water as stated here doesn't seem to
> > explain anything.
> >
> > This section should have a description of how the reclaimer interacts
> > with the bgwriter logic. Do we put dirty buffers on the freelist that
> > are then cleaned by the bgwriter? Which buffers does the bgwriter write
> > out?
>
> As discussed in thread, I will change this accordingly.

Done

> > Move buffers with reference and a usage_count *of* zero to freelist. By
> > maintaining enough buffers in the freelist (up to the list's high water
> > mark), we drastically reduce the likelihood of individual backends
> > having to perform the clock sweep themselves.
>
> Okay will rephrase it as per your suggestion.

Done

> > > + * This is called by the background reclaim process when the number
> > > + * of buffers in freelist falls below low water mark of freelist.
> > > + */
> >
> > The logic used here *definitely* needs to be documented in another form
> > somewhere in the source.
>
> I think the way Robert has suggested to modify Readme adresses
> this to an extent, however if you think it is better to go in more
> detail, then I can expand function header of function
> BgMoveBuffersToFreelist() or on top of bgreclaimer.c, what do you prefer?
>
> > > +
> > > + if (tmp_num_to_free == 0)
> > > + break;
> >
> > num_to_free isn't a convincing name if I understand what this is doing
> > correctly. Maybe 'move_to_freelist', 'freelist_needed',
> > 'needed_on_freelist' or something like that?
>
> I think keeping num or count in name could be more helpful as it is used
> to loop for finding usable buffers. How about 'num_needed_on_freelist',
> without any count or num it sounds more like a boolean variable.

Changed.

> >
> > I wonder if we don't want to increase the high watermark when
> > tmp_recent_backend_clocksweep > 0?
> >
> > > + while (tmp_num_to_free > 0)
> > > + {
> ..
> > > + /*
> > > + * If the buffer is pinned or has a nonzero usage_count, we cannot
> > > + * move it to freelist; decrement the usage_count (unless pinned)
> > > + * and keep scanning.
> > > + */
> > > + LockBufHdr(bufHdr);
> >
> > Hm. Perhaps we should do a bufHdr->refcount != zero check without
> > locking here? The atomic op will transfer the cacheline exclusively to
> > the reclaimer's CPU. Even though it very shortly afterwards will be
> > touched afterwards by the pinning backend.
>
> As per discussion, the conclusion seems to be that we can do some
> more tests to see if we need such a change, I will do more tests on
> the lines suggested by Robert in below mails and then we can decide
> if any thing is required.

Still performance data related to this needs to be collected.

> > > +/*
> > > + * Water mark indicators for maintaining buffers on freelist. When the
> > > + * number of buffers on freelist drops below the low water mark, the
> > > + * allocating backend sets the latch and bgreclaimer wakesup and begin
> > > + * adding buffer's to freelist until it reaches high water mark and then
> > > + * again goes back to sleep.
> > > + */
> >
> > s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/;
> > /to freelist/to the freelist/; s/reaches high water/reaches the high water/
>
> Will change as per suggestions.

Done

> > > +int freelistLowWaterMark;
> > > +int freelistHighWaterMark;
> > > +
> > > +/* Percentage indicators for maintaining buffers on freelist */
> > > +#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.005
> > > +#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.2
> > > +#define MIN_HIGH_WATER_MARK 5
> > > +#define MAX_HIGH_WATER_MARK 2000
> >
> > I'm confused. The high water mark percentage is smaller than the low
> > water mark?
>
> High water mark is percentage of NBuffers (shared_buffers).
> Low water mark is percentage of High water mark.
>
> I will add a comment.

Added a comment.

>
>
> > > - *lock_held = true;
> > > - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> > > + SpinLockAcquire(&StrategyControl->freelist_lck);
> > >
> ..
> > > + bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
> > > bgwriterLatch = StrategyControl->bgwriterLatch;
> >
> > I don't understand why these need to be grabbed under the spinlock?
>
> In earlier version of patch, it was done without spinklock,
> however Robert has given comment that as it was previously done
> with BufFreelist lock, these should be under spinlock (atleast
> that is what I understood) and I tested the patch again by having
> them under spinlock and didn't find any difference, so I have moved
> them under spinlock.

Not changed anything related to this.

> ..
> > > + if (numFreeListBuffers < freelistLowWaterMark && bgreclaimerLatch)
> > > + SetLatch(StrategyControl->bgreclaimerLatch);
> > > +
> ..
> > > }
> > > - UnlockBufHdr(buf);
> > > }
> >
> > I think it makes sense to break out this bit into its own
> > function. That'll make StrategyGetBuffer() a good bit easier to read.
>
> I will try to move it into a new function GetBufferFromFreelist().

Moved this part of code into new function.

>
> > > +/*
> > > + * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
> > > + */
> > > +bool
> > > +StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
> > > +{
> >
> > Should maybe rather be named *Tail?
>
> Tail is better suited than End, I will change it.

Changed.

> > > +void
> > > +StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc,
> > > + uint32 *num_buf_backend_clocksweep)
> > > +
> > > + if (num_buf_alloc)
> > > + {
> > > + *num_buf_alloc = StrategyControl->numBufferAllocs;
> > > + StrategyControl->numBufferAllocs = 0;
> > > + }
> > > + if (num_buf_backend_clocksweep)
> > > + {
> > > + *num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
> > > + StrategyControl->numBufferBackendClocksweep = 0;
> > > + }
> > > + SpinLockRelease(&StrategyControl->freelist_lck);
> > > +
> > > + return;
> > > +}
> >
> > Do we need the if (num_buf_alloc) bits? Can't we make it unconditional?
>
> This API is more or less designed on lines of StrategySyncStart().
> Currently there is no case where we can't do it unconditional (same is
> true for num_buf_backend_clocksweep), however there is some merit
> to keep it in sync with existing API. Having said that if you feel we should
> go about doing it unconditionally, I will change it in next patch.

I have made it unconditional.

> Hm, right. But then let's move BgWriterStats.m_buf_alloc =+,
> ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end
> up being continously busy without being visible.

Done and I have removed pgstat_send_bgwriter() call from

bgreclaimer loop, as after this change calling it there becomes

redundant.

Apart from this, I have changed kid of newly added function as

due to recent commit, the oid I was using is no longer available.

I have taken tpc-b data as well, it is with previous version of patch,

however the recent version hasn't changed much to impact performance

data. It takes long time to get tpc-b data, thats why I could not report

it previously.

I will post the data with the latest patch separately (where I will focus

on new cases discussed between Robert and Andres).

TPC-B Performance Data

----------------------------------------

Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
wal_buffers = 256MB
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 30mins

Client_count/Patch_ver	32	64	128
HEAD	420	535	556
Patch (sbe_v8)	431	537	568

About performance data

-------------------------------------

1. This data is a median of 3 runs, individual run data can

be found in attached document (perf_read_scalability_data_v8.ods)

2. There is not much difference in performance between Head and patch

which shows that this patch hasn't regressed tpc-b case.

Steps to take tpc-b data, for each run:

1. start server

2. drop db

3. create db

4. initialize db

5. run tpc-b pgbench (used prepared statement)

6. checkpoint

7. stop server

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

14 September 2014, 07:00:39

On Fri, Sep 12, 2014 at 11:09 PM, Gregory Smith <gregsmithpgsql@gmail.com> wrote:

> This looks like it's squashed one of the very fundamental buffer

> scaling issues though; well done Amit.

Thanks.

> I'll go back to my notes and try to recreate the pathological cases

> that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync

> spreading effort I did; get those running again and see how they

> do on this new approach. I have a decent sized 24 core server

> that should be good enough for this job. I'll see what I can do.

It will be really helpful if you can try out those cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Mark Kirkwood

Date:

14 September 2014, 07:09:10

On 14/09/14 19:00, Amit Kapila wrote:
> On Fri, Sep 12, 2014 at 11:09 PM, Gregory Smith
> <gregsmithpgsql@gmail.com <mailto:gregsmithpgsql@gmail.com>> wrote:
>
>  > This looks like it's squashed one of the very fundamental buffer
>  >  scaling issues though; well done Amit.
>
> Thanks.
>
>  > I'll go back to my notes and try to recreate the pathological cases
>  > that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync
>  > spreading effort I did; get those running again and see how they
>  > do on this new approach.  I have a decent sized 24 core server
>  > that should be good enough for this job.  I'll see what I can do.
>
> It will be really helpful if you can try out those cases.
>
>

And if you want 'em run on the 60 core beast, just let me know the 
details and I'll do some runs for you.

Cheers

Mark

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

16 September 2014, 12:18:44

On Sun, Sep 14, 2014 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 12, 2014 at 11:55 AM, Amit Chapel <amit.kapila16@gmail.com> wrote:
> On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:

I will post the data with the latest patch separately (where I will focus
on new cases discussed between Robert and Andres).

Performance Data with latest version of patch.

All the data shown below is a median of 3 runs, for each

individual run data, refer attached document

(perf_read_scalability_data_v9.ods)

Performance Data for Read-only test

-----------------------------------------------------

Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Client_Count/Patch_Ver	8	16	32	64	128
HEAD	58614	107370	140717	104357	65010
sbe_v9	62943	119064	172246	220174	220904

Observations

---------------------

1. It scales well as with previous versions of patch, but

it seems the performance is slightly better in few cases,

may be because I have removed a statement (if check)

or 2 in bgreclaimer (those were done under spinlock) or it

could be just run-to-run difference.

> (1) A read-only pgbench workload that is just a tiny bit larger than
> shared_buffers, say size of shared_buffers plus 0.01%. Such workloads
> tend to stress buffer eviction heavily.

When the data is just tiny bit larger than shared buffers, actually

there is no problem in scalability even in HEAD, because I think

most of the requests will be satisfied from existing buffer pool.

I have taken data for some of the loads where database size is

bit larger than shared buffers and it is as follows:

Scale Factor - 800

Shared_Buffers - 12286MB (Total db size is 12288MB)

Client_Count/Patch_Ver	1	8	16	32	64	128
HEAD	8406	68712	132222	198481	290340	289828
sbe_v9	8504	68546	131926	195789	289959	289021

Scale Factor - 800
Shared_Buffers - 12166MB (Total db size is 12288MB)

Client_Count/Patch_Ver	1	8	16	32	64	128
HEAD	8428	68609	128092	196596	292066	293812
sbe_v9	8386	68546	126926	197126	289959	287621

Observations

---------------------

In most cases performance with patch is slightly less as compare

to HEAD and the difference is generally less than 1% and in a case

or 2 close to 2%. I think the main reason for slight difference is that

when the size of shared buffers is almost same as data size, the number

of buffers it needs from clock sweep are very less, as an example in first

case (when size of shared buffers is 12286MB), it actually needs at most

256 additional buffers (2MB) via clock sweep, where as bgreclaimer

will put 2000 (high water mark) additional buffers (0.5% of shared buffers

is greater than 2000 ) in free list, so bgreclaimer does some extra work

when it is not required and it also leads to condition you mentioned

down (freelist will contain buffers that have already been touched since

we added them). Now for case 2 (12166MB), we need buffers more

than 2000 additional buffers, but not too many, so it can also have

similar effect.

I think we have below options related to this observation

a. Some further tuning in bgreclaimer, so that instead of putting

the buffers up to high water mark in freelist, it puts just 1/4th or

1/2 of high water mark and then check if the free list still contains

lesser than equal to low water mark, if yes it continues and if not

then it can wait (or may be some other way).

b. Instead of waking bgreclaimer when the number of buffers fall

below low water mark, wake when the number of times backends

does clock sweep crosses certain threshold

c. Give low and high water mark as config knobs, so that in some

rare cases users can use them to do tuning.

d. Lets not do anything as if user does such a configuration, he should

be educated to configure shared buffers in a better way and or the

performance hit doesn't seem to be justified to do any further

work.

Now if we do either of 'a' or 'b', then I think there is a chance

that the gain might not be same for cases where users can

easily get benefit from this patch and there is a chance that

it degrades the performance in some other case.

> (2) A workload that maximizes the rate of concurrent buffer eviction
> relative to other tasks. Read-only pgbench is not bad for this, but
> maybe somebody's got a better idea.

I think the first test of pgbench (scale_factor-3000;shared_buffers-8GB)

addresses this case.

> As I sort of mentioned in what I was writing for the bufmgr README,
> there are, more or less, three ways this can fall down, at least that
> I can see: (1) if the high water mark is too high, then we'll start
> finding buffers in the freelist that have already been touched since
> we added them:

I think I am able to see this effect (though mild) in one of above tests.

With Regards,
Amit Kapila.

EnterpriseDB: http://www.enterprisedb.com

Attachment

perf_read_scalability_data_v9.ods

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

16 September 2014, 16:51:30

On Tue, Sep 16, 2014 at 8:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

In most cases performance with patch is slightly less as compare
to HEAD and the difference is generally less than 1% and in a case
or 2 close to 2%. I think the main reason for slight difference is that
when the size of shared buffers is almost same as data size, the number
of buffers it needs from clock sweep are very less, as an example in first
case (when size of shared buffers is 12286MB), it actually needs at most
256 additional buffers (2MB) via clock sweep, where as bgreclaimer
will put 2000 (high water mark) additional buffers (0.5% of shared buffers
is greater than 2000 ) in free list, so bgreclaimer does some extra work
when it is not required and it also leads to condition you mentioned
down (freelist will contain buffers that have already been touched since
we added them). Now for case 2 (12166MB), we need buffers more
than 2000 additional buffers, but not too many, so it can also have
similar effect.

So there are two suboptimal things that can happen and they pull in opposite directions. I think you should instrument the server how often each is happening. #1 is that we can pop a buffer from the freelist and find that it's been touched. That means we wasted the effort of putting it on the freelist in the first place. #2 is that we can want to pop a buffer from the freelist and find it empty and thus be forced to run the clock sweep ourselves. If we're having problem #1, we could improve things by reducing the water marks. If we're having problem #2, we could improve things by increasing the water marks. If we're having both problems, then I dunno. But let's get some numbers on the frequency of these specific things, rather than just overall tps numbers.

I think we have below options related to this observation
a. Some further tuning in bgreclaimer, so that instead of putting
the buffers up to high water mark in freelist, it puts just 1/4th or
1/2 of high water mark and then check if the free list still contains
lesser than equal to low water mark, if yes it continues and if not
then it can wait (or may be some other way).

That sounds suspiciously like just reducing the high water mark.

b. Instead of waking bgreclaimer when the number of buffers fall
below low water mark, wake when the number of times backends
does clock sweep crosses certain threshold

That doesn't sound helpful.

c. Give low and high water mark as config knobs, so that in some
rare cases users can use them to do tuning.

Yuck.

d. Lets not do anything as if user does such a configuration, he should
be educated to configure shared buffers in a better way and or the
performance hit doesn't seem to be justified to do any further
work.

At least worth entertaining.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

19 September 2014, 11:21:57

On Tue, Sep 16, 2014 at 10:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 16, 2014 at 8:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
In most cases performance with patch is slightly less as compare
to HEAD and the difference is generally less than 1% and in a case
or 2 close to 2%. I think the main reason for slight difference is that
when the size of shared buffers is almost same as data size, the number
of buffers it needs from clock sweep are very less, as an example in first
case (when size of shared buffers is 12286MB), it actually needs at most
256 additional buffers (2MB) via clock sweep, where as bgreclaimer
will put 2000 (high water mark) additional buffers (0.5% of shared buffers
is greater than 2000 ) in free list, so bgreclaimer does some extra work
when it is not required and it also leads to condition you mentioned
down (freelist will contain buffers that have already been touched since
we added them). Now for case 2 (12166MB), we need buffers more
than 2000 additional buffers, but not too many, so it can also have
similar effect.

So there are two suboptimal things that can happen and they pull in opposite directions. I think you should instrument the server how often each is happening. #1 is that we can pop a buffer from the freelist and find that it's been touched. That means we wasted the effort of putting it on the freelist in the first place. #2 is that we can want to pop a buffer from the freelist and find it empty and thus be forced to run the clock sweep ourselves. If we're having problem #1, we could improve things by reducing the water marks. If we're having problem #2, we could improve things by increasing the water marks. If we're having both problems, then I dunno. But let's get some numbers on the frequency of these specific things, rather than just overall tps numbers.

Specific numbers of both the configurations for which I have

posted data in previous mail are as follows:

Scale Factor - 800

Shared_Buffers - 12286MB (Total db size is 12288MB)

Client and Thread Count = 64

buffers_touched_freelist - count of buffers that backends found touched after

popping from freelist.

buffers_backend_clocksweep - count of buffer allocations not satisfied from freelist

buffers_alloc	1531023
buffers_backend_clocksweep	0
buffers_touched_freelist	0

Scale Factor - 800
Shared_Buffers - 12166MB (Total db size is 12288MB)

Client and Thread Count = 64

buffers_alloc	1531010
buffers_backend_clocksweep	0
buffers_touched_freelist	0

In both the above cases, I have taken data multiple times to ensure

correctness. From the above data, it is evident that in both the above

configurations all the requests are satisfied from the initial freelist.

Basically the amount of shared buffers configured

(12286MB = 1572608 buffers and 12166MB = 1557248 buffers) are

sufficient to contain all the work load for pgbench run.

So now the question is why we are seeing small variation (<1%) in data

in case all the data fits in shared buffers and the reason could be that

we have added few extra instructions (due to increase in StrategyControl

structure size, additional function call, one or two new assignments) in the

Buffer Allocation path (the extra instructions will also be only till all the data

pages gets associated with buffers, after that the control won't even reach

StrategyGetBuffer()) or it may be due to variation across different runs with

different binaries.

I have went ahead to take the data in cases shared buffers are tiny bit (0.1%

and .05%) less than workload (based on buffer allocations done in above cases).

Performance Data

-------------------------------

Scale Factor - 800

Shared_Buffers - 11950MB

Client_Count/Patch_Ver	8	16	32	64	128
HEAD	68424	132540	195496	279511	283280
sbe_v9	68565	132709	194631	284351	289333

Scale Factor - 800

Shared_Buffers - 11955MB

Client_Count/Patch_Ver	8	16	32	64	128
HEAD	68331	127752	196385	274387	281753
sbe_v9	68922	131314	194452	284292	287221

The above data indicates that performance is better with patch

in almost all cases and especially at high concurrency (64 and

128 client count).

The overall conclusion is that with patch

a. when the data can fit in RAM and not completely in shared buffers,

the performance/scalability is quite good even if shared buffers are just

tiny bit less that all the data.

b. when shared buffers are sufficient to contain all the data, then there is

a slight difference (<1%) in performance.

d. Lets not do anything as if user does such a configuration, he should
be educated to configure shared buffers in a better way and or the
performance hit doesn't seem to be justified to do any further
work.

At least worth entertaining.

Based on further analysis, I think this is the way to go.

Attached find the patch for new stat (buffers_touched_freelist) just in

case you want to run the patch with it and detailed (individual run)

performance data.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

20 September 2014, 08:14:47

On Sun, Sep 14, 2014 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Apart from this, I have changed kid of newly added function as
> due to recent commit, the oid I was using is no longer available.

After sending last patch I have realized that oid used in patch for

new function is wrong; fixed the same in attached patch.

I have fixed all the comments raised till now for patch and

provided the latest performance data as well, so I will mark it

as "Needs Review" to proceed further.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

scalable_buffer_eviction_v10.patch

Re: Scaling shared buffer eviction

From

Gregory Smith

Date:

22 September 2014, 05:13:56

On 9/16/14, 8:18 AM, Amit Kapila wrote:
> I think the main reason for slight difference is that
> when the size of shared buffers is almost same as data size, the number
> of buffers it needs from clock sweep are very less, as an example in first
> case (when size of shared buffers is 12286MB), it actually needs at most
> 256 additional buffers (2MB) via clock sweep, where as bgreclaimer
> will put 2000 (high water mark) additional buffers (0.5% of shared buffers
> is greater than 2000 ) in free list, so bgreclaimer does some extra work
> when it is not required
This is exactly what I was warning about, as the sort of lesson learned 
from the last round of such tuning.  There are going to be spots where 
trying to tune the code to be aggressive on the hard cases will work 
great.  But you need to make that dynamic to some degree, such that the 
code doesn't waste a lot of time sweeping buffers when the demand for 
them is actually weak.  That will make all sorts of cases that look like 
this slower.

We should be able to tell these apart if there's enough instrumentation 
and solid logic inside of the program itself though.  The 8.3 era BGW 
coped with a lot of these issues using a particular style of moving 
average with fast reaction time, plus instrumenting the buffer 
allocation rate as accurately as it could. So before getting into 
high/low water note questions, are you comfortable that there's a clear, 
accurate number that measures the activity level that's important here?  
And have you considered ways it might be averaging over time or have a 
history that's analyzed? The exact fast approach / slow decay weighted 
moving average approach of the 8.3 BGW, the thing that tried to smooth 
the erratic data set possible here, was a pretty critical part of 
getting itself auto-tuning to workload size.  It ended up being much 
more important than the work of setting the arbitrary watermark levels.

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

22 September 2014, 06:55:22

On Mon, Sep 22, 2014 at 10:43 AM, Gregory Smith <gregsmithpgsql@gmail.com> wrote:

On 9/16/14, 8:18 AM, Amit Kapila wrote:
I think the main reason for slight difference is that
when the size of shared buffers is almost same as data size, the number
of buffers it needs from clock sweep are very less, as an example in first
case (when size of shared buffers is 12286MB), it actually needs at most
256 additional buffers (2MB) via clock sweep, where as bgreclaimer
will put 2000 (high water mark) additional buffers (0.5% of shared buffers
is greater than 2000 ) in free list, so bgreclaimer does some extra work
when it is not required
This is exactly what I was warning about, as the sort of lesson learned from the last round of such tuning. There are going to be spots where trying to tune the code to be aggressive on the hard cases will work great. But you need to make that dynamic to some degree, such that the code doesn't waste a lot of time sweeping buffers when the demand for them is actually weak. That will make all sorts of cases that look like this slower.

To verify whether above can lead to any kind of regression, I have

checked the cases (workload is 0.05 or 0.1 percent larger than shared

buffers) where we need few extra buffers and bgreclaimer might put

some additional buffers and it turns out that in those cases also, there

is a win especially at high concurrency and results of the same are posted

upthread

(http://www.postgresql.org/message-id/CAA4eK1LFGcvzMdcD5NZx7B2gCbP1G7vWK7w32EZk=VOOLUds-A@mail.gmail.com).

We should be able to tell these apart if there's enough instrumentation and solid logic inside of the program itself though. The 8.3 era BGW coped with a lot of these issues using a particular style of moving average with fast reaction time, plus instrumenting the buffer allocation rate as accurately as it could. So before getting into high/low water note questions, are you comfortable that there's a clear, accurate number that measures the activity level that's important here?

Very Good Question. This was exactly the thing which was

missing in my initial versions (about 2 years back when I tried to

solve this problem) but based on Robert's and Andres's feedback

I realized that we need an accurate number to measure the activity

level (in this case it is consumption of buffers from freelist), so

I have introduced the logic to calculate the same (it is stored in new

variable numFreeListBuffers in BufferStrategyControl structure).

And have you considered ways it might be averaging over time or have a history that's analyzed?

The current logic of bgreclaimer is such that even if it does

some extra activity (extra is very much controlled) in one cycle,

it will not start another cycle unless backends consume all the

buffers that were made available by bgreclaimer in one cycle.

I think the algorithm designed for bgreclaimer automatically

averages out based on activity. Do you see any cases where it

will not do so?

The exact fast approach / slow decay weighted moving average approach of the 8.3 BGW, the thing that tried to smooth the erratic data set possible here, was a pretty critical part of getting itself auto-tuning to workload size. It ended up being much more important than the work of setting the arbitrary watermark levels.

Agreed, but the logic with which bgwriter works is pretty different

and thats why it needs different kind of logic to handle auto-tuning.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

23 September 2014, 14:31:31

On Fri, Sep 19, 2014 at 7:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Specific numbers of both the configurations for which I have
posted data in previous mail are as follows:

Scale Factor - 800
Shared_Buffers - 12286MB (Total db size is 12288MB)
Client and Thread Count = 64
buffers_touched_freelist - count of buffers that backends found touched after
popping from freelist.
buffers_backend_clocksweep - count of buffer allocations not satisfied from freelist

buffers_alloc 1531023
buffers_backend_clocksweep 0
buffers_touched_freelist 0

I didn't believe these numbers, so I did some testing. I used the same configuration you mention here, scale factor = 800, shared_buffers = 12286 MB, and I also saw buffers_backend_clocksweep = 0. I didn't see buffers_touched_freelist showing up anywhere, so I don't know whether that would have been zero or not. Then I tried reducing the high watermark for the freelist from 2000 buffers to 25 buffers, and buffers_backend_clocksweep was *still* 0. At that point I started to smell a rat. It turns out that, with this test configuration, there's no buffer allocation going on at all. Everything fits in shared_buffers, or it did on my test. I had to reduce shared_buffers down to 10491800kB before I got any significant buffer eviction.

At that level, a 100-buffer high watermark wasn't sufficient to prevent the freelist from occasionally going empty. A 2000-buffer high water mark was by and large sufficient, although I was able to see small numbers of buffers being allocated via clocksweep right at the very beginning of the test, I guess before the reclaimer really got cranking. So the watermarks seem to be broadly in the right ballpark, but I think the statistics reporting needs improving. We need an easy way to measure the amount of work that bgreclaimer is actually doing.

I suggest we count these things:

1. The number of buffers the reclaimer has put back on the free list.

2. The number of times a backend has run the clocksweep.

3. The number of buffers past which the reclaimer has advanced the clock sweep (i.e. the number of buffers it had to examine in order to reclaim the number counted by #1).

4. The number of buffers past which a backend has advanced the clocksweep (i.e. the number of buffers it had to examine in order to allocate the number of buffers count by #3).

5. The number of buffers allocated from the freelist which the backend did not use because they'd been touched (what you're calling buffers_touched_freelist).

It's hard to come up with good names for all of these things that are consistent with the somewhat wonky existing names. Here's an attempt:

1. bgreclaim_freelist

2. buffers_alloc_clocksweep (you've got buffers_backend_clocksweep, but I think we want to make it more parallel with buffers_alloc, which is the number of buffers allocated, not buffers_backend, the number of buffers *written* by a backend)

3. clocksweep_bgreclaim

4. clocksweep_backend

5. freelist_touched

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

23 September 2014, 14:43:00

Hi,

On 2014-09-23 10:31:24 -0400, Robert Haas wrote:
> I suggest we count these things:
> 
> 1. The number of buffers the reclaimer has put back on the free list.
> 2. The number of times a backend has run the clocksweep.
> 3. The number of buffers past which the reclaimer has advanced the clock
> sweep (i.e. the number of buffers it had to examine in order to reclaim the
> number counted by #1).

> 4. The number of buffers past which a backend has advanced the clocksweep
> (i.e. the number of buffers it had to examine in order to allocate the
> number of buffers count by #3).

> 5. The number of buffers allocated from the freelist which the backend did
> not use because they'd been touched (what you're calling
> buffers_touched_freelist).

Sounds good.

> It's hard to come up with good names for all of these things that are
> consistent with the somewhat wonky existing names.  Here's an attempt:
> 
> 1. bgreclaim_freelist

bgreclaim_alloc_clocksweep?

> 2. buffers_alloc_clocksweep (you've got buffers_backend_clocksweep, but I
> think we want to make it more parallel with buffers_alloc, which is the
> number of buffers allocated, not buffers_backend, the number of buffers
> *written* by a backend)
> 3. clocksweep_bgreclaim
> 4. clocksweep_backend

I think bgreclaim/backend should always be either be a prefix or a
postfix. But not one in some variables and some in another.

> 5. freelist_touched

I wonder if we shouldn't move all this to a new view, instead of
stuffing it somewhere where it really doesn't belong. pg_stat_buffers or
something like it.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

23 September 2014, 14:55:38

On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> [ review ]

Oh, by the way, I noticed that this patch breaks pg_buffercache.  If
we're going to have 128 lock partitions, we need to bump
MAX_SIMUL_LWLOCKS.

But this gets at another point: the way we're benchmarking this right
now, we're really conflating the effects of three different things:

1. Changing the locking regimen around the freelist and clocksweep.
2. Adding a bgreclaimer process.
3. Raising the number of buffer locking partitions.

I think it's pretty clear that #1 and #2 are a good idea.  #3 is a
mixed bag, and it might account for the regressions you saw on some
test cases.  Increasing the number of buffer mapping locks means that
those locks take up more cache lines, which could slow things down in
cases where there's no reduction in contention.  It also means that
the chances of an allocated buffer ending up in the same buffer
mapping lock partition are 1/128 instead of 1/16, which means about
5.4 additional lwlock acquire/release cycles per 100 allocations.
That's not a ton, but it's not free either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

23 September 2014, 20:29:26

On Tue, Sep 23, 2014 at 10:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> But this gets at another point: the way we're benchmarking this right
> now, we're really conflating the effects of three different things:
>
> 1. Changing the locking regimen around the freelist and clocksweep.
> 2. Adding a bgreclaimer process.
> 3. Raising the number of buffer locking partitions.

I did some more experimentation on this.  Attached is a patch that
JUST does #1, and, as previously suggested, it uses a single spinlock
instead of using two of them that are probably in the same cacheline.
Without this patch, on a 32-client, read-only pgbench at scale factor
1000 and shared_buffers=8GB, perf -e cs -g -a has this to say about
LWLock-inspired preemption:

         - LWLockAcquire
            + 68.41% ReadBuffer_common
            + 31.59% StrategyGetBuffer

With the patch, unsurprisingly, StrategyGetBuffer disappears and the
only lwlocks that are causing context-switches are the individual
buffer locks.  But are we suffering spinlock contention instead as a
result?   Yes, but not much.  s_lock is at 0.41% in the corresponding
profile, and only 6.83% of those calls are from the patched
StrategyGetBuffer.  In a similar profile of master, s_lock is at
0.31%.  So there's a bit of additional s_lock contention, but I think
it's considerably less than the contention over the lwlock it's
replacing, because the spinlock is only held for the minimal amount of
time needed, whereas the lwlock could be held across taking and
releasing many individual buffer locks.

TPS results are a little higher with the patch - these are alternating
5 minute runs:

master tps = 176010.647944 (including connections establishing)
master tps = 176615.291149 (including connections establishing)
master tps = 175648.370487 (including connections establishing)
reduce-replacement-locking tps = 177888.734320 (including connections
establishing)
reduce-replacement-locking tps = 177797.842410 (including connections
establishing)
reduce-replacement-locking tps = 177894.822656 (including connections
establishing)

The picture is similar at 64 clients, but the benefit is a little more:

master tps = 179037.231597 (including connections establishing)
master tps = 180500.937068 (including connections establishing)
master tps = 181565.706514 (including connections establishing)
reduce-replacement-locking tps = 185741.503425 (including connections
establishing)
reduce-replacement-locking tps = 188598.728062 (including connections
establishing)
reduce-replacement-locking tps = 187340.977277 (including connections
establishing)

What's interesting is that I can't see in the perf output any real
sign that the buffer mapping locks are slowing things down, but they
clearly are, because when I take this patch and also boost
NUM_BUFFER_PARTITIONS to 128, the performance goes up:

reduce-replacement-locking-128 tps = 251001.812843 (including
connections establishing)
reduce-replacement-locking-128 tps = 247368.925927 (including
connections establishing)
reduce-replacement-locking-128 tps = 250775.304177 (including
connections establishing)

The performance also goes up if I do that on master, but the effect is
substantially less:

master-128 tps = 219301.492902 (including connections establishing)
master-128 tps = 219786.249076 (including connections establishing)
master-128 tps = 219821.220271 (including connections establishing)

I think this shows pretty clearly that, even without the bgreclaimer,
there's a lot of merit in getting rid of BufFreelistLock and using a
spinlock held for the absolutely minimal number of instructions
instead.  There's already some benefit without doing anything about
the buffer mapping locks, and we'll get a lot more benefit once that
issue is addressed.  I think we need to do some serious study to see
whether bgreclaimer is even necessary, because it looks like this
change alone, which is much simpler, takes a huge bite out of the
scalability problem.

I welcome further testing and comments, but my current inclination is
to go ahead and push the attached patch.  To my knowledge, nobody has
at any point objected to this aspect of what's being proposed, and it
looks safe to me and seems to be a clear win.  We can then deal with
the other issues on their own merits.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

reduce-replacement-locking.patch

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

23 September 2014, 20:53:54

On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I did some more experimentation on this.  Attached is a patch that
> JUST does #1, and, ...

...and that was the wrong patch.  Thanks to Heikki for point that out.

Second try.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

reduce-replacement-locking.patch

Re: Scaling shared buffer eviction

From

Tom Lane

Date:

23 September 2014, 21:44:08

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I did some more experimentation on this.  Attached is a patch that
>> JUST does #1, and, ...

> ...and that was the wrong patch.  Thanks to Heikki for point that out.
> Second try.

But the results you gave in the previous message were correctly
attributed?
        regards, tom lane

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

23 September 2014, 21:50:24

On Tue, Sep 23, 2014 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I did some more experimentation on this.  Attached is a patch that
>>> JUST does #1, and, ...
>
>> ...and that was the wrong patch.  Thanks to Heikki for point that out.
>> Second try.
>
> But the results you gave in the previous message were correctly
> attributed?

The patch I attached the first time was just the last commit in the
git repository where I wrote the patch, rather than the changes that I
made on top of that commit.  So, yes, the results from the previous
message are with the patch attached to the follow-up.  I just typed
the wrong git command when attempting to extract that patch to attach
it to the email.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Gregory Smith

Date:

23 September 2014, 22:03:08

On 9/23/14, 10:31 AM, Robert Haas wrote:
> I suggest we count these things:
>
> 1. The number of buffers the reclaimer has put back on the free list.
> 2. The number of times a backend has run the clocksweep.
> 3. The number of buffers past which the reclaimer has advanced the 
> clock sweep (i.e. the number of buffers it had to examine in order to 
> reclaim the number counted by #1).
> 4. The number of buffers past which a backend has advanced the 
> clocksweep (i.e. the number of buffers it had to examine in order to 
> allocate the number of buffers count by #3).
> 5. The number of buffers allocated from the freelist which the backend 
> did not use because they'd been touched (what you're calling 
> buffers_touched_freelist).

All sound reasonable.  To avoid wasting time here, I think it's only 
worth doing all of these as DEBUG level messages for now.  Then only go 
through the overhead of exposing the ones that actually seem relevant.  
That's what I did for the 8.3 work, and most of that data at this level 
was barely relevant to anyone but me then or since. We don't want the 
system views to include so much irrelevant trivia that finding the 
important parts becomes overwhelming.

I'd like to see that level of instrumentation--just the debug level 
messages--used to quantify the benchmarks that people are running 
already, to prove they are testing what they think they are.  That would 
have caught the test error you already stumbled on for example.  Simple 
paranoia says there may be more issues like that hidden in here 
somewhere, and this set you've identified should find any more of them 
around.

If all that matches up so the numbers for the new counters seem sane, I 
think that's enough to commit the first basic improvement here.  Then 
make a second pass over exposing the useful internals that let everyone 
quantify either the improvement or things that may need to be tunable.

-- 
Greg Smith greg.smith@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

23 September 2014, 22:54:26

On 2014-09-23 16:29:16 -0400, Robert Haas wrote:
> On Tue, Sep 23, 2014 at 10:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > But this gets at another point: the way we're benchmarking this right
> > now, we're really conflating the effects of three different things:
> >
> > 1. Changing the locking regimen around the freelist and clocksweep.
> > 2. Adding a bgreclaimer process.
> > 3. Raising the number of buffer locking partitions.
> 
> I did some more experimentation on this.  Attached is a patch that
> JUST does #1, and, as previously suggested, it uses a single spinlock
> instead of using two of them that are probably in the same cacheline.
> Without this patch, on a 32-client, read-only pgbench at scale factor
> 1000 and shared_buffers=8GB, perf -e cs -g -a has this to say about
> LWLock-inspired preemption:
> 
>          - LWLockAcquire
>             + 68.41% ReadBuffer_common
>             + 31.59% StrategyGetBuffer
> 
> With the patch, unsurprisingly, StrategyGetBuffer disappears and the
> only lwlocks that are causing context-switches are the individual
> buffer locks.  But are we suffering spinlock contention instead as a
> result?   Yes, but not much.  s_lock is at 0.41% in the corresponding
> profile, and only 6.83% of those calls are from the patched
> StrategyGetBuffer.  In a similar profile of master, s_lock is at
> 0.31%.  So there's a bit of additional s_lock contention, but I think
> it's considerably less than the contention over the lwlock it's
> replacing, because the spinlock is only held for the minimal amount of
> time needed, whereas the lwlock could be held across taking and
> releasing many individual buffer locks.

Am I understanding you correctly that you also measured context switches
for spinlocks? If so, I don't think that's a valid comparison. LWLocks
explicitly yield the CPU as soon as there's any contention while
spinlocks will, well, spin. Sure they also go to sleep, but it'll take
longer.

It's also worthwile to remember in such comparisons that lots of the
cost of spinlocks will be in the calling routines, not s_lock() - the
first TAS() is inlined into it. And that's the one that'll incur cache
misses and such...

Note that I'm explicitly *not* doubting the use of a spinlock
itself. Given the short acquiration times and the exclusive use of
exlusive acquiration a spinlock makes more sense. The lwlock's spinlock
alone will have about as much contention.

I think it might be possible to construct some cases where the spinlock
performs worse than the lwlock. But I think those will be clearly in the
minority. And at least some of those will be fixed by bgwriter. As an
example consider a single backend COPY ... FROM of large files with a
relatively large s_b. That's a seriously bad workload for the current
victim buffer selection because frequently most of the needs to be
searched for a buffer with usagecount 0. And this patch will double the
amount of atomic ops during that.

Let me try to quantify that.

> What's interesting is that I can't see in the perf output any real
> sign that the buffer mapping locks are slowing things down, but they
> clearly are, because when I take this patch and also boost
> NUM_BUFFER_PARTITIONS to 128, the performance goes up:

What did events did you profile?

> I welcome further testing and comments, but my current inclination is
> to go ahead and push the attached patch.  To my knowledge, nobody has
> at any point objected to this aspect of what's being proposed, and it
> looks safe to me and seems to be a clear win.  We can then deal with
> the other issues on their own merits.

I've took a look at this, and all the stuff I saw that I disliked were
there before this patch. So +1 for going ahead.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

23 September 2014, 23:13:38

On Tue, Sep 23, 2014 at 6:02 PM, Gregory Smith <gregsmithpgsql@gmail.com> wrote:
> On 9/23/14, 10:31 AM, Robert Haas wrote:
>> I suggest we count these things:
>>
>> 1. The number of buffers the reclaimer has put back on the free list.
>> 2. The number of times a backend has run the clocksweep.
>> 3. The number of buffers past which the reclaimer has advanced the clock
>> sweep (i.e. the number of buffers it had to examine in order to reclaim the
>> number counted by #1).
>> 4. The number of buffers past which a backend has advanced the clocksweep
>> (i.e. the number of buffers it had to examine in order to allocate the
>> number of buffers count by #3).
>> 5. The number of buffers allocated from the freelist which the backend did
>> not use because they'd been touched (what you're calling
>> buffers_touched_freelist).
>
> All sound reasonable.  To avoid wasting time here, I think it's only worth
> doing all of these as DEBUG level messages for now.  Then only go through
> the overhead of exposing the ones that actually seem relevant.  That's what
> I did for the 8.3 work, and most of that data at this level was barely
> relevant to anyone but me then or since. We don't want the system views to
> include so much irrelevant trivia that finding the important parts becomes
> overwhelming.

I think we expose far too little information in our system views.
Just to take one example, we expose no useful information about lwlock
acquire or release, but a lot of real-world performance problems are
caused by lwlock contention.  There are of course difficulties in
exposing huge numbers of counters, but we're not talking about many
here, so I'd lean toward exposing them in the final patch if they seem
at all useful.

> I'd like to see that level of instrumentation--just the debug level
> messages--used to quantify the benchmarks that people are running already,
> to prove they are testing what they think they are.  That would have caught
> the test error you already stumbled on for example.  Simple paranoia says
> there may be more issues like that hidden in here somewhere, and this set
> you've identified should find any more of them around.

Right.

> If all that matches up so the numbers for the new counters seem sane, I
> think that's enough to commit the first basic improvement here.  Then make a
> second pass over exposing the useful internals that let everyone quantify
> either the improvement or things that may need to be tunable.

Well, I posted a patch a bit ago that I think is the first basic
improvement - and none of these counters are relevant to that.  It
doesn't add a new background process or anything; it just does pretty
much the same thing we do now with less-painful locking.  There are no
water marks to worry about, or tunable thresholds, or anything; and
because it's so much simpler, it's far easier to reason about than the
full patch, which is why I feel quite confident pressing on toward a
commit.

Once that is in, I think we should revisit the idea of a bgreclaimer
process, and see how much more that improves things - if at all - on
top of what that basic patch already does.  For that we'll need these
counters, and maybe others.  But let's make that phase two.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

23 September 2014, 23:21:19

On Tue, Sep 23, 2014 at 6:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Am I understanding you correctly that you also measured context switches
> for spinlocks? If so, I don't think that's a valid comparison. LWLocks
> explicitly yield the CPU as soon as there's any contention while
> spinlocks will, well, spin. Sure they also go to sleep, but it'll take
> longer.

No, I measured CPU consumption attributable to s_lock.

(I checked context-switches, too.)

> It's also worthwile to remember in such comparisons that lots of the
> cost of spinlocks will be in the calling routines, not s_lock() - the
> first TAS() is inlined into it. And that's the one that'll incur cache
> misses and such...

True.  I can check that - I did not.

> Note that I'm explicitly *not* doubting the use of a spinlock
> itself. Given the short acquiration times and the exclusive use of
> exlusive acquiration a spinlock makes more sense. The lwlock's spinlock
> alone will have about as much contention.

Right.

> I think it might be possible to construct some cases where the spinlock
> performs worse than the lwlock. But I think those will be clearly in the
> minority. And at least some of those will be fixed by bgwriter. As an
> example consider a single backend COPY ... FROM of large files with a
> relatively large s_b. That's a seriously bad workload for the current
> victim buffer selection because frequently most of the needs to be
> searched for a buffer with usagecount 0. And this patch will double the
> amount of atomic ops during that.

It will actually be far worse than that, because we'll acquire and
release the spinlock for every buffer over which we advance the clock
sweep, instead of just once for the whole thing.  That's reason to
hope that a smart bgreclaimer process may be a good improvement on top
of this.  It can do things like advance the clock sweep hand 16
buffers at a time and then sweep them all after-the-fact, reducing
contention on the mutex by an order-of-magnitude, if that turns out to
be an important consideration.

But I think it's right to view that as something we need to test vs.
the baseline established by this patch.  What's clear today is that
workloads which stress buffer-eviction fall to pieces, because the
entire buffer-eviction process is essentially single-threaded.  One
process can't begin evicting a buffer until another has finished doing
so.  This lets multiple backends do that at the same time.  We may
find cases where that leads to an unpleasant amount of contention, but
since we have several ideas for how to mitigate that as needs be, I
think it's OK to go ahead.  The testing we're doing on the combined
patch is conflating the effects the new locking regimen with however
the bgreclaimer affects things, and it's very clear to me now that we
need to make sure those are clearly separate.

> Let me try to quantify that.

Please do.

>> What's interesting is that I can't see in the perf output any real
>> sign that the buffer mapping locks are slowing things down, but they
>> clearly are, because when I take this patch and also boost
>> NUM_BUFFER_PARTITIONS to 128, the performance goes up:
>
> What did events did you profile?

cs.

>> I welcome further testing and comments, but my current inclination is
>> to go ahead and push the attached patch.  To my knowledge, nobody has
>> at any point objected to this aspect of what's being proposed, and it
>> looks safe to me and seems to be a clear win.  We can then deal with
>> the other issues on their own merits.
>
> I've took a look at this, and all the stuff I saw that I disliked were
> there before this patch. So +1 for going ahead.

Cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

23 September 2014, 23:43:00

On 2014-09-23 19:21:10 -0400, Robert Haas wrote:
> On Tue, Sep 23, 2014 at 6:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I think it might be possible to construct some cases where the spinlock
> > performs worse than the lwlock. But I think those will be clearly in the
> > minority. And at least some of those will be fixed by bgwriter.

Err, this should read bgreclaimer, not bgwriter.

> > As an
> > example consider a single backend COPY ... FROM of large files with a
> > relatively large s_b. That's a seriously bad workload for the current
> > victim buffer selection because frequently most of the needs to be
> > searched for a buffer with usagecount 0. And this patch will double the
> > amount of atomic ops during that.
> 
> It will actually be far worse than that, because we'll acquire and
> release the spinlock for every buffer over which we advance the clock
> sweep, instead of just once for the whole thing.

I said double, because we already acquire the buffer header's spinlock
every tick.

> That's reason to hope that a smart bgreclaimer process may be a good
> improvement on top of this.

Right. That's what I was trying to say with bgreclaimer above.

> But I think it's right to view that as something we need to test vs.
> the baseline established by this patch.

Agreed. I think the possible downsides are at the very fringe of already
bad cases. That's why I agreed that you should go ahead.

> > Let me try to quantify that.
> 
> Please do.

I've managed to find a ~1.5% performance regression. But the setup was
plain absurd. COPY ... FROM /tmp/... BINARY; of large bytea datums into
a fillfactor 10 table with the column set to PLAIN storage. With the
resulting table size chosen so it's considerably bigger than s_b, but
smaller than the dirty writeback limit of the kernel.

That's perfectly reasonable.

I can think of a couple other cases, but they're all similarly absurd.

> >> What's interesting is that I can't see in the perf output any real
> >> sign that the buffer mapping locks are slowing things down, but they
> >> clearly are, because when I take this patch and also boost
> >> NUM_BUFFER_PARTITIONS to 128, the performance goes up:
> >
> > What did events did you profile?
> 
> cs.

Ah. My guess is that most of the time will probably actually be spent in
the lwlock's spinlock, not the the lwlock putting itself to sleep.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

24 September 2014, 02:08:11

On Tue, Sep 23, 2014 at 7:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> It will actually be far worse than that, because we'll acquire and
>> release the spinlock for every buffer over which we advance the clock
>> sweep, instead of just once for the whole thing.
>
> I said double, because we already acquire the buffer header's spinlock
> every tick.

Oh, good point.

>> > Let me try to quantify that.
>>
>> Please do.
>
> I've managed to find a ~1.5% performance regression. But the setup was
> plain absurd. COPY ... FROM /tmp/... BINARY; of large bytea datums into
> a fillfactor 10 table with the column set to PLAIN storage. With the
> resulting table size chosen so it's considerably bigger than s_b, but
> smaller than the dirty writeback limit of the kernel.
>
> That's perfectly reasonable.
>
> I can think of a couple other cases, but they're all similarly absurd.

Well, it's not insane to worry about such things, but if you can only
manage 1.5% on such an extreme case, I'm encouraged.  This is killing
us on OLTP workloads, and fixing that is a lot more important than a
couple percent on an extreme case.

> Ah. My guess is that most of the time will probably actually be spent in
> the lwlock's spinlock, not the the lwlock putting itself to sleep.

Ah, OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Gregory Smith

Date:

24 September 2014, 04:02:19

On 9/23/14, 7:13 PM, Robert Haas wrote:
> I think we expose far too little information in our system views. Just 
> to take one example, we expose no useful information about lwlock 
> acquire or release, but a lot of real-world performance problems are 
> caused by lwlock contention.
I sent over a proposal for what I was calling Performance Events about a 
year ago.  The idea was to provide a place to save data about lock 
contention, weird checkpoint sync events, that sort of thing.  Replacing 
log parsing to get at log_lock_waits data was my top priority.  Once 
that's there, lwlocks was an obvious next target.  Presumably we just 
needed collection to be low enough overhead, and then we can go down to 
whatever shorter locks we want; lower the overhead, faster the event we 
can measure.

Sometimes the database will never be able to instrument some of its 
fastest events without blowing away the event itself.  We'll still have 
perf / dtrace / systemtap / etc. for those jobs.  But those are not the 
problems of the average Postgres DBA's typical day.

The data people need to solve this sort of thing in production can't 
always show up in counters.  You'll get evidence the problem is there, 
but you need more details to actually find the culprit.  Some info about 
the type of lock, tables and processes involved, maybe the query that's 
running, that sort of thing.  You can kind of half-ass the job if you 
make per-tables counter for everything, but we really need more, both to 
serve our users and to compare well against what other databases provide 
for tools.  That's why I was trying to get the infrastructure to capture 
all that lock detail, without going through the existing logging system 
first.

Actually building Performance Events fell apart on the storage side:  
figuring out where to put it all without waiting for a log file to hit 
disk.  I wanted in-memory storage so clients don't wait for anything, 
then a potentially lossy persistence writer.  I thought I could get away 
with a fixed size buffer like pg_stat_statements uses.  That was 
optimistic.  Trying to do better got me lost in memory management land 
without making much progress.

I think the work you've now done on dynamic shared memory gives the 
right shape of infrastructure that I could pull this off now.  I even 
have funding to work on it again, and it's actually the #2 thing I'd 
like to take on as I get energy for new feature development.  (#1 is the 
simple but time consuming job of adding block write counters, the lack 
of which which is just killing me on some fast growing installs)

I have a lot of unread messages on this list to sort through right now.  
I know I saw someone try to revive the idea of saving new sorts of 
performance log data again recently; can't seem to find it again right 
now.  That didn't seem like it went any farther than thinking about the 
specifications though.  The last time I jumped right over that and hit a 
wall with this one hard part of the implementation instead, low overhead 
memory management for saving everything.

-- 
Greg Smith greg.smith@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

25 September 2014, 13:51:30

On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The patch I attached the first time was just the last commit in the
> git repository where I wrote the patch, rather than the changes that I
> made on top of that commit.  So, yes, the results from the previous
> message are with the patch attached to the follow-up.  I just typed
> the wrong git command when attempting to extract that patch to attach
> it to the email.

Here are some more results.  TL;DR: The patch still looks good, but we
should raise the number of buffer mapping partitions as well.

On the IBM POWER7 machine, I ran read-only pgbench tests with 1
client, 8 clients, and all multiples of 16 up to 96.  I ran these
tests at scale factor 1000 and scale factor 3000. I tested four
builds: master as of commit df4077cda2eae3eb4a5cf387da0c1e7616e73204,
that same commit with the number of buffer mapping partitions raised
to 128, that commit with reduce-replacement-locking.patch applied, and
that commit with reduce-replacement-locking.patch applied AND the
number of buffer mapping partitions raised to 128.  The results from
each configuration are reported in that order on each of the lines
below; each is the median of three results.  shared_buffers=8GB for
all tests.

scale factor 1000
1 8119.907618 8230.853237 8153.515217 8145.045004
8 65457.006762 65826.439701 65851.010116 65703.168020
16 125263.858855 125723.441853 125020.598728 129506.037997
32 176696.288187 182376.232631 178278.917581 186440.340283
48 193251.602743 214243.417591 197958.562641 226782.327868
64 182264.276909 218655.105894 190364.759052 256863.652885
80 171719.210488 203104.673512 179861.241080 274065.020956
96 162525.883898 190960.622943 169759.271356 277820.128782

scale factor 3000
1 7690.357053 7723.925932 7772.207513 7684.079850
8 60789.325087 61688.547446 61485.398967 62546.166411
16 112509.423777 115138.385501 115606.858594 120015.350112
32 147881.211900 161359.994902 153302.501020 173063.463752
48 129748.929652 153986.160920 136164.387103 204935.207578
64 114364.542340 132174.970721 116705.371890 224636.957891
80 101375.265389 117279.931095 102374.794412 232966.076908
96 93144.724830 106676.309224 92787.650325 233862.872939

Analysis:

1. To see the effect of reduce-replacement-locking.patch, compare the
first TPS number in each line to the third, or the second to the
fourth.  At scale factor 1000, the patch wins in all of the cases with
32 or more clients and exactly half of the cases with 1, 8, or 16
clients.  The variations at low client counts are quite small, and the
patch isn't expected to do much at low concurrency levels, so that's
probably just random variation.  At scale factor 3000, the situation
is more complicated.  With only 16 bufmappinglocks, the patch gets its
biggest win at 48 clients, and by 96 clients it's actually losing to
unpatched master.  But with 128 bufmappinglocks, it wins - often
massively - on everything but the single-client test, which is a small
loss, hopefully within experimental variation.

2. To see the effect of increasing the number of buffer mapping locks
to 128, compare the first TPS number in each line to the second, or
the third to the fourth.  Without reduce-replacement-locking.patch,
that's a win at every concurrency level at both scale factors.  With
that patch, the 1 and 8 client tests are small losses at scale factor
1000, and the 1 client test is a small loss at scale factor 3000.

The single-client results, which are often a concern for scalability
patches, bear a bit of further comment.  In this round of testing,
either patch alone improved things slightly, and both patches together
made them slightly worse.  Even if that is reproducible, I don't think
it should be cause for concern, because it tends to indicate (at least
to me) that the shifting around is just the result of slightly
different placement of code across cache lines, or other minor factors
we can't really pin down.  So I'm inclined to (a) push
reduce-replacement-locking.patch and then also (b) bump up the number
of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS
accordingly so that pg_buffercache doesn't get unhappy).

Comments?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Merlin Moncure

Date:

25 September 2014, 14:02:36

On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> 1. To see the effect of reduce-replacement-locking.patch, compare the
> first TPS number in each line to the third, or the second to the
> fourth.  At scale factor 1000, the patch wins in all of the cases with
> 32 or more clients and exactly half of the cases with 1, 8, or 16
> clients.  The variations at low client counts are quite small, and the
> patch isn't expected to do much at low concurrency levels, so that's
> probably just random variation.  At scale factor 3000, the situation
> is more complicated.  With only 16 bufmappinglocks, the patch gets its
> biggest win at 48 clients, and by 96 clients it's actually losing to
> unpatched master.  But with 128 bufmappinglocks, it wins - often
> massively - on everything but the single-client test, which is a small
> loss, hopefully within experimental variation.
>
> Comments?

Why stop at 128 mapping locks?   Theoretical downsides to having more
mapping locks have been mentioned a few times but has this ever been
measured?  I'm starting to wonder if the # mapping locks should be
dependent on some other value, perhaps the # of shared bufffers...

merlin

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

25 September 2014, 14:09:44

On Thu, Sep 25, 2014 at 10:02 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 1. To see the effect of reduce-replacement-locking.patch, compare the
>> first TPS number in each line to the third, or the second to the
>> fourth.  At scale factor 1000, the patch wins in all of the cases with
>> 32 or more clients and exactly half of the cases with 1, 8, or 16
>> clients.  The variations at low client counts are quite small, and the
>> patch isn't expected to do much at low concurrency levels, so that's
>> probably just random variation.  At scale factor 3000, the situation
>> is more complicated.  With only 16 bufmappinglocks, the patch gets its
>> biggest win at 48 clients, and by 96 clients it's actually losing to
>> unpatched master.  But with 128 bufmappinglocks, it wins - often
>> massively - on everything but the single-client test, which is a small
>> loss, hopefully within experimental variation.
>>
>> Comments?
>
> Why stop at 128 mapping locks?   Theoretical downsides to having more
> mapping locks have been mentioned a few times but has this ever been
> measured?  I'm starting to wonder if the # mapping locks should be
> dependent on some other value, perhaps the # of shared bufffers...

Good question.  My belief is that the number of buffer mapping locks
required to avoid serious contention will be roughly proportional to
the number of hardware threads.  At the time the value 16 was chosen,
there were probably not more than 8-core CPUs in common use; but now
we've got a machine with 64 hardware threads and, what do you know but
it wants 128 locks.

I think the long-term solution here is that we need a lock-free hash
table implementation for our buffer mapping tables, because I'm pretty
sure that just cranking the number of locks up and up is going to
start to have unpleasant side effects at some point.  We may be able
to buy a few more years by just cranking it up, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

25 September 2014, 14:14:16

On 2014-09-25 09:51:17 -0400, Robert Haas wrote:
> On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > The patch I attached the first time was just the last commit in the
> > git repository where I wrote the patch, rather than the changes that I
> > made on top of that commit.  So, yes, the results from the previous
> > message are with the patch attached to the follow-up.  I just typed
> > the wrong git command when attempting to extract that patch to attach
> > it to the email.
> 
> Here are some more results.  TL;DR: The patch still looks good, but we
> should raise the number of buffer mapping partitions as well.

>  So I'm inclined to (a) push
> reduce-replacement-locking.patch and then also (b) bump up the number
> of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS
> accordingly so that pg_buffercache doesn't get unhappy).

I'm happy with that. I don't think it's likely that a moderate increase
in the number of mapping lwlocks will be noticeably bad for any
workload.

One difference is that the total number of lwlock acquirations will be a
bit higher because currently it's more likely for the old and new to
fall into different partitions. But that's not really significant.

The other difference is the number of cachelines touched. Currently, in
concurrent workloads, there's already lots of L1 cache misses around the
buffer mapping locks because they're exclusively owned by a different
core/socket. So, to be effectively worse, the increase would need to
lead to lower overall cache hit rates by them not being in *any* cache
or displacing other content.

That leads me to wonder: Have you measured different, lower, number of
buffer mapping locks? 128 locks is, if we'd as we should align them
properly, 8KB of memory. Common L1 cache sizes are around 32k...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

25 September 2014, 14:15:06

On 2014-09-25 09:02:25 -0500, Merlin Moncure wrote:
> On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > 1. To see the effect of reduce-replacement-locking.patch, compare the
> > first TPS number in each line to the third, or the second to the
> > fourth.  At scale factor 1000, the patch wins in all of the cases with
> > 32 or more clients and exactly half of the cases with 1, 8, or 16
> > clients.  The variations at low client counts are quite small, and the
> > patch isn't expected to do much at low concurrency levels, so that's
> > probably just random variation.  At scale factor 3000, the situation
> > is more complicated.  With only 16 bufmappinglocks, the patch gets its
> > biggest win at 48 clients, and by 96 clients it's actually losing to
> > unpatched master.  But with 128 bufmappinglocks, it wins - often
> > massively - on everything but the single-client test, which is a small
> > loss, hopefully within experimental variation.
> >
> > Comments?
> 
> Why stop at 128 mapping locks?   Theoretical downsides to having more
> mapping locks have been mentioned a few times but has this ever been
> measured?  I'm starting to wonder if the # mapping locks should be
> dependent on some other value, perhaps the # of shared bufffers...

Wrong way round. You need to prove the upside of increasing it further,
not the contrary. The primary downside is cache hit ratio and displacing
other cache entries...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

25 September 2014, 14:24:50

On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> That leads me to wonder: Have you measured different, lower, number of
> buffer mapping locks? 128 locks is, if we'd as we should align them
> properly, 8KB of memory. Common L1 cache sizes are around 32k...

Amit has some results upthread showing 64 being good, but not as good
as 128.  I haven't verified that myself, but have no reason to doubt
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

25 September 2014, 14:24:59

On 2014-09-25 10:22:47 -0400, Robert Haas wrote:
> On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > That leads me to wonder: Have you measured different, lower, number of
> > buffer mapping locks? 128 locks is, if we'd as we should align them
> > properly, 8KB of memory. Common L1 cache sizes are around 32k...
> 
> Amit has some results upthread showing 64 being good, but not as good
> as 128.  I haven't verified that myself, but have no reason to doubt
> it.

How about you push the spinlock change and I crosscheck the partition
number on a multi socket x86 machine? Seems worthwile to make sure that
it doesn't cause problems on x86. I seriously doubt it'll, but ...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Merlin Moncure

Date:

25 September 2014, 14:35:07

On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Why stop at 128 mapping locks?   Theoretical downsides to having more
>> mapping locks have been mentioned a few times but has this ever been
>> measured?  I'm starting to wonder if the # mapping locks should be
>> dependent on some other value, perhaps the # of shared bufffers...
>
> Wrong way round. You need to prove the upside of increasing it further,
> not the contrary. The primary downside is cache hit ratio and displacing
> other cache entries...

I can't do that because I don't have the hardware.  I wasn't
suggesting to just set it but to measure the affects of setting it.
But the benefits from going from 16 to 128 are pretty significant at
least on this hardware; I'm curious how much further it can be
pushed...what's wrong with trying it out?

merlin

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

25 September 2014, 14:40:32

On 2014-09-25 10:09:30 -0400, Robert Haas wrote:
> I think the long-term solution here is that we need a lock-free hash
> table implementation for our buffer mapping tables, because I'm pretty
> sure that just cranking the number of locks up and up is going to
> start to have unpleasant side effects at some point.  We may be able
> to buy a few more years by just cranking it up, though.

I think mid to long term we actually need something else than a
hashtable. Capable of efficiently looking for the existance of
'neighboring' buffers so we can intelligently prefetch far enough that
the read actually completes when we get there. Also I'm pretty sure that
we'll need a way to efficiently remove all buffers for a relfilenode
from shared buffers - linearly scanning for that isn't a good
solution. So I think we need a different data structure.

I've played a bit around with just replacing buf_table.c with a custom
handrolled hashtable because I've seen more than one production workload
where hash_search_with_hash_value() is both cpu and cache miss wise
top#1 of profiles. With most calls coming from the buffer mapping and
then from the lock manager.

There's two reasons for that: a) dynahash just isn't very good and it
does a lot of things that will never be necessary for these hashes. b)
the key into the hash table is *far* too wide. A significant portion of
the time is spent comparing buffer/lock tags.

The aforementioned replacement hash table was a good bit faster for
fully cached workloads - but at the time I wrote I could still make it
crash in very high cache pressure workloads, so that should be taken
with a fair bit of salt.

I think we can comparatively easily get rid of the tablespace in buffer
tags. Getting rid of the database already would be a fair bit harder. I
haven't really managed to get an idea how to remove the fork number
without making the catalog much more complicated.  I don't think we can
go too long without at least some of these steps :(.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

25 September 2014, 14:42:37

On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-09-25 10:22:47 -0400, Robert Haas wrote:
>> On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > That leads me to wonder: Have you measured different, lower, number of
>> > buffer mapping locks? 128 locks is, if we'd as we should align them
>> > properly, 8KB of memory. Common L1 cache sizes are around 32k...
>>
>> Amit has some results upthread showing 64 being good, but not as good
>> as 128.  I haven't verified that myself, but have no reason to doubt
>> it.
>
> How about you push the spinlock change and I crosscheck the partition
> number on a multi socket x86 machine? Seems worthwile to make sure that
> it doesn't cause problems on x86. I seriously doubt it'll, but ...

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

25 September 2014, 14:48:06

On 2014-09-25 09:34:57 -0500, Merlin Moncure wrote:
> On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> Why stop at 128 mapping locks?   Theoretical downsides to having more
> >> mapping locks have been mentioned a few times but has this ever been
> >> measured?  I'm starting to wonder if the # mapping locks should be
> >> dependent on some other value, perhaps the # of shared bufffers...
> >
> > Wrong way round. You need to prove the upside of increasing it further,
> > not the contrary. The primary downside is cache hit ratio and displacing
> > other cache entries...
> 
> I can't do that because I don't have the hardware.

One interesting part of this is making sure it doesn't regress
older/smaller machines. So at least that side you could check...

> what's wrong with trying it out?

If somebody is willing to do it: nothing. I'd just much rather do the,
by now proven, simple change before starting with more complex
solutions.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

26 September 2014, 11:40:24

On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> But this gets at another point: the way we're benchmarking this right
> now, we're really conflating the effects of three different things:
>
> 1. Changing the locking regimen around the freelist and clocksweep.
> 2. Adding a bgreclaimer process.
> 3. Raising the number of buffer locking partitions.

First of all thanks for committing part-1 of this changes and it

seems you are planing to commit part-3 based on results of tests

which Andres is planing to do and for remaining part (part-2), today

I have tried some tests, the results of which are as follows:

Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count	16	32	64	128
reduce-replacement-locking.patch + 128 Buf Partitions	157732	229547	271536	245295
scalable_buffer_eviction_v9.patch	163762	230753	275147	248309

Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count	16	32	64	128
reduce-replacement-locking.patch + 128 Buf Partitions	157781	212134	202209	171176
scalable_buffer_eviction_v9.patch	160301	213922	208680	172720

The results indicates that in all cases there is benefit by doing

part-2 (bgreclaimer). Though the benefit at this configuration is

not high, but might be at some higher configurations

(scale factor - 10000) there is more benefit. Do you see any merit

in pursuing further to accomplish part-2 as well?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Heikki Linnakangas

Date:

26 September 2014, 12:05:37

On 09/25/2014 05:40 PM, Andres Freund wrote:
> There's two reasons for that: a) dynahash just isn't very good and it
> does a lot of things that will never be necessary for these hashes. b)
> the key into the hash table is*far*  too wide. A significant portion of
> the time is spent comparing buffer/lock tags.

Hmm. Is it the comparing, or calculating the hash? We could precalculate 
the hash for RelFileNode+ForkNumber, and store it RelationData. At a 
lookup, you'd only need to mix in the block number.

- Heikki

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

26 September 2014, 12:26:53

On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote:
> On 09/25/2014 05:40 PM, Andres Freund wrote:
> >There's two reasons for that: a) dynahash just isn't very good and it
> >does a lot of things that will never be necessary for these hashes. b)
> >the key into the hash table is*far*  too wide. A significant portion of
> >the time is spent comparing buffer/lock tags.
> 
> Hmm. Is it the comparing, or calculating the hash?

Neither, really. The hash calculation is visible in the profile, but not
that pronounced yet. The primary thing noticeable in profiles (besides
cache efficiency) is the comparison of the full tag after locating a
possible match in a bucket. 20 byte memcmp's aren't free.

Besides making the hashtable more efficent, a smaller key (say, 4 byte
relfilenode, 4 byte blocknumber) would also make using a radix tree or
similar more realistic. I've prototyped that once and it has nice
properties, but the tree is too deep.  Obviousy it'd also help making
buffer descriptors smaller, which is also good from a cache efficiency
perspective...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

26 September 2014, 13:34:26

On Fri, Sep 26, 2014 at 7:40 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

First of all thanks for committing part-1 of this changes and it
seems you are planing to commit part-3 based on results of tests
which Andres is planing to do and for remaining part (part-2), today
I have tried some tests, the results of which are as follows:

Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128
reduce-replacement-locking.patch + 128 Buf Partitions 157732 229547 271536 245295
scalable_buffer_eviction_v9.patch 163762 230753 275147 248309

Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128
reduce-replacement-locking.patch + 128 Buf Partitions 157781 212134 202209 171176
scalable_buffer_eviction_v9.patch 160301 213922 208680 172720

The results indicates that in all cases there is benefit by doing
part-2 (bgreclaimer). Though the benefit at this configuration is
not high, but might be at some higher configurations
(scale factor - 10000) there is more benefit. Do you see any merit
in pursuing further to accomplish part-2 as well?

Interesting results. Thanks for gathering this data.

If this is the best we can do with the bgreclaimer, I think the case for pursuing it is somewhat marginal. The biggest jump you've got seems to be at scale factor 3000 with 64 clients, where you picked up about 4%. 4% isn't nothing, but it's not a lot, either. On the other hand, this might not be the best we can do. There may be further improvements to bgreclaimer that make the benefit larger.

Backing up a it, to what extent have we actually solved the problem here? If we had perfectly removed all of the scalability bottlenecks, what would we expect to see? You didn't say which machine this testing was done on, or how many cores it had, but for example on the IBM POWER7 machine, we probably wouldn't expect the throughput at 64 clients to be 4 times the throughput at 16 cores because up to 16 clients each one can have a full CPU core, whereas after that and out to 64 each is getting a hardware thread, which is not quite as good. Still, we'd expect performance to go up, or at least not go down. Your data shows a characteristic performance knee: between 16 and 32 clients we go up, but then between 32 and 64 we go down, and between 64 and 128 we go down more. You haven't got enough data points there to show very precisely where the knee is, but unless you tested this on a smaller box than what you have been using, we're certainly hitting the knee sometime before we run out of physical cores. That implies a remaining contention bottleneck.

My results from yesterday were a bit different. I tested 1 client, 8 clients, and multiples of 16 clients out to 96. With reduce-replacement-locking.patch + 128 buffer mapping partitions, performance continued to rise all the way out to 96 clients. It definitely wasn't linearly, but it went up, not down. I don't know why this is different from what you are seeing. Anyway there's a little more ambiguity there about how much contention remains, but my bet is that there is at least some contention that we could still hope to remove. We need to understand where that contention is. Are the buffer mapping locks still contended? Is the new spinlock contended? Are there other contention points? I won't be surprised if it turns out that the contention is on the new spinlock and that a proper design for bgreclaimer is the best way to remove that contention .... but I also won't be surprised if it turns out that there are bigger wins elsewhere. So I think you should try to figure out where the remaining contention is first, and then we can discuss what to do about it.

On another point, I think it would be a good idea to rebase the bgreclaimer patch over what I committed, so that we have a clean patch against master to test with.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Heikki Linnakangas

Date:

26 September 2014, 13:48:14

On 09/26/2014 03:26 PM, Andres Freund wrote:
> On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote:
>> On 09/25/2014 05:40 PM, Andres Freund wrote:
>>> There's two reasons for that: a) dynahash just isn't very good and it
>>> does a lot of things that will never be necessary for these hashes. b)
>>> the key into the hash table is*far*  too wide. A significant portion of
>>> the time is spent comparing buffer/lock tags.
>>
>> Hmm. Is it the comparing, or calculating the hash?
>
> Neither, really. The hash calculation is visible in the profile, but not
> that pronounced yet. The primary thing noticeable in profiles (besides
> cache efficiency) is the comparison of the full tag after locating a
> possible match in a bucket. 20 byte memcmp's aren't free.

Hmm. We could provide a custom compare function instead of relying on 
memcmp. We can do somewhat better than generic memcmo when we know that 
the BufferTag is MAXALIGNed (is it? at least it's 4 bytes aligned), and 
it's always exactly 20 bytes. I wonder if you're actually just seeing a 
cache miss showing up in the profile, though.

- Heikki

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

26 September 2014, 13:59:46

On Fri, Sep 26, 2014 at 8:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Neither, really. The hash calculation is visible in the profile, but not
> that pronounced yet. The primary thing noticeable in profiles (besides
> cache efficiency) is the comparison of the full tag after locating a
> possible match in a bucket. 20 byte memcmp's aren't free.

Would it be faster to test the relfilenode first, and then test the
rest only if that matches?

One idea for improving this is to get rid of relation forks.  Like all
true PostgreSQL patriots, I love the free space map and the visibility
map passionately, and I believe that those features are one of the
biggest contributors to the success of the project over the last
half-decade.  But I'd love them even more if they didn't triple our
rate of inode consumption and bloat our buffer tags.  More, it's just
not an extensible mechanism: too many things have to loop over all
forks, and it just doesn't scale to keep adding more of them.  If we
added a metapage to each heap, we could have the FSM and VM have their
own relfilenode and just have the heap point at them.  Or (maybe
better still) we could store the data in the heap itself.

It would be a lot of work, though.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Ants Aasma

Date:

26 September 2014, 14:02:07

On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Neither, really. The hash calculation is visible in the profile, but not
> that pronounced yet. The primary thing noticeable in profiles (besides
> cache efficiency) is the comparison of the full tag after locating a
> possible match in a bucket. 20 byte memcmp's aren't free.

I'm not arguing against a more efficient hash table, but one simple
win would be to have a buffer tag specific comparison function. I mean
something like the attached patch. This way we avoid the loop counter
overhead, can check the blocknum first and possibly have better branch
prediction.

Do you have a workload where I could test if this helps alleviate the
comparison overhead?

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Attachment

optimized-buffer-tag-cmp.patch

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

26 September 2014, 14:31:55

On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 26, 2014 at 7:40 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
First of all thanks for committing part-1 of this changes and it
seems you are planing to commit part-3 based on results of tests
which Andres is planing to do and for remaining part (part-2), today
I have tried some tests, the results of which are as follows:

Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128
reduce-replacement-locking.patch + 128 Buf Partitions 157732 229547 271536 245295
scalable_buffer_eviction_v9.patch 163762 230753 275147 248309

Scale Factor - 3000, Shared_buffer - 8GB

Typo here Scale Factor - 3000, Shared_buffer - 2GB

Patch_Ver/Client_Count 16 32 64 128
reduce-replacement-locking.patch + 128 Buf Partitions 157781 212134 202209 171176
scalable_buffer_eviction_v9.patch 160301 213922 208680 172720

The results indicates that in all cases there is benefit by doing
part-2 (bgreclaimer). Though the benefit at this configuration is
not high, but might be at some higher configurations
(scale factor - 10000) there is more benefit. Do you see any merit
in pursuing further to accomplish part-2 as well?

Interesting results. Thanks for gathering this data.

One more point I have missed is that above data is using

"-M prepared" option of pgbench.

If this is the best we can do with the bgreclaimer, I think the case for pursuing it is somewhat marginal. The biggest jump you've got seems to be at scale factor 3000 with 64 clients, where you picked up about 4%. 4% isn't nothing, but it's not a lot, either. On the other hand, this might not be the best we can do. There may be further improvements to bgreclaimer that make the benefit larger.

Backing up a it, to what extent have we actually solved the problem here? If we had perfectly removed all of the scalability bottlenecks, what would we expect to see? You didn't say which machine this testing was done on

It was IBM POWER7

Sorry, I should have mentioned it.

, or how many cores it had, but for example on the IBM POWER7 machine, we probably wouldn't expect the throughput at 64 clients to be 4 times the throughput at 16 cores because up to 16 clients each one can have a full CPU core, whereas after that and out to 64 each is getting a hardware thread, which is not quite as good. Still, we'd expect performance to go up, or at least not go down. Your data shows a characteristic performance knee: between 16 and 32 clients we go up, but then between 32 and 64 we go down,

Here another point worth noting is that it goes down between

32 and 64 when shared_buffers is 2GB, however when

shared_buffers are 8GB it doesn't go down between 32 and 64.

and between 64 and 128 we go down more. You haven't got enough data points there to show very precisely where the knee is, but unless you tested this on a smaller box than what you have been using, we're certainly hitting the knee sometime before we run out of physical cores. That implies a remaining contention bottleneck.

My results from yesterday were a bit different. I tested 1 client, 8 clients, and multiples of 16 clients out to 96. With reduce-replacement-locking.patch + 128 buffer mapping partitions, performance continued to rise all the way out to 96 clients. It definitely wasn't linearly, but it went up, not down. I don't know why this is different from what you are seeing.

I think it is almost same if we consider same configuration

(scale_factor - 3000, shared_buffer - 8GB).

Anyway there's a little more ambiguity there about how much contention remains, but my bet is that there is at least some contention that we could still hope to remove. We need to understand where that contention is. Are the buffer mapping locks still contended? Is the new spinlock contended? Are there other contention points? I won't be surprised if it turns out that the contention is on the new spinlock and that a proper design for bgreclaimer is the best way to remove that contention .... but I also won't be surprised if it turns out that there are bigger wins elsewhere. So I think you should try to figure out where the remaining contention is first, and then we can discuss what to do about it.

Make sense.

On another point, I think it would be a good idea to rebase the bgreclaimer patch over what I committed, so that we have a clean patch against master to test with.

I think this also makes sense, however I think it is better to first

see where is the bottleneck.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

26 September 2014, 14:37:35

On 2014-09-26 17:01:52 +0300, Ants Aasma wrote:
> On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Neither, really. The hash calculation is visible in the profile, but not
> > that pronounced yet. The primary thing noticeable in profiles (besides
> > cache efficiency) is the comparison of the full tag after locating a
> > possible match in a bucket. 20 byte memcmp's aren't free.
> 
> I'm not arguing against a more efficient hash table, but one simple
> win would be to have a buffer tag specific comparison function. I mean
> something like the attached patch. This way we avoid the loop counter
> overhead, can check the blocknum first and possibly have better branch
> prediction.

Heh. Yea. As I wrote to Heikki, 64bit compares were the thing showing
most benefits - at least with my own hashtable implementation.

> Do you have a workload where I could test if this helps alleviate the
> comparison overhead?

Fully cached readonly pgbench workloads with -M prepared already show
it. But it gets more pronounced for workload that access buffers at a
higher frequency.

At two customers I've seen this really badly in the profile because they
have OLTP statements that some index nested loops. Often looking the
same buffer up *over and over*. There often isn't a better plan (with
current pg join support at least) for joining a somewhat small number of
rows out of a large table to small/mid sized table.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

26 September 2014, 14:37:36

On 2014-09-26 16:47:55 +0300, Heikki Linnakangas wrote:
> On 09/26/2014 03:26 PM, Andres Freund wrote:
> >On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote:
> >>On 09/25/2014 05:40 PM, Andres Freund wrote:
> >>>There's two reasons for that: a) dynahash just isn't very good and it
> >>>does a lot of things that will never be necessary for these hashes. b)
> >>>the key into the hash table is*far*  too wide. A significant portion of
> >>>the time is spent comparing buffer/lock tags.
> >>
> >>Hmm. Is it the comparing, or calculating the hash?
> >
> >Neither, really. The hash calculation is visible in the profile, but not
> >that pronounced yet. The primary thing noticeable in profiles (besides
> >cache efficiency) is the comparison of the full tag after locating a
> >possible match in a bucket. 20 byte memcmp's aren't free.
> 
> Hmm. We could provide a custom compare function instead of relying on
> memcmp. We can do somewhat better than generic memcmo when we know that the
> BufferTag is MAXALIGNed (is it? at least it's 4 bytes aligned), and it's
> always exactly 20 bytes.

That might give a little benefit. I haven't experimented with that with
dynahash.c. I've compared memcmp() and custom comparison with my own
hashtable and there were some differences, but neglegible. The biggest
was using 64bit compares. Either way, it all ends up being rather branch
heavy with high misprediction rates.

> I wonder if you're actually just seeing a cache miss showing up in the
> profile, though.

I don't think so. I hacked (by moving it to the end of
RelfileNode/BufferTag and comparing only the front port) the tablespace
out of buffer tags and that produced measurable benefits.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

26 September 2014, 14:41:55

On 2014-09-26 09:59:41 -0400, Robert Haas wrote:
> On Fri, Sep 26, 2014 at 8:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Neither, really. The hash calculation is visible in the profile, but not
> > that pronounced yet. The primary thing noticeable in profiles (besides
> > cache efficiency) is the comparison of the full tag after locating a
> > possible match in a bucket. 20 byte memcmp's aren't free.
> 
> Would it be faster to test the relfilenode first, and then test the
> rest only if that matches?

I tried that and I couldn't see that much benefit. Check my message to
Heikki.

> One idea for improving this is to get rid of relation forks.

I think that's the hardest end to start from. A cool goal, but
hard. Getting rid of the tablespace sound comparatively simple. And even
getting rid of the database in the buffer tag seems to be simpler
although already pretty hard.

> Like all
> true PostgreSQL patriots, I love the free space map and the visibility
> map passionately, and I believe that those features are one of the
> biggest contributors to the success of the project over the last
> half-decade.  But I'd love them even more if they didn't triple our
> rate of inode consumption and bloat our buffer tags.  More, it's just
> not an extensible mechanism: too many things have to loop over all
> forks, and it just doesn't scale to keep adding more of them.  If we
> added a metapage to each heap, we could have the FSM and VM have their
> own relfilenode and just have the heap point at them.  Or (maybe
> better still) we could store the data in the heap itself.
> 
> It would be a lot of work, though.  :-(

Yea, it's really hard. And nearly impossible to do without breaking
binary compatibility.

What I've been wondering about was to give individual forks their own
relfilenodes and manage them via columns in pg_class. But that's also a
heck of a lot of work and gets complicated for unlogged relations
because we need to access those during recovery when we don't have
catalog access yet and can't access all databases anyway.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Heikki Linnakangas

Date:

27 September 2014, 07:29:50

Part of this patch was already committed, and the overall patch has had 
its fair share of review for this commitfest, so I'm marking this as 
"Returned with feedback". The benchmark results for the bgreclaimer 
showed a fairly small improvement, so it doesn't seem like anyone's 
going to commit the rest of the patch soon, not without more discussion 
and testing anyway. Let's not hold up the commitfest for that.

- Heikki

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

01 October 2014, 18:54:49

On 2014-09-25 16:50:44 +0200, Andres Freund wrote:
> On 2014-09-25 10:44:40 -0400, Robert Haas wrote:
> > On Thu, Sep 25, 2014 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > >> On 2014-09-25 10:22:47 -0400, Robert Haas wrote:
> > >>> On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > >>> > That leads me to wonder: Have you measured different, lower, number of
> > >>> > buffer mapping locks? 128 locks is, if we'd as we should align them
> > >>> > properly, 8KB of memory. Common L1 cache sizes are around 32k...
> > >>>
> > >>> Amit has some results upthread showing 64 being good, but not as good
> > >>> as 128.  I haven't verified that myself, but have no reason to doubt
> > >>> it.
> > >>
> > >> How about you push the spinlock change and I crosscheck the partition
> > >> number on a multi socket x86 machine? Seems worthwile to make sure that
> > >> it doesn't cause problems on x86. I seriously doubt it'll, but ...
> > >
> > > OK.
> > 
> > Another thought is that we should test what impact your atomics-based
> > lwlocks have on this.
> 
> Yes, I'd planned to test that as well. I think that it will noticeably
> reduce the need to increase the number of partitions for workloads that
> fit into shared_buffers. But it won't do much about exclusive
> acquirations of the buffer mapping locks. So I think there's independent
> benefit of increasing the number.

Here we go.

Postgres was configured with.-c shared_buffers=8GB \-c log_line_prefix="[%m %p] " \-c log_min_messages=debug1 \-p 5440
\-ccheckpoint_segments=600-c max_connections=200
 

Each individual measurement (#TPS) is the result of a
pgbench -h /tmp/ -p 5440 postgres -n -M prepared -c $clients -j $clients -S -T 10
run.

Master is as of ef8863844bb0b0dab7b92c5f278302a42b4bf05a.

First, a scale 200 run. That fits entirely into shared_buffers:

#scale          #client         #partitions             #TPS
200             1               16                      8353.547724     8145.296655     8263.295459
200             16              16                      171014.763118   193971.091518   133992.128348
200             32              16                      259119.988034   234619.421322   201879.618322
200             64              16                      178909.038670   179425.091562   181391.354613
200             96              16                      141402.895201   138392.705402   137216.416951
200             128             16                      125643.089677   124465.288860   122527.209125

(other runs here stricken, they were contorted due some concurrent
activity. But nothing interesting).

So, there's quite some variation in here. Not very surprising given the
short runtimes, but still.

Looking at a profile nearly all the contention is around
GetSnapshotData(). That might hide the interesting scalability effects
of the partition number. So I next tried my rwlock-contention branch.

#scale          #client         #partitions             #TPS
200             1               1                       8540.390223     8285.628397     8497.022656
200             16              1                       136875.484896   164302.769380   172053.413980
200             32              1                       308624.650724   240502.019046   260825.231470
200             64              1                       453004.188676   406226.943046   406973.325822
200             96              1                       442608.459701   450185.431848   445549.710907
200             128             1                       487138.077973   496233.594356   457877.992783

200             1               16                      9477.217454     8181.098317     8457.276961
200             16              16                      154224.573476   170238.637315   182941.035416
200             32              16                      302230.215403   285124.708236   265917.729628
200             64              16                      405151.647136   443473.797835   456072.782722
200             96              16                      443360.377281   457164.981119   474049.685940
200             128             16                      490616.257063   458273.380238   466429.948417

200             1               64                      8410.981874     11554.708966    8359.294710
200             16              64                      139378.312883   168398.919590   166184.744944
200             32              64                      288657.701012   283588.901083   302241.706222
200             64              64                      424838.919754   416926.779367   436848.292520
200             96              64                      462352.017671   446384.114441   483332.592663
200             128             64                      471578.594596   488862.395621   466692.726385

200             1               128                     8350.274549     8140.699687     8305.975703
200             16              128                     144553.966808   154711.927715   202437.837908
200             32              128                     290193.349170   213242.292597   261016.779185
200             64              128                     413792.389493   431267.716855   456587.450294
200             96              128                     490459.212833   456375.442210   496430.996055
200             128             128                     470067.179360   464513.801884   483485.000502

Not much there either.

So, on to the next scale, 1000. That doesn't fit into s_b anymore.

master:
#scale          #client         #partitions             #TPS
1000            1               1                       7378.370717     7110.988121     7164.977746
1000            16              1                       66439.037413    85151.814130    85047.296626
1000            32              1                       71505.487093    75687.291060    69803.895496
1000            64              1                       42148.071099    41934.631603    43253.528849
1000            96              1                       33760.812746    33969.800564    33598.640121
1000            128             1                       30382.414165    30047.284982    30144.576494

1000            1               16                      7228.883843     9479.793813     7217.657145
1000            16              16                      105203.710528   112375.187471   110919.986283
1000            32              16                      146294.286762   145391.938025   144620.709764
1000            64              16                      134411.772164   134536.943367   136196.793573
1000            96              16                      107626.878208   105289.783922   96480.468107
1000            128             16                      92597.909379    86128.040557    92417.727720

1000            1               64                      7130.392436     12801.641683    7019.999330
1000            16              64                      120180.196384   125319.373819   126137.930478
1000            32              64                      181876.697461   190578.106760   189412.973015
1000            64              64                      216233.590299   222561.774501   225802.194056
1000            96              64                      171928.358031   165922.395721   168283.712990
1000            128             64                      139303.139631   137564.877450   141534.449640

1000            1               128                     8215.702354     7209.520152     7026.888706
1000            16              128                     116196.740200   123018.284948   127045.761518
1000            32              128                     183391.488566   185428.757458   185732.926794
1000            64              128                     218547.133675   218096.002473   208679.436158
1000            96              128                     155209.830821   156327.200412   157542.582637
1000            128             128                     131127.769076   132084.933955   124706.336737

rwlock:
#scale          #client         #partitions             #TPS
1000            1               1                       7377.270393     7494.260136     7207.898866
1000            16              1                       79289.755569    88032.480145    86810.772569
1000            32              1                       83006.336151    88961.964680    88508.832253
1000            64              1                       44135.036648    46582.727314    45119.421278
1000            96              1                       35036.174438    35687.025568    35469.127697
1000            128             1                       30597.870830    30782.335225    30342.454439

1000            1               16                      7114.602838     7265.863826     7205.225737
1000            16              16                      128507.292054   131868.678603   124507.097065
1000            32              16                      212779.122153   185666.608338   210714.373254
1000            64              16                      239776.079534   239923.393293   242476.922423
1000            96              16                      169240.934839   166021.430680   169187.643644
1000            128             16                      136601.409985   139340.961857   141731.068752

1000            1               64                      13271.722885    11348.028311    12531.188689
1000            16              64                      129074.053482   125334.720264   125140.499619
1000            32              64                      198405.463848   196605.923684   198354.818005
1000            64              64                      250463.474112   249543.622897   251517.159399
1000            96              64                      251715.751133   254168.028451   251502.783058
1000            128             64                      243596.368933   234671.592026   239123.259642

1000            1               128                     7376.371403     7301.077478     7240.526379
1000            16              128                     127992.070372   133537.637394   123382.418747
1000            32              128                     185807.703422   194303.674428   184919.586634
1000            64              128                     270233.496350   271576.483715   262281.662510
1000            96              128                     266023.529574   272484.352878   271921.597420
1000            128             128                     260004.301457   266710.469926   263713.245868


Based on this I think we can fairly conclude that increasing the number
of partitions is quite the win on larger x86 machines too. Independent
of the rwlock patch, although it moves the contention points to some
degree.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

01 October 2014, 19:04:37

On 2014-10-01 20:54:39 +0200, Andres Freund wrote:
> Here we go.
> 
> Postgres was configured with.
>  -c shared_buffers=8GB \
>  -c log_line_prefix="[%m %p] " \
>  -c log_min_messages=debug1 \
>  -p 5440 \
>  -c checkpoint_segments=600
>  -c max_connections=200

Robert reminded me that I missed to report the hardware aspect of
this...

4x E5-4620  @ 2.20GHz: 32 cores, 64 threads
256GB RAM

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

02 October 2014, 14:36:32

On 2014-09-25 10:42:29 -0400, Robert Haas wrote:
> On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-09-25 10:22:47 -0400, Robert Haas wrote:
> >> On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > That leads me to wonder: Have you measured different, lower, number of
> >> > buffer mapping locks? 128 locks is, if we'd as we should align them
> >> > properly, 8KB of memory. Common L1 cache sizes are around 32k...
> >>
> >> Amit has some results upthread showing 64 being good, but not as good
> >> as 128.  I haven't verified that myself, but have no reason to doubt
> >> it.
> >
> > How about you push the spinlock change and I crosscheck the partition
> > number on a multi socket x86 machine? Seems worthwile to make sure that
> > it doesn't cause problems on x86. I seriously doubt it'll, but ...
> 
> OK.

Given that the results look good, do you plan to push this?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

02 October 2014, 14:40:38

On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> OK.
>
> Given that the results look good, do you plan to push this?

By "this", you mean the increase in the number of buffer mapping
partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

If so, and if you don't have any reservations, yeah I'll go change it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

02 October 2014, 14:44:34

On 2014-10-02 10:40:30 -0400, Robert Haas wrote:
> On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> OK.
> >
> > Given that the results look good, do you plan to push this?
> 
> By "this", you mean the increase in the number of buffer mapping
> partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

Yes. Now that I think about it I wonder if we shouldn't define MAX_SIMUL_LWLOCKS like
#define MAX_SIMUL_LWLOCKS    (NUM_BUFFER_PARTITIONS + 64)
or something like that?

> If so, and if you don't have any reservations, yeah I'll go change it.

Yes, I'm happy with going forward.


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

02 October 2014, 14:56:16

On Thu, Oct 2, 2014 at 10:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-10-02 10:40:30 -0400, Robert Haas wrote:
>> On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> >> OK.
>> >
>> > Given that the results look good, do you plan to push this?
>>
>> By "this", you mean the increase in the number of buffer mapping
>> partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?
>
> Yes. Now that I think about it I wonder if we shouldn't define MAX_SIMUL_LWLOCKS like
> #define MAX_SIMUL_LWLOCKS       (NUM_BUFFER_PARTITIONS + 64)
> or something like that?

Nah.  That assumes NUM_BUFFER_PARTITIONS will always be the biggest
thing, and I don't see any reason to assume that, even if we're making
it true for now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

02 October 2014, 15:21:07

On 2014-10-02 10:56:05 -0400, Robert Haas wrote:
> On Thu, Oct 2, 2014 at 10:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-10-02 10:40:30 -0400, Robert Haas wrote:
> >> On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> >> OK.
> >> >
> >> > Given that the results look good, do you plan to push this?
> >>
> >> By "this", you mean the increase in the number of buffer mapping
> >> partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?
> >
> > Yes. Now that I think about it I wonder if we shouldn't define MAX_SIMUL_LWLOCKS like
> > #define MAX_SIMUL_LWLOCKS       (NUM_BUFFER_PARTITIONS + 64)
> > or something like that?
> 
> Nah.  That assumes NUM_BUFFER_PARTITIONS will always be the biggest
> thing, and I don't see any reason to assume that, even if we're making
> it true for now.

The reason I'm suggesting is that NUM_BUFFER_PARTITIONS (and
NUM_LOCK_PARTITIONS) are the cases where we can expect many lwlocks to
be held at the same time. It doesn't seem friendly to users
experimenting with changing this to know about a define that's private
to lwlock.c.
But I'm fine with doing this not at all or separately - there's no need
to actually do it together. 

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Heikki Linnakangas

Date:

02 October 2014, 17:05:44

On 10/02/2014 05:40 PM, Robert Haas wrote:
> On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>>> OK.
>>
>> Given that the results look good, do you plan to push this?
>
> By "this", you mean the increase in the number of buffer mapping
> partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

Hmm, do we actually ever need to hold all the buffer partition locks at 
the same time? At a quick search for NUM_BUFFER_PARTITIONS in the code, 
I couldn't find any place where we'd do that. I bumped up 
NUM_BUFFER_PARTITIONS to 128, but left MAX_SIMUL_LWLOCKS at 100, and did 
"make check". It passed.

- Heikki

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

02 October 2014, 17:07:59

On 2014-10-02 20:04:58 +0300, Heikki Linnakangas wrote:
> On 10/02/2014 05:40 PM, Robert Haas wrote:
> >On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >>>OK.
> >>
> >>Given that the results look good, do you plan to push this?
> >
> >By "this", you mean the increase in the number of buffer mapping
> >partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?
> 
> Hmm, do we actually ever need to hold all the buffer partition locks at the
> same time? At a quick search for NUM_BUFFER_PARTITIONS in the code, I
> couldn't find any place where we'd do that. I bumped up
> NUM_BUFFER_PARTITIONS to 128, but left MAX_SIMUL_LWLOCKS at 100, and did
> "make check". It passed.

Do a make check-world and it'll hopefully fail ;). Check
pg_buffercache_pages.c.

I'd actually quite like to have a pg_buffercache version that, at least
optionally, doesn't do this, but that's a separate thing.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Robert Haas

Date:

02 October 2014, 19:22:22

On Thu, Oct 2, 2014 at 1:07 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Do a make check-world and it'll hopefully fail ;). Check
> pg_buffercache_pages.c.

Yep.  Committed, with an update to the comments in lwlock.c to allude
to the pg_buffercache issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

09 October 2014, 12:47:20

On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On another point, I think it would be a good idea to rebase the
> bgreclaimer patch over what I committed, so that we have a
> clean patch against master to test with.

Please find the rebased patch attached with this mail. I have taken
some performance data as well and done some analysis based on
the same.

Performance Data
----------------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

max_connections =300
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 5000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

Below data is median of 3 runs.

patch_ver/client_count	1	8	32	64	128	256
HEAD	18884	118628	251093	216294	186625	177505
PATCH	18743	122578	247243	205521	179712	175031

Here we can see that the performance dips at higher client

count(>=32) which was quite surprising for me, as I was expecting

it to improve, because bgreclaimer reduces the contention by making

buffers available on free list. So I tried to analyze the situation by

using perf and found that in above configuration, there is a contention

around freelist spinlock with HEAD and the same is removed by Patch,

but still the performance goes down with Patch. On further analysis, I

observed that actually after Patch there is an increase in contention

around ProcArrayLock (shared LWlock) via GetSnapshotData which

sounds bit odd, but that's what I can see in profiles. Based on analysis,

few ideas which I would like to further investigate are:

a. As there is an increase in spinlock contention, I would like to check

with Andres's latest patch which reduces contention around shared

lwlocks.

b. Reduce some instructions added by patch in StrategyGetBuffer(),

like instead of awakening bgreclaimer at low threshold, awaken when

it tries to do clock sweep.

Thoughts?

Below is the profile data for 64 and 128 client count:

Head - 64 client count

-------------------------------------

+ 8.93% postgres postgres [.] s_lock

7.83% swapper [unknown] [H] 0x00000000011bc5ac

+ 3.09% postgres postgres [.] GetSnapshotData

+ 3.06% postgres postgres [.] tas

+ 2.49% postgres postgres [.] AllocSetAlloc

+ 2.43% postgres postgres [.] hash_search_with_hash_value

+ 2.13% postgres postgres [.] _bt_compare

Detailed Data

------------------------

- 8.93% postgres postgres [.] s_lock

- s_lock

- 4.97% s_lock

- 1.63% StrategyGetBuffer

BufferAlloc

ReadBuffer_common

ReadBufferExtended

- ReadBuffer

- 1.63% ReleaseAndReadBuffer

- 0.93% index_fetch_heap

- index_getnext

- 0.93% IndexNext

ExecScanFetch

- 0.69% _bt_relandgetbuf

_bt_search

_bt_first

btgettuple

- index_getnext

- 0.69% IndexNext

ExecScanFetch

- 1.39% LWLockAcquireCommon

- LWLockAcquire

- 1.38% GetSnapshotData

- GetTransactionSnapshot

- 0.70% exec_bind_message

- 0.68% PortalStart

exec_bind_message

- 1.37% LWLockRelease

- 1.37% GetSnapshotData

- GetTransactionSnapshot

- 0.69% exec_bind_message

- 0.68% PortalStart

exec_bind_message

PostgresMain

- 1.07% StrategyGetBuffer

- 1.06% StrategyGetBuffer

BufferAlloc

ReadBuffer_common

ReadBufferExtended

- ReadBuffer

- 1.06% ReleaseAndReadBuffer

- 0.62% index_fetch_heap

index_getnext

- 0.95% LWLockAcquireCommon

- LWLockAcquire

- 0.90% GetSnapshotData

GetTransactionSnapshot

- 0.94% LWLockRelease

- 0.90% GetSnapshotData

GetTransactionSnapshot

7.83% swapper [unknown] [H] 0x00000000011bc5ac

- 3.09% postgres postgres [.] GetSnapshotData

- GetSnapshotData

- 3.06% GetSnapshotData

- 3.06% GetTransactionSnapshot

- 1.54% PortalStart

exec_bind_message

move_buffers_to_freelist_by_bgreclaimer_v1 - 64 Client count

----------------------------------------------------------------------------------------------

+ 11.52% postgres postgres [.] s_lock

7.57% swapper [unknown] [H] 0x00000000011d9034

+ 3.54% postgres postgres [.] tas

+ 3.02% postgres postgres [.] GetSnapshotData

+ 2.47% postgres postgres [.] hash_search_with_hash_value

+ 2.33% postgres postgres [.] AllocSetAlloc

+ 2.03% postgres postgres [.] _bt_compare

+ 1.89% postgres postgres [.] calc_bucket

Detailed Data

---------------------

- 11.52% postgres postgres [.] s_lock

- s_lock

- 6.57% s_lock

- 2.72% LWLockAcquireCommon

- LWLockAcquire

- 2.71% GetSnapshotData

- GetTransactionSnapshot

- 1.38% exec_bind_message

- 1.33% PortalStart

exec_bind_message

- 2.69% LWLockRelease

- 2.69% GetSnapshotData

- GetTransactionSnapshot

- 1.35% exec_bind_message

PostgresMain

- 1.34% PortalStart

exec_bind_message

- 1.65% LWLockAcquireCommon

- LWLockAcquire

- 1.59% GetSnapshotData

- GetTransactionSnapshot

- 0.80% exec_bind_message

PostgresMain

- 0.79% PortalStart

exec_bind_message

- 0.79% PortalStart

exec_bind_message

- 1.62% LWLockRelease

- 1.58% GetSnapshotData

- GetTransactionSnapshot

- 0.79% exec_bind_message

PostgresMain

- 0.79% PortalStart

exec_bind_message

PostgresMain

- 0.63% hash_search_with_hash_value

BufTableDelete

BufferAlloc

- 0.59% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

Head - 128 Client count

---------------------------------------

+ 18.39% postgres postgres [.] s_lock

6.72% swapper [unknown] [H] 0x00000000011bc390

+ 3.37% postgres postgres [.] GetSnapshotData

+ 2.11% postgres postgres [.] tas

+ 2.05% postgres postgres [.] tas

+ 1.82% postgres postgres [.] hash_search_with_hash_value

+ 1.77% postgres postgres [.] AllocSetAlloc

1.52% postgres [unknown] [H] 0x00000000012fdc00

+ 1.45% postgres postgres [.] tas

+ 1.42% postgres postgres [.] _bt_compare

- 18.39% postgres postgres [.] s_lock

- s_lock

- 12.35% s_lock

- 7.52% StrategyGetBuffer

BufferAlloc

ReadBuffer_common

- 1.86% LWLockAcquireCommon

- LWLockAcquire

- 1.83% GetSnapshotData

- GetTransactionSnapshot

- 0.95% exec_bind_message

- 0.88% PortalStart

exec_bind_message

- 1.78% LWLockRelease

- 1.76% GetSnapshotData

- GetTransactionSnapshot

- 0.91% exec_bind_message

- 0.86% PortalStart

exec_bind_message

- 0.60% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

- 0.58% hash_search_with_hash_value

- 0.58% BufTableDelete

BufferAlloc

- 3.18% StrategyGetBuffer

BufferAlloc

- 0.88% LWLockAcquireCommon

- 0.87% LWLockAcquireCommon

- LWLockAcquire

- 0.81% GetSnapshotData

GetTransactionSnapshot

- 0.84% LWLockRelease

- 0.83% LWLockRelease

- 0.79% GetSnapshotData

GetTransactionSnapshot

- 0.55% hash_search_with_hash_value

BufTableDelete

BufferAlloc

ReadBuffer_common

ReadBufferExtended

- ReadBuffer

0.55% ReleaseAndReadBuffer

- 0.54% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

ReadBuffer_common

ReadBufferExtended

- ReadBuffer

0.54% ReleaseAndReadBuffer

move_buffers_to_freelist_by_bgreclaimer_v1 - 128 Client count

----------------------------------------------------------------------------------------------

+ 13.64% postgres postgres [.] s_lock

8.19% swapper [unknown] [H] 0x0000000000000c04

+ 3.62% postgres postgres [.] GetSnapshotData

+ 2.40% postgres postgres [.] calc_bucket

+ 2.38% postgres postgres [.] tas

+ 2.38% postgres postgres [.] hash_search_with_hash_value

2.02% postgres [unknown] [H] 0x0000000000000f80

+ 1.73% postgres postgres [.] AllocSetAlloc

+ 1.68% postgres postgres [.] tas

Detailed Data

-----------------------

- 13.64% postgres postgres [.] s_lock

- s_lock

- 8.76% s_lock

- 3.03% LWLockAcquireCommon

- LWLockAcquire

- 2.97% GetSnapshotData

- GetTransactionSnapshot

- 1.55% exec_bind_message

- 1.42% PortalStart

exec_bind_message

- 2.87% LWLockRelease

- 2.82% GetSnapshotData

- GetTransactionSnapshot

- 1.46% exec_bind_message

- 1.36% PortalStart

exec_bind_message

- 1.35% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

- 1.29% hash_search_with_hash_value

- 1.29% BufTableDelete

BufferAlloc

- 1.19% LWLockAcquireCommon

- LWLockAcquire

- 1.11% GetSnapshotData

- GetTransactionSnapshot

- 0.56% exec_bind_message

PostgresMain

- 0.55% PortalStart

exec_bind_message

- 1.15% LWLockRelease

- 1.08% GetSnapshotData

- GetTransactionSnapshot

- 0.55% exec_bind_message

- 0.53% PortalStart

exec_bind_message

- 1.12% hash_search_with_hash_value

BufTableDelete

BufferAlloc

- 1.10% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

move_buffers_to_freelist_by_bgreclaimer_v1.patch

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

09 October 2014, 14:02:05

On 2014-10-09 18:17:09 +0530, Amit Kapila wrote:
> On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On another point, I think it would be a good idea to rebase the
> > bgreclaimer patch over what I committed, so that we have a
> > clean patch against master to test with.
> 
> Please find the rebased patch attached with this mail.  I have taken
> some performance data as well and done some analysis based on
> the same.
> 
> Performance Data
> ----------------------------
> IBM POWER-8 24 cores, 192 hardware threads
> RAM = 492GB
> max_connections =300
> Database Locale =C
> checkpoint_segments=256
> checkpoint_timeout    =15min
> shared_buffers=8GB
> scale factor = 5000
> Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
> Duration of each individual run = 5mins

I don't think OLTP really is the best test case for this. Especially not
pgbench with relatilvely small rows *and* a uniform distribution of
access.

Try parallel COPY TO. Batch write loads is where I've seen this hurt
badly.

> patch_ver/client_count 1 8 32 64 128 256
> HEAD 18884 118628 251093 216294 186625 177505
> PATCH 18743 122578 247243 205521 179712 175031

So, pretty much no benefits on any scale, right?


> Here we can see that the performance dips at higher client
> count(>=32) which was quite surprising for me, as I was expecting
> it to improve, because bgreclaimer reduces the contention by making
> buffers available on free list.  So I tried to analyze the situation by
> using perf and found that in above configuration, there is a contention
> around freelist spinlock with HEAD and the same is removed by Patch,
> but still the performance goes down with Patch.  On further analysis, I
> observed that actually after Patch there is an increase in contention
> around ProcArrayLock (shared LWlock) via GetSnapshotData which
> sounds bit odd, but that's what I can see in profiles.  Based on analysis,
> few ideas which I would like to further investigate are:
> a.  As there is an increase in spinlock contention, I would like to check
> with Andres's latest patch which reduces contention around shared
> lwlocks.
> b.  Reduce some instructions added by patch in StrategyGetBuffer(),
> like instead of awakening bgreclaimer at low threshold, awaken when
> it tries to do clock sweep.
> 

Are you sure you didn't mix up the profiles here? The head vs. patched
look more like profiles from different client counts than different
versions of the code.


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

09 October 2014, 14:22:13

On Thu, Oct 9, 2014 at 7:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> On 2014-10-09 18:17:09 +0530, Amit Kapila wrote:
> > On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On another point, I think it would be a good idea to rebase the
> > > bgreclaimer patch over what I committed, so that we have a
> > > clean patch against master to test with.
> >
> > Please find the rebased patch attached with this mail. I have taken
> > some performance data as well and done some analysis based on
> > the same.
> >
> > Performance Data
> > ----------------------------
> > IBM POWER-8 24 cores, 192 hardware threads
> > RAM = 492GB
> > max_connections =300
> > Database Locale =C
> > checkpoint_segments=256
> > checkpoint_timeout =15min
> > shared_buffers=8GB
> > scale factor = 5000
> > Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
> > Duration of each individual run = 5mins
>
> I don't think OLTP really is the best test case for this. Especially not
> pgbench with relatilvely small rows *and* a uniform distribution of
> access.
>
> Try parallel COPY TO. Batch write loads is where I've seen this hurt
> badly.
>

> > patch_ver/client_count 1 8 32 64 128 256
> > HEAD 18884 118628 251093 216294 186625 177505
> > PATCH 18743 122578 247243 205521 179712 175031
>
> So, pretty much no benefits on any scale, right?

Almost Right, there seem to be slight benefit at client count

8, however that can be due to variation as well.

> > Here we can see that the performance dips at higher client
> > count(>=32) which was quite surprising for me, as I was expecting
> > it to improve, because bgreclaimer reduces the contention by making
> > buffers available on free list. So I tried to analyze the situation by
> > using perf and found that in above configuration, there is a contention
> > around freelist spinlock with HEAD and the same is removed by Patch,
> > but still the performance goes down with Patch. On further analysis, I
> > observed that actually after Patch there is an increase in contention
> > around ProcArrayLock (shared LWlock) via GetSnapshotData which
> > sounds bit odd, but that's what I can see in profiles. Based on analysis,
> > few ideas which I would like to further investigate are:
> > a. As there is an increase in spinlock contention, I would like to check
> > with Andres's latest patch which reduces contention around shared
> > lwlocks.
> > b. Reduce some instructions added by patch in StrategyGetBuffer(),
> > like instead of awakening bgreclaimer at low threshold, awaken when
> > it tries to do clock sweep.
> >
>
> Are you sure you didn't mix up the profiles here?

I have tried this 2 times, basically I am quite confident from myside,

but human errors can't be ruled out. I have used below statements:

Steps used for profiling

during configure, use CFLAS="-fno-omit-frame-pointer"

Terminal -1

Start Server

Terminal -2

./pgbench -c 64 -j 64 -T 300 -S -M prepared postgres

Terminal-3

perf record -a -g sleep 60 --This command is run after a minute or so of

-- test start

After test is finished - perf report -g graph,0.5,callee

Do you see any problem in the way I am collecting perf reports?

In any case, I can try once more if you still doubt the profiles.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

09 October 2014, 19:38:36

On 2014-10-09 16:01:55 +0200, Andres Freund wrote:
> On 2014-10-09 18:17:09 +0530, Amit Kapila wrote:
> > On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On another point, I think it would be a good idea to rebase the
> > > bgreclaimer patch over what I committed, so that we have a
> > > clean patch against master to test with.
> >
> > Please find the rebased patch attached with this mail.  I have taken
> > some performance data as well and done some analysis based on
> > the same.
> >
> > Performance Data
> > ----------------------------
> > IBM POWER-8 24 cores, 192 hardware threads
> > RAM = 492GB
> > max_connections =300
> > Database Locale =C
> > checkpoint_segments=256
> > checkpoint_timeout    =15min
> > shared_buffers=8GB
> > scale factor = 5000
> > Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
> > Duration of each individual run = 5mins
>
> I don't think OLTP really is the best test case for this. Especially not
> pgbench with relatilvely small rows *and* a uniform distribution of
> access.
>
> Try parallel COPY TO. Batch write loads is where I've seen this hurt
> badly.

As an example. The attached scripts go from:
progress: 5.3 s, 20.9 tps, lat 368.917 ms stddev 49.655
progress: 10.1 s, 21.0 tps, lat 380.326 ms stddev 64.525
progress: 15.1 s, 14.1 tps, lat 568.108 ms stddev 226.040
progress: 20.4 s, 12.0 tps, lat 634.557 ms stddev 300.519
progress: 25.2 s, 17.5 tps, lat 461.738 ms stddev 136.257
progress: 30.2 s, 9.8 tps, lat 850.766 ms stddev 305.454
progress: 35.3 s, 12.2 tps, lat 670.473 ms stddev 271.075
progress: 40.2 s, 7.9 tps, lat 972.617 ms stddev 313.152
progress: 45.3 s, 14.9 tps, lat 546.056 ms stddev 211.987
progress: 50.2 s, 13.2 tps, lat 610.608 ms stddev 271.780
progress: 55.5 s, 16.9 tps, lat 468.757 ms stddev 156.516
progress: 60.5 s, 14.3 tps, lat 548.913 ms stddev 190.414
progress: 65.7 s, 9.3 tps, lat 821.293 ms stddev 353.665
progress: 70.1 s, 16.0 tps, lat 524.240 ms stddev 174.903
progress: 75.2 s, 17.0 tps, lat 485.692 ms stddev 194.273
progress: 80.2 s, 19.9 tps, lat 396.295 ms stddev 78.891
progress: 85.3 s, 18.3 tps, lat 423.744 ms stddev 105.798
progress: 90.1 s, 14.5 tps, lat 577.373 ms stddev 270.914
progress: 95.3 s, 12.0 tps, lat 649.434 ms stddev 247.001
progress: 100.3 s, 14.6 tps, lat 563.693 ms stddev 275.236
tps = 14.812222 (including connections establishing)

to:
progress: 5.1 s, 18.9 tps, lat 409.766 ms stddev 75.032
progress: 10.3 s, 20.2 tps, lat 396.781 ms stddev 67.593
progress: 15.1 s, 19.1 tps, lat 418.545 ms stddev 109.431
progress: 20.3 s, 20.6 tps, lat 388.606 ms stddev 74.259
progress: 25.1 s, 19.5 tps, lat 406.591 ms stddev 109.050
progress: 30.0 s, 19.1 tps, lat 420.199 ms stddev 157.005
progress: 35.0 s, 18.4 tps, lat 421.102 ms stddev 124.019
progress: 40.3 s, 12.3 tps, lat 640.640 ms stddev 88.409
progress: 45.2 s, 12.8 tps, lat 586.471 ms stddev 145.543
progress: 50.5 s, 6.9 tps, lat 1116.603 ms stddev 285.479
progress: 56.2 s, 6.3 tps, lat 1349.055 ms stddev 381.095
progress: 60.6 s, 7.9 tps, lat 1083.745 ms stddev 452.386
progress: 65.0 s, 9.6 tps, lat 805.981 ms stddev 273.845
progress: 71.1 s, 9.6 tps, lat 798.273 ms stddev 184.108
progress: 75.2 s, 9.3 tps, lat 950.131 ms stddev 150.870
progress: 80.8 s, 8.6 tps, lat 899.389 ms stddev 135.090
progress: 85.3 s, 8.8 tps, lat 928.183 ms stddev 152.056
progress: 90.9 s, 8.0 tps, lat 929.737 ms stddev 71.155
progress: 95.7 s, 9.0 tps, lat 968.070 ms stddev 127.824
progress: 100.3 s, 8.7 tps, lat 911.767 ms stddev 130.697

just by switching shared_buffers from 1 to 8GB. I haven't tried, but I
hope that with an approach like your's this might become better.

psql -f /tmp/prepare.sql
pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

10 October 2014, 06:58:20

On Fri, Oct 10, 2014 at 1:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:

> On 2014-10-09 16:01:55 +0200, Andres Freund wrote:
> >
> > I don't think OLTP really is the best test case for this. Especially not
> > pgbench with relatilvely small rows *and* a uniform distribution of
> > access.
> >
> > Try parallel COPY TO. Batch write loads is where I've seen this hurt
> > badly.
>
>
> just by switching shared_buffers from 1 to 8GB. I haven't tried, but I
> hope that with an approach like your's this might become better.
>
> psql -f /tmp/prepare.sql
> pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100

Thanks for providing the scripts. You haven't specified how much data

is present in 'large' file used in tests. I have tried with different set of

rows, but I could not see the dip that is present in your data when you

increased shared buffers from 1GB to 8GB, also I don't see any difference

with patch. BTW, why do you think that for such worklaods this patch can

be helpful, according to my understanding it can be helpful mainly for

read mostly workloads when all the data doesn't fit in shared buffers.

Performance Data

-----------------------------------

IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

For 500000 rows

----------------------------

Data populated using below statement:

insert into largedata_64 ('aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbb',generate_series(1,500000));

copy largedata_64 to '/tmp/large' binary;

pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100

shared_buffers - 1GB

---------------------------------

progress: 7.0 s, 2.7 tps, lat 2326.645 ms stddev 173.506

progress: 11.5 s, 3.5 tps, lat 2295.577 ms stddev 78.949

progress: 15.8 s, 3.7 tps, lat 2298.217 ms stddev 223.346

progress: 20.4 s, 3.5 tps, lat 2350.187 ms stddev 192.312

progress: 25.1 s, 3.4 tps, lat 2280.206 ms stddev 54.580

progress: 31.9 s, 3.4 tps, lat 2408.593 ms stddev 243.230

progress: 45.2 s, 1.1 tps, lat 5120.151 ms stddev 3913.561

progress: 50.5 s, 1.3 tps, lat 8967.954 ms stddev 3384.229

progress: 52.7 s, 2.7 tps, lat 3883.788 ms stddev 1733.293

progress: 55.6 s, 3.2 tps, lat 2684.282 ms stddev 348.615

progress: 58.2 s, 3.4 tps, lat 2602.355 ms stddev 268.718

progress: 60.8 s, 3.1 tps, lat 2361.937 ms stddev 302.643

progress: 65.3 s, 3.5 tps, lat 2341.903 ms stddev 162.338

progress: 74.1 s, 2.6 tps, lat 2720.182 ms stddev 716.425

progress: 76.4 s, 3.5 tps, lat 3023.234 ms stddev 670.473

progress: 80.4 s, 2.0 tps, lat 2795.323 ms stddev 820.429

progress: 85.6 s, 1.9 tps, lat 4756.217 ms stddev 844.284

progress: 91.9 s, 2.2 tps, lat 3996.001 ms stddev 1301.143

progress: 96.6 s, 3.5 tps, lat 2284.419 ms stddev 85.013

progress: 101.1 s, 3.5 tps, lat 2282.848 ms stddev 71.388

transaction type: Custom query

scaling factor: 1

query mode: simple

number of clients: 8

number of threads: 8

duration: 100 s

number of transactions actually processed: 275

latency average: 2939.784 ms

latency stddev: 1739.974 ms

tps = 2.710138 (including connections establishing)

tps = 2.710208 (excluding connections establishing)

shared_buffers - 8GB

------------------------------------

progress: 6.7 s, 2.7 tps, lat 2349.816 ms stddev 212.889

progress: 11.0 s, 3.5 tps, lat 2257.364 ms stddev 141.148

progress: 15.2 s, 3.8 tps, lat 2209.669 ms stddev 127.101

progress: 21.7 s, 3.7 tps, lat 2159.838 ms stddev 92.205

progress: 25.8 s, 3.9 tps, lat 2221.072 ms stddev 283.362

progress: 30.1 s, 3.5 tps, lat 2179.611 ms stddev 152.741

progress: 39.3 s, 2.1 tps, lat 2768.609 ms stddev 1265.508

progress: 50.9 s, 1.1 tps, lat 9361.388 ms stddev 2657.885

progress: 52.9 s, 1.0 tps, lat 2036.098 ms stddev 3.599

progress: 55.2 s, 4.3 tps, lat 2167.688 ms stddev 91.183

progress: 57.6 s, 3.0 tps, lat 2399.219 ms stddev 173.535

progress: 60.2 s, 4.1 tps, lat 2427.273 ms stddev 198.698

progress: 65.2 s, 3.4 tps, lat 2441.630 ms stddev 123.773

progress: 72.4 s, 2.9 tps, lat 2534.631 ms stddev 254.162

progress: 75.0 s, 3.9 tps, lat 2468.266 ms stddev 221.969

progress: 82.3 s, 3.0 tps, lat 2548.690 ms stddev 404.852

progress: 86.7 s, 1.4 tps, lat 3980.576 ms stddev 1205.743

progress: 92.5 s, 1.4 tps, lat 5174.340 ms stddev 643.415

progress: 97.1 s, 3.7 tps, lat 3252.847 ms stddev 1689.268

progress: 101.8 s, 3.4 tps, lat 2346.690 ms stddev 138.251

transaction type: Custom query

scaling factor: 1

query mode: simple

number of clients: 8

number of threads: 8

duration: 100 s

number of transactions actually processed: 284

latency average: 2856.195 ms

latency stddev: 1740.699 ms

tps = 2.781603 (including connections establishing)

tps = 2.781682 (excluding connections establishing)

For 5000 rows

------------------------

shared_buffers - 1GB

-----------------------------------

progress: 5.0 s, 357.7 tps, lat 22.295 ms stddev 3.511

progress: 10.0 s, 339.0 tps, lat 23.606 ms stddev 4.388

progress: 15.0 s, 323.4 tps, lat 24.733 ms stddev 5.001

progress: 20.0 s, 329.6 tps, lat 24.258 ms stddev 4.407

progress: 25.0 s, 334.3 tps, lat 23.963 ms stddev 4.126

progress: 30.0 s, 337.5 tps, lat 23.699 ms stddev 3.492

progress: 35.2 s, 158.6 tps, lat 37.182 ms stddev 189.946

progress: 40.3 s, 3.9 tps, lat 2587.129 ms stddev 762.231

progress: 45.4 s, 2.4 tps, lat 2525.946 ms stddev 942.428

progress: 50.0 s, 303.7 tps, lat 33.719 ms stddev 137.524

progress: 55.0 s, 331.5 tps, lat 24.122 ms stddev 3.806

progress: 60.0 s, 333.2 tps, lat 24.028 ms stddev 3.340

progress: 65.0 s, 336.1 tps, lat 23.802 ms stddev 3.601

progress: 70.0 s, 209.0 tps, lat 38.263 ms stddev 120.198

progress: 75.2 s, 141.2 tps, lat 54.350 ms stddev 168.274

progress: 80.0 s, 331.0 tps, lat 25.262 ms stddev 31.637

progress: 86.0 s, 10.9 tps, lat 721.991 ms stddev 750.484

progress: 90.0 s, 85.3 tps, lat 95.531 ms stddev 411.560

progress: 95.0 s, 318.0 tps, lat 25.152 ms stddev 3.985

progress: 100.5 s, 241.1 tps, lat 33.061 ms stddev 83.705

transaction type: Custom query

scaling factor: 1

query mode: simple

number of clients: 8

number of threads: 8

duration: 100 s

number of transactions actually processed: 24068

latency average: 33.400 ms

latency stddev: 138.159 ms

tps = 239.444193 (including connections establishing)

tps = 239.450511 (excluding connections establishing)

shared_buffers-8GB

---------------------------------

progress: 5.0 s, 339.3 tps, lat 23.514 ms stddev 3.853

progress: 10.0 s, 332.7 tps, lat 24.033 ms stddev 3.850

progress: 15.0 s, 329.7 tps, lat 24.290 ms stddev 3.236

progress: 20.0 s, 323.7 tps, lat 24.718 ms stddev 3.639

progress: 25.0 s, 338.0 tps, lat 23.650 ms stddev 2.916

progress: 30.0 s, 324.0 tps, lat 24.721 ms stddev 3.365

progress: 36.1 s, 56.7 tps, lat 127.433 ms stddev 530.344

progress: 41.0 s, 3.2 tps, lat 2393.639 ms stddev 469.533

progress: 45.0 s, 91.8 tps, lat 104.049 ms stddev 418.744

progress: 50.0 s, 331.4 tps, lat 24.143 ms stddev 2.398

progress: 55.0 s, 332.7 tps, lat 24.067 ms stddev 2.810

progress: 60.0 s, 331.1 tps, lat 24.136 ms stddev 3.449

progress: 65.0 s, 304.0 tps, lat 26.332 ms stddev 33.693

progress: 70.9 s, 227.6 tps, lat 34.153 ms stddev 121.504

progress: 75.0 s, 295.1 tps, lat 28.236 ms stddev 52.897

progress: 82.2 s, 44.0 tps, lat 160.993 ms stddev 632.587

progress: 85.0 s, 85.3 tps, lat 121.065 ms stddev 432.011

progress: 90.0 s, 325.9 tps, lat 24.581 ms stddev 2.545

progress: 95.0 s, 333.4 tps, lat 23.989 ms stddev 1.709

progress: 100.0 s, 292.2 tps, lat 27.330 ms stddev 41.678

transaction type: Custom query

scaling factor: 1

query mode: simple

number of clients: 8

number of threads: 8

duration: 100 s

number of transactions actually processed: 25039

latency average: 31.955 ms

latency stddev: 136.304 ms

tps = 250.328882 (including connections establishing)

tps = 250.335432 (excluding connections establishing)

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

10 October 2014, 10:15:08

On 2014-10-10 12:28:13 +0530, Amit Kapila wrote:
> On Fri, Oct 10, 2014 at 1:08 AM, Andres Freund <andres@2ndquadrant.com>
> wrote:
> > On 2014-10-09 16:01:55 +0200, Andres Freund wrote:
> > >
> > > I don't think OLTP really is the best test case for this. Especially not
> > > pgbench with relatilvely small rows *and* a uniform distribution of
> > > access.
> > >
> > > Try parallel COPY TO. Batch write loads is where I've seen this hurt
> > > badly.
> >
> >
> > just by switching shared_buffers from 1 to 8GB. I haven't tried, but I
> > hope that with an approach like your's this might become better.
> >
> > psql -f /tmp/prepare.sql
> > pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100
> 
> Thanks for providing the scripts.  You haven't specified how much data
> is present in 'large' file used in tests.

I don't think it matters much. It should be small enough that you get a
couple TPS over all backends.

> I have tried with different set of
> rows, but I could not see the dip that is present in your data when you
> increased shared buffers from 1GB to 8GB, also I don't see any difference
> with patch.

Interesting. I wonder whether that's because the concurrency wasn't high
enough for that machine to show the problem. I ran the test on my
workstation which has 8 actual cores.

> BTW, why do you think that for such worklaods this patch can
> be helpful, according to my understanding it can be helpful mainly for
> read mostly workloads when all the data doesn't fit in shared buffers.

The performance dip comes from all the backends performing the clock
sweep. As the access is pretty uniform all the buffers start with some
usage count (IIRC 3 in this example. Much worse if 5). Due to the
uniform usagecount the clock sweep frequently has to go several times
through *all* the buffers. That leads to quite horrible performance in
some cases.
I had hoped that bgreclaimer can make that workload les horrible by
funneling most of the accesses through the freelist.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

14 October 2014, 09:55:58

On Thu, Oct 9, 2014 at 6:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On another point, I think it would be a good idea to rebase the
> > bgreclaimer patch over what I committed, so that we have a
> > clean patch against master to test with.
>
> Please find the rebased patch attached with this mail. I have taken
> some performance data as well and done some analysis based on
> the same.
>
>
>
> Below data is median of 3 runs.
>
> patch_ver/client_count 1 8 32 64 128 256
> HEAD 18884 118628 251093 216294 186625 177505
> PATCH 18743 122578 247243 205521 179712 175031
>
>
> Here we can see that the performance dips at higher client
> count(>=32) which was quite surprising for me, as I was expecting
> it to improve, because bgreclaimer reduces the contention by making
> buffers available on free list. So I tried to analyze the situation by
> using perf and found that in above configuration, there is a contention
> around freelist spinlock with HEAD and the same is removed by Patch,
> but still the performance goes down with Patch. On further analysis, I
> observed that actually after Patch there is an increase in contention
> around ProcArrayLock (shared LWlock) via GetSnapshotData which
> sounds bit odd, but that's what I can see in profiles. Based on analysis,
> few ideas which I would like to further investigate are:
> a. As there is an increase in spinlock contention, I would like to check
> with Andres's latest patch which reduces contention around shared
> lwlocks.

I have tried by applying Andres's Wait free LW_SHARED acquisition

patch posted by him at below link along with bgreclaimer patch:

http://www.postgresql.org/message-id/20141008133533.GA5053@alap3.anarazel.de

After that I observed that contention for LW_SHARED has reduced

for this load, but it didn't help much in terms of performance, so I again

rechecked the profile and this time most of the contention is moved

to spinlock used in dynahash for buf mapping tables, please refer

the profile (for 128 client count; Read only load) below:

bgreclaimer patch + wait free lw_shared acquisition patches -

------------------------------------------------------------------------------------------

9.24% swapper [unknown] [H] 0x00000000011d9c10

+ 7.19% postgres postgres [.] s_lock

+ 3.52% postgres postgres [.] GetSnapshotData

+ 3.02% postgres postgres [.] calc_bucket

+ 2.71% postgres postgres [.] hash_search_with_hash_value

2.32% postgres [unknown] [H] 0x00000000011e0d7c

+ 2.17% postgres postgres [.] pg_atomic_fetch_add_u32_impl

+ 1.84% postgres postgres [.] AllocSetAlloc

+ 1.57% postgres postgres [.] _bt_compare

+ 1.05% postgres postgres [.] AllocSetFreeIndex

+ 1.02% postgres [kernel.kallsyms] [k] .__copy_tofrom_user_power7

+ 0.94% postgres postgres [.] tas

+ 0.85% swapper [kernel.kallsyms] [k] .int_sqrt

+ 0.80% postgres postgres [.] pg_encoding_mbcliplen

+ 0.78% pgbench [kernel.kallsyms] [k] .find_busiest_group

0.65% pgbench [unknown] [H] 0x00000000011d96e0

+ 0.59% postgres postgres [.] hash_any

+ 0.54% postgres postgres [.] LWLockRelease

Detailed Profile Data

-----------------------------------------

- 7.19% postgres postgres [.] s_lock

- s_lock

- 3.79% s_lock

- 1.69% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

- 1.63% hash_search_with_hash_value

- 1.63% BufTableDelete

BufferAlloc

ReadBuffer_common

ReadBufferExtended

- 1.45% hash_search_with_hash_value

BufTableDelete

BufferAlloc

ReadBuffer_common

- 1.43% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

ReadBuffer_common

- 3.52% postgres postgres [.] GetSnapshotData

- GetSnapshotData

- 3.50% GetSnapshotData

- 3.49% GetTransactionSnapshot

- 1.75% exec_bind_message

PostgresMain

- 1.74% PortalStart

exec_bind_message

PostgresMain

To reduce above contention, I tried to write a patch to replace spin lock

used in dynahash to manage free list by atomic operations. Still there

is work pending for this patch with respect to ensuring whether the

approach used in patch is completely sane, however I am posting the

patch so that others can have a look at it and give me feedback about

the approach.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

change_dynahash_table_impl_using_atomics_v1.patch

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

14 October 2014, 10:02:26

On 2014-10-14 15:24:57 +0530, Amit Kapila wrote:
> After that I observed that contention for LW_SHARED has reduced
> for this load, but it didn't help much in terms of performance, so I again
> rechecked the profile and this time most of the contention is moved
> to spinlock used in dynahash for buf mapping tables, please refer
> the profile (for 128 client count; Read only load) below:
> 
> bgreclaimer patch + wait free lw_shared acquisition patches -
> ------------------------------------------------------------------------------------------
> 
>     9.24%          swapper  [unknown]           [H] 0x00000000011d9c10
> +   7.19%         postgres  postgres            [.] s_lock
> +   3.52%         postgres  postgres            [.] GetSnapshotData
> +   3.02%         postgres  postgres            [.] calc_bucket
> +   2.71%         postgres  postgres            [.]
> hash_search_with_hash_value
>     2.32%         postgres  [unknown]           [H] 0x00000000011e0d7c
> +   2.17%         postgres  postgres            [.]
> pg_atomic_fetch_add_u32_impl
> +   1.84%         postgres  postgres            [.] AllocSetAlloc
> +   1.57%         postgres  postgres            [.] _bt_compare
> +   1.05%         postgres  postgres            [.] AllocSetFreeIndex
> +   1.02%         postgres  [kernel.kallsyms]   [k]
> .__copy_tofrom_user_power7
> +   0.94%         postgres  postgres            [.] tas
> +   0.85%          swapper  [kernel.kallsyms]   [k] .int_sqrt
> +   0.80%         postgres  postgres            [.] pg_encoding_mbcliplen
> +   0.78%          pgbench  [kernel.kallsyms]   [k] .find_busiest_group
>     0.65%          pgbench  [unknown]           [H] 0x00000000011d96e0
> +   0.59%         postgres  postgres            [.] hash_any
> +   0.54%         postgres  postgres            [.] LWLockRelease

This profile is without -O2 again. I really don't think it makes much
sense to draw much inference from an unoptimized build. I realize that
you said that the builds you use for benchmarking don't have that
problem, but that doesn't make this profile meaningful...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

15 October 2014, 07:48:00

On Tue, Oct 14, 2014 at 3:32 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-10-14 15:24:57 +0530, Amit Kapila wrote:
> > After that I observed that contention for LW_SHARED has reduced
> > for this load, but it didn't help much in terms of performance, so I again
> > rechecked the profile and this time most of the contention is moved
> > to spinlock used in dynahash for buf mapping tables, please refer
> > the profile (for 128 client count; Read only load) below:
> >
> > bgreclaimer patch + wait free lw_shared acquisition patches -
> > ------------------------------------------------------------------------------------------
>
> This profile is without -O2 again. I really don't think it makes much
> sense to draw much inference from an unoptimized build.

Profile data with -O2 is below. This shows that top

contributors are calls for BufTableLookup and spin lock caused

by BufTableInsert and BufTableDelete. To resolve spin lock

contention, patch like above might prove to be useful (although

I have to still evaluate the same). I would like to once take

LWLOCK_STATS data as well before proceeding further.

Do you have any other ideas?

11.17% swapper [unknown] [H] 0x00000000011e0328

+ 4.62% postgres postgres [.] hash_search_with_hash_value

+ 4.35% pgbench [kernel.kallsyms] [k] .find_busiest_group

+ 3.71% postgres postgres [.] s_lock

2.56% postgres [unknown] [H] 0x0000000001500120

+ 2.23% pgbench [kernel.kallsyms] [k] .idle_cpu

+ 1.97% postgres postgres [.] LWLockAttemptLock

+ 1.73% postgres postgres [.] LWLockRelease

+ 1.47% postgres [kernel.kallsyms] [k] .__copy_tofrom_user_power7

+ 1.44% postgres postgres [.] GetSnapshotData

+ 1.28% postgres postgres [.] _bt_compare

+ 1.04% swapper [kernel.kallsyms] [k] .int_sqrt

+ 1.04% postgres postgres [.] AllocSetAlloc

+ 0.97% pgbench [kernel.kallsyms] [k] .default_wake_function

Detailed Data

----------------

- 4.62% postgres postgres [.] hash_search_with_hash_value

- hash_search_with_hash_value

- 2.19% BufTableLookup

- 2.15% BufTableLookup

ReadBuffer_common

- ReadBufferExtended

- 1.32% _bt_relandgetbuf

- 0.73% BufTableDelete

- 0.71% BufTableDelete

ReadBuffer_common

ReadBufferExtended

- 0.69% BufTableInsert

- 0.68% BufTableInsert

ReadBuffer_common

ReadBufferExtended

0.66% hash_search_with_hash_value

- 4.35% pgbench [kernel.kallsyms] [k] .find_busiest_group

- .find_busiest_group

- 4.28% .find_busiest_group

- 4.26% .load_balance

- 4.26% .idle_balance

- .__schedule

- 4.26% .schedule_hrtimeout_range_clock

.do_select

.core_sys_select

- 3.71% postgres postgres [.] s_lock

- s_lock

- 3.19% hash_search_with_hash_value

- 3.18% hash_search_with_hash_value

- 1.60% BufTableInsert

ReadBuffer_common

- ReadBufferExtended

- 1.57% BufTableDelete

ReadBuffer_common

- ReadBufferExtended

- 0.93% index_fetch_heap

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

28 October 2014, 04:14:59

On Tue, Oct 14, 2014 at 3:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Oct 9, 2014 at 6:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On another point, I think it would be a good idea to rebase the
> > > bgreclaimer patch over what I committed, so that we have a
> > > clean patch against master to test with.
> >
> > Please find the rebased patch attached with this mail. I have taken
> > some performance data as well and done some analysis based on
> > the same.
> >
> >
> >
>
> To reduce above contention, I tried to write a patch to replace spin lock
> used in dynahash to manage free list by atomic operations. Still there
> is work pending for this patch with respect to ensuring whether the
> approach used in patch is completely sane, however I am posting the
> patch so that others can have a look at it and give me feedback about
> the approach.

After further working on this patch (replacement of spinlock in dynahash),

I found that I need to solve A-B-A problem for lockless structure as

described in section 5.1 of paper [1]. I could have further pursued to solve

it by using some additional variables as described in section 5 of paper [1],

but it seems to me that Robert has already solved it by using some other

technique as proposed by him in patch [2], so it would be waste of effort

for me to pursue on this problem. So I am not planning to continue on this

patch and marking it as Rejected in CF app.

[1] http://www.liblfds.org/mediawiki/images/1/1d/Valois_-_Lock-Free_Linked_Lists_Using_Compare-and-Swap.pdf

[2] https://commitfest.postgresql.org/action/patch_view?id=1613

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Scaling shared buffer eviction

From

Thom Brown

Date:

11 November 2014, 09:29:59

On 26 September 2014 12:40, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> But this gets at another point: the way we're benchmarking this right
> now, we're really conflating the effects of three different things:
>
> 1. Changing the locking regimen around the freelist and clocksweep.
> 2. Adding a bgreclaimer process.
> 3. Raising the number of buffer locking partitions.

First of all thanks for committing part-1 of this changes and it
seems you are planing to commit part-3 based on results of tests
which Andres is planing to do and for remaining part (part-2), today

Were parts 2 and 3 committed in the end?

Thom

Re: Scaling shared buffer eviction

From

Andres Freund

Date:

11 November 2014, 09:31:01

On 2014-11-11 09:29:22 +0000, Thom Brown wrote:
> On 26 September 2014 12:40, Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> >  On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com>
> > wrote:
> > >
> > > But this gets at another point: the way we're benchmarking this right
> > > now, we're really conflating the effects of three different things:
> > >
> > > 1. Changing the locking regimen around the freelist and clocksweep.
> > > 2. Adding a bgreclaimer process.
> > > 3. Raising the number of buffer locking partitions.
> >
> > First of all thanks for committing part-1 of this changes and it
> > seems you are planing to commit part-3 based on results of tests
> > which Andres is planing to do and for remaining part (part-2), today
> >
> 
> Were parts 2 and 3 committed in the end?

3 was committed. 2 wasn't because it's not yet clear whether how
beneficial it is generally.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Scaling shared buffer eviction

From

Amit Kapila

Date:

09 December 2014, 08:52:43

On Tue, Nov 11, 2014 at 3:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-11-11 09:29:22 +0000, Thom Brown wrote:
> > On 26 September 2014 12:40, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com>
> > > wrote:
> > > >
> > > > But this gets at another point: the way we're benchmarking this right
> > > > now, we're really conflating the effects of three different things:
> > > >
> > > > 1. Changing the locking regimen around the freelist and clocksweep.
> > > > 2. Adding a bgreclaimer process.
> > > > 3. Raising the number of buffer locking partitions.
> > >
> > > First of all thanks for committing part-1 of this changes and it
> > > seems you are planing to commit part-3 based on results of tests
> > > which Andres is planing to do and for remaining part (part-2), today
> > >
> >
> > Were parts 2 and 3 committed in the end?
>
> 3 was committed. 2 wasn't because it's not yet clear whether how
> beneficial it is generally.
>

As shown upthread that this patch (as it stands today) is dependent on

another patch (wait free LW_SHARED acquisition) which is still not

committed and still some more work is needed to clearly show the

gain by this patch, so I have marked it as "Returned with Feedback".

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com