Re: Scaling shared buffer eviction - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Scaling shared buffer eviction
Date
Msg-id CAA4eK1+5bQh3KyO14Pqn+VuLex41V8cwt0kw6hRJASdcbaabtg@mail.gmail.com
Whole thread Raw
In response to Re: Scaling shared buffer eviction  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Scaling shared buffer eviction  (Kevin Grittner <kgrittn@ymail.com>)
Re: Scaling shared buffer eviction  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Sat, May 17, 2014 at 6:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, May 16, 2014 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:


Thrds (64)Thrds (128)
HEAD4556217128
HEAD + 645790432810
V1 + 6410555781011
HEAD + 1285838332997
V1 + 128110705114544

I haven't actually reviewed the code, but this sort of thing seems like good evidence that we need your patch, or something like it.  The fact that the patch produces little performance improvement on it's own (though it does produce some) shouldn't be held against it - the fact that the contention shifts elsewhere when the first bottleneck is removed is not your patch's fault.



I have improved the patch  by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
    removed the hibernate logic as bgwriter will now work only when
    there is scarcity of buffer's in free list. Basic idea is when the
    number of buffers on freelist drops below the low threshold, the
    allocating backend sets the latch and bgwriter wakesup and begin

    adding buffer's to freelist until it reaches high threshold and then

    again goes back to sleep.


b.  New stats for number of buffers on freelist has been added, some
     old one's like maxwritten_clean can be removed as new logic for
     syncing buffers and moving them to free list doesn't use them.
     However I think it's better to remove them once the new logic is
     accepted.  Added some new logs for info related to free list under
     BGW_DEBUG

c.  Used the already existing bgwriterLatch in BufferStrategyControl to
     wake bgwriter when number of buffer's in freelist drops below
     threshold.

d.  Autotune the low and high threshold for freelist for various
     configurations.  Generally if keep small number (200~2000) of buffers
     always available on freelist, then even for high shared buffers
     like 15GB, it appears to be sufficient.  However when the value
     of shared buffer's is less, then we need much smaller number.  I
     think we can provide these as config knobs for user as well, but for
     now based on LWLOCK_STATS result, I have chosen some hard
     coded values for low and high threshold values for freelist.
     Values for low and high threshold have been decided based on total
     number of shared buffers, basically I have divided them into 5
     categories (16~100, 100~1000,  1000~10000, 10000~100000,
     100000 and above) and then ran tests(read-only pgbench) for various
     configurations falling under these categories.  The reason for keeping
     lesser categories for larger shared buffers is that if there are small
     number (200~2000) of buffers available on free list, then it seems to
     be sufficient for quite high loads, however as the total number of shared
     buffer's decreases we need to be more careful as if we keep the number as
     too low then it will lead to more clock sweep by backends (which means
     freelist lock contention) and if we keep number higher bgwriter will evict
     many useful buffers.  Results based on LWLOCK_STATS is at end of mail.

e.  One reason why I think number of buf-partitions is hard-coded to 16 is that
     minimum number of shared buffers allowed are 16 (128kb).  However, there
     is handling in code (in function init_htab()) which ensure that even if number
     of partitions are more that shared buffers, it handles it safely.

I have checked the bgwriter CPU usage with and without patch
for various configurations and the observation is that for most of the
loads bgwriter's CPU usage after patch is between 8~20% and in
HEAD it is 0~2%.  It shows that with patch when shared buffers
are under use by backends, bgwriter is constantly doing work to
ease the work of backends.  Detailed data is provided later in the
mail.

Performance Data:
-------------------------------

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads

RAM = 64GB

Database Locale =C

checkpoint_segments=256

checkpoint_timeout    =15min

shared_buffers=8GB

scale factor = 3000

Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) 

Duration of each individual run = 5mins




Client Count/patch_ver (tps)8163264128
Head2622048686707794523217310
Patch264025072675574111468114521
 
Data is taken by using script (pert_buff_mgmt.sh) attached with mail.
This data is read-only pgbench data with different number of client
connections.  All the numbers are in tps.  This data is median of 3
5 min pgbench read-only runs.  Please find the detailed data for 3 runs
in attached open office document (perf_read_scalability_data_v3.ods)

This data clearly shows that patch has improved improved performance
upto 5~6 times.


Results of BGwriter CPU usage:
--------------------------------------------------

Here sc is scale factor and sb is shared buffers and the data is
for read-only pgbench runs.

./pgbench -c 64 - j 64 -S -T 300 postgres
sc - 3000, sb - 8GB
HEAD
CPU usage - 0~2.3%
Patch v_3
CPU usage - 8.6%

sc - 100, sb - 128MB
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
CPU Usage - 1~2%
tps- 36199.047132
Patch v_3
CPU usage - 12~13%
tps = 109182.681827

sc - 50, sb - 75MB
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
CPU Usage - 0.7~2%
tps- 37760.575128
Patch v_3
CPU usage - 20~22%
tps = 106310.744198

./pgbench -c 16 - j 16 -S -T 300 postgres
sc - 100, sb - 128kb
--need to change pgbench for this.
HEAD
CPU Usage - 0~0.3%
tps- 40979.529254
Patch v_3
CPU usage - 35~40%
tps = 42956.785618


Results of LWLOCK_STATS based on low-high threshold values of freelist:
--------------------------------------------------------------------------------------------------------------

In the results, values of exacq and blk shows the contention on freelist lock.
sc is scale factor and sb is number of shared_buffers.  Below results shows
that for all except one (1MB) of configuration the contention around buffreelist
lock is reduced significantly.  For 1MB case also, it has reduced exacq count
which shows that it has performed clock sweep lesser number of times. 

sc - 3000, sb - 15GB --(sb > 100000)
./pgbench -c 64 - j 64 -S -T 300 postgres 
HEAD
PID 4406 lwlock main 0: shacq 0 exacq 84482 blk 5139 spindelay 62
Patch v_3
PID 4864 lwlock main 0: shacq 0 exacq 34 blk 1 spindelay 0

sc - 3000, sb - 8GB --(sb > 100000)
./pgbench -c 64 - j 64 -S -T 300 postgres 
HEAD
PID 24124 lwlock main 0: shacq 0 exacq 285155 blk 33910 spindelay 548
Patch v_3
PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0

sc - 100, sb - 768MB --(sb > 10000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 9144 lwlock main 0: shacq 0 exacq 284636 blk 34091 spindelay 555
Patch v-3 (lw=100,hg=1000)
PID 9428 lwlock main 0: shacq 0 exacq 306 blk 59 spindelay 0

sc - 100, sb - 128MB --(sb > 10000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 5405 lwlock main 0: shacq 0 exacq 285449 blk 32345 spindelay 714
Patch v-3
PID 8625 lwlock main 0: shacq 0 exacq 740 blk 178 spindelay 0

sc - 50, sb - 75MB --(sb > 1000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 12681 lwlock main 0: shacq 0 exacq 289347 blk 34064 spindelay 773
Patch v3
PID 12800 lwlock main 0: shacq 0 exacq 76287 blk 15183 spindelay 28

sc - 50, sb - 10MB --(sb > 1000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 10283 lwlock main 0: shacq 0 exacq 287500 blk 32177 spindelay 864
Patch v3 (for > 1000, lw = 50 hg =200)
PID 11629 lwlock main 0: shacq 0 exacq 60139 blk 12978 spindelay 40

sc - 1, sb - 7MB --(sb > 100)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 47127 lwlock main 0: shacq 0 exacq 289462 blk 37057 spindelay 119
Patch v3
PID 47283 lwlock main 0: shacq 0 exacq 9507 blk 1656 spindelay 0

sc - 1, sb - 1MB --(sb > 100)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 43215 lwlock main 0: shacq 0 exacq 301384 blk 36740 spindelay 902
Patch v3
PID 46542 lwlock main 0: shacq 0 exacq 197231 blk 37532 spindelay 294

sc - 100, sb - 128kb focus(sb > 16)
./pgbench -c 16 - j 16 -S -T 300 postgres (for this, I need to reduce value
of naccounts to 2500, else it was always giving no unpinned buffers
available)
HEAD
PID 49751 lwlock main 0: shacq 0 exacq 1821276 blk 130119 spindelay 7
Patch v3
PID 50768 lwlock main 0: shacq 0 exacq 382610 blk 46543 spindelay 1


More Datapoints and work:
a. I have yet to take data by merging it with scalable lwlock patch of
    There are many conflicts in the patch, so waiting for an updated patch.
b. Read-only data for more configurations.
c. Data for Write work load (tpc-b of pgbench, Bulk insert (Copy))   
d. Update docs and Remove unused code.


Suggestions?

With Regards,
Amit Kapila.
Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: pg_receivexlog add synchronous mode
Next
From: Andres Freund
Date:
Subject: Re: tests for client programs