Home > mailing lists

Re: Scaling shared buffer eviction - Mailing list pgsql-hackers

From	Amit Kapila
Subject	Re: Scaling shared buffer eviction
Date	June 5, 2014 11:43:44
Msg-id	CAA4eK1+5bQh3KyO14Pqn+VuLex41V8cwt0kw6hRJASdcbaabtg@mail.gmail.com Whole thread Raw
In response to	Re: Scaling shared buffer eviction (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Scaling shared buffer eviction (Kevin Grittner <kgrittn@ymail.com>) Re: Scaling shared buffer eviction (Robert Haas <robertmhaas@gmail.com>)
List	pgsql-hackers

Tree view

On Sat, May 17, 2014 at 6:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, May 16, 2014 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Thrds (64) Thrds (128)
HEAD 45562 17128
HEAD + 64 57904 32810
V1 + 64 105557 81011
HEAD + 128 58383 32997
V1 + 128 110705 114544

I haven't actually reviewed the code, but this sort of thing seems like good evidence that we need your patch, or something like it. The fact that the patch produces little performance improvement on it's own (though it does produce some) shouldn't be held against it - the fact that the contention shifts elsewhere when the first bottleneck is removed is not your patch's fault.

I have improved the patch by making following changes:

a. Improved the bgwriter logic to log for xl_running_xacts info and

removed the hibernate logic as bgwriter will now work only when

there is scarcity of buffer's in free list. Basic idea is when the

number of buffers on freelist drops below the low threshold, the

allocating backend sets the latch and bgwriter wakesup and begin

adding buffer's to freelist until it reaches high threshold and then

again goes back to sleep.

b. New stats for number of buffers on freelist has been added, some

old one's like maxwritten_clean can be removed as new logic for

syncing buffers and moving them to free list doesn't use them.

However I think it's better to remove them once the new logic is

accepted. Added some new logs for info related to free list under

BGW_DEBUG

c. Used the already existing bgwriterLatch in BufferStrategyControl to

wake bgwriter when number of buffer's in freelist drops below

threshold.

d. Autotune the low and high threshold for freelist for various

configurations. Generally if keep small number (200~2000) of buffers

always available on freelist, then even for high shared buffers

like 15GB, it appears to be sufficient. However when the value

of shared buffer's is less, then we need much smaller number. I

think we can provide these as config knobs for user as well, but for

now based on LWLOCK_STATS result, I have chosen some hard

coded values for low and high threshold values for freelist.

Values for low and high threshold have been decided based on total

number of shared buffers, basically I have divided them into 5

categories (16~100, 100~1000, 1000~10000, 10000~100000,

100000 and above) and then ran tests(read-only pgbench) for various

configurations falling under these categories. The reason for keeping

lesser categories for larger shared buffers is that if there are small

number (200~2000) of buffers available on free list, then it seems to

be sufficient for quite high loads, however as the total number of shared

buffer's decreases we need to be more careful as if we keep the number as

too low then it will lead to more clock sweep by backends (which means

freelist lock contention) and if we keep number higher bgwriter will evict

many useful buffers. Results based on LWLOCK_STATS is at end of mail.

e. One reason why I think number of buf-partitions is hard-coded to 16 is that

minimum number of shared buffers allowed are 16 (128kb). However, there

is handling in code (in function init_htab()) which ensure that even if number

of partitions are more that shared buffers, it handles it safely.

I have checked the bgwriter CPU usage with and without patch

for various configurations and the observation is that for most of the

loads bgwriter's CPU usage after patch is between 8~20% and in

HEAD it is 0~2%. It shows that with patch when shared buffers

are under use by backends, bgwriter is constantly doing work to

ease the work of backends. Detailed data is provided later in the

mail.

Performance Data:

-------------------------------

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads

RAM = 64GB

Database Locale =C

checkpoint_segments=256

checkpoint_timeout =15min

shared_buffers=8GB

scale factor = 3000

Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)

Duration of each individual run = 5mins

Client Count/patch_ver (tps)	8	16	32	64	128
Head	26220	48686	70779	45232	17310
Patch	26402	50726	75574	111468	114521

Data is taken by using script (pert_buff_mgmt.sh) attached with mail.

This data is read-only pgbench data with different number of client

connections. All the numbers are in tps. This data is median of 3

5 min pgbench read-only runs. Please find the detailed data for 3 runs

in attached open office document (perf_read_scalability_data_v3.ods)

This data clearly shows that patch has improved improved performance

upto 5~6 times.

Results of BGwriter CPU usage:

--------------------------------------------------

Here sc is scale factor and sb is shared buffers and the data is

for read-only pgbench runs.

./pgbench -c 64 - j 64 -S -T 300 postgres

sc - 3000, sb - 8GB

HEAD

CPU usage - 0~2.3%

Patch v_3

CPU usage - 8.6%

sc - 100, sb - 128MB

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

CPU Usage - 1~2%

tps- 36199.047132

Patch v_3

CPU usage - 12~13%

tps = 109182.681827

sc - 50, sb - 75MB

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

CPU Usage - 0.7~2%

tps- 37760.575128

Patch v_3

CPU usage - 20~22%

tps = 106310.744198

./pgbench -c 16 - j 16 -S -T 300 postgres

sc - 100, sb - 128kb

--need to change pgbench for this.

HEAD

CPU Usage - 0~0.3%

tps- 40979.529254

Patch v_3

CPU usage - 35~40%

tps = 42956.785618

Results of LWLOCK_STATS based on low-high threshold values of freelist:

--------------------------------------------------------------------------------------------------------------

In the results, values of exacq and blk shows the contention on freelist lock.

sc is scale factor and sb is number of shared_buffers. Below results shows

that for all except one (1MB) of configuration the contention around buffreelist

lock is reduced significantly. For 1MB case also, it has reduced exacq count

which shows that it has performed clock sweep lesser number of times.

sc - 3000, sb - 15GB --(sb > 100000)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 4406 lwlock main 0: shacq 0 exacq 84482 blk 5139 spindelay 62

Patch v_3

PID 4864 lwlock main 0: shacq 0 exacq 34 blk 1 spindelay 0

sc - 3000, sb - 8GB --(sb > 100000)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 24124 lwlock main 0: shacq 0 exacq 285155 blk 33910 spindelay 548

Patch v_3

PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0

sc - 100, sb - 768MB --(sb > 10000)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 9144 lwlock main 0: shacq 0 exacq 284636 blk 34091 spindelay 555

Patch v-3 (lw=100,hg=1000)

PID 9428 lwlock main 0: shacq 0 exacq 306 blk 59 spindelay 0

sc - 100, sb - 128MB --(sb > 10000)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 5405 lwlock main 0: shacq 0 exacq 285449 blk 32345 spindelay 714

Patch v-3

PID 8625 lwlock main 0: shacq 0 exacq 740 blk 178 spindelay 0

sc - 50, sb - 75MB --(sb > 1000)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 12681 lwlock main 0: shacq 0 exacq 289347 blk 34064 spindelay 773

Patch v3

PID 12800 lwlock main 0: shacq 0 exacq 76287 blk 15183 spindelay 28

sc - 50, sb - 10MB --(sb > 1000)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 10283 lwlock main 0: shacq 0 exacq 287500 blk 32177 spindelay 864

Patch v3 (for > 1000, lw = 50 hg =200)

PID 11629 lwlock main 0: shacq 0 exacq 60139 blk 12978 spindelay 40

sc - 1, sb - 7MB --(sb > 100)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 47127 lwlock main 0: shacq 0 exacq 289462 blk 37057 spindelay 119

Patch v3

PID 47283 lwlock main 0: shacq 0 exacq 9507 blk 1656 spindelay 0

sc - 1, sb - 1MB --(sb > 100)

./pgbench -c 64 - j 64 -S -T 300 postgres

HEAD

PID 43215 lwlock main 0: shacq 0 exacq 301384 blk 36740 spindelay 902

Patch v3

PID 46542 lwlock main 0: shacq 0 exacq 197231 blk 37532 spindelay 294

sc - 100, sb - 128kb focus(sb > 16)

./pgbench -c 16 - j 16 -S -T 300 postgres (for this, I need to reduce value

of naccounts to 2500, else it was always giving no unpinned buffers

available)

HEAD

PID 49751 lwlock main 0: shacq 0 exacq 1821276 blk 130119 spindelay 7

Patch v3

PID 50768 lwlock main 0: shacq 0 exacq 382610 blk 46543 spindelay 1

More Datapoints and work:

a. I have yet to take data by merging it with scalable lwlock patch of

Andres (https://commitfest.postgresql.org/action/patch_view?id=1313).

There are many conflicts in the patch, so waiting for an updated patch.

b. Read-only data for more configurations.

c. Data for Write work load (tpc-b of pgbench, Bulk insert (Copy))

d. Update docs and Remove unused code.

Suggestions?

With Regards,
Amit Kapila.

EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

From: Andres Freund
Date: 05 June 2014, 11:43:09
Subject: Re: pg_receivexlog add synchronous mode

From: Andres Freund
Date: 05 June 2014, 11:57:12
Subject: Re: tests for client programs

Re: Scaling shared buffer eviction - Mailing list pgsql-hackers

Attachment

Previous

Next

	Thrds (64)	Thrds (128)
HEAD	45562	17128
HEAD + 64	57904	32810
V1 + 64	105557	81011
HEAD + 128	58383	32997
V1 + 128	110705	114544