Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers

From Zhihong Yu
Subject Re: BufferAlloc: don't take two simultaneous locks
Date
Msg-id CALNJ-vQF2tPW_rRkV-pBw0BNgBkKvUunt1vYj-5WHLkQV4aOSA@mail.gmail.com
Whole thread Raw
In response to BufferAlloc: don't take two simultaneous locks  (Yura Sokolov <y.sokolov@postgrespro.ru>)
Responses Re: BufferAlloc: don't take two simultaneous locks  (Yura Sokolov <y.sokolov@postgrespro.ru>)
List pgsql-hackers


On Fri, Oct 1, 2021 at 3:26 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
Good day.

I found some opportunity in Buffer Manager code in BufferAlloc
function:
- When valid buffer is evicted, BufferAlloc acquires two partition
lwlocks: for partition for evicted block is in and partition for new
block placement.

It doesn't matter if there is small number of concurrent replacements.
But if there are a lot of concurrent backends replacing buffers,
complex dependency net quickly arose.

It could be easily seen with select-only pgbench with scale 100 and
shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly
doesn't fit shared buffers. This way performance starts to degrade at
~100 connections. Even with shared buffers 1GB it slowly degrades after
150 connections.

But strictly speaking, there is no need to hold both lock
simultaneously. Buffer is pinned so other processes could not select it
for eviction. If tag is cleared and buffer removed from old partition
then other processes will not find it. Therefore it is safe to release
old partition lock before acquiring new partition lock.

If other process concurrently inserts same new buffer, then old buffer
is placed to bufmanager's freelist.

Additional optimisation: in case of old buffer is reused, there is no
need to put its BufferLookupEnt into dynahash's freelist. That reduces
lock contention a bit more. To acomplish this FreeListData.nentries is
changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement
is used.

Remark: there were bug in the `hash_update_hash_key`: nentries were not
kept in sync if freelist partitions differ. This bug were never
triggered because single use of `hash_update_hash_key` doesn't move
entry between partitions.

There is some tests results.

- pgbench with scale 100 were tested with --select-only (since we want
to test buffer manager alone). It produces 1.5GB table.
- two shared_buffers values were tested: 128MB and 1GB.
- second best result were taken among five runs

Test were made in three system configurations:
- notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
- Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
- same Xeon X5675 but restricted to single socket
  (with numactl -m 0 -N 0)

Results for i7-1165G7:

  conns |     master |    patched |  master 1G | patched 1G
--------+------------+------------+------------+------------
      1 |      29667 |      29079 |      29425 |      29411
      2 |      55577 |      55553 |      57974 |      57223
      3 |      87393 |      87924 |      87246 |      89210
      5 |     136222 |     136879 |     133775 |     133949
      7 |     179865 |     176734 |     178297 |     175559
     17 |     215953 |     214708 |     222908 |     223651
     27 |     211162 |     213014 |     220506 |     219752
     53 |     211620 |     218702 |     220906 |     225218
     83 |     213488 |     221799 |     219075 |     228096
    107 |     212018 |     222110 |     222502 |     227825
    139 |     207068 |     220812 |     218191 |     226712
    163 |     203716 |     220793 |     213498 |     226493
    191 |     199248 |     217486 |     210994 |     221026
    211 |     195887 |     217356 |     209601 |     219397
    239 |     193133 |     215695 |     209023 |     218773
    271 |     190686 |     213668 |     207181 |     219137
    307 |     188066 |     214120 |     205392 |     218782
    353 |     185449 |     213570 |     202120 |     217786
    397 |     182173 |     212168 |     201285 |     216489

Results for 1 socket X5675

  conns |     master |    patched |  master 1G | patched 1G
--------+------------+------------+------------+------------
      1 |      16864 |      16584 |      17419 |      17630
      2 |      32764 |      32735 |      34593 |      34000
      3 |      47258 |      46022 |      49570 |      47432
      5 |      64487 |      64929 |      68369 |      68885
      7 |      81932 |      82034 |      87543 |      87538
     17 |     114502 |     114218 |     127347 |     127448
     27 |     116030 |     115758 |     130003 |     128890
     53 |     116814 |     117197 |     131142 |     131080
     83 |     114438 |     116704 |     130198 |     130985
    107 |     113255 |     116910 |     129932 |     131468
    139 |     111577 |     116929 |     129012 |     131782
    163 |     110477 |     116818 |     128628 |     131697
    191 |     109237 |     116672 |     127833 |     131586
    211 |     108248 |     116396 |     127474 |     131650
    239 |     107443 |     116237 |     126731 |     131760
    271 |     106434 |     115813 |     126009 |     131526
    307 |     105077 |     115542 |     125279 |     131421
    353 |     104516 |     115277 |     124491 |     131276
    397 |     103016 |     114842 |     123624 |     131019

Results for 2 socket x5675

  conns |     master |    patched |  master 1G | patched 1G
--------+------------+------------+------------+------------
      1 |      16323 |      16280 |      16959 |      17598
      2 |      30510 |      31431 |      33763 |      31690
      3 |      45051 |      45834 |      48896 |      47991
      5 |      71800 |      73208 |      78077 |      74714
      7 |      89792 |      89980 |      95986 |      96662
     17 |     178319 |     177979 |     195566 |     196143
     27 |     210475 |     205209 |     226966 |     235249
     53 |     222857 |     220256 |     252673 |     251041
     83 |     219652 |     219938 |     250309 |     250464
    107 |     218468 |     219849 |     251312 |     251425
    139 |     210486 |     217003 |     250029 |     250695
    163 |     204068 |     218424 |     248234 |     252940
    191 |     200014 |     218224 |     246622 |     253331
    211 |     197608 |     218033 |     245331 |     253055
    239 |     195036 |     218398 |     243306 |     253394
    271 |     192780 |     217747 |     241406 |     253148
    307 |     189490 |     217607 |     239246 |     253373
    353 |     186104 |     216697 |     236952 |     253034
    397 |     183507 |     216324 |     234764 |     252872

As can be seen, patched version degrades much slower than master.
(Or even doesn't degrade with 1G shared buffer on older processor).

PS.

There is a room for further improvements:
- buffer manager's freelist could be partitioned
- dynahash's freelist could be sized/aligned to CPU cache line
- in fact, there is no need in dynahash at all. It is better to make
  custom hash-table using BufferDesc as entries. BufferDesc has spare
  space for link to next and hashvalue.

regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Hi,
Improvement is impressive.

For BufTableFreeDeleted(), since it only has one call, maybe its caller can invoke hash_return_to_freelist() directly.

For free_list_decrement_nentries():

+   Assert(hctl->freeList[freelist_idx].nentries.value < MAX_NENTRIES);

Is the assertion necessary ? There is similar assertion in free_list_increment_nentries() which would maintain hctl->freeList[freelist_idx].nentries.value <= MAX_NENTRIES.

Cheers

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Adding CI to our tree
Next
From: Andres Freund
Date:
Subject: Re: PATH manipulation in 001_libpq_pipeline.pl fails on windows