BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers

From Yura Sokolov
Subject BufferAlloc: don't take two simultaneous locks
Date
Msg-id 1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel@postgrespro.ru
Whole thread Raw
Responses Re: BufferAlloc: don't take two simultaneous locks
List pgsql-hackers
Good day.

I found some opportunity in Buffer Manager code in BufferAlloc
function:
- When valid buffer is evicted, BufferAlloc acquires two partition
lwlocks: for partition for evicted block is in and partition for new
block placement.

It doesn't matter if there is small number of concurrent replacements.
But if there are a lot of concurrent backends replacing buffers,
complex dependency net quickly arose.

It could be easily seen with select-only pgbench with scale 100 and
shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly
doesn't fit shared buffers. This way performance starts to degrade at
~100 connections. Even with shared buffers 1GB it slowly degrades after
150 connections. 

But strictly speaking, there is no need to hold both lock
simultaneously. Buffer is pinned so other processes could not select it
for eviction. If tag is cleared and buffer removed from old partition
then other processes will not find it. Therefore it is safe to release
old partition lock before acquiring new partition lock.

If other process concurrently inserts same new buffer, then old buffer
is placed to bufmanager's freelist.

Additional optimisation: in case of old buffer is reused, there is no
need to put its BufferLookupEnt into dynahash's freelist. That reduces
lock contention a bit more. To acomplish this FreeListData.nentries is
changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement
is used.

Remark: there were bug in the `hash_update_hash_key`: nentries were not
kept in sync if freelist partitions differ. This bug were never
triggered because single use of `hash_update_hash_key` doesn't move
entry between partitions.

There is some tests results.

- pgbench with scale 100 were tested with --select-only (since we want
to test buffer manager alone). It produces 1.5GB table.
- two shared_buffers values were tested: 128MB and 1GB.
- second best result were taken among five runs

Test were made in three system configurations:
- notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
- Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
- same Xeon X5675 but restricted to single socket
  (with numactl -m 0 -N 0)

Results for i7-1165G7:

  conns |     master |    patched |  master 1G | patched 1G 
--------+------------+------------+------------+------------
      1 |      29667 |      29079 |      29425 |      29411 
      2 |      55577 |      55553 |      57974 |      57223 
      3 |      87393 |      87924 |      87246 |      89210 
      5 |     136222 |     136879 |     133775 |     133949 
      7 |     179865 |     176734 |     178297 |     175559 
     17 |     215953 |     214708 |     222908 |     223651 
     27 |     211162 |     213014 |     220506 |     219752 
     53 |     211620 |     218702 |     220906 |     225218 
     83 |     213488 |     221799 |     219075 |     228096 
    107 |     212018 |     222110 |     222502 |     227825 
    139 |     207068 |     220812 |     218191 |     226712 
    163 |     203716 |     220793 |     213498 |     226493 
    191 |     199248 |     217486 |     210994 |     221026 
    211 |     195887 |     217356 |     209601 |     219397 
    239 |     193133 |     215695 |     209023 |     218773 
    271 |     190686 |     213668 |     207181 |     219137 
    307 |     188066 |     214120 |     205392 |     218782 
    353 |     185449 |     213570 |     202120 |     217786 
    397 |     182173 |     212168 |     201285 |     216489 

Results for 1 socket X5675

  conns |     master |    patched |  master 1G | patched 1G 
--------+------------+------------+------------+------------
      1 |      16864 |      16584 |      17419 |      17630 
      2 |      32764 |      32735 |      34593 |      34000 
      3 |      47258 |      46022 |      49570 |      47432 
      5 |      64487 |      64929 |      68369 |      68885 
      7 |      81932 |      82034 |      87543 |      87538 
     17 |     114502 |     114218 |     127347 |     127448 
     27 |     116030 |     115758 |     130003 |     128890 
     53 |     116814 |     117197 |     131142 |     131080 
     83 |     114438 |     116704 |     130198 |     130985 
    107 |     113255 |     116910 |     129932 |     131468 
    139 |     111577 |     116929 |     129012 |     131782 
    163 |     110477 |     116818 |     128628 |     131697 
    191 |     109237 |     116672 |     127833 |     131586 
    211 |     108248 |     116396 |     127474 |     131650 
    239 |     107443 |     116237 |     126731 |     131760 
    271 |     106434 |     115813 |     126009 |     131526 
    307 |     105077 |     115542 |     125279 |     131421 
    353 |     104516 |     115277 |     124491 |     131276 
    397 |     103016 |     114842 |     123624 |     131019 

Results for 2 socket x5675

  conns |     master |    patched |  master 1G | patched 1G 
--------+------------+------------+------------+------------
      1 |      16323 |      16280 |      16959 |      17598 
      2 |      30510 |      31431 |      33763 |      31690 
      3 |      45051 |      45834 |      48896 |      47991 
      5 |      71800 |      73208 |      78077 |      74714 
      7 |      89792 |      89980 |      95986 |      96662 
     17 |     178319 |     177979 |     195566 |     196143 
     27 |     210475 |     205209 |     226966 |     235249 
     53 |     222857 |     220256 |     252673 |     251041 
     83 |     219652 |     219938 |     250309 |     250464 
    107 |     218468 |     219849 |     251312 |     251425 
    139 |     210486 |     217003 |     250029 |     250695 
    163 |     204068 |     218424 |     248234 |     252940 
    191 |     200014 |     218224 |     246622 |     253331 
    211 |     197608 |     218033 |     245331 |     253055 
    239 |     195036 |     218398 |     243306 |     253394 
    271 |     192780 |     217747 |     241406 |     253148 
    307 |     189490 |     217607 |     239246 |     253373 
    353 |     186104 |     216697 |     236952 |     253034 
    397 |     183507 |     216324 |     234764 |     252872 

As can be seen, patched version degrades much slower than master.
(Or even doesn't degrade with 1G shared buffer on older processor).

PS.

There is a room for further improvements:
- buffer manager's freelist could be partitioned
- dynahash's freelist could be sized/aligned to CPU cache line
- in fact, there is no need in dynahash at all. It is better to make
  custom hash-table using BufferDesc as entries. BufferDesc has spare
  space for link to next and hashvalue.

regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Attachment

pgsql-hackers by date:

Previous
From: "Bossart, Nathan"
Date:
Subject: Re: Enabling deduplication with system catalog indexes
Next
From: Andres Freund
Date:
Subject: Adding CI to our tree