Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers

From Yura Sokolov
Subject Re: BufferAlloc: don't take two simultaneous locks
Date
Msg-id ae8b7f99f8ee9583bd96e20e271b5969056ae44a.camel@postgrespro.ru
Whole thread Raw
In response to Re: BufferAlloc: don't take two simultaneous locks  (Zhihong Yu <zyu@yugabyte.com>)
List pgsql-hackers
В Пт, 01/10/2021 в 15:46 -0700, Zhihong Yu wrote:
> 
> 
> On Fri, Oct 1, 2021 at 3:26 PM Yura Sokolov <y.sokolov@postgrespro.ru>
> wrote:
> > Good day.
> > 
> > I found some opportunity in Buffer Manager code in BufferAlloc
> > function:
> > - When valid buffer is evicted, BufferAlloc acquires two partition
> > lwlocks: for partition for evicted block is in and partition for new
> > block placement.
> > 
> > It doesn't matter if there is small number of concurrent
> > replacements.
> > But if there are a lot of concurrent backends replacing buffers,
> > complex dependency net quickly arose.
> > 
> > It could be easily seen with select-only pgbench with scale 100 and
> > shared buffers 128MB: scale 100 produces 1.5GB tables, and it
> > certainly
> > doesn't fit shared buffers. This way performance starts to degrade
> > at
> > ~100 connections. Even with shared buffers 1GB it slowly degrades
> > after
> > 150 connections. 
> > 
> > But strictly speaking, there is no need to hold both lock
> > simultaneously. Buffer is pinned so other processes could not select
> > it
> > for eviction. If tag is cleared and buffer removed from old
> > partition
> > then other processes will not find it. Therefore it is safe to
> > release
> > old partition lock before acquiring new partition lock.
> > 
> > If other process concurrently inserts same new buffer, then old
> > buffer
> > is placed to bufmanager's freelist.
> > 
> > Additional optimisation: in case of old buffer is reused, there is
> > no
> > need to put its BufferLookupEnt into dynahash's freelist. That
> > reduces
> > lock contention a bit more. To acomplish this FreeListData.nentries
> > is
> > changed to pg_atomic_u32/pg_atomic_u64 and atomic
> > increment/decrement
> > is used.
> > 
> > Remark: there were bug in the `hash_update_hash_key`: nentries were
> > not
> > kept in sync if freelist partitions differ. This bug were never
> > triggered because single use of `hash_update_hash_key` doesn't move
> > entry between partitions.
> > 
> > There is some tests results.
> > 
> > - pgbench with scale 100 were tested with --select-only (since we
> > want
> > to test buffer manager alone). It produces 1.5GB table.
> > - two shared_buffers values were tested: 128MB and 1GB.
> > - second best result were taken among five runs
> > 
> > Test were made in three system configurations:
> > - notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
> > - Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
> > - same Xeon X5675 but restricted to single socket
> >   (with numactl -m 0 -N 0)
> > 
> > Results for i7-1165G7:
> > 
> >   conns |     master |    patched |  master 1G | patched 1G 
> > --------+------------+------------+------------+------------
> >       1 |      29667 |      29079 |      29425 |      29411 
> >       2 |      55577 |      55553 |      57974 |      57223 
> >       3 |      87393 |      87924 |      87246 |      89210 
> >       5 |     136222 |     136879 |     133775 |     133949 
> >       7 |     179865 |     176734 |     178297 |     175559 
> >      17 |     215953 |     214708 |     222908 |     223651 
> >      27 |     211162 |     213014 |     220506 |     219752 
> >      53 |     211620 |     218702 |     220906 |     225218 
> >      83 |     213488 |     221799 |     219075 |     228096 
> >     107 |     212018 |     222110 |     222502 |     227825 
> >     139 |     207068 |     220812 |     218191 |     226712 
> >     163 |     203716 |     220793 |     213498 |     226493 
> >     191 |     199248 |     217486 |     210994 |     221026 
> >     211 |     195887 |     217356 |     209601 |     219397 
> >     239 |     193133 |     215695 |     209023 |     218773 
> >     271 |     190686 |     213668 |     207181 |     219137 
> >     307 |     188066 |     214120 |     205392 |     218782 
> >     353 |     185449 |     213570 |     202120 |     217786 
> >     397 |     182173 |     212168 |     201285 |     216489 
> > 
> > Results for 1 socket X5675
> > 
> >   conns |     master |    patched |  master 1G | patched 1G 
> > --------+------------+------------+------------+------------
> >       1 |      16864 |      16584 |      17419 |      17630 
> >       2 |      32764 |      32735 |      34593 |      34000 
> >       3 |      47258 |      46022 |      49570 |      47432 
> >       5 |      64487 |      64929 |      68369 |      68885 
> >       7 |      81932 |      82034 |      87543 |      87538 
> >      17 |     114502 |     114218 |     127347 |     127448 
> >      27 |     116030 |     115758 |     130003 |     128890 
> >      53 |     116814 |     117197 |     131142 |     131080 
> >      83 |     114438 |     116704 |     130198 |     130985 
> >     107 |     113255 |     116910 |     129932 |     131468 
> >     139 |     111577 |     116929 |     129012 |     131782 
> >     163 |     110477 |     116818 |     128628 |     131697 
> >     191 |     109237 |     116672 |     127833 |     131586 
> >     211 |     108248 |     116396 |     127474 |     131650 
> >     239 |     107443 |     116237 |     126731 |     131760 
> >     271 |     106434 |     115813 |     126009 |     131526 
> >     307 |     105077 |     115542 |     125279 |     131421 
> >     353 |     104516 |     115277 |     124491 |     131276 
> >     397 |     103016 |     114842 |     123624 |     131019 
> > 
> > Results for 2 socket x5675
> > 
> >   conns |     master |    patched |  master 1G | patched 1G 
> > --------+------------+------------+------------+------------
> >       1 |      16323 |      16280 |      16959 |      17598 
> >       2 |      30510 |      31431 |      33763 |      31690 
> >       3 |      45051 |      45834 |      48896 |      47991 
> >       5 |      71800 |      73208 |      78077 |      74714 
> >       7 |      89792 |      89980 |      95986 |      96662 
> >      17 |     178319 |     177979 |     195566 |     196143 
> >      27 |     210475 |     205209 |     226966 |     235249 
> >      53 |     222857 |     220256 |     252673 |     251041 
> >      83 |     219652 |     219938 |     250309 |     250464 
> >     107 |     218468 |     219849 |     251312 |     251425 
> >     139 |     210486 |     217003 |     250029 |     250695 
> >     163 |     204068 |     218424 |     248234 |     252940 
> >     191 |     200014 |     218224 |     246622 |     253331 
> >     211 |     197608 |     218033 |     245331 |     253055 
> >     239 |     195036 |     218398 |     243306 |     253394 
> >     271 |     192780 |     217747 |     241406 |     253148 
> >     307 |     189490 |     217607 |     239246 |     253373 
> >     353 |     186104 |     216697 |     236952 |     253034 
> >     397 |     183507 |     216324 |     234764 |     252872 
> > 
> > As can be seen, patched version degrades much slower than master.
> > (Or even doesn't degrade with 1G shared buffer on older processor).
> > 
> > PS.
> > 
> > There is a room for further improvements:
> > - buffer manager's freelist could be partitioned
> > - dynahash's freelist could be sized/aligned to CPU cache line
> > - in fact, there is no need in dynahash at all. It is better to make
> >   custom hash-table using BufferDesc as entries. BufferDesc has
> > spare
> >   space for link to next and hashvalue.
> > 
> > regards,
> > Yura Sokolov
> > y.sokolov@postgrespro.ru
> > funny.falcon@gmail.com
> 
> Hi,
> Improvement is impressive.

Thank you!

> For BufTableFreeDeleted(), since it only has one call, maybe its
> caller can invoke hash_return_to_freelist() directly.

It will be a dirty break of abstraction. Everywhere we talk with
BufTable, and here will be hash ... eugh

> For free_list_decrement_nentries():
> 
> +   Assert(hctl->freeList[freelist_idx].nentries.value <
> MAX_NENTRIES);
> 
> Is the assertion necessary ? There is similar assertion in
> free_list_increment_nentries() which would maintain hctl-
> >freeList[freelist_idx].nentries.value <= MAX_NENTRIES.

Assertion in free_list_decrement_nentries is absolutely necessary:
it is direct translation of Assert(nentries>=0) from signed types
to unsigned. (Since there is no signed atomics in pg, I had to convert
signed `long nentries` to unsigned `pg_atomic_uXX nentries`).

Assertion in free_list_increment_nentries is not necessary. But it
doesn't hurt either - it is just Assert that doesn't compile into
production code.


regards

Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com




pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Triage on old commitfest entries - JSON_PATH
Next
From: Amit Kapila
Date:
Subject: Re: pgsql: Document XLOG_INCLUDE_XID a little better