Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers
From | Yura Sokolov |
---|---|
Subject | Re: BufferAlloc: don't take two simultaneous locks |
Date | |
Msg-id | ae8b7f99f8ee9583bd96e20e271b5969056ae44a.camel@postgrespro.ru Whole thread Raw |
In response to | Re: BufferAlloc: don't take two simultaneous locks (Zhihong Yu <zyu@yugabyte.com>) |
List | pgsql-hackers |
В Пт, 01/10/2021 в 15:46 -0700, Zhihong Yu wrote: > > > On Fri, Oct 1, 2021 at 3:26 PM Yura Sokolov <y.sokolov@postgrespro.ru> > wrote: > > Good day. > > > > I found some opportunity in Buffer Manager code in BufferAlloc > > function: > > - When valid buffer is evicted, BufferAlloc acquires two partition > > lwlocks: for partition for evicted block is in and partition for new > > block placement. > > > > It doesn't matter if there is small number of concurrent > > replacements. > > But if there are a lot of concurrent backends replacing buffers, > > complex dependency net quickly arose. > > > > It could be easily seen with select-only pgbench with scale 100 and > > shared buffers 128MB: scale 100 produces 1.5GB tables, and it > > certainly > > doesn't fit shared buffers. This way performance starts to degrade > > at > > ~100 connections. Even with shared buffers 1GB it slowly degrades > > after > > 150 connections. > > > > But strictly speaking, there is no need to hold both lock > > simultaneously. Buffer is pinned so other processes could not select > > it > > for eviction. If tag is cleared and buffer removed from old > > partition > > then other processes will not find it. Therefore it is safe to > > release > > old partition lock before acquiring new partition lock. > > > > If other process concurrently inserts same new buffer, then old > > buffer > > is placed to bufmanager's freelist. > > > > Additional optimisation: in case of old buffer is reused, there is > > no > > need to put its BufferLookupEnt into dynahash's freelist. That > > reduces > > lock contention a bit more. To acomplish this FreeListData.nentries > > is > > changed to pg_atomic_u32/pg_atomic_u64 and atomic > > increment/decrement > > is used. > > > > Remark: there were bug in the `hash_update_hash_key`: nentries were > > not > > kept in sync if freelist partitions differ. This bug were never > > triggered because single use of `hash_update_hash_key` doesn't move > > entry between partitions. > > > > There is some tests results. > > > > - pgbench with scale 100 were tested with --select-only (since we > > want > > to test buffer manager alone). It produces 1.5GB table. > > - two shared_buffers values were tested: 128MB and 1GB. > > - second best result were taken among five runs > > > > Test were made in three system configurations: > > - notebook with i7-1165G7 (limited to 2.8GHz to not overheat) > > - Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads). > > - same Xeon X5675 but restricted to single socket > > (with numactl -m 0 -N 0) > > > > Results for i7-1165G7: > > > > conns | master | patched | master 1G | patched 1G > > --------+------------+------------+------------+------------ > > 1 | 29667 | 29079 | 29425 | 29411 > > 2 | 55577 | 55553 | 57974 | 57223 > > 3 | 87393 | 87924 | 87246 | 89210 > > 5 | 136222 | 136879 | 133775 | 133949 > > 7 | 179865 | 176734 | 178297 | 175559 > > 17 | 215953 | 214708 | 222908 | 223651 > > 27 | 211162 | 213014 | 220506 | 219752 > > 53 | 211620 | 218702 | 220906 | 225218 > > 83 | 213488 | 221799 | 219075 | 228096 > > 107 | 212018 | 222110 | 222502 | 227825 > > 139 | 207068 | 220812 | 218191 | 226712 > > 163 | 203716 | 220793 | 213498 | 226493 > > 191 | 199248 | 217486 | 210994 | 221026 > > 211 | 195887 | 217356 | 209601 | 219397 > > 239 | 193133 | 215695 | 209023 | 218773 > > 271 | 190686 | 213668 | 207181 | 219137 > > 307 | 188066 | 214120 | 205392 | 218782 > > 353 | 185449 | 213570 | 202120 | 217786 > > 397 | 182173 | 212168 | 201285 | 216489 > > > > Results for 1 socket X5675 > > > > conns | master | patched | master 1G | patched 1G > > --------+------------+------------+------------+------------ > > 1 | 16864 | 16584 | 17419 | 17630 > > 2 | 32764 | 32735 | 34593 | 34000 > > 3 | 47258 | 46022 | 49570 | 47432 > > 5 | 64487 | 64929 | 68369 | 68885 > > 7 | 81932 | 82034 | 87543 | 87538 > > 17 | 114502 | 114218 | 127347 | 127448 > > 27 | 116030 | 115758 | 130003 | 128890 > > 53 | 116814 | 117197 | 131142 | 131080 > > 83 | 114438 | 116704 | 130198 | 130985 > > 107 | 113255 | 116910 | 129932 | 131468 > > 139 | 111577 | 116929 | 129012 | 131782 > > 163 | 110477 | 116818 | 128628 | 131697 > > 191 | 109237 | 116672 | 127833 | 131586 > > 211 | 108248 | 116396 | 127474 | 131650 > > 239 | 107443 | 116237 | 126731 | 131760 > > 271 | 106434 | 115813 | 126009 | 131526 > > 307 | 105077 | 115542 | 125279 | 131421 > > 353 | 104516 | 115277 | 124491 | 131276 > > 397 | 103016 | 114842 | 123624 | 131019 > > > > Results for 2 socket x5675 > > > > conns | master | patched | master 1G | patched 1G > > --------+------------+------------+------------+------------ > > 1 | 16323 | 16280 | 16959 | 17598 > > 2 | 30510 | 31431 | 33763 | 31690 > > 3 | 45051 | 45834 | 48896 | 47991 > > 5 | 71800 | 73208 | 78077 | 74714 > > 7 | 89792 | 89980 | 95986 | 96662 > > 17 | 178319 | 177979 | 195566 | 196143 > > 27 | 210475 | 205209 | 226966 | 235249 > > 53 | 222857 | 220256 | 252673 | 251041 > > 83 | 219652 | 219938 | 250309 | 250464 > > 107 | 218468 | 219849 | 251312 | 251425 > > 139 | 210486 | 217003 | 250029 | 250695 > > 163 | 204068 | 218424 | 248234 | 252940 > > 191 | 200014 | 218224 | 246622 | 253331 > > 211 | 197608 | 218033 | 245331 | 253055 > > 239 | 195036 | 218398 | 243306 | 253394 > > 271 | 192780 | 217747 | 241406 | 253148 > > 307 | 189490 | 217607 | 239246 | 253373 > > 353 | 186104 | 216697 | 236952 | 253034 > > 397 | 183507 | 216324 | 234764 | 252872 > > > > As can be seen, patched version degrades much slower than master. > > (Or even doesn't degrade with 1G shared buffer on older processor). > > > > PS. > > > > There is a room for further improvements: > > - buffer manager's freelist could be partitioned > > - dynahash's freelist could be sized/aligned to CPU cache line > > - in fact, there is no need in dynahash at all. It is better to make > > custom hash-table using BufferDesc as entries. BufferDesc has > > spare > > space for link to next and hashvalue. > > > > regards, > > Yura Sokolov > > y.sokolov@postgrespro.ru > > funny.falcon@gmail.com > > Hi, > Improvement is impressive. Thank you! > For BufTableFreeDeleted(), since it only has one call, maybe its > caller can invoke hash_return_to_freelist() directly. It will be a dirty break of abstraction. Everywhere we talk with BufTable, and here will be hash ... eugh > For free_list_decrement_nentries(): > > + Assert(hctl->freeList[freelist_idx].nentries.value < > MAX_NENTRIES); > > Is the assertion necessary ? There is similar assertion in > free_list_increment_nentries() which would maintain hctl- > >freeList[freelist_idx].nentries.value <= MAX_NENTRIES. Assertion in free_list_decrement_nentries is absolutely necessary: it is direct translation of Assert(nentries>=0) from signed types to unsigned. (Since there is no signed atomics in pg, I had to convert signed `long nentries` to unsigned `pg_atomic_uXX nentries`). Assertion in free_list_increment_nentries is not necessary. But it doesn't hurt either - it is just Assert that doesn't compile into production code. regards Yura Sokolov y.sokolov@postgrespro.ru funny.falcon@gmail.com
pgsql-hackers by date: