BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers
From | Yura Sokolov |
---|---|
Subject | BufferAlloc: don't take two simultaneous locks |
Date | |
Msg-id | 1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel@postgrespro.ru Whole thread Raw |
Responses |
Re: BufferAlloc: don't take two simultaneous locks
|
List | pgsql-hackers |
Good day. I found some opportunity in Buffer Manager code in BufferAlloc function: - When valid buffer is evicted, BufferAlloc acquires two partition lwlocks: for partition for evicted block is in and partition for new block placement. It doesn't matter if there is small number of concurrent replacements. But if there are a lot of concurrent backends replacing buffers, complex dependency net quickly arose. It could be easily seen with select-only pgbench with scale 100 and shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly doesn't fit shared buffers. This way performance starts to degrade at ~100 connections. Even with shared buffers 1GB it slowly degrades after 150 connections. But strictly speaking, there is no need to hold both lock simultaneously. Buffer is pinned so other processes could not select it for eviction. If tag is cleared and buffer removed from old partition then other processes will not find it. Therefore it is safe to release old partition lock before acquiring new partition lock. If other process concurrently inserts same new buffer, then old buffer is placed to bufmanager's freelist. Additional optimisation: in case of old buffer is reused, there is no need to put its BufferLookupEnt into dynahash's freelist. That reduces lock contention a bit more. To acomplish this FreeListData.nentries is changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement is used. Remark: there were bug in the `hash_update_hash_key`: nentries were not kept in sync if freelist partitions differ. This bug were never triggered because single use of `hash_update_hash_key` doesn't move entry between partitions. There is some tests results. - pgbench with scale 100 were tested with --select-only (since we want to test buffer manager alone). It produces 1.5GB table. - two shared_buffers values were tested: 128MB and 1GB. - second best result were taken among five runs Test were made in three system configurations: - notebook with i7-1165G7 (limited to 2.8GHz to not overheat) - Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads). - same Xeon X5675 but restricted to single socket (with numactl -m 0 -N 0) Results for i7-1165G7: conns | master | patched | master 1G | patched 1G --------+------------+------------+------------+------------ 1 | 29667 | 29079 | 29425 | 29411 2 | 55577 | 55553 | 57974 | 57223 3 | 87393 | 87924 | 87246 | 89210 5 | 136222 | 136879 | 133775 | 133949 7 | 179865 | 176734 | 178297 | 175559 17 | 215953 | 214708 | 222908 | 223651 27 | 211162 | 213014 | 220506 | 219752 53 | 211620 | 218702 | 220906 | 225218 83 | 213488 | 221799 | 219075 | 228096 107 | 212018 | 222110 | 222502 | 227825 139 | 207068 | 220812 | 218191 | 226712 163 | 203716 | 220793 | 213498 | 226493 191 | 199248 | 217486 | 210994 | 221026 211 | 195887 | 217356 | 209601 | 219397 239 | 193133 | 215695 | 209023 | 218773 271 | 190686 | 213668 | 207181 | 219137 307 | 188066 | 214120 | 205392 | 218782 353 | 185449 | 213570 | 202120 | 217786 397 | 182173 | 212168 | 201285 | 216489 Results for 1 socket X5675 conns | master | patched | master 1G | patched 1G --------+------------+------------+------------+------------ 1 | 16864 | 16584 | 17419 | 17630 2 | 32764 | 32735 | 34593 | 34000 3 | 47258 | 46022 | 49570 | 47432 5 | 64487 | 64929 | 68369 | 68885 7 | 81932 | 82034 | 87543 | 87538 17 | 114502 | 114218 | 127347 | 127448 27 | 116030 | 115758 | 130003 | 128890 53 | 116814 | 117197 | 131142 | 131080 83 | 114438 | 116704 | 130198 | 130985 107 | 113255 | 116910 | 129932 | 131468 139 | 111577 | 116929 | 129012 | 131782 163 | 110477 | 116818 | 128628 | 131697 191 | 109237 | 116672 | 127833 | 131586 211 | 108248 | 116396 | 127474 | 131650 239 | 107443 | 116237 | 126731 | 131760 271 | 106434 | 115813 | 126009 | 131526 307 | 105077 | 115542 | 125279 | 131421 353 | 104516 | 115277 | 124491 | 131276 397 | 103016 | 114842 | 123624 | 131019 Results for 2 socket x5675 conns | master | patched | master 1G | patched 1G --------+------------+------------+------------+------------ 1 | 16323 | 16280 | 16959 | 17598 2 | 30510 | 31431 | 33763 | 31690 3 | 45051 | 45834 | 48896 | 47991 5 | 71800 | 73208 | 78077 | 74714 7 | 89792 | 89980 | 95986 | 96662 17 | 178319 | 177979 | 195566 | 196143 27 | 210475 | 205209 | 226966 | 235249 53 | 222857 | 220256 | 252673 | 251041 83 | 219652 | 219938 | 250309 | 250464 107 | 218468 | 219849 | 251312 | 251425 139 | 210486 | 217003 | 250029 | 250695 163 | 204068 | 218424 | 248234 | 252940 191 | 200014 | 218224 | 246622 | 253331 211 | 197608 | 218033 | 245331 | 253055 239 | 195036 | 218398 | 243306 | 253394 271 | 192780 | 217747 | 241406 | 253148 307 | 189490 | 217607 | 239246 | 253373 353 | 186104 | 216697 | 236952 | 253034 397 | 183507 | 216324 | 234764 | 252872 As can be seen, patched version degrades much slower than master. (Or even doesn't degrade with 1G shared buffer on older processor). PS. There is a room for further improvements: - buffer manager's freelist could be partitioned - dynahash's freelist could be sized/aligned to CPU cache line - in fact, there is no need in dynahash at all. It is better to make custom hash-table using BufferDesc as entries. BufferDesc has spare space for link to next and hashvalue. regards, Yura Sokolov y.sokolov@postgrespro.ru funny.falcon@gmail.com
Attachment
pgsql-hackers by date: