Re: WAL Insertion Lock Improvements - Mailing list pgsql-hackers
From | Bharath Rupireddy |
---|---|
Subject | Re: WAL Insertion Lock Improvements |
Date | |
Msg-id | CALj2ACWGHGJ0pCsuVqJ4QkEfv8D_69rZL0FUK7TFfukaOcf7YQ@mail.gmail.com Whole thread Raw |
In response to | Re: WAL Insertion Lock Improvements (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: WAL Insertion Lock Improvements
Re: WAL Insertion Lock Improvements Re: WAL Insertion Lock Improvements |
List | pgsql-hackers |
On Mon, Apr 10, 2023 at 9:38 AM Michael Paquier <michael@paquier.xyz> wrote: > > I have been trying to get some reproducible numbers, but I think that > I am going to need a bigger maching than what I have been using for > the last few days, up to 400 connections. It is worth noting that > 00d1e02b may influence a bit the results, so we may want to have more > numbers with that in place particularly with INSERTs, and one of the > tests used upthread uses single row INSERTs. I ran performance tests on the patch with different use-cases. Clearly the patch reduces burden on LWLock's waitlist lock (evident from perf reports [1]). However, to see visible impact in the output, the txns must be generating small (between 16 bytes to 2 KB) amounts of WAL in a highly concurrent manner, check the results below (FWIW, I've zipped and attached perf images for better illustration along with test setup). When the txns are generating a small amount of WAL i.e. between 16 bytes to 2 KB in a highly concurrent manner, the benefit is clearly visible in the TPS more than 2.3X improvement. When the txns are generating more WAL i.e. more than 2 KB, the gain from reduced burden on waitlist lock is offset by increase in the wait/release for WAL insertion locks and no visible benefit is seen. As the amount of WAL each txn generates increases, it looks like the benefit gained from reduced burden on waitlist lock is offset by increase in the wait for WAL insertion locks. Note that I've used pg_logical_emit_message() for ease of understanding about the txns generating various amounts of WAL, but the pattern is the same if txns are generating various amounts of WAL say with inserts. test-case 1: -T5, WAL ~16 bytes clients HEAD PATCHED 1 1437 1352 2 1376 1419 4 2919 2774 8 5875 6371 16 11148 12242 32 22108 23532 64 41414 46478 128 85304 85235 256 83771 152901 512 61970 141021 768 56514 118899 1024 51784 110960 2048 39141 84150 4096 16901 45759 test-case 1: -t1000, WAL ~16 bytes clients HEAD PATCHED 1 1417 1333 2 1363 1791 4 2978 2970 8 5954 6198 16 11179 11164 32 23742 24043 64 45537 44103 128 84683 91762 256 80369 146293 512 61080 132079 768 57236 118046 1024 53497 114574 2048 46423 93588 4096 42067 85790 test-case 2: -T5, WAL ~256 bytes clients HEAD PATCHED 1 1521 1386 2 1647 1637 4 3088 3270 8 6011 5631 16 12778 10317 32 24117 20006 64 43966 38199 128 72660 67936 256 93096 121261 512 57247 142418 768 53782 126218 1024 50279 109153 2048 35109 91602 4096 21184 39848 test-case 2: -t1000, WAL ~256 bytes clients HEAD PATCHED 1 1265 1389 2 1522 1258 4 2802 2775 8 5875 5422 16 11664 10853 32 21961 22145 64 44304 40851 128 73278 80494 256 91172 122287 512 60966 136734 768 56590 125050 1024 52481 124341 2048 47878 104760 4096 42838 94121 test-case 3: -T5, WAL 512 bytes clients HEAD PATCHED 1 1464 1284 2 1520 1381 4 2985 2877 8 6237 5261 16 11296 10621 32 22257 20789 64 40548 37243 128 66507 59891 256 92516 97506 512 56404 119716 768 51127 112482 1024 48463 103484 2048 38079 81424 4096 18977 40942 test-case 3: -t1000, WAL 512 bytes clients HEAD PATCHED 1 1452 1434 2 1604 1649 4 3051 2971 8 5967 5650 16 10471 10702 32 20257 20899 64 39412 36750 128 62767 61110 256 81050 89768 512 56888 122786 768 51238 114444 1024 48972 106867 2048 43451 98847 4096 40018 111079 test-case 4: -T5, WAL 1024 bytes clients HEAD PATCHED 1 1405 1395 2 1638 1607 4 3176 3207 8 6271 6024 16 11653 11103 32 20530 20260 64 34313 32367 128 55939 52079 256 74355 76420 512 56506 90983 768 50088 100410 1024 44589 99025 2048 39640 90931 4096 20942 36035 test-case 4: -t1000, WAL 1024 bytes clients HEAD PATCHED 1 1330 1304 2 1615 1366 4 3117 2667 8 6179 5390 16 10524 10426 32 19819 18620 64 34844 29731 128 52180 48869 256 73284 71396 512 55714 96014 768 49336 108100 1024 46113 102789 2048 44627 104721 4096 44979 106189 test-case 5: -T5, WAL 2048 bytes clients HEAD PATCHED 1 1407 1377 2 1518 1559 4 2589 2870 8 4883 5493 16 9075 9201 32 15957 16295 64 27471 25029 128 37493 38642 256 46369 45787 512 61755 62836 768 59144 68419 1024 52495 68933 2048 48608 72500 4096 26463 61252 test-case 5: -t1000, WAL 2048 bytes clients HEAD PATCHED 1 1289 1366 2 1489 1628 4 2960 3036 8 5536 5965 16 9248 10399 32 15770 18140 64 27626 27800 128 36817 39483 256 48533 52105 512 64453 64007 768 59146 64160 1024 57637 61756 2048 59063 62109 4096 58268 61206 test-case 6: -T5, WAL 4096 bytes clients HEAD PATCHED 1 1322 1325 2 1504 1551 4 2811 2880 8 5330 5159 16 8625 8315 32 12820 13534 64 19737 19965 128 26298 24633 256 34630 29939 512 34382 36669 768 33421 33316 1024 33525 32821 2048 37053 37752 4096 37334 39114 test-case 6: -t1000, WAL 4096 bytes clients HEAD PATCHED 1 1212 1371 2 1383 1566 4 2858 2967 8 5092 5035 16 8233 8486 32 13353 13678 64 19052 20072 128 24803 24726 256 34065 33139 512 31590 32029 768 31432 31404 1024 31357 31366 2048 31465 31508 4096 32157 32180 test-case 7: -T5, WAL 8192 bytes clients HEAD PATCHED 1 1287 1233 2 1552 1521 4 2658 2617 8 4680 4532 16 6732 7110 32 9649 9198 64 13276 12042 128 17100 17187 256 17408 17448 512 16595 16358 768 16599 16500 1024 16975 17300 2048 19073 19137 4096 21368 21735 test-case 7: -t1000, WAL 8192 bytes clients HEAD PATCHED 1 1144 1190 2 1414 1395 4 2618 2438 8 4645 4485 16 6766 7001 32 9620 9804 64 12943 13023 128 15904 17148 256 16645 16035 512 15800 15796 768 15788 15810 1024 15814 15817 2048 17775 17771 4096 31715 31682 > Looking at the patch.. LWLockConflictsWithVar() and > LWLockReleaseClearVar() are the trivial bits. These are OK. Hm, the crux of the patch is avoiding LWLock's waitlist lock for reading/writing the lock variable. Essentially, they are important bits. > + * NB: Note the use of pg_atomic_exchange_u64 as opposed to just > + * pg_atomic_write_u64 to update the value. Since pg_atomic_exchange_u64 is > + * a full barrier, we're guaranteed that the subsequent shared memory > + * reads/writes, if any, happen after we reset the lock variable. > > This mentions that the subsequent read/write operations are safe, so > this refers to anything happening after the variable is reset. As > a full barrier, should be also mention that this is also ordered with > respect to anything that the caller did before clearing the variable? > From this perspective using pg_atomic_exchange_u64() makes sense to me > in LWLockReleaseClearVar(). Wordsmithed that comment a bit. > + * XXX: Use of a spinlock at the beginning of this function to read > + * current insert position implies memory ordering. That means that > + * the immediate loads and stores to shared memory (for instance, > + * in LWLockUpdateVar called via LWLockWaitForVar) don't need an > + * explicit memory barrier as far as the current usage is > + * concerned. But that might not be safe in general. > */ > What's the part where this is not safe? Based on what I see, this > code path is safe because of the previous spinlock. This is the same > comment as at the beginning of LWLockConflictsWithVar(). Is that > something that we ought to document at the top of LWLockWaitForVar() > as well? We have one caller of this function currently, but there may > be more in the future. 'But that might not be safe in general' applies only for LWLockWaitForVar not for WaitXLogInsertionsToFinish for sure. My bad. If there's another caller for LWLockWaitForVar without any spinlock, that's when the LWLockWaitForVar needs to have an explicit memory barrier. Per a comment upthread https://www.postgresql.org/message-id/20221205183007.s72oygp63s43dqyz%40awork3.anarazel.de, I had a note in WaitXLogInsertionsToFinish before LWLockWaitForVar. I now have modified that comment. > - * you're about to write out. > + * you're about to write out. Using an atomic variable for insertingAt avoids > + * taking any explicit lock for reads and writes. > > Hmm. Not sure that we need to comment at all. Removed. I was being verbose. One who understands pg_atomic_uint64 can get to that point easily. > -LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val) > +LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val) > [...] > Assert(pg_atomic_read_u32(&lock->state) & LW_VAL_EXCLUSIVE); > > - /* Update the lock's value */ > - *valptr = val; > > The sensitive change is in LWLockUpdateVar(). I am not completely > sure to understand this removal, though. Does that influence the case > where there are waiters? I'll send about this in a follow-up email to not overload this response with too much data. > Another thing I was wondering about: how much does the fast-path used > in LWLockUpdateVar() influence the performance numbers? Am I right to > guess that it counts for most of the gain seen? I'll send about this in a follow-up email to not overload this response with too much data. > Or could it be that > the removal of the spin lock in > LWLockConflictsWithVar()/LWLockWaitForVar() the point that has the > highest effect? I'll send about this in a follow-up email to not overload this response with too much data. I've addressed the above review comments and attached the v6 patch. [1] test-case 1: -T5, WAL ~16 bytes HEAD: + 81.52% 0.03% postgres [.] __vstrfmon_l_internal + 81.52% 0.00% postgres [.] startup_hacks + 81.52% 0.00% postgres [.] PostmasterMain + 63.95% 1.01% postgres [.] LWLockWaitListLock + 61.93% 0.02% postgres [.] WaitXLogInsertionsToFinish + 61.89% 0.05% postgres [.] LWLockWaitForVar + 48.83% 48.33% postgres [.] pg_atomic_read_u32_impl + 48.78% 0.40% postgres [.] pg_atomic_read_u32 + 43.19% 0.12% postgres [.] LWLockConflictsWithVar + 19.81% 0.01% postgres [.] LWLockQueueSelf + 7.86% 2.46% postgres [.] perform_spin_delay + 6.14% 6.06% postgres [.] spin_delay + 5.82% 0.01% postgres [.] pg_atomic_fetch_or_u32 + 5.81% 5.76% postgres [.] pg_atomic_fetch_or_u32_impl + 4.00% 0.01% postgres [.] XLogInsert + 3.93% 0.03% postgres [.] XLogInsertRecord + 2.13% 0.02% postgres [.] LWLockRelease + 2.10% 0.03% postgres [.] LWLockAcquire + 1.92% 0.00% postgres [.] LWLockDequeueSelf + 1.87% 0.01% postgres [.] WALInsertLockAcquire + 1.68% 0.04% postgres [.] LWLockAcquireOrWait + 1.64% 0.01% postgres [.] pg_analyze_and_rewrite_fixedparams + 1.62% 0.00% postgres [.] WALInsertLockRelease + 1.62% 0.00% postgres [.] LWLockReleaseClearVar + 1.55% 0.01% postgres [.] parse_analyze_fixedparams + 1.51% 0.00% postgres [.] transformTopLevelStmt + 1.50% 0.00% postgres [.] transformOptionalSelectInto + 1.50% 0.01% postgres [.] transformStmt + 1.47% 0.02% postgres [.] transformSelectStmt + 1.29% 0.01% postgres [.] XactLogCommitRecord test-case 1: -T5, WAL ~16 bytes PATCHED: + 74.49% 0.04% postgres [.] __vstrfmon_l_internal + 74.49% 0.00% postgres [.] startup_hacks + 74.49% 0.00% postgres [.] PostmasterMain + 51.60% 0.01% postgres [.] finish_xact_command + 51.60% 0.02% postgres [.] CommitTransactionCommand + 51.37% 0.03% postgres [.] CommitTransaction + 49.43% 0.05% postgres [.] RecordTransactionCommit + 46.55% 0.05% postgres [.] XLogFlush + 46.37% 0.85% postgres [.] LWLockWaitListLock + 43.79% 0.02% postgres [.] LWLockQueueSelf + 38.87% 0.03% postgres [.] WaitXLogInsertionsToFinish + 38.79% 0.11% postgres [.] LWLockWaitForVar + 34.99% 34.49% postgres [.] pg_atomic_read_u32_impl + 34.93% 0.35% postgres [.] pg_atomic_read_u32 + 6.99% 2.12% postgres [.] perform_spin_delay + 6.64% 0.01% postgres [.] XLogInsert + 6.54% 0.06% postgres [.] XLogInsertRecord + 6.26% 0.08% postgres [.] LWLockAcquireOrWait + 5.31% 5.22% postgres [.] spin_delay + 4.23% 0.04% postgres [.] LWLockRelease + 3.74% 0.01% postgres [.] pg_atomic_fetch_or_u32 + 3.73% 3.68% postgres [.] pg_atomic_fetch_or_u32_impl + 3.33% 0.06% postgres [.] LWLockAcquire + 2.97% 0.01% postgres [.] pg_plan_queries + 2.95% 0.01% postgres [.] WALInsertLockAcquire + 2.94% 0.02% postgres [.] planner + 2.94% 0.01% postgres [.] pg_plan_query + 2.92% 0.01% postgres [.] LWLockDequeueSelf + 2.89% 0.04% postgres [.] standard_planner + 2.81% 0.00% postgres [.] WALInsertLockRelease + 2.80% 0.00% postgres [.] LWLockReleaseClearVar + 2.38% 0.07% postgres [.] subquery_planner + 2.35% 0.01% postgres [.] XactLogCommitRecord -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: