Re: WAL Insertion Lock Improvements - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: WAL Insertion Lock Improvements
Date
Msg-id CALj2ACWGHGJ0pCsuVqJ4QkEfv8D_69rZL0FUK7TFfukaOcf7YQ@mail.gmail.com
Whole thread Raw
In response to Re: WAL Insertion Lock Improvements  (Michael Paquier <michael@paquier.xyz>)
Responses Re: WAL Insertion Lock Improvements
Re: WAL Insertion Lock Improvements
Re: WAL Insertion Lock Improvements
List pgsql-hackers
On Mon, Apr 10, 2023 at 9:38 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> I have been trying to get some reproducible numbers, but I think that
> I am going to need a bigger maching than what I have been using for
> the last few days, up to 400 connections.  It is worth noting that
> 00d1e02b may influence a bit the results, so we may want to have more
> numbers with that in place particularly with INSERTs, and one of the
> tests used upthread uses single row INSERTs.

I ran performance tests on the patch with different use-cases. Clearly
the patch reduces burden on LWLock's waitlist lock (evident from perf
reports [1]). However, to see visible impact in the output, the txns
must be generating small (between 16 bytes to 2 KB) amounts of WAL in
a highly concurrent manner, check the results below (FWIW, I've zipped
and attached perf images for better illustration along with test
setup).

When the txns are generating a small amount of WAL i.e. between 16
bytes to 2 KB in a highly concurrent manner, the benefit is clearly
visible in the TPS more than 2.3X improvement. When the txns are
generating more WAL i.e. more than 2 KB, the gain from reduced burden
on waitlist lock is offset by increase in the wait/release for WAL
insertion locks and no visible benefit is seen.

As the amount of WAL each txn generates increases, it looks like the
benefit gained from reduced burden on waitlist lock is offset by
increase in the wait for WAL insertion locks.

Note that I've used pg_logical_emit_message() for ease of
understanding about the txns generating various amounts of WAL, but
the pattern is the same if txns are generating various amounts of WAL
say with inserts.

test-case 1: -T5, WAL ~16 bytes
clients    HEAD    PATCHED
1    1437    1352
2    1376    1419
4    2919    2774
8    5875    6371
16    11148    12242
32    22108    23532
64    41414    46478
128    85304    85235
256    83771    152901
512    61970    141021
768    56514    118899
1024    51784    110960
2048    39141    84150
4096    16901    45759

test-case 1: -t1000, WAL ~16 bytes
clients    HEAD    PATCHED
1    1417    1333
2    1363    1791
4    2978    2970
8    5954    6198
16    11179    11164
32    23742    24043
64    45537    44103
128    84683    91762
256    80369    146293
512    61080    132079
768    57236    118046
1024    53497    114574
2048    46423    93588
4096    42067    85790

test-case 2: -T5, WAL ~256 bytes
clients    HEAD    PATCHED
1    1521    1386
2    1647    1637
4    3088    3270
8    6011    5631
16    12778    10317
32    24117    20006
64    43966    38199
128    72660    67936
256    93096    121261
512    57247    142418
768    53782    126218
1024    50279    109153
2048    35109    91602
4096    21184    39848

test-case 2: -t1000, WAL ~256 bytes
clients    HEAD    PATCHED
1    1265    1389
2    1522    1258
4    2802    2775
8    5875    5422
16    11664    10853
32    21961    22145
64    44304    40851
128    73278    80494
256    91172    122287
512    60966    136734
768    56590    125050
1024    52481    124341
2048    47878    104760
4096    42838    94121

test-case 3: -T5, WAL 512 bytes
clients    HEAD    PATCHED
1    1464    1284
2    1520    1381
4    2985    2877
8    6237    5261
16    11296    10621
32    22257    20789
64    40548    37243
128    66507    59891
256    92516    97506
512    56404    119716
768    51127    112482
1024    48463    103484
2048    38079    81424
4096    18977    40942

test-case 3: -t1000, WAL 512 bytes
clients    HEAD    PATCHED
1    1452    1434
2    1604    1649
4    3051    2971
8    5967    5650
16    10471    10702
32    20257    20899
64    39412    36750
128    62767    61110
256    81050    89768
512    56888    122786
768    51238    114444
1024    48972    106867
2048    43451    98847
4096    40018    111079

test-case 4: -T5, WAL 1024 bytes
clients    HEAD    PATCHED
1    1405    1395
2    1638    1607
4    3176    3207
8    6271    6024
16    11653    11103
32    20530    20260
64    34313    32367
128    55939    52079
256    74355    76420
512    56506    90983
768    50088    100410
1024    44589    99025
2048    39640    90931
4096    20942    36035

test-case 4: -t1000, WAL 1024 bytes
clients    HEAD    PATCHED
1    1330    1304
2    1615    1366
4    3117    2667
8    6179    5390
16    10524    10426
32    19819    18620
64    34844    29731
128    52180    48869
256    73284    71396
512    55714    96014
768    49336    108100
1024    46113    102789
2048    44627    104721
4096    44979    106189

test-case 5: -T5, WAL 2048 bytes
clients    HEAD    PATCHED
1    1407    1377
2    1518    1559
4    2589    2870
8    4883    5493
16    9075    9201
32    15957    16295
64    27471    25029
128    37493    38642
256    46369    45787
512    61755    62836
768    59144    68419
1024    52495    68933
2048    48608    72500
4096    26463    61252

test-case 5: -t1000, WAL 2048 bytes
clients    HEAD    PATCHED
1    1289    1366
2    1489    1628
4    2960    3036
8    5536    5965
16    9248    10399
32    15770    18140
64    27626    27800
128    36817    39483
256    48533    52105
512    64453    64007
768    59146    64160
1024    57637    61756
2048    59063    62109
4096    58268    61206

test-case 6: -T5, WAL 4096 bytes
clients    HEAD    PATCHED
1    1322    1325
2    1504    1551
4    2811    2880
8    5330    5159
16    8625    8315
32    12820    13534
64    19737    19965
128    26298    24633
256    34630    29939
512    34382    36669
768    33421    33316
1024    33525    32821
2048    37053    37752
4096    37334    39114

test-case 6: -t1000, WAL 4096 bytes
clients    HEAD    PATCHED
1    1212    1371
2    1383    1566
4    2858    2967
8    5092    5035
16    8233    8486
32    13353    13678
64    19052    20072
128    24803    24726
256    34065    33139
512    31590    32029
768    31432    31404
1024    31357    31366
2048    31465    31508
4096    32157    32180

test-case 7: -T5, WAL 8192 bytes
clients    HEAD    PATCHED
1    1287    1233
2    1552    1521
4    2658    2617
8    4680    4532
16    6732    7110
32    9649    9198
64    13276    12042
128    17100    17187
256    17408    17448
512    16595    16358
768    16599    16500
1024    16975    17300
2048    19073    19137
4096    21368    21735

test-case 7: -t1000, WAL 8192 bytes
clients    HEAD    PATCHED
1    1144    1190
2    1414    1395
4    2618    2438
8    4645    4485
16    6766    7001
32    9620    9804
64    12943    13023
128    15904    17148
256    16645    16035
512    15800    15796
768    15788    15810
1024    15814    15817
2048    17775    17771
4096    31715    31682

> Looking at the patch..  LWLockConflictsWithVar() and
> LWLockReleaseClearVar() are the trivial bits.  These are OK.

Hm, the crux of the patch is avoiding LWLock's waitlist lock for
reading/writing the lock variable. Essentially, they are important
bits.

> +    * NB: Note the use of pg_atomic_exchange_u64 as opposed to just
> +    * pg_atomic_write_u64 to update the value. Since pg_atomic_exchange_u64 is
> +    * a full barrier, we're guaranteed that the subsequent shared memory
> +    * reads/writes, if any, happen after we reset the lock variable.
>
> This mentions that the subsequent read/write operations are safe, so
> this refers to anything happening after the variable is reset.  As
> a full barrier, should be also mention that this is also ordered with
> respect to anything that the caller did before clearing the variable?
> From this perspective using pg_atomic_exchange_u64() makes sense to me
> in LWLockReleaseClearVar().

Wordsmithed that comment a bit.

> +            * XXX: Use of a spinlock at the beginning of this function to read
> +            * current insert position implies memory ordering. That means that
> +            * the immediate loads and stores to shared memory (for instance,
> +            * in LWLockUpdateVar called via LWLockWaitForVar) don't need an
> +            * explicit memory barrier as far as the current usage is
> +            * concerned. But that might not be safe in general.
>              */
> What's the part where this is not safe?  Based on what I see, this
> code path is safe because of the previous spinlock.  This is the same
> comment as at the beginning of LWLockConflictsWithVar().  Is that
> something that we ought to document at the top of LWLockWaitForVar()
> as well?  We have one caller of this function currently, but there may
> be more in the future.

'But that might not be safe in general' applies only for
LWLockWaitForVar not for WaitXLogInsertionsToFinish for sure. My bad.

If there's another caller for LWLockWaitForVar without any spinlock,
that's when the LWLockWaitForVar needs to have an explicit memory
barrier.

Per a comment upthread
https://www.postgresql.org/message-id/20221205183007.s72oygp63s43dqyz%40awork3.anarazel.de,
I had a note in WaitXLogInsertionsToFinish before LWLockWaitForVar. I
now have modified that comment.

> - * you're about to write out.
> + * you're about to write out. Using an atomic variable for insertingAt avoids
> + * taking any explicit lock for reads and writes.
>
> Hmm.  Not sure that we need to comment at all.

Removed. I was being verbose. One who understands pg_atomic_uint64 can
get to that point easily.

> -LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
> +LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
> [...]
>     Assert(pg_atomic_read_u32(&lock->state) & LW_VAL_EXCLUSIVE);
>
> -   /* Update the lock's value */
> -   *valptr = val;
>
> The sensitive change is in LWLockUpdateVar().  I am not completely
> sure to understand this removal, though.  Does that influence the case
> where there are waiters?

I'll send about this in a follow-up email to not overload this
response with too much data.

> Another thing I was wondering about: how much does the fast-path used
> in LWLockUpdateVar() influence the performance numbers? Am I right to
> guess that it counts for most of the gain seen?

I'll send about this in a follow-up email to not overload this
response with too much data.

> Or could it be that
> the removal of the spin lock in
> LWLockConflictsWithVar()/LWLockWaitForVar() the point that has the
> highest effect?

I'll send about this in a follow-up email to not overload this
response with too much data.

I've addressed the above review comments and attached the v6 patch.

[1]
test-case 1: -T5, WAL ~16 bytes HEAD:
+   81.52%     0.03%  postgres  [.] __vstrfmon_l_internal
+   81.52%     0.00%  postgres  [.] startup_hacks
+   81.52%     0.00%  postgres  [.] PostmasterMain
+   63.95%     1.01%  postgres  [.] LWLockWaitListLock
+   61.93%     0.02%  postgres  [.] WaitXLogInsertionsToFinish
+   61.89%     0.05%  postgres  [.] LWLockWaitForVar
+   48.83%    48.33%  postgres  [.] pg_atomic_read_u32_impl
+   48.78%     0.40%  postgres  [.] pg_atomic_read_u32
+   43.19%     0.12%  postgres  [.] LWLockConflictsWithVar
+   19.81%     0.01%  postgres  [.] LWLockQueueSelf
+    7.86%     2.46%  postgres  [.] perform_spin_delay
+    6.14%     6.06%  postgres  [.] spin_delay
+    5.82%     0.01%  postgres  [.] pg_atomic_fetch_or_u32
+    5.81%     5.76%  postgres  [.] pg_atomic_fetch_or_u32_impl
+    4.00%     0.01%  postgres  [.] XLogInsert
+    3.93%     0.03%  postgres  [.] XLogInsertRecord
+    2.13%     0.02%  postgres  [.] LWLockRelease
+    2.10%     0.03%  postgres  [.] LWLockAcquire
+    1.92%     0.00%  postgres  [.] LWLockDequeueSelf
+    1.87%     0.01%  postgres  [.] WALInsertLockAcquire
+    1.68%     0.04%  postgres  [.] LWLockAcquireOrWait
+    1.64%     0.01%  postgres  [.] pg_analyze_and_rewrite_fixedparams
+    1.62%     0.00%  postgres  [.] WALInsertLockRelease
+    1.62%     0.00%  postgres  [.] LWLockReleaseClearVar
+    1.55%     0.01%  postgres  [.] parse_analyze_fixedparams
+    1.51%     0.00%  postgres  [.] transformTopLevelStmt
+    1.50%     0.00%  postgres  [.] transformOptionalSelectInto
+    1.50%     0.01%  postgres  [.] transformStmt
+    1.47%     0.02%  postgres  [.] transformSelectStmt
+    1.29%     0.01%  postgres  [.] XactLogCommitRecord

test-case 1: -T5, WAL ~16 bytes PATCHED:
+   74.49%     0.04%  postgres  [.] __vstrfmon_l_internal
+   74.49%     0.00%  postgres  [.] startup_hacks
+   74.49%     0.00%  postgres  [.] PostmasterMain
+   51.60%     0.01%  postgres  [.] finish_xact_command
+   51.60%     0.02%  postgres  [.] CommitTransactionCommand
+   51.37%     0.03%  postgres  [.] CommitTransaction
+   49.43%     0.05%  postgres  [.] RecordTransactionCommit
+   46.55%     0.05%  postgres  [.] XLogFlush
+   46.37%     0.85%  postgres  [.] LWLockWaitListLock
+   43.79%     0.02%  postgres  [.] LWLockQueueSelf
+   38.87%     0.03%  postgres  [.] WaitXLogInsertionsToFinish
+   38.79%     0.11%  postgres  [.] LWLockWaitForVar
+   34.99%    34.49%  postgres  [.] pg_atomic_read_u32_impl
+   34.93%     0.35%  postgres  [.] pg_atomic_read_u32
+    6.99%     2.12%  postgres  [.] perform_spin_delay
+    6.64%     0.01%  postgres  [.] XLogInsert
+    6.54%     0.06%  postgres  [.] XLogInsertRecord
+    6.26%     0.08%  postgres  [.] LWLockAcquireOrWait
+    5.31%     5.22%  postgres  [.] spin_delay
+    4.23%     0.04%  postgres  [.] LWLockRelease
+    3.74%     0.01%  postgres  [.] pg_atomic_fetch_or_u32
+    3.73%     3.68%  postgres  [.] pg_atomic_fetch_or_u32_impl
+    3.33%     0.06%  postgres  [.] LWLockAcquire
+    2.97%     0.01%  postgres  [.] pg_plan_queries
+    2.95%     0.01%  postgres  [.] WALInsertLockAcquire
+    2.94%     0.02%  postgres  [.] planner
+    2.94%     0.01%  postgres  [.] pg_plan_query
+    2.92%     0.01%  postgres  [.] LWLockDequeueSelf
+    2.89%     0.04%  postgres  [.] standard_planner
+    2.81%     0.00%  postgres  [.] WALInsertLockRelease
+    2.80%     0.00%  postgres  [.] LWLockReleaseClearVar
+    2.38%     0.07%  postgres  [.] subquery_planner
+    2.35%     0.01%  postgres  [.] XactLogCommitRecord

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Jelte Fennema
Date:
Subject: Re: proposal: psql: show current user in prompt
Next
From: Robert Haas
Date:
Subject: Re: [PATCH] Allow Postgres to pick an unused port to listen