Re: Handing off SLRU fsyncs to the checkpointer - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Handing off SLRU fsyncs to the checkpointer
Date
Msg-id VI1PR0701MB696023DA7815207237196DC8F6570@VI1PR0701MB6960.eurprd07.prod.outlook.com
Whole thread Raw
In response to Re: Handing off SLRU fsyncs to the checkpointer  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Handing off SLRU fsyncs to the checkpointer
Re: Handing off SLRU fsyncs to the checkpointer
List pgsql-hackers
On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> [patch]

Hi Thomas / hackers,

I just wanted to help testing this patch (defer SLRU fsyncs during recovery) and also faster compactify_tuples() patch
[2]as both are related to the WAL recovery performance in which I'm interested in. This is my first message to this
mailinggroup so please let me know also if I should adjust testing style or formatting. 

With both of those patches applied:
make check -> Passes
make check-world -> Passes
make standbycheck (section "Testing Hot Standby" from docs) -> Passes
There wasn't a single segfault or postmaster crash during the tests.
Review of the patches itself: I'm not qualified to review the PostgreSQL internals.

I've used redo-bench scripts [1] by Thomas to measure the performance effect (this approach simplifies testing and
excludesnetwork jittering effects): 1st column is redo start->end timing, 2nd is redo end -> end of checkpointing
timingbefore opening the DB for reads. I've conducted 2-3 separate tests that show benefits of those patches depending
onthe workload: 
- Handing SLRU sync work over to the checkpointer: in my testing it accelerates WAL recovery performance on slower /
higerlatency storage by ~20% 
- Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery performance for HOT updates also by
~20%

TEST1: workload profile test as per standard TPC-B pgbench -c8 -j8, with majority of records in WAL stream being
Heap/HOT_UPDATE:

xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
72.991, 0.919
73.688, 1.027
72.228, 1.032

xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch
72.271, 0.857
72.717, 0.748
72.705, 0.81

xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
72.49, 0.103
74.069, 0.102
73.368, 0.102

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch
70.312, 0.22
70.615, 0.201
69.739, 0.194

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
69.686, 0.101
70.601, 0.102
70.042, 0.101

As Thomas stated in the standard pgbench workload profile on recovery side there is compactify_tuples()->pg_qsort()
overheadvisible. So this is where the 2nd patch helps: 

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance
improvement
58.85, 0.296
58.605, 0.269
58.603, 0.277

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance
improvement,fsync=off 
58.618, 0.103
57.693, 0.101
58.779, 0.111

In the last final case the top profile is as follows related still to the sorting but as I understand in much optimal
way:

    26.68%  postgres  postgres            [.] qsort_itemoff
            ---qsort_itemoff
               |--14.17%--qsort_itemoff
               |          |--10.96%--compactify_tuples
               |          |          PageRepairFragmentation
               |          |          heap2_redo
               |          |          StartupXLOG
               |           --3.21%--qsort_itemoff
               |                      --3.10%--compactify_tuples
               |                                PageRepairFragmentation
               |                                heap2_redo
               |                                StartupXLOG
                --12.51%--compactify_tuples
                          PageRepairFragmentation
                          heap2_redo
                          StartupXLOG

     8.38%  postgres  libc-2.17.so        [.] __memmove_ssse3_back
            ---__memmove_ssse3_back
               compactify_tuples
               PageRepairFragmentation
               heap2_redo
               StartupXLOG

     6.51%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--3.62%--smgropen
               |          |--2.17%--XLogReadBufferExtended
               |          |           --1.76%--XLogReadBufferForRedoExtended
               |          |                      --0.93%--heap_xlog_update
               |           --1.45%--ReadBufferWithoutRelcache
               |                     XLogReadBufferExtended
               |                      --1.34%--XLogReadBufferForRedoExtended
               |                                 --0.72%--heap_xlog_update
                --2.69%--BufTableLookup
                          ReadBuffer_common
                          ReadBufferWithoutRelcache
                          XLogReadBufferExtended
                           --2.48%--XLogReadBufferForRedoExtended
                                     |--1.34%--heap2_redo
                                     |          StartupXLOG
                                      --0.83%--heap_xlog_update


So to sum, HOT update-like workload profile tends to be CPU bound on single process recovery side. Even slow storage
(likexvda) was not the bottleneck here as I've used already hot stuff from VFS cache. 

TEST2:  The second suite of tests used append-only workload profile (the same amount of transactions, using pgbench -c8
-j8-f insert_only.sql -n -t 1000000), however with simpler structure: 
CREATE TABLE t (c1 uuid NOT NULL, c2 uuid NOT NULL, c3 integer NOT NULL, c4 integer NOT NULL);
CREATE INDEX i1 ON t USING btree (c1);
CREATE INDEX i2 ON t USING btree (c2);
CREATE INDEX i3 ON t USING btree (c3);
CREATE EXTENSION "uuid-ossp"

and customized script for pgbench, just this: BEGIN; INSERT INTO t (c1,c2,c3,c4) values ( uuid_generate_v4(),
uuid_generate_v4(),round(5000*random()), 500); END; 

Majority of WAL records being Btree/INSERT_LEAF (60%), results are following:

xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
120.732, 5.275
120.608, 5.902
120.685, 5.872

xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch
99.061, 7.071
99.472, 6.386
98.036, 5.994

xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
102.838, 0.136
102.711, 0.099
103.093, 0.0970001

NVMe, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
96.46, 0.405
96.121, 0.405
95.951, 0.402

NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch
94.124, 0.387
96.61, 0.416
94.624, 0.451

NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
95.401, 0.0969999
95.028, 0.099
94.632, 0.0970001

So apparently the v2-0001-Defer-flushing-of-SLRU-files helps in my case on higher latency storage.

The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache
(butnot in shared buffers, please compare with TEST1 as there was no such bottleneck at all): 

    29.62%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
            ---copy_user_enhanced_fast_string
               |--17.98%--copyin
[..]
               |          __pwrite_nocancel
               |          FileWrite
               |          mdwrite
               |          FlushBuffer
               |          ReadBuffer_common
               |          ReadBufferWithoutRelcache
               |          XLogReadBufferExtended
               |          XLogReadBufferForRedoExtended
               |           --17.57%--btree_xlog_insert
               |                     btree_redo
               |                     StartupXLOG
               |
                --11.64%--copyout
[..]
                          __pread_nocancel
                           --11.44%--FileRead
                                     mdread
                                     ReadBuffer_common
                                     ReadBufferWithoutRelcache
                                     XLogReadBufferExtended
                                     XLogReadBufferForRedoExtended
                                      --11.34%--btree_xlog_insert
                                                btree_redo
                                                StartupXLOG

     5.62%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--1.74%--smgropen
[..]

No# of syscalls/s topped at ~100k/s @ 8kB for each read & write respectively (it's logical I/O as everything fits in
VFScache, nearly no physical I/O). It was also visible during the test that the startup/recovering process spent 60% of
it'stime in %sys time in such conditions. As there was no sorting visible in profiler, I've did not test the workload
withcompactify_tuples()->pg_qsort() performance improvement here, although from basic runs it appears it did not
introduceany degradation. 

TEST2b: Quite frankly, at 1st glance I've did not seem to understand why btree_xlog_insert()->ReadBuffer_common() would
requireFlushBuffer() that would write, until I've bumped shared buffers 128MB -> 24GB as it might have been flushing
dirtybuffers which caused those pwrite's() - it's not evident as direct call in code from ReadBuffer*()  (inlining? any
ideahow to monitor on standby the cycling of dirty buffers during recovery when there is no bgwriter yet?). Still I
wantedto eliminate storage and VFS cache as bottleneck. This lead me to results like below (with fsync=on, also defer
patch,shared_buffers=24GB to eliminate VFS cache lookups): 
27.341, 0.858
27.26, 0.869
26.959, 0.86

Turning on/off the defer SLRU patch and/or fsync doesn't seem to make any difference, so if anyone is curious the next
setsof append-only bottlenecks is like below: 

    14.69%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--9.80%--BufTableLookup
               |          ReadBuffer_common
               |          ReadBufferWithoutRelcache
               |          XLogReadBufferExtended
               |          XLogReadBufferForRedoExtended
               |          |--7.76%--btree_xlog_insert
               |          |          btree_redo
               |          |          StartupXLOG
               |           --1.63%--heap_xlog_insert
                --4.90%--smgropen
                          |--2.86%--ReadBufferWithoutRelcache
                          |          XLogReadBufferExtended
                          |          |--2.04%--XLogReadBufferForRedoExtended
                          |          |          |--1.22%--btree_xlog_insert
                          |          |          |          btree_redo
                          |          |          |          StartupXLOG
                          |          |           --0.82%--heap_xlog_insert
                          |           --0.82%--XLogRecordPageWithFreeSpace
                          |                     heap_xlog_insert
                           --2.04%--XLogReadBufferExtended
                                     XLogReadBufferForRedoExtended
                                      --1.63%--btree_xlog_insert
                                                btree_redo
                                                StartupXLOG

     7.76%  postgres  libc-2.17.so        [.] __memmove_ssse3_back
            ---__memmove_ssse3_back
               PageAddItemExtended
               btree_xlog_insert
               btree_redo
               StartupXLOG

Still the result seems to be nice, as it is 10^6 (trx) / 27s = ~34k TPS on recovery side (~3000MB of WAL/~27s =
~111MB/swithout FPW) provided there would be no I/O overheads and my assumptions. For full picture, compare ratio
generationof "primary" data with INSERTs I was able to get without any special tuning: 
- ~29k TPS of INSERTs-only/COMMITs with pgbench -c8
- ~49k TPS of INSERTs-only/COMMITs with pgbench -c16
so the WAL (single process) recovery code in 14master seems to have like 70% of performance of fairly low-end primary
16vCPUDB in above append-only conditions. 

Specs: Linux 4.14 kernel, ext4 filesystems (data=ordered,noatime), 1s8c16t Xeon CPU E5-2686 v4 @ 2.30GHz, 128GB RAM,
gcc7.2.1, CFLAGS set by ./configure to "-O2", test on default/tiny shared_buffers until last test. 

xvda AKA slow storage: root file system, single thread tests:
    dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==>  157 MB/s
    dd if=/dev/zero of=test bs=8k count=10000 oflag=direct =>  17.2 MB/s

NVMe:  striped VG consisting of 2x NVMes devices with much lower latency
    dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==>  1.9 GB/s or maybe even more
    dd if=/dev/zero of=test bs=8k count=10000 oflag=direct =>  141 MB/s

-Jakub Wartak.

[1] - https://github.com/macdice/redo-bench/
[2] - https://commitfest.postgresql.org/29/2687/



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Asymmetric partition-wise JOIN
Next
From: Peter Smith
Date:
Subject: Re: proposal - function string_to_table