Re: Handing off SLRU fsyncs to the checkpointer - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Handing off SLRU fsyncs to the checkpointer |
Date | |
Msg-id | VI1PR0701MB696023DA7815207237196DC8F6570@VI1PR0701MB6960.eurprd07.prod.outlook.com Whole thread Raw |
In response to | Re: Handing off SLRU fsyncs to the checkpointer (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Handing off SLRU fsyncs to the checkpointer
Re: Handing off SLRU fsyncs to the checkpointer |
List | pgsql-hackers |
On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro <thomas.munro@gmail.com> wrote: > [patch] Hi Thomas / hackers, I just wanted to help testing this patch (defer SLRU fsyncs during recovery) and also faster compactify_tuples() patch [2]as both are related to the WAL recovery performance in which I'm interested in. This is my first message to this mailinggroup so please let me know also if I should adjust testing style or formatting. With both of those patches applied: make check -> Passes make check-world -> Passes make standbycheck (section "Testing Hot Standby" from docs) -> Passes There wasn't a single segfault or postmaster crash during the tests. Review of the patches itself: I'm not qualified to review the PostgreSQL internals. I've used redo-bench scripts [1] by Thomas to measure the performance effect (this approach simplifies testing and excludesnetwork jittering effects): 1st column is redo start->end timing, 2nd is redo end -> end of checkpointing timingbefore opening the DB for reads. I've conducted 2-3 separate tests that show benefits of those patches depending onthe workload: - Handing SLRU sync work over to the checkpointer: in my testing it accelerates WAL recovery performance on slower / higerlatency storage by ~20% - Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery performance for HOT updates also by ~20% TEST1: workload profile test as per standard TPC-B pgbench -c8 -j8, with majority of records in WAL stream being Heap/HOT_UPDATE: xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline: 72.991, 0.919 73.688, 1.027 72.228, 1.032 xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch 72.271, 0.857 72.717, 0.748 72.705, 0.81 xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off 72.49, 0.103 74.069, 0.102 73.368, 0.102 NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch 70.312, 0.22 70.615, 0.201 69.739, 0.194 NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off 69.686, 0.101 70.601, 0.102 70.042, 0.101 As Thomas stated in the standard pgbench workload profile on recovery side there is compactify_tuples()->pg_qsort() overheadvisible. So this is where the 2nd patch helps: NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance improvement 58.85, 0.296 58.605, 0.269 58.603, 0.277 NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance improvement,fsync=off 58.618, 0.103 57.693, 0.101 58.779, 0.111 In the last final case the top profile is as follows related still to the sorting but as I understand in much optimal way: 26.68% postgres postgres [.] qsort_itemoff ---qsort_itemoff |--14.17%--qsort_itemoff | |--10.96%--compactify_tuples | | PageRepairFragmentation | | heap2_redo | | StartupXLOG | --3.21%--qsort_itemoff | --3.10%--compactify_tuples | PageRepairFragmentation | heap2_redo | StartupXLOG --12.51%--compactify_tuples PageRepairFragmentation heap2_redo StartupXLOG 8.38% postgres libc-2.17.so [.] __memmove_ssse3_back ---__memmove_ssse3_back compactify_tuples PageRepairFragmentation heap2_redo StartupXLOG 6.51% postgres postgres [.] hash_search_with_hash_value ---hash_search_with_hash_value |--3.62%--smgropen | |--2.17%--XLogReadBufferExtended | | --1.76%--XLogReadBufferForRedoExtended | | --0.93%--heap_xlog_update | --1.45%--ReadBufferWithoutRelcache | XLogReadBufferExtended | --1.34%--XLogReadBufferForRedoExtended | --0.72%--heap_xlog_update --2.69%--BufTableLookup ReadBuffer_common ReadBufferWithoutRelcache XLogReadBufferExtended --2.48%--XLogReadBufferForRedoExtended |--1.34%--heap2_redo | StartupXLOG --0.83%--heap_xlog_update So to sum, HOT update-like workload profile tends to be CPU bound on single process recovery side. Even slow storage (likexvda) was not the bottleneck here as I've used already hot stuff from VFS cache. TEST2: The second suite of tests used append-only workload profile (the same amount of transactions, using pgbench -c8 -j8-f insert_only.sql -n -t 1000000), however with simpler structure: CREATE TABLE t (c1 uuid NOT NULL, c2 uuid NOT NULL, c3 integer NOT NULL, c4 integer NOT NULL); CREATE INDEX i1 ON t USING btree (c1); CREATE INDEX i2 ON t USING btree (c2); CREATE INDEX i3 ON t USING btree (c3); CREATE EXTENSION "uuid-ossp" and customized script for pgbench, just this: BEGIN; INSERT INTO t (c1,c2,c3,c4) values ( uuid_generate_v4(), uuid_generate_v4(),round(5000*random()), 500); END; Majority of WAL records being Btree/INSERT_LEAF (60%), results are following: xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline: 120.732, 5.275 120.608, 5.902 120.685, 5.872 xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch 99.061, 7.071 99.472, 6.386 98.036, 5.994 xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off 102.838, 0.136 102.711, 0.099 103.093, 0.0970001 NVMe, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline: 96.46, 0.405 96.121, 0.405 95.951, 0.402 NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch 94.124, 0.387 96.61, 0.416 94.624, 0.451 NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off 95.401, 0.0969999 95.028, 0.099 94.632, 0.0970001 So apparently the v2-0001-Defer-flushing-of-SLRU-files helps in my case on higher latency storage. The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache (butnot in shared buffers, please compare with TEST1 as there was no such bottleneck at all): 29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string ---copy_user_enhanced_fast_string |--17.98%--copyin [..] | __pwrite_nocancel | FileWrite | mdwrite | FlushBuffer | ReadBuffer_common | ReadBufferWithoutRelcache | XLogReadBufferExtended | XLogReadBufferForRedoExtended | --17.57%--btree_xlog_insert | btree_redo | StartupXLOG | --11.64%--copyout [..] __pread_nocancel --11.44%--FileRead mdread ReadBuffer_common ReadBufferWithoutRelcache XLogReadBufferExtended XLogReadBufferForRedoExtended --11.34%--btree_xlog_insert btree_redo StartupXLOG 5.62% postgres postgres [.] hash_search_with_hash_value ---hash_search_with_hash_value |--1.74%--smgropen [..] No# of syscalls/s topped at ~100k/s @ 8kB for each read & write respectively (it's logical I/O as everything fits in VFScache, nearly no physical I/O). It was also visible during the test that the startup/recovering process spent 60% of it'stime in %sys time in such conditions. As there was no sorting visible in profiler, I've did not test the workload withcompactify_tuples()->pg_qsort() performance improvement here, although from basic runs it appears it did not introduceany degradation. TEST2b: Quite frankly, at 1st glance I've did not seem to understand why btree_xlog_insert()->ReadBuffer_common() would requireFlushBuffer() that would write, until I've bumped shared buffers 128MB -> 24GB as it might have been flushing dirtybuffers which caused those pwrite's() - it's not evident as direct call in code from ReadBuffer*() (inlining? any ideahow to monitor on standby the cycling of dirty buffers during recovery when there is no bgwriter yet?). Still I wantedto eliminate storage and VFS cache as bottleneck. This lead me to results like below (with fsync=on, also defer patch,shared_buffers=24GB to eliminate VFS cache lookups): 27.341, 0.858 27.26, 0.869 26.959, 0.86 Turning on/off the defer SLRU patch and/or fsync doesn't seem to make any difference, so if anyone is curious the next setsof append-only bottlenecks is like below: 14.69% postgres postgres [.] hash_search_with_hash_value ---hash_search_with_hash_value |--9.80%--BufTableLookup | ReadBuffer_common | ReadBufferWithoutRelcache | XLogReadBufferExtended | XLogReadBufferForRedoExtended | |--7.76%--btree_xlog_insert | | btree_redo | | StartupXLOG | --1.63%--heap_xlog_insert --4.90%--smgropen |--2.86%--ReadBufferWithoutRelcache | XLogReadBufferExtended | |--2.04%--XLogReadBufferForRedoExtended | | |--1.22%--btree_xlog_insert | | | btree_redo | | | StartupXLOG | | --0.82%--heap_xlog_insert | --0.82%--XLogRecordPageWithFreeSpace | heap_xlog_insert --2.04%--XLogReadBufferExtended XLogReadBufferForRedoExtended --1.63%--btree_xlog_insert btree_redo StartupXLOG 7.76% postgres libc-2.17.so [.] __memmove_ssse3_back ---__memmove_ssse3_back PageAddItemExtended btree_xlog_insert btree_redo StartupXLOG Still the result seems to be nice, as it is 10^6 (trx) / 27s = ~34k TPS on recovery side (~3000MB of WAL/~27s = ~111MB/swithout FPW) provided there would be no I/O overheads and my assumptions. For full picture, compare ratio generationof "primary" data with INSERTs I was able to get without any special tuning: - ~29k TPS of INSERTs-only/COMMITs with pgbench -c8 - ~49k TPS of INSERTs-only/COMMITs with pgbench -c16 so the WAL (single process) recovery code in 14master seems to have like 70% of performance of fairly low-end primary 16vCPUDB in above append-only conditions. Specs: Linux 4.14 kernel, ext4 filesystems (data=ordered,noatime), 1s8c16t Xeon CPU E5-2686 v4 @ 2.30GHz, 128GB RAM, gcc7.2.1, CFLAGS set by ./configure to "-O2", test on default/tiny shared_buffers until last test. xvda AKA slow storage: root file system, single thread tests: dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==> 157 MB/s dd if=/dev/zero of=test bs=8k count=10000 oflag=direct => 17.2 MB/s NVMe: striped VG consisting of 2x NVMes devices with much lower latency dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==> 1.9 GB/s or maybe even more dd if=/dev/zero of=test bs=8k count=10000 oflag=direct => 141 MB/s -Jakub Wartak. [1] - https://github.com/macdice/redo-bench/ [2] - https://commitfest.postgresql.org/29/2687/
pgsql-hackers by date: