Thread: Handing off SLRU fsyncs to the checkpointer
Hi, In commit 3eb77eba we made it possible for any subsystem that wants a file to be flushed as part of the next checkpoint to ask the checkpointer to do that, as previously only md.c could do. In the past, foreground CLOG flush stalls were a problem, but then commit 33aaa139 cranked up the number of buffers, and then commit 5364b357 cranked it right up until the flushes mostly disappeared from some benchmark workload but not so high that the resulting linear searches through the buffer array destroyed the gains. I know there is interest in moving that stuff into regular shared buffers, so it can be found via the buffer mapping system (and improve as that improves), written back by the background writer (and improve as that improves), managed with a proper replacement algorithm (and improve as that improves), etc etc. That sounds like a great idea to me, but it's a big project. In the meantime, one thing we could do is hand off the fsyncs, but I'm not sure if it's still considered a real problem in the field given the new parameters. Anyway, I had a patch for that, that I used while testing commit 3eb77eba. While reading old threads about SLRU today I found that several people had wished for a thing exactly like that, so I dusted it off and rebased it.
Attachment
On Wed, Feb 12, 2020 at 9:54 PM Thomas Munro <thomas.munro@gmail.com> wrote: > In commit 3eb77eba we made it possible for any subsystem that wants a > file to be flushed as part of the next checkpoint to ask the > checkpointer to do that, as previously only md.c could do. Hello, While working on recovery performance, I found my way back to this idea and rebased the patch. Problem statement: Every time we have to write out a page of pg_commit_ts, pg_multixact or pg_xact due to cache pressure, we immediately call fsync(). This runs serially in the recovery process, and it's quite bad for pg_commit_ts, because we need to dump out a page for every ~800 transactions (track_commit_timestamps is not enabled by default). If we ask the checkpointer to do it, it collapses the 2048 fsync calls for each SLRU segment into one, and the kernel can write out the data with larger I/Os, maybe even ahead of time, and update the inode only once. Experiment: Run crash recovery for 1 million pgbench transactions: postgres -D pgdata \ -c synchronous_commit=off \ -c enable_commit_timestamps=on \ -c max_wal_size=10GB \ -c checkpoint_timeout=60min # in another shell pgbench -i -s10 postgres psql postgres -c checkpoint pgbench -t1000000 -Mprepared postgres killall -9 postgres # save the crashed pgdata dir for repeated experiments mv pgdata pgdata-save # now run experiments like this and see how long recovery takes rm -fr pgdata cp -r pgdata-save pgdata postgres -D pgdata What I see on a system that has around 2.5ms latency for fsync: master: 16.83 seconds patched: 4.00 seconds It's harder to see it without commit timestamps enabled since we only need to flush a pg_xact page every 32k transactions (and multixacts are more complicated to test), but you can still see the effect. With 8x more transactions to make it clearer what's going on, I could measure a speedup of around 6% from this patch, which I suppose scales up fairly obviously with storage latency (every million transaction = at least 30 fsyncs stalls, so you can multiply that by your fsync latency and work out how much time your recovery process will be asleep at the wheel instead of applying your records). From a syscall overhead point of view, it's a bit unfortunate that we open and close SLRU segments every time we write, but it's probably not really enough to complain about... except for the (small) risk of an inode dropping out of kernel caches in the time between closing it and the checkpointer opening it. Hmm.
Attachment
On Tue, Aug 4, 2020 at 6:02 PM Thomas Munro <thomas.munro@gmail.com> wrote: > ... speedup of around 6% ... I did some better testing. OS: Linux, storage: consumer SSD. I repeatedly ran crash recovery on 3.3GB worth of WAL generated with 8M pgbench transactions. I tested 3 different builds 7 times each and used "ministat" to compare the recovery times. It told me that: * Master is around 11% faster than last week before commit c5315f4f "Cache smgrnblocks() results in recovery." * This patch gives a similar speedup, bringing the total to around 25% faster than last week (the time is ~20% less, the WAL processing speed is ~1.25x). My test fit in RAM and was all cached. With the patch, the recovery process used 100% of a single core the whole time and stayed on that core and the variance is low, but in the other builds it hovered around 90% and hopped around as it kept getting rescheduled and the variance was higher. Of course, SLRU fsyncs aren't the only I/O stalls in a real system; among others, there are also reads from faulting in referenced pages that don't have full page images in the WAL. I'm working on that separately, but that's a tad more complicated than this stuff. Added to commit fest. === ministat output showing recovery times in seconds === x patched.dat + master.dat * lastweek.dat +------------------------------------------------------------------------------+ | * | | x + * | |x x xx + + ++ + + * **** | | |AM| |_____AM____| |_____A_M__|| +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 38.655 39.406 39.218 39.134857 0.25188849 + 7 42.128 45.068 43.958 43.815286 0.91387758 Difference at 95.0% confidence 4.68043 +/- 0.780722 11.9597% +/- 1.99495% (Student's t, pooled s = 0.670306) * 7 47.187 49.404 49.203 48.904286 0.76793483 Difference at 95.0% confidence 9.76943 +/- 0.665613 24.9635% +/- 1.70082% (Student's t, pooled s = 0.571477)
On Wed, Aug 5, 2020 at 2:01 AM Thomas Munro <thomas.munro@gmail.com> wrote: > * Master is around 11% faster than last week before commit c5315f4f > "Cache smgrnblocks() results in recovery." > * This patch gives a similar speedup, bringing the total to around 25% > faster than last week (the time is ~20% less, the WAL processing speed > is ~1.25x). Dang, that's pretty nice, especially for the relatively small amount of code that it seems to require. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Aug 8, 2020 at 2:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Aug 5, 2020 at 2:01 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > * Master is around 11% faster than last week before commit c5315f4f > > "Cache smgrnblocks() results in recovery." > > * This patch gives a similar speedup, bringing the total to around 25% > > faster than last week (the time is ~20% less, the WAL processing speed > > is ~1.25x). > > Dang, that's pretty nice, especially for the relatively small amount > of code that it seems to require. Yeah, the combined effect of these two patches is better than I expected. To be clear though, I was only measuring the time between the "redo starts at ..." and "redo done at ..." messages, since I've been staring at the main recovery code, but there are also some more fsyncs before (SyncDataDirectory()) and after (RemoveOldXlogFiles()) that are unaffected. I think it's probably possible to do something about those too, but that's another topic. I spotted a small problem: if the transaction ID wrap all the way around between checkpoints, then you might have cancelled requests for a removed SLRU segment from the previous epoch, so we'd better uncancel them if we see that. That's a one line fix, done in the attached. I also adjusted the commit message to be a little clearer (this work deferment/collapsing scheme works in crash recovery too, not just when there is a checkpointer process to hand the work to).
Attachment
On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro <thomas.munro@gmail.com> wrote: > [patch] Bitrot, rebased, no changes. > Yeah, the combined effect of these two patches is better than I > expected. To be clear though, I was only measuring the time between > the "redo starts at ..." and "redo done at ..." messages, since I've > been staring at the main recovery code, but there are also some more > fsyncs before (SyncDataDirectory()) and after (RemoveOldXlogFiles()) > that are unaffected. I think it's probably possible to do something > about those too, but that's another topic. ... and of course the end-of-recovery checkpoint; in my tests this wasn't materially changed since there isn't actually very much CLOG, it's just that we avoided syncing it block at a time and getting rescheduled. FWIW I put a very simple test here: https://github.com/macdice/redo-bench, YMMV.
Attachment
On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro <thomas.munro@gmail.com> wrote: > [patch] Hi Thomas / hackers, I just wanted to help testing this patch (defer SLRU fsyncs during recovery) and also faster compactify_tuples() patch [2]as both are related to the WAL recovery performance in which I'm interested in. This is my first message to this mailinggroup so please let me know also if I should adjust testing style or formatting. With both of those patches applied: make check -> Passes make check-world -> Passes make standbycheck (section "Testing Hot Standby" from docs) -> Passes There wasn't a single segfault or postmaster crash during the tests. Review of the patches itself: I'm not qualified to review the PostgreSQL internals. I've used redo-bench scripts [1] by Thomas to measure the performance effect (this approach simplifies testing and excludesnetwork jittering effects): 1st column is redo start->end timing, 2nd is redo end -> end of checkpointing timingbefore opening the DB for reads. I've conducted 2-3 separate tests that show benefits of those patches depending onthe workload: - Handing SLRU sync work over to the checkpointer: in my testing it accelerates WAL recovery performance on slower / higerlatency storage by ~20% - Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery performance for HOT updates also by ~20% TEST1: workload profile test as per standard TPC-B pgbench -c8 -j8, with majority of records in WAL stream being Heap/HOT_UPDATE: xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline: 72.991, 0.919 73.688, 1.027 72.228, 1.032 xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch 72.271, 0.857 72.717, 0.748 72.705, 0.81 xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off 72.49, 0.103 74.069, 0.102 73.368, 0.102 NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch 70.312, 0.22 70.615, 0.201 69.739, 0.194 NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off 69.686, 0.101 70.601, 0.102 70.042, 0.101 As Thomas stated in the standard pgbench workload profile on recovery side there is compactify_tuples()->pg_qsort() overheadvisible. So this is where the 2nd patch helps: NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance improvement 58.85, 0.296 58.605, 0.269 58.603, 0.277 NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance improvement,fsync=off 58.618, 0.103 57.693, 0.101 58.779, 0.111 In the last final case the top profile is as follows related still to the sorting but as I understand in much optimal way: 26.68% postgres postgres [.] qsort_itemoff ---qsort_itemoff |--14.17%--qsort_itemoff | |--10.96%--compactify_tuples | | PageRepairFragmentation | | heap2_redo | | StartupXLOG | --3.21%--qsort_itemoff | --3.10%--compactify_tuples | PageRepairFragmentation | heap2_redo | StartupXLOG --12.51%--compactify_tuples PageRepairFragmentation heap2_redo StartupXLOG 8.38% postgres libc-2.17.so [.] __memmove_ssse3_back ---__memmove_ssse3_back compactify_tuples PageRepairFragmentation heap2_redo StartupXLOG 6.51% postgres postgres [.] hash_search_with_hash_value ---hash_search_with_hash_value |--3.62%--smgropen | |--2.17%--XLogReadBufferExtended | | --1.76%--XLogReadBufferForRedoExtended | | --0.93%--heap_xlog_update | --1.45%--ReadBufferWithoutRelcache | XLogReadBufferExtended | --1.34%--XLogReadBufferForRedoExtended | --0.72%--heap_xlog_update --2.69%--BufTableLookup ReadBuffer_common ReadBufferWithoutRelcache XLogReadBufferExtended --2.48%--XLogReadBufferForRedoExtended |--1.34%--heap2_redo | StartupXLOG --0.83%--heap_xlog_update So to sum, HOT update-like workload profile tends to be CPU bound on single process recovery side. Even slow storage (likexvda) was not the bottleneck here as I've used already hot stuff from VFS cache. TEST2: The second suite of tests used append-only workload profile (the same amount of transactions, using pgbench -c8 -j8-f insert_only.sql -n -t 1000000), however with simpler structure: CREATE TABLE t (c1 uuid NOT NULL, c2 uuid NOT NULL, c3 integer NOT NULL, c4 integer NOT NULL); CREATE INDEX i1 ON t USING btree (c1); CREATE INDEX i2 ON t USING btree (c2); CREATE INDEX i3 ON t USING btree (c3); CREATE EXTENSION "uuid-ossp" and customized script for pgbench, just this: BEGIN; INSERT INTO t (c1,c2,c3,c4) values ( uuid_generate_v4(), uuid_generate_v4(),round(5000*random()), 500); END; Majority of WAL records being Btree/INSERT_LEAF (60%), results are following: xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline: 120.732, 5.275 120.608, 5.902 120.685, 5.872 xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch 99.061, 7.071 99.472, 6.386 98.036, 5.994 xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off 102.838, 0.136 102.711, 0.099 103.093, 0.0970001 NVMe, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline: 96.46, 0.405 96.121, 0.405 95.951, 0.402 NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch 94.124, 0.387 96.61, 0.416 94.624, 0.451 NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off 95.401, 0.0969999 95.028, 0.099 94.632, 0.0970001 So apparently the v2-0001-Defer-flushing-of-SLRU-files helps in my case on higher latency storage. The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache (butnot in shared buffers, please compare with TEST1 as there was no such bottleneck at all): 29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string ---copy_user_enhanced_fast_string |--17.98%--copyin [..] | __pwrite_nocancel | FileWrite | mdwrite | FlushBuffer | ReadBuffer_common | ReadBufferWithoutRelcache | XLogReadBufferExtended | XLogReadBufferForRedoExtended | --17.57%--btree_xlog_insert | btree_redo | StartupXLOG | --11.64%--copyout [..] __pread_nocancel --11.44%--FileRead mdread ReadBuffer_common ReadBufferWithoutRelcache XLogReadBufferExtended XLogReadBufferForRedoExtended --11.34%--btree_xlog_insert btree_redo StartupXLOG 5.62% postgres postgres [.] hash_search_with_hash_value ---hash_search_with_hash_value |--1.74%--smgropen [..] No# of syscalls/s topped at ~100k/s @ 8kB for each read & write respectively (it's logical I/O as everything fits in VFScache, nearly no physical I/O). It was also visible during the test that the startup/recovering process spent 60% of it'stime in %sys time in such conditions. As there was no sorting visible in profiler, I've did not test the workload withcompactify_tuples()->pg_qsort() performance improvement here, although from basic runs it appears it did not introduceany degradation. TEST2b: Quite frankly, at 1st glance I've did not seem to understand why btree_xlog_insert()->ReadBuffer_common() would requireFlushBuffer() that would write, until I've bumped shared buffers 128MB -> 24GB as it might have been flushing dirtybuffers which caused those pwrite's() - it's not evident as direct call in code from ReadBuffer*() (inlining? any ideahow to monitor on standby the cycling of dirty buffers during recovery when there is no bgwriter yet?). Still I wantedto eliminate storage and VFS cache as bottleneck. This lead me to results like below (with fsync=on, also defer patch,shared_buffers=24GB to eliminate VFS cache lookups): 27.341, 0.858 27.26, 0.869 26.959, 0.86 Turning on/off the defer SLRU patch and/or fsync doesn't seem to make any difference, so if anyone is curious the next setsof append-only bottlenecks is like below: 14.69% postgres postgres [.] hash_search_with_hash_value ---hash_search_with_hash_value |--9.80%--BufTableLookup | ReadBuffer_common | ReadBufferWithoutRelcache | XLogReadBufferExtended | XLogReadBufferForRedoExtended | |--7.76%--btree_xlog_insert | | btree_redo | | StartupXLOG | --1.63%--heap_xlog_insert --4.90%--smgropen |--2.86%--ReadBufferWithoutRelcache | XLogReadBufferExtended | |--2.04%--XLogReadBufferForRedoExtended | | |--1.22%--btree_xlog_insert | | | btree_redo | | | StartupXLOG | | --0.82%--heap_xlog_insert | --0.82%--XLogRecordPageWithFreeSpace | heap_xlog_insert --2.04%--XLogReadBufferExtended XLogReadBufferForRedoExtended --1.63%--btree_xlog_insert btree_redo StartupXLOG 7.76% postgres libc-2.17.so [.] __memmove_ssse3_back ---__memmove_ssse3_back PageAddItemExtended btree_xlog_insert btree_redo StartupXLOG Still the result seems to be nice, as it is 10^6 (trx) / 27s = ~34k TPS on recovery side (~3000MB of WAL/~27s = ~111MB/swithout FPW) provided there would be no I/O overheads and my assumptions. For full picture, compare ratio generationof "primary" data with INSERTs I was able to get without any special tuning: - ~29k TPS of INSERTs-only/COMMITs with pgbench -c8 - ~49k TPS of INSERTs-only/COMMITs with pgbench -c16 so the WAL (single process) recovery code in 14master seems to have like 70% of performance of fairly low-end primary 16vCPUDB in above append-only conditions. Specs: Linux 4.14 kernel, ext4 filesystems (data=ordered,noatime), 1s8c16t Xeon CPU E5-2686 v4 @ 2.30GHz, 128GB RAM, gcc7.2.1, CFLAGS set by ./configure to "-O2", test on default/tiny shared_buffers until last test. xvda AKA slow storage: root file system, single thread tests: dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==> 157 MB/s dd if=/dev/zero of=test bs=8k count=10000 oflag=direct => 17.2 MB/s NVMe: striped VG consisting of 2x NVMes devices with much lower latency dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==> 1.9 GB/s or maybe even more dd if=/dev/zero of=test bs=8k count=10000 oflag=direct => 141 MB/s -Jakub Wartak. [1] - https://github.com/macdice/redo-bench/ [2] - https://commitfest.postgresql.org/29/2687/
On Tue, Aug 25, 2020 at 9:16 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote: > I just wanted to help testing this patch (defer SLRU fsyncs during recovery) and also faster compactify_tuples() patch[2] as both are related to the WAL recovery performance in which I'm interested in. This is my first message to thismailing group so please let me know also if I should adjust testing style or formatting. Hi Jakub, Thanks very much for these results! > - Handing SLRU sync work over to the checkpointer: in my testing it accelerates WAL recovery performance on slower / higerlatency storage by ~20% Wow. Those fsyncs must have had fairly high latency (presumably due to queuing behind other write back activity). > - Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery performance for HOT updates also by ~20% Nice. > In the last final case the top profile is as follows related still to the sorting but as I understand in much optimal way: > > 26.68% postgres postgres [.] qsort_itemoff > ---qsort_itemoff > |--14.17%--qsort_itemoff > | |--10.96%--compactify_tuples > | | PageRepairFragmentation > | | heap2_redo > | | StartupXLOG > | --3.21%--qsort_itemoff > | --3.10%--compactify_tuples > | PageRepairFragmentation > | heap2_redo > | StartupXLOG > --12.51%--compactify_tuples > PageRepairFragmentation > heap2_redo > StartupXLOG I wonder if there is something higher level that could be done to reduce the amount of compaction work required in the first place, but in the meantime I'm very happy if we can improve the situation so much with such a microscopic improvement that might eventually benefit other sorting stuff... > 8.38% postgres libc-2.17.so [.] __memmove_ssse3_back > ---__memmove_ssse3_back > compactify_tuples > PageRepairFragmentation > heap2_redo Hmm, I wonder if this bit could go teensy bit faster by moving as many adjacent tuples as you can in one go rather than moving them one at a time... > The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache(but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all): > > 29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string > ---copy_user_enhanced_fast_string > |--17.98%--copyin > [..] > | __pwrite_nocancel > | FileWrite > | mdwrite > | FlushBuffer > | ReadBuffer_common > | ReadBufferWithoutRelcache > | XLogReadBufferExtended > | XLogReadBufferForRedoExtended > | --17.57%--btree_xlog_insert To move these writes out of recovery's way, we should probably just run the bgwriter process during crash recovery. I'm going to look into that. The other thing is of course the checkpointer process, and our end-of-recovery checkpoint. I was going to suggest it should be optional and not done by the recovery process itself, which is why some earlier numbers I shared didn't include the end-of-recovery checkpoint, but then I realised it complicated the numbers for this little patch and, anyway, it'd be a good idea to open that can of worms separately... > | btree_redo > | StartupXLOG > | > --11.64%--copyout > [..] > __pread_nocancel > --11.44%--FileRead > mdread > ReadBuffer_common > ReadBufferWithoutRelcache > XLogReadBufferExtended > XLogReadBufferForRedoExtended For these reads, the solution should be WAL prefetching, but the patch I shared for that (and will be updating soon) is just one piece of the puzzle, and as it stands it actually *increases* the number of syscalls by adding some posix_fadvise() calls, so ... erm, for an all-in-kernel-cache-already workload like what you profiled there it can only make things worse on that front. But... when combined with Andres's work-in-progress AIO stuff, a whole bunch of reads can be submitted with a single system call ahead of time and then the results are delivered directly into our buffer pool by kernel threads or hardware DMA, so we'll not only avoid going off CPU during recovery but we'll also reduce the system call count. > Turning on/off the defer SLRU patch and/or fsync doesn't seem to make any difference, so if anyone is curious the nextsets of append-only bottlenecks is like below: > > 14.69% postgres postgres [.] hash_search_with_hash_value > ---hash_search_with_hash_value > |--9.80%--BufTableLookup > | ReadBuffer_common > | ReadBufferWithoutRelcache > | XLogReadBufferExtended > | XLogReadBufferForRedoExtended Hypothesis: Your 24GB buffer pool requires somewhere near 70MB of buffer mapping table (huh, pg_shmem_allocations doesn't show that correctly), so it doesn't fit into any level of your memory cache hierarchy and it's super random access, so every buffer lookup is costing you a ~60-100ns memory stall. Maybe? If that's the reason for this showing up in your profile, I think I could probably add a little cache line prefetch phase to the WAL prefetch patch to fix it. I've actually tried prefetching the buffer mapping cache lines before, without success, but never in recovery. I'll make a note to look into that.
Hi, On 2020-08-26 15:58:14 +1200, Thomas Munro wrote: > > --12.51%--compactify_tuples > > PageRepairFragmentation > > heap2_redo > > StartupXLOG > > I wonder if there is something higher level that could be done to > reduce the amount of compaction work required in the first place, but > in the meantime I'm very happy if we can improve the situation so much > with such a microscopic improvement that might eventually benefit > other sorting stuff... Another approach could be to not perform any sorting during recovery, instead including enough information in the WAL record to avoid doing a full blown PageRepairFragmentation during recovery. Greetings, Andres Freund
On 2020-Aug-25, Andres Freund wrote: > Hi, > > On 2020-08-26 15:58:14 +1200, Thomas Munro wrote: > > > --12.51%--compactify_tuples > > > PageRepairFragmentation > > > heap2_redo > > > StartupXLOG > > > > I wonder if there is something higher level that could be done to > > reduce the amount of compaction work required in the first place, but > > in the meantime I'm very happy if we can improve the situation so much > > with such a microscopic improvement that might eventually benefit > > other sorting stuff... > > Another approach could be to not perform any sorting during recovery, > instead including enough information in the WAL record to avoid doing a > full blown PageRepairFragmentation during recovery. Hmm, including the sorted ItemId array in the WAL record might make sense to alleviate the qsort work ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Aug-25, Jakub Wartak wrote: > Turning on/off the defer SLRU patch and/or fsync doesn't seem to make > any difference, so if anyone is curious the next sets of append-only > bottlenecks is like below: > > 14.69% postgres postgres [.] hash_search_with_hash_value > ---hash_search_with_hash_value > |--9.80%--BufTableLookup > | ReadBuffer_common > | ReadBufferWithoutRelcache > | XLogReadBufferExtended > | XLogReadBufferForRedoExtended > | |--7.76%--btree_xlog_insert > | | btree_redo > | | StartupXLOG > | --1.63%--heap_xlog_insert > --4.90%--smgropen > |--2.86%--ReadBufferWithoutRelcache Looking at an earlier report of this problem I was thinking whether it'd make sense to replace SMgrRelationHash with a simplehash table; I have a half-written patch for that, but I haven't completed that work. However, in the older profile things were looking different, as hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74% of it. BufTableLookup was also there but only 1.51%. So I'm not so sure now that that'll pay off as clearly as I had hoped. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 27, 2020 at 6:15 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > --4.90%--smgropen > > |--2.86%--ReadBufferWithoutRelcache > > Looking at an earlier report of this problem I was thinking whether it'd > make sense to replace SMgrRelationHash with a simplehash table; I have a > half-written patch for that, but I haven't completed that work. > However, in the older profile things were looking different, as > hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74% > of it. BufTableLookup was also there but only 1.51%. So I'm not so > sure now that that'll pay off as clearly as I had hoped. Right, my hypothesis requires an uncacheably large buffer mapping table, and I think smgropen() needs a different explanation because it's not expected to be as large or as random, at least not with a pgbench workload. I think the reasons for a profile with a smgropen() showing up so high, and in particular higher than BufTableLookup(), must be: 1. We call smgropen() twice for every call to BufTableLookup(). Once in XLogReadBufferExtended(), and then again in ReadBufferWithoutRelcache(). 2. We also call it for every block forced out of the buffer pool, and in recovery that has to be done by the recovery loop. 3. We also call it for every block in the buffer pool during the end-of-recovery checkpoint. Not sure but the last two might perform worse due to proximity to interleaving pwrite() system calls (just a thought, not investigated). In any case, I'm going to propose we move those things out of the recovery loop, in a new thread.
Hi Thomas / hackers, >> The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache(but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all): >> >> 29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string >> ---copy_user_enhanced_fast_string >> |--17.98%--copyin >> [..] >> | __pwrite_nocancel >> | FileWrite >> | mdwrite >> | FlushBuffer >> | ReadBuffer_common >> | ReadBufferWithoutRelcache >> | XLogReadBufferExtended >> | XLogReadBufferForRedoExtended >> | --17.57%--btree_xlog_insert > > To move these writes out of recovery's way, we should probably just > run the bgwriter process during crash recovery. I'm going to look > into that. Sounds awesome. Also as this thread is starting to derail the SLRU fsync topic - maybe we should change subject? However,to add some data to the separate bgwriter: on 14master (already with lseek() caching, SLRU fsyncs out of way, bettersorting), I've measured the same configuration as last time with still the same append-only WAL workload on NVMe andcompared with various shared_buffers settings (and buffers description sizing from pg_shmem_allocations which as You statedis wrongly reported(?) which I'm stating only for reference just in case): shared_buffers=128MB buffers_desc=1024kB 96.778, 0.438 [a] shared_buffers=256MB buffers_desc=2048kB 62.755, 0.577 [a] shared_buffers=512MB buffers_desc=4096kB 33.167, 0.62 [a] shared_buffers=1GB buffers_desc=8192kB 27.303, 0.929 shared_buffers=4GB buffers_desc=32MB 27.185, 1.166 shared_buffers=8GB buffers_desc=64MB 27.649, 1.088 shared_buffers=16GB buffers_desc=128MB 27.584, 1.201 [b] shared_buffers=32GB buffers_desc=256MB 32.314, 1.171 [b] shared_buffers=48GB buffers_desc=384 MB 31.95, 1.217 shared_buffers=64GB buffers_desc=512 MB 31.276, 1.349 shared_buffers=72GB buffers_desc=576 MB 31.925, 1.284 shared_buffers=80GB buffers_desc=640 MB 31.809, 1.413 The amount of WAL to be replayed was ~2.8GB. To me it looks like that a) flushing dirty buffers by StartupXLog might be a real problem but please read-on. b) there is very low impact by this L2/L3 hypothesis you mention (?), but it's not that big and it's not degrading linearlyas one would expect (??) L1d:L1d:L2:L3 cache sizes on this machine are as follows on this machine: 32K/32K/256K/46080K.Maybe the DB size is too small. I've double-checked that in condition [a] (shared_buffers=128MB) there was a lot of FlushBuffer() invocations per second(perf stat -e probe_postgres:FlushBuffer -I 1000), e.g: # time counts unit events 1.000485217 79,494 probe_postgres:FlushBuffer 2.000861366 75,305 probe_postgres:FlushBuffer 3.001263163 79,662 probe_postgres:FlushBuffer 4.001650795 80,317 probe_postgres:FlushBuffer 5.002033273 79,951 probe_postgres:FlushBuffer 6.002418646 79,993 probe_postgres:FlushBuffer while at 1GB shared_buffers it sits nearly at zero all the time. So there is like 3x speed-up possible when StartupXLogwouldn't have to care too much about dirty buffers in the critical path (bgwriter would as you say?) at leastin append-only scenarios. But ... I've checked some real systems (even older versions of PostgreSQL not doing thatmuch of replication, and indeed it's e.g. happening 8-12k/s FlushBuffer's() and shared buffers are pretty huge (>100GB,0.5TB RAM) but they are *system-wide* numbers, work is really performed by bgwriter not by startup/recovering asin this redo-bench case when DB is open for reads and replicating-- so it appears that this isn't optimization for hotstandby case , but for the DB-closed startup recovery/restart/disaster/PITR scenario). As for the 24GB shared_buffers scenario where dirty buffers are not at all a problem with given top profile (output trimmed),again as expected: 13.41% postgres postgres [.] hash_search_with_hash_value |--8.31%--BufTableLookup <- ReadBuffer_common <- ReadBufferWithoutRelcache --5.11%--smgropen |--2.77%--XLogReadBufferExtended --2.34%--ReadBufferWithoutRelcache 7.88% postgres postgres [.] MarkBufferDirty I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even on bigger) reporting as not-supported orzeros. However interestingly the workload has IPC of 1.40 (instruction bound) which to me is strange as I would expectBufTableLookup() to be actually heavy memory bound (?) Maybe I'll try on some different hardware one day. >> __pread_nocancel >> --11.44%--FileRead >> mdread >> ReadBuffer_common >> ReadBufferWithoutRelcache >> XLogReadBufferExtended >> XLogReadBufferForRedoExtended > > For these reads, the solution should be WAL prefetching,(..) But... when combined with Andres's work-in-progress AIO stuff(..) Yes, I've heard a thing or two about those :) I hope I'll be able to deliver some measurements sooner or later of those twotogether (AIO+WALprefetch). -Jakub Wartak.
Hi Alvaro, Thomas, hackers >> 14.69% postgres postgres [.] hash_search_with_hash_value >> ---hash_search_with_hash_value >> |--9.80%--BufTableLookup [..] >> --4.90%--smgropen >> |--2.86%--ReadBufferWithoutRelcache > Looking at an earlier report of this problem I was thinking whether it'd > make sense to replace SMgrRelationHash with a simplehash table; I have a > half-written patch for that, but I haven't completed that work. > However, in the older profile things were looking different, as > hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74% > of it. BufTableLookup was also there but only 1.51%. So I'm not so > sure now that that'll pay off as clearly as I had hoped. Yes, quite frankly my expectation was to see hash_search_with_hash_value()<-smgropen() outcome as 1st one, but in simplifiedredo-bench script it's not the case. The original scenario was much more complex with plenty of differences (inno particular order: TB-sized DB VS ~500GB RAM -> thousands of forks, multiple tables, huge btrees, multiple INSERTs wihplenty of data in VALUES() thrown as one commit, real primary->hot-standby replication [not closed DB in recovery], sortednot random UUIDs) - I'm going to try nail down these differences and maybe I manage to produce more realistic "pgbenchreproducer" (this may take some time though). -Jakub Wartak.
On Thu, Aug 27, 2020 at 8:48 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote: > I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even on bigger) reporting as not-supported orzeros. I heard some of the counters are only allowed on their dedicated instance types. > However interestingly the workload has IPC of 1.40 (instruction bound) which to me is strange as I would expect BufTableLookup()to be actually heavy memory bound (?) Maybe I'll try on some different hardware one day. Hmm, OK now you've made me go and read a bunch of Brendan Gregg bloggs and try some experiments of my own to get a feel for this number and what it might be telling us about the cache miss counters you can't see. Since I know how to generate arbitrary cache miss workloads for quick experiments using hash joins of different sizes, I tried that and noticed that when LLC misses were at 76% (bad), IPC was at 1.69 which is still higher than what you're seeing. When the hash table was much smaller and LLC misses were down to 15% (much better), IPC was at 2.83. I know Gregg said[1] "An IPC < 1.0 likely means memory bound, and an IPC > 1.0 likely means instruction bound", but that's not what I'm seeing here, and in his comments section that is disputed. So I'm not sure your IPC of 1.40 is evidence against the hypothesis on its own. [1] http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html
On Thu, Aug 27, 2020 at 8:48 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote: > >> 29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string > >> ---copy_user_enhanced_fast_string > >> |--17.98%--copyin > >> [..] > >> | __pwrite_nocancel > >> | FileWrite > >> | mdwrite > >> | FlushBuffer > >> | ReadBuffer_common > >> | ReadBufferWithoutRelcache > >> | XLogReadBufferExtended > >> | XLogReadBufferForRedoExtended > >> | --17.57%--btree_xlog_insert > > > > To move these writes out of recovery's way, we should probably just > > run the bgwriter process during crash recovery. I'm going to look > > into that. > > Sounds awesome. I wrote a quick and dirty experimental patch to try that. I can't see any benefit from it on pgbench with default shared buffers, but maybe it would do better with your append test due to locality, especially if you can figure out how to tune bgwriter to pace itself optimally. https://github.com/macdice/postgres/tree/bgwriter-in-crash-recovery
Hi Thomas, hackers, >> > To move these writes out of recovery's way, we should probably just >> > run the bgwriter process during crash recovery. I'm going to look >> > into that. >> >> Sounds awesome. > >I wrote a quick and dirty experimental patch to try that. I can't see >any benefit from it on pgbench with default shared buffers, but maybe >it would do better with your append test due to locality, especially >if you can figure out how to tune bgwriter to pace itself optimally. >https://github.com/macdice/postgres/tree/bgwriter-in-crash-recovery OK, so I've quickly tested those two PoCs patches together, in the conditions like below: - similar append-only workload by pgbench (to eliminate other already known different WAL bottlenecks: e.g. sorting), - 4.3GB of WAL to be applied (mostly Btree/INSERT_LEAF) - on same system as last time (ext4 on NVMe, 1s8c16, 4.14 kernel) - 14master already with SLRU fsync to checkpointer/pg_qgsort patches applied TEST bgwriterPOC1: - in severe dirty memory conditions (artificially simulated via small s_b here) --> so for workloads with very high FlushBufferactivity in StartupXLOG - with fsync=off/fpw=off by default and on NVMe (e.g. scenario: I want to perform some PITR as fast as I can to see howproduction data looked like in the past, before some user deleted some data) baseline s_b@128MB: 140.404, 0.123 (2nd small as there is small region to checkpoint) 22.49% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string ---copy_user_enhanced_fast_string |--14.72%--copyin | __pwrite_nocancel | FileWrite | mdwrite | FlushBuffer | ReadBuffer_common | --14.52%--btree_xlog_insert --7.77%--copyout __pread_nocancel --7.57%--FileRead mdread ReadBuffer_common 6.13% postgres [kernel.kallsyms] [k] do_syscall_64 |--1.64%--__pwrite_nocancel --1.23%--__pread_nocancel 3.68% postgres postgres [.] hash_search_with_hash_value ---hash_search_with_hash_value |--1.02%--smgropen After applying: patch -p1 < ../0001-Run-checkpointer-and-bgworker-in-crash-recovery.patch patch -p1 < ../0002-Optionally-don-t-wait-for-end-of-recovery-checkpoint.patch 0001+0002 s_b@128MB: similar result to above 0001+0002 s_b@128MB: 108.871, 0.114 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 1000 0001+0002 s_b@128MB: 85.392, 0.103 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 50000 #~390MB max? 18.40% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string ---copy_user_enhanced_fast_string |--17.79%--copyout | __pread_nocancel | |--16.56%--FileRead | | mdread | | ReadBuffer_common --0.61%--copyin // WOW __pwrite_nocancel FileWrite mdwrite FlushBuffer ReadBuffer_common 9.20% postgres postgres [.] hash_search_with_hash_value ---hash_search_with_hash_value |--4.70%--smgropen of course there is another WOW moment during recovery ("61.9%") USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND postgres 120935 0.9 0.0 866052 3824 ? Ss 09:47 0:00 postgres: checkpointer postgres 120936 61.9 0.0 865796 3824 ? Rs 09:47 0:22 postgres: background writer postgres 120937 97.4 0.0 865940 5228 ? Rs 09:47 0:36 postgres: startup recovering 000000010000000000000089 speedup of 1.647x when dirty memory is in way. When it's not: baseline s_b@24000MB: 39.199, 1.448 (2x patches off) 0001+0002 s_b@24000MB: 39.383, 1.442 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 50000 #~390MB/s max, yay there's no regression. I have only one comment about those 2 WIP patches, bgwriter_lru_maxpages should be maybe called standby_bgwriter_lru_maxpagesin this scenario or even more preferred there shouldn't be a maximum set during closed DB recoveryscenario. TEST bgwriterPOC2a to showcase the 2nd patch which opens the DB for read-write users before the final checkpoint finishesafter redo recovery. The DBA may make the decision via this parameter end_of_recovery_checkpoint_wait=off. - on slow storage (xvda, fsync=on) and even with high memory: s_b@24000MB: 39.043, 15.639 -- even with WAL recovery being 100% CPU bound(mostly on hash_search_with_hash_value() for Buffers/__memmove_ssse3_back),it took additional 15s to perform checkpoint before DB was open for users (it had to write269462 buffers =~ 2GB =~ 140MB/s which is close to the xvda device speed): the complete output looks in 14master lookssimilar to this: 1598609928.620 startup 22543 LOG: redo done at 1/12201C88 1598609928.624 checkpointer 22541 LOG: checkpoint starting: end-of-recovery immediate wait 1598609944.908 checkpointer 22541 LOG: checkpoint complete: wrote 269462 buffers (8.6%); 0 WAL file(s) added, 0 removed,273 recycled; write=15.145 s, sync=0.138 s, total=16.285 s; sync files=11, longest=0.133 s, average=0.012 s; distance=4468855kB, estimate=4468855 kB 1598609944.912 postmaster 22538 LOG: database system is ready to accept connections s_b@24000MB: 39.96, 0 , with end_of_recovery_checkpoint_wait = off, before DB is open 15s faster 1598610331.556 startup 29499 LOG: redo done at 1/12201C88 1598610331.559 checkpointer 29497 LOG: checkpoint starting: immediate force 1598610331.562 postmaster 29473 LOG: database system is ready to accept connections 1598610347.202 checkpointer 29497 LOG: checkpoint complete: wrote 269462 buffers (8.6%); 0 WAL file(s) added, 0 removed,273 recycled; write=15.092 s, sync=0.149 s, total=15.643 s; sync files=12, longest=0.142 s, average=0.012 s; distance=4468855kB, estimate=4468855 kB I suppose a checkpoint for large shared_buffers (hundredths of GB) might take a lot of time and this 0002 patch bypassesthat. I would find it quite useful in some scenarios (e.g. testing backups, PITR recoveries, opening DB from storagesnapshots / storage replication, maybe with DWH-after-crash too). TEST bgwriterPOC2b: FYI, I was also testing the the hot_standby code path -- to test if it would reduce time of starting/ opening a fresh standby for read-only queries, but this parameter doesn't seem to influence that in my tests. AsI've learned it's apparently much more complex to reproduce what I'm after and involves a lot of reading about LogStandbySnapshot()/ standby recovery points on my side. Now, back to smgropen() hash_search_by_values() reproducer... -Jakub Wartak.
On Sat, Aug 29, 2020 at 12:43 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote: > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > postgres 120935 0.9 0.0 866052 3824 ? Ss 09:47 0:00 postgres: checkpointer > postgres 120936 61.9 0.0 865796 3824 ? Rs 09:47 0:22 postgres: background writer > postgres 120937 97.4 0.0 865940 5228 ? Rs 09:47 0:36 postgres: startup recovering 000000010000000000000089 > > speedup of 1.647x Thanks for testing! That's better than I expected. I guess it wasn't quite so good with default bgwriter settings. > I have only one comment about those 2 WIP patches, bgwriter_lru_maxpages should be maybe called standby_bgwriter_lru_maxpagesin this scenario or even more preferred there shouldn't be a maximum set during closed DB recoveryscenario. I wish bgwriter could auto-tune itself better, so we wouldn't need to contemplate adding more settings. As for the second patch ("Optionally, don't wait for end-of-recovery checkpoint."), that also looked quite useful in your test scenario: > end_of_recovery_checkpoint_wait = off, before DB is open 15s faster > I suppose a checkpoint for large shared_buffers (hundredths of GB) might take a lot of time and this 0002 patch bypassesthat. I would find it quite useful in some scenarios (e.g. testing backups, PITR recoveries, opening DB from storagesnapshots / storage replication, maybe with DWH-after-crash too). I suppose a third option that you might want is no checkpoint at all (meaning leave it to regular checkpoint scheduling), like fast promotion. One thing missing from the patch is that we probably need to log an end-of-recovery *record*, like fast promotion does. I'm a little fuzzy on the timeline stuff. I wonder if any recovery experts would like to weigh in on theoretical problems that might be lurking here...
On Sat, Aug 29, 2020 at 12:43 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote: > ... %CPU ... COMMAND > ... 97.4 ... postgres: startup recovering 000000010000000000000089 So, what else is pushing this thing off CPU, anyway? For one thing, I guess it might be stalling while reading the WAL itself, because (1) we only read it 8KB at a time, relying on kernel read-ahead, which typically defaults to 128KB I/Os unless you cranked it up, but for example we know that's not enough to saturate a sequential scan on NVME system, so maybe it hurts here too (2) we keep having to switch segment files every 16MB. Increasing WAL segment size and kernel readahead size presumably help with that, if indeed it is a problem, but we could also experiment with a big POSIX_FADV_WILLNEED hint for a future segment every time we cross a boundary, and also maybe increase the size of our reads.
Hi Thomas, hackers, >> ... %CPU ... COMMAND >> ... 97.4 ... postgres: startup recovering 000000010000000000000089 > So, what else is pushing this thing off CPU, anyway? For one thing, I > guess it might be stalling while reading the WAL itself, because (1) > we only read it 8KB at a time, relying on kernel read-ahead, which > typically defaults to 128KB I/Os unless you cranked it up, but for > example we know that's not enough to saturate a sequential scan on > NVME system, so maybe it hurts here too (2) we keep having to switch > segment files every 16MB. Increasing WAL segment size and kernel > readahead size presumably help with that, if indeed it is a problem, > but we could also experiment with a big POSIX_FADV_WILLNEED hint for a > future segment every time we cross a boundary, and also maybe increase > the size of our reads. All of the above (1,2) would make sense and the effects IMHO are partially possible to achieve via ./configure compile options,but from previous correspondence [1] in this particular workload, it looked like it was not WAL reading, but readingrandom DB blocks into shared buffer: in that case I suppose it was the price of too many syscalls to the OS/VFS cacheitself as the DB was small and fully cached there - so problem (3): copy_user_enhanced_fast_string <- 17.79%--copyout(!) <- __pread_nocancel <- 16.56%--FileRead / mdread / ReadBuffer_common (!). Without some micro-optimizationor some form of vectorized [batching] I/O in recovery it's dead end when it comes to small changes. Thingthat come to my mind as for enhancing recovery: - preadv() - works only for 1 fd, while WAL stream might require reading a lot of random pages into s_b (many relations/fds,even btree inserting to single relation might put data into many 1GB [default] forks). This would only micro-optimizeINSERT (pk) SELECT nextval(seq) kind of processing on recovery side I suppose. Of coruse provided that StartupXLOGwould be more working in a batched way: (a) reading a lot of blocks from WAL at once (b) then issuing preadv()to get all the DB blocks into s_b going from the same rel/fd (c) applying WAL. Sounds like a major refactor justto save syscalls :( - mmap() - even more unrealistic - IO_URING - gives a lot of promise here I think, is it even planned to be shown for PgSQL14 cycle ? Or it's more like PgSQL15? -Jakub Wartak [1] - https://www.postgresql.org/message-id/VI1PR0701MB6960EEB838D53886D8A180E3F6520%40VI1PR0701MB6960.eurprd07.prod.outlook.com pleasesee profile after "0001+0002 s_b(at)128MB: 85.392"
On Mon, Aug 31, 2020 at 8:50 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote: > - IO_URING - gives a lot of promise here I think, is it even planned to be shown for PgSQL14 cycle ? Or it's more likePgSQL15? I can't answer that, but I've played around with the prototype quite a bit, and thought quite a lot about how to port it to systems without IO_URING, and I'm just as keen to see this happen as you are. In the meantime, from the low-hanging-fruit department, here's a new version of the SLRU-fsync-offload patch. The only changes are a tweaked commit message, and adoption of C99 designated initialisers for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of relying on humans to make the array order match the enum values. If there are no comments or objections, I'm planning to commit this quite soon.
Attachment
On Sat, Sep 19, 2020 at 5:06 PM Thomas Munro <thomas.munro@gmail.com> wrote: > In the meantime, from the low-hanging-fruit department, here's a new > version of the SLRU-fsync-offload patch. The only changes are a > tweaked commit message, and adoption of C99 designated initialisers > for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of > relying on humans to make the array order match the enum values. If > there are no comments or objections, I'm planning to commit this quite > soon. ... and CI told me that Windows didn't like my array syntax with the extra trailing comma. Here's one without.
Attachment
On Sun, Sep 20, 2020 at 12:40 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Sep 19, 2020 at 5:06 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > In the meantime, from the low-hanging-fruit department, here's a new > > version of the SLRU-fsync-offload patch. The only changes are a > > tweaked commit message, and adoption of C99 designated initialisers > > for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of > > relying on humans to make the array order match the enum values. If > > there are no comments or objections, I'm planning to commit this quite > > soon. > > ... and CI told me that Windows didn't like my array syntax with the > extra trailing comma. Here's one without. While scanning for comments and identifier names that needed updating, I realised that this patch changed the behaviour of the ShutdownXXX() functions, since they currently flush the SLRUs but are not followed by a checkpoint. I'm not entirely sure I understand the logic of that, but it wasn't my intention to change it. So here's a version that converts the existing fsync_fname() to fsync_fname_recurse() to fix that. Strangely, the fsync calls that ensure that directory entries are on disk seemed to be missing from CheckPointMultixact(), so I added them. Isn't that a live bug? I decided it was a little too magical that CheckPointCLOG() etc depended on the later call to CheckPointBuffers() to perform their fsyncs. I started writing comments about that, but then I realised that the right thing to do was to hoist ProcessSyncRequests() out of there into CheckPointGuts() and make it all more explicit. I also realised that it would be inconsistent to count SLRU sync activity as buffer sync time, but not count SLRU write activity as buffer write time, or count its buffers as written in the reported statistics. In other words, SLRU buffers *are* buffers for checkpoint reporting purposes (or should at least be consistently in or out of the stats, and with this patch they have to be in). Does that make sense? Is there a problem I'm not seeing with reordering CheckPointGuts() as I have? One comment change that seems worth highlighting is this code reached by VACUUM, which seems like a strict improvement (it wasn't flushing for crash recovery): /* - * Flush out dirty data, so PhysicalPageExists can work correctly. - * SimpleLruFlush() is a pretty big hammer for that. Alternatively we - * could add an in-memory version of page exists, but find_multixact_start - * is called infrequently, and it doesn't seem bad to flush buffers to - * disk before truncation. + * Write out dirty data, so PhysicalPageExists can work correctly. */ - SimpleLruFlush(MultiXactOffsetCtl, true); - SimpleLruFlush(MultiXactMemberCtl, true); + SimpleLruWriteAll(MultiXactOffsetCtl, true); + SimpleLruWriteAll(MultiXactMemberCtl, true);
Attachment
On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro <thomas.munro@gmail.com> wrote: > While scanning for comments and identifier names that needed updating, > I realised that this patch changed the behaviour of the ShutdownXXX() > functions, since they currently flush the SLRUs but are not followed > by a checkpoint. I'm not entirely sure I understand the logic of > that, but it wasn't my intention to change it. So here's a version > that converts the existing fsync_fname() to fsync_fname_recurse() to Bleugh, that was probably a bad idea, it's too expensive. But it forces me to ask the question: *why* do we need to call Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a shutdown checkpoint? I wondered if this might date from before the WAL, but I see that the pattern was introduced when the CLOG was moved out of shared buffers into a proto-SLRU in ancient commit 2589735da08, but even in that commit the preceding CreateCheckPoint() call included a call to CheckPointCLOG().
On Tue, Sep 22, 2020 at 9:08 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > While scanning for comments and identifier names that needed updating, > > I realised that this patch changed the behaviour of the ShutdownXXX() > > functions, since they currently flush the SLRUs but are not followed > > by a checkpoint. I'm not entirely sure I understand the logic of > > that, but it wasn't my intention to change it. So here's a version > > that converts the existing fsync_fname() to fsync_fname_recurse() to > > Bleugh, that was probably a bad idea, it's too expensive. But it > forces me to ask the question: *why* do we need to call > Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a > shutdown checkpoint? I wondered if this might date from before the > WAL, but I see that the pattern was introduced when the CLOG was moved > out of shared buffers into a proto-SLRU in ancient commit 2589735da08, > but even in that commit the preceding CreateCheckPoint() call included > a call to CheckPointCLOG(). I complained about the apparently missing multixact fsync in a new thread, because if I'm right about that it requires a back-patchable fix. As for the ShutdownXXX() functions, I haven't yet come up with any reason for this code to exist. Emboldened by a colleague's inability to explain to me what that code is doing for us, here is a new version that just rips it all out.
Attachment
On Wed, Sep 23, 2020 at 1:56 PM Thomas Munro <thomas.munro@gmail.com> wrote: > As for the ShutdownXXX() functions, I haven't yet come up with any > reason for this code to exist. Emboldened by a colleague's inability > to explain to me what that code is doing for us, here is a new version > that just rips it all out. Rebased. Tom, do you have any thoughts on ShutdownCLOG() etc?
Attachment
Thomas Munro <thomas.munro@gmail.com> writes: > Tom, do you have any thoughts on ShutdownCLOG() etc? Hm, if we cannot reach that without first completing a shutdown checkpoint, it does seem a little pointless. It'd likely be a good idea to add a comment to CheckPointCLOG et al explaining that we expect $what-exactly to fsync the data we are writing before the checkpoint is considered done. regards, tom lane
On Fri, Sep 25, 2020 at 12:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Tom, do you have any thoughts on ShutdownCLOG() etc? > > Hm, if we cannot reach that without first completing a shutdown checkpoint, > it does seem a little pointless. Thanks for the sanity check. > It'd likely be a good idea to add a comment to CheckPointCLOG et al > explaining that we expect $what-exactly to fsync the data we are writing > before the checkpoint is considered done. Good point. Done like this: + /* + * Write dirty CLOG pages to disk. This may result in sync requests queued + * for later handling by ProcessSyncRequests(), as part of the checkpoint. + */ TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true); - SimpleLruFlush(XactCtl, true); + SimpleLruWriteAll(XactCtl, true); TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); Here's a new version. The final thing I'm contemplating before pushing this is whether there may be hidden magical dependencies in the order of operations in CheckPointGuts(), which I've changed around. Andres, any comments?
Attachment
On Fri, Sep 25, 2020 at 12:53 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Here's a new version. The final thing I'm contemplating before > pushing this is whether there may be hidden magical dependencies in > the order of operations in CheckPointGuts(), which I've changed > around. Andres, any comments? I nagged Andres off-list and he opined that it might be better to reorder it a bit so that ProcessSyncRequests() comes after almost everything else, so that if we ever teach more things to offload their fsync work it'll be in the right order. I reordered it like that; now only CheckPointTwoPhase() comes later, based on the comment that accompanies it. In any case, we can always reconsider the ordering of this function in later commits as required. Pushed like that.
On 9/25/20 9:09 AM, Thomas Munro wrote: > On Fri, Sep 25, 2020 at 12:53 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> Here's a new version. The final thing I'm contemplating before >> pushing this is whether there may be hidden magical dependencies in >> the order of operations in CheckPointGuts(), which I've changed >> around. Andres, any comments? > > I nagged Andres off-list and he opined that it might be better to > reorder it a bit so that ProcessSyncRequests() comes after almost > everything else, so that if we ever teach more things to offload their > fsync work it'll be in the right order. I reordered it like that; now > only CheckPointTwoPhase() comes later, based on the comment that > accompanies it. In any case, we can always reconsider the ordering of > this function in later commits as required. Pushed like that. > Seems this commit left behind a couple unnecessary prototypes in a bunch of header files. In particular, it removed these functions - ShutdownCLOG(); - ShutdownCommitTs(); - ShutdownSUBTRANS(); - ShutdownMultiXact(); but we still have $ git grep ShutdownCLOG src/include/access/clog.h:extern void ShutdownCLOG(void); regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 4, 2021 at 3:35 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Seems this commit left behind a couple unnecessary prototypes in a bunch > of header files. In particular, it removed these functions > > - ShutdownCLOG(); > - ShutdownCommitTs(); > - ShutdownSUBTRANS(); > - ShutdownMultiXact(); Thanks. Fixed.