Re: Handing off SLRU fsyncs to the checkpointer - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Handing off SLRU fsyncs to the checkpointer |
Date | |
Msg-id | CA+hUKGKAch2h0s75yOLWj+ZqCAxaeezWNpOW+oZLnwkjvw4NOA@mail.gmail.com Whole thread Raw |
In response to | Re: Handing off SLRU fsyncs to the checkpointer (Jakub Wartak <Jakub.Wartak@tomtom.com>) |
Responses |
Re: Handing off SLRU fsyncs to the checkpointer
Re: Handing off SLRU fsyncs to the checkpointer |
List | pgsql-hackers |
On Tue, Aug 25, 2020 at 9:16 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote: > I just wanted to help testing this patch (defer SLRU fsyncs during recovery) and also faster compactify_tuples() patch[2] as both are related to the WAL recovery performance in which I'm interested in. This is my first message to thismailing group so please let me know also if I should adjust testing style or formatting. Hi Jakub, Thanks very much for these results! > - Handing SLRU sync work over to the checkpointer: in my testing it accelerates WAL recovery performance on slower / higerlatency storage by ~20% Wow. Those fsyncs must have had fairly high latency (presumably due to queuing behind other write back activity). > - Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery performance for HOT updates also by ~20% Nice. > In the last final case the top profile is as follows related still to the sorting but as I understand in much optimal way: > > 26.68% postgres postgres [.] qsort_itemoff > ---qsort_itemoff > |--14.17%--qsort_itemoff > | |--10.96%--compactify_tuples > | | PageRepairFragmentation > | | heap2_redo > | | StartupXLOG > | --3.21%--qsort_itemoff > | --3.10%--compactify_tuples > | PageRepairFragmentation > | heap2_redo > | StartupXLOG > --12.51%--compactify_tuples > PageRepairFragmentation > heap2_redo > StartupXLOG I wonder if there is something higher level that could be done to reduce the amount of compaction work required in the first place, but in the meantime I'm very happy if we can improve the situation so much with such a microscopic improvement that might eventually benefit other sorting stuff... > 8.38% postgres libc-2.17.so [.] __memmove_ssse3_back > ---__memmove_ssse3_back > compactify_tuples > PageRepairFragmentation > heap2_redo Hmm, I wonder if this bit could go teensy bit faster by moving as many adjacent tuples as you can in one go rather than moving them one at a time... > The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache(but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all): > > 29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string > ---copy_user_enhanced_fast_string > |--17.98%--copyin > [..] > | __pwrite_nocancel > | FileWrite > | mdwrite > | FlushBuffer > | ReadBuffer_common > | ReadBufferWithoutRelcache > | XLogReadBufferExtended > | XLogReadBufferForRedoExtended > | --17.57%--btree_xlog_insert To move these writes out of recovery's way, we should probably just run the bgwriter process during crash recovery. I'm going to look into that. The other thing is of course the checkpointer process, and our end-of-recovery checkpoint. I was going to suggest it should be optional and not done by the recovery process itself, which is why some earlier numbers I shared didn't include the end-of-recovery checkpoint, but then I realised it complicated the numbers for this little patch and, anyway, it'd be a good idea to open that can of worms separately... > | btree_redo > | StartupXLOG > | > --11.64%--copyout > [..] > __pread_nocancel > --11.44%--FileRead > mdread > ReadBuffer_common > ReadBufferWithoutRelcache > XLogReadBufferExtended > XLogReadBufferForRedoExtended For these reads, the solution should be WAL prefetching, but the patch I shared for that (and will be updating soon) is just one piece of the puzzle, and as it stands it actually *increases* the number of syscalls by adding some posix_fadvise() calls, so ... erm, for an all-in-kernel-cache-already workload like what you profiled there it can only make things worse on that front. But... when combined with Andres's work-in-progress AIO stuff, a whole bunch of reads can be submitted with a single system call ahead of time and then the results are delivered directly into our buffer pool by kernel threads or hardware DMA, so we'll not only avoid going off CPU during recovery but we'll also reduce the system call count. > Turning on/off the defer SLRU patch and/or fsync doesn't seem to make any difference, so if anyone is curious the nextsets of append-only bottlenecks is like below: > > 14.69% postgres postgres [.] hash_search_with_hash_value > ---hash_search_with_hash_value > |--9.80%--BufTableLookup > | ReadBuffer_common > | ReadBufferWithoutRelcache > | XLogReadBufferExtended > | XLogReadBufferForRedoExtended Hypothesis: Your 24GB buffer pool requires somewhere near 70MB of buffer mapping table (huh, pg_shmem_allocations doesn't show that correctly), so it doesn't fit into any level of your memory cache hierarchy and it's super random access, so every buffer lookup is costing you a ~60-100ns memory stall. Maybe? If that's the reason for this showing up in your profile, I think I could probably add a little cache line prefetch phase to the WAL prefetch patch to fix it. I've actually tried prefetching the buffer mapping cache lines before, without success, but never in recovery. I'll make a note to look into that.
pgsql-hackers by date: