Re: Handing off SLRU fsyncs to the checkpointer - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Handing off SLRU fsyncs to the checkpointer
Date
Msg-id CA+hUKGKAch2h0s75yOLWj+ZqCAxaeezWNpOW+oZLnwkjvw4NOA@mail.gmail.com
Whole thread Raw
In response to Re: Handing off SLRU fsyncs to the checkpointer  (Jakub Wartak <Jakub.Wartak@tomtom.com>)
Responses Re: Handing off SLRU fsyncs to the checkpointer  (Andres Freund <andres@anarazel.de>)
Re: Handing off SLRU fsyncs to the checkpointer  (Jakub Wartak <Jakub.Wartak@tomtom.com>)
List pgsql-hackers
On Tue, Aug 25, 2020 at 9:16 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> I just wanted to help testing this patch (defer SLRU fsyncs during recovery) and also faster compactify_tuples()
patch[2] as both are related to the WAL recovery performance in which I'm interested in. This is my first message to
thismailing group so please let me know also if I should adjust testing style or formatting. 

Hi Jakub,

Thanks very much for these results!

> - Handing SLRU sync work over to the checkpointer: in my testing it accelerates WAL recovery performance on slower /
higerlatency storage by ~20% 

Wow.  Those fsyncs must have had fairly high latency (presumably due
to queuing behind other write back activity).

> - Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery performance for HOT updates also by
~20%

Nice.

> In the last final case the top profile is as follows related still to the sorting but as I understand in much optimal
way:
>
>     26.68%  postgres  postgres            [.] qsort_itemoff
>             ---qsort_itemoff
>                |--14.17%--qsort_itemoff
>                |          |--10.96%--compactify_tuples
>                |          |          PageRepairFragmentation
>                |          |          heap2_redo
>                |          |          StartupXLOG
>                |           --3.21%--qsort_itemoff
>                |                      --3.10%--compactify_tuples
>                |                                PageRepairFragmentation
>                |                                heap2_redo
>                |                                StartupXLOG
>                 --12.51%--compactify_tuples
>                           PageRepairFragmentation
>                           heap2_redo
>                           StartupXLOG

I wonder if there is something higher level that could be done to
reduce the amount of compaction work required in the first place, but
in the meantime I'm very happy if we can improve the situation so much
with such a microscopic improvement that might eventually benefit
other sorting stuff...

>      8.38%  postgres  libc-2.17.so        [.] __memmove_ssse3_back
>             ---__memmove_ssse3_back
>                compactify_tuples
>                PageRepairFragmentation
>                heap2_redo

Hmm, I wonder if this bit could go teensy bit faster by moving as many
adjacent tuples as you can in one go rather than moving them one at a
time...

> The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS
cache(but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all): 
>
>     29.62%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
>             ---copy_user_enhanced_fast_string
>                |--17.98%--copyin
> [..]
>                |          __pwrite_nocancel
>                |          FileWrite
>                |          mdwrite
>                |          FlushBuffer
>                |          ReadBuffer_common
>                |          ReadBufferWithoutRelcache
>                |          XLogReadBufferExtended
>                |          XLogReadBufferForRedoExtended
>                |           --17.57%--btree_xlog_insert

To move these writes out of recovery's way, we should probably just
run the bgwriter process during crash recovery.  I'm going to look
into that.

The other thing is of course the checkpointer process, and our
end-of-recovery checkpoint.  I was going to suggest it should be
optional and not done by the recovery process itself, which is why
some earlier numbers I shared didn't include the end-of-recovery
checkpoint, but then I realised it complicated the numbers for this
little patch and, anyway, it'd be a good idea to open that can of
worms separately...

>                |                     btree_redo
>                |                     StartupXLOG
>                |
>                 --11.64%--copyout
> [..]
>                           __pread_nocancel
>                            --11.44%--FileRead
>                                      mdread
>                                      ReadBuffer_common
>                                      ReadBufferWithoutRelcache
>                                      XLogReadBufferExtended
>                                      XLogReadBufferForRedoExtended

For these reads, the solution should be WAL prefetching, but the patch
I shared for that (and will be updating soon) is just one piece of the
puzzle, and as it stands it actually *increases* the number of
syscalls by adding some posix_fadvise() calls, so ... erm, for an
all-in-kernel-cache-already workload like what you profiled there it
can only make things worse on that front.  But... when combined with
Andres's work-in-progress AIO stuff, a whole bunch of reads can be
submitted with a single system call ahead of time and then the results
are delivered directly into our buffer pool by kernel threads or
hardware DMA, so we'll not only avoid going off CPU during recovery
but we'll also reduce the system call count.

> Turning on/off the defer SLRU patch and/or fsync doesn't seem to make any difference, so if anyone is curious the
nextsets of append-only bottlenecks is like below: 
>
>     14.69%  postgres  postgres            [.] hash_search_with_hash_value
>             ---hash_search_with_hash_value
>                |--9.80%--BufTableLookup
>                |          ReadBuffer_common
>                |          ReadBufferWithoutRelcache
>                |          XLogReadBufferExtended
>                |          XLogReadBufferForRedoExtended

Hypothesis:  Your 24GB buffer pool requires somewhere near 70MB of
buffer mapping table (huh, pg_shmem_allocations doesn't show that
correctly), so it doesn't fit into any level of your memory cache
hierarchy and it's super random access, so every buffer lookup is
costing you a ~60-100ns memory stall.   Maybe?

If that's the reason for this showing up in your profile, I think I
could probably add a little cache line prefetch phase to the WAL
prefetch patch to fix it.  I've actually tried prefetching the buffer
mapping cache lines before, without success, but never in recovery.
I'll make a note to look into that.



pgsql-hackers by date:

Previous
From: Mahendra Singh Thalor
Date:
Subject: Re: display offset along with block number in vacuum errors
Next
From: Peter Eisentraut
Date:
Subject: Re: some unused parameters cleanup