Thread: Handing off SLRU fsyncs to the checkpointer

Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
Hi,

In commit 3eb77eba we made it possible for any subsystem that wants a
file to be flushed as part of the next checkpoint to ask the
checkpointer to do that, as previously only md.c could do.

In the past, foreground CLOG flush stalls were a problem, but then
commit 33aaa139 cranked up the number of buffers, and then commit
5364b357 cranked it right up until the flushes mostly disappeared from
some benchmark workload but not so high that the resulting linear
searches through the buffer array destroyed the gains.  I know there
is interest in moving that stuff into regular shared buffers, so it
can be found via the buffer mapping system (and improve as that
improves), written back by the background writer (and improve as that
improves), managed with a proper replacement algorithm (and improve as
that improves), etc etc.  That sounds like a great idea to me, but
it's a big project.

In the meantime, one thing we could do is hand off the fsyncs, but I'm
not sure if it's still considered a real problem in the field given
the new parameters.

Anyway, I had a patch for that, that I used while testing commit
3eb77eba.  While reading old threads about SLRU today I found that
several people had wished for a thing exactly like that, so I dusted
it off and rebased it.

Attachment

Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Wed, Feb 12, 2020 at 9:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> In commit 3eb77eba we made it possible for any subsystem that wants a
> file to be flushed as part of the next checkpoint to ask the
> checkpointer to do that, as previously only md.c could do.

Hello,

While working on recovery performance, I found my way back to this
idea and rebased the patch.

Problem statement:

Every time we have to write out a page of pg_commit_ts, pg_multixact
or pg_xact due to cache pressure, we immediately call fsync().  This
runs serially in the recovery process, and it's quite bad for
pg_commit_ts, because we need to dump out a page for every ~800
transactions (track_commit_timestamps is not enabled by default).  If
we ask the checkpointer to do it, it collapses the 2048 fsync calls
for each SLRU segment into one, and the kernel can write out the data
with larger I/Os, maybe even ahead of time, and update the inode only
once.

Experiment:

Run crash recovery for 1 million pgbench transactions:

  postgres -D pgdata \
    -c synchronous_commit=off \
    -c enable_commit_timestamps=on \
    -c max_wal_size=10GB \
    -c checkpoint_timeout=60min

  # in another shell
  pgbench -i -s10 postgres
  psql postgres -c checkpoint
  pgbench -t1000000 -Mprepared postgres
  killall -9 postgres

  # save the crashed pgdata dir for repeated experiments
  mv pgdata pgdata-save

  # now run experiments like this and see how long recovery takes
  rm -fr pgdata
  cp -r pgdata-save pgdata
  postgres -D pgdata

What I see on a system that has around 2.5ms latency for fsync:

  master: 16.83 seconds
  patched: 4.00 seconds

It's harder to see it without commit timestamps enabled since we only
need to flush a pg_xact page every 32k transactions (and multixacts
are more complicated to test), but you can still see the effect.  With
8x more transactions to make it clearer what's going on, I could
measure a speedup of around 6% from this patch, which I suppose scales
up fairly obviously with storage latency (every million transaction =
at least 30 fsyncs stalls, so you can multiply that by your fsync
latency and work out how much time your recovery process will be
asleep at the wheel instead of applying your records).

From a syscall overhead point of view, it's a bit unfortunate that we
open and close SLRU segments every time we write, but it's probably
not really enough to complain about... except for the (small) risk of
an inode dropping out of kernel caches in the time between closing it
and the checkpointer opening it.  Hmm.

Attachment

Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Tue, Aug 4, 2020 at 6:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... speedup of around 6% ...

I did some better testing.  OS: Linux, storage: consumer SSD.  I
repeatedly ran crash recovery on 3.3GB worth of WAL generated with 8M
pgbench transactions.  I tested 3 different builds 7 times each and
used "ministat" to compare the recovery times.  It told me that:

* Master is around 11% faster than last week before commit c5315f4f
"Cache smgrnblocks() results in recovery."
* This patch gives a similar speedup, bringing the total to around 25%
faster than last week (the time is ~20% less, the WAL processing speed
is ~1.25x).

My test fit in RAM and was all cached.  With the patch, the recovery
process used 100% of a single core the whole time and stayed on that
core and the variance is low, but in the other builds it hovered
around 90% and hopped around as it kept getting rescheduled and the
variance was higher.

Of course, SLRU fsyncs aren't the only I/O stalls in a real system;
among others, there are also reads from faulting in referenced pages
that don't have full page images in the WAL.  I'm working on that
separately, but that's a tad more complicated than this stuff.

Added to commit fest.

=== ministat output showing recovery times in seconds ===

x patched.dat
+ master.dat
* lastweek.dat
+------------------------------------------------------------------------------+
|                                                                          *   |
|    x                             +                                       *   |
|x x xx                  +         +  ++ +    +              *           ****  |
|  |AM|                        |_____AM____|                       |_____A_M__||
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7        38.655        39.406        39.218     39.134857    0.25188849
+   7        42.128        45.068        43.958     43.815286    0.91387758
Difference at 95.0% confidence
    4.68043 +/- 0.780722
    11.9597% +/- 1.99495%
    (Student's t, pooled s = 0.670306)
*   7        47.187        49.404        49.203     48.904286    0.76793483
Difference at 95.0% confidence
    9.76943 +/- 0.665613
    24.9635% +/- 1.70082%
    (Student's t, pooled s = 0.571477)



Re: Handing off SLRU fsyncs to the checkpointer

From
Robert Haas
Date:
On Wed, Aug 5, 2020 at 2:01 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> * Master is around 11% faster than last week before commit c5315f4f
> "Cache smgrnblocks() results in recovery."
> * This patch gives a similar speedup, bringing the total to around 25%
> faster than last week (the time is ~20% less, the WAL processing speed
> is ~1.25x).

Dang, that's pretty nice, especially for the relatively small amount
of code that it seems to require.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Sat, Aug 8, 2020 at 2:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 5, 2020 at 2:01 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > * Master is around 11% faster than last week before commit c5315f4f
> > "Cache smgrnblocks() results in recovery."
> > * This patch gives a similar speedup, bringing the total to around 25%
> > faster than last week (the time is ~20% less, the WAL processing speed
> > is ~1.25x).
>
> Dang, that's pretty nice, especially for the relatively small amount
> of code that it seems to require.

Yeah, the combined effect of these two patches is better than I
expected.  To be clear though, I was only measuring the time between
the "redo starts at ..." and "redo done at ..." messages, since I've
been staring at the main recovery code, but there are also some more
fsyncs before (SyncDataDirectory()) and after (RemoveOldXlogFiles())
that are unaffected.  I think it's probably possible to do something
about those too, but that's another topic.

I spotted a small problem: if the transaction ID wrap all the way
around between checkpoints, then you might have cancelled requests for
a removed SLRU segment from the previous epoch, so we'd better
uncancel them if we see that.  That's a one line fix, done in the
attached.  I also adjusted the commit message to be a little clearer
(this work deferment/collapsing scheme works in crash recovery too,
not just when there is a checkpointer process to hand the work to).

Attachment

Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> [patch]

Bitrot, rebased, no changes.

> Yeah, the combined effect of these two patches is better than I
> expected.  To be clear though, I was only measuring the time between
> the "redo starts at ..." and "redo done at ..." messages, since I've
> been staring at the main recovery code, but there are also some more
> fsyncs before (SyncDataDirectory()) and after (RemoveOldXlogFiles())
> that are unaffected.  I think it's probably possible to do something
> about those too, but that's another topic.

... and of course the end-of-recovery checkpoint; in my tests this
wasn't materially changed since there isn't actually very much CLOG,
it's just that we avoided syncing it block at a time and getting
rescheduled.  FWIW I put a very simple test here:
https://github.com/macdice/redo-bench, YMMV.

Attachment

Re: Handing off SLRU fsyncs to the checkpointer

From
Jakub Wartak
Date:
On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> [patch]

Hi Thomas / hackers,

I just wanted to help testing this patch (defer SLRU fsyncs during recovery) and also faster compactify_tuples() patch
[2]as both are related to the WAL recovery performance in which I'm interested in. This is my first message to this
mailinggroup so please let me know also if I should adjust testing style or formatting. 

With both of those patches applied:
make check -> Passes
make check-world -> Passes
make standbycheck (section "Testing Hot Standby" from docs) -> Passes
There wasn't a single segfault or postmaster crash during the tests.
Review of the patches itself: I'm not qualified to review the PostgreSQL internals.

I've used redo-bench scripts [1] by Thomas to measure the performance effect (this approach simplifies testing and
excludesnetwork jittering effects): 1st column is redo start->end timing, 2nd is redo end -> end of checkpointing
timingbefore opening the DB for reads. I've conducted 2-3 separate tests that show benefits of those patches depending
onthe workload: 
- Handing SLRU sync work over to the checkpointer: in my testing it accelerates WAL recovery performance on slower /
higerlatency storage by ~20% 
- Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery performance for HOT updates also by
~20%

TEST1: workload profile test as per standard TPC-B pgbench -c8 -j8, with majority of records in WAL stream being
Heap/HOT_UPDATE:

xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
72.991, 0.919
73.688, 1.027
72.228, 1.032

xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch
72.271, 0.857
72.717, 0.748
72.705, 0.81

xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
72.49, 0.103
74.069, 0.102
73.368, 0.102

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch
70.312, 0.22
70.615, 0.201
69.739, 0.194

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
69.686, 0.101
70.601, 0.102
70.042, 0.101

As Thomas stated in the standard pgbench workload profile on recovery side there is compactify_tuples()->pg_qsort()
overheadvisible. So this is where the 2nd patch helps: 

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance
improvement
58.85, 0.296
58.605, 0.269
58.603, 0.277

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance
improvement,fsync=off 
58.618, 0.103
57.693, 0.101
58.779, 0.111

In the last final case the top profile is as follows related still to the sorting but as I understand in much optimal
way:

    26.68%  postgres  postgres            [.] qsort_itemoff
            ---qsort_itemoff
               |--14.17%--qsort_itemoff
               |          |--10.96%--compactify_tuples
               |          |          PageRepairFragmentation
               |          |          heap2_redo
               |          |          StartupXLOG
               |           --3.21%--qsort_itemoff
               |                      --3.10%--compactify_tuples
               |                                PageRepairFragmentation
               |                                heap2_redo
               |                                StartupXLOG
                --12.51%--compactify_tuples
                          PageRepairFragmentation
                          heap2_redo
                          StartupXLOG

     8.38%  postgres  libc-2.17.so        [.] __memmove_ssse3_back
            ---__memmove_ssse3_back
               compactify_tuples
               PageRepairFragmentation
               heap2_redo
               StartupXLOG

     6.51%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--3.62%--smgropen
               |          |--2.17%--XLogReadBufferExtended
               |          |           --1.76%--XLogReadBufferForRedoExtended
               |          |                      --0.93%--heap_xlog_update
               |           --1.45%--ReadBufferWithoutRelcache
               |                     XLogReadBufferExtended
               |                      --1.34%--XLogReadBufferForRedoExtended
               |                                 --0.72%--heap_xlog_update
                --2.69%--BufTableLookup
                          ReadBuffer_common
                          ReadBufferWithoutRelcache
                          XLogReadBufferExtended
                           --2.48%--XLogReadBufferForRedoExtended
                                     |--1.34%--heap2_redo
                                     |          StartupXLOG
                                      --0.83%--heap_xlog_update


So to sum, HOT update-like workload profile tends to be CPU bound on single process recovery side. Even slow storage
(likexvda) was not the bottleneck here as I've used already hot stuff from VFS cache. 

TEST2:  The second suite of tests used append-only workload profile (the same amount of transactions, using pgbench -c8
-j8-f insert_only.sql -n -t 1000000), however with simpler structure: 
CREATE TABLE t (c1 uuid NOT NULL, c2 uuid NOT NULL, c3 integer NOT NULL, c4 integer NOT NULL);
CREATE INDEX i1 ON t USING btree (c1);
CREATE INDEX i2 ON t USING btree (c2);
CREATE INDEX i3 ON t USING btree (c3);
CREATE EXTENSION "uuid-ossp"

and customized script for pgbench, just this: BEGIN; INSERT INTO t (c1,c2,c3,c4) values ( uuid_generate_v4(),
uuid_generate_v4(),round(5000*random()), 500); END; 

Majority of WAL records being Btree/INSERT_LEAF (60%), results are following:

xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
120.732, 5.275
120.608, 5.902
120.685, 5.872

xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch
99.061, 7.071
99.472, 6.386
98.036, 5.994

xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
102.838, 0.136
102.711, 0.099
103.093, 0.0970001

NVMe, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
96.46, 0.405
96.121, 0.405
95.951, 0.402

NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch
94.124, 0.387
96.61, 0.416
94.624, 0.451

NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
95.401, 0.0969999
95.028, 0.099
94.632, 0.0970001

So apparently the v2-0001-Defer-flushing-of-SLRU-files helps in my case on higher latency storage.

The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache
(butnot in shared buffers, please compare with TEST1 as there was no such bottleneck at all): 

    29.62%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
            ---copy_user_enhanced_fast_string
               |--17.98%--copyin
[..]
               |          __pwrite_nocancel
               |          FileWrite
               |          mdwrite
               |          FlushBuffer
               |          ReadBuffer_common
               |          ReadBufferWithoutRelcache
               |          XLogReadBufferExtended
               |          XLogReadBufferForRedoExtended
               |           --17.57%--btree_xlog_insert
               |                     btree_redo
               |                     StartupXLOG
               |
                --11.64%--copyout
[..]
                          __pread_nocancel
                           --11.44%--FileRead
                                     mdread
                                     ReadBuffer_common
                                     ReadBufferWithoutRelcache
                                     XLogReadBufferExtended
                                     XLogReadBufferForRedoExtended
                                      --11.34%--btree_xlog_insert
                                                btree_redo
                                                StartupXLOG

     5.62%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--1.74%--smgropen
[..]

No# of syscalls/s topped at ~100k/s @ 8kB for each read & write respectively (it's logical I/O as everything fits in
VFScache, nearly no physical I/O). It was also visible during the test that the startup/recovering process spent 60% of
it'stime in %sys time in such conditions. As there was no sorting visible in profiler, I've did not test the workload
withcompactify_tuples()->pg_qsort() performance improvement here, although from basic runs it appears it did not
introduceany degradation. 

TEST2b: Quite frankly, at 1st glance I've did not seem to understand why btree_xlog_insert()->ReadBuffer_common() would
requireFlushBuffer() that would write, until I've bumped shared buffers 128MB -> 24GB as it might have been flushing
dirtybuffers which caused those pwrite's() - it's not evident as direct call in code from ReadBuffer*()  (inlining? any
ideahow to monitor on standby the cycling of dirty buffers during recovery when there is no bgwriter yet?). Still I
wantedto eliminate storage and VFS cache as bottleneck. This lead me to results like below (with fsync=on, also defer
patch,shared_buffers=24GB to eliminate VFS cache lookups): 
27.341, 0.858
27.26, 0.869
26.959, 0.86

Turning on/off the defer SLRU patch and/or fsync doesn't seem to make any difference, so if anyone is curious the next
setsof append-only bottlenecks is like below: 

    14.69%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--9.80%--BufTableLookup
               |          ReadBuffer_common
               |          ReadBufferWithoutRelcache
               |          XLogReadBufferExtended
               |          XLogReadBufferForRedoExtended
               |          |--7.76%--btree_xlog_insert
               |          |          btree_redo
               |          |          StartupXLOG
               |           --1.63%--heap_xlog_insert
                --4.90%--smgropen
                          |--2.86%--ReadBufferWithoutRelcache
                          |          XLogReadBufferExtended
                          |          |--2.04%--XLogReadBufferForRedoExtended
                          |          |          |--1.22%--btree_xlog_insert
                          |          |          |          btree_redo
                          |          |          |          StartupXLOG
                          |          |           --0.82%--heap_xlog_insert
                          |           --0.82%--XLogRecordPageWithFreeSpace
                          |                     heap_xlog_insert
                           --2.04%--XLogReadBufferExtended
                                     XLogReadBufferForRedoExtended
                                      --1.63%--btree_xlog_insert
                                                btree_redo
                                                StartupXLOG

     7.76%  postgres  libc-2.17.so        [.] __memmove_ssse3_back
            ---__memmove_ssse3_back
               PageAddItemExtended
               btree_xlog_insert
               btree_redo
               StartupXLOG

Still the result seems to be nice, as it is 10^6 (trx) / 27s = ~34k TPS on recovery side (~3000MB of WAL/~27s =
~111MB/swithout FPW) provided there would be no I/O overheads and my assumptions. For full picture, compare ratio
generationof "primary" data with INSERTs I was able to get without any special tuning: 
- ~29k TPS of INSERTs-only/COMMITs with pgbench -c8
- ~49k TPS of INSERTs-only/COMMITs with pgbench -c16
so the WAL (single process) recovery code in 14master seems to have like 70% of performance of fairly low-end primary
16vCPUDB in above append-only conditions. 

Specs: Linux 4.14 kernel, ext4 filesystems (data=ordered,noatime), 1s8c16t Xeon CPU E5-2686 v4 @ 2.30GHz, 128GB RAM,
gcc7.2.1, CFLAGS set by ./configure to "-O2", test on default/tiny shared_buffers until last test. 

xvda AKA slow storage: root file system, single thread tests:
    dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==>  157 MB/s
    dd if=/dev/zero of=test bs=8k count=10000 oflag=direct =>  17.2 MB/s

NVMe:  striped VG consisting of 2x NVMes devices with much lower latency
    dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==>  1.9 GB/s or maybe even more
    dd if=/dev/zero of=test bs=8k count=10000 oflag=direct =>  141 MB/s

-Jakub Wartak.

[1] - https://github.com/macdice/redo-bench/
[2] - https://commitfest.postgresql.org/29/2687/



Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Tue, Aug 25, 2020 at 9:16 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> I just wanted to help testing this patch (defer SLRU fsyncs during recovery) and also faster compactify_tuples()
patch[2] as both are related to the WAL recovery performance in which I'm interested in. This is my first message to
thismailing group so please let me know also if I should adjust testing style or formatting. 

Hi Jakub,

Thanks very much for these results!

> - Handing SLRU sync work over to the checkpointer: in my testing it accelerates WAL recovery performance on slower /
higerlatency storage by ~20% 

Wow.  Those fsyncs must have had fairly high latency (presumably due
to queuing behind other write back activity).

> - Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery performance for HOT updates also by
~20%

Nice.

> In the last final case the top profile is as follows related still to the sorting but as I understand in much optimal
way:
>
>     26.68%  postgres  postgres            [.] qsort_itemoff
>             ---qsort_itemoff
>                |--14.17%--qsort_itemoff
>                |          |--10.96%--compactify_tuples
>                |          |          PageRepairFragmentation
>                |          |          heap2_redo
>                |          |          StartupXLOG
>                |           --3.21%--qsort_itemoff
>                |                      --3.10%--compactify_tuples
>                |                                PageRepairFragmentation
>                |                                heap2_redo
>                |                                StartupXLOG
>                 --12.51%--compactify_tuples
>                           PageRepairFragmentation
>                           heap2_redo
>                           StartupXLOG

I wonder if there is something higher level that could be done to
reduce the amount of compaction work required in the first place, but
in the meantime I'm very happy if we can improve the situation so much
with such a microscopic improvement that might eventually benefit
other sorting stuff...

>      8.38%  postgres  libc-2.17.so        [.] __memmove_ssse3_back
>             ---__memmove_ssse3_back
>                compactify_tuples
>                PageRepairFragmentation
>                heap2_redo

Hmm, I wonder if this bit could go teensy bit faster by moving as many
adjacent tuples as you can in one go rather than moving them one at a
time...

> The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS
cache(but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all): 
>
>     29.62%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
>             ---copy_user_enhanced_fast_string
>                |--17.98%--copyin
> [..]
>                |          __pwrite_nocancel
>                |          FileWrite
>                |          mdwrite
>                |          FlushBuffer
>                |          ReadBuffer_common
>                |          ReadBufferWithoutRelcache
>                |          XLogReadBufferExtended
>                |          XLogReadBufferForRedoExtended
>                |           --17.57%--btree_xlog_insert

To move these writes out of recovery's way, we should probably just
run the bgwriter process during crash recovery.  I'm going to look
into that.

The other thing is of course the checkpointer process, and our
end-of-recovery checkpoint.  I was going to suggest it should be
optional and not done by the recovery process itself, which is why
some earlier numbers I shared didn't include the end-of-recovery
checkpoint, but then I realised it complicated the numbers for this
little patch and, anyway, it'd be a good idea to open that can of
worms separately...

>                |                     btree_redo
>                |                     StartupXLOG
>                |
>                 --11.64%--copyout
> [..]
>                           __pread_nocancel
>                            --11.44%--FileRead
>                                      mdread
>                                      ReadBuffer_common
>                                      ReadBufferWithoutRelcache
>                                      XLogReadBufferExtended
>                                      XLogReadBufferForRedoExtended

For these reads, the solution should be WAL prefetching, but the patch
I shared for that (and will be updating soon) is just one piece of the
puzzle, and as it stands it actually *increases* the number of
syscalls by adding some posix_fadvise() calls, so ... erm, for an
all-in-kernel-cache-already workload like what you profiled there it
can only make things worse on that front.  But... when combined with
Andres's work-in-progress AIO stuff, a whole bunch of reads can be
submitted with a single system call ahead of time and then the results
are delivered directly into our buffer pool by kernel threads or
hardware DMA, so we'll not only avoid going off CPU during recovery
but we'll also reduce the system call count.

> Turning on/off the defer SLRU patch and/or fsync doesn't seem to make any difference, so if anyone is curious the
nextsets of append-only bottlenecks is like below: 
>
>     14.69%  postgres  postgres            [.] hash_search_with_hash_value
>             ---hash_search_with_hash_value
>                |--9.80%--BufTableLookup
>                |          ReadBuffer_common
>                |          ReadBufferWithoutRelcache
>                |          XLogReadBufferExtended
>                |          XLogReadBufferForRedoExtended

Hypothesis:  Your 24GB buffer pool requires somewhere near 70MB of
buffer mapping table (huh, pg_shmem_allocations doesn't show that
correctly), so it doesn't fit into any level of your memory cache
hierarchy and it's super random access, so every buffer lookup is
costing you a ~60-100ns memory stall.   Maybe?

If that's the reason for this showing up in your profile, I think I
could probably add a little cache line prefetch phase to the WAL
prefetch patch to fix it.  I've actually tried prefetching the buffer
mapping cache lines before, without success, but never in recovery.
I'll make a note to look into that.



Re: Handing off SLRU fsyncs to the checkpointer

From
Andres Freund
Date:
Hi,

On 2020-08-26 15:58:14 +1200, Thomas Munro wrote:
> >                 --12.51%--compactify_tuples
> >                           PageRepairFragmentation
> >                           heap2_redo
> >                           StartupXLOG
> 
> I wonder if there is something higher level that could be done to
> reduce the amount of compaction work required in the first place, but
> in the meantime I'm very happy if we can improve the situation so much
> with such a microscopic improvement that might eventually benefit
> other sorting stuff...

Another approach could be to not perform any sorting during recovery,
instead including enough information in the WAL record to avoid doing a
full blown PageRepairFragmentation during recovery.

Greetings,

Andres Freund



Re: Handing off SLRU fsyncs to the checkpointer

From
Alvaro Herrera
Date:
On 2020-Aug-25, Andres Freund wrote:

> Hi,
> 
> On 2020-08-26 15:58:14 +1200, Thomas Munro wrote:
> > >                 --12.51%--compactify_tuples
> > >                           PageRepairFragmentation
> > >                           heap2_redo
> > >                           StartupXLOG
> > 
> > I wonder if there is something higher level that could be done to
> > reduce the amount of compaction work required in the first place, but
> > in the meantime I'm very happy if we can improve the situation so much
> > with such a microscopic improvement that might eventually benefit
> > other sorting stuff...
> 
> Another approach could be to not perform any sorting during recovery,
> instead including enough information in the WAL record to avoid doing a
> full blown PageRepairFragmentation during recovery.

Hmm, including the sorted ItemId array in the WAL record might make
sense to alleviate the qsort work ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Handing off SLRU fsyncs to the checkpointer

From
Alvaro Herrera
Date:
On 2020-Aug-25, Jakub Wartak wrote:

> Turning on/off the defer SLRU patch and/or fsync doesn't seem to make
> any difference, so if anyone is curious the next sets of append-only
> bottlenecks is like below:
> 
>     14.69%  postgres  postgres            [.] hash_search_with_hash_value
>             ---hash_search_with_hash_value
>                |--9.80%--BufTableLookup
>                |          ReadBuffer_common
>                |          ReadBufferWithoutRelcache
>                |          XLogReadBufferExtended
>                |          XLogReadBufferForRedoExtended
>                |          |--7.76%--btree_xlog_insert
>                |          |          btree_redo
>                |          |          StartupXLOG
>                |           --1.63%--heap_xlog_insert
>                 --4.90%--smgropen
>                           |--2.86%--ReadBufferWithoutRelcache

Looking at an earlier report of this problem I was thinking whether it'd
make sense to replace SMgrRelationHash with a simplehash table; I have a
half-written patch for that, but I haven't completed that work.
However, in the older profile things were looking different, as
hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74%
of it.  BufTableLookup was also there but only 1.51%.  So I'm not so
sure now that that'll pay off as clearly as I had hoped.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Thu, Aug 27, 2020 at 6:15 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >                 --4.90%--smgropen
> >                           |--2.86%--ReadBufferWithoutRelcache
>
> Looking at an earlier report of this problem I was thinking whether it'd
> make sense to replace SMgrRelationHash with a simplehash table; I have a
> half-written patch for that, but I haven't completed that work.
> However, in the older profile things were looking different, as
> hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74%
> of it.  BufTableLookup was also there but only 1.51%.  So I'm not so
> sure now that that'll pay off as clearly as I had hoped.

Right, my hypothesis requires an uncacheably large buffer mapping
table, and I think smgropen() needs a different explanation because
it's not expected to be as large or as random, at least not with a
pgbench workload.  I think the reasons for a profile with a smgropen()
showing up so high, and in particular higher than BufTableLookup(),
must be:

1.  We call smgropen() twice for every call to BufTableLookup().  Once
in XLogReadBufferExtended(), and then again in
ReadBufferWithoutRelcache().
2.  We also call it for every block forced out of the buffer pool, and
in recovery that has to be done by the recovery loop.
3.  We also call it for every block in the buffer pool during the
end-of-recovery checkpoint.

Not sure but the last two might perform worse due to proximity to
interleaving pwrite() system calls (just a thought, not investigated).
In any case, I'm going to propose we move those things out of the
recovery loop, in a new thread.



Re: Handing off SLRU fsyncs to the checkpointer

From
Jakub Wartak
Date:
Hi Thomas / hackers,

>> The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS
cache(but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all): 
>>
>>     29.62%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
>>             ---copy_user_enhanced_fast_string
>>                |--17.98%--copyin
>> [..]
>>                |          __pwrite_nocancel
>>                |          FileWrite
>>                |          mdwrite
>>                |          FlushBuffer
>>                |          ReadBuffer_common
>>                |          ReadBufferWithoutRelcache
>>                |          XLogReadBufferExtended
>>                |          XLogReadBufferForRedoExtended
>>                |           --17.57%--btree_xlog_insert
>
> To move these writes out of recovery's way, we should probably just
> run the bgwriter process during crash recovery.  I'm going to look
> into that.

Sounds awesome. Also as this thread is starting to derail the SLRU fsync topic - maybe we should change subject?
However,to add some data to the separate bgwriter: on 14master (already with lseek() caching, SLRU fsyncs out of way,
bettersorting), I've measured the same configuration as last time with still the same append-only WAL workload on NVMe
andcompared with various shared_buffers settings (and buffers description sizing from pg_shmem_allocations which as You
statedis wrongly reported(?) which I'm stating only for reference just in case): 

shared_buffers=128MB buffers_desc=1024kB 96.778, 0.438 [a]
shared_buffers=256MB buffers_desc=2048kB 62.755, 0.577 [a]
shared_buffers=512MB buffers_desc=4096kB 33.167, 0.62 [a]
shared_buffers=1GB buffers_desc=8192kB 27.303, 0.929
shared_buffers=4GB buffers_desc=32MB 27.185, 1.166
shared_buffers=8GB buffers_desc=64MB 27.649, 1.088
shared_buffers=16GB buffers_desc=128MB 27.584, 1.201 [b]
shared_buffers=32GB buffers_desc=256MB 32.314, 1.171 [b]
shared_buffers=48GB buffers_desc=384 MB 31.95, 1.217
shared_buffers=64GB buffers_desc=512 MB 31.276, 1.349
shared_buffers=72GB buffers_desc=576 MB 31.925, 1.284
shared_buffers=80GB buffers_desc=640 MB 31.809, 1.413

The amount of WAL to be replayed was ~2.8GB. To me it looks like that
a) flushing dirty buffers by StartupXLog might be a real problem but please read-on.
b) there is very low impact by this L2/L3 hypothesis you mention (?), but it's not that big and it's not degrading
linearlyas one would expect (??) L1d:L1d:L2:L3 cache sizes on this machine are as follows on this machine:
32K/32K/256K/46080K.Maybe the DB size is too small. 

I've double-checked that in condition [a] (shared_buffers=128MB) there was a lot of FlushBuffer() invocations per
second(perf stat -e probe_postgres:FlushBuffer -I 1000), e.g: 
#           time             counts unit events
     1.000485217             79,494      probe_postgres:FlushBuffer
     2.000861366             75,305      probe_postgres:FlushBuffer
     3.001263163             79,662      probe_postgres:FlushBuffer
     4.001650795             80,317      probe_postgres:FlushBuffer
     5.002033273             79,951      probe_postgres:FlushBuffer
     6.002418646             79,993      probe_postgres:FlushBuffer
while at 1GB shared_buffers it sits nearly at zero all the time.  So there is like 3x speed-up possible  when
StartupXLogwouldn't have to care too much about dirty buffers in the critical path (bgwriter would as you say?) at
leastin append-only scenarios.  But ... I've checked some real systems (even older versions of PostgreSQL not doing
thatmuch of replication, and indeed it's  e.g. happening 8-12k/s FlushBuffer's() and shared buffers are pretty huge
(>100GB,0.5TB RAM) but they are *system-wide* numbers, work is really performed by bgwriter not by startup/recovering
asin this redo-bench case when DB is open for reads and replicating-- so it appears that this isn't optimization for
hotstandby case , but for the DB-closed startup recovery/restart/disaster/PITR scenario). 

As for the 24GB shared_buffers scenario where dirty buffers are not at all a problem with given top profile (output
trimmed),again as expected: 

    13.41%  postgres  postgres           [.] hash_search_with_hash_value
               |--8.31%--BufTableLookup <- ReadBuffer_common <- ReadBufferWithoutRelcache
                --5.11%--smgropen
                          |--2.77%--XLogReadBufferExtended
                           --2.34%--ReadBufferWithoutRelcache
     7.88%  postgres  postgres           [.] MarkBufferDirty

I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even on bigger) reporting as not-supported
orzeros. However interestingly the workload has IPC of 1.40 (instruction bound) which to me is strange as I would
expectBufTableLookup() to be actually heavy memory bound (?) Maybe I'll try on some different hardware one day.  

>>                           __pread_nocancel
>>                            --11.44%--FileRead
>>                                      mdread
>>                                      ReadBuffer_common
>>                                      ReadBufferWithoutRelcache
>>                                      XLogReadBufferExtended
>>                                      XLogReadBufferForRedoExtended
>
> For these reads, the solution should be WAL prefetching,(..) But... when combined with Andres's work-in-progress AIO
stuff(..) 

Yes, I've heard a thing or two about those :) I hope I'll be able to deliver some measurements sooner or later of those
twotogether (AIO+WALprefetch). 

-Jakub Wartak.


Re: Handing off SLRU fsyncs to the checkpointer

From
Jakub Wartak
Date:
Hi Alvaro, Thomas, hackers

>>     14.69%  postgres  postgres            [.] hash_search_with_hash_value
>>             ---hash_search_with_hash_value
>>                |--9.80%--BufTableLookup
[..]
>>                 --4.90%--smgropen
>>                           |--2.86%--ReadBufferWithoutRelcache
> Looking at an earlier report of this problem I was thinking whether it'd
> make sense to replace SMgrRelationHash with a simplehash table; I have a
> half-written patch for that, but I haven't completed that work.
> However, in the older profile things were looking different, as
> hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74%
> of it.  BufTableLookup was also there but only 1.51%.  So I'm not so
> sure now that that'll pay off as clearly as I had hoped.

Yes, quite frankly my expectation was to see hash_search_with_hash_value()<-smgropen() outcome as 1st one, but in
simplifiedredo-bench script it's not the case. The original scenario was much more complex with plenty of differences
(inno particular order: TB-sized DB VS ~500GB RAM -> thousands of forks, multiple tables, huge btrees, multiple INSERTs
wihplenty of data in VALUES() thrown as one commit, real primary->hot-standby replication [not closed DB in recovery],
sortednot random UUIDs) - I'm going to try nail down these differences and maybe I manage to produce more realistic
"pgbenchreproducer" (this may take some time though). 

-Jakub Wartak.


Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Thu, Aug 27, 2020 at 8:48 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even on bigger) reporting as not-supported
orzeros.
 

I heard some of the counters are only allowed on their dedicated instance types.

> However interestingly the workload has IPC of 1.40 (instruction bound) which to me is strange as I would expect
BufTableLookup()to be actually heavy memory bound (?) Maybe I'll try on some different hardware one day.
 

Hmm, OK now you've made me go and read a bunch of Brendan Gregg bloggs
and try some experiments of my own to get a feel for this number and
what it might be telling us about the cache miss counters you can't
see.  Since I know how to generate arbitrary cache miss workloads for
quick experiments using hash joins of different sizes, I tried that
and noticed that when LLC misses were at 76% (bad), IPC was at 1.69
which is still higher than what you're seeing.  When the hash table
was much smaller and LLC misses were down to 15% (much better), IPC
was at 2.83.  I know Gregg said[1] "An IPC < 1.0 likely means memory
bound, and an IPC > 1.0 likely means instruction bound", but that's
not what I'm seeing here, and in his comments section that is
disputed.  So I'm not sure your IPC of 1.40 is evidence against the
hypothesis on its own.

[1] http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html



Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Thu, Aug 27, 2020 at 8:48 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> >>     29.62%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
> >>             ---copy_user_enhanced_fast_string
> >>                |--17.98%--copyin
> >> [..]
> >>                |          __pwrite_nocancel
> >>                |          FileWrite
> >>                |          mdwrite
> >>                |          FlushBuffer
> >>                |          ReadBuffer_common
> >>                |          ReadBufferWithoutRelcache
> >>                |          XLogReadBufferExtended
> >>                |          XLogReadBufferForRedoExtended
> >>                |           --17.57%--btree_xlog_insert
> >
> > To move these writes out of recovery's way, we should probably just
> > run the bgwriter process during crash recovery.  I'm going to look
> > into that.
>
> Sounds awesome.

I wrote a quick and dirty experimental patch to try that.  I can't see
any benefit from it on pgbench with default shared buffers, but maybe
it would do better with your append test due to locality, especially
if you can figure out how to tune bgwriter to pace itself optimally.
https://github.com/macdice/postgres/tree/bgwriter-in-crash-recovery



Re: Handing off SLRU fsyncs to the checkpointer

From
Jakub Wartak
Date:
Hi Thomas, hackers,

>> > To move these writes out of recovery's way, we should probably just
>> > run the bgwriter process during crash recovery.  I'm going to look
>> > into that.
>>
>> Sounds awesome.
>
>I wrote a quick and dirty experimental patch to try that.  I can't see
>any benefit from it on pgbench with default shared buffers, but maybe
>it would do better with your append test due to locality, especially
>if you can figure out how to tune bgwriter to pace itself optimally.
>https://github.com/macdice/postgres/tree/bgwriter-in-crash-recovery

OK, so I've quickly tested those two PoCs patches together, in the conditions like below:
- similar append-only workload by pgbench (to eliminate other already known different WAL bottlenecks: e.g. sorting),
- 4.3GB of WAL to be applied (mostly Btree/INSERT_LEAF)
- on same system as last time (ext4 on NVMe, 1s8c16, 4.14 kernel)
- 14master already with SLRU fsync to checkpointer/pg_qgsort patches applied

TEST bgwriterPOC1:
- in severe dirty memory conditions (artificially simulated via small s_b here) --> so for workloads with very high
FlushBufferactivity in StartupXLOG 
- with fsync=off/fpw=off by default and on NVMe (e.g. scenario:  I want to perform some PITR as fast as I can to see
howproduction data looked like in the past, before some user deleted some data) 

baseline s_b@128MB: 140.404, 0.123 (2nd small as there is small region to checkpoint)

    22.49%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
            ---copy_user_enhanced_fast_string
               |--14.72%--copyin
               |          __pwrite_nocancel
               |          FileWrite
               |          mdwrite
               |          FlushBuffer
               |          ReadBuffer_common
               |           --14.52%--btree_xlog_insert
                --7.77%--copyout
                          __pread_nocancel
                           --7.57%--FileRead
                                     mdread
                                     ReadBuffer_common
     6.13%  postgres  [kernel.kallsyms]   [k] do_syscall_64
               |--1.64%--__pwrite_nocancel
                --1.23%--__pread_nocancel
     3.68%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--1.02%--smgropen

After applying:
patch -p1 < ../0001-Run-checkpointer-and-bgworker-in-crash-recovery.patch
patch -p1 < ../0002-Optionally-don-t-wait-for-end-of-recovery-checkpoint.patch

0001+0002 s_b@128MB: similar result to above
0001+0002 s_b@128MB: 108.871, 0.114 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 1000
0001+0002 s_b@128MB: 85.392, 0.103 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 50000 #~390MB max?

    18.40%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
            ---copy_user_enhanced_fast_string
               |--17.79%--copyout
               |          __pread_nocancel
               |          |--16.56%--FileRead
               |          |          mdread
               |          |          ReadBuffer_common
                --0.61%--copyin // WOW
                          __pwrite_nocancel
                          FileWrite
                          mdwrite
                          FlushBuffer
                          ReadBuffer_common
     9.20%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--4.70%--smgropen

of course there is another WOW moment during recovery ("61.9%")

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres 120935  0.9  0.0 866052  3824 ?        Ss   09:47   0:00 postgres: checkpointer
postgres 120936 61.9  0.0 865796  3824 ?        Rs   09:47   0:22 postgres: background writer
postgres 120937 97.4  0.0 865940  5228 ?        Rs   09:47   0:36 postgres: startup recovering 000000010000000000000089

speedup of 1.647x when dirty memory is in way. When it's not:

baseline  s_b@24000MB: 39.199, 1.448 (2x patches off)
0001+0002 s_b@24000MB: 39.383, 1.442 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 50000 #~390MB/s max, yay

there's no regression. I have only one comment about those 2 WIP patches, bgwriter_lru_maxpages should be maybe called
standby_bgwriter_lru_maxpagesin this scenario or even more preferred there shouldn't be a maximum set during closed DB
recoveryscenario. 

TEST bgwriterPOC2a to showcase the 2nd patch which opens the DB for read-write users before the final checkpoint
finishesafter redo recovery. The DBA may make the decision via this parameter end_of_recovery_checkpoint_wait=off. 
- on slow storage (xvda, fsync=on) and even with high memory:

s_b@24000MB: 39.043, 15.639 -- even with WAL recovery being 100% CPU bound(mostly on hash_search_with_hash_value() for
Buffers/__memmove_ssse3_back),it took additional 15s to perform checkpoint before DB was open for users (it had to
write269462 buffers =~ 2GB =~ 140MB/s which is close to the xvda device speed): the complete output looks in 14master
lookssimilar to this: 

1598609928.620 startup 22543 LOG:  redo done at 1/12201C88
1598609928.624 checkpointer 22541 LOG:  checkpoint starting: end-of-recovery immediate wait
1598609944.908 checkpointer 22541 LOG:  checkpoint complete: wrote 269462 buffers (8.6%); 0 WAL file(s) added, 0
removed,273 recycled; write=15.145 s, sync=0.138 s, total=16.285 s; sync files=11, longest=0.133 s, average=0.012 s;
distance=4468855kB, estimate=4468855 kB 
1598609944.912 postmaster 22538 LOG:  database system is ready to accept connections

s_b@24000MB: 39.96, 0 , with end_of_recovery_checkpoint_wait = off, before DB is open 15s faster

1598610331.556 startup 29499 LOG:  redo done at 1/12201C88
1598610331.559 checkpointer 29497 LOG:  checkpoint starting: immediate force
1598610331.562 postmaster 29473 LOG:  database system is ready to accept connections
1598610347.202 checkpointer 29497 LOG:  checkpoint complete: wrote 269462 buffers (8.6%); 0 WAL file(s) added, 0
removed,273 recycled; write=15.092 s, sync=0.149 s, total=15.643 s; sync files=12, longest=0.142 s, average=0.012 s;
distance=4468855kB, estimate=4468855 kB 

I suppose a checkpoint for large shared_buffers (hundredths of GB) might take a lot of time and this 0002 patch
bypassesthat. I would find it quite useful in some scenarios (e.g. testing backups, PITR recoveries, opening DB from
storagesnapshots / storage replication, maybe with DWH-after-crash too). 

TEST bgwriterPOC2b: FYI, I was also testing the the hot_standby code path -- to test if it would reduce time of
starting/ opening a fresh standby for read-only queries, but this parameter doesn't seem to influence that in my tests.
AsI've learned it's apparently much more complex to reproduce what I'm after and involves a lot of reading about
LogStandbySnapshot()/ standby recovery points on my side. 

Now, back to smgropen() hash_search_by_values() reproducer...

-Jakub Wartak.



Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Sat, Aug 29, 2020 at 12:43 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> postgres 120935  0.9  0.0 866052  3824 ?        Ss   09:47   0:00 postgres: checkpointer
> postgres 120936 61.9  0.0 865796  3824 ?        Rs   09:47   0:22 postgres: background writer
> postgres 120937 97.4  0.0 865940  5228 ?        Rs   09:47   0:36 postgres: startup recovering
000000010000000000000089
>
> speedup of 1.647x

Thanks for testing!  That's better than I expected.  I guess it wasn't
quite so good with default bgwriter settings.

> I have only one comment about those 2 WIP patches, bgwriter_lru_maxpages should be maybe called
standby_bgwriter_lru_maxpagesin this scenario or even more preferred there shouldn't be a maximum set during closed DB
recoveryscenario. 

I wish bgwriter could auto-tune itself better, so we wouldn't need to
contemplate adding more settings.

As for the second patch ("Optionally, don't wait for end-of-recovery
checkpoint."), that also looked quite useful in your test scenario:

> end_of_recovery_checkpoint_wait = off, before DB is open 15s faster

> I suppose a checkpoint for large shared_buffers (hundredths of GB) might take a lot of time and this 0002 patch
bypassesthat. I would find it quite useful in some scenarios (e.g. testing backups, PITR recoveries, opening DB from
storagesnapshots / storage replication, maybe with DWH-after-crash too). 

I suppose a third option that you might want is no checkpoint at all
(meaning leave it to regular checkpoint scheduling), like fast
promotion.  One thing missing from the patch is that we probably need
to log an end-of-recovery *record*, like fast promotion does.  I'm a
little fuzzy on the timeline stuff.  I wonder if any recovery experts
would like to weigh in on theoretical problems that might be lurking
here...



Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Sat, Aug 29, 2020 at 12:43 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> ... %CPU ... COMMAND
> ... 97.4 ... postgres: startup recovering 000000010000000000000089

So, what else is pushing this thing off CPU, anyway?  For one thing, I
guess it might be stalling while reading the WAL itself, because (1)
we only read it 8KB at a time, relying on kernel read-ahead, which
typically defaults to 128KB I/Os unless you cranked it up, but for
example we know that's not enough to saturate a sequential scan on
NVME system, so maybe it hurts here too (2) we keep having to switch
segment files every 16MB.  Increasing WAL segment size and kernel
readahead size presumably help with that, if indeed it is a problem,
but we could also experiment with a big POSIX_FADV_WILLNEED hint for a
future segment every time we cross a boundary, and also maybe increase
the size of our reads.



Re: Handing off SLRU fsyncs to the checkpointer

From
Jakub Wartak
Date:
Hi Thomas, hackers,

>> ... %CPU ... COMMAND
>> ... 97.4 ... postgres: startup recovering 000000010000000000000089
> So, what else is pushing this thing off CPU, anyway?  For one thing, I
> guess it might be stalling while reading the WAL itself, because (1)
> we only read it 8KB at a time, relying on kernel read-ahead, which
> typically defaults to 128KB I/Os unless you cranked it up, but for
> example we know that's not enough to saturate a sequential scan on
> NVME system, so maybe it hurts here too (2) we keep having to switch
> segment files every 16MB.  Increasing WAL segment size and kernel
> readahead size presumably help with that, if indeed it is a problem,
> but we could also experiment with a big POSIX_FADV_WILLNEED hint for a
> future segment every time we cross a boundary, and also maybe increase
> the size of our reads.

All of the above (1,2) would make sense and the effects IMHO are partially possible to achieve via ./configure compile
options,but from previous correspondence [1] in this particular workload, it looked like it was not WAL reading, but
readingrandom DB blocks into shared buffer: in that case I suppose it was the price of too many syscalls to the OS/VFS
cacheitself as the DB was small and fully cached there - so problem (3): copy_user_enhanced_fast_string <-
17.79%--copyout(!) <- __pread_nocancel <- 16.56%--FileRead / mdread / ReadBuffer_common (!). Without some
micro-optimizationor some form of vectorized [batching] I/O in recovery it's dead end when it comes to small changes.
Thingthat come to my mind as for enhancing recovery: 
- preadv() - works only for 1 fd, while WAL stream might require reading a lot of random pages into s_b (many
relations/fds,even btree inserting to single relation might put data into many 1GB [default] forks). This would only
micro-optimizeINSERT (pk) SELECT nextval(seq) kind of processing on recovery side I suppose. Of coruse provided that
StartupXLOGwould be more working in a batched way: (a) reading a lot of blocks from WAL at once (b) then issuing
preadv()to get all the DB blocks into s_b going from the same rel/fd (c) applying WAL. Sounds like a major refactor
justto save syscalls :( 
- mmap() - even more unrealistic
- IO_URING - gives a lot of promise here I think, is it even planned to be shown for PgSQL14 cycle ? Or it's more like
PgSQL15?

-Jakub Wartak

[1] -
https://www.postgresql.org/message-id/VI1PR0701MB6960EEB838D53886D8A180E3F6520%40VI1PR0701MB6960.eurprd07.prod.outlook.com
pleasesee profile after "0001+0002 s_b(at)128MB: 85.392" 


Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Mon, Aug 31, 2020 at 8:50 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> - IO_URING - gives a lot of promise here I think, is it even planned to be shown for PgSQL14 cycle ? Or it's more
likePgSQL15?
 

I can't answer that, but I've played around with the prototype quite a
bit, and thought quite a lot about how to port it to systems without
IO_URING, and I'm just as keen to see this happen as you are.

In the meantime, from the low-hanging-fruit department, here's a new
version of the SLRU-fsync-offload patch.  The only changes are a
tweaked commit message, and adoption of C99 designated initialisers
for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of
relying on humans to make the array order match the enum values.  If
there are no comments or objections, I'm planning to commit this quite
soon.

Attachment

Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Sat, Sep 19, 2020 at 5:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> In the meantime, from the low-hanging-fruit department, here's a new
> version of the SLRU-fsync-offload patch.  The only changes are a
> tweaked commit message, and adoption of C99 designated initialisers
> for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of
> relying on humans to make the array order match the enum values.  If
> there are no comments or objections, I'm planning to commit this quite
> soon.

... and CI told me that Windows didn't like my array syntax with the
extra trailing comma.  Here's one without.

Attachment

Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Sun, Sep 20, 2020 at 12:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Sep 19, 2020 at 5:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > In the meantime, from the low-hanging-fruit department, here's a new
> > version of the SLRU-fsync-offload patch.  The only changes are a
> > tweaked commit message, and adoption of C99 designated initialisers
> > for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of
> > relying on humans to make the array order match the enum values.  If
> > there are no comments or objections, I'm planning to commit this quite
> > soon.
>
> ... and CI told me that Windows didn't like my array syntax with the
> extra trailing comma.  Here's one without.

While scanning for comments and identifier names that needed updating,
I realised that this patch changed the behaviour of the ShutdownXXX()
functions, since they currently flush the SLRUs but are not followed
by a checkpoint.  I'm not entirely sure I understand the logic of
that, but it wasn't my intention to change it.  So here's a version
that converts the existing fsync_fname() to fsync_fname_recurse() to
fix that.

Strangely, the fsync calls that ensure that directory entries are on
disk seemed to be missing from CheckPointMultixact(), so I added them.
Isn't that a live bug?

I decided it was a little too magical that CheckPointCLOG() etc
depended on the later call to CheckPointBuffers() to perform their
fsyncs.  I started writing comments about that, but then I realised
that the right thing to do was to hoist ProcessSyncRequests() out of
there into CheckPointGuts() and make it all more explicit.

I also realised that it would be inconsistent to count SLRU sync
activity as buffer sync time, but not count SLRU write activity as
buffer write time, or count its buffers as written in the reported
statistics.  In other words, SLRU buffers *are* buffers for checkpoint
reporting purposes (or should at least be consistently in or out of
the stats, and with this patch they have to be in).

Does that make sense?   Is there a problem I'm not seeing with
reordering CheckPointGuts() as I have?

One comment change that seems worth highlighting is this code reached
by VACUUM, which seems like a strict improvement (it wasn't flushing
for crash recovery):

        /*
-        * Flush out dirty data, so PhysicalPageExists can work correctly.
-        * SimpleLruFlush() is a pretty big hammer for that.  Alternatively we
-        * could add an in-memory version of page exists, but
find_multixact_start
-        * is called infrequently, and it doesn't seem bad to flush buffers to
-        * disk before truncation.
+        * Write out dirty data, so PhysicalPageExists can work correctly.
         */
-       SimpleLruFlush(MultiXactOffsetCtl, true);
-       SimpleLruFlush(MultiXactMemberCtl, true);
+       SimpleLruWriteAll(MultiXactOffsetCtl, true);
+       SimpleLruWriteAll(MultiXactMemberCtl, true);

Attachment

Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> While scanning for comments and identifier names that needed updating,
> I realised that this patch changed the behaviour of the ShutdownXXX()
> functions, since they currently flush the SLRUs but are not followed
> by a checkpoint.  I'm not entirely sure I understand the logic of
> that, but it wasn't my intention to change it.  So here's a version
> that converts the existing fsync_fname() to fsync_fname_recurse() to

Bleugh, that was probably a bad idea, it's too expensive.  But it
forces me to ask the question: *why* do we need to call
Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a
shutdown checkpoint?  I wondered if this might date from before the
WAL, but I see that the pattern was introduced when the CLOG was moved
out of shared buffers into a proto-SLRU in ancient commit 2589735da08,
but even in that commit the preceding CreateCheckPoint() call included
a call to CheckPointCLOG().



Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Tue, Sep 22, 2020 at 9:08 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > While scanning for comments and identifier names that needed updating,
> > I realised that this patch changed the behaviour of the ShutdownXXX()
> > functions, since they currently flush the SLRUs but are not followed
> > by a checkpoint.  I'm not entirely sure I understand the logic of
> > that, but it wasn't my intention to change it.  So here's a version
> > that converts the existing fsync_fname() to fsync_fname_recurse() to
>
> Bleugh, that was probably a bad idea, it's too expensive.  But it
> forces me to ask the question: *why* do we need to call
> Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a
> shutdown checkpoint?  I wondered if this might date from before the
> WAL, but I see that the pattern was introduced when the CLOG was moved
> out of shared buffers into a proto-SLRU in ancient commit 2589735da08,
> but even in that commit the preceding CreateCheckPoint() call included
> a call to CheckPointCLOG().

I complained about the apparently missing multixact fsync in a new
thread, because if I'm right about that it requires a back-patchable
fix.

As for the ShutdownXXX() functions, I haven't yet come up with any
reason for this code to exist.  Emboldened by a colleague's inability
to explain to me what that code is doing for us, here is a new version
that just rips it all out.

Attachment

Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Wed, Sep 23, 2020 at 1:56 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> As for the ShutdownXXX() functions, I haven't yet come up with any
> reason for this code to exist.  Emboldened by a colleague's inability
> to explain to me what that code is doing for us, here is a new version
> that just rips it all out.

Rebased.

Tom, do you have any thoughts on ShutdownCLOG() etc?

Attachment

Re: Handing off SLRU fsyncs to the checkpointer

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Tom, do you have any thoughts on ShutdownCLOG() etc?

Hm, if we cannot reach that without first completing a shutdown checkpoint,
it does seem a little pointless.

It'd likely be a good idea to add a comment to CheckPointCLOG et al
explaining that we expect $what-exactly to fsync the data we are writing
before the checkpoint is considered done.

            regards, tom lane



Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Fri, Sep 25, 2020 at 12:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Tom, do you have any thoughts on ShutdownCLOG() etc?
>
> Hm, if we cannot reach that without first completing a shutdown checkpoint,
> it does seem a little pointless.

Thanks for the sanity check.

> It'd likely be a good idea to add a comment to CheckPointCLOG et al
> explaining that we expect $what-exactly to fsync the data we are writing
> before the checkpoint is considered done.

Good point.  Done like this:

+       /*
+        * Write dirty CLOG pages to disk.  This may result in sync
requests queued
+        * for later handling by ProcessSyncRequests(), as part of the
checkpoint.
+        */
        TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
-       SimpleLruFlush(XactCtl, true);
+       SimpleLruWriteAll(XactCtl, true);
        TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);

Here's a new version.  The final thing I'm contemplating before
pushing this is whether there may be hidden magical dependencies in
the order of operations in CheckPointGuts(), which I've changed
around.  Andres, any comments?

Attachment

Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Fri, Sep 25, 2020 at 12:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's a new version.  The final thing I'm contemplating before
> pushing this is whether there may be hidden magical dependencies in
> the order of operations in CheckPointGuts(), which I've changed
> around.  Andres, any comments?

I nagged Andres off-list and he opined that it might be better to
reorder it a bit so that ProcessSyncRequests() comes after almost
everything else, so that if we ever teach more things to offload their
fsync work it'll be in the right order.  I reordered it like that; now
only CheckPointTwoPhase() comes later, based on the comment that
accompanies it.  In any case, we can always reconsider the ordering of
this function in later commits as required.  Pushed like that.



Re: Handing off SLRU fsyncs to the checkpointer

From
Tomas Vondra
Date:
On 9/25/20 9:09 AM, Thomas Munro wrote:
> On Fri, Sep 25, 2020 at 12:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Here's a new version.  The final thing I'm contemplating before
>> pushing this is whether there may be hidden magical dependencies in
>> the order of operations in CheckPointGuts(), which I've changed
>> around.  Andres, any comments?
> 
> I nagged Andres off-list and he opined that it might be better to
> reorder it a bit so that ProcessSyncRequests() comes after almost
> everything else, so that if we ever teach more things to offload their
> fsync work it'll be in the right order.  I reordered it like that; now
> only CheckPointTwoPhase() comes later, based on the comment that
> accompanies it.  In any case, we can always reconsider the ordering of
> this function in later commits as required.  Pushed like that.
> 

Seems this commit left behind a couple unnecessary prototypes in a bunch 
of header files. In particular, it removed these functions

- ShutdownCLOG();
- ShutdownCommitTs();
- ShutdownSUBTRANS();
- ShutdownMultiXact();

but we still have

$ git grep ShutdownCLOG
src/include/access/clog.h:extern void ShutdownCLOG(void);


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Handing off SLRU fsyncs to the checkpointer

From
Thomas Munro
Date:
On Mon, Jan 4, 2021 at 3:35 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Seems this commit left behind a couple unnecessary prototypes in a bunch
> of header files. In particular, it removed these functions
>
> - ShutdownCLOG();
> - ShutdownCommitTs();
> - ShutdownSUBTRANS();
> - ShutdownMultiXact();

Thanks.  Fixed.