Re: A few new options for CHECKPOINT - Mailing list pgsql-hackers

From Bossart, Nathan
Subject Re: A few new options for CHECKPOINT
Date
Msg-id A07F8329-4870-40B2-9991-5016491AA11A@amazon.com
Whole thread Raw
In response to Re: A few new options for CHECKPOINT  (Stephen Frost <sfrost@snowman.net>)
Responses Re: A few new options for CHECKPOINT  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
On 11/27/20, 10:58 AM, "Stephen Frost" <sfrost@snowman.net> wrote:
> If you'd like to show that I'm wrong, and it's entirely possible that I
> am, then retry the above with actual load on the system, and also
> actually look at how much outstanding WAL you end up with given the
> different scenarios which has to be replayed during crash recovery.

I did a little experiment to show the behavior I'm referring to.  I
used these settings:

        checkpoint_completion_target = 0.9
        checkpoint_timeout = 30s
        max_wal_size = 20GB
        WAL segment size is 64MB

I ran the following pgbench command for a few minutes before each
test:

        pgbench postgres -T 3600 -c 64 -j 64 -N

For the first test, I killed Postgres just before an automatic, non-
immediate checkpoint completed.

        2020-11-28 00:31:57 UTC::@:[51770]:LOG:  checkpoint complete...
        2020-11-28 00:32:00 UTC::@:[51770]:LOG:  checkpoint starting: time

        Killed Postgres at 00:32:26 UTC, 29 seconds after latest
        checkpoint completed.

        2020-11-28 00:32:42 UTC::@:[77256]:LOG:  redo starts at 3CF/FD6B8BD0
        2020-11-28 00:32:56 UTC::@:[77256]:LOG:  redo done at 3D0/C94D1D00

        Recovery took 14 seconds and replayed ~3.2 GB of WAL.

        postgres=> SELECT pg_wal_lsn_diff('3D0/C94D1D00', '3CF/FD6B8BD0');
         pg_wal_lsn_diff
        -----------------
              3420557616
        (1 row)

For the second test, I killed Postgres just after an automatic, non-
immediate checkpoint completed.

        2020-11-28 00:41:26 UTC::@:[77475]:LOG:  checkpoint complete...

        Killed Postgres at 00:41:26 UTC, just after latest checkpoint
        completed.

        2020-11-28 00:41:42 UTC::@:[8599]:LOG:  redo starts at 3D3/152EDD78
        2020-11-28 00:41:49 UTC::@:[8599]:LOG:  redo done at 3D3/78358A40

        Recovery took 7 seconds and replayed ~1.5 GB of WAL.

        postgres=> SELECT pg_wal_lsn_diff('3D3/78358A40', '3D3/152EDD78');
         pg_wal_lsn_diff
        -----------------
              1661381832
        (1 row)

Granted, I used a rather aggressive checkpoint_timeout, but I think
this demonstrates that waiting for a non-immediate checkpoint to
complete can lower the amount of WAL needed for recovery, even though
it might not lower it as much as waiting for an immediate checkpoint
would.

Nathan


pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [PoC] Non-volatile WAL buffer
Next
From: Tomas Vondra
Date:
Subject: Re: POC: postgres_fdw insert batching