Home > mailing lists

Thread: checkpoints taking much longer than expected

checkpoints taking much longer than expected

From

Tiemen Ruiten

Date:

14 June 2019, 17:01:40

Hello,

I setup a new 3-node cluster with the following specifications:

2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2*20 cores)

128 GB RAM

8x Crucial MX500 1TB SSD's

FS is ZFS, the dataset with the PGDATA directory on it has the following properties (only non-default listed):

NAME PROPERTY VALUE SOURCE
tank/pgsql used 234G -
tank/pgsql available 3.28T -
tank/pgsql referenced 234G -
tank/pgsql compressratio 2.68x -
tank/pgsql compression lz4 inherited from tank
tank/pgsql atime off inherited from tank
tank/pgsql canmount on local
tank/pgsql xattr sa inherited from tank

My postgresql.conf (only changed from default listed):

hba_file = '/var/lib/pgsql/pg_hba.conf'
listen_addresses = '*'
max_connections = 800
shared_buffers = 8GB
work_mem = 64MB
maintenance_work_mem = 2GB
autovacuum_work_mem = 1GB
dynamic_shared_memory_type = posix
effective_io_concurrency = 200
max_worker_processes = 50
max_parallel_maintenance_workers = 8
max_parallel_workers_per_gather = 8
max_parallel_workers = 40
wal_level = replica
synchronous_commit = off
full_page_writes = on
wal_log_hints = on
wal_buffers = 128MB
checkpoint_timeout = 60min
max_wal_size = 8GB
min_wal_size = 1GB
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'pgbackrest --stanza=pgdb2 archive-push %p'
max_wal_senders = 10
wal_keep_segments = 20
hot_standby = on
hot_standby_feedback = on
random_page_cost = 1.5
effective_cache_size = 48GB
default_statistics_target = 500 # range 1-10000
idle_in_transaction_session_timeout = 30min # in milliseconds, 0 is disabled
shared_preload_libraries = 'timescaledb, pg_cron'
max_locks_per_transaction = 512
timescaledb.max_background_workers = 8

My problem is that checkpoints are taking a long time. Even when I run a few manual checkpoints one after the other, they keep taking very long, up to 10 minutes:

2019-06-14 15:21:10.351 CEST [23657] LOG: checkpoint starting: immediate force wait
2019-06-14 15:25:57.655 CEST [23657] LOG: checkpoint complete: wrote 139831 buffers (13.3%); 0 WAL file(s) added, 148 removed, 40 recycled; write=284.751 s, sync=0.532 s, total=287.304 s; sync files=537, longest=0.010 s, average=0.000 s; distance=2745065 kB, estimate=2841407 kB
2019-06-14 15:26:01.988 CEST [23657] LOG: checkpoint starting: immediate force wait
2019-06-14 15:30:30.430 CEST [23657] LOG: checkpoint complete: wrote 238322 buffers (22.7%); 0 WAL file(s) added, 0 removed, 172 recycled; write=264.794 s, sync=0.415 s, total=268.441 s; sync files=378, longest=0.011 s, average=0.001 s; distance=2834018 kB, estimate=2840668 kB
2019-06-14 15:30:44.097 CEST [23657] LOG: checkpoint starting: immediate force wait
2019-06-14 15:37:01.438 CEST [23657] LOG: checkpoint complete: wrote 132286 buffers (12.6%); 0 WAL file(s) added, 54 removed, 96 recycled; write=366.614 s, sync=2.975 s, total=377.341 s; sync files=467, longest=0.095 s, average=0.006 s; distance=2444291 kB, estimate=2801030 kB

What is going on? It doesn't seem like normal behaviour?

Re: checkpoints taking much longer than expected

From

Ravi Krishna

Date:

14 June 2019, 17:20:03

On 6/14/19 10:01 AM, Tiemen Ruiten wrote:
> LOG:  checkpoint starting: immediate force wait

Does it mean that the DB is blocked until the completion of checkpoint.
Years ago
Informix use to have this issue until they fixed around 2006.

Re: checkpoints taking much longer than expected

From

Stephen Frost

Date:

14 June 2019, 18:36:45

Greetings,

* Ravi Krishna (ravikrishna@mail.com) wrote:
> On 6/14/19 10:01 AM, Tiemen Ruiten wrote:
> >LOG:  checkpoint starting: immediate force wait
>
> Does it mean that the DB is blocked until the completion of checkpoint.
> Years ago
> Informix use to have this issue until they fixed around 2006.

No.

Thanks,

Stephen

Attachment

signature.asc

Re: checkpoints taking much longer than expected

From

Stephen Frost

Date:

14 June 2019, 18:43:50

Greetings,

* Tiemen Ruiten (t.ruiten@tech-lab.io) wrote:
> checkpoint_timeout = 60min

That seems like a pretty long timeout.

> My problem is that checkpoints are taking a long time. Even when I run a
> few manual checkpoints one after the other, they keep taking very long, up
> to 10 minutes:

You haven't said *why* this is an issue...  Why are you concerned with
how long it takes to do a checkpoint?

> 2019-06-14 15:21:10.351 CEST [23657] LOG:  checkpoint starting: immediate
> force wait
> 2019-06-14 15:25:57.655 CEST [23657] LOG:  checkpoint complete: wrote
> 139831 buffers (13.3%); 0 WAL file(s) added, 148 removed, 40 recycled;
> write=284.751 s, sync=0.532 s, total=287.304 s; sync files=537,
> longest=0.010 s, average=0.000 s; distance=2745065 kB, estimate=2841407 kB
> 2019-06-14 15:26:01.988 CEST [23657] LOG:  checkpoint starting: immediate
> force wait
> 2019-06-14 15:30:30.430 CEST [23657] LOG:  checkpoint complete: wrote
> 238322 buffers (22.7%); 0 WAL file(s) added, 0 removed, 172 recycled;
> write=264.794 s, sync=0.415 s, total=268.441 s; sync files=378,
> longest=0.011 s, average=0.001 s; distance=2834018 kB, estimate=2840668 kB
> 2019-06-14 15:30:44.097 CEST [23657] LOG:  checkpoint starting: immediate
> force wait
> 2019-06-14 15:37:01.438 CEST [23657] LOG:  checkpoint complete: wrote
> 132286 buffers (12.6%); 0 WAL file(s) added, 54 removed, 96 recycled;
> write=366.614 s, sync=2.975 s, total=377.341 s; sync files=467,
> longest=0.095 s, average=0.006 s; distance=2444291 kB, estimate=2801030 kB
>
> What is going on? It doesn't seem like normal behaviour?

The time information is all there and it tells you what it's doing and
how much had to be done... If you're unhappy with how long it takes to
write out gigabytes of data and fsync hundreds of files, talk to your
storage people...

Thanks,

Stephen

Attachment

signature.asc

Re: checkpoints taking much longer than expected

From

"Peter J. Holzer"

Date:

14 June 2019, 19:22:09

On 2019-06-14 16:01:40 +0200, Tiemen Ruiten wrote:
> FS is ZFS, the dataset with the PGDATA directory on it has the following
> properties (only non-default listed):
[...]
> My problem is that checkpoints are taking a long time. Even when I run a few
> manual checkpoints one after the other, they keep taking very long, up to 10
> minutes:

There was a discussion about ZFS' COW behaviour and PostgreSQL reusing
WAL files not being a good combination about a year ago:
https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w%40mail.gmail.com

Maybe you have the same problem?

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

signature.asc

Re: checkpoints taking much longer than expected

From

Tiemen Ruiten

Date:

15 June 2019, 11:49:38

On Fri, Jun 14, 2019 at 5:43 PM Stephen Frost <sfrost@snowman.net> wrote:

Greetings,

* Tiemen Ruiten (t.ruiten@tech-lab.io) wrote:
> checkpoint_timeout = 60min

That seems like a pretty long timeout.

My reasoning was that a longer recovery time to avoid writes would be acceptable because there are two more nodes in the cluster to fall back on in case of emergency.

> My problem is that checkpoints are taking a long time. Even when I run a
> few manual checkpoints one after the other, they keep taking very long, up
> to 10 minutes:

You haven't said *why* this is an issue... Why are you concerned with
how long it takes to do a checkpoint?

During normal operation I don't mind that it takes a long time, but when performing maintenance I want to be able to gracefully bring down the master without long delays to promote one of the standby's.

The time information is all there and it tells you what it's doing and
how much had to be done... If you're unhappy with how long it takes to
write out gigabytes of data and fsync hundreds of files, talk to your
storage people...

I am the storage people too :)

Thanks,

Stephen

Re: checkpoints taking much longer than expected

From

Peter Geoghegan

Date:

16 June 2019, 03:35:52

On Sat, Jun 15, 2019 at 1:50 AM Tiemen Ruiten <t.ruiten@tech-lab.io> wrote:
> During normal operation I don't mind that it takes a long time, but when performing maintenance I want to be able to
gracefullybring down the master without long delays to promote one of the standby's.
 

Maybe an "immediate" mode shutdown is appropriate, then. That will
mean that the primary will need to go through crash recovery if and
when you bring it back up, though.

-- 
Peter Geoghegan

Re: checkpoints taking much longer than expected

From

Michael Loftis

Date:

16 June 2019, 17:22:55

On Fri, Jun 14, 2019 at 08:02 Tiemen Ruiten <t.ruiten@tech-lab.io> wrote:

Hello,

I setup a new 3-node cluster with the following specifications:

2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2*20 cores)
128 GB RAM
8x Crucial MX500 1TB SSD's

FS is ZFS, the dataset with the PGDATA directory on it has the following properties (only non-default listed):

NAME PROPERTY VALUE SOURCE
tank/pgsql used 234G -
tank/pgsql available 3.28T -
tank/pgsql referenced 234G -
tank/pgsql compressratio 2.68x -
tank/pgsql compression lz4 inherited from tank

You're possibly slowing yourself down a lot by using compression here. Turning off compression though is multiple steps. You basically have to turn it off in the FS and then rewrite/copy the data. I'd check in a scratch area how long it takes to write ~3G of random data with compression on vs off.

tank/pgsql atime off inherited from tank
tank/pgsql canmount on local
tank/pgsql xattr sa inherited from tank

My postgresql.conf (only changed from default listed):

hba_file = '/var/lib/pgsql/pg_hba.conf'
listen_addresses = '*'
max_connections = 800
shared_buffers = 8GB
work_mem = 64MB
maintenance_work_mem = 2GB
autovacuum_work_mem = 1GB
dynamic_shared_memory_type = posix
effective_io_concurrency = 200
max_worker_processes = 50
max_parallel_maintenance_workers = 8
max_parallel_workers_per_gather = 8
max_parallel_workers = 40
wal_level = replica
synchronous_commit = off
full_page_writes = on
wal_log_hints = on
wal_buffers = 128MB
checkpoint_timeout = 60min
max_wal_size = 8GB
min_wal_size = 1GB
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'pgbackrest --stanza=pgdb2 archive-push %p'
max_wal_senders = 10
wal_keep_segments = 20
hot_standby = on
hot_standby_feedback = on
random_page_cost = 1.5
effective_cache_size = 48GB
default_statistics_target = 500 # range 1-10000
idle_in_transaction_session_timeout = 30min # in milliseconds, 0 is disabled
shared_preload_libraries = 'timescaledb, pg_cron'
max_locks_per_transaction = 512
timescaledb.max_background_workers = 8

My problem is that checkpoints are taking a long time. Even when I run a few manual checkpoints one after the other, they keep taking very long, up to 10 minutes:

2019-06-14 15:21:10.351 CEST [23657] LOG: checkpoint starting: immediate force wait
2019-06-14 15:25:57.655 CEST [23657] LOG: checkpoint complete: wrote 139831 buffers (13.3%); 0 WAL file(s) added, 148 removed, 40 recycled; write=284.751 s, sync=0.532 s, total=287.304 s; sync files=537, longest=0.010 s, average=0.000 s; distance=2745065 kB, estimate=2841407 kB
2019-06-14 15:26:01.988 CEST [23657] LOG: checkpoint starting: immediate force wait
2019-06-14 15:30:30.430 CEST [23657] LOG: checkpoint complete: wrote 238322 buffers (22.7%); 0 WAL file(s) added, 0 removed, 172 recycled; write=264.794 s, sync=0.415 s, total=268.441 s; sync files=378, longest=0.011 s, average=0.001 s; distance=2834018 kB, estimate=2840668 kB
2019-06-14 15:30:44.097 CEST [23657] LOG: checkpoint starting: immediate force wait
2019-06-14 15:37:01.438 CEST [23657] LOG: checkpoint complete: wrote 132286 buffers (12.6%); 0 WAL file(s) added, 54 removed, 96 recycled; write=366.614 s, sync=2.975 s, total=377.341 s; sync files=467, longest=0.095 s, average=0.006 s; distance=2444291 kB, estimate=2801030 kB

What is going on? It doesn't seem like normal behaviour?

"Genius might be described as a supreme capacity for getting its possessors
into trouble of all kinds."
-- Samuel Butler

Re: checkpoints taking much longer than expected

From

Jeff Janes

Date:

16 June 2019, 19:25:58

On Sat, Jun 15, 2019 at 4:50 AM Tiemen Ruiten <t.ruiten@tech-lab.io> wrote:

On Fri, Jun 14, 2019 at 5:43 PM Stephen Frost <sfrost@snowman.net> wrote:

The time information is all there and it tells you what it's doing and
how much had to be done... If you're unhappy with how long it takes to
write out gigabytes of data and fsync hundreds of files, talk to your
storage people...

Right, but true only because they were "checkpoint starting: immediate". Otherwise the reported write time includes intentional sleeps added to honor the checkpoint_completion_target. A bit confusing to report it that way, I think.

I am the storage people too :)

So, what did you say to yourself? Have you done fundamental benchmarking at the OS level, to see how fast you can write out 1-2 GB of data changes spread out randomly over a few hundred GB of data, when it is not coming from PostgreSQL?

Cheers,

Jeff

Re: checkpoints taking much longer than expected

From

Stephen Frost

Date:

16 June 2019, 20:30:28

Greetings,

* Tiemen Ruiten (t.ruiten@tech-lab.io) wrote:
> On Fri, Jun 14, 2019 at 5:43 PM Stephen Frost <sfrost@snowman.net> wrote:
> > * Tiemen Ruiten (t.ruiten@tech-lab.io) wrote:
> > > checkpoint_timeout = 60min
> >
> > That seems like a pretty long timeout.
>
> My reasoning was that a longer recovery time to avoid writes would be
> acceptable because there are two more nodes in the cluster to fall back on
> in case of emergency.

Ok, so you want fewer checkpoints because you expect to failover to a
replica rather than recover the primary on a failure.  If you're doing
synchronous replication, then that certainly makes sense.  If you
aren't, then you're deciding that you're alright with losing some number
of writes by failing over rather than recovering the primary, which can
also be acceptable but it's certainly much more questionable.

> > > My problem is that checkpoints are taking a long time. Even when I run a
> > > few manual checkpoints one after the other, they keep taking very long,
> > up
> > > to 10 minutes:
> >
> > You haven't said *why* this is an issue...  Why are you concerned with
> > how long it takes to do a checkpoint?
>
> During normal operation I don't mind that it takes a long time, but when
> performing maintenance I want to be able to gracefully bring down the
> master without long delays to promote one of the standby's.

I'm getting the feeling that your replicas are async, but it sounds like
you'd be better off with having at least one sync replica, so that you
can flip to it quickly.  Alternatively, having a way to more easily make
the primary to accepting new writes, flush everything to the replicas,
report that it's completed doing so, to allow you to promote a replica
without losing anything, and *then* go through the process on the
primary of doing a checkpoint, would be kind of nice.

Then again, you run into the issue that if your async replicas are very
far behind then you're still going to have a long period of time between
the "stop accepting new writes" and "finished flushing everything to the
replicas".

> > The time information is all there and it tells you what it's doing and
> > how much had to be done... If you're unhappy with how long it takes to
> > write out gigabytes of data and fsync hundreds of files, talk to your
> > storage people...
>
> I am the storage people too :)

Great!  Make it go faster. :)

Thanks,

Stephen

Attachment

signature.asc

Re: checkpoints taking much longer than expected

From

Stephen Frost

Date:

16 June 2019, 20:37:02

Greetings,

* Jeff Janes (jeff.janes@gmail.com) wrote:
> On Sat, Jun 15, 2019 at 4:50 AM Tiemen Ruiten <t.ruiten@tech-lab.io> wrote:
> > On Fri, Jun 14, 2019 at 5:43 PM Stephen Frost <sfrost@snowman.net> wrote:
> >> The time information is all there and it tells you what it's doing and
> >> how much had to be done... If you're unhappy with how long it takes to
> >> write out gigabytes of data and fsync hundreds of files, talk to your
> >> storage people...
>
> Right, but true only because they were "checkpoint starting: immediate".
> Otherwise the reported write time includes intentional sleeps added to
> honor the checkpoint_completion_target.  A bit confusing to report it that
> way, I think.

Hrmpf, yeah, we could probably improve on that, I'd certainly like to
see that broken out.

Thanks,

Stephen

Attachment

signature.asc

Re: checkpoints taking much longer than expected

From

Alvaro Herrera

Date:

16 June 2019, 21:57:39

On 2019-Jun-14, Peter J. Holzer wrote:

> There was a discussion about ZFS' COW behaviour and PostgreSQL reusing
> WAL files not being a good combination about a year ago:
> https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w%40mail.gmail.com
> 
> Maybe you have the same problem?

Note that Joyent ended up proposing patches to fix their performance
problem (and got them committed).  Maybe it would be useful for Tiemen
to try that code?  (That commit cherry-picks cleanly on REL_11_STABLE.)


commit 475861b2615dd63ae8431d811749a6f9a15bbfd6
Author:     Thomas Munro <tmunro@postgresql.org>
AuthorDate: Tue Apr 2 14:37:14 2019 +1300
CommitDate: Tue Apr 2 14:37:14 2019 +1300

    Add wal_recycle and wal_init_zero GUCs.
    
    On at least ZFS, it can be beneficial to create new WAL files every
    time and not to bother zero-filling them.  Since it's not clear which
    other filesystems might benefit from one or both of those things,
    add individual GUCs to control those two behaviors independently and
    make only very general statements in the docs.
    
    Author: Jerry Jelinek, with some adjustments by Thomas Munro
    Reviewed-by: Alvaro Herrera, Andres Freund, Tomas Vondra, Robert Haas and others
    Discussion: https://postgr.es/m/CACPQ5Fo00QR7LNAcd1ZjgoBi4y97%2BK760YABs0vQHH5dLdkkMA%40mail.gmail.com

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: checkpoints taking much longer than expected

From

Tiemen Ruiten

Date:

16 June 2019, 23:17:02

On Sun, Jun 16, 2019 at 8:57 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Note that Joyent ended up proposing patches to fix their performance
problem (and got them committed). Maybe it would be useful for Tiemen
to try that code? (That commit cherry-picks cleanly on REL_11_STABLE.)

Interesting! The performance improvements in that thread look like it's worth trying. Is this patch likely to make it to an 11.x release?

Re: checkpoints taking much longer than expected

From

Stephen Frost

Date:

16 June 2019, 23:37:49

Greetings,

* Tiemen Ruiten (t.ruiten@tech-lab.io) wrote:
> On Sun, Jun 16, 2019 at 8:57 PM Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
> > Note that Joyent ended up proposing patches to fix their performance
> > problem (and got them committed).  Maybe it would be useful for Tiemen
> > to try that code?  (That commit cherry-picks cleanly on REL_11_STABLE.)
>
> Interesting! The performance improvements in that thread look like it's
> worth trying. Is this patch likely to make it to an 11.x release?

No, as it's a feature and we don't back-patch those.

Not likely to help with what you're experiencing anyway though...

Thanks,

Stephen

Attachment

signature.asc

Re: checkpoints taking much longer than expected

From

Alvaro Herrera

Date:

17 June 2019, 00:00:59

On 2019-Jun-16, Stephen Frost wrote:

> Not likely to help with what you're experiencing anyway though...

My gut feeling is that you're wrong, since (as I understand) the
symptoms are the same.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: checkpoints taking much longer than expected

From

Tiemen Ruiten

Date:

17 June 2019, 00:08:06

On Sun, Jun 16, 2019 at 7:30 PM Stephen Frost <sfrost@snowman.net> wrote:

Ok, so you want fewer checkpoints because you expect to failover to a
replica rather than recover the primary on a failure. If you're doing
synchronous replication, then that certainly makes sense. If you
aren't, then you're deciding that you're alright with losing some number
of writes by failing over rather than recovering the primary, which can
also be acceptable but it's certainly much more questionable.

Yes, in our setup that's the case: a few lost transactions will have a negligible impact to the business.

I'm getting the feeling that your replicas are async, but it sounds like
you'd be better off with having at least one sync replica, so that you
can flip to it quickly.

They are indeed async, we traded durability for performance here, because we can accept some lost transactions.

Alternatively, having a way to more easily make
the primary to accepting new writes, flush everything to the replicas,
report that it's completed doing so, to allow you to promote a replica
without losing anything, and *then* go through the process on the
primary of doing a checkpoint, would be kind of nice.

I suppose that would require being able to demote a master to a slave during runtime.
That would definitely be nice-to-have.

Thanks,

Stephen

Re: checkpoints taking much longer than expected

From

Stephen Frost

Date:

17 June 2019, 00:09:25

Greetings,

* Alvaro Herrera (alvherre@2ndquadrant.com) wrote:
> On 2019-Jun-16, Stephen Frost wrote:
> > Not likely to help with what you're experiencing anyway though...
>
> My gut feeling is that you're wrong, since (as I understand) the
> symptoms are the same.

The issue in the linked-to thread was that of a synch replica getting
stuck doing a random read of a WAL page in order to append to it, since
we were reusing WAL.

The issue being discussed here is writing out to the heap files during a
checkpoint...

Thanks,

Stephen

Attachment

signature.asc

Re: checkpoints taking much longer than expected

From

Stephen Frost

Date:

17 June 2019, 00:16:04

Greetings,

* Tiemen Ruiten (t.ruiten@tech-lab.io) wrote:
> On Sun, Jun 16, 2019 at 7:30 PM Stephen Frost <sfrost@snowman.net> wrote:
> > Ok, so you want fewer checkpoints because you expect to failover to a
> > replica rather than recover the primary on a failure.  If you're doing
> > synchronous replication, then that certainly makes sense.  If you
> > aren't, then you're deciding that you're alright with losing some number
> > of writes by failing over rather than recovering the primary, which can
> > also be acceptable but it's certainly much more questionable.
>
> Yes, in our setup that's the case: a few lost transactions will have a
> negligible impact to the business.

There's a risk here that the impact could be much higher than just 'a
few'.  Hopefully you're monitoring it carefully and have some plan of
what to do if the difference grows very quickly.

> > Alternatively, having a way to more easily make
> > the primary to accepting new writes, flush everything to the replicas,
> > report that it's completed doing so, to allow you to promote a replica
> > without losing anything, and *then* go through the process on the
> > primary of doing a checkpoint, would be kind of nice.
>
> I suppose that would require being able to demote a master to a slave
> during runtime.

That's.. not the case.

If you're really not worried about losing transactions, then an
immediate shutdown as suggested by Peter G would let you fail over to
one of your not-completely-current replicas..  Or you could just do that
as soon as you start shutting down the primary.  Of course, you'll end
up in a situation where the replica will have fork'ed off of the
timeline from a point prior to where the primary is at, meaning you'd
have to use pg_rewind (or do a restore from an earlier backup) to
re-build the former-primary as a replica, if that's the goal.

If the idea is that you don't want to lose any transactions during this
process, and you want to be able to cleanly bring the former primary
back, then I don't think we've really got a great solution where you can
make sure that all the WAL has been flushed out to the replicas and the
primary just finishes its checkpoint (and thinking about that a bit
more, I'm not sure we could actually do it anyway since we're going to
want to write out that shutdown record, making it a catch-22, but maybe
there's something clever that could be done there..).

Thanks,

Stephen

Attachment

signature.asc

Re: checkpoints taking much longer than expected

From

Alvaro Herrera

Date:

17 June 2019, 00:30:18

On 2019-Jun-16, Stephen Frost wrote:

> The issue being discussed here is writing out to the heap files during a
> checkpoint...

We don't really know, as it was already established that the log line is
misattributing time spent ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: checkpoints taking much longer than expected

From

Stephen Frost

Date:

17 June 2019, 00:31:17

Greetings,

* Alvaro Herrera (alvherre@2ndquadrant.com) wrote:
> On 2019-Jun-16, Stephen Frost wrote:
>
> > The issue being discussed here is writing out to the heap files during a
> > checkpoint...
>
> We don't really know, as it was already established that the log line is
> misattributing time spent ...

That's in the case of a regular checkpoint, this was from someone
actually issuing 'CHECKPOINT;', so I don't think that's accurate..

Thanks,

Stephen

Attachment

signature.asc

Re: checkpoints taking much longer than expected

From

Tiemen Ruiten

Date:

17 June 2019, 15:29:18

On Sun, Jun 16, 2019 at 8:57 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Jun-14, Peter J. Holzer wrote:

> There was a discussion about ZFS' COW behaviour and PostgreSQL reusing
> WAL files not being a good combination about a year ago:
> https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w%40mail.gmail.com
>
> Maybe you have the same problem?

Note that Joyent ended up proposing patches to fix their performance
problem (and got them committed). Maybe it would be useful for Tiemen
to try that code? (That commit cherry-picks cleanly on REL_11_STABLE.)

Unfortunately it doesn't compile:

access/transam/xlog.o: In function `XLogFileInit':
/home/ter/rpmbuild/BUILD/postgresql-11.3/src/backend/access/transam/xlog.c:3299: undefined reference to `pg_pwrite'

collect2: error: ld returned 1 exit status

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: checkpoints taking much longer than expected

From

Andres Freund

Date:

17 June 2019, 20:43:34

On 2019-06-16 12:25:58 -0400, Jeff Janes wrote:
> Right, but true only because they were "checkpoint starting: immediate".
> Otherwise the reported write time includes intentional sleeps added to
> honor the checkpoint_completion_target.  A bit confusing to report it that
> way, I think.

+1

It's even worse than that, actually. We also don't sleep if the *next*
requested checkpoint is an immediate one:

     * Perform the usual duties and take a nap, unless we're behind schedule,
     * in which case we just try to catch up as quickly as possible.
     */
    if (!(flags & CHECKPOINT_IMMEDIATE) &&
        !shutdown_requested &&
        !ImmediateCheckpointRequested() &&
        IsCheckpointOnSchedule(progress))

/*
 * Returns true if an immediate checkpoint request is pending.  (Note that
 * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
 * there is one pending behind it.)
 */
static bool
ImmediateCheckpointRequested(void)

We ought to do better.

- Andres