Home > mailing lists

Re: Checkpoint throttling issues - Mailing list pgsql-hackers

From	Fabien COELHO
Subject	Re: Checkpoint throttling issues
Date	October 19, 2015 19:27:55
Msg-id	alpine.DEB.2.10.1510192115010.4181@sto Whole thread
In response to	Checkpoint throttling issues (Andres Freund <andres@anarazel.de>)
Responses	Re: Checkpoint throttling issues
List	pgsql-hackers

Tree view

> preexisting issues:
>
> 1) The progress passed to CheckpointWriteDelay() will often be wrong -
>   it's calculated as num_written / num_to_write, but num_written is only
>   incremented if the buffer hasn't since independently been written
>   out. That's bad because it mean's we'll think we're further and
>   further behind if there's independent writeout activity.
>
>   Simple enough to fix, we gotta split num_written into num_written
>   (for stats purposes) and num_processed (for progress).
>
>   This is pretty much a bug, but I'm a slightly worried about
>   backpatching a fix because it can have a rather noticeable
>   behavioural impact.

I noticed this discrepancy, but decided agains doing anything about it, 
because it meant more arguing and explaning. In the pgbench run I used the 
case does not seem to arise often because most I/O is performed through 
the checkpointer, so it does not matter much.

Another discrepancy is that there are two progress computations, one 
against time and one against wal buffer, and when one is used and why it 
is better than the other is not too clear to me.

> 2) CheckpointWriteDelay()'s handling of config file changes et al is
>   pretty weird: The config is reloaded during a checkpoing iff it's not
>   an immediate checkpoint and we're on schedule. I see very little
>   justification for having the got_SIGHUP block inside that if block.

No opinion.

> 3) The static pg_usleep(100000L); in CheckpointWriteDelay() is a bad
>   idea.
>
>   On a system with low write activity (< 10 writes sec) this will lead
>   to checkpoints finishing too early as there'll always be around ~10
>   writes a second. On slow IO, say a sdcard, that's bad.
>
>   On system with a very high write throughput (say 1k+ buffers/sec) the
>   unconditional 100ms sleep while on schedule will mean we're far
>   behind after the sleep. This leads to rather bursty IO, and makes it
>   more likely to run into dirty buffer limits and such.  Just reducing
>   the sleep from 100ms to 10ms leads to significant latency
>   improvements. On pgbench rate limited to 5k tps:
>
>   before:
>   number of transactions skipped: 70352 (1.408 %)
>   number of transactions above the 100.0 ms latency limit: 2817 (0.056 %)
>   latency average: 5.266 ms
>   latency stddev: 11.723 ms
>   rate limit schedule lag: avg 3.138 (max 0.000) ms
>
>   after:
>   number of transactions skipped: 41696 (0.834 %)
>   number of transactions above the 100.0 ms latency limit: 1736 (0.035 %)
>   latency average: 4.929 ms
>   latency stddev: 8.929 ms
>   rate limit schedule lag: avg 2.835 (max 0.000) ms

Do these figures include sorting & flushing ?

>   I think the sleep time should be computed adaptively based on the
>   number of buffers remaining and the remaining time. There's probably
>   better formulations, but that seems like an easy enough improvement
>   and considerably better than now.

One this run. If there are few I/Os then the 1 0ms will result in more 
random/discontinuous I/Os, because there would be only a few writes in 
each bucket, so I'm not sure whether this would be an unconditionnal 
improvement. I would be interested to have figures *with* sorting and 
flushing on and several kind of run.

> 4) It's a bit dubious to only pgstat_send_bgwriter() when on schedule.

No opinion!

-- 
Fabien.

pgsql-hackers by date:

From: Fabien COELHO
Date: 19 October 2015, 19:15:09
Subject: Re: checkpointer continuous flushing

From: Alvaro Herrera
Date: 19 October 2015, 19:36:02
Subject: Re: Checkpoint throttling issues

Re: Checkpoint throttling issues - Mailing list pgsql-hackers

Previous

Next