Checkpoint throttling issues - Mailing list pgsql-hackers

From Andres Freund
Subject Checkpoint throttling issues
Date
Msg-id 20151019101044.GD3485@alap3.anarazel.de
Whole thread Raw
Responses Re: Checkpoint throttling issues  (Fabien COELHO <coelho@cri.ensmp.fr>)
Re: Checkpoint throttling issues  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

working on the checkpoint sorting/flushing patch I noticed a number of
preexisting issues:

1) The progress passed to CheckpointWriteDelay() will often be wrong -  it's calculated as num_written / num_to_write,
butnum_written is only  incremented if the buffer hasn't since independently been written  out. That's bad because it
mean'swe'll think we're further and  further behind if there's independent writeout activity.
 
  Simple enough to fix, we gotta split num_written into num_written  (for stats purposes) and num_processed (for
progress).
  This is pretty much a bug, but I'm a slightly worried about  backpatching a fix because it can have a rather
noticeable behavioural impact.
 

2) CheckpointWriteDelay()'s handling of config file changes et al is  pretty weird: The config is reloaded during a
checkpoingiff it's not  an immediate checkpoint and we're on schedule. I see very little  justification for having the
got_SIGHUPblock inside that if block.
 

3) The static pg_usleep(100000L); in CheckpointWriteDelay() is a bad  idea.
  On a system with low write activity (< 10 writes sec) this will lead  to checkpoints finishing too early as there'll
alwaysbe around ~10  writes a second. On slow IO, say a sdcard, that's bad.
 
  On system with a very high write throughput (say 1k+ buffers/sec) the  unconditional 100ms sleep while on schedule
willmean we're far  behind after the sleep. This leads to rather bursty IO, and makes it  more likely to run into dirty
bufferlimits and such.  Just reducing  the sleep from 100ms to 10ms leads to significant latency  improvements. On
pgbenchrate limited to 5k tps:
 
  before:  number of transactions skipped: 70352 (1.408 %)  number of transactions above the 100.0 ms latency limit:
2817(0.056 %)  latency average: 5.266 ms  latency stddev: 11.723 ms  rate limit schedule lag: avg 3.138 (max 0.000) ms
 
  after:  number of transactions skipped: 41696 (0.834 %)  number of transactions above the 100.0 ms latency limit:
1736(0.035 %)  latency average: 4.929 ms  latency stddev: 8.929 ms  rate limit schedule lag: avg 2.835 (max 0.000) ms
 

  I think the sleep time should be computed adaptively based on the  number of buffers remaining and the remaining
time.There's probably  better formulations, but that seems like an easy enough improvement  and considerably better
thannow.
 

4) It's a bit dubious to only pgstat_send_bgwriter() when on schedule.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: YUriy Zhuravlev
Date:
Subject: Re: Some questions about the array.
Next
From: Etsuro Fujita
Date:
Subject: Re: Foreign join pushdown vs EvalPlanQual