Thread: bgwriter autotuning might be unnecessarily penalizing bursty workloads

bgwriter autotuning might be unnecessarily penalizing bursty workloads

From
Josh Berkus
Date:
Greg, All:

So, I've been running some stats over some customer workloads who are
having issues with checkpoint spikes, and I'm finding that the bgwriter
is almost entirely ineffective for them:

pct_checkpoints_req     | 33.0
avg_frequency_min       | 2.30
avg_write_time_s        | 112.17
avg_sync_time_s         | 1.33
mb_written              | 2387369.6
mb_written_per_min      | 61.01
mb_per_checkpoint       | 82.27
pct_checkpoint_buffers  | 58.6
pct_bgwriter_buffers    | 0.3
pct_backend_buffers     | 41.1
bgwriter_halt_freq      | 0.06
bgwriter_halt_potential | 70.11
buffer_allocation_ratio | 1.466

(query for the above is below)

The key metric I'm looking at is that the bgwriter only took care of
0.3% of buffers.  Yet average write throughput is around 1mb/s, and the
bgwriter is capable of flushing 4mb/s, if it's waking up every 200ms.

Of course, our first conclusion is that the writes are very bursty and
the bgwriter is frequently hitting lru_maxpages.  In fact, that seems to
be the case, per bgwriter_halt_potential above (this is a measurement of
the % of the time the bgwriter halted vs if all buffer writes were done
in one continuous session where it couldn't keep up).  And from the raw
pg_stat_bgwriter:

maxwritten_clean      | 6950

So we're halting because we hit lru_maxpages a *lot*, which is keeping
the bgwriter from keeping up.  What gives?

Well, digging into the docs, one thing I noticed was this note:

"It then sleeps for bgwriter_delay milliseconds, and repeats. When there
are no dirty buffers in the buffer pool, though, it goes into a longer
sleep regardless of bgwriter_delay."

combined with:

"The number of dirty buffers written in each round is based on the
number of new buffers that have been needed by server processes during
recent rounds"

... so Greg built in the bgwriter autotuner with ramp-up/down behavior,
where it sleeps longer and writes less if it hasn't been busy lately.
But given the stats I'm looking at, I'm wondering if that isn't too much
of a double-whammy for people with bursty workloads.

That is, if you have several seconds of inactivity followed by a big
write, then the bgwriter will wake up slowly (which, btw, is not
manually tunable), and then write very little when it does wake up, at
least in the first round.

Of course, I may be misinterpreting the data in front of me ... I'm
currently running a week-long test of raising lru_maxpages and
decreasing bgwriter_delay to see how it affects things ... but I wanted
to discuss it on-list.

Bgwriter stats query follows:

with bgstats as (
    select checkpoints_timed,
        checkpoints_req,
        checkpoints_timed + checkpoints_req as checkpoints,
        checkpoint_sync_time,
        checkpoint_write_time,
        buffers_checkpoint,
        buffers_clean,
        maxwritten_clean,
        buffers_backend,
        buffers_backend_fsync,
        buffers_alloc,
        buffers_checkpoint + buffers_clean + buffers_backend as
total_buffers,
        round(extract('epoch' from now() - stats_reset)/60)::numeric as
min_since_reset,
        lru.setting::numeric as bgwriter_maxpages,
        delay.setting::numeric as bgwriter_delay
    from pg_stat_bgwriter
        cross join pg_settings as lru
        cross join pg_settings as delay
    where lru.name = 'bgwriter_lru_maxpages'
        and delay.name = 'bgwriter_delay'
)
select
    round(checkpoints_req*100/checkpoints,1) as pct_checkpoints_req,
    round(min_since_reset/checkpoints,2) as avg_frequency_min,
    round(checkpoint_write_time::numeric/(checkpoints*1000),2) as
avg_write_time_s,
    round(checkpoint_sync_time::numeric/(checkpoints*1000),2) as
avg_sync_time_s,
    round(total_buffers/128::numeric,1) as mb_written,
    round(total_buffers/(128 * min_since_reset),2) as mb_written_per_min,
    round(buffers_checkpoint/(checkpoints*128::numeric),2) as
mb_per_checkpoint,
    round(buffers_checkpoint*100/total_buffers::numeric,1) as
pct_checkpoint_buffers,
    round(buffers_clean*100/total_buffers::numeric,1) as
pct_bgwriter_buffers,
    round(buffers_backend*100/total_buffers::numeric,1) as
pct_backend_buffers,

round(maxwritten_clean*100::numeric/(min_since_reset*60000/bgwriter_delay),2)
as bgwriter_halt_freq,

round(maxwritten_clean*100::numeric/(buffers_clean/bgwriter_maxpages),2)
as bgwriter_halt_potential,
    round(buffers_alloc::numeric/total_buffers,3) as buffer_allocation_ratio
from bgstats;

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


On 7/17/13 6:46 PM, Josh Berkus wrote:
> The key metric I'm looking at is that the bgwriter only took care of
> 0.3% of buffers.

There look to be a good number of buffers on this server that are only
being written at checkpoint time.  The background writer will only deal
with buffers when their usage count is low.  Fast servers can cycle over
shared_buffers such that as soon as their usage counts get low, they're
immediately reallocated by a hungry backend.  You might try to quantify
how many buffers the BGW can possibly do something with using
pg_buffercache.

> So we're halting because we hit lru_maxpages a *lot*, which is keeping
> the bgwriter from keeping up.  What gives?

2007's defaults can be a bummer in 2013.  I don't hesitate to bump that
up to 500 on a server with decent hardware.

If it weren't for power savings concerns, I would set bgwriter_delay to
10ms by default.  With that change lru_maxpages should drop to 10 to
have the same behavior as the existing default.

> "It then sleeps for bgwriter_delay milliseconds, and repeats. When there
> are no dirty buffers in the buffer pool, though, it goes into a longer
> sleep regardless of bgwriter_delay."

That wording is from some power savings code added after I poked at
things.  Look at src/backend/postmaster/bgwriter.c where it mentions
"electricity" to see more about it.  The longer sleeps are supposed to
be interrupted when backends do work, to respond to bursts.  Maybe it
can get confused, I haven't looked at it that carefully.

The latching mechanism shouldn't need to be tunable if it works
correctly, which is why there's no exposed knobs for it.

Since I saw an idea I'm going to steal from your pg_stat_bgwriter query,
I'll trade you one with a completely different spin on the data to
harvest from:

SELECT
   block_size::numeric * buffers_alloc / (1024 * 1024 * seconds) AS
alloc_mbps,
   block_size::numeric * buffers_checkpoint / (1024 * 1024 * seconds) AS
checkpoint_mbps,
   block_size::numeric * buffers_clean / (1024 * 1024 * seconds) AS
clean_mbps,
   block_size::numeric * buffers_backend/ (1024 * 1024 * seconds) AS
backend_mbps,
   block_size::numeric * (buffers_checkpoint + buffers_clean +
buffers_backend) / (1024 * 1024 * seconds) AS write_mbps
FROM
(
SELECT now() AS sample,now() - stats_reset AS uptime,EXTRACT(EPOCH FROM
now()) - extract(EPOCH FROM stats_reset) AS seconds,
b.*,p.setting::integer AS block_size FROM pg_stat_bgwriter b,pg_settings
p WHERE p.name='block_size'
) bgw;

That only works on 9.1 and later where there is a stats_reset time
available on pg_stat_bgwriter.  Sample from a busy system with
moderately tuned BGW and checkpoint_timeout at 15 minutes:

-[ RECORD 1 ]---+-------------------
alloc_mbps      | 246.019686474412
checkpoint_mbps | 0.0621780475463596
clean_mbps      | 2.38631188442859
backend_mbps    | 0.777490109599045
write_mbps      | 3.22598004157399

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Greg,

> There look to be a good number of buffers on this server that are only
> being written at checkpoint time.  The background writer will only deal
> with buffers when their usage count is low.  Fast servers can cycle over
> shared_buffers such that as soon as their usage counts get low, they're
> immediately reallocated by a hungry backend.  You might try to quantify
> how many buffers the BGW can possibly do something with using
> pg_buffercache.

Yeah, that's going to be the next step.

> 2007's defaults can be a bummer in 2013.  I don't hesitate to bump that
> up to 500 on a server with decent hardware.

Right, that's what I just tested.  The results are interesting.  I
changed the defaults as follows:

bgwriter_delay = 100ms
bgwriter_lru_maxpages = 512
bgwriter_lru_multiplier = 3.0

... and the number of buffers being written by the bgwriter went *down*,
almost to zero.  Mind you, I wanna gather a full week of data, but there
seems to be something counterintuitive going on here.

One potential factor is that they have their shared_buffers set
unusually high (5GB out of 16GB).

Here's the stats:

postgres=# select * from pg_stat_bgwriter;
-[ RECORD 1 ]---------+------------------------------
checkpoints_timed     | 330
checkpoints_req       | 47
checkpoint_write_time | 55504727
checkpoint_sync_time  | 286743
buffers_checkpoint    | 2809031
buffers_clean         | 789
maxwritten_clean      | 0
buffers_backend       | 457456
buffers_backend_fsync | 0
buffers_alloc         | 943734
stats_reset           | 2013-07-17 17:09:18.945194-07

So we're not hitting maxpages anymore, at all.  So why isn't the
bgwriter doing any work?

-[ RECORD 1 ]-----------+--------
pct_checkpoints_req     | 12.0
avg_frequency_min       | 2.78
avg_write_time_s        | 146.91
avg_sync_time_s         | 0.76
mb_written              | 25617.8
mb_written_per_min      | 24.42
mb_per_checkpoint       | 58.27
pct_checkpoint_buffers  | 86.0
pct_bgwriter_buffers    | 0.0
pct_backend_buffers     | 14.0
bgwriter_halt_freq      | 0.00
bgwriter_halt_potential | 0.00
buffer_allocation_ratio | 0.288


And your query, with some rounding added:

-[ RECORD 1 ]---+------
alloc_mbps      | 0.116
checkpoint_mbps | 0.340
clean_mbps      | 0.000
backend_mbps    | 0.056
write_mbps      | 0.396



--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


On Thu, Jul 18, 2013 at 10:40 AM, Josh Berkus <josh@agliodbs.com> wrote:
>
> Right, that's what I just tested.  The results are interesting.  I
> changed the defaults as follows:
>
> bgwriter_delay = 100ms
> bgwriter_lru_maxpages = 512
> bgwriter_lru_multiplier = 3.0
>
> ... and the number of buffers being written by the bgwriter went *down*,
> almost to zero.  Mind you, I wanna gather a full week of data, but there
> seems to be something counterintuitive going on here.
>
> One potential factor is that they have their shared_buffers set
> unusually high (5GB out of 16GB).
>
> Here's the stats:
>
> postgres=# select * from pg_stat_bgwriter;
> -[ RECORD 1 ]---------+------------------------------
> checkpoints_timed     | 330
> checkpoints_req       | 47
> checkpoint_write_time | 55504727
> checkpoint_sync_time  | 286743
> buffers_checkpoint    | 2809031
> buffers_clean         | 789
> maxwritten_clean      | 0
> buffers_backend       | 457456
> buffers_backend_fsync | 0
> buffers_alloc         | 943734
> stats_reset           | 2013-07-17 17:09:18.945194-07
>
> So we're not hitting maxpages anymore, at all.  So why isn't the
> bgwriter doing any work?


Does their workload have a lot of bulk operations, which use a
ring-buffer strategy and so intentionally evict their own buffers?

Do you have a simple select * from pg_stat_bgwriter from the period
before the change?  You posted the query that does averaging and
aggregation, but I couldn't figure out how to backtrack from that to
the original numbers.

Cheers,

Jeff