Thread: checkpoints/bgwriter tuning verification

checkpoints/bgwriter tuning verification

From

Vick Khera

Date:

29 October 2009, 15:46:25

On my primary DB I'm observing random slowness which just doesn't make
sense to me.  The I/O system can easily do 40MB/sec writes, but I'm
only seeing a sustained 5MB/sec, even as the application is stalling
waiting on the DB.

My only guess is that I'm getting hit by checkpoints too often, and
this is causing some delays.  I did a lot of reading and came across
this most excelelent article by Greg Smith
(http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm).
Reading that and the Pg manual and looking at the statistics, here is
what I think I need to do.  (Greg, do you have a performance tuning
book?  If so, I really want to buy it!  Your articles are awesome.)

current settings for checkpoints:
checkpoint_segments = 32
checkpoint_timeout = 900
checkpoint_completion_target = 0.9
log_checkpoints = on

In my logs I see this pretty consistently during times of slowness:

Oct 29 14:17:38 d04 postgres[54902]: [10990-1] LOG:  checkpoint starting: xlog
Oct 29 14:22:04 d04 postgres[54902]: [10991-1] LOG:  checkpoint
complete: wrote 52828 buffers (24.3%); 0 transaction log file(s)
added, 0 removed, 32 recycled; write=265.881
Oct 29 14:22:57 d04 postgres[54902]: [10992-1] LOG:  checkpoint starting: xlog
Oct 29 14:26:56 d04 postgres[54902]: [10993-1] LOG:  checkpoint
complete: wrote 52773 buffers (24.3%); 0 transaction log file(s)
added, 0 removed, 32 recycled; write=234.846
Oct 29 14:28:32 d04 postgres[54902]: [10994-1] LOG:  checkpoint starting: xlog
Oct 29 14:33:32 d04 postgres[54902]: [10995-1] LOG:  checkpoint
complete: wrote 53807 buffers (24.7%); 0 transaction log file(s)
added, 0 removed, 32 recycled; write=299.170


basically, the next checkpoint starts within a few seconds of the
prior one completing.


The stats show this:

# select * from pg_stat_bgwriter;
-[ RECORD 1 ]------+-----------
checkpoints_timed  | 3236
checkpoints_req    | 83044
buffers_checkpoint | 1376460896
buffers_clean      | 59124159
maxwritten_clean   | 304410
buffers_backend    | 285595787
buffers_alloc      | 6643047623

Based on Greg's article and the above number showing that most
checkpoints are triggered by running out of WAL segments, I should
increase my checkpoint_buffers.  Also, based on the article, I should
increase the bgwriter_lru_maxpages (currently at default 100).

My plan is to bump checkpoint_segments to 48 and reduce
checkpoint_completion_target to 0.7, and bump the
bgwriter_lru_maxpages to 500.

Can the checkpoint operation actually cause the DB to stop responding
for a few seconds at a time?  That seems to be what I observe.
Sometimes for 5 or more seconds one transaction will just stall.

Thanks for any ideas.

Re: checkpoints/bgwriter tuning verification

From

Brad Nicholson

Date:

29 October 2009, 16:59:11

On Thu, 2009-10-29 at 14:46 -0400, Vick Khera wrote:
> On my primary DB I'm observing random slowness which just doesn't make
> sense to me.  The I/O system can easily do 40MB/sec writes, but I'm
> only seeing a sustained 5MB/sec, even as the application is stalling
> waiting on the DB.
>
> My only guess is that I'm getting hit by checkpoints too often, and
> this is causing some delays.  I did a lot of reading and came across
> this most excelelent article by Greg Smith
> (http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm).
> Reading that and the Pg manual and looking at the statistics, here is
> what I think I need to do.  (Greg, do you have a performance tuning
> book?  If so, I really want to buy it!  Your articles are awesome.)
>
> current settings for checkpoints:
> checkpoint_segments = 32
> checkpoint_timeout = 900
> checkpoint_completion_target = 0.9
> log_checkpoints = on
>
> In my logs I see this pretty consistently during times of slowness:
>
> Oct 29 14:17:38 d04 postgres[54902]: [10990-1] LOG:  checkpoint starting: xlog
> Oct 29 14:22:04 d04 postgres[54902]: [10991-1] LOG:  checkpoint
> complete: wrote 52828 buffers (24.3%); 0 transaction log file(s)
> added, 0 removed, 32 recycled; write=265.881
> Oct 29 14:22:57 d04 postgres[54902]: [10992-1] LOG:  checkpoint starting: xlog
> Oct 29 14:26:56 d04 postgres[54902]: [10993-1] LOG:  checkpoint
> complete: wrote 52773 buffers (24.3%); 0 transaction log file(s)
> added, 0 removed, 32 recycled; write=234.846
> Oct 29 14:28:32 d04 postgres[54902]: [10994-1] LOG:  checkpoint starting: xlog
> Oct 29 14:33:32 d04 postgres[54902]: [10995-1] LOG:  checkpoint
> complete: wrote 53807 buffers (24.7%); 0 transaction log file(s)
> added, 0 removed, 32 recycled; write=299.170
>
>
> basically, the next checkpoint starts within a few seconds of the
> prior one completing.

That's most likely due to having checkpoint_completion target set near
the maximum.  You are telling it to stretch the checkpoint out as long
as possible.

>
> The stats show this:
>
> # select * from pg_stat_bgwriter;
> -[ RECORD 1 ]------+-----------
> checkpoints_timed  | 3236
> checkpoints_req    | 83044
> buffers_checkpoint | 1376460896
> buffers_clean      | 59124159
> maxwritten_clean   | 304410
> buffers_backend    | 285595787
> buffers_alloc      | 6643047623
>
> Based on Greg's article and the above number showing that most
> checkpoints are triggered by running out of WAL segments, I should
> increase my checkpoint_buffers.  Also, based on the article, I should
> increase the bgwriter_lru_maxpages (currently at default 100).
>
> My plan is to bump checkpoint_segments to 48 and reduce
> checkpoint_completion_target to 0.7,

That's what I would do. I think you getting hit by checkpointing too
frequently or stretching the checkpoints out too long (or a combo of
both).

I'd also keep bumping checkpoint_segements up until you are
checkpointing on the timeout.  This will give you a longer window
between checkpoints - which gives more time to smooth the the checkpoint
out.

> and bump the
> bgwriter_lru_maxpages to 500.


I would probably start by increasing the number of WAL segments and
decreasing the checkpoint_completion_timeout and see if that solves the
problem.  I wouldn't change the bgwriter_lru_maxpages at the same time,
and wouldn't necessarily increase it by a factor of 5 when doing so.

It does look like you need to increase it though.


> Can the checkpoint operation actually cause the DB to stop responding
> for a few seconds at a time?  That seems to be what I observe.
> Sometimes for 5 or more seconds one transaction will just stall.

Absolutely.  The smoothing is supposed to alleviate that, but I'm not
sure how well it works with aggressive activity driven checkpoints like
you have.  Keep up posted, I'm curious.

> Thanks for any ideas.
>
--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.

Re: checkpoints/bgwriter tuning verification

From

Scott Marlowe

Date:

29 October 2009, 17:42:38

On Thu, Oct 29, 2009 at 12:46 PM, Vick Khera <vivek@khera.org> wrote:
> On my primary DB I'm observing random slowness which just doesn't make
> sense to me.  The I/O system can easily do 40MB/sec writes, but I'm
> only seeing a sustained 5MB/sec, even as the application is stalling
> waiting on the DB.

Just one point on top of everything else you'll hear.  40 MB/sec
sequential throughput does not equal 40MB/sec random PLUS checkpoint
throughput.  Random access is gonna lower that 40MB/sec way down real
fast.

First step to speed things up is putting pg_xlog on its own disk(s).
Since pg_xlog is mostly sequentially access, it's much faster when
it's not competing with the rest of the db.  THEN if you need faster
disks you can buy them and throw them at path/data/base/.  However,
often just splitting things out like this will make a world of
difference on a write heavy database.

Re: checkpoints/bgwriter tuning verification

From

Greg Smith

Date:

29 October 2009, 23:24:17

On Thu, 29 Oct 2009, Vick Khera wrote:

> Greg, do you have a performance tuning book?  If so, I really want to
> buy it!  Your articles are awesome.

Give me a few more months...

> basically, the next checkpoint starts within a few seconds of the prior
> one completing.

That's the expected behavior with a high setting for the completion
target.  Each checkpoint finishes just before the next one needs to start.
That's the most you can spread them out, and more spreading generally
equals better behavior.  Usually; always exceptions of course.

> Based on Greg's article and the above number showing that most
> checkpoints are triggered by running out of WAL segments, I should
> increase my checkpoint_buffers.  Also, based on the article, I should
> increase the bgwriter_lru_maxpages (currently at default 100).

You might actually want to decrease bgwriter_lru_maxpages to 0.  The
current background writer designs presumes you have some spare I/O
capacity and want to trade off some write overhead for possibly lower
latency in client backends.  If you're at the point where the disks aren't
keeping up with the checkpoint disk I/O, this could turn
counter-productive.

> My plan is to bump checkpoint_segments to 48 and reduce
> checkpoint_completion_target to 0.7, and bump the
> bgwriter_lru_maxpages to 500.

In your situation, I would increase checkpoint_segments to 64, keep the
completion target at 0.9, and decrease bgwriter_lru_maxpages to 0 so that
feature is turned off altogether.  If you're still not happy, you can try
dropping the completion target too; I've heard one report of that logic
not working out so well in practice, where lower settings actually
resulted in less of an I/O spike.  I would worry about the OS tuning first
though, which brins us to:

> Can the checkpoint operation actually cause the DB to stop responding
> for a few seconds at a time?  That seems to be what I observe.
> Sometimes for 5 or more seconds one transaction will just stall.

There are two ways that that writes can hang:

1) You've gotten to the point in the checkpoint cycle where it's calling
fsync to flush everything out of the filesystem.  At this point you could
potentially have a big chunk of data that needs to go out to disk, and any
other client that needs to write something (or needs a resource locked by
a writer) is stuck behind that.

2) The entire OS write cache is filled.  Once that happens, if a client
tries to write something else, the OS will force it to nap until there's
space again to hold that write.

How that all works really depends on the OS.  I know that under Linux, the
way this is all handled is really bad, and can easily lead to multi-second
hangs.  Generally you can watch the writes accumulate by looking at
/proc/meminfo.  I wrote an example showing how the problem appears and
what I did to improve long hangs one server at
http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html and
it refers to a long theory page on Linux's pdflush on my web page.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: checkpoints/bgwriter tuning verification

From

Steve Clark

Date:

30 October 2009, 08:21:16

On 10/29/2009 04:42 PM, Scott Marlowe wrote:
> On Thu, Oct 29, 2009 at 12:46 PM, Vick Khera<vivek@khera.org>  wrote:
>> On my primary DB I'm observing random slowness which just doesn't make
>> sense to me.  The I/O system can easily do 40MB/sec writes, but I'm
>> only seeing a sustained 5MB/sec, even as the application is stalling
>> waiting on the DB.
>
> Just one point on top of everything else you'll hear.  40 MB/sec
> sequential throughput does not equal 40MB/sec random PLUS checkpoint
> throughput.  Random access is gonna lower that 40MB/sec way down real
> fast.
>
> First step to speed things up is putting pg_xlog on its own disk(s).
Hi Scott,

How exactly do you do this? By creating a link to the new location or
is there a config option somewhere that says where the pg_xlog resides?

Thanks,
Steve

> Since pg_xlog is mostly sequentially access, it's much faster when
> it's not competing with the rest of the db.  THEN if you need faster
> disks you can buy them and throw them at path/data/base/.  However,
> often just splitting things out like this will make a world of
> difference on a write heavy database.
>

Re: checkpoints/bgwriter tuning verification

From

Brad Nicholson

Date:

30 October 2009, 09:25:24

On Fri, 2009-10-30 at 07:15 -0400, Steve Clark wrote:
> On 10/29/2009 04:42 PM, Scott Marlowe wrote:
> > On Thu, Oct 29, 2009 at 12:46 PM, Vick Khera<vivek@khera.org>  wrote:
> >> On my primary DB I'm observing random slowness which just doesn't make
> >> sense to me.  The I/O system can easily do 40MB/sec writes, but I'm
> >> only seeing a sustained 5MB/sec, even as the application is stalling
> >> waiting on the DB.
> >
> > Just one point on top of everything else you'll hear.  40 MB/sec
> > sequential throughput does not equal 40MB/sec random PLUS checkpoint
> > throughput.  Random access is gonna lower that 40MB/sec way down real
> > fast.
> >
> > First step to speed things up is putting pg_xlog on its own disk(s).
> Hi Scott,
>
> How exactly do you do this? By creating a link to the new location or
> is there a config option somewhere that says where the pg_xlog resides?


There is an option to do this during initdb.  If you want to do it after
the DB is created, move the contents of pg_xlog/ (when the DB is shut
down) and make a symlink to the new directory.

--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.

Re: checkpoints/bgwriter tuning verification

From

Guillaume Lelarge

Date:

30 October 2009, 10:03:59

Le vendredi 30 octobre 2009 à 03:24:05, Greg Smith a écrit :
> On Thu, 29 Oct 2009, Vick Khera wrote:
> > Greg, do you have a performance tuning book?  If so, I really want to
> > buy it!  Your articles are awesome.
>
> Give me a few more months...
>

Are you kidding or is it a great teaser? it would love to know that it is the
latter. Consider you have another reader of your book, if it's really what you
wanted to say :)

--
Guillaume.
 http://www.postgresqlfr.org
 http://dalibo.com

Re: checkpoints/bgwriter tuning verification

From

Vick Khera

Date:

30 October 2009, 11:24:14

On Thu, Oct 29, 2009 at 4:42 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> First step to speed things up is putting pg_xlog on its own disk(s).
> Since pg_xlog is mostly sequentially access, it's much faster when
>

Did that with a symlink.  I always do that.

My frustration is that I have this really fast RAID subsystem
connected via fibre channel to a Sun X4100 with 20GB RAM and AMD
Opterons.  I don't seem to be able to squeeze out the update
performance I really need out of it.

I'm wondering if it is all the FK checks that are slowing me down...

Thanks for responding to help me out.

Re: checkpoints/bgwriter tuning verification

From

Vick Khera

Date:

30 October 2009, 11:30:48

On Thu, Oct 29, 2009 at 10:24 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> There are two ways that that writes can hang:
>
> 1) You've gotten to the point in the checkpoint cycle where it's calling
> fsync to flush everything out of the filesystem.  At this point you could
> potentially have a big chunk of data that needs to go out to disk, and any
> other client that needs to write something (or needs a resource locked by a
> writer) is stuck behind that.
>
> 2) The entire OS write cache is filled.  Once that happens, if a client
> tries to write something else, the OS will force it to nap until there's
> space again to hold that write.

I'm on FreeBSD.  I've got my shared_buffers set to 1200MB on a box
with 20GB of RAM, and fast external RAID on a fibre channel.

Is that perhaps too low to keep the "hot" bits of my DB in memory?  I
hear so much conflicting advice on how big to set shared_buffers...

I've tweaked the checkpoint target to 0.8 and segments to 64.  Is 15
minutes reasonable for the timeout value?

I really don't think I'm saturating the disk bandwidth.  I see spikes
of 18MB/s or more, but it usually hovers at under 3MB/s according to
iostat output.

Re: checkpoints/bgwriter tuning verification

From

Andres Freund

Date:

20 November 2009, 16:12:47

On Friday 30 October 2009 15:30:34 Vick Khera wrote:
> I really don't think I'm saturating the disk bandwidth.  I see spikes
> of 18MB/s or more, but it usually hovers at under 3MB/s according to
> iostat output.
Looking at the iostat -x output is often more interesting - especially the
utilization and the average request latency.

Andres