Thread: Checkpoint Tuning Question

Checkpoint Tuning Question

From
Dan Armbrust
Date:
I'm running a steady state test where I am pushing about 600 queries
per second through a Posgres 8.3 system on an 8 CPU Linux system.
It's a mix of inserts, updates, and deletes on a few tables - the two
biggest ones probably have about 200,000 rows.

Harddrive is just a simple, run-of-the-mill desktop drive.

Here are parameters that I have changed from defaults:

shared_buffers =100MB
synchronous_commit=off

And, after noting complaints in the log file about checkpoint intervals, I set:

checkpoint_segments=10

Then I turned on slow query logging for queries that take more than 1
second, and checkpoint logging.

Typically, I see no slow queries.  The system handles the load just fine.
Once in a while, I'll see a query that takes 3 - 5 seconds.

However, once the checkpoint process begins, I get a whole flood of
queries that take between 1 and 10 seconds to complete.  My throughput
crashes to near nothing.  The checkpoint takes between 45 seconds and
a minute to complete.

After the checkpoint completes - the system returns to having very few
slow queries, and the keeps up with the load fine.

Is there anything I can do to prevent the occasional slow query?

Is there anything I can do to prevent (or minimize) the performance
impact of the checkpoint?

Thanks,

Dan

Re: Checkpoint Tuning Question

From
Tom Lane
Date:
Dan Armbrust <daniel.armbrust.list@gmail.com> writes:
> However, once the checkpoint process begins, I get a whole flood of
> queries that take between 1 and 10 seconds to complete.  My throughput
> crashes to near nothing.  The checkpoint takes between 45 seconds and
> a minute to complete.

You sure this is 8.3?  It should spread out checkpoints over a couple of
minutes by default.  [thinks...]  Maybe you need to increase
checkpoint_segments some more.  If it's forcing the checkpoint in order
to hold down the number of WAL segments used up, that would explain a
fast checkpoint.

            regards, tom lane

Re: Checkpoint Tuning Question

From
Dan Armbrust
Date:
On Wed, Jul 8, 2009 at 12:50 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Dan Armbrust <daniel.armbrust.list@gmail.com> writes:
>> However, once the checkpoint process begins, I get a whole flood of
>> queries that take between 1 and 10 seconds to complete.  My throughput
>> crashes to near nothing.  The checkpoint takes between 45 seconds and
>> a minute to complete.
>
> You sure this is 8.3?  It should spread out checkpoints over a couple of
> minutes by default.  [thinks...]  Maybe you need to increase
> checkpoint_segments some more.  If it's forcing the checkpoint in order
> to hold down the number of WAL segments used up, that would explain a
> fast checkpoint.
>
>                        regards, tom lane
>

Just checked - currently running 8.3.4 on the system I'm testing on.

With checkpoint_segments set to 10, the checkpoints appear to be
happening due to checkpoint_timeout - which I've left at the default
of 5 minutes.

If I double my test load, I end up with checkpoints happening about
every 4 minutes, with the log message that I read to indicate that it
used up all 10 segments.

So not much I can do to keep the checkpoint process from causing a
burst of slow queries?

Thanks,

Dan

Re: Checkpoint Tuning Question

From
Dan Armbrust
Date:
On Wed, Jul 8, 2009 at 1:23 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Dan Armbrust <daniel.armbrust.list@gmail.com> writes:
>> With checkpoint_segments set to 10, the checkpoints appear to be
>> happening due to checkpoint_timeout - which I've left at the default
>> of 5 minutes.
>
> Well, you could increase both those settings so as to put the
> checkpoints further apart, and/or increase checkpoint_completion_target
> to spread the checkpoint I/O over a larger fraction of the cycle.
>
>                        regards, tom lane
>

Sorry, didn't mean to get off list.

Wouldn't increasing the length between checkpoints result in the
checkpoint process taking even longer to complete?

The way my system processes and buffers incoming data, having
infrequent (but long and disruptive) checkpoints is bad, since it
causes the throughput to suffer so bad - my buffers can't hold the
flood, and I have to drop data.  If I can reduce the impact of the
checkpoints, and have them occur more frequently, they my buffers
should be able to hold the incoming data during the short durations
that I have slow queries.

I'll go experiment with checkpoint_completion_target.

Thanks,

Dan

Re: Checkpoint Tuning Question

From
Tom Lane
Date:
Dan Armbrust <daniel.armbrust.list@gmail.com> writes:
> On Wed, Jul 8, 2009 at 1:23 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>> Well, you could increase both those settings so as to put the
>> checkpoints further apart, and/or increase checkpoint_completion_target
>> to spread the checkpoint I/O over a larger fraction of the cycle.

> Wouldn't increasing the length between checkpoints result in the
> checkpoint process taking even longer to complete?

You don't really care how long it takes.  What you want is for it not to
be chewing a bigger fraction of your I/O bandwidth than you can spare.
Hence, you want it to take longer.  Trying to shorten it is just going
to make the spike worse.

            regards, tom lane

Re: Checkpoint Tuning Question

From
Dan Armbrust
Date:
>> Wouldn't increasing the length between checkpoints result in the
>> checkpoint process taking even longer to complete?
>
> You don't really care how long it takes.  What you want is for it not to
> be chewing a bigger fraction of your I/O bandwidth than you can spare.
> Hence, you want it to take longer.  Trying to shorten it is just going
> to make the spike worse.
>
>                        regards, tom lane
>

I bumped the segments up to 15, and the timeout up to 10 minutes, and
changed the completion target to .7.

What I observe now is that I get a short (1-2 second) period where I
get slow queries - I'm running about 30 queries in parallel at any
given time - it appears that all 30 queries get paused for a couple of
seconds at the moment that a checkpoint begins.  However, after the
initial slowdown, I don't get any additional slow queries logged while
the checkpoint process runs.

My takeaway is that starting the checkpoint process is really
expensive - so I don't want to start it very frequently.  And the only
downside to longer intervals between checkpoints is a longer recovery
time if the system crashes?

Thanks,

Dan

Re: Checkpoint Tuning Question

From
Greg Smith
Date:
On Wed, 8 Jul 2009, Dan Armbrust wrote:

> With checkpoint_segments set to 10, the checkpoints appear to be
> happening due to checkpoint_timeout - which I've left at the default
> of 5 minutes.

OK, then that's as far upwards as you probably need to tweak that for your
workload, even though most systems tuned for performance need 30+ instead.

If what you're getting nailed with is the pause during the initial
checkpoint processing, the only real option you have on the database side
is to lower shared_buffers.  You might get some improvement playing with
operating system tunables too, to make it buffer less data and write more
aggressively instead.  The idea is that when the checkpoint starts, there
should be little else buffered already.  On Linux this is best
accomplished with tuning dirty_background_ratio downward for example.

There's more on this subject than you probably ever wanted to know at
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm if you
haven't seen that already.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Checkpoint Tuning Question

From
Greg Smith
Date:
On Wed, 8 Jul 2009, Dan Armbrust wrote:

> My takeaway is that starting the checkpoint process is really
> expensive - so I don't want to start it very frequently.  And the only
> downside to longer intervals between checkpoints is a longer recovery
> time if the system crashes?

And additional disk space wasted in hold the write-ahead logs.  You're
moving in the right direction here, the less checkpoints the better as
long as you can stand the recovery time.  What you'll discover if you bump
checkpoint_segments up high enough is that you have to lengthen the test
run you're trying, because eventually you'll reach a point where there are
none of them happening during some test runs.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Checkpoint Tuning Question

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
On Wed, 8 Jul 2009, Dan Armbrust wrote:
>> What I observe now is that I get a short (1-2 second) period where I
>> get slow queries - I'm running about 30 queries in parallel at any
>> given time - it appears that all 30 queries get paused for a couple of
>> seconds at the moment that a checkpoint begins.  However, after the
>> initial slowdown, I don't get any additional slow queries logged while
>> the checkpoint process runs.

> If what you're getting nailed with is the pause during the initial
> checkpoint processing, the only real option you have on the database side
> is to lower shared_buffers.

He's only got 100MB of shared buffers, which doesn't seem like much
considering it's apparently a fairly beefy system.  I definitely
don't see how one CPU spinning over the buffer headers in BufferSync
is going to create the sort of hiccup he's describing.

Dan, are you sure that this hiccup is happening at the *start* of a
checkpoint?  Do you have log_checkpoints turned on, and if so what
does it show?

            regards, tom lane

Re: Checkpoint Tuning Question

From
John R Pierce
Date:
Tom Lane wrote:
> He's only got 100MB of shared buffers, which doesn't seem like much
> considering it's apparently a fairly beefy system.


a beefy system with...

> Harddrive is just a simple, run-of-the-mill desktop drive.


which is going to severely limit random write throughput....




Re: Checkpoint Tuning Question

From
Tom Lane
Date:
John R Pierce <pierce@hogranch.com> writes:
> a beefy system with...
>> Harddrive is just a simple, run-of-the-mill desktop drive.
> which is going to severely limit random write throughput....

True, which is why he's having to flail so hard to keep the checkpoint
from saturating his I/O.  However, the latest report says that he
managed that, and yet there's still a one-or-two-second transient of
some sort.  I'm wondering what's causing that.  If it were at the *end*
of the checkpoint, it might be the disk again (failing to handle a bunch
of fsyncs, perhaps).  But if it really is at the *start* then there's
something else odd happening.

            regards, tom lane

Re: Checkpoint Tuning Question

From
Greg Smith
Date:
On Wed, 8 Jul 2009, Tom Lane wrote:

> He's only got 100MB of shared buffers, which doesn't seem like much
> considering it's apparently a fairly beefy system.  I definitely
> don't see how one CPU spinning over the buffer headers in BufferSync
> is going to create the sort of hiccup he's describing.

Agreed, it doesn't seem like a likely cause.  If the problem reduces in
magnitude in proportion with the size of the buffer cache, we might have
to accept that's it's true regardless; that's why I was curious to see
what impact that had on the test results.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Checkpoint Tuning Question

From
Dan Armbrust
Date:
> However, the latest report says that he
> managed that, and yet there's still a one-or-two-second transient of
> some sort.  I'm wondering what's causing that.  If it were at the *end*
> of the checkpoint, it might be the disk again (failing to handle a bunch
> of fsyncs, perhaps).  But if it really is at the *start* then there's
> something else odd happening.
>
>                        regards, tom lane
>

Log output during heavy load - checkpoint logging on, and slow query
logging on for queries that take longer than 1 second.
Blank space inserted to show interesting bits....

Almost all of the slow query log messages are logged within about 3
seconds of the checkpoint starting message.


LOG:  checkpoint starting: xlog

LOG:  duration: 1101.419 ms  execute <unnamed>: select
dynamichos0_.ethernetmacaddr as ethernet1_0_, dynamichos0_.ipaddr as
ipaddr0_, dynamichos0_.cpemac as cpemac0_, dynamichos0_.last_updated
as last3_0_, dynamichos0_.leasetime as leasetime0_,
dynamichos0_.regtime as regtime0_ from iphost dynamichos0_, cpe cpe1_
where dynamichos0_.cpemac=cpe1_.cpemac and 1=1 and
dynamichos0_.ethernetmacaddr=$1 and dynamichos0_.cpemac=$2 and
cpe1_.regBaseId=$3 and dynamichos0_.ipaddr<>$4
DETAIL:  parameters: $1 = '01:01:01:56:01:7F', $2 = '00:55:00:82', $3
= '01:01:01:56', $4 = '000.000.000.000'
LOG:  duration: 1101.422 ms  execute <unnamed>: insert into iphost
(cpemac, last_updated, leasetime, regtime, ethernetmacaddr, ipaddr)
values ($1, $2, $3, $4, $5, $6)
DETAIL:  parameters: $1 = '01:AA:00:A0', $2 = '2009-07-08
15:33:20.673', $3 = '2009-07-08 21:06:40.67', $4 = '2009-07-08
15:33:20.67', $5 = '01:01:02:AB:01:9D', $6 = '2.171.156.0'

<snip about 200 lines of similar messages>

LOG:  duration: 1501.905 ms  execute <unnamed>: select
dynamichos0_.ethernetmacaddr as ethernet1_0_, dynamichos0_.ipaddr as
ipaddr0_, dynamichos0_.cpemac as cpemac0_, dynamichos0_.last_updated
as last3_0_, dynamichos0_.leasetime as leasetime0_,
dynamichos0_.regtime as regtime0_ from iphost dynamichos0_, cpe cpe1_
where dynamichos0_.cpemac=cpe1_.cpemac and 1=1 and
dynamichos0_.ethernetmacaddr=$1 and dynamichos0_.cpemac=$2 and
cpe1_.regBaseId=$3 and dynamichos0_.ipaddr<>$4
DETAIL:  parameters: $1 = '01:01:01:C3:01:8B', $2 = '00:C2:00:8E', $3
= '01:01:01:C3', $4 = '000.000.000.000'

LOG:  checkpoint complete: wrote 9975 buffers (77.9%); 0 transaction
log file(s) added, 0 removed, 15 recycled; write=156.576 s, sync=0.065
s, total=156.662 s
LOG:  checkpoint starting: xlog

LOG:  duration: 1104.780 ms  execute <unnamed>: delete from iphost
where ethernetmacaddr=$1 and ipaddr=$2 and last_updated=$3
DETAIL:  parameters: $1 = '01:01:01:33:01:AA', $2 = '1.50.169.0', $3 =
'2009-07-08 15:32:57.131'
LOG:  duration: 1106.499 ms  execute <unnamed>: select cpe0_.cpemac as
cpemac2_0_, cpe0_.changeTime as changeTime2_0_, cpe0_.comment as
comment2_0_, cpe0_.configuration as configur4_2_0_, cpe0_.cpeconfigid
as cpeconf17_2_0_, cpe0_.cpefilterid as cpefilt18_2_0_, cpe0_.endTime
as endTime2_0_, cpe0_.ispId as ispId2_0_, cpe0_.last_updated as
last7_2_0_, cpe0_.reglocationid as regloca19_2_0_, cpe0_.modelId as
modelId2_0_, cpe0_.numberOfHosts as numberOf9_2_0_, cpe0_.regBaseId as
regBaseId2_0_, cpe0_.regTime as regTime2_0_, cpe0_.roamAllowed as
roamAll12_2_0_, cpe0_.serialNumber as serialN13_2_0_, cpe0_.slaid as
slaid2_0_, cpe0_.enable as enable2_0_, cpe0_.staticip as staticip2_0_,
cpe0_.subscriberid as subscri21_2_0_, cpe0_.swVersion as
swVersion2_0_, cpe0_.vlanid as vlanid2_0_, cpe0_.voipid as voipid2_0_
from cpe cpe0_ where cpe0_.cpemac=$1
DETAIL:  parameters: $1 = '00:84:00:37'

<snip about 300 lines of similar messages>

LOG:  duration: 1205.828 ms  execute <unnamed>: insert into iphost
(cpemac, last_updated, leasetime, regtime, ethernetmacaddr, ipaddr)
values ($1, $2, $3, $4, $5, $6)
DETAIL:  parameters: $1 = '02:31:00:25', $2 = '2009-07-08
15:39:53.718', $3 = '2009-07-08 21:13:13.715', $4 = '2009-07-08
15:39:53.715', $5 = '01:01:03:32:01:22', $6 = '3.51.33.0'
LOG:  duration: 1203.287 ms  execute <unnamed>: insert into iphost
(cpemac, last_updated, leasetime, regtime, ethernetmacaddr, ipaddr)
values ($1, $2, $3, $4, $5, $6)
DETAIL:  parameters: $1 = '03:47:00:81', $2 = '2009-07-08
15:39:53.72', $3 = '2009-07-08 21:13:13.717', $4 = '2009-07-08
15:39:53.717', $5 = '01:01:04:48:01:7E', $6 = '4.74.125.0'
LOG:  duration: 1201.480 ms  execute <unnamed>: insert into iphost
(cpemac, last_updated, leasetime, regtime, ethernetmacaddr, ipaddr)
values ($1, $2, $3, $4, $5, $6)
DETAIL:  parameters: $1 = '01:12:00:7F', $2 = '2009-07-08
15:39:53.725', $3 = '2009-07-08 21:13:13.721', $4 = '2009-07-08
15:39:53.721', $5 = '01:01:02:13:01:7C', $6 = '2.19.123.0'

LOG:  checkpoint complete: wrote 9794 buffers (76.5%); 0 transaction
log file(s) added, 0 removed, 15 recycled; write=148.084 s, sync=0.062
s, total=148.172 s

Re: Checkpoint Tuning Question

From
Tom Lane
Date:
Dan Armbrust <daniel.armbrust.list@gmail.com> writes:
> Almost all of the slow query log messages are logged within about 3
> seconds of the checkpoint starting message.

> LOG:  checkpoint complete: wrote 9975 buffers (77.9%); 0 transaction
> log file(s) added, 0 removed, 15 recycled; write=156.576 s, sync=0.065
> s, total=156.662 s

Huh.  And there's just about no daylight between the total checkpoint
time and the write+sync time (barely more than 20msec in both examples).
So that seems to wipe out the thought I had that maybe we'd
underestimated the work involved in one of the other steps of
checkpoint.

As Greg commented upthread, we seem to be getting forced to the
conclusion that the initial buffer scan in BufferSync() is somehow
causing this.  There are a couple of things it'd be useful to try
here:

* see how the size of the hiccup varies with shared_buffers;

* try inserting a delay into that scan loop, as per attached
  quick-and-dirty patch.  (Numbers pulled from the air, but
  we can worry about tuning after we see if this is really
  where the problem is.)

            regards, tom lane

*** src/backend/storage/buffer/bufmgr.c~    Tue Jan  1 14:45:51 2008
--- src/backend/storage/buffer/bufmgr.c    Wed Jul  8 18:12:49 2009
***************
*** 1037,1042 ****
--- 1037,1045 ----
          }

          UnlockBufHdr(bufHdr);
+
+         if ((buf_id % 256) == 0)
+             pg_usleep(1000L);
      }

      if (num_to_write == 0)

Re: Checkpoint Tuning Question

From
Dan Armbrust
Date:
> As Greg commented upthread, we seem to be getting forced to the
> conclusion that the initial buffer scan in BufferSync() is somehow
> causing this.  There are a couple of things it'd be useful to try
> here:
>
> * see how the size of the hiccup varies with shared_buffers;

I tried decreasing shared buffers - both 25MB and 50MB were too small
for my load - I had slow queries at all times.

So then I increased it from what I was using - 100MB - to 500MB - and
the hiccup roughly doubles in length.

At 100MB, the hiccup is about 2-3 seconds long.
At 500MB, the hiccup is about 6 seconds long.

>
> * try inserting a delay into that scan loop, as per attached
>  quick-and-dirty patch.  (Numbers pulled from the air, but
>  we can worry about tuning after we see if this is really
>  where the problem is.)
>

After finally getting this particular system into a state where I
could build postgres (I was using the binary install) I built a 8.3.4,
using your patch - but I didn't see any change in the behaviour.  I
see hiccups that appear to be the same length as I saw on the binary
build of 8.3.4.

Thanks,

Dan

Re: Checkpoint Tuning Question

From
Simon Riggs
Date:
On Wed, 2009-07-08 at 18:22 -0400, Tom Lane wrote:

> As Greg commented upthread, we seem to be getting forced to the
> conclusion that the initial buffer scan in BufferSync() is somehow
> causing this.  There are a couple of things it'd be useful to try
> here:

Not sure why you're forced to that conclusion?

ISTM more likely to be a problem with checkpointing clog or subtrans.
That would block everybody and the scale of the problem is about right.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Checkpoint Tuning Question

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> ISTM more likely to be a problem with checkpointing clog or subtrans.
> That would block everybody and the scale of the problem is about right.

That's what I had been thinking too, but the log_checkpoint output
conclusively disproves it: those steps are taking less than 20msec.

            regards, tom lane

Re: Checkpoint Tuning Question

From
Simon Riggs
Date:
On Fri, 2009-07-10 at 10:27 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > ISTM more likely to be a problem with checkpointing clog or subtrans.
> > That would block everybody and the scale of the problem is about right.
>
> That's what I had been thinking too, but the log_checkpoint output
> conclusively disproves it: those steps are taking less than 20msec.

OK, I was looking at total -write, not total - write - sync.

I think its a traffic jam.

After checkpoint in XLogInsert(), we discover that we now have to backup
a block that we didn't think so previously. So we have to drop the lock
and then re-access WALInsertLock. So every backend has to go through the
queue twice the first time it tries to write WAL immediately after a
checkpoint. Also, suddenly, every block needs to be copied to WAL, so
the CRC checks make each lock holder take longer than normal, so the
whole queue begins to backup. Then, because of wal_buffers being small
we find that the increased volume of WAL being written causes
WALInsertLock to be held across I/O.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Checkpoint Tuning Question

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> I think its a traffic jam.

> After checkpoint in XLogInsert(), we discover that we now have to backup
> a block that we didn't think so previously. So we have to drop the lock
> and then re-access WALInsertLock. So every backend has to go through the
> queue twice the first time it tries to write WAL immediately after a
> checkpoint. Also, suddenly, every block needs to be copied to WAL, so
> the CRC checks make each lock holder take longer than normal, so the
> whole queue begins to backup. Then, because of wal_buffers being small
> we find that the increased volume of WAL being written causes
> WALInsertLock to be held across I/O.

Hm, I'm not sure I believe any of that except the last bit, seeing that
he's got plenty of excess CPU capability.  But the last bit fits with
the wimpy-I/O problem, and it also offers something we could test.
Dan, please see what happens when you vary the wal_buffers setting.
(Note you need a postmaster restart to change that.)

            regards, tom lane

Re: Checkpoint Tuning Question

From
Dan Armbrust
Date:
> Hm, I'm not sure I believe any of that except the last bit, seeing that
> he's got plenty of excess CPU capability.  But the last bit fits with
> the wimpy-I/O problem, and it also offers something we could test.
> Dan, please see what happens when you vary the wal_buffers setting.
> (Note you need a postmaster restart to change that.)
>
>                        regards, tom lane
>

Ok, I tried a few different values - 32kb, 64kb, 512kb, 2MB and 10MB.

I'm not seeing any highly noticeable change in behaviour with any
setting - it wasn't a scientific test, but I seem to have about the
same size hiccup with each setting.  The hiccup may be slightly
shorter with the 10MB setting, but barely, if it is.

Thanks,

Dan

Re: Checkpoint Tuning Question

From
Simon Riggs
Date:
On Fri, 2009-07-10 at 14:25 -0500, Dan Armbrust wrote:
> > Hm, I'm not sure I believe any of that except the last bit, seeing that
> > he's got plenty of excess CPU capability.  But the last bit fits with
> > the wimpy-I/O problem, and it also offers something we could test.
> > Dan, please see what happens when you vary the wal_buffers setting.
> > (Note you need a postmaster restart to change that.)
> >

> Ok, I tried a few different values - 32kb, 64kb, 512kb, 2MB and 10MB.
>
> I'm not seeing any highly noticeable change in behaviour with any
> setting - it wasn't a scientific test, but I seem to have about the
> same size hiccup with each setting.  The hiccup may be slightly
> shorter with the 10MB setting, but barely, if it is.

OK, so its not the "last bit". Let's look at the first idea again:

In more detail, the explanation relates to a section of code in
XLogInsert() in xlog.c that has a comment

/*
 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
 * back and recompute everything.  This can only happen just after a
 * checkpoint, so it's better to be slow in this case and fast
otherwise.
 *
 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
 * affect the contents of the XLOG record, so we'll update our local
copy
 * but not force a recomputation.
 */

This causes us to queue for the WALInsertLock twice at exactly the time
when every caller needs to calculate the CRC for complete blocks. So we
queue twice when the lock-hold-time is consistently high, causing queue
lengths to go ballistic. That also means that XLogFlush never achieves a
piggy-back flush, so WALWriteLock hold times go up also. I bet that
takes a significant amount of time to settle down once initiated,
probably a function of the write transaction ratio and number of
frequently accessed blocks.

We haven't got *any* measurements that show how long traffic jams take
to clear and I think we need them.

Also think piggy-back writes works OK, but stops working when work
begins to queue, which will cause a "stall" catastrophe, like in
aerodynamics.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Checkpoint Tuning Question

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> This causes us to queue for the WALInsertLock twice at exactly the time
> when every caller needs to calculate the CRC for complete blocks. So we
> queue twice when the lock-hold-time is consistently high, causing queue
> lengths to go ballistic.

You keep saying that, and it keeps not being true, because the CRC
calculation is *not* done while holding the lock.

It is true that the very first XLogInsert call in each backend after
a checkpoint starts will have to go back and redo its CRC calculation,
but that's a one-time waste of CPU.  It's hard to see how it could have
continuing effects over several seconds, especially in a system that
has CPU to spare.

What I think might be the cause is that just after a checkpoint starts,
quite a large proportion of XLogInserts will include full-page buffer
copies, thus leading to an overall higher rate of WAL creation.  That
means longer hold times for WALInsertLock due to spending more time
copying data into the WAL buffers, and it also means more WAL that has
to be synced to disk before a transaction can commit.  I'm still
convinced that Dan's problem ultimately comes down to inadequate disk
bandwidth, so I think the latter point is probably the key.

So this thought leads to a couple of other things Dan could test.
First, see if turning off full_page_writes makes the hiccup go away.
If so, we know the problem is in this area (though still not exactly
which reason); if not we need another idea.  That's not a good permanent
fix though, since it reduces crash safety.  The other knobs to
experiment with are synchronous_commit and wal_sync_method.  If the
stalls are due to commits waiting for additional xlog to get written,
then async commit should stop them.  I'm not sure if changing
wal_sync_method can help, but it'd be worth experimenting with.

            regards, tom lane

Re: Checkpoint Tuning Question

From
Simon Riggs
Date:
On Sun, 2009-07-12 at 13:10 -0400, Tom Lane wrote:

> It's hard to see how it could have continuing effects over several
> seconds, especially in a system that has CPU to spare.

Any queueing situation takes a while to resolve and over-damped systems
can take a long time to resolve themselves. We simply don't know
anything at all about how well damped the system is. Regrettably
physicists spend a lot of time looking at oscillating systems...
http://en.wikipedia.org/wiki/Vibration so my viewpoint is that the
system is much more dynamic than we have yet considered it to be.

You've not commented on the double-take on the WALInsertLock which seems
to be the initiating factor here. I think multiple CPU systems may
simply exacerbate the queueing problem and spare CPU is usually an
indication of contention. As ever, this is conjecture to attempt to
explain the problem, not firm black and white knowledge.

Anyway, I agree that turning synchronous_commit = off and setting
full_page_writes = off will get us closer here.

Also, I'm very happy that the delay is only 1 sec. We had much greater
effects two years ago, so we've definitely moved forwards.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Checkpoint Tuning Question

From
Dan Armbrust
Date:
> So this thought leads to a couple of other things Dan could test.
> First, see if turning off full_page_writes makes the hiccup go away.
> If so, we know the problem is in this area (though still not exactly
> which reason); if not we need another idea.  That's not a good permanent
> fix though, since it reduces crash safety.  The other knobs to
> experiment with are synchronous_commit and wal_sync_method.  If the
> stalls are due to commits waiting for additional xlog to get written,
> then async commit should stop them.  I'm not sure if changing
> wal_sync_method can help, but it'd be worth experimenting with.
>
>                        regards, tom lane
>


All of my testing to date has been done with synchronous_commit=off

I just tried setting full_page_writes=off - and like magic, the entire
hiccup went away.

Thanks,

Dan

Re: Checkpoint Tuning Question

From
Simon Riggs
Date:
On Mon, 2009-07-13 at 15:53 -0500, Dan Armbrust wrote:
> > So this thought leads to a couple of other things Dan could test.
> > First, see if turning off full_page_writes makes the hiccup go away.
> > If so, we know the problem is in this area (though still not exactly
> > which reason); if not we need another idea.  That's not a good permanent
> > fix though, since it reduces crash safety.  The other knobs to
> > experiment with are synchronous_commit and wal_sync_method.  If the
> > stalls are due to commits waiting for additional xlog to get written,
> > then async commit should stop them.  I'm not sure if changing
> > wal_sync_method can help, but it'd be worth experimenting with.
> >

> All of my testing to date has been done with synchronous_commit=off
>
> I just tried setting full_page_writes=off - and like magic, the entire
> hiccup went away.

OK, that seems clear.

I mistakenly referred to the CRC calculation happening while the lock
was held, which confused the discussion. The lock *is* held for longer
when we have backup blocks and the lock does need to be acquired twice
immediately after a checkpoint.

Neither of the above two effects appear, on their own, sufficient to
explain the delay. We should conjecture that a traffic jam exists and go
looking for it.

Propose a DTrace probe immediately after the "goto begin" at line 740 of
xlog.c, so we can start tracing from the first backend following
checkpoint, and turn off tracing when all backends have completed a
transaction.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Checkpoint Tuning Question

From
Dan Armbrust
Date:
>
> Propose a DTrace probe immediately after the "goto begin" at line 740 of
> xlog.c, so we can start tracing from the first backend following
> checkpoint, and turn off tracing when all backends have completed a
> transaction.
>

That's greek to me.  But I'm happy to test things if you send me
patches or custom code.

Thanks,

Dan

Re: Checkpoint Tuning Question

From
Dan Armbrust
Date:
On Mon, Jul 13, 2009 at 3:53 PM, Dan
Armbrust<daniel.armbrust.list@gmail.com> wrote:
>> So this thought leads to a couple of other things Dan could test.
>> First, see if turning off full_page_writes makes the hiccup go away.
>> If so, we know the problem is in this area (though still not exactly
>> which reason); if not we need another idea.  That's not a good permanent
>> fix though, since it reduces crash safety.  The other knobs to
>> experiment with are synchronous_commit and wal_sync_method.  If the
>> stalls are due to commits waiting for additional xlog to get written,
>> then async commit should stop them.  I'm not sure if changing
>> wal_sync_method can help, but it'd be worth experimenting with.
>>
>>                        regards, tom lane
>>
>
>
> All of my testing to date has been done with synchronous_commit=off
>
> I just tried setting full_page_writes=off - and like magic, the entire
> hiccup went away.
>
> Thanks,
>
> Dan

I haven't done any other testing for this issue since discovering that
setting "full_page_writes=off" makes my performance hiccup go away.
I'm about to move on to some different tests and reset this setup that
I was using to test out this issue - is there anything else that
anyone would like to have gathered before I axe this system?

I can redo it again later too, it will just take a fair bit longer for
me to reconfigure the test environment.

Thanks,

Dan

Re: Checkpoint Tuning Question

From
tomrevam
Date:


Dan Armbrust wrote:
>
> All of my testing to date has been done with synchronous_commit=off
>
> I just tried setting full_page_writes=off - and like magic, the entire
> hiccup went away.
>

Why is the full_page_write happening before the commit returns when
synchronous_commit is set to off? Is there a way to fix this?

Thanks,
Tomer Amiaz
--
View this message in context: http://www.nabble.com/Checkpoint-Tuning-Question-tp24396082p24607396.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.