Thread: Analyse without locking?

Analyse without locking?

From
Richard Neill
Date:
Dear All,

I'm wondering whether Vacuum/analyse (notably by the autovaccuum daemon)
is responsible for some deadlocks/dropouts I'm seeing.

One particular table gets hit about 5 times a second (for single row
updates and inserts) + associated index changes. This is a very light
load for the hardware; we have 7 CPU cores idling, and very little disk
activity. The query normally runs in about 20 ms.

However, the query must always respond within 200ms, or userspace gets
nasty errors.  [we're routing books on a sorter machine, and the book
misses its exit opportunity]. Although this is a low load, it's a bit
like a heartbeat.

The question is, could the autovacuum daemon (running either in vacuum
or in analyse mode) be taking out locks on this table that sometimes
cause the query response time to go way up (exceeding 10 seconds)?

I think I've set up autovacuum to do "little and often", using
   autovacuum_vacuum_cost_delay = 20ms
   autovacuum_vacuum_cost_limit = 20
but I'm not sure this is doing exactly what I think it is. In
particular, the system-wide I/O (and CPU) limit of autovacuum is
negligible, but it's possible that queries may be waiting on locks.

In particular, I want to make sure that the autovacuum daemon never
holds any lock for more than about 50ms at a time. (or will release it
immediately if something else wants it)

Or am I barking up the wrong tree entirely?

Thanks,

Richard

Re: Analyse without locking?

From
Grzegorz Jaśkiewicz
Date:


On Thu, Nov 26, 2009 at 4:20 PM, Richard Neill <rn214@cam.ac.uk> wrote:
Dear All,

I'm wondering whether Vacuum/analyse (notably by the autovaccuum daemon) is responsible for some deadlocks/dropouts I'm seeing.

One particular table gets hit about 5 times a second (for single row updates and inserts) + associated index changes. This is a very light load for the hardware; we have 7 CPU cores idling, and very little disk activity. The query normally runs in about 20 ms.

However, the query must always respond within 200ms, or userspace gets nasty errors.  [we're routing books on a sorter machine, and the book misses its exit opportunity]. Although this is a low load, it's a bit like a heartbeat.

The question is, could the autovacuum daemon (running either in vacuum or in analyse mode) be taking out locks on this table that sometimes cause the query response time to go way up (exceeding 10 seconds)?

I think I've set up autovacuum to do "little and often", using
 autovacuum_vacuum_cost_delay = 20ms
 autovacuum_vacuum_cost_limit = 20

those are basically thresholds. So in essence you are forcing your autovacuum to be active pretty often,

And from what I can read here, you are looking for completely opposite behaviour. Unless you think statistical image of your table will be completely invalid, after 20 modifications to it, which I am sure is not true.




--
GJ

Re: Analyse without locking?

From
Tom Lane
Date:
Richard Neill <rn214@cam.ac.uk> writes:
> I'm wondering whether Vacuum/analyse (notably by the autovaccuum daemon)
> is responsible for some deadlocks/dropouts I'm seeing.

> One particular table gets hit about 5 times a second (for single row
> updates and inserts) + associated index changes. This is a very light
> load for the hardware; we have 7 CPU cores idling, and very little disk
> activity. The query normally runs in about 20 ms.

> However, the query must always respond within 200ms, or userspace gets
> nasty errors.  [we're routing books on a sorter machine, and the book
> misses its exit opportunity]. Although this is a low load, it's a bit
> like a heartbeat.

> The question is, could the autovacuum daemon (running either in vacuum
> or in analyse mode) be taking out locks on this table that sometimes
> cause the query response time to go way up (exceeding 10 seconds)?

Hmm.  Autovacuum does sometimes take an exclusive lock.  It is supposed
to release it "on demand" but if I recall the details correctly, that
could involve a delay of about deadlock_timeout, or 1s by default.
It would be reasonable to reduce deadlock_timeout to 100ms to ensure
your external constraint is met.

Delays of up to 10s would not be explained by that though.  Do you have
usage spikes of other types?  I wonder in particular if you've got
checkpoints smoothed out enough.

            regards, tom lane

Re: Analyse without locking?

From
Andres Freund
Date:
On Thursday 26 November 2009 17:20:35 Richard Neill wrote:
> Dear All,
>
> I'm wondering whether Vacuum/analyse (notably by the autovaccuum daemon)
> is responsible for some deadlocks/dropouts I'm seeing.
>
> One particular table gets hit about 5 times a second (for single row
> updates and inserts) + associated index changes. This is a very light
> load for the hardware; we have 7 CPU cores idling, and very little disk
> activity. The query normally runs in about 20 ms.
>
> However, the query must always respond within 200ms, or userspace gets
> nasty errors.  [we're routing books on a sorter machine, and the book
> misses its exit opportunity]. Although this is a low load, it's a bit
> like a heartbeat.
>
> The question is, could the autovacuum daemon (running either in vacuum
> or in analyse mode) be taking out locks on this table that sometimes
> cause the query response time to go way up (exceeding 10 seconds)?
>
> I think I've set up autovacuum to do "little and often", using
>    autovacuum_vacuum_cost_delay = 20ms
>    autovacuum_vacuum_cost_limit = 20
> but I'm not sure this is doing exactly what I think it is. In
> particular, the system-wide I/O (and CPU) limit of autovacuum is
> negligible, but it's possible that queries may be waiting on locks.
>
> In particular, I want to make sure that the autovacuum daemon never
> holds any lock for more than about 50ms at a time. (or will release it
> immediately if something else wants it)
>
> Or am I barking up the wrong tree entirely?
I would suggest enabling log_log_wait and setting deadlock_timeout to a low
value - should give you more information.

Andres

Re: Analyse without locking?

From
Greg Smith
Date:
Richard Neill wrote:
> Or am I barking up the wrong tree entirely?
If you haven't already tuned checkpoint behavior, it's more likely
that's causing a dropout than autovacuum.  See the checkpoint_segments
section of http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
for an intro.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: Analyse without locking?

From
Richard Neill
Date:
Greg Smith wrote:
> Richard Neill wrote:
>> Or am I barking up the wrong tree entirely?
> If you haven't already tuned checkpoint behavior, it's more likely
> that's causing a dropout than autovacuum.  See the checkpoint_segments
> section of http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
> for an intro.
>

Greg Smith wrote:
 > Richard Neill wrote:
 >> Or am I barking up the wrong tree entirely?
 > If you haven't already tuned checkpoint behavior, it's more likely
 > that's causing a dropout than autovacuum.  See the checkpoint_segments
 > section of http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
 > for an intro.
 >

Thanks - I did that already - it's currently
    checkpoint_segments = 64

Now, I understand that increasing checkpoint_segments is generally a
good thing (subject to some limit), but doesn't that just mean that
instead of say a 1 second outage every minute, it's a 10 second outage
every 10 minutes?

Also, correct me if I'm wrong, but mere selects shouldn't cause any
addition to the WAL. I'd expect that a simple row insert might require
perhaps 1kB of disk writes(*), in which case we're looking at only a few
kB/sec at most of writes in normal use.?

Is it possible (or even sensible) to do a manual vacuum analyze with
nice/ionice?

Richard



(*)A typical write should be about 80 Bytes of data, in terms of how
much is actually being stored. I'm using the engineers' "rule of 10"
approximation to call that 1kB, based on indexes, and incomplete pages.


Re: Analyse without locking?

From
Tom Lane
Date:
Richard Neill <rn214@cam.ac.uk> writes:
> Now, I understand that increasing checkpoint_segments is generally a
> good thing (subject to some limit), but doesn't that just mean that
> instead of say a 1 second outage every minute, it's a 10 second outage
> every 10 minutes?

In recent PG versions you can spread the checkpoint I/O out over a
period of time, so it shouldn't be an "outage" at all, just background
load.  Other things being equal, a longer checkpoint cycle is better
since it improves the odds of being able to coalesce multiple changes
to the same page into a single write.  The limiting factor is your
threshold of pain on how much WAL-replay work would be needed to recover
after a crash.

> Is it possible (or even sensible) to do a manual vacuum analyze with
> nice/ionice?

There's no support for that in PG.  You could try manually renice'ing
the backend that's running your VACUUM but I'm not sure how well it
would work; there are a number of reasons why it might be
counterproductive.  Fooling with the vacuum_cost_delay parameters is the
recommended way to make a vacuum run slower and use less of the machine.

            regards, tom lane

Re: Analyse without locking?

From
Greg Smith
Date:
Richard Neill wrote:
> Now, I understand that increasing checkpoint_segments is generally a
> good thing (subject to some limit), but doesn't that just mean that
> instead of say a 1 second outage every minute, it's a 10 second outage
> every 10 minutes?
That was the case in versions before 8.3.  Now, the I/O is spread out
over most of the next checkpoint's time period.  So what actually
happens is that all the I/O that happens over 10 minutes will be spread
out over the next five minutes of time.  With the defaults, there's so
little time between checkpoints under heavy writes that the spreading
doesn't have enough room to work, leading to higher write bursts.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: Analyse without locking?

From
Richard Neill
Date:
Thanks for your explanations.

Tom Lane wrote:
> Richard Neill <rn214@cam.ac.uk> writes:
>> Now, I understand that increasing checkpoint_segments is generally a
>> good thing (subject to some limit), but doesn't that just mean that
>> instead of say a 1 second outage every minute, it's a 10 second outage
>> every 10 minutes?
>
> In recent PG versions you can spread the checkpoint I/O out over a
> period of time, so it shouldn't be an "outage" at all, just background
> load.  Other things being equal, a longer checkpoint cycle is better
> since it improves the odds of being able to coalesce multiple changes
> to the same page into a single write.  The limiting factor is your
> threshold of pain on how much WAL-replay work would be needed to recover
> after a crash.

That makes sense. I think that 64 is sane - it means crash-recovery
takes less than 1 minute, yet we aren't seeing the warning that
checkpoints are too frequent.
>
>> Is it possible (or even sensible) to do a manual vacuum analyze with
>> nice/ionice?
>
> There's no support for that in PG.  You could try manually renice'ing
> the backend that's running your VACUUM but I'm not sure how well it
> would work; there are a number of reasons why it might be
> counterproductive.  Fooling with the vacuum_cost_delay parameters is the
> recommended way to make a vacuum run slower and use less of the machine.

I see why it might not work well - priority inversion etc.

What I was trying to achieve is to say that vacuum can have all the
spare idle CPU/IO that's available, but must *immediately* back off when
something else needs the CPU/IO/Locks.

For example,
   nice -n 20 yes > /dev/null
   ionice -c 3 dd if=/dev/zero > tmp.del

will both get quite a lot of work done on a medium-loaded system (try
this on your own laptop), but have zero impact on other processes.

On the other hand, changing vacuum_cost_delay means that vacuum runs
slowly even if the CPU is otherwise idle; yet it still impacts on the
responsiveness of some queries.


Richard

Re: Analyse without locking?

From
Richard Neill
Date:
Dear All,

I'm still puzzled by this one - it looks like it's causing about 5% of
queries to rise in duration from ~300ms to 2-6 seconds.

On the other hand, the system never seems to be I/O bound. (we have at
least 25 MB/sec of write bandwidth, and use a small fraction of that
normally).

Here's the typical checkpoint logs:

2009-12-03 06:21:21 GMT LOG:  checkpoint complete: wrote 12400 buffers
(2.2%); 0 transaction log file(s) added, 0 removed, 12 recycled;
write=149.883 s, sync=5.143 s, total=155.040 s

We're using 8.4.1, on ext4 with SSD. Is it possible that something
exotic is occurring to do with write barriers (on by default in ext4,
and we haven't changed this).

Perhaps a low priority IO process for writing the previous WAL to disk
is blocking a high-priority transaction (which is trying to write to the
new WAL). If the latter is trying to sync, could the large amount of
lower priority IO be getting in the way thanks to write barriers?

If so, can I safely turn off write barriers?

Thanks,

Richard


P.S. Should I rename this thread?




Richard Neill wrote:
> Dear All,
>
> It definitely looks checkpoint-related - the checkpoint timeout is set
> to 5 minutes, and here is a graph of our response time (in ms) over a 1
> hour period. The query is pretty much identical each time.
>
> Any ideas what I could do to make checkpoints not hurt performance like
> this?
>
> Thanks,
>
> Richard
>
>
>
> Tom Lane wrote:
>> Richard Neill <rn214@cam.ac.uk> writes:
>>> Now, I understand that increasing checkpoint_segments is generally a
>>> good thing (subject to some limit), but doesn't that just mean that
>>> instead of say a 1 second outage every minute, it's a 10 second
>>> outage every 10 minutes?
>>
>> In recent PG versions you can spread the checkpoint I/O out over a
>> period of time, so it shouldn't be an "outage" at all, just background
>> load.  Other things being equal, a longer checkpoint cycle is better
>> since it improves the odds of being able to coalesce multiple changes
>> to the same page into a single write.  The limiting factor is your
>> threshold of pain on how much WAL-replay work would be needed to recover
>> after a crash.
>
>
> ------------------------------------------------------------------------
>

Re: Checkpoint spikes

From
Greg Smith
Date:
Richard Neill wrote:
> On the other hand, the system never seems to be I/O bound. (we have at
> least 25 MB/sec of write bandwidth, and use a small fraction of that
> normally).
I would bet that if you sampled vmstat or iostat every single second,
you'd discover there's a large burst in write speed for the same few
seconds that queries are stuck.  If you're averaging out the data over a
5 second or longer period, you'll never see it--the spike will get lost
in the average.  You just can't monitor checkpoint spikes unless you're
watching I/O with an extremely tight time resolution.  Watching the
"Writeback" figure in /proc/meminfo is helpful too, that is where I
normally see everything jammed up.

> Here's the typical checkpoint logs:
> 2009-12-03 06:21:21 GMT LOG:  checkpoint complete: wrote 12400 buffers
> (2.2%); 0 transaction log file(s) added, 0 removed, 12 recycled;
> write=149.883 s, sync=5.143 s, total=155.040 s
See that "sync" number there?  That's your problem; while that sync
operation is going on, everybody else is grinding to a halt waiting for
it.  Not a coincidence that the duration is about the same amount of
time that your queries are getting stuck.  This example shows 12400
buffers = 97MB of total data written.  Since those writes are pretty
random I/O, it's easily possible to get stuck for a few seconds waiting
for that much data to make it out to disk.  You only gave the write
phase a couple of minutes to spread things out over; meanwhile, Linux
may not even bother starting to write things out until 30 seconds into
that, so the effective time between when writes to disk start and when
the matching sync happens on your system is extremely small.  That's not
good--you have to give that several minutes of breathing room if you
want to avoid checkpoint spikes.

> We're using 8.4.1, on ext4 with SSD. Is it possible that something
> exotic is occurring to do with write barriers (on by default in ext4,
> and we haven't changed this).
> Perhaps a low priority IO process for writing the previous WAL to disk
> is blocking a high-priority transaction (which is trying to write to
> the new WAL). If the latter is trying to sync, could the large amount
> of lower priority IO be getting in the way thanks to write barriers?
> If so, can I safely turn off write barriers?
Linux is pretty dumb in general here.  fsync operations will usually end
up writing out way more of the OS buffer cache than they need to.  And
the write cache can get quite big before pdflush decides it should
actually do some work, the whole thing is optimized for throughput
rather than latency.  I don't really trust barriers at all, so I don't
know if there's some specific tuning you can do with those to improve
things.  Your whole system is bleeding edge craziness IMHO--SSD, ext4,
write barriers, all stuff that just doesn't work reliably yet far as I'm
concerned.

...but that's not what you want to hear.  When I can suggest that should
help is increasing checkpoint_segments (>32), checkpoint_timeout (>=10
minutes), checkpoint_completion_target (0.9), and lowering the amount of
writes Linux will cache before it gets more aggressive about flushing
them.  Those things will fight the root cause of the problem, by giving
more time between the "write" and "sync" phases of the checkpoint.  It's
ok if "write" takes a long while, decreasing the "sync" number is your
goal you need to keep your eye on.

I've written a couple of articles on this specific topic if you want
more background on the underlying issues, it's kind of heavy reading:

http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
http://www.westnet.com/~gsmith/content/linux-pdflush.htm
http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: Analyse without locking?

From
Laurent Laborde
Date:
On Sat, Nov 28, 2009 at 6:57 PM, Richard Neill <rn214@cam.ac.uk> wrote:
> Greg Smith wrote:
>>
>> Richard Neill wrote:
>>>
>>> Or am I barking up the wrong tree entirely?
>>
>> If you haven't already tuned checkpoint behavior, it's more likely that's
>> causing a dropout than autovacuum.  See the checkpoint_segments section of
>> http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server for an intro.
>>
>
> Greg Smith wrote:
>> Richard Neill wrote:
>>> Or am I barking up the wrong tree entirely?
>> If you haven't already tuned checkpoint behavior, it's more likely
>> that's causing a dropout than autovacuum.  See the checkpoint_segments
>> section of http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
>> for an intro.
>>
>
> Thanks - I did that already - it's currently
>   checkpoint_segments = 64
>
> Now, I understand that increasing checkpoint_segments is generally a good
> thing (subject to some limit), but doesn't that just mean that instead of
> say a 1 second outage every minute, it's a 10 second outage every 10
> minutes?
>
> Also, correct me if I'm wrong, but mere selects shouldn't cause any addition
> to the WAL. I'd expect that a simple row insert might require perhaps 1kB of
> disk writes(*), in which case we're looking at only a few kB/sec at most of
> writes in normal use.?
>
> Is it possible (or even sensible) to do a manual vacuum analyze with
> nice/ionice?

this is the job of autovacuum_vacuum_cost_delay and vacuum_cost_delay.

About checkpoint, you may eventually set :
synchronous_commit = off

Please note that you may loose some queries if the server badly crash.
(but that shouldn't cause database corruption like a fsync = off)

If you are running on linux, you could try to monitor (rrd is your
friend) /proc/meminfo and specifically the "Dirty" field.

Read your syslog log to see if the checkpoint is a problem.
Here is a sample of mine (cleaned) :
checkpoint complete: wrote 3117 buffers (1.2%); 0 transaction log
file(s) added, 0 removed, 3 recycled;
write=280.213 s, sync=0.579 s, total=280.797 s

The more Dirty page (/proc/meminfo), the longer is your sync time.
A high sync time can easily "lock" your server.

To reduce the dirty page, tune /proc/sys/vm/dirty_background_ratio
I have set it to "1" on my 32GB servers.

You should also be carefull about all the other
/proc/sys/vm/dirty_*
And specifically /proc/sys/vm/dirty_ratio :
Maximum percentage of total memory that can be filled with dirty pages
before processes are forced to write dirty buffers themselves during
their time slice instead of being allowed to do more writes.
Note that all processes are blocked for writes when this happens, not
just the one that filled the write buffers.

About "ionice" : it only work with the CFQ I/O Scheduler.
And CFQ is a very bad idea when using postgresql.

--
Laurent "ker2x" Laborde
Sysadmin & DBA at http://www.over-blog.com/

Re: Checkpoint spikes

From
Heikki Linnakangas
Date:
Greg Smith wrote:
> Richard Neill wrote:
>> Here's the typical checkpoint logs:
>> 2009-12-03 06:21:21 GMT LOG:  checkpoint complete: wrote 12400 buffers
>> (2.2%); 0 transaction log file(s) added, 0 removed, 12 recycled;
>> write=149.883 s, sync=5.143 s, total=155.040 s
> See that "sync" number there?  That's your problem; while that sync
> operation is going on, everybody else is grinding to a halt waiting for
> it.  Not a coincidence that the duration is about the same amount of
> time that your queries are getting stuck.  This example shows 12400
> buffers = 97MB of total data written.  Since those writes are pretty
> random I/O, it's easily possible to get stuck for a few seconds waiting
> for that much data to make it out to disk.  You only gave the write
> phase a couple of minutes to spread things out over; meanwhile, Linux
> may not even bother starting to write things out until 30 seconds into
> that, so the effective time between when writes to disk start and when
> the matching sync happens on your system is extremely small.  That's not
> good--you have to give that several minutes of breathing room if you
> want to avoid checkpoint spikes.

I wonder how common this issue is? When we implemented spreading of the
write phase, we had long discussions about spreading out the fsyncs too,
but in the end it wasn't done. Perhaps it is time to revisit that now
that 8.3 has been out for some time and people have experience with the
load-distributed checkpoints.

I'm not sure how the spreading of the fsync()s should work, it's hard to
estimate how long each fsync() is going to take, for example, but surely
something would be better than nothing.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: Checkpoint spikes

From
Greg Smith
Date:
Heikki Linnakangas wrote:
> I wonder how common this issue is? When we implemented spreading of the
> write phase, we had long discussions about spreading out the fsyncs too,
> but in the end it wasn't done. Perhaps it is time to revisit that now
> that 8.3 has been out for some time and people have experience with the
> load-distributed checkpoints.
>
Circa 8.2, I ran into checkpoint problems all the time.  With the
spreading logic in 8.3, properly setup, the worst case is so improved
that I usually find something else more pressing to tune, rather than
worry about the exact details of the sync process.  It seems to have hit
the "good enough" point where it's hard to justify time for further
improvements when there are other things to work on.  I'd still like to
see spread fsync happen one day, just hasn't been a priority for any
systems I have to improve lately.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: Checkpoint spikes

From
Richard Neill
Date:
Dear All,

Thanks for all your help so far. This page was particularly helpful:
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
(does the advice for 8.3 apply unchanged to 8.4?)

I'm still hitting issues with this though: sync is taking 7-10 seconds
and I need to get it down to nearer 3.

We're running a lightly-loaded system which has a realtime response
requirement of 0.5 seconds all the time (with a few seconds permissible
occasionally, but never exceeding 10).

So far, I've set checkpoint_segments to 128, timeout to 10min, and
completion_target to 0.8. This helps, but not as much as I'd hoped.

But I haven't touched any of the other WAL or BG Writer settings.

Where should I look next?
   Should I be looking at the BG Writer settings,
   or should I look at the Linux VM configuration?
     (eg changing /proc/sys/vm/dirty_background_ratio from 5 to 1)

   Or would it be most useful to try to move the WAL to a different disk?


Latest messages:

# tail -f /var/log/postgresql/postgresql-8.4-main.log | grep check

2009-12-08 09:12:00 GMT LOG:  checkpoint starting: time
2009-12-08 09:20:09 GMT LOG:  checkpoint complete: wrote 51151 buffers
(8.9%); 0 transaction log file(s) added, 0 removed, 23 recycled;
write=479.669 s, sync=9.852 s, total=489.553 s

2009-12-08 09:22:00 GMT LOG:  checkpoint starting: time
2009-12-08 09:30:07 GMT LOG:  checkpoint complete: wrote 45772 buffers
(7.9%); 0 transaction log file(s) added, 0 removed, 24 recycled;
write=479.706 s, sync=7.337 s, total=487.120 s

2009-12-08 09:32:00 GMT LOG:  checkpoint starting: time
2009-12-08 09:40:09 GMT LOG:  checkpoint complete: wrote 47043 buffers
(8.2%); 0 transaction log file(s) added, 0 removed, 22 recycled;
write=479.744 s, sync=9.300 s, total=489.122 s

2009-12-08 09:42:00 GMT LOG:  checkpoint starting: time
2009-12-08 09:50:07 GMT LOG:  checkpoint complete: wrote 48210 buffers
(8.4%); 0 transaction log file(s) added, 0 removed, 23 recycled;
write=479.689 s, sync=7.707 s, total=487.416 s


Thanks a lot,

Richard



Re: Checkpoint spikes

From
Richard Neill
Date:
Dear All,

Thanks for all your help so far. This page was particularly helpful:
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
(does the advice for 8.3 apply unchanged to 8.4?)

I'm still hitting issues with this though: sync is taking 7-10 seconds
and I need to get it down to nearer 3.

We're running a lightly-loaded system which has a realtime response
requirement of 0.5 seconds all the time (with a few seconds permissible
occasionally, but never exceeding 10).

So far, I've set checkpoint_segments to 128, timeout to 10min, and
completion_target to 0.8. This helps, but not as much as I'd hoped.

But I haven't touched any of the other WAL or BG Writer settings.

Where should I look next?
   Should I be looking at the BG Writer settings,
   or should I look at the Linux VM configuration?
     (eg changing /proc/sys/vm/dirty_background_ratio from 5 to 1)

   Or would it be most useful to try to move the WAL to a different disk?


Latest messages:

# tail -f /var/log/postgresql/postgresql-8.4-main.log | grep check

2009-12-08 09:12:00 GMT LOG:  checkpoint starting: time
2009-12-08 09:20:09 GMT LOG:  checkpoint complete: wrote 51151 buffers
(8.9%); 0 transaction log file(s) added, 0 removed, 23 recycled;
write=479.669 s, sync=9.852 s, total=489.553 s

2009-12-08 09:22:00 GMT LOG:  checkpoint starting: time
2009-12-08 09:30:07 GMT LOG:  checkpoint complete: wrote 45772 buffers
(7.9%); 0 transaction log file(s) added, 0 removed, 24 recycled;
write=479.706 s, sync=7.337 s, total=487.120 s

2009-12-08 09:32:00 GMT LOG:  checkpoint starting: time
2009-12-08 09:40:09 GMT LOG:  checkpoint complete: wrote 47043 buffers
(8.2%); 0 transaction log file(s) added, 0 removed, 22 recycled;
write=479.744 s, sync=9.300 s, total=489.122 s

2009-12-08 09:42:00 GMT LOG:  checkpoint starting: time
2009-12-08 09:50:07 GMT LOG:  checkpoint complete: wrote 48210 buffers
(8.4%); 0 transaction log file(s) added, 0 removed, 23 recycled;
write=479.689 s, sync=7.707 s, total=487.416 s


Thanks a lot,

Richard




Re: Checkpoint spikes

From
"Kevin Grittner"
Date:
Richard Neill <rn214@cam.ac.uk> wrote:

> So far, I've set checkpoint_segments to 128, timeout to 10min, and
> completion_target to 0.8. This helps, but not as much as I'd
> hoped.
>
> But I haven't touched any of the other WAL or BG Writer settings.
>
> Where should I look next?

On our web servers, where we had similar issues, we seem to be doing
OK using:

bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 4.0

The other thing which can help this problem is keeping
shared_buffers smaller than most people recommend.  We use 512MB on
our larger web server and 256MB on other servers.  (Be sure to test
with your actual load, as this might or might not degrade overall
performance.)

-Kevin

Re: Checkpoint spikes

From
Greg Smith
Date:
Richard Neill wrote:
> (does the advice for 8.3 apply unchanged to 8.4?)
Yes; no changes in this area for 8.4.  The main things performance
related that changed between 8.3 and 8.4 are:
1) VACUUM free space management reimplemented so that the max_fsm_*
parameters aren't needed anymore
2) default_statistics_target now starts at 100 instead of 10

> So far, I've set checkpoint_segments to 128, timeout to 10min, and
> completion_target to 0.8. This helps, but not as much as I'd hoped.
Good, if the problem is moving in the right direction you're making
progress.

> But I haven't touched any of the other WAL or BG Writer settings.
> Where should I look next?
>   Should I be looking at the BG Writer settings,
>   or should I look at the Linux VM configuration?
>     (eg changing /proc/sys/vm/dirty_background_ratio from 5 to 1)
I would start by reducing dirty_background_ratio; as RAM sizes climb,
this keeps becoming a bigger issue.  The whole disk flushing code
finally got a major overhaul in the 2.6.32 Linux kernel, I'm hoping this
whole class of problem was improved from the changes made.

Changes to the background writer behavior will probably not work as
you'd expect.  The first thing I'd try it in your situation turning it
off altogether; it can be slightly counterproductive for reducing
checkpoint issues if they're really bad, which yours are.  If that goes
in the wrong direction, experimenting with increasing the maximum pages
and the multiplier might be useful, I wouldn't bet on it helping through.

As Kevin already mentioned, reducing the size of the buffer cache can
help too.  That's worth trying if you're exhausted the other obvious
possibilities.

>   Or would it be most useful to try to move the WAL to a different disk?
On Linux having the WAL on a separate disk can improve things much more
than you might expect, simply because of how brain-dead the filesystem
fsync implementation is.  Reducing the seeks for WAL traffic can help a
lot too.

If you've lowered Linux's caching, tried some BGW tweaks, and moved the
WAL to somewhere else, if latency is still high you may be facing a
hardware upgrade to improve things.  Sometimes these problems just
require more burst write throughput (regardless of how good average
performance looks) and nothing else will substitute.  Hopefully you'll
find a tuning solution before that though.


--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: Checkpoint spikes

From
Andres Freund
Date:
On Wednesday 09 December 2009 03:05:40 Greg Smith wrote:
> On Linux having the WAL on a separate disk can improve things much more
> than you might expect, simply because of how brain-dead the filesystem
> fsync implementation is.  Reducing the seeks for WAL traffic can help a
> lot too.
Not using ext3's data=ordered helps massively already.

Andres