Thread: Occasional spike in query response time (jumps from 200ms to 45 seconds)

Occasional spike in query response time (jumps from 200ms to 45 seconds)

From
Robert Voinea
Date:
Hi list

From time to time (1-2 months) one of the production systems I manage
starts acting crazy... and this is starting to become a problem.

Every query I send to the server has a very long running time (sometimes it
reaches 45+ seconds). Even the simples queries like "SELECT NOW()" run in 4-5
seconds... or more.

This goes on for a few minutes. After that, everything is back to normal.
I must say that this is not a big system -- 2 databases: one with a few
hundred tables, but not a lot of information; the other has a few
tables, but around 15GB of data (the tables are partitions by month). No heavy
load. All the queries are normally running in less than 20ms; only a few of
them are running in more than 20ms, but they peak at 200ms.

The servers are HP ProLiant DL360 G7 with 8GB RAM and 1GB RAM for the RAID
controller (HP Smart Array G6 controller). The system uses two disks in
mirroring. There is no dedicated disk for the database.
I am running CentOS 6.3 (kernel 2.6.32-279.9.1.el6.i686).
PostgreSQL 9.1.6 was compiled from sources.

I'm guessing that, somehow, there is heavy I/O usage at that time (and I am
starting to suspect the autovacuum daemon).
I have tried monitoring the server with iotop and iostat, but I couldn't find
out anything useful.
Yes, the database server is using a lot of I/O, especially the clients that
connect to the second database (the 15GB one), but that was to be expected.
Also, the other processes running on the server (there are a few other
clients) don't use that much I/O bandwidth (they use the processor more, but
not enough to stall the whole system), so I have excluded them.

At this moment, an upgrade is not desired, but if there are hints that an
upgrade would solve this issue, then I'll insist on it.
I've seen that postgres 9.1.8 Changelog contains some bug fixes related to
performance of autovaccum:

// BEGIN QUOTE
Fix performance problems with autovacuum truncation in busy workloads (Jan
Wieck)

Truncation of empty pages at the end of a table requires exclusive lock, but
autovacuum was coded to fail (and release the table lock) when there are
conflicting lock requests. Under load, it is easily possible that truncation
would never occur, resulting in table bloat. Fix by performing a partial
truncation, releasing the lock, then attempting to re-acquire the lock and
continue. This fix also greatly reduces the average time before autovacuum
releases the lock after a conflicting request arrives.

Fix error in vacuum_freeze_table_age implementation (Andres Freund)

In installations that have existed for more than vacuum_freeze_min_age
transactions, this mistake prevented autovacuum from using partial-table
scans, so that a full-table scan would always happen instead.
// END QUOTE

Could you give me some hints on what to look for?
What might be causing this behavior?
Any idea is greatly appreciated!

Thank you!
--
Robert Voinea
Software Engineer
+4 0740 467 262

Don't take life too seriously. You'll never get out of it alive.
(Elbert Hubbard)



On Thu, Nov 7, 2013 at 2:44 AM, Robert Voinea <rvoinea@gmail.com> wrote:

Could you give me some hints on what to look for?
What might be causing this behavior?

Possibly checkpointing is the problem? Is "log_checkpoints" enabled in your postgresql.conf?  What is "checkpoint_completion_target" set to?

Re: Occasional spike in query response time (jumps from 200ms to 45 seconds)

From
Kevin Grittner
Date:
Robert Voinea <rvoinea@gmail.com> wrote:

> From time to time (1-2 months) one of the production systems I
> manage starts acting crazy... and this is starting to become a
> problem.
>
> Every query I send to the server has a very long running time
> (sometimes it reaches 45+ seconds). Even the simples queries like
> "SELECT NOW()" run in 4-5 seconds... or more.
>
> This goes on for a few minutes. After that, everything is back to
> normal.

A likely cause for this is transparent huge page defrag.  If you
run `vmstat 1` during an episode and see a lot of system CPU time,
that tends to confirm this.  If you run `perf top` during an
episode and see the kernel spending a lot of time in spinlock
functions, that pretty much nails it.  You might want to turn off
transparent huge page support.

> I'm guessing that, somehow, there is heavy I/O usage at that time

Which might be another possibility if the `vmstat 1` shows a lot of
I/O wait time.

> I've seen that postgres 9.1.8 Changelog contains some bug fixes
> related to performance of autovaccum:
>
> // BEGIN QUOTE
> Fix performance problems with autovacuum truncation in busy
> workloads (Jan Wieck)
>
> Truncation of empty pages at the end of a table requires
> exclusive lock, but autovacuum was coded to fail (and release the
> table lock) when there are conflicting lock requests. Under load,
> it is easily possible that truncation would never occur,
> resulting in table bloat. Fix by performing a partial truncation,
> releasing the lock, then attempting to re-acquire the lock and
> continue. This fix also greatly reduces the average time before
> autovacuum releases the lock after a conflicting request arrives.
>
> Fix error in vacuum_freeze_table_age implementation (Andres
> Freund)
>
> In installations that have existed for more than
> vacuum_freeze_min_age transactions, this mistake prevented
> autovacuum from using partial-table scans, so that a full-table
> scan would always happen instead.
> // END QUOTE

If there is high I/O and you see autovacuum running wraparound
prevention vacuums or vacuuming tables which have recently had a
large number of rows deleted, you might solve the problem by
updating.  I would strongly recommend updating anyway, because
otherwise you are running with known bugs and security
vulnerabilities, including one very serious one.

Why knowingly put yourself through dealing with bugs that others
have already encountered, reported, diagnosed, and fixed?

http://www.postgresql.org/support/versioning/

http://www.postgresql.org/support/security/faq/2013-04-04/

Note that in PostgreSQL, a minor release does not contain any new
features or behavior changes except those deemed to be clear bugs.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Fri, Nov 8, 2013 at 5:25 AM, Robert Voinea <rvoinea@gmail.com> wrote:
Hi

On Thursday 07 November 2013 06:37:57 bricklen wrote:
> On Thu, Nov 7, 2013 at 2:44 AM, Robert Voinea <rvoinea@gmail.com> wrote:
> > Could you give me some hints on what to look for?
> > What might be causing this behavior?
>
> Possibly checkpointing is the problem? Is "log_checkpoints" enabled in your
> postgresql.conf?  What is "checkpoint_completion_target" set to?

The parameters are commented out, so they are at their default values.

Thank you for replying.
I'll investigate this further to see if this is the problem.


Checkpointing may or may not be the problem, but changing log_checkpoints = on and checkpoint_completion_target = 0.8 would both be useful things to do. The former to help in your investigation, the latter, known as "checkpoint smoothing" might help alleviate the problem if it is indeed related to checkpointing.

Re: Occasional spike in query response time (jumps from 200ms to 45 seconds)

From
Robert Voinea
Date:
Hi

On Thursday 07 November 2013 06:37:57 bricklen wrote:
> On Thu, Nov 7, 2013 at 2:44 AM, Robert Voinea <rvoinea@gmail.com> wrote:
> > Could you give me some hints on what to look for?
> > What might be causing this behavior?
>
> Possibly checkpointing is the problem? Is "log_checkpoints" enabled in your
> postgresql.conf?  What is "checkpoint_completion_target" set to?

The parameters are commented out, so they are at their default values.

Thank you for replying.
I'll investigate this further to see if this is the problem.

--
Robert Voinea
Software Engineer
+4 0740 467 262

Don't take life too seriously. You'll never get out of it alive.
(Elbert Hubbard)




Re: Occasional spike in query response time (jumps from 200ms to 45 seconds)

From
Robert Voinea
Date:
Hi

On Thursday 07 November 2013 06:40:00 Kevin Grittner wrote:
> Robert Voinea <rvoinea@gmail.com> wrote:
> > From time to time (1-2 months) one of the production systems I
> > manage starts acting crazy... and this is starting to become a
> > problem.
> >
> > Every query I send to the server has a very long running time
> > (sometimes it reaches 45+ seconds). Even the simples queries like
> > "SELECT NOW()" run in 4-5 seconds... or more.
> >
> > This goes on for a few minutes. After that, everything is back to
> > normal.
>
> A likely cause for this is transparent huge page defrag.  If you
> run `vmstat 1` during an episode and see a lot of system CPU time,
> that tends to confirm this.  If you run `perf top` during an
> episode and see the kernel spending a lot of time in spinlock
> functions, that pretty much nails it.  You might want to turn off
> transparent huge page support.
I need to read about this first.

I forgot to mention that this is a 32 bit system... and due to the known
limitations we are forced to use PAE on the kernel...
Unfortunately we cannot use a 64 bit operating system.

> > I'm guessing that, somehow, there is heavy I/O usage at that time
>
> Which might be another possibility if the `vmstat 1` shows a lot of
> I/O wait time.
Another colleague of mine managed to somehow reproduce the issue.
It seems that the disk + network interface could be the bottle neck (iostat &
iotop shows average wait times very high) and the peak throughput of the disk
at 70MB/s.
This is not the actual production system, but a similar one (as hardware).
We'll try to re-do the tests with the clients and the server connected via a
Gigabit network and see if this helps.

> > I've seen that postgres 9.1.8 Changelog contains some bug fixes
> > related to performance of autovaccum:
> If there is high I/O and you see autovacuum running wraparound
> prevention vacuums or vacuuming tables which have recently had a
> large number of rows deleted, you might solve the problem by
> updating.  I would strongly recommend updating anyway, because
> otherwise you are running with known bugs and security
> vulnerabilities, including one very serious one.
>
> Why knowingly put yourself through dealing with bugs that others
> have already encountered, reported, diagnosed, and fixed?
I agree with you on this, but the update requires downtime and has to be
scheduled in advance.

Thank you for replying.
I will investigate this even further and see what I may come up with.
I'll keep you informed with what I find.

--
Robert Voinea
Software Engineer
+4 0740 467 262

Don't take life too seriously. You'll never get out of it alive.
(Elbert Hubbard)