Thread: How to find the culprit in server load spikes?

How to find the culprit in server load spikes?

From
Moreno Andreo
Date:
Hi everyone,
     I host a Postgresql server on Ubuntu 12.04 and I am facing server
load spikes (if I run top, it goes up to 25-30 on a 4-core system)...
In some cases, I have to restart potgresql service because users call us
complaining of the slowness, but in some cases I can leave things on
their way and I see that after a bunch of minutes (about 5-10) the
situations drops to the normality (0.50-2 load).
The problem is, as in the most cases, the I/O, but I need a small hand
to know some methods or tools that can help me to investigate who or
what is causing me these spikes.

Any help would be appreciated.
Best regards,
Moreno.



Re: How to find the culprit in server load spikes?

From
Kevin Grittner
Date:
Moreno Andreo <moreno.andreo@evolu-s.it> wrote:

>     I host a Postgresql server on Ubuntu 12.04 and I am facing server
> load spikes (if I run top, it goes up to 25-30 on a 4-core system)...
> In some cases, I have to restart potgresql service because users call us
> complaining of the slowness, but in some cases I can leave things on
> their way and I see that after a bunch of minutes (about 5-10) the
> situations drops to the normality (0.50-2 load).
>
> The problem is, as in the most cases, the I/O,

If you have confirmed that there is an I/O glut during these
episodes, it is probably that a cascade of dirty cache pages
caused the OS dirty pages to hit the vm.dirty_ratio percentage.  If
you were seeing a high number for system CPU time the below would
probably not help.

Make sure you are using a storage system that has a persistent
cache configured for write-back (rather than write-through).
Reduce the OS vm.dirty_background_bytes setting to less than the
size of the persistent write cache.  Make sure that vm.dirty_ratio
is at least 20, possibly higher.  Configure the PostgreSQL
background writer to be more aggressive.  If those don't do it,
reduce the size of shared_buffers.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: How to find the culprit in server load spikes?

From
Jeff Janes
Date:
On Wed, Jul 22, 2015 at 5:50 AM, Moreno Andreo <moreno.andreo@evolu-s.it> wrote:
Hi everyone,
    I host a Postgresql server on Ubuntu 12.04 and I am facing server load spikes (if I run top, it goes up to 25-30 on a 4-core system)...
In some cases, I have to restart potgresql service because users call us complaining of the slowness, but in some cases I can leave things on their way and I see that after a bunch of minutes (about 5-10) the situations drops to the normality (0.50-2 load).
The problem is, as in the most cases, the I/O, but I need a small hand to know some methods or tools that can help me to investigate who or what is causing me these spikes.

I always run systems starting out with logging cranked up to at least these settings:

log_checkpoints = on
log_lock_waits = on
shared_preload_libraries = 'pg_stat_statements,auto_explain'
auto_explain.log_min_duration = '10s'
track_io_timing = on
log_autovacuum_min_duration = 1000
log_min_duration_statement = 1000 ## or less

In particular, you would want to see what the reported "sync" time is for the checkpoint, and whether the slowness (as shown by the frequency of statement min duration log events) is occurring in a pattern around the beginning and end of a checkpoint.

I'd also set up vmstat to run continuously capturing output to a logfile with a timestamp, which can later be correlated to the postgres log file entries.

Cheers,

Jeff