Re: Spurious Stalls - Mailing list pgsql-general

From Scott Marlowe
Subject Re: Spurious Stalls
Date
Msg-id CAOR=d=3ogM2GXqqHRVs8XNguQXWCEXJFcOkQGQodkzGReLqSpQ@mail.gmail.com
Whole thread Raw
In response to Spurious Stalls  (Christopher Nielsen <cnielsen@atlassian.com>)
Responses Re: Spurious Stalls
List pgsql-general
On Thu, Jun 12, 2014 at 1:57 PM, Christopher Nielsen
<cnielsen@atlassian.com> wrote:

> Lately, though, about once a day now, for about a week, we have been experiencing periods of stalling.  When Postgres
stalls,we haven't been able to recover, without restarting the database, unfortunately. 

Just the db and not the server right? Have you tried killing any
individual queries that seem to be problem / long running queries?

> Here is some background, about the issue.  We have found the following symptoms.
>
> During this performance issue, we found the following symptoms.
> Running queries do not return.

Even simple ones like "select 1;" ?
Or ones that are something like "select * from smalltable where id=123; ?

> The application sometimes can no longer get new connections.
> The CPU load increases
> There is no I/O wait.
> There is no swapping.

> Also, our database configuration, is attached to this email, as postgresql.conf, for reference, along with a profile
ofour hardware and tuning, as pg_db_profile.txt. 

How much memory does your machine have? How much of that is shared buf
then? If it's got 8G and 6G shared_buffers I'd lower shared_buffers.
etc
What is your IO subsystem like? Is it a single 7200 RPM SATA drive, an
array of 16 15krpm disks under a caching raid controller? A 1TB
FusionIO card?
Just because linux SAYS there's wait doesn't mean there isn't one.
Sometimes the io wait numbers are zero while some kernel daemon uses
up all the bandwidth and you really are waiting.
Your charts show 35MB/s write. If that's all random, that's a LOT of
writes on anything but SSDs and even then it's a fair bit for a busy
db server if it's being done by 100 or more processes, which is what
this looks like.

> While the database was unavailable, we also collected a lot of data.  Looking through this info, a few things pop-out
tous, that may be problematic, or useful to notice. 
>
> Disk I/O appears to be all write, and little read.
> In previous incidents, with the same symptoms, we have seen pg processes spending much time in s_lock
> That info is attached to this email also, as files named perf_*.
>
> Additionally, monitoring graphs show the following performance profile.
>
> Problem
>
> As you can probably see below, at 11:54, the DB stops returning rows.
>
> Also, transactions stop returning, causing the active transaction time to trend up to the sky.
>
>
> Consequences of Problem
>
> Once transactions stop returning, we see connections pile-up.  Eventually, we reach a max, and clients can no longer
connect.

You need to lower your max connections and institute connection
pooling yesterday. 500 connections and 256MB work mem are a recipe for
disaster. While a machine with a db pool in front of it can survive
such scenarios, lack of a pool and high work mem are killing your
machine.

I recommend pgbouncer. Dirt simple, fast, and will keep your incoming
connections limited to something your db can handle.


pgsql-general by date:

Previous
From: Si Chen
Date:
Subject: Re: what does pg_activity mean when the database is stuck?
Next
From: John R Pierce
Date:
Subject: Re: what does pg_activity mean when the database is stuck?