Re: Spurious Stalls - Mailing list pgsql-general

From Bill Moran
Subject Re: Spurious Stalls
Date
Msg-id 20140613153541.b642295fc0d97acdf9faead2@potentialtech.com
Whole thread Raw
In response to Re: Spurious Stalls  (Jaco Engelbrecht <jengelbrecht@atlassian.com>)
Responses Re: Spurious Stalls
List pgsql-general
On Fri, 13 Jun 2014 20:02:01 +0100 Jaco Engelbrecht <jengelbrecht@atlassian.com> wrote:
>
> This coincides with a checkpoint:

There's a lot going on here (as always with a busy server) but I suspected
a checkpoint problem earlier, and this statement and your additional
information makes me suspect even harder.

In your earlier email with the .conf file, I seem to remember that
you had checkpoint_segments set to 256 and checkpoint_timeout set to
30m.  It's obvious from reading this thread that you've already put
quite a bit of effort into resolving this.  My question: have you
tried _lowering_ the checkpoint settings?  If we assume that the
stall is related to checkpoint, and I'm remembering correctly on the
settings, then PostgreSQL might have as much as 4G of wal logs to
grind through to complete a checkpoint.  While not huge, if that's
trying to complete at the same time a lot of other work is going on,
it could cause stalls.  If you lower the chckpoint_segments and
checkpoint_timeout, it will cause _more_ disk activity overall, but
it will be spread out more.  Whether or not this helps with your
particular situation is dependent on whether your incidents are
caused by a spike in activity (in which case it might help) or
a cumulative effect of a lot of activity (in which case it will
probably make the situation worse).

Another thing that I may be misremembering from from your earlier
email: did you say that the load on the database was mostly write
(or am I misremembering that you said the OS graphs were showing
mostly write?)  The reason I'm asking is that we've seen problems
like you describe when trying to implement a high volume queue
in PostgreSQL: the continuous INSERT/SELECT/DELETE grind on the
single queue table was just more than PostgreSQL could keep up
with.  We moved that one portion of the application to Redis and
everything else just fell in line.  I'm stretching a bit to suppose
that you have a similar problem, but it's another data point for
you to consider.

>
> 2014-06-13 08:13:49 GMT [81383]: [224-1] LOG:  checkpoint complete:
> wrote 11065 buffers (1.4%); 0 transaction log file(s) added, 0
> removed, 56 recycled; write=789.974 s, sync=9.996 s, total=799.987 s;
> sync files=644, longest=2.055 s, average=0.015 s
>
> iostat -mx 1 2 output:
>
> Fri Jun 13 08:13:44 UTC 2014
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           20.31    0.00    3.13    2.21    0.00   74.36
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00   501.00    0.00  275.00     0.00     2.68
> 19.99     1.30    4.72   3.30  90.80
> sdc               0.00  9275.00    0.00 12141.00     0.00    83.52
> 14.09   142.56   11.75   0.08  99.70
>
> ..
>
> Fri Jun 13 08:13:52 UTC 2014
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           19.10    0.00    2.99    0.04    0.00   77.87
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00   226.00    0.00  264.00     0.00     1.40
> 10.85     0.01    0.05   0.05   1.40
> sdc               0.00   744.00    0.00  302.00     0.00     4.09
> 27.71     0.03    0.10   0.10   3.00
> sda               0.00   623.00    0.00  508.00     0.00     4.02
> 16.22     0.04    0.08   0.06   3.10
> dm-0              0.00     0.00    0.00 1030.00     0.00     4.02
> 8.00     0.11    0.11   0.03   3.30
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
>
> Any thoughts on that?
>
> Also, at the moment we're using ext4 as the FS for PostgreSQL.  We
> were looking to switch to XFS as part of our upgrade to PostgreSQL
> 9.3.  What's your thoughts on this?
>
> >> Once transactions stop returning, we see connections pile-up.  Eventually, we reach a max, and clients can no
longerconnect. 
> >
> > You need to lower your max connections and institute connection
> > pooling yesterday. 500 connections and 256MB work mem are a recipe for
> > disaster. While a machine with a db pool in front of it can survive
> > such scenarios, lack of a pool and high work mem are killing your
> > machine.
>
> We have only 360 (max potential) connections coming to the database
> from our application servers (see below for pgbouncer configuration).
> And then a few more connections from our monitoring hosts.  In fact,
> looking at the aggregate pgbouncer active server connections we're
> only peaking at 120 connections under normal load conditions.  During
> an incident I can see it ramp up and each client using all of their
> available server connections, going up to a total of ~360 connections.
>
> Given the fact that we have 256GB RAM in our server, is your statement
> made about 500 connections @ 256MB work_mem still of a concern?  Even
> say 400 connections at a work_mem size of 256MB?  Why would we not
> want to use all of our RAM?
>
> My calculation shows that we're only using:
>
> shared_buffers + (max_connections * work_mem) +
> (autovacuum_max_workers * maintenance_work_mem)
> = 8.5GB + (500*256)MB + (6*1)GB
> = 8.5GB + 128G + 6GB
> = 142GB RAM.
>
> > I recommend pgbouncer. Dirt simple, fast, and will keep your incoming
> > connections limited to something your db can handle.
>
> We already use pgbouncer on our application servers with these
> configuration settings:
>
> pool_mode = transaction
> default_pool_size = 20
> max_client_conn = 125
> query_wait_timeout = 2
> server_idle_timeout = 60
>
> Jaco
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


--
Bill Moran <wmoran@potentialtech.com>


pgsql-general by date:

Previous
From: Jaco Engelbrecht
Date:
Subject: Re: Spurious Stalls
Next
From: Scott Marlowe
Date:
Subject: Re: Spurious Stalls