Re: Spurious Stalls - Mailing list pgsql-general

From Jaco Engelbrecht
Subject Re: Spurious Stalls
Date
Msg-id CAEorrh8=SZTt_FjUicFyqQskP3_4kCLarW05aHcQ4+Hms+YURw@mail.gmail.com
Whole thread Raw
In response to Re: Spurious Stalls  (Scott Marlowe <scott.marlowe@gmail.com>)
Responses Re: Spurious Stalls
Re: Spurious Stalls
List pgsql-general
hi Scott,

On 13 June 2014 00:30, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> Just the db and not the server right? Have you tried killing any
> individual queries that seem to be problem / long running queries?

We have tried to kill individual queries before and it did not have any effect.

>> During this performance issue, we found the following symptoms.
>> Running queries do not return.
>
> Even simple ones like "select 1;" ?
> Or ones that are something like "select * from smalltable where id=123; ?

SELECTs against pg_stat_activity, pg_stat_user_tables do work.

Haven't done a SELECT against the affected database during an outage
for a while now, will add that to our script to do next time.  I can't
remember offhand either.

> How much memory does your machine have? How much of that is shared buf
> then? If it's got 8G and 6G shared_buffers I'd lower shared_buffers.
> etc

256GB RAM - FWIW, we used to have shared_buffers set to 8G but found
two mailing list posts that seemed to suggest lowering this value
fixed similar issues at those sites.  So we've lowered it to 6G.
Still experienced two incidents since.

> What is your IO subsystem like? Is it a single 7200 RPM SATA drive, an
> array of 16 15krpm disks under a caching raid controller? A 1TB
> FusionIO card?

A Dell PERC H700 RAID card, utilizing the CacheCade functionality (SSD
read-cache) for PostgreSQL data only.

2x 146GB 15k SAS disks in RAID 1 for OS
2x 50GB SSD disks in RAID 1 for CacheCade
2x 146GB 15k SAS disks in RAID 1 for PostgreSQL transaction logs
10x 146GB 15k SAS disks in RAID1+0 for PostgreSQL data

> Just because linux SAYS there's wait doesn't mean there isn't one.
> Sometimes the io wait numbers are zero while some kernel daemon uses
> up all the bandwidth and you really are waiting.

OK - any ideas on how to further investigate/validate this?

> Your charts show 35MB/s write. If that's all random, that's a LOT of
> writes on anything but SSDs and even then it's a fair bit for a busy
> db server if it's being done by 100 or more processes, which is what
> this looks like.

It would appear that the specific writes in the graph was due to a
poorly written SQL query that used > 256MB of work_mem.  We've since
fixed that query and I only see a few (<5) temp files being created
per day now, and not close to when we have an incident.

Having said that, before an incident (again this morning), I can see
high disk latency (23,000 ms) on /pg_data disks (sdc) (and the graph
of temp_files is now empty during incidents).

This coincides with a checkpoint:

2014-06-13 08:13:49 GMT [81383]: [224-1] LOG:  checkpoint complete:
wrote 11065 buffers (1.4%); 0 transaction log file(s) added, 0
removed, 56 recycled; write=789.974 s, sync=9.996 s, total=799.987 s;
sync files=644, longest=2.055 s, average=0.015 s

iostat -mx 1 2 output:

Fri Jun 13 08:13:44 UTC 2014
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          20.31    0.00    3.13    2.21    0.00   74.36

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00   501.00    0.00  275.00     0.00     2.68
19.99     1.30    4.72   3.30  90.80
sdc               0.00  9275.00    0.00 12141.00     0.00    83.52
14.09   142.56   11.75   0.08  99.70

..

Fri Jun 13 08:13:52 UTC 2014
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          19.10    0.00    2.99    0.04    0.00   77.87

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00   226.00    0.00  264.00     0.00     1.40
10.85     0.01    0.05   0.05   1.40
sdc               0.00   744.00    0.00  302.00     0.00     4.09
27.71     0.03    0.10   0.10   3.00
sda               0.00   623.00    0.00  508.00     0.00     4.02
16.22     0.04    0.08   0.06   3.10
dm-0              0.00     0.00    0.00 1030.00     0.00     4.02
8.00     0.11    0.11   0.03   3.30
dm-1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00

Any thoughts on that?

Also, at the moment we're using ext4 as the FS for PostgreSQL.  We
were looking to switch to XFS as part of our upgrade to PostgreSQL
9.3.  What's your thoughts on this?

>> Once transactions stop returning, we see connections pile-up.  Eventually, we reach a max, and clients can no longer
connect.
>
> You need to lower your max connections and institute connection
> pooling yesterday. 500 connections and 256MB work mem are a recipe for
> disaster. While a machine with a db pool in front of it can survive
> such scenarios, lack of a pool and high work mem are killing your
> machine.

We have only 360 (max potential) connections coming to the database
from our application servers (see below for pgbouncer configuration).
And then a few more connections from our monitoring hosts.  In fact,
looking at the aggregate pgbouncer active server connections we're
only peaking at 120 connections under normal load conditions.  During
an incident I can see it ramp up and each client using all of their
available server connections, going up to a total of ~360 connections.

Given the fact that we have 256GB RAM in our server, is your statement
made about 500 connections @ 256MB work_mem still of a concern?  Even
say 400 connections at a work_mem size of 256MB?  Why would we not
want to use all of our RAM?

My calculation shows that we're only using:

shared_buffers + (max_connections * work_mem) +
(autovacuum_max_workers * maintenance_work_mem)
= 8.5GB + (500*256)MB + (6*1)GB
= 8.5GB + 128G + 6GB
= 142GB RAM.

> I recommend pgbouncer. Dirt simple, fast, and will keep your incoming
> connections limited to something your db can handle.

We already use pgbouncer on our application servers with these
configuration settings:

pool_mode = transaction
default_pool_size = 20
max_client_conn = 125
query_wait_timeout = 2
server_idle_timeout = 60

Jaco


pgsql-general by date:

Previous
From: Saravanan Subramaniyan
Date:
Subject: Re: OpenSSL Vulnerabilities
Next
From: Bill Moran
Date:
Subject: Re: Spurious Stalls