Spurious Stalls - Mailing list pgsql-general

From Christopher Nielsen
Subject Spurious Stalls
Date
Msg-id CAJ+wzrb1qhz3xuoeSy5mo8i=E-5OO9Yvm6R+VxLBGaPB=uevqA@mail.gmail.com
Whole thread Raw
Responses Re: Spurious Stalls
Re: Spurious Stalls
Re: Spurious Stalls
List pgsql-general

Hi Group,

My team has been very happy using Postgres, hosting Bitbucket.  Thanks very much for all the community contributions, to the platform.

Lately, though, about once a day now, for about a week, we have been experiencing periods of stalling.  When Postgres stalls, we haven't been able to recover, without restarting the database, unfortunately.

This brings our uptime down some, that we'd like to avoid (99.2%) :(  We'd like to do a better job keeping things running.

It would be great to get your input about it.  Alternately, if someone is available, as a consultant, that would be great too.

Here is some background, about the issue.  We have found the following symptoms.
  • During this performance issue, we found the following symptoms.
  • Running queries do not return.
  • The application sometimes can no longer get new connections.
  • The CPU load increases
  • There is no I/O wait.
  • There is no swapping.
Also, our database configuration, is attached to this email, as postgresql.conf, for reference, along with a profile of our hardware and tuning, as pg_db_profile.txt.

While the database was unavailable, we also collected a lot of data.  Looking through this info, a few things pop-out to us, that may be problematic, or useful to notice.
  • Disk I/O appears to be all write, and little read.
  • In previous incidents, with the same symptoms, we have seen pg processes spending much time in s_lock
  • That info is attached to this email also, as files named perf_*.
Additionally, monitoring graphs show the following performance profile.

Problem

As you can probably see below, at 11:54, the DB stops returning rows.

Also, transactions stop returning, causing the active transaction time to trend up to the sky.

Consequences of Problem

Once transactions stop returning, we see connections pile-up.  Eventually, we reach a max, and clients can no longer connect.

The cpu utilization increases to nearly 100%, in user space, and stays there, until the database is restarted.

Events Before Problem

This is likely the most useful part.  As the time approaches 11:54, there are periods of increased latency.  There is also a marked increase in write operations, in general.
Lastly, about 10 minutes before outage, postgres writes a sustained 30 MB/s of temp files.


After investigating this, we found a query that was greatly exceeding work_mem.  We've since optimized it, and hopefully, that will have a positive effect on the above.

We may not know until the next issue happens, though.

With a problem like this, I am not exactly positive how to proceed.  I am really looking forward to hearing your thoughts, and opinions, if you can share them.

Thanks very much,

-Chris

Attachment

pgsql-general by date:

Previous
From: "Vasudevan, Ramya"
Date:
Subject: Re: max_connections reached in postgres 9.3.3
Next
From: Kevin Grittner
Date:
Subject: Re: max_connections reached in postgres 9.3.3