Thread: Sudden Query slowdown on our Postgresql Server

Sudden Query slowdown on our Postgresql Server

From

Sebastian Melchior

Date:

22 March 2012, 21:39:49

Hi,

we are currently seeing some strange performance issues related to our Postgresql Database. Our Setup currently
contains:
- 1 Master with 32GB Ram and 6 x 100GB SSDs in RAID10 and 2 Quad Core Intel Processors (this one has a failover Box,
thedata volume is shared via DRBD)
- 2 Slaves with 16GB Ram and 6 x 100GB SAS Disks in RAID 10 and 1 Quad Core Processor connected via streaming
replication

We currently use Postgresql 9.0.6 from Debian Squeeze Backports with a 2.6.39-bpo.2 Backports Squeeze Kernel.
All Servers use Pgbouncer as a connection Pooler, which is installed on each box.
In times of higher usage, we see some strange issues on our Master box, the connections start stacking up in an "<idle
intransaction>" state, and the Queries get slower and slower when using the Master Server. We traced the Application
whichis connected via a private LAN, and could not find any hangups that could cause these states in the Database.
Duringthis time, the load of the Master goes up a bit, but the CPU Usage and IOwait is still quite low at around a Load
of5-8. The usual Load is around 1 - 1.5.

19:14:28.654 4838 LOG Stats: 3156 req/s, in 1157187 b/s, out 1457656 b/s,query 6119 us
19:15:28.655 4838 LOG Stats: 3247 req/s, in 1159833 b/s, out 1421552 b/s,query 5025 us
19:16:28.660 4838 LOG Stats: 3045 req/s, in 1096349 b/s, out 1377927 b/s,query 3713 us
19:17:28.680 4838 LOG Stats: 2779 req/s, in 1030783 b/s, out 1343547 b/s,query 11977 us
19:18:28.688 4838 LOG Stats: 1723 req/s, in 664282 b/s, out 789989 b/s,query 67144 us
19:19:28.665 4838 LOG Stats: 1371 req/s, in 472587 b/s, out 622347 b/s,query 48417 us
19:20:28.668 4838 LOG Stats: 2161 req/s, in 748727 b/s, out 995794 b/s,query 2794 us

As you can see in the pgbouncer logs, the query exec time shoots up.
We took a close look at locking issues during that time, but we don't see any excessive amount of locking during that
time.
The issue suddenly popped up, we had times of higher usage before and the Postgresql DB handled it without any
problems.We also did not recently change anything in this setup. We also did take a look at the Slow Queries Log during
thattime. This did now show anything unusual during the time of the slowdown.

Does anyone have any idea what could cause this issue or how we can further debug it?

Thanks for your Input!

Sebastian

Re: Sudden Query slowdown on our Postgresql Server

From

Stephen Frost

Date:

22 March 2012, 21:48:05

* Sebastian Melchior (webmaster@mailz.de) wrote:
> Does anyone have any idea what could cause this issue or how we can further debug it?

Are you logging checkpoints?  If not, you should, if so, then see if
they correllate to the time of the slowdown..?

    Thanks,

        Stephen

Attachment

signature.asc

Re: Sudden Query slowdown on our Postgresql Server

From

Sebastian Melchior

Date:

23 March 2012, 01:38:09

Hi,

yeah we log those, those times do not match the times of the slowdown at all. Seems to be unrelated.

Sebastian

On 23.03.2012, at 01:47, Stephen Frost wrote:

> * Sebastian Melchior (webmaster@mailz.de) wrote:
>> Does anyone have any idea what could cause this issue or how we can further debug it?
>
> Are you logging checkpoints?  If not, you should, if so, then see if
> they correllate to the time of the slowdown..?
>
>     Thanks,
>
>         Stephen

Re: Sudden Query slowdown on our Postgresql Server

From

Scott Marlowe

Date:

23 March 2012, 01:48:28

I'd suggest the handy troubleshooting tools sar, iostat, vmstat and iotop

On Thu, Mar 22, 2012 at 10:37 PM, Sebastian Melchior <webmaster@mailz.de> wrote:
> Hi,
>
> yeah we log those, those times do not match the times of the slowdown at all. Seems to be unrelated.
>
> Sebastian
>
>
> On 23.03.2012, at 01:47, Stephen Frost wrote:
>
>> * Sebastian Melchior (webmaster@mailz.de) wrote:
>>> Does anyone have any idea what could cause this issue or how we can further debug it?
>>
>> Are you logging checkpoints?  If not, you should, if so, then see if
>> they correllate to the time of the slowdown..?
>>
>>       Thanks,
>>
>>               Stephen
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



--
To understand recursion, one must first understand recursion.

Re: Sudden Query slowdown on our Postgresql Server

From

Sebastian Melchior

Date:

23 March 2012, 01:54:28

Hi,

we already used iostat and iotop during times of the slowdown, there is no sudden drop in I/O workload in the times of
theslowdown. Also the iowait does not spike and stays as before. 
So i do not think that this is I/O related. As the disks are SSDs there also still is some "head room" left.

Sebastian

On 23.03.2012, at 05:48, Scott Marlowe wrote:

> I'd suggest the handy troubleshooting tools sar, iostat, vmstat and iotop
>
> On Thu, Mar 22, 2012 at 10:37 PM, Sebastian Melchior <webmaster@mailz.de> wrote:
>> Hi,
>>
>> yeah we log those, those times do not match the times of the slowdown at all. Seems to be unrelated.
>>
>> Sebastian
>>
>>
>> On 23.03.2012, at 01:47, Stephen Frost wrote:
>>
>>> * Sebastian Melchior (webmaster@mailz.de) wrote:
>>>> Does anyone have any idea what could cause this issue or how we can further debug it?
>>>
>>> Are you logging checkpoints?  If not, you should, if so, then see if
>>> they correllate to the time of the slowdown..?
>>>
>>>       Thanks,
>>>
>>>               Stephen
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>
>
>
> --
> To understand recursion, one must first understand recursion.

Re: Sudden Query slowdown on our Postgresql Server

From

Scott Marlowe

Date:

23 March 2012, 02:19:23

What does vmstat say about things like context switches / interrupts per second?

On Thu, Mar 22, 2012 at 10:53 PM, Sebastian Melchior <webmaster@mailz.de> wrote:
> Hi,
>
> we already used iostat and iotop during times of the slowdown, there is no sudden drop in I/O workload in the times
ofthe slowdown. Also the iowait does not spike and stays as before. 
> So i do not think that this is I/O related. As the disks are SSDs there also still is some "head room" left.
>
> Sebastian
>
> On 23.03.2012, at 05:48, Scott Marlowe wrote:
>
>> I'd suggest the handy troubleshooting tools sar, iostat, vmstat and iotop
>>
>> On Thu, Mar 22, 2012 at 10:37 PM, Sebastian Melchior <webmaster@mailz.de> wrote:
>>> Hi,
>>>
>>> yeah we log those, those times do not match the times of the slowdown at all. Seems to be unrelated.
>>>
>>> Sebastian
>>>
>>>
>>> On 23.03.2012, at 01:47, Stephen Frost wrote:
>>>
>>>> * Sebastian Melchior (webmaster@mailz.de) wrote:
>>>>> Does anyone have any idea what could cause this issue or how we can further debug it?
>>>>
>>>> Are you logging checkpoints?  If not, you should, if so, then see if
>>>> they correllate to the time of the slowdown..?
>>>>
>>>>       Thanks,
>>>>
>>>>               Stephen
>>>
>>>
>>> --
>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>>
>>
>> --
>> To understand recursion, one must first understand recursion.
>



--
To understand recursion, one must first understand recursion.

Re: Sudden Query slowdown on our Postgresql Server

From

Yeb Havinga

Date:

23 March 2012, 04:11:08

On 2012-03-23 05:53, Sebastian Melchior wrote:
> Hi,
>
> we already used iostat and iotop during times of the slowdown, there is no sudden drop in I/O workload in the times
ofthe slowdown. Also the iowait does not spike and stays as before. 
> So i do not think that this is I/O related. As the disks are SSDs there also still is some "head room" left.

I've seen a ssd completely lock up for a dozen seconds or so after
giving it a smartctl command to trim a section of the disk. I'm not sure
if that was the vertex 2 pro disk I was testing or the intel 710, but
enough reason for us to not mount filesystems with -o discard.

regards,
Yeb

Re: Sudden Query slowdown on our Postgresql Server

From

Sebastian Melchior

Date:

23 March 2012, 04:52:26

Hi,

unfortunately we cannot directly control the TRIM (i am not sure it even occurs) because the SSDs are behind an LSI
MegaSAS9260 Controller which does not allow access via smart. Even if some kind of TRIM command is the problem,
shouldn'tthe iowait go up in this case? 

Sebastian

On 23.03.2012, at 08:10, Yeb Havinga wrote:

> On 2012-03-23 05:53, Sebastian Melchior wrote:
>> Hi,
>>
>> we already used iostat and iotop during times of the slowdown, there is no sudden drop in I/O workload in the times
ofthe slowdown. Also the iowait does not spike and stays as before. 
>> So i do not think that this is I/O related. As the disks are SSDs there also still is some "head room" left.
>
> I've seen a ssd completely lock up for a dozen seconds or so after giving it a smartctl command to trim a section of
thedisk. I'm not sure if that was the vertex 2 pro disk I was testing or the intel 710, but enough reason for us to not
mountfilesystems with -o discard. 
>
> regards,
> Yeb
>
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: Sudden Query slowdown on our Postgresql Server

From

Robert Haas

Date:

08 May 2012, 13:18:12

On Fri, Mar 23, 2012 at 3:52 AM, Sebastian Melchior <webmaster@mailz.de> wrote:
> unfortunately we cannot directly control the TRIM (i am not sure it even occurs) because the SSDs are behind an LSI
MegaSAS9260 Controller which does not allow access via smart. Even if some kind of TRIM command is the problem,
shouldn'tthe iowait go up in this case? 

Based on my recent benchmarking experiences, maybe not.  Suppose
backend A takes a lock and then blocks on an I/O.  Then, all the other
backends block waiting on the lock.  So maybe one backend is stuck in
I/O-wait, but on a multi-processor system the percentages are averaged
across all CPUs, so it doesn't really look like there's much I/O-wait.
 If you have 'perf' available, I've found the following quite helpful:

perf record -e cs -g -a sleep 30
perf report -g

Then you can look at the report and find out what's causing PostgreSQL
to context-switch out - i.e. block - and therefore find out what lock
and call path is contended.  LWLocks don't show up in pg_locks, so you
can't troubleshoot this sort of contention that way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company