Re: Slow queries / commits, mis-configuration or hardware issues? - Mailing list pgsql-performance

From Greg Smith
Subject Re: Slow queries / commits, mis-configuration or hardware issues?
Date
Msg-id 4EC33BA5.2040603@2ndQuadrant.com
Whole thread Raw
In response to Slow queries / commits, mis-configuration or hardware issues?  (Cody Caughlan <toolbag@gmail.com>)
Responses Re: Slow queries / commits, mis-configuration or hardware issues?  ("Tomas Vondra" <tv@fuzzy.cz>)
List pgsql-performance
On 11/14/2011 01:16 PM, Cody Caughlan wrote:
> We're starting to see some slow queries, especially COMMITs that are
> happening more frequently. The slow queries are against seemingly
> well-indexed tables.
> Slow commits like:
>
> 2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG:  duration: 3062.784 ms
>   statement: COMMIT
> 2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG:  duration: 2593.351 ms
>   statement: COMMIT
>
> These slow COMMITs are against tables that received a large number of
> UPDATEs and are growing fairly rapidly.
>

Linux will fill its write cache with all of the writes coming out of
each checkpoint.  With a 16GB instance, I would expect that 5% * 16GB ~=
800MB of writes are batched up when your system is slow.  You should be
able to confirm that by looking at the "Dirty:" line in /proc/meminfo

With 800MB queued up and I/O that is lucky to get 50MB/s, the sync calls
at the end of each checkpoint are sometimes blocking for multiple seconds:

> 2011-11-14 17:38:48 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
> 15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8
> recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync
> files=60, longest=1.484 s, average=0.049 s
> 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
> 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
> 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
> recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync
> files=82, longest=2.943 s, average=0.050 s
>

When an individual sync call gets stuck for that long, clients can
easily get stuck behind it too.  There are a couple of techniques that
might help:

-Switch filesystems if you're running a slow one.  ext3 has bad latency
behavior here, XFS and ext4 are better.
-Lower the dirty_* tunables like dirty_background_ratio or its bytes
version.  This will reduce average throughput, but can lower latency.
-Spread checkpoints out more so that less average writes are happening.
-Decrease shared_buffers so less data is getting pushed out at
checkpoint time.
-Reduce your reliability expectations and turn off synchronous_commit.

Your server is sometimes showing multi-second latency issues with
bonnie++ too; that suggests how this problem is not even specific to
PostgreSQL.  Linux is hard to tune for low latency under all
circumstances; fighting latency down under a heavy update workload is
hard to do even with good hardware to accelerate write performance.  In
an EC2 environment, it may not even be possible to do without making
trade-offs like disabling synchronous writes.  I can easily get
transactions hung for 10 to 15 seconds on one of their servers if I try
to make that problem bad, you're only seeing the middle range of latency
issues so far.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


pgsql-performance by date:

Previous
From: Aidan Van Dyk
Date:
Subject: Re: What's the state of postgresql on ext4 now?
Next
From: Greg Smith
Date:
Subject: Re: What's the state of postgresql on ext4 now?