Thread: Sudden Query slowdown on our Postgresql Server
Hi, we are currently seeing some strange performance issues related to our Postgresql Database. Our Setup currently contains: - 1 Master with 32GB Ram and 6 x 100GB SSDs in RAID10 and 2 Quad Core Intel Processors (this one has a failover Box, thedata volume is shared via DRBD) - 2 Slaves with 16GB Ram and 6 x 100GB SAS Disks in RAID 10 and 1 Quad Core Processor connected via streaming replication We currently use Postgresql 9.0.6 from Debian Squeeze Backports with a 2.6.39-bpo.2 Backports Squeeze Kernel. All Servers use Pgbouncer as a connection Pooler, which is installed on each box. In times of higher usage, we see some strange issues on our Master box, the connections start stacking up in an "<idle intransaction>" state, and the Queries get slower and slower when using the Master Server. We traced the Application whichis connected via a private LAN, and could not find any hangups that could cause these states in the Database. Duringthis time, the load of the Master goes up a bit, but the CPU Usage and IOwait is still quite low at around a Load of5-8. The usual Load is around 1 - 1.5. 19:14:28.654 4838 LOG Stats: 3156 req/s, in 1157187 b/s, out 1457656 b/s,query 6119 us 19:15:28.655 4838 LOG Stats: 3247 req/s, in 1159833 b/s, out 1421552 b/s,query 5025 us 19:16:28.660 4838 LOG Stats: 3045 req/s, in 1096349 b/s, out 1377927 b/s,query 3713 us 19:17:28.680 4838 LOG Stats: 2779 req/s, in 1030783 b/s, out 1343547 b/s,query 11977 us 19:18:28.688 4838 LOG Stats: 1723 req/s, in 664282 b/s, out 789989 b/s,query 67144 us 19:19:28.665 4838 LOG Stats: 1371 req/s, in 472587 b/s, out 622347 b/s,query 48417 us 19:20:28.668 4838 LOG Stats: 2161 req/s, in 748727 b/s, out 995794 b/s,query 2794 us As you can see in the pgbouncer logs, the query exec time shoots up. We took a close look at locking issues during that time, but we don't see any excessive amount of locking during that time. The issue suddenly popped up, we had times of higher usage before and the Postgresql DB handled it without any problems.We also did not recently change anything in this setup. We also did take a look at the Slow Queries Log during thattime. This did now show anything unusual during the time of the slowdown. Does anyone have any idea what could cause this issue or how we can further debug it? Thanks for your Input! Sebastian
* Sebastian Melchior (webmaster@mailz.de) wrote: > Does anyone have any idea what could cause this issue or how we can further debug it? Are you logging checkpoints? If not, you should, if so, then see if they correllate to the time of the slowdown..? Thanks, Stephen
Attachment
Hi, yeah we log those, those times do not match the times of the slowdown at all. Seems to be unrelated. Sebastian On 23.03.2012, at 01:47, Stephen Frost wrote: > * Sebastian Melchior (webmaster@mailz.de) wrote: >> Does anyone have any idea what could cause this issue or how we can further debug it? > > Are you logging checkpoints? If not, you should, if so, then see if > they correllate to the time of the slowdown..? > > Thanks, > > Stephen
I'd suggest the handy troubleshooting tools sar, iostat, vmstat and iotop On Thu, Mar 22, 2012 at 10:37 PM, Sebastian Melchior <webmaster@mailz.de> wrote: > Hi, > > yeah we log those, those times do not match the times of the slowdown at all. Seems to be unrelated. > > Sebastian > > > On 23.03.2012, at 01:47, Stephen Frost wrote: > >> * Sebastian Melchior (webmaster@mailz.de) wrote: >>> Does anyone have any idea what could cause this issue or how we can further debug it? >> >> Are you logging checkpoints? If not, you should, if so, then see if >> they correllate to the time of the slowdown..? >> >> Thanks, >> >> Stephen > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- To understand recursion, one must first understand recursion.
Hi, we already used iostat and iotop during times of the slowdown, there is no sudden drop in I/O workload in the times of theslowdown. Also the iowait does not spike and stays as before. So i do not think that this is I/O related. As the disks are SSDs there also still is some "head room" left. Sebastian On 23.03.2012, at 05:48, Scott Marlowe wrote: > I'd suggest the handy troubleshooting tools sar, iostat, vmstat and iotop > > On Thu, Mar 22, 2012 at 10:37 PM, Sebastian Melchior <webmaster@mailz.de> wrote: >> Hi, >> >> yeah we log those, those times do not match the times of the slowdown at all. Seems to be unrelated. >> >> Sebastian >> >> >> On 23.03.2012, at 01:47, Stephen Frost wrote: >> >>> * Sebastian Melchior (webmaster@mailz.de) wrote: >>>> Does anyone have any idea what could cause this issue or how we can further debug it? >>> >>> Are you logging checkpoints? If not, you should, if so, then see if >>> they correllate to the time of the slowdown..? >>> >>> Thanks, >>> >>> Stephen >> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance > > > > -- > To understand recursion, one must first understand recursion.
What does vmstat say about things like context switches / interrupts per second? On Thu, Mar 22, 2012 at 10:53 PM, Sebastian Melchior <webmaster@mailz.de> wrote: > Hi, > > we already used iostat and iotop during times of the slowdown, there is no sudden drop in I/O workload in the times ofthe slowdown. Also the iowait does not spike and stays as before. > So i do not think that this is I/O related. As the disks are SSDs there also still is some "head room" left. > > Sebastian > > On 23.03.2012, at 05:48, Scott Marlowe wrote: > >> I'd suggest the handy troubleshooting tools sar, iostat, vmstat and iotop >> >> On Thu, Mar 22, 2012 at 10:37 PM, Sebastian Melchior <webmaster@mailz.de> wrote: >>> Hi, >>> >>> yeah we log those, those times do not match the times of the slowdown at all. Seems to be unrelated. >>> >>> Sebastian >>> >>> >>> On 23.03.2012, at 01:47, Stephen Frost wrote: >>> >>>> * Sebastian Melchior (webmaster@mailz.de) wrote: >>>>> Does anyone have any idea what could cause this issue or how we can further debug it? >>>> >>>> Are you logging checkpoints? If not, you should, if so, then see if >>>> they correllate to the time of the slowdown..? >>>> >>>> Thanks, >>>> >>>> Stephen >>> >>> >>> -- >>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-performance >> >> >> >> -- >> To understand recursion, one must first understand recursion. > -- To understand recursion, one must first understand recursion.
On 2012-03-23 05:53, Sebastian Melchior wrote: > Hi, > > we already used iostat and iotop during times of the slowdown, there is no sudden drop in I/O workload in the times ofthe slowdown. Also the iowait does not spike and stays as before. > So i do not think that this is I/O related. As the disks are SSDs there also still is some "head room" left. I've seen a ssd completely lock up for a dozen seconds or so after giving it a smartctl command to trim a section of the disk. I'm not sure if that was the vertex 2 pro disk I was testing or the intel 710, but enough reason for us to not mount filesystems with -o discard. regards, Yeb
Hi, unfortunately we cannot directly control the TRIM (i am not sure it even occurs) because the SSDs are behind an LSI MegaSAS9260 Controller which does not allow access via smart. Even if some kind of TRIM command is the problem, shouldn'tthe iowait go up in this case? Sebastian On 23.03.2012, at 08:10, Yeb Havinga wrote: > On 2012-03-23 05:53, Sebastian Melchior wrote: >> Hi, >> >> we already used iostat and iotop during times of the slowdown, there is no sudden drop in I/O workload in the times ofthe slowdown. Also the iowait does not spike and stays as before. >> So i do not think that this is I/O related. As the disks are SSDs there also still is some "head room" left. > > I've seen a ssd completely lock up for a dozen seconds or so after giving it a smartctl command to trim a section of thedisk. I'm not sure if that was the vertex 2 pro disk I was testing or the intel 710, but enough reason for us to not mountfilesystems with -o discard. > > regards, > Yeb > > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Fri, Mar 23, 2012 at 3:52 AM, Sebastian Melchior <webmaster@mailz.de> wrote: > unfortunately we cannot directly control the TRIM (i am not sure it even occurs) because the SSDs are behind an LSI MegaSAS9260 Controller which does not allow access via smart. Even if some kind of TRIM command is the problem, shouldn'tthe iowait go up in this case? Based on my recent benchmarking experiences, maybe not. Suppose backend A takes a lock and then blocks on an I/O. Then, all the other backends block waiting on the lock. So maybe one backend is stuck in I/O-wait, but on a multi-processor system the percentages are averaged across all CPUs, so it doesn't really look like there's much I/O-wait. If you have 'perf' available, I've found the following quite helpful: perf record -e cs -g -a sleep 30 perf report -g Then you can look at the report and find out what's causing PostgreSQL to context-switch out - i.e. block - and therefore find out what lock and call path is contended. LWLocks don't show up in pg_locks, so you can't troubleshoot this sort of contention that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company