Re: Cannot reproduce why a query is slow - Mailing list pgsql-general

From John Cheng
Subject Re: Cannot reproduce why a query is slow
Date
Msg-id CAJzZBAQn+7NU9WYv1pFJSSBxnJhkb934a0O45KUoxAbDhuh4Tw@mail.gmail.com
Whole thread Raw
In response to Re: Cannot reproduce why a query is slow  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-general
Sorry about the long wait between reply. 

On Sat, May 7, 2011 at 2:36 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
Resource usually means there's too much I/O so the query is slow, but
when you try it later the drives are idle and query runs much faster.
Run some monitoring, e.g. even a simple 'iostat -x' or 'dstat' output
might be sufficient. Once the slow query happens, try to correlate it to
the CPU / disk activity.

I am unsure if disk IO could be a part of the problem. iostat shows a max of ~2000kb/second write. This query gets during a batch of insert/update process (every 5 minutes or so, we batch process a list of 20 or so xmls, i.e., "leads"). During the batch processing, iostat shows pretty low usage.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.38    0.00    0.12    0.62    0.00   94.88

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.99  0.00  5.94     0.00    55.45     9.33     0.10   16.17   6.17   3.66
sdb               0.00   172.28  0.00 11.88     0.00  1473.27   124.00     0.03    2.42   1.08   1.29
sdc               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdd               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.35    0.00    0.37    0.37    0.00   90.90

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00   621.00  0.00 23.00     0.00  5152.00   224.00     0.08    3.52   1.22   2.80
sdc               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdd               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.62    0.00    0.12    1.88    0.00   88.38

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00    73.00  0.00  9.00     0.00   656.00    72.89     0.01    1.33   1.00   0.90
sdc               0.00   200.00 20.00 129.00   304.00  2632.00    19.70     1.22    8.16   0.68  10.20
sdd               0.00     0.00  9.00  0.00   120.00     0.00    13.33     0.04    4.67   4.56   4.10

The 'atop' command shows low disk activity ~2%, but I'm not sure how accurate atop is. If I were to run a similar query from the command line while the batch job is running, I still get a 50ms execution time. 

CPU usage does goes up a lot during this time. There are 8 cores and 1 of the cores shoots up to ~97%. The process updates XML records in the database and performs dozens of xpath based queries to pull data out of the xml into relational tables. These extractions are performed with pgplsql triggers and Postgres xpath functions. Hence the spike in CPU usage during that time. 
 

The unexpected plan change is usually caused by autovacuum/autoanalyze
collecting skewed data for some reason, resulting in bad plan choice.
Then the autovacuum runs again and you get different (much better) plan.
This can be detected using the auto_explain contrib module, as someone
already recommended.

We are using Postgres 8.3 and I don't believe the autoexplain contrib module is available for 8.3. 
 

> The query is a bit complex, as it is generated by code, but I tried to
> format it for easier reading. I've also replaced actual data with fake
> data to protected personal information.

I generally do recommend using explain.depesz.com to post explain plans,
especially in case of complex queries. I've posted your query and this
is the result

 http://explain.depesz.com/s/gJO

Not sure if it's relevant to your issue (probably not), but the bitmap
index scans are significantly overestimated. Not sure if the overlap
operator affects the estimate accuracy ...

BTW what postgresql version is this? How large the database is, how much
RAM is available? What is the size of shared_buffers?

I am loving explain.depesz.com!

We are using Postgres 8.3.This system has 16GB of RAM. The database is 114GB. shared_buffers is set to 2000MB. A possibly related note... the 'effective_cache_size' setting. The previous admin set it to 8000MB. The free command shows 13.8GB cache, 397MB buffer. I wonder if changing the effective_cache_size can improve query planning. 

pgsql-general by date:

Previous
From: Rob Sargent
Date:
Subject: Re: error when compiling a c function
Next
From: Peter V
Date:
Subject: WITH x AS (...) and visibility in UPDATE