Thread: Slow queries on 9.3.1 despite use of index

Slow queries on 9.3.1 despite use of index

From

Michael van Rooyen

Date:

28 April 2014, 20:12:54

I'm trying to get to the bottom of a performance issue on a server
running PostgreSQL 9.3.1 on Centos 5.  The machine is a dual quad-core
Xeon E5620 with 24GB ECC RAM and four enterprise SATA Seagate
Constellation ES drives configured as 2 software RAID1 volumes.  The
main DB is on one volume and some, more used indexes and WAL logs are on
the other.

Overall, the load doesn't appear to be heavy, but we're still getting
slow queries, for example, here's an extract from the log:

2014-04-28 16:51:02.904 GMT 25998: LOG:  checkpoint starting: time
2014-04-28 16:53:37.204 GMT 30053: LOG:  duration: 1067.464 ms execute
<unnamed>: select "id" from "ProductSupplier" where
"product"='25553848082928'
2014-04-28 16:54:12.701 GMT 30053: LOG:  duration: 1105.844 ms execute
<unnamed>: select "id" from "ProductSupplier" where
"product"='1626407082928'
2014-04-28 16:54:46.789 GMT 30053: LOG:  duration: 1060.585 ms execute
<unnamed>: select

"id","updated","ean","frontImagePresent","backImagePresent","jacketScanned","x80ImagePresent","x120ImagePresent","x200ImagePresent","x400ImagePresent","type","brand","manProductCode","name","series","subtitle","keywords","mass","length","width","thickness","releaseDate","originalReleaseDate","firstAvailableDate","stockLevel","stockCost","url","ratingTotal","ratingCount","beta","available","reorderSource","reorderPoint","reorderQuantity","status","replacement","countryOfOrigin","formType","formDetail","formDetail1","formDetail2","formDetail3","formDetail4","contentType","packagingType","ebookType","mixedMedia","mediaCount","shortAnnotationType","longAnnotationType","wikipediaTopic","allMusicId","allMovieId","allGameId","imdbId","boost","salesMovingAverage","movingAverageDate","lifetimeSales","shippingRequirement","shippingSubsidyCalculator","edition","dewey","lcc","lccn","pages","minutes","bicSubjects","bisacSubjects","regionEncoding","audioFormat","videoFormat","platform","audit"

from "Product" where id='6686838082928'
2014-04-28 16:55:58.058 GMT 30053: LOG:  duration: 1309.192 ms execute
<unnamed>: select "id" from "ProductCategory" where
"product"='1932872082928'
2014-04-28 16:56:06.019 GMT 25998: LOG:  checkpoint complete: wrote 6647
buffers (2.5%); 0 transaction log file(s) added, 0 removed, 2 recycled;
write=298.862 s, sync=4.072 s, total=303.115 s; sync files=155,
longest=0.234 s, average=0.026 s

Although these tables (ProductSupplier, Product, ProductCategory) are
large (millions of rows), all these queries are done via indexes, for
example:

=# explain (analyze, buffers) select "id" from "ProductSupplier" where
"product"='25553848082928';
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
  Index Scan using "i433275-productIndex" on "ProductSupplier"
(cost=0.57..15.81 rows=3 width=8) (actual time=0.069..0.105 rows=6 loops=1)
    Index Cond: (product = 25553848082928::bigint)
    Buffers: shared hit=10
  Total runtime: 0.138 ms
(4 rows)

So, I would expect it to be really quick, and > 1 second seems really slow.

Overall the load on the server seems quite low, for example, typical
vmstat -1 is:

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa st
  0  1    304  77740  11960 17709156    0    0    99    82    2    2 2
1 89  8  0
  1  0    304  75228  11960 17711164    0    0  1256   635 1418 6498 0
0 94  6  0
  0  1    304  73440  11968 17712036    0    0  1232   149 1253 6232 1
0 94  6  0
  0  2    304  78472  11968 17706016    0    0  1760    89 1325 6361 1
0 94  5  0
  0  1    304  75616  11968 17708480    0    0  2164    72 1371 6855 1
0 94  5  0
  0  1    304  73292  11968 17710320    0    0  1760   112 1335 6673 1
0 94  5  0
  0  2    304  77956  11964 17703528    0    0  1204  5614 1867 6712 0
0 94  6  0

And iostat also seems okay:

       sda             sdb             sdc sdd
md0             md1             cpu
   kps tps svc_t   kps tps svc_t   kps tps svc_t   kps tps svc_t kps tps
svc_t   kps tps svc_t  us  sy  wt  id
   619  64   5.4  1478  96   8.9  1494  98   8.7   542  60   5.5 1998
139   0.0   804  96   0.0   2   1   8  89
  1147 151   3.7  5328  88   8.0  5388  90   8.3  1147 145   3.0 5638
171   0.0  2245 284   0.0   0   0   4  95
   865 110   4.6   214  24  12.6   214  24  13.1   885 113   4.5 252
28   0.0  1749 222   0.0   4   0   1  95
   937 107   5.0    40   5  10.0    40   5  13.0   821  93   6.4 80
10   0.0  1596 193   0.0   9   0   1  90
  1206 154   3.2     0   0   0.0     0   0   0.0  1195 153   3.2 0   0
0.0  2401 307   0.0   6   0   0  94
  1111 139   3.8    24   3  10.0    16   2  12.0  1115 144   3.8 40
5   0.0  2141 274   0.0  10   0   0  89
  1222 156   3.6   101  12   4.8   101  12   4.2  1171 150   4.2 93
10   0.0  1986 252   0.0  10   0   0  90
   739  98   5.5     0   0   0.0     0   0   0.0   687  93   5.2 0   0
0.0  1425 191   0.0   7   0   0  93
   775 102   5.2     8   1  13.0     8   1  21.0   755  99   5.1 16
2   0.0  1529 201   0.0  10   0   0  89
   780 105   5.3    16   2  13.0    16   2  11.0   784 103   5.4 32
4   0.0  1555 205   0.0   6   0   1  92
   573  75   6.8   110  32   1.3   110  31   1.1   561  73   7.6 102
30   0.0  1109 143   0.0  10   1   0  89
   639  84   7.3  2833 349   2.1  3085 367   2.0   683  90   6.0 4854
592   0.0  1026 135   0.0   2   0   4  93
   510  70   7.8  2020 240   4.2  1808 227   4.2   586  81   7.2 40
5   0.0  1077 146   0.0   0   0   7  93
   538  75   7.8    24   3 290.0    20   3 239.7   582  81   7.5 40
5   0.0  1066 147   0.0   0   0   6  94
   504  69   9.3   132  17  25.8   128  16  14.4   600  81   8.2 256
32   0.0  1083 144   0.0   1   0   6  94

I've tried to optimise postgresql.conf for performance:

max_connections = 1000                  # (change requires restart)
shared_buffers = 2GB                    # min 128kB or max_connections*16kB
work_mem = 100MB                                # min 64kB
maintenance_work_mem = 100MB            # min 1MB
synchronous_commit = off                # immediate fsync at commit
wal_buffers = 16MB                      # min 32kB
checkpoint_segments = 64                # in logfile segments, min 1,
16MB each
checkpoint_timeout = 10min              # range 30s-1h
effective_cache_size = 16GB
logging_collector = on                  # Enable capturing of stderr and
csvlog
log_directory = 'pg_log'                # directory where log files are
written,
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,
log_rotation_age = 1d                   # Automatic rotation of logfiles
will
log_min_duration_statement = 1000       # -1 is disabled, 0 logs all
statements
log_checkpoints = on
log_line_prefix = '%m %p: '                     # special values:
autovacuum = on                         # Enable autovacuum subprocess?
'on'
autovacuum_vacuum_threshold = 10000     # min number of row updates before
autovacuum_analyze_threshold = 1000     # min number of row updates before
autovacuum_vacuum_scale_factor = 0.1    # fraction of table size before
vacuum
autovacuum_analyze_scale_factor = 0.05  # fraction of table size before
analyze
autovacuum_vacuum_cost_delay = 5        # default vacuum cost delay for
datestyle = 'iso, dmy'
default_text_search_config = 'pg_catalog.english'
backslash_quote = on    # on, off, or safe_encoding
standard_conforming_strings = off

If anyone can see anything amiss in the above info, I'd be grateful for
any suggestions...

Thanks,
Michael.

Re: Slow queries on 9.3.1 despite use of index

From

Tom Lane

Date:

28 April 2014, 20:50:36

Michael van Rooyen <michael@loot.co.za> writes:
> I'm trying to get to the bottom of a performance issue on a server
> running PostgreSQL 9.3.1 on Centos 5.

Hm ... it seems pretty suspicious that all of these examples take just
about exactly 1 second longer than you might expect.  I'm wondering
if there is something sitting on an exclusive table lock somewhere,
and releasing it after 1 second.

In particular, this looks quite a bit like the old behavior of autovacuum
when it was trying to truncate empty pages off the end of a relation ---
it would hold off other accesses to the table until deadlock_timeout
elapsed, whereupon it'd get kicked off the exclusive lock (and have to
retry the truncation next time).  Are you *sure* this server is running
9.3.1, and not something pre-9.3?

            regards, tom lane

Re: Slow queries on 9.3.1 despite use of index

From

Jeff Janes

Date:

28 April 2014, 20:52:48

On Mon, Apr 28, 2014 at 10:12 AM, Michael van Rooyen <michael@loot.co.za> wrote:

I'm trying to get to the bottom of a performance issue on a server running PostgreSQL 9.3.1 on Centos 5. The machine is a dual quad-core Xeon E5620 with 24GB ECC RAM and four enterprise SATA Seagate Constellation ES drives configured as 2 software RAID1 volumes. The main DB is on one volume and some, more used indexes and WAL logs are on the other.

Overall, the load doesn't appear to be heavy, but we're still getting slow queries, for example, here's an extract from the log:

2014-04-28 16:51:02.904 GMT 25998: LOG: checkpoint starting: time
2014-04-28 16:53:37.204 GMT 30053: LOG: duration: 1067.464 ms execute <unnamed>: select "id" from "ProductSupplier" where "product"='25553848082928'
2014-04-28 16:54:12.701 GMT 30053: LOG: duration: 1105.844 ms execute <unnamed>: select "id" from "ProductSupplier" where "product"='1626407082928'
2014-04-28 16:54:46.789 GMT 30053: LOG: duration: 1060.585 ms execute <unnamed>: select

...

2014-04-28 16:55:58.058 GMT 30053: LOG: duration: 1309.192 ms execute <unnamed>: select "id" from "ProductCategory" where "product"='1932872082928'
2014-04-28 16:56:06.019 GMT 25998: LOG: checkpoint complete: wrote 6647 buffers (2.5%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=298.862 s, sync=4.072 s, total=303.115 s; sync files=155, longest=0.234 s, average=0.026 s

It looks like something is causing your IO to seize up briefly. It is common for the sync phase of the checkpoint to do that, but that would only explain 3 of the 4 reports above.

Is this causing an actual problem for your users, or are you just trying to be proactive?

You could change the kernel setting dirty_background_bytes to try to reduce this problem.

Overall the load on the server seems quite low, for example, typical vmstat -1 is:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 1 304 77740 11960 17709156 0 0 99 82 2 2 2 1 89 8 0
1 0 304 75228 11960 17711164 0 0 1256 635 1418 6498 0 0 94 6 0
0 1 304 73440 11968 17712036 0 0 1232 149 1253 6232 1 0 94 6 0
0 2 304 78472 11968 17706016 0 0 1760 89 1325 6361 1 0 94 5 0
0 1 304 75616 11968 17708480 0 0 2164 72 1371 6855 1 0 94 5 0
0 1 304 73292 11968 17710320 0 0 1760 112 1335 6673 1 0 94 5 0
0 2 304 77956 11964 17703528 0 0 1204 5614 1867 6712 0 0 94 6 0

It that typical for when the problem is not occurring, or typical for when it is occurring. Without having timestamps to correlate the vmstat back to log file, it is very hard to make use of this info. Some versions of vmstat have a -t flag.

I've tried to optimise postgresql.conf for performance:

max_connections = 1000 # (change requires restart)

1000 is extremely high. How many connections do you actually use at any one time?

shared_buffers = 2GB # min 128kB or max_connections*16kB
work_mem = 100MB # min 64kB

100MB is also very high, at least on conjunction with the high max_connections.

maintenance_work_mem = 100MB # min 1MB
synchronous_commit = off # immediate fsync at commit
wal_buffers = 16MB # min 32kB
checkpoint_segments = 64 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 10min # range 30s-1h
effective_cache_size = 16GB
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,
log_rotation_age = 1d # Automatic rotation of logfiles will
log_min_duration_statement = 1000 # -1 is disabled, 0 logs all statements

I would lower this. You can see that few statements were just over 1000 ms, but can't tell if there are lot that are at 800 ms, or if you have bimodal distribution with most being 1ms and a few being 1200ms.

Cheers,

Jeff

Re: Slow queries on 9.3.1 despite use of index

From

Michael van Rooyen

Date:

29 April 2014, 00:22:27

On 2014/04/28 07:50 PM, Tom Lane wrote:
> Michael van Rooyen <michael@loot.co.za> writes:
>> I'm trying to get to the bottom of a performance issue on a server
>> running PostgreSQL 9.3.1 on Centos 5.
> Hm ... it seems pretty suspicious that all of these examples take just
> about exactly 1 second longer than you might expect.  I'm wondering
> if there is something sitting on an exclusive table lock somewhere,
> and releasing it after 1 second.
I do have log_min_duration_statement = 1000, which may cause this.
> In particular, this looks quite a bit like the old behavior of autovacuum
> when it was trying to truncate empty pages off the end of a relation ---
> it would hold off other accesses to the table until deadlock_timeout
> elapsed, whereupon it'd get kicked off the exclusive lock (and have to
> retry the truncation next time).  Are you *sure* this server is running
> 9.3.1, and not something pre-9.3?
Definitely 9.3.1.  The strange thing is I have other servers with
similar configurations and load and with the same database, where
performance is great, so it's hard for me to know what's different
here.  Maybe I'm expecting too much from these SATA drives, or it's time
to add lots of RAM...

Re: Slow queries on 9.3.1 despite use of index

From

Tom Lane

Date:

29 April 2014, 02:06:21

Michael van Rooyen <michael@loot.co.za> writes:
> On 2014/04/28 07:50 PM, Tom Lane wrote:
>> Hm ... it seems pretty suspicious that all of these examples take just
>> about exactly 1 second longer than you might expect.  I'm wondering
>> if there is something sitting on an exclusive table lock somewhere,
>> and releasing it after 1 second.

> I do have log_min_duration_statement = 1000, which may cause this.

Ah, I overlooked that detail.  Never mind that theory then.  Although
it might still be worth turning on log_lock_waits for awhile, just to
eliminate the possibility of lock-induced delays.

            regards, tom lane

Re: Slow queries on 9.3.1 despite use of index

From

Michael van Rooyen

Date:

29 April 2014, 17:18:05

On 2014/04/28 07:52 PM, Jeff Janes wrote:
> On Mon, Apr 28, 2014 at 10:12 AM, Michael van Rooyen
> <michael@loot.co.za <mailto:michael@loot.co.za>> wrote:
>
> It looks like something is causing your IO to seize up briefly.  It is
> common for the sync phase of the checkpoint to do that, but that would
> only explain 3 of the 4 reports above.
>
> Is this causing an actual problem for your users, or are you just
> trying to be proactive?
>
> You could change the kernel setting dirty_background_bytes to try to
> reduce this problem.
The problem is that this server running background tasks very slowly
(about 10x slower than a similar server with the same DB but 3x more RAM).

I changed dirty_background_bytes to 16M, previously the
dirty_background_ratio was 10%.  No real effect on the DB performance,
but it seems a good change anyway. Thanks for the tip.
>
>
>     Overall the load on the server seems quite low, for example,
>     typical vmstat -1 is:
>
>
>     procs -----------memory---------- ---swap-- -----io---- --system--
>     -----cpu------
>      r  b   swpd   free   buff  cache   si   so    bi    bo in   cs us
>     sy id wa st
>      0  1    304  77740  11960 17709156    0    0    99    82    2
>      2 2  1 89  8  0
>      1  0    304  75228  11960 17711164    0    0  1256   635 1418
>     6498 0  0 94  6  0
>      0  1    304  73440  11968 17712036    0    0  1232   149 1253
>     6232 1  0 94  6  0
>      0  2    304  78472  11968 17706016    0    0  1760    89 1325
>     6361 1  0 94  5  0
>      0  1    304  75616  11968 17708480    0    0  2164    72 1371
>     6855 1  0 94  5  0
>      0  1    304  73292  11968 17710320    0    0  1760   112 1335
>     6673 1  0 94  5  0
>      0  2    304  77956  11964 17703528    0    0  1204  5614 1867
>     6712 0  0 94  6  0
>
>
> It that typical for when the problem is not occurring, or typical for
> when it is occurring.  Without having timestamps to correlate the
> vmstat back to log file, it is very hard to make use of this info.
>  Some versions of vmstat have a -t flag.
>
It's fairly typical - and although the same underlying query will
sometimes complete faster or slower, the overall performance /
throughput is consistently (as opposed to sporadically) poor.
>
>
>
>     I've tried to optimise postgresql.conf for performance:
>
>     max_connections = 1000                  # (change requires restart)
>
>
> 1000 is extremely high.  How many connections do you actually use at
> any one time?
>
>     shared_buffers = 2GB                    # min 128kB or
>     max_connections*16kB
>     work_mem = 100MB                                # min 64kB
>
>
> 100MB is also very high, at least on conjunction with the high
> max_connections.
Blush.  Thanks - I've reduced these to more reasonable values (200 /
10MB), but it didn't have any effect on performance.

>     maintenance_work_mem = 100MB            # min 1MB
>     synchronous_commit = off                # immediate fsync at commit
>     wal_buffers = 16MB                      # min 32kB
>     checkpoint_segments = 64                # in logfile segments, min
>     1, 16MB each
>     checkpoint_timeout = 10min              # range 30s-1h
>     effective_cache_size = 16GB
>     logging_collector = on                  # Enable capturing of
>     stderr and csvlog
>     log_directory = 'pg_log'                # directory where log
>     files are written,
>     log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name
>     pattern,
>     log_rotation_age = 1d                   # Automatic rotation of
>     logfiles will
>     log_min_duration_statement = 1000       # -1 is disabled, 0 logs
>     all statements
>
>
> I would lower this.  You can see that few statements were just over
> 1000 ms, but can't tell if there are lot that are at 800 ms, or if you
> have bimodal distribution with most being 1ms and a few being 1200ms.
I lowered it to 100ms, and taking the same query in my original post
over the last few hours, the times vary in the spectrum from 100ms to
just over a 1s.  It seems like an exponential distribution with the norm
close to 100ms. I am becoming increasingly sure that I'm just up against
the limitations of the SATA disks due to the load profile on this
particular server. Maybe it's time to reassess the load, or install an
SSD or lots of RAM...
> Cheers,
>
> Jeff