Thread: Slow query due to slow I/O

Slow query due to slow I/O

From

Bryce Covert

Date:

12 December 2013, 20:10:47

Hi,

I'm seeing a slow running query. After some optimization with indexes, it appears that the query plan is correct, it's just slow. Running the query twice, not surprisingly, is very fast, due to OS caching or shared_buffers caching. If the parameters for the query are different, however, the query is slow until run a second time. In our usage pattern, this query must run fast the first time it runs.

A couple of general stats: this is a linode machine with a single 3GB DB with 4GBs of ram. Shared buffers = 1024mb, effective_cache_size=2048MB. We are running with postgres 9.1. The machine is otherwise dormant when this query runs. Here is the schema:

                                          Table "public.premiseaccount"
         Column         |           Type           |                                Modifiers
------------------------+--------------------------+-------------------------------------------------------------------------
id                     | integer                  | not null default nextval('premiseaccount_id_seq'::regclass)
created                | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
modified               | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
account_id             | integer                  | not null
premise_id             | integer                  | not null
bucket                 | character varying(255)   |
Indexes:
    "premiseaccount_pkey" PRIMARY KEY, btree (id)
    "premiseaccount_account_id" btree (account_id)
    "premiseaccount_bucket" btree (bucket)
    "premiseaccount_bucket_58c70392619aa36f" btree (bucket, id)
    "premiseaccount_bucket_like" btree (bucket varchar_pattern_ops)
    "premiseaccount_premise_id" btree (premise_id)
Foreign-key constraints:
    "account_id_refs_id_529631edfff28022" FOREIGN KEY (account_id) REFERENCES utilityaccount(id) DEFERRABLE INITIALLY DEFERRED
    "premise_id_refs_id_5ecea2842007328b" FOREIGN KEY (premise_id) REFERENCES premise(id) DEFERRABLE INITIALLY DEFERRED

                                        Table "public.electricusage"
       Column       |           Type           |                               Modifiers
--------------------+--------------------------+------------------------------------------------------------------------
id                 | integer                  | not null default nextval('electricusage_id_seq'::regclass)
created            | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
modified           | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
from_date          | timestamp with time zone | not null
to_date            | timestamp with time zone | not null
usage              | numeric(9,2)             |
demand             | numeric(9,2)             |
bill_amount        | numeric(9,2)             |
premise_account_id | integer                  | not null
Indexes:
    "electricusage_pkey" PRIMARY KEY, btree (id)
    "electricusage_premise_account_id" btree (premise_account_id)
    "electricusage_covered_id_from_date_usage" btree (premise_account_id, from_date, usage)
Foreign-key constraints:
    "premise_account_id_refs_id_4c39e54406369128" FOREIGN KEY (premise_account_id) REFERENCES premiseaccount(id) DEFERRABLE INITIALLY DEFERRED

For reference, premiseaccount has about 1 million rows, one for each account, grouped in buckets with an average of 5000 accounts per bucket. electricusage has 10 million rows, about 10 rows per premiseaccount.

Here is the query:

explain analyze
SELECT premiseaccount.id, SUM(electricusage.usage) AS total_kwh
FROM premiseaccount
LEFT OUTER JOIN electricusage
    ON premiseaccount.id = electricusage.premise_account_id
WHERE premiseaccount.bucket = 'XXX' AND electricusage.from_date >= '2012-11-20 00:00:00'
GROUP BY premiseaccount.id
HAVING SUM(electricusage.usage) BETWEEN 3284 and 3769
LIMIT 50;

Limit (cost=0.00..1987.24 rows=50 width=8) (actual time=931.524..203631.435 rows=50 loops=1)
   -> GroupAggregate (cost=0.00..179487.78 rows=4516 width=8) (actual time=931.519..203631.275 rows=50 loops=1)
         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         -> Nested Loop (cost=0.00..179056.87 rows=36317 width=8) (actual time=101.934..203450.761 rows=30711 loops=1)
               -> Index Scan using premiseaccount_bucket_58c70392619aa36f on premiseaccount premiseaccount (cost=0.00..14882.30 rows=4516 width=4) (actual time=77.620..7199.527 rows=3437 loops=1)
                     Index Cond: ((bucket)::text = 'XXX'::text)
               -> Index Scan using electricusage_premise_account_id_36bc8999ced10059 on electricusage electricusage (cost=0.00..36.24 rows=9 width=8) (actual time=8.607..57.055 rows=9 loops=3437)
                     Index Cond: ((premise_account_id = premiseaccount.id) AND (from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone))
Total runtime: 203631.666 ms

(see online at: http://explain.depesz.com/s/zeC)

What am I missing here? It seems like this is a relatively straightforward foreign key join, using the correct indexes, that is just slow. Warming the OS cache seems like the only way to make this actually perform reasonably, but it seems like it's masking the underlying problem. I could probably do some denormalization, like putting the bucket field on the electricusage table, but that seems like a shoddy solution.

Before running my query, I clear the os and postgres cache. I know that in theory some subset of the data will be in the os cache and postgres cache. But in the real world, what we see is that a number of people get timeouts waiting for this query to finish, because it's not in the cache. e.g., running the query with where bucket='xxx' will be fast a second time, but where bucket='yyy' will not.

Here are things I've tried:
+ Clustering premiseaccount on premiseaccount_bucket_58c70392619aa36f and electricusage on electricusage_covered_id_from_date_usage. This provided modest improvement
+ Fit the whole DB in memory. Since this query is fast if the data is cached, try to fit everything into the cache. Bumping the ram to 8GBs, and warming up the os cache by taring the postgres data directory. On the 4GB machine, I still run into slow performance, but at 8GBs, the entire DB can fit in the OS cache. This made a huge improvement, but will only work for a little while as the database grows.
+ Tried on other linode servers - same results.
+ Removing the from_date part of the query. It uses a different index but same performance characteristics.
+ Looked at kern, sys, and postgres logs. Nothing interesting.

I would really appreciate any help I could get! Thanks!

Bryce

Re: Slow query due to slow I/O

From

Claudio Freire

Date:

12 December 2013, 20:16:02

On Thu, Dec 12, 2013 at 5:10 PM, Bryce Covert
<bryce@brycecovertoperations.com> wrote:
> Hi,
>
> I'm seeing a slow running query. After some optimization with indexes, it
> appears that the query plan is correct, it's just slow. Running the query
> twice, not surprisingly, is very fast, due to OS caching or shared_buffers
> caching. If the parameters for the query are different, however, the query
> is slow until run a second time. In our usage pattern, this query must run
> fast the first time it runs.
>
> A couple of general stats: this is a linode machine with a single 3GB DB
> with 4GBs of ram. Shared buffers = 1024mb, effective_cache_size=2048MB. We
> are running with postgres 9.1. The machine is otherwise dormant when this
> query runs. Here is the schema:


For this kind of diagnostic, you need to include hardware details.

OS? Disks? RAID?

Re: Slow query due to slow I/O

From

Bryce Covert

Date:

12 December 2013, 20:21:10

I don't have much info on disks, since this is a virtual server on linode. It is running ubuntu 12.04, 8cpus, 4GB ram, 95GB ext3 volume (noatime). Hopefully that's helpful.

Bryce

Claudio Freire
December 12, 2013 12:15 PM
On Thu, Dec 12, 2013 at 5:10 PM, Bryce Covert

For this kind of diagnostic, you need to include hardware details.

OS? Disks? RAID?
Bryce Covert
December 12, 2013 12:10 PM

Hi,

I'm seeing a slow running query. After some optimization with indexes, it appears that the query plan is correct, it's just slow. Running the query twice, not surprisingly, is very fast, due to OS caching or shared_buffers caching. If the parameters for the query are different, however, the query is slow until run a second time. In our usage pattern, this query must run fast the first time it runs.

A couple of general stats: this is a linode machine with a single 3GB DB with 4GBs of ram. Shared buffers = 1024mb, effective_cache_size=2048MB. We are running with postgres 9.1. The machine is otherwise dormant when this query runs. Here is the schema:

                                          Table "public.premiseaccount"
         Column         |           Type           |                                Modifiers
------------------------+--------------------------+-------------------------------------------------------------------------
id                     | integer                  | not null default nextval('premiseaccount_id_seq'::regclass)
created                | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
modified               | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
account_id             | integer                  | not null
premise_id             | integer                  | not null
bucket                 | character varying(255)   |
Indexes:
    "premiseaccount_pkey" PRIMARY KEY, btree (id)
    "premiseaccount_account_id" btree (account_id)
    "premiseaccount_bucket" btree (bucket)
    "premiseaccount_bucket_58c70392619aa36f" btree (bucket, id)
    "premiseaccount_bucket_like" btree (bucket varchar_pattern_ops)
    "premiseaccount_premise_id" btree (premise_id)
Foreign-key constraints:
    "account_id_refs_id_529631edfff28022" FOREIGN KEY (account_id) REFERENCES utilityaccount(id) DEFERRABLE INITIALLY DEFERRED
    "premise_id_refs_id_5ecea2842007328b" FOREIGN KEY (premise_id) REFERENCES premise(id) DEFERRABLE INITIALLY DEFERRED

                                        Table "public.electricusage"
       Column       |           Type           |                               Modifiers
--------------------+--------------------------+------------------------------------------------------------------------
id                 | integer                  | not null default nextval('electricusage_id_seq'::regclass)
created            | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
modified           | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
from_date          | timestamp with time zone | not null
to_date            | timestamp with time zone | not null
usage              | numeric(9,2)             |
demand             | numeric(9,2)             |
bill_amount        | numeric(9,2)             |
premise_account_id | integer                  | not null
Indexes:
    "electricusage_pkey" PRIMARY KEY, btree (id)
    "electricusage_premise_account_id" btree (premise_account_id)
    "electricusage_covered_id_from_date_usage" btree (premise_account_id, from_date, usage)
Foreign-key constraints:
    "premise_account_id_refs_id_4c39e54406369128" FOREIGN KEY (premise_account_id) REFERENCES premiseaccount(id) DEFERRABLE INITIALLY DEFERRED

For reference, premiseaccount has about 1 million rows, one for each account, grouped in buckets with an average of 5000 accounts per bucket. electricusage has 10 million rows, about 10 rows per premiseaccount.

Here is the query:

explain analyze
SELECT premiseaccount.id, SUM(electricusage.usage) AS total_kwh
FROM premiseaccount
LEFT OUTER JOIN electricusage
    ON premiseaccount.id = electricusage.premise_account_id
WHERE premiseaccount.bucket = 'XXX' AND electricusage.from_date >= '2012-11-20 00:00:00'
GROUP BY premiseaccount.id
HAVING SUM(electricusage.usage) BETWEEN 3284 and 3769
LIMIT 50;

Limit (cost=0.00..1987.24 rows=50 width=8) (actual time=931.524..203631.435 rows=50 loops=1)
   -> GroupAggregate (cost=0.00..179487.78 rows=4516 width=8) (actual time=931.519..203631.275 rows=50 loops=1)
         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         -> Nested Loop (cost=0.00..179056.87 rows=36317 width=8) (actual time=101.934..203450.761 rows=30711 loops=1)
               -> Index Scan using premiseaccount_bucket_58c70392619aa36f on premiseaccount premiseaccount (cost=0.00..14882.30 rows=4516 width=4) (actual time=77.620..7199.527 rows=3437 loops=1)
                     Index Cond: ((bucket)::text = 'XXX'::text)
               -> Index Scan using electricusage_premise_account_id_36bc8999ced10059 on electricusage electricusage (cost=0.00..36.24 rows=9 width=8) (actual time=8.607..57.055 rows=9 loops=3437)
                     Index Cond: ((premise_account_id = premiseaccount.id) AND (from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone))
Total runtime: 203631.666 ms

(see online at: http://explain.depesz.com/s/zeC)

What am I missing here? It seems like this is a relatively straightforward foreign key join, using the correct indexes, that is just slow. Warming the OS cache seems like the only way to make this actually perform reasonably, but it seems like it's masking the underlying problem. I could probably do some denormalization, like putting the bucket field on the electricusage table, but that seems like a shoddy solution.

Before running my query, I clear the os and postgres cache. I know that in theory some subset of the data will be in the os cache and postgres cache. But in the real world, what we see is that a number of people get timeouts waiting for this query to finish, because it's not in the cache. e.g., running the query with where bucket='xxx' will be fast a second time, but where bucket='yyy' will not.

Here are things I've tried:
+ Clustering premiseaccount on premiseaccount_bucket_58c70392619aa36f and electricusage on electricusage_covered_id_from_date_usage. This provided modest improvement
+ Fit the whole DB in memory. Since this query is fast if the data is cached, try to fit everything into the cache. Bumping the ram to 8GBs, and warming up the os cache by taring the postgres data directory. On the 4GB machine, I still run into slow performance, but at 8GBs, the entire DB can fit in the OS cache. This made a huge improvement, but will only work for a little while as the database grows.
+ Tried on other linode servers - same results.
+ Removing the from_date part of the query. It uses a different index but same performance characteristics.
+ Looked at kern, sys, and postgres logs. Nothing interesting.

I would really appreciate any help I could get! Thanks!

Bryce

Attachment

Re: Slow query due to slow I/O

From

Claudio Freire

Date:

12 December 2013, 20:48:33

On Thu, Dec 12, 2013 at 5:20 PM, Bryce Covert <bryce@brycecovertoperations.com> wrote:

I don't have much info on disks, since this is a virtual server on linode. It is running ubuntu 12.04, 8cpus, 4GB ram, 95GB ext3 volume (noatime). Hopefully that's helpful.

Bryce

Well, did you run benchmarks? How many IOPS do you get from the volumes?

Try running "iostat -x -m -d 10" while the slow query is running and pasting the results (or a relevant sample of them).

Also, do run "explain (analyze, buffers)" instead of plain "explain analyze".

Re: Slow query due to slow I/O

From

Bryce Covert

Date:

12 December 2013, 21:16:40

Hey Claudio,

Thanks a lot for the help. I'm not familiar with explain buffers, but here's the results:

Limit (cost=0.00..648.71 rows=50 width=8) (actual time=653.681..52328.707 rows=50 loops=1)
   Buffers: shared hit=7875 read=9870
   -> GroupAggregate (cost=0.00..55672.36 rows=4291 width=8) (actual time=653.671..52328.480 rows=50 loops=1)
         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         Buffers: shared hit=7875 read=9870
         -> Nested Loop (cost=0.00..55262.93 rows=34506 width=8) (actual time=432.129..52200.465 rows=30711 loops=1)
               Buffers: shared hit=7875 read=9870
               -> Index Scan using premiseaccount_bucket_58c70392619aa36f on premiseaccount premiseaccount (cost=0.00..15433.71 rows=4291 width=4) (actual time=338.160..10014.780 rows=3437 loops=1)
                     Index Cond: ((bucket)::text = '85349_single-family'::text)
                     Buffers: shared hit=744 read=2692
               -> Index Scan using electricusage_premise_account_id on electricusage electricusage (cost=0.00..9.17 rows=9 width=8) (actual time=11.430..12.235 rows=9 loops=3437)
                     Index Cond: (premise_account_id = premiseaccount.id)
                     Filter: (from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone)
                     Buffers: shared hit=7131 read=7178
Total runtime: 52329.028 ms
(15 rows)

and the iostat results...
Linux 3.11.6-x86_64-linode35 (preview-aps-new)     12/12/2013     _x86_64_    (8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              6.94    65.68   12.16    7.64     0.36     0.29    67.06     0.44   22.11    3.56   51.66   1.15   2.28
xvdb              0.00     0.00    0.00    0.00     0.00     0.00    10.61     0.00    1.82    1.82    0.00   1.82   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.10    47.10 152.40    7.80     2.20     0.21    30.92     1.36    8.50    8.03   17.53   6.14 98.33
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.20     2.30 212.10    0.70     3.22     0.01    31.09     1.47    6.88    6.90    2.86   4.63 98.62
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.20     8.30 183.20    5.10     2.46     0.05    27.31     1.68    8.85    6.68   86.78   5.24 98.70
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.10     0.00 165.70    0.00     2.36     0.00    29.20     1.46    8.86    8.86    0.00   5.95 98.63
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

I'm not sure how to process this except that 2.5MB/s seems really slow, and it looks like it is using postgres' cache quite a bit.

Thanks,
Bryce

Claudio Freire
December 12, 2013 12:48 PM

Well, did you run benchmarks? How many IOPS do you get from the volumes?

Try running "iostat -x -m -d 10" while the slow query is running and pasting the results (or a relevant sample of them).

Also, do run "explain (analyze, buffers)" instead of plain "explain analyze".

Bryce Covert
December 12, 2013 12:20 PM

I don't have much info on disks, since this is a virtual server on linode. It is running ubuntu 12.04, 8cpus, 4GB ram, 95GB ext3 volume (noatime). Hopefully that's helpful.

Bryce
Claudio Freire
December 12, 2013 12:15 PM
On Thu, Dec 12, 2013 at 5:10 PM, Bryce Covert

For this kind of diagnostic, you need to include hardware details.

OS? Disks? RAID?

Bryce Covert
December 12, 2013 12:10 PM

Hi,

I'm seeing a slow running query. After some optimization with indexes, it appears that the query plan is correct, it's just slow. Running the query twice, not surprisingly, is very fast, due to OS caching or shared_buffers caching. If the parameters for the query are different, however, the query is slow until run a second time. In our usage pattern, this query must run fast the first time it runs.

A couple of general stats: this is a linode machine with a single 3GB DB with 4GBs of ram. Shared buffers = 1024mb, effective_cache_size=2048MB. We are running with postgres 9.1. The machine is otherwise dormant when this query runs. Here is the schema:

                                          Table "public.premiseaccount"
         Column         |           Type           |                                Modifiers
------------------------+--------------------------+-------------------------------------------------------------------------
id                     | integer                  | not null default nextval('premiseaccount_id_seq'::regclass)
created                | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
modified               | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
account_id             | integer                  | not null
premise_id             | integer                  | not null
bucket                 | character varying(255)   |
Indexes:
    "premiseaccount_pkey" PRIMARY KEY, btree (id)
    "premiseaccount_account_id" btree (account_id)
    "premiseaccount_bucket" btree (bucket)
    "premiseaccount_bucket_58c70392619aa36f" btree (bucket, id)
    "premiseaccount_bucket_like" btree (bucket varchar_pattern_ops)
    "premiseaccount_premise_id" btree (premise_id)
Foreign-key constraints:
    "account_id_refs_id_529631edfff28022" FOREIGN KEY (account_id) REFERENCES utilityaccount(id) DEFERRABLE INITIALLY DEFERRED
    "premise_id_refs_id_5ecea2842007328b" FOREIGN KEY (premise_id) REFERENCES premise(id) DEFERRABLE INITIALLY DEFERRED

                                        Table "public.electricusage"
       Column       |           Type           |                               Modifiers
--------------------+--------------------------+------------------------------------------------------------------------
id                 | integer                  | not null default nextval('electricusage_id_seq'::regclass)
created            | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
modified           | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
from_date          | timestamp with time zone | not null
to_date            | timestamp with time zone | not null
usage              | numeric(9,2)             |
demand             | numeric(9,2)             |
bill_amount        | numeric(9,2)             |
premise_account_id | integer                  | not null
Indexes:
    "electricusage_pkey" PRIMARY KEY, btree (id)
    "electricusage_premise_account_id" btree (premise_account_id)
    "electricusage_covered_id_from_date_usage" btree (premise_account_id, from_date, usage)
Foreign-key constraints:
    "premise_account_id_refs_id_4c39e54406369128" FOREIGN KEY (premise_account_id) REFERENCES premiseaccount(id) DEFERRABLE INITIALLY DEFERRED

For reference, premiseaccount has about 1 million rows, one for each account, grouped in buckets with an average of 5000 accounts per bucket. electricusage has 10 million rows, about 10 rows per premiseaccount.

Here is the query:

explain analyze
SELECT premiseaccount.id, SUM(electricusage.usage) AS total_kwh
FROM premiseaccount
LEFT OUTER JOIN electricusage
    ON premiseaccount.id = electricusage.premise_account_id
WHERE premiseaccount.bucket = 'XXX' AND electricusage.from_date >= '2012-11-20 00:00:00'
GROUP BY premiseaccount.id
HAVING SUM(electricusage.usage) BETWEEN 3284 and 3769
LIMIT 50;

Limit (cost=0.00..1987.24 rows=50 width=8) (actual time=931.524..203631.435 rows=50 loops=1)
   -> GroupAggregate (cost=0.00..179487.78 rows=4516 width=8) (actual time=931.519..203631.275 rows=50 loops=1)
         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         -> Nested Loop (cost=0.00..179056.87 rows=36317 width=8) (actual time=101.934..203450.761 rows=30711 loops=1)
               -> Index Scan using premiseaccount_bucket_58c70392619aa36f on premiseaccount premiseaccount (cost=0.00..14882.30 rows=4516 width=4) (actual time=77.620..7199.527 rows=3437 loops=1)
                     Index Cond: ((bucket)::text = 'XXX'::text)
               -> Index Scan using electricusage_premise_account_id_36bc8999ced10059 on electricusage electricusage (cost=0.00..36.24 rows=9 width=8) (actual time=8.607..57.055 rows=9 loops=3437)
                     Index Cond: ((premise_account_id = premiseaccount.id) AND (from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone))
Total runtime: 203631.666 ms

(see online at: http://explain.depesz.com/s/zeC)

What am I missing here? It seems like this is a relatively straightforward foreign key join, using the correct indexes, that is just slow. Warming the OS cache seems like the only way to make this actually perform reasonably, but it seems like it's masking the underlying problem. I could probably do some denormalization, like putting the bucket field on the electricusage table, but that seems like a shoddy solution.

Before running my query, I clear the os and postgres cache. I know that in theory some subset of the data will be in the os cache and postgres cache. But in the real world, what we see is that a number of people get timeouts waiting for this query to finish, because it's not in the cache. e.g., running the query with where bucket='xxx' will be fast a second time, but where bucket='yyy' will not.

Here are things I've tried:
+ Clustering premiseaccount on premiseaccount_bucket_58c70392619aa36f and electricusage on electricusage_covered_id_from_date_usage. This provided modest improvement
+ Fit the whole DB in memory. Since this query is fast if the data is cached, try to fit everything into the cache. Bumping the ram to 8GBs, and warming up the os cache by taring the postgres data directory. On the 4GB machine, I still run into slow performance, but at 8GBs, the entire DB can fit in the OS cache. This made a huge improvement, but will only work for a little while as the database grows.
+ Tried on other linode servers - same results.
+ Removing the from_date part of the query. It uses a different index but same performance characteristics.
+ Looked at kern, sys, and postgres logs. Nothing interesting.

I would really appreciate any help I could get! Thanks!

Bryce

Attachment

Re: Slow query due to slow I/O

From

Bryce Covert

Date:

12 December 2013, 21:25:04

Also, I was reading this: http://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.2, and I realized that index-only scans weren't introduced until 9.2. I tried creating a covered index for this, but I don't think it helps in this situation.

Bryce

Claudio Freire
December 12, 2013 12:48 PM

Well, did you run benchmarks? How many IOPS do you get from the volumes?

Try running "iostat -x -m -d 10" while the slow query is running and pasting the results (or a relevant sample of them).

Also, do run "explain (analyze, buffers)" instead of plain "explain analyze".

Bryce Covert
December 12, 2013 12:20 PM

I don't have much info on disks, since this is a virtual server on linode. It is running ubuntu 12.04, 8cpus, 4GB ram, 95GB ext3 volume (noatime). Hopefully that's helpful.

Bryce
Claudio Freire
December 12, 2013 12:15 PM
On Thu, Dec 12, 2013 at 5:10 PM, Bryce Covert

For this kind of diagnostic, you need to include hardware details.

OS? Disks? RAID?

Bryce Covert
December 12, 2013 12:10 PM

Hi,

I'm seeing a slow running query. After some optimization with indexes, it appears that the query plan is correct, it's just slow. Running the query twice, not surprisingly, is very fast, due to OS caching or shared_buffers caching. If the parameters for the query are different, however, the query is slow until run a second time. In our usage pattern, this query must run fast the first time it runs.

A couple of general stats: this is a linode machine with a single 3GB DB with 4GBs of ram. Shared buffers = 1024mb, effective_cache_size=2048MB. We are running with postgres 9.1. The machine is otherwise dormant when this query runs. Here is the schema:

                                          Table "public.premiseaccount"
         Column         |           Type           |                                Modifiers
------------------------+--------------------------+-------------------------------------------------------------------------
id                     | integer                  | not null default nextval('premiseaccount_id_seq'::regclass)
created                | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
modified               | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
account_id             | integer                  | not null
premise_id             | integer                  | not null
bucket                 | character varying(255)   |
Indexes:
    "premiseaccount_pkey" PRIMARY KEY, btree (id)
    "premiseaccount_account_id" btree (account_id)
    "premiseaccount_bucket" btree (bucket)
    "premiseaccount_bucket_58c70392619aa36f" btree (bucket, id)
    "premiseaccount_bucket_like" btree (bucket varchar_pattern_ops)
    "premiseaccount_premise_id" btree (premise_id)
Foreign-key constraints:
    "account_id_refs_id_529631edfff28022" FOREIGN KEY (account_id) REFERENCES utilityaccount(id) DEFERRABLE INITIALLY DEFERRED
    "premise_id_refs_id_5ecea2842007328b" FOREIGN KEY (premise_id) REFERENCES premise(id) DEFERRABLE INITIALLY DEFERRED

                                        Table "public.electricusage"
       Column       |           Type           |                               Modifiers
--------------------+--------------------------+------------------------------------------------------------------------
id                 | integer                  | not null default nextval('electricusage_id_seq'::regclass)
created            | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
modified           | timestamp with time zone | not null default '2013-09-25 07:00:00+00'::timestamp with time zone
from_date          | timestamp with time zone | not null
to_date            | timestamp with time zone | not null
usage              | numeric(9,2)             |
demand             | numeric(9,2)             |
bill_amount        | numeric(9,2)             |
premise_account_id | integer                  | not null
Indexes:
    "electricusage_pkey" PRIMARY KEY, btree (id)
    "electricusage_premise_account_id" btree (premise_account_id)
    "electricusage_covered_id_from_date_usage" btree (premise_account_id, from_date, usage)
Foreign-key constraints:
    "premise_account_id_refs_id_4c39e54406369128" FOREIGN KEY (premise_account_id) REFERENCES premiseaccount(id) DEFERRABLE INITIALLY DEFERRED

For reference, premiseaccount has about 1 million rows, one for each account, grouped in buckets with an average of 5000 accounts per bucket. electricusage has 10 million rows, about 10 rows per premiseaccount.

Here is the query:

explain analyze
SELECT premiseaccount.id, SUM(electricusage.usage) AS total_kwh
FROM premiseaccount
LEFT OUTER JOIN electricusage
    ON premiseaccount.id = electricusage.premise_account_id
WHERE premiseaccount.bucket = 'XXX' AND electricusage.from_date >= '2012-11-20 00:00:00'
GROUP BY premiseaccount.id
HAVING SUM(electricusage.usage) BETWEEN 3284 and 3769
LIMIT 50;

Limit (cost=0.00..1987.24 rows=50 width=8) (actual time=931.524..203631.435 rows=50 loops=1)
   -> GroupAggregate (cost=0.00..179487.78 rows=4516 width=8) (actual time=931.519..203631.275 rows=50 loops=1)
         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         -> Nested Loop (cost=0.00..179056.87 rows=36317 width=8) (actual time=101.934..203450.761 rows=30711 loops=1)
               -> Index Scan using premiseaccount_bucket_58c70392619aa36f on premiseaccount premiseaccount (cost=0.00..14882.30 rows=4516 width=4) (actual time=77.620..7199.527 rows=3437 loops=1)
                     Index Cond: ((bucket)::text = 'XXX'::text)
               -> Index Scan using electricusage_premise_account_id_36bc8999ced10059 on electricusage electricusage (cost=0.00..36.24 rows=9 width=8) (actual time=8.607..57.055 rows=9 loops=3437)
                     Index Cond: ((premise_account_id = premiseaccount.id) AND (from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone))
Total runtime: 203631.666 ms

(see online at: http://explain.depesz.com/s/zeC)

What am I missing here? It seems like this is a relatively straightforward foreign key join, using the correct indexes, that is just slow. Warming the OS cache seems like the only way to make this actually perform reasonably, but it seems like it's masking the underlying problem. I could probably do some denormalization, like putting the bucket field on the electricusage table, but that seems like a shoddy solution.

Before running my query, I clear the os and postgres cache. I know that in theory some subset of the data will be in the os cache and postgres cache. But in the real world, what we see is that a number of people get timeouts waiting for this query to finish, because it's not in the cache. e.g., running the query with where bucket='xxx' will be fast a second time, but where bucket='yyy' will not.

Here are things I've tried:
+ Clustering premiseaccount on premiseaccount_bucket_58c70392619aa36f and electricusage on electricusage_covered_id_from_date_usage. This provided modest improvement
+ Fit the whole DB in memory. Since this query is fast if the data is cached, try to fit everything into the cache. Bumping the ram to 8GBs, and warming up the os cache by taring the postgres data directory. On the 4GB machine, I still run into slow performance, but at 8GBs, the entire DB can fit in the OS cache. This made a huge improvement, but will only work for a little while as the database grows.
+ Tried on other linode servers - same results.
+ Removing the from_date part of the query. It uses a different index but same performance characteristics.
+ Looked at kern, sys, and postgres logs. Nothing interesting.

I would really appreciate any help I could get! Thanks!

Bryce

Attachment

Re: Slow query due to slow I/O

From

Claudio Freire

Date:

12 December 2013, 21:36:02

On Thu, Dec 12, 2013 at 6:16 PM, Bryce Covert <bryce@brycecovertoperations.com> wrote:

Thanks a lot for the help. I'm not familiar with explain buffers, but here's the results:

Limit (cost=0.00..648.71 rows=50 width=8) (actual time=653.681..52328.707 rows=50 loops=1)
   Buffers: shared hit=7875 read=9870
   -> GroupAggregate (cost=0.00..55672.36 rows=4291 width=8) (actual time=653.671..52328.480 rows=50 loops=1)

         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         Buffers: shared hit=7875 read=9870
         -> Nested Loop (cost=0.00..55262.93 rows=34506 width=8) (actual time=432.129..52200.465 rows=30711 loops=1)
               Buffers: shared hit=7875 read=9870
               -> Index Scan using premiseaccount_bucket_58c70392619aa36f on premiseaccount premiseaccount (cost=0.00..15433.71 rows=4291 width=4) (actual time=338.160..10014.780 rows=3437 loops=1)
                     Index Cond: ((bucket)::text = '85349_single-family'::text)
                     Buffers: shared hit=744 read=2692
               -> Index Scan using electricusage_premise_account_id on electricusage electricusage (cost=0.00..9.17 rows=9 width=8) (actual time=11.430..12.235 rows=9 loops=3437)
                     Index Cond: (premise_account_id = premiseaccount.id)
                     Filter: (from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone)
                     Buffers: shared hit=7131 read=7178
Total runtime: 52329.028 ms
(15 rows)

and the iostat results...
Linux 3.11.6-x86_64-linode35 (preview-aps-new)     12/12/2013     _x86_64_    (8 CPU)

...

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.10    47.10 152.40    7.80     2.20     0.21    30.92     1.36    8.50    8.03   17.53   6.14 98.33
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

This means it's doing random I/O, and that your disk is a single 7200RPM drive (152 r/s being typical for that hardware).

You can improve this by either:

1 - Turning that random I/O pattern into sequential, or

2 - Getting better I/O.

I'll assume 2 isn't available to you on linode, so for 1, you could try lowering effective_cache_size substantially. It seems you're not getting nearly as much caching as you think (ie 2GB). However, I doubt there's a plan that can get you significantly better performance given your hardware.

You may shave a few seconds, though, if you increase work_mem. It seems it should have used a bitmap index scan for at least one of the index scans there, and a low work_mem could be what's limiting the planner's possibilities. What are your settings in that area?

Re: Slow query due to slow I/O

From

Bryce Covert

Date:

12 December 2013, 21:49:51

It's strange that it isn't sequential at least for the electric usage, as i've clustered using the index that it's using..

I had work_mem set to 128mb. I tried bumping it to 1024mb, and I don't think I see a in the query plan.

Would you think upgrading to 9.2 would help much here? Using a covered index, I imagine reads would be limited quite a bit.

Thanks again,
Bryce

Claudio Freire
December 12, 2013 1:35 PM

On Thu, Dec 12, 2013 at 6:16 PM, Bryce Covert <bryce@brycecovertoperations.com> wrote:

Thanks a lot for the help. I'm not familiar with explain buffers, but here's the results:

Limit (cost=0.00..648.71 rows=50 width=8) (actual time=653.681..52328.707 rows=50 loops=1)
   Buffers: shared hit=7875 read=9870
   -> GroupAggregate (cost=0.00..55672.36 rows=4291 width=8) (actual time=653.671..52328.480 rows=50 loops=1)

         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         Buffers: shared hit=7875 read=9870
         -> Nested Loop (cost=0.00..55262.93 rows=34506 width=8) (actual time=432.129..52200.465 rows=30711 loops=1)
               Buffers: shared hit=7875 read=9870
               -> Index Scan using premiseaccount_bucket_58c70392619aa36f on premiseaccount premiseaccount (cost=0.00..15433.71 rows=4291 width=4) (actual time=338.160..10014.780 rows=3437 loops=1)
                     Index Cond: ((bucket)::text = '85349_single-family'::text)
                     Buffers: shared hit=744 read=2692
               -> Index Scan using electricusage_premise_account_id on electricusage electricusage (cost=0.00..9.17 rows=9 width=8) (actual time=11.430..12.235 rows=9 loops=3437)
                     Index Cond: (premise_account_id = premiseaccount.id)
                     Filter: (from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone)
                     Buffers: shared hit=7131 read=7178
Total runtime: 52329.028 ms
(15 rows)

and the iostat results...
Linux 3.11.6-x86_64-linode35 (preview-aps-new)     12/12/2013     _x86_64_    (8 CPU)

...
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.10    47.10 152.40    7.80     2.20     0.21    30.92     1.36    8.50    8.03   17.53   6.14 98.33
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

This means it's doing random I/O, and that your disk is a single 7200RPM drive (152 r/s being typical for that hardware).

You can improve this by either:

1 - Turning that random I/O pattern into sequential, or
2 - Getting better I/O.

I'll assume 2 isn't available to you on linode, so for 1, you could try lowering effective_cache_size substantially. It seems you're not getting nearly as much caching as you think (ie 2GB). However, I doubt there's a plan that can get you significantly better performance given your hardware.

You may shave a few seconds, though, if you increase work_mem. It seems it should have used a bitmap index scan for at least one of the index scans there, and a low work_mem could be what's limiting the planner's possibilities. What are your settings in that area?

Bryce Covert
December 12, 2013 1:16 PM
Hey Claudio,

Thanks a lot for the help. I'm not familiar with explain buffers, but here's the results:

Limit (cost=0.00..648.71 rows=50 width=8) (actual time=653.681..52328.707 rows=50 loops=1)
   Buffers: shared hit=7875 read=9870
   -> GroupAggregate (cost=0.00..55672.36 rows=4291 width=8) (actual time=653.671..52328.480 rows=50 loops=1)
         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         Buffers: shared hit=7875 read=9870
         -> Nested Loop (cost=0.00..55262.93 rows=34506 width=8) (actual time=432.129..52200.465 rows=30711 loops=1)
               Buffers: shared hit=7875 read=9870
               -> Index Scan using premiseaccount_bucket_58c70392619aa36f on premiseaccount premiseaccount (cost=0.00..15433.71 rows=4291 width=4) (actual time=338.160..10014.780 rows=3437 loops=1)
                     Index Cond: ((bucket)::text = '85349_single-family'::text)
                     Buffers: shared hit=744 read=2692
               -> Index Scan using electricusage_premise_account_id on electricusage electricusage (cost=0.00..9.17 rows=9 width=8) (actual time=11.430..12.235 rows=9 loops=3437)
                     Index Cond: (premise_account_id = premiseaccount.id)
                     Filter: (from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone)
                     Buffers: shared hit=7131 read=7178
Total runtime: 52329.028 ms
(15 rows)

and the iostat results...
Linux 3.11.6-x86_64-linode35 (preview-aps-new)     12/12/2013     _x86_64_    (8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              6.94    65.68   12.16    7.64     0.36     0.29    67.06     0.44   22.11    3.56   51.66   1.15   2.28
xvdb              0.00     0.00    0.00    0.00     0.00     0.00    10.61     0.00    1.82    1.82    0.00   1.82   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.10    47.10 152.40    7.80     2.20     0.21    30.92     1.36    8.50    8.03   17.53   6.14 98.33
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.20     2.30 212.10    0.70     3.22     0.01    31.09     1.47    6.88    6.90    2.86   4.63 98.62
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.20     8.30 183.20    5.10     2.46     0.05    27.31     1.68    8.85    6.68   86.78   5.24 98.70
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.10     0.00 165.70    0.00     2.36     0.00    29.20     1.46    8.86    8.86    0.00   5.95 98.63
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

I'm not sure how to process this except that 2.5MB/s seems really slow, and it looks like it is using postgres' cache quite a bit.

Thanks,
Bryce

Claudio Freire
December 12, 2013 12:48 PM

Well, did you run benchmarks? How many IOPS do you get from the volumes?

Try running "iostat -x -m -d 10" while the slow query is running and pasting the results (or a relevant sample of them).

Also, do run "explain (analyze, buffers)" instead of plain "explain analyze".

Bryce Covert
December 12, 2013 12:20 PM

I don't have much info on disks, since this is a virtual server on linode. It is running ubuntu 12.04, 8cpus, 4GB ram, 95GB ext3 volume (noatime). Hopefully that's helpful.

Bryce
Claudio Freire
December 12, 2013 12:15 PM
On Thu, Dec 12, 2013 at 5:10 PM, Bryce Covert

For this kind of diagnostic, you need to include hardware details.

OS? Disks? RAID?

Attachment

Re: Slow query due to slow I/O

From

Bryce Covert

Date:

12 December 2013, 23:04:40

Not sure if this is helpful, but I tried upgrading to 9.2, and here's what I got:

---------
Limit (cost=0.00..535.78 rows=50 width=8) (actual time=1037.376..135043.945 rows=50 loops=1)
   Output: premiseaccount.id, (sum(electricusage.usage))
   Buffers: shared hit=4851 read=18718
   -> GroupAggregate (cost=0.00..198012.28 rows=18479 width=8) (actual time=1037.369..135043.700 rows=50 loops=1)
         Output: premiseaccount.id, sum(electricusage.usage)
         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         Rows Removed by Filter: 1476
         Buffers: shared hit=4851 read=18718
         -> Nested Loop (cost=0.00..196247.46 rows=148764 width=8) (actual time=107.092..134845.231 rows=15188 loops=1)
               Output: premiseaccount.id, electricusage.usage
               Buffers: shared hit=4851 read=18718
               -> Index Only Scan using premiseaccount_bucket_58c70392619aa36f on public.premiseaccount premiseaccount (cost=0.00..43135.13 rows=18479 width=4) (actual time=45.368..137.340 rows=1527 loops=1)
                     Output: premiseaccount.bucket, premiseaccount.id
                     Index Cond: (premiseaccount.bucket = '85375_single-family'::text)
                     Heap Fetches: 1527
                     Buffers: shared hit=1 read=685
               -> Index Scan using electricusage_premise_account_id on public.electricusage electricusage (cost=0.00..8.20 rows=9 width=8) (actual time=22.306..88.136 rows=10 loops=1527)
                     Output: electricusage.id, electricusage.created, electricusage.modified, electricusage.from_date, electricusage.to_date, electricusage.usage, electricusage.demand, electricusage.bill_amount, electricusage.premise_account_id
                     Index Cond: (electricusage.premise_account_id = premiseaccount.id)
                     Filter: (electricusage.from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone)
                     Rows Removed by Filter: 2
                     Buffers: shared hit=4850 read=18033
Total runtime: 135044.256 ms
(23 rows)

Looks like it is doing an index only scan for the first table, but not for the second. I tried creating two indexes that theoretically should make it not have to go to the physical table.:
    "electricusage_premise_account_id_36bc8999ced10059" btree (premise_account_id, from_date, usage)
    "ix_covered_2" btree (premise_account_id, from_date DESC, usage, id)

Any idea why it's not using that?

Thanks!
Bryce

Claudio Freire
December 12, 2013 1:35 PM

On Thu, Dec 12, 2013 at 6:16 PM, Bryce Covert <bryce@brycecovertoperations.com> wrote:

Thanks a lot for the help. I'm not familiar with explain buffers, but here's the results:

Limit (cost=0.00..648.71 rows=50 width=8) (actual time=653.681..52328.707 rows=50 loops=1)
   Buffers: shared hit=7875 read=9870
   -> GroupAggregate (cost=0.00..55672.36 rows=4291 width=8) (actual time=653.671..52328.480 rows=50 loops=1)

         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         Buffers: shared hit=7875 read=9870
         -> Nested Loop (cost=0.00..55262.93 rows=34506 width=8) (actual time=432.129..52200.465 rows=30711 loops=1)
               Buffers: shared hit=7875 read=9870
               -> Index Scan using premiseaccount_bucket_58c70392619aa36f on premiseaccount premiseaccount (cost=0.00..15433.71 rows=4291 width=4) (actual time=338.160..10014.780 rows=3437 loops=1)
                     Index Cond: ((bucket)::text = '85349_single-family'::text)
                     Buffers: shared hit=744 read=2692
               -> Index Scan using electricusage_premise_account_id on electricusage electricusage (cost=0.00..9.17 rows=9 width=8) (actual time=11.430..12.235 rows=9 loops=3437)
                     Index Cond: (premise_account_id = premiseaccount.id)
                     Filter: (from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone)
                     Buffers: shared hit=7131 read=7178
Total runtime: 52329.028 ms
(15 rows)

and the iostat results...
Linux 3.11.6-x86_64-linode35 (preview-aps-new)     12/12/2013     _x86_64_    (8 CPU)

...
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.10    47.10 152.40    7.80     2.20     0.21    30.92     1.36    8.50    8.03   17.53   6.14 98.33
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

This means it's doing random I/O, and that your disk is a single 7200RPM drive (152 r/s being typical for that hardware).

You can improve this by either:

1 - Turning that random I/O pattern into sequential, or
2 - Getting better I/O.

I'll assume 2 isn't available to you on linode, so for 1, you could try lowering effective_cache_size substantially. It seems you're not getting nearly as much caching as you think (ie 2GB). However, I doubt there's a plan that can get you significantly better performance given your hardware.

You may shave a few seconds, though, if you increase work_mem. It seems it should have used a bitmap index scan for at least one of the index scans there, and a low work_mem could be what's limiting the planner's possibilities. What are your settings in that area?

Bryce Covert
December 12, 2013 1:16 PM
Hey Claudio,

Thanks a lot for the help. I'm not familiar with explain buffers, but here's the results:

Limit (cost=0.00..648.71 rows=50 width=8) (actual time=653.681..52328.707 rows=50 loops=1)
   Buffers: shared hit=7875 read=9870
   -> GroupAggregate (cost=0.00..55672.36 rows=4291 width=8) (actual time=653.671..52328.480 rows=50 loops=1)
         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         Buffers: shared hit=7875 read=9870
         -> Nested Loop (cost=0.00..55262.93 rows=34506 width=8) (actual time=432.129..52200.465 rows=30711 loops=1)
               Buffers: shared hit=7875 read=9870
               -> Index Scan using premiseaccount_bucket_58c70392619aa36f on premiseaccount premiseaccount (cost=0.00..15433.71 rows=4291 width=4) (actual time=338.160..10014.780 rows=3437 loops=1)
                     Index Cond: ((bucket)::text = '85349_single-family'::text)
                     Buffers: shared hit=744 read=2692
               -> Index Scan using electricusage_premise_account_id on electricusage electricusage (cost=0.00..9.17 rows=9 width=8) (actual time=11.430..12.235 rows=9 loops=3437)
                     Index Cond: (premise_account_id = premiseaccount.id)
                     Filter: (from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone)
                     Buffers: shared hit=7131 read=7178
Total runtime: 52329.028 ms
(15 rows)

and the iostat results...
Linux 3.11.6-x86_64-linode35 (preview-aps-new)     12/12/2013     _x86_64_    (8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              6.94    65.68   12.16    7.64     0.36     0.29    67.06     0.44   22.11    3.56   51.66   1.15   2.28
xvdb              0.00     0.00    0.00    0.00     0.00     0.00    10.61     0.00    1.82    1.82    0.00   1.82   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.10    47.10 152.40    7.80     2.20     0.21    30.92     1.36    8.50    8.03   17.53   6.14 98.33
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.20     2.30 212.10    0.70     3.22     0.01    31.09     1.47    6.88    6.90    2.86   4.63 98.62
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.20     8.30 183.20    5.10     2.46     0.05    27.31     1.68    8.85    6.68   86.78   5.24 98.70
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.10     0.00 165.70    0.00     2.36     0.00    29.20     1.46    8.86    8.86    0.00   5.95 98.63
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

I'm not sure how to process this except that 2.5MB/s seems really slow, and it looks like it is using postgres' cache quite a bit.

Thanks,
Bryce

Claudio Freire
December 12, 2013 12:48 PM

Well, did you run benchmarks? How many IOPS do you get from the volumes?

Try running "iostat -x -m -d 10" while the slow query is running and pasting the results (or a relevant sample of them).

Also, do run "explain (analyze, buffers)" instead of plain "explain analyze".

Bryce Covert
December 12, 2013 12:20 PM

I don't have much info on disks, since this is a virtual server on linode. It is running ubuntu 12.04, 8cpus, 4GB ram, 95GB ext3 volume (noatime). Hopefully that's helpful.

Bryce
Claudio Freire
December 12, 2013 12:15 PM
On Thu, Dec 12, 2013 at 5:10 PM, Bryce Covert

For this kind of diagnostic, you need to include hardware details.

OS? Disks? RAID?

Attachment

Re: Slow query due to slow I/O

From

Claudio Freire

Date:

12 December 2013, 23:39:37

On Thu, Dec 12, 2013 at 6:49 PM, Bryce Covert <bryce@brycecovertoperations.com> wrote:

It's strange that it isn't sequential at least for the electric usage, as i've clustered using the index that it's using..

electricusage is the inner part of the nested loop, which means it will do ~3000 small scans. That's not sequential no matter how much you cluster. And the join order cannot be reversed (because you're filtering on premiseaccount).

I had work_mem set to 128mb. I tried bumping it to 1024mb, and I don't think I see a in the query plan.

128mb already is abusive enough. If anything, you'd have to lower it.

On Thu, Dec 12, 2013 at 8:04 PM, Bryce Covert <bryce@brycecovertoperations.com> wrote:

Looks like it is doing an index only scan for the first table, but not for the second. I tried creating two indexes that theoretically should make it not have to go to the physical table.:
"electricusage_premise_account_id_36bc8999ced10059" btree (premise_account_id, from_date, usage)
"ix_covered_2" btree (premise_account_id, from_date DESC, usage, id)

Any idea why it's not using that?

Index-only scans not only need the covering index, they also need fully visible pages. That takes time to build up.

If after that happens you're still getting poor performance, at that point, I guess you just have a lousy schema. You're trying to process way too scattered data too fast.

See, your query processes 15k rows, and reads 18k pages. That's as scattered as it gets.

The biggest table you've got there (from the looks of this query) is by far electricusage. You need to cluster that by bucket (since that's your querying criteria), but your schema doesn't allow that. I'm not sure whether it's viable, but if it were, I'd normalize bucket in premiseaccount and de-normalize electricusage to also refer to that bucket directly. That way, you can filter on electricusage, get a bitmap index scan, and live happily ever after.

Failing that, create materialized views, assuming your write patterns allow it.

And failing that, add more hardware. If linode doesn't provide it, move somewhere else.

Re: Slow query due to slow I/O

From

Jeff Janes

Date:

13 December 2013, 00:02:01

On Thu, Dec 12, 2013 at 3:04 PM, Bryce Covert <bryce@brycecovertoperations.com> wrote:

Not sure if this is helpful, but I tried upgrading to 9.2, and here's what I got:

---------
Limit (cost=0.00..535.78 rows=50 width=8) (actual time=1037.376..135043.945 rows=50 loops=1)
   Output: premiseaccount.id, (sum(electricusage.usage))
   Buffers: shared hit=4851 read=18718
   -> GroupAggregate (cost=0.00..198012.28 rows=18479 width=8) (actual time=1037.369..135043.700 rows=50 loops=1)
         Output: premiseaccount.id, sum(electricusage.usage)

         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         Rows Removed by Filter: 1476
         Buffers: shared hit=4851 read=18718
         -> Nested Loop (cost=0.00..196247.46 rows=148764 width=8) (actual time=107.092..134845.231 rows=15188 loops=1)
               Output: premiseaccount.id, electricusage.usage
               Buffers: shared hit=4851 read=18718
               -> Index Only Scan using premiseaccount_bucket_58c70392619aa36f on public.premiseaccount premiseaccount (cost=0.00..43135.13 rows=18479 width=4) (actual time=45.368..137.340 rows=1527 loops=1)
                     Output: premiseaccount.bucket, premiseaccount.id
                     Index Cond: (premiseaccount.bucket = '85375_single-family'::text)
                     Heap Fetches: 1527

You had to hit the heap for every row, meaning the index-only feature was useless. Are you vacuuming enough? How fast does this table change? What is relallvisible from pg_class for these tables?

               -> Index Scan using electricusage_premise_account_id on public.electricusage electricusage (cost=0.00..8.20 rows=9 width=8) (actual time=22.306..88.136 rows=10 loops=1527)
                     Output: electricusage.id, electricusage.created, electricusage.modified, electricusage.from_date, electricusage.to_date, electricusage.usage, electricusage.demand, electricusage.bill_amount, electricusage.premise_account_id
                     Index Cond: (electricusage.premise_account_id = premiseaccount.id)
                     Filter: (electricusage.from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone)
                     Rows Removed by Filter: 2
                     Buffers: shared hit=4850 read=18033
Total runtime: 135044.256 ms
(23 rows)

Looks like it is doing an index only scan for the first table, but not for the second. I tried creating two indexes that theoretically should make it not have to go to the physical table.:
    "electricusage_premise_account_id_36bc8999ced10059" btree (premise_account_id, from_date, usage)
    "ix_covered_2" btree (premise_account_id, from_date DESC, usage, id)

Any idea why it's not using that?

If the other IOS in this plan is anything to go by, then your table doesn't have enough all-visible pages to make it worthwhile. So it chooses the smaller index, instead of the bigger one that could theoretically support an IOS.

Cheers,

Jeff

Re: Slow query due to slow I/O

From

Bryce Covert

Date:

13 December 2013, 00:03:31

It looks like you guys were right. I think vacuum analyzing this made it do an IOS. It seems like materialized views are going to be the best bet. I see how that would allow sequential reading. Thanks!

Bryce

Jeff Janes
December 12, 2013 4:01 PM
On Thu, Dec 12, 2013 at 3:04 PM, Bryce Covert <bryce@brycecovertoperations.com> wrote:
Not sure if this is helpful, but I tried upgrading to 9.2, and here's what I got:

---------
Limit (cost=0.00..535.78 rows=50 width=8) (actual time=1037.376..135043.945 rows=50 loops=1)
   Output: premiseaccount.id, (sum(electricusage.usage))
   Buffers: shared hit=4851 read=18718
   -> GroupAggregate (cost=0.00..198012.28 rows=18479 width=8) (actual time=1037.369..135043.700 rows=50 loops=1)
         Output: premiseaccount.id, sum(electricusage.usage)

         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         Rows Removed by Filter: 1476
         Buffers: shared hit=4851 read=18718
         -> Nested Loop (cost=0.00..196247.46 rows=148764 width=8) (actual time=107.092..134845.231 rows=15188 loops=1)
               Output: premiseaccount.id, electricusage.usage
               Buffers: shared hit=4851 read=18718
               -> Index Only Scan using premiseaccount_bucket_58c70392619aa36f on public.premiseaccount premiseaccount (cost=0.00..43135.13 rows=18479 width=4) (actual time=45.368..137.340 rows=1527 loops=1)
                     Output: premiseaccount.bucket, premiseaccount.id
                     Index Cond: (premiseaccount.bucket = '85375_single-family'::text)
                     Heap Fetches: 1527

You had to hit the heap for every row, meaning the index-only feature was useless. Are you vacuuming enough? How fast does this table change? What is relallvisible from pg_class for these tables?

               -> Index Scan using electricusage_premise_account_id on public.electricusage electricusage (cost=0.00..8.20 rows=9 width=8) (actual time=22.306..88.136 rows=10 loops=1527)
                     Output: electricusage.id, electricusage.created, electricusage.modified, electricusage.from_date, electricusage.to_date, electricusage.usage, electricusage.demand, electricusage.bill_amount, electricusage.premise_account_id
                     Index Cond: (electricusage.premise_account_id = premiseaccount.id)
                     Filter: (electricusage.from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone)
                     Rows Removed by Filter: 2
                     Buffers: shared hit=4850 read=18033
Total runtime: 135044.256 ms
(23 rows)

Looks like it is doing an index only scan for the first table, but not for the second. I tried creating two indexes that theoretically should make it not have to go to the physical table.:
    "electricusage_premise_account_id_36bc8999ced10059" btree (premise_account_id, from_date, usage)
    "ix_covered_2" btree (premise_account_id, from_date DESC, usage, id)

Any idea why it's not using that?

If the other IOS in this plan is anything to go by, then your table doesn't have enough all-visible pages to make it worthwhile. So it chooses the smaller index, instead of the bigger one that could theoretically support an IOS.

Cheers,

Jeff
Bryce Covert
December 12, 2013 3:04 PM
Not sure if this is helpful, but I tried upgrading to 9.2, and here's what I got:

---------
Limit (cost=0.00..535.78 rows=50 width=8) (actual time=1037.376..135043.945 rows=50 loops=1)
   Output: premiseaccount.id, (sum(electricusage.usage))
   Buffers: shared hit=4851 read=18718
   -> GroupAggregate (cost=0.00..198012.28 rows=18479 width=8) (actual time=1037.369..135043.700 rows=50 loops=1)
         Output: premiseaccount.id, sum(electricusage.usage)
         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         Rows Removed by Filter: 1476
         Buffers: shared hit=4851 read=18718
         -> Nested Loop (cost=0.00..196247.46 rows=148764 width=8) (actual time=107.092..134845.231 rows=15188 loops=1)
               Output: premiseaccount.id, electricusage.usage
               Buffers: shared hit=4851 read=18718
               -> Index Only Scan using premiseaccount_bucket_58c70392619aa36f on public.premiseaccount premiseaccount (cost=0.00..43135.13 rows=18479 width=4) (actual time=45.368..137.340 rows=1527 loops=1)
                     Output: premiseaccount.bucket, premiseaccount.id
                     Index Cond: (premiseaccount.bucket = '85375_single-family'::text)
                     Heap Fetches: 1527
                     Buffers: shared hit=1 read=685
               -> Index Scan using electricusage_premise_account_id on public.electricusage electricusage (cost=0.00..8.20 rows=9 width=8) (actual time=22.306..88.136 rows=10 loops=1527)
                     Output: electricusage.id, electricusage.created, electricusage.modified, electricusage.from_date, electricusage.to_date, electricusage.usage, electricusage.demand, electricusage.bill_amount, electricusage.premise_account_id
                     Index Cond: (electricusage.premise_account_id = premiseaccount.id)
                     Filter: (electricusage.from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone)
                     Rows Removed by Filter: 2
                     Buffers: shared hit=4850 read=18033
Total runtime: 135044.256 ms
(23 rows)

Looks like it is doing an index only scan for the first table, but not for the second. I tried creating two indexes that theoretically should make it not have to go to the physical table.:
    "electricusage_premise_account_id_36bc8999ced10059" btree (premise_account_id, from_date, usage)
    "ix_covered_2" btree (premise_account_id, from_date DESC, usage, id)

Any idea why it's not using that?

Thanks!
Bryce

Claudio Freire
December 12, 2013 1:35 PM

On Thu, Dec 12, 2013 at 6:16 PM, Bryce Covert <bryce@brycecovertoperations.com> wrote:

Thanks a lot for the help. I'm not familiar with explain buffers, but here's the results:

Limit (cost=0.00..648.71 rows=50 width=8) (actual time=653.681..52328.707 rows=50 loops=1)
   Buffers: shared hit=7875 read=9870
   -> GroupAggregate (cost=0.00..55672.36 rows=4291 width=8) (actual time=653.671..52328.480 rows=50 loops=1)

         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         Buffers: shared hit=7875 read=9870
         -> Nested Loop (cost=0.00..55262.93 rows=34506 width=8) (actual time=432.129..52200.465 rows=30711 loops=1)
               Buffers: shared hit=7875 read=9870
               -> Index Scan using premiseaccount_bucket_58c70392619aa36f on premiseaccount premiseaccount (cost=0.00..15433.71 rows=4291 width=4) (actual time=338.160..10014.780 rows=3437 loops=1)
                     Index Cond: ((bucket)::text = '85349_single-family'::text)
                     Buffers: shared hit=744 read=2692
               -> Index Scan using electricusage_premise_account_id on electricusage electricusage (cost=0.00..9.17 rows=9 width=8) (actual time=11.430..12.235 rows=9 loops=3437)
                     Index Cond: (premise_account_id = premiseaccount.id)
                     Filter: (from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone)
                     Buffers: shared hit=7131 read=7178
Total runtime: 52329.028 ms
(15 rows)

and the iostat results...
Linux 3.11.6-x86_64-linode35 (preview-aps-new)     12/12/2013     _x86_64_    (8 CPU)

...
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.10    47.10 152.40    7.80     2.20     0.21    30.92     1.36    8.50    8.03   17.53   6.14 98.33
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

This means it's doing random I/O, and that your disk is a single 7200RPM drive (152 r/s being typical for that hardware).

You can improve this by either:

1 - Turning that random I/O pattern into sequential, or
2 - Getting better I/O.

I'll assume 2 isn't available to you on linode, so for 1, you could try lowering effective_cache_size substantially. It seems you're not getting nearly as much caching as you think (ie 2GB). However, I doubt there's a plan that can get you significantly better performance given your hardware.

You may shave a few seconds, though, if you increase work_mem. It seems it should have used a bitmap index scan for at least one of the index scans there, and a low work_mem could be what's limiting the planner's possibilities. What are your settings in that area?

Bryce Covert
December 12, 2013 1:16 PM
Hey Claudio,

Thanks a lot for the help. I'm not familiar with explain buffers, but here's the results:

Limit (cost=0.00..648.71 rows=50 width=8) (actual time=653.681..52328.707 rows=50 loops=1)
   Buffers: shared hit=7875 read=9870
   -> GroupAggregate (cost=0.00..55672.36 rows=4291 width=8) (actual time=653.671..52328.480 rows=50 loops=1)
         Filter: ((sum(electricusage.usage) >= 3284::numeric) AND (sum(electricusage.usage) <= 3769::numeric))
         Buffers: shared hit=7875 read=9870
         -> Nested Loop (cost=0.00..55262.93 rows=34506 width=8) (actual time=432.129..52200.465 rows=30711 loops=1)
               Buffers: shared hit=7875 read=9870
               -> Index Scan using premiseaccount_bucket_58c70392619aa36f on premiseaccount premiseaccount (cost=0.00..15433.71 rows=4291 width=4) (actual time=338.160..10014.780 rows=3437 loops=1)
                     Index Cond: ((bucket)::text = '85349_single-family'::text)
                     Buffers: shared hit=744 read=2692
               -> Index Scan using electricusage_premise_account_id on electricusage electricusage (cost=0.00..9.17 rows=9 width=8) (actual time=11.430..12.235 rows=9 loops=3437)
                     Index Cond: (premise_account_id = premiseaccount.id)
                     Filter: (from_date >= '2012-11-20 00:00:00+00'::timestamp with time zone)
                     Buffers: shared hit=7131 read=7178
Total runtime: 52329.028 ms
(15 rows)

and the iostat results...
Linux 3.11.6-x86_64-linode35 (preview-aps-new)     12/12/2013     _x86_64_    (8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              6.94    65.68   12.16    7.64     0.36     0.29    67.06     0.44   22.11    3.56   51.66   1.15   2.28
xvdb              0.00     0.00    0.00    0.00     0.00     0.00    10.61     0.00    1.82    1.82    0.00   1.82   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.10    47.10 152.40    7.80     2.20     0.21    30.92     1.36    8.50    8.03   17.53   6.14 98.33
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.20     2.30 212.10    0.70     3.22     0.01    31.09     1.47    6.88    6.90    2.86   4.63 98.62
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.20     8.30 183.20    5.10     2.46     0.05    27.31     1.68    8.85    6.68   86.78   5.24 98.70
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await svctm %util
xvda              0.10     0.00 165.70    0.00     2.36     0.00    29.20     1.46    8.86    8.86    0.00   5.95 98.63
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

I'm not sure how to process this except that 2.5MB/s seems really slow, and it looks like it is using postgres' cache quite a bit.

Thanks,
Bryce

Claudio Freire
December 12, 2013 12:48 PM

Well, did you run benchmarks? How many IOPS do you get from the volumes?

Try running "iostat -x -m -d 10" while the slow query is running and pasting the results (or a relevant sample of them).

Also, do run "explain (analyze, buffers)" instead of plain "explain analyze".