Thread: Slow query and indexes...

Slow query and indexes...

From
"Jonas Henriksen"
Date:
Hi,

I'm trying to figure out how to make postgres utilize my indexes on a table.
this query:
>> explain analyze SELECT max(date_time) FROM data_values;
Goes fast and returns:

                                                 QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result  (cost=0.08..0.09 rows=1 width=0) (actual time=0.108..0.111
rows=1 loops=1)
   InitPlan
     ->  Limit  (cost=0.00..0.08 rows=1 width=8) (actual
time=0.090..0.092 rows=1 loops=1)
           ->  Index Scan Backward using
data_values_data_date_time_index on data_values  (cost=0.00..58113.06
rows=765121 width=8) (actual time=0.078..0.078 rows=1 loops=1)
                 Filter: (date_time IS NOT NULL)
 Total runtime: 0.204 ms
(6 rows)

while if I add a GROUP BY data_logger  the query uses a seq scan and a
lot of time:
>> explain analyze SELECT max(date_time) FROM data_values GROUP BY
data_logger_id;

                                                QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=20171.82..20171.85 rows=3 width=12) (actual
time=3510.500..3510.506 rows=3 loops=1)
   ->  Seq Scan on data_values  (cost=0.00..16346.21 rows=765121
width=12) (actual time=0.039..1598.518 rows=765121 loops=1)
 Total runtime: 3510.634 ms
(3 rows)

Tha table contains approx 765000 rows. It has three distinct
data_logger_id's. I can make quick queries on each of them using:
SELECT max(date_time) FROM data_values where data_logger_id=1

I have an index on the date_time field and on the data_logger_id
field, and I ahve also tried to make an index with both date_time and
data_logger_id. Anyone have any idea whats going on, and suggestions
what I should do to speed up my query?


Regards Jonas:)))

Im using PostgreSQL 8.2.3 on windows xp.

My table:
CREATE TABLE data_values
(
  data_value_id serial NOT NULL,
  data_type_id integer NOT NULL,
  data_collection_id integer NOT NULL,
  data_logger_id integer NOT NULL,
  date_time timestamp without time zone NOT NULL,
  lat_wgs84 double precision NOT NULL,
  lon_wgs84 double precision NOT NULL,
  height integer NOT NULL,
  parallell integer NOT NULL DEFAULT 0,
  data_value double precision NOT NULL,
  sensor_id integer,
  CONSTRAINT data_values_pkey PRIMARY KEY (data_value_id),
  CONSTRAINT data_values_data_collection_id_fkey FOREIGN KEY
(data_collection_id)
      REFERENCES data_collections (data_collection_id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE RESTRICT,
  CONSTRAINT data_values_data_logger_id_fkey FOREIGN KEY (data_logger_id)
      REFERENCES data_loggers (data_logger_id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE RESTRICT,
  CONSTRAINT data_values_data_type_id_fkey FOREIGN KEY (data_type_id)
      REFERENCES data_types (data_type_id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE RESTRICT,
  CONSTRAINT data_values_sensor_id_fkey FOREIGN KEY (sensor_id)
      REFERENCES sensors (sensor_id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE RESTRICT,
  CONSTRAINT data_values_data_type_id_key UNIQUE (data_type_id,
data_logger_id, date_time, lat_wgs84, lon_wgs84, height, parallell)
);

CREATE INDEX data_values_data_date_time_index
  ON data_values
  USING btree
  (date_time);

CREATE INDEX data_values_data_logger_id_index
  ON data_values
  USING btree
  (data_logger_id);

CREATE INDEX data_values_time_logger_index
  ON data_values
  USING btree
  (data_logger_id, date_time);

Re: Slow query and indexes...

From
Peter Eisentraut
Date:
Am Montag, 7. Mai 2007 15:53 schrieb Jonas Henriksen:
> while if I add a GROUP BY data_logger  the query uses a seq scan and a
>
> lot of time:
> >> explain analyze SELECT max(date_time) FROM data_values GROUP BY
> data_logger_id;

I don't think there is anything you can do about this.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Slow query and indexes...

From
Jim Nasby
Date:
On May 7, 2007, at 8:53 AM, Jonas Henriksen wrote:
> while if I add a GROUP BY data_logger  the query uses a seq scan and a
> lot of time:
>>> explain analyze SELECT max(date_time) FROM data_values GROUP BY
> data_logger_id;

What do you get if you run that with SET enable_seqscan = off; ?
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



Re: Slow query and indexes...

From
"Jonas Henriksen"
Date:
Well thanks, but that don't help me much.

I've tried setting an extra condition using datetime>(now() - '14
weeks'::interval)

explain analyze
SELECT max(date_time) FROM data_values
where date_time > (now() - '14 weeks'::interval)
 GROUP BY data_logger_id;

HashAggregate  (cost=23264.52..23264.55 rows=2 width=12) (actual
time=1691.447..1691.454 rows=3 loops=1)
  ->  Bitmap Heap Scan on data_values  (cost=7922.08..21787.31
rows=295442 width=12) (actual time=320.643..951.043 rows=298589
loops=1)
        Recheck Cond: (date_time > (now() - '98 days'::interval))
        ->  Bitmap Index Scan on data_values_data_date_time_index
(cost=0.00..7848.22 rows=295442 width=0) (actual time=319.708..319.708
rows=298589 loops=1)
              Index Cond: (date_time > (now() - '98 days'::interval))
Total runtime: 1691.598 ms

However, when I switch to using datetime>(now() - '15 weeks'::interval) I get:
explain analyze
SELECT max(date_time) FROM data_values
where date_time > (now() - '15 weeks'::interval)
 GROUP BY data_logger_id;

HashAggregate  (cost=23798.26..23798.28 rows=2 width=12) (actual
time=3237.816..3237.823 rows=3 loops=1)
  ->  Seq Scan on data_values  (cost=0.00..22084.62 rows=342728
width=12) (actual time=0.037..2409.234 rows=344111 loops=1)
        Filter: (date_time > (now() - '105 days'::interval))
Total runtime: 3237.944 ms

Doing "SET enable_seqscan=off" speeds up the query and forces the use
of the index, but I dont really love that solution...


regards Jonas:))




On 5/7/07, Peter Eisentraut <peter_e@gmx.net> wrote:
> Am Montag, 7. Mai 2007 15:53 schrieb Jonas Henriksen:
> > while if I add a GROUP BY data_logger the query uses a seq scan and a
> >
> > lot of time:
> > >> explain analyze SELECT max(date_time) FROM data_values GROUP BY
> > data_logger_id;
>
> I don't think there is anything you can do about this.
>
> --
> Peter Eisentraut
> http://developer.postgresql.org/~petere/
>

Re: Slow query and indexes...

From
Andrew Kroeger
Date:
Jonas Henriksen wrote:

>>> explain analyze SELECT max(date_time) FROM data_values;
> Goes fast and returns:

In prior postgres versions, the planner could not take advantage of
indexes with max() (nor min()) calculations.  A workaround to this was
(given an appropriate index) a query like:

select date_time from data_values order by date_time desc limit 1;

The planner in recent versions has been upgraded to recognize this case
and basically apply the same workaround automatically.  This is shown by
the "Index Scan Backward" and "Limit" nodes in the plan you posted.

>>> explain analyze SELECT max(date_time) FROM data_values GROUP BY
> data_logger_id;

I cannot think of a workaround like above that would speed this up.  The
planner modifications that work in the above case probably don't handle
queries like this in the same way.

> Tha table contains approx 765000 rows. It has three distinct
> data_logger_id's. I can make quick queries on each of them using:
> SELECT max(date_time) FROM data_values where data_logger_id=1

If your 3 distinct data_logger_id will never change (or if you can
handle code changes if/when they do change), the following might provide
what you are looking for:

select max(date_time) from data_values where data_logger_id=1
union all
select max(date_time) from data_values where data_logger_id=2
union all
select max(date_time) from data_values where data_logger_id=3

If that works for you, you may also be able to eliminate the
(data_logger_id, date_time) index if no other queries need it (i.e. you
added it in an attempt to speed up this specific case).

Hope this helps.

Andrew


Re: Slow query and indexes...

From
"Isak Hansen"
Date:
On 5/7/07, Andrew Kroeger <andrew@sprocks.gotdns.com> wrote:
> Jonas Henriksen wrote:
>
> >>> explain analyze SELECT max(date_time) FROM data_values;
> > Goes fast and returns:
>
> In prior postgres versions, the planner could not take advantage of
> indexes with max() (nor min()) calculations.  A workaround to this was
> (given an appropriate index) a query like:
>
> select date_time from data_values order by date_time desc limit 1;
>
> The planner in recent versions has been upgraded to recognize this case
> and basically apply the same workaround automatically.  This is shown by
> the "Index Scan Backward" and "Limit" nodes in the plan you posted.
>
> >>> explain analyze SELECT max(date_time) FROM data_values GROUP BY
> > data_logger_id;
>
> I cannot think of a workaround like above that would speed this up.  The
> planner modifications that work in the above case probably don't handle
> queries like this in the same way.
>
> > Tha table contains approx 765000 rows. It has three distinct
> > data_logger_id's. I can make quick queries on each of them using:
> > SELECT max(date_time) FROM data_values where data_logger_id=1
>
> If your 3 distinct data_logger_id will never change (or if you can
> handle code changes if/when they do change), the following might provide
> what you are looking for:
>
> select max(date_time) from data_values where data_logger_id=1
> union all
> select max(date_time) from data_values where data_logger_id=2
> union all
> select max(date_time) from data_values where data_logger_id=3
>
> If that works for you, you may also be able to eliminate the
> (data_logger_id, date_time) index if no other queries need it (i.e. you
> added it in an attempt to speed up this specific case).

Naive question, but how would an index on (date_time, data_logger_id)
affect things?

Say coupled with limit 3 for the above case, or the date interval condition.


Isak

>
> Hope this helps.
>
> Andrew
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>

Re: Slow query and indexes...

From
"Jonas Henriksen"
Date:
Thanks for a good answer, I'll try to find a workaround. The number of
data_loggers will change, but not to frequently. I was actually hoping
to make a view showing the latest data for each logger, maybe I can
manage that with a stored procedure thingy...

Regards, Jonas:))


On 5/7/07, Andrew Kroeger <andrew@sprocks.gotdns.com> wrote:
> Jonas Henriksen wrote:
>
> >>> explain analyze SELECT max(date_time) FROM data_values;
> > Goes fast and returns:
>
> In prior postgres versions, the planner could not take advantage of
> indexes with max() (nor min()) calculations.  A workaround to this was
> (given an appropriate index) a query like:
>
> select date_time from data_values order by date_time desc limit 1;
>
> The planner in recent versions has been upgraded to recognize this case
> and basically apply the same workaround automatically.  This is shown by
> the "Index Scan Backward" and "Limit" nodes in the plan you posted.
>
> >>> explain analyze SELECT max(date_time) FROM data_values GROUP BY
> > data_logger_id;
>
> I cannot think of a workaround like above that would speed this up.  The
> planner modifications that work in the above case probably don't handle
> queries like this in the same way.
>
> > Tha table contains approx 765000 rows. It has three distinct
> > data_logger_id's. I can make quick queries on each of them using:
> > SELECT max(date_time) FROM data_values where data_logger_id=1
>
> If your 3 distinct data_logger_id will never change (or if you can
> handle code changes if/when they do change), the following might provide
> what you are looking for:
>
> select max(date_time) from data_values where data_logger_id=1
> union all
> select max(date_time) from data_values where data_logger_id=2
> union all
> select max(date_time) from data_values where data_logger_id=3
>
> If that works for you, you may also be able to eliminate the
> (data_logger_id, date_time) index if no other queries need it (i.e. you
> added it in an attempt to speed up this specific case).
>
> Hope this helps.
>
> Andrew
>
>

Re: Slow query and indexes...

From
PFC
Date:
> Thanks for a good answer, I'll try to find a workaround. The number of
> data_loggers will change, but not to frequently. I was actually hoping
> to make a view showing the latest data for each logger, maybe I can
> manage that with a stored procedure thingy...

    - Create a table which contains your list of loggers (since it's good
normalization anyway, you probably have it already) and have your data
table's logger_id REFERENCE it
    - You now have a simple way to get the list of loggers (just select from
the loggers table which will contain 3 rows)
    - Then, to get the most recent record for each logger_id, you do :

SELECT l.logger_id, (SELECT id FROM data d WHERE d.logger_id = l.logger_id
ORDER BY d.logger_id DESC, d.date_time DESC LIMIT 1) AS last_record_id
 FROM loggers l

    2 minute example :

forum_bench=> CREATE TABLE loggers (id SERIAL PRIMARY KEY, name TEXT );
CREATE TABLE

forum_bench=> INSERT INTO loggers (name) VALUES ('logger 1'),('logger
2'),('logger 3');
INSERT 0 3

forum_bench=> CREATE TABLE data (id SERIAL PRIMARY KEY, logger_id INTEGER
NOT NULL REFERENCES loggers( id ));
CREATE TABLE

forum_bench=> INSERT INTO data (logger_id) SELECT 1+floor(random()*3) FROM
generate_series(1,1000000);

forum_bench=> SELECT logger_id, count(*) FROM data GROUP BY logger_id;
  logger_id | count
-----------+--------
          3 | 333058
          2 | 333278
          1 | 333664


NOTE : I use id rather than timestamp to get the last one

forum_bench=> EXPLAIN ANALYZE SELECT logger_id, max(id) FROM data GROUP BY
logger_id;
                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=19166.82..19169.32 rows=200 width=8) (actual
time=1642.556..1642.558 rows=3 loops=1)
    ->  Seq Scan on data  (cost=0.00..14411.88 rows=950988 width=8) (actual
time=0.028..503.308 rows=1000000 loops=1)
  Total runtime: 1642.610 ms

forum_bench=> CREATE INDEX data_by_logger ON data (logger_id, id);
CREATE INDEX

forum_bench=> EXPLAIN ANALYZE SELECT l.id, (SELECT d.id FROM data d WHERE
d.logger_id=l.id ORDER BY d.logger_id DESC, d.id DESC LIMIT 1) FROM
loggers l;
                                                                      QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------
  Seq Scan on loggers l  (cost=0.00..3128.51 rows=1160 width=4) (actual
time=0.044..0.074 rows=3 loops=1)
    SubPlan
      ->  Limit  (cost=0.00..2.68 rows=1 width=8) (actual time=0.020..0.020
rows=1 loops=3)
            ->  Index Scan Backward using data_by_logger on data d
(cost=0.00..13391.86 rows=5000 width=8) (actual time=0.018..0.018 rows=1
loops=3)
                  Index Cond: (logger_id = $0)
  Total runtime: 0.113 ms
(6 lignes)

forum_bench=> SELECT l.id, (SELECT d.id FROM data d WHERE d.logger_id=l.id
ORDER BY d.logger_id DESC, d.id DESC LIMIT 1) FROM loggers l;
  id | ?column?
----+----------
   1 |   999999
   2 |  1000000
   3 |   999990
(3 lignes)

Re: Slow query and indexes...

From
Jim Nasby
Date:
On May 8, 2007, at 3:29 AM, PFC wrote:
> Create a table which contains your list of loggers (since it's good
> normalization anyway, you probably have it already) and have your
> data table's logger_id REFERENCE it

BTW, you could do that dynamically with a subselect: (SELECT DISTINCT
logger_id FROM data) AS loggers, though I'm not sure how optimal the
plan would be.

BTW, I encourage you to not use 'id' as a field name; I've found it
makes doing things like joins a lot trickier. Easier to just make
every id field the same (logger_id in this case).
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



Re: Slow query and indexes...

From
Jim Nasby
Date:
There are other ways to influence the selection of a seqscan, notably
effective_cache_size and random_page_cost.

First, you need to find out at what point a seqscan is actually
faster than an index scan. That's going to be a trial and error
search, but eventually if you're going back far enough in time the
seqscan will be faster. EXPLAIN ANALYZE has it's own overhead, so a
better way to test this is with psql's timing command, and wrap the
query into a count so you're not shoving a bunch of data across to psql:

SELECT count(*) FROM (... your query goes here ...) a;

(SELECT 1 might work too and would be more accurate)

Once you've found the break even point, you can tweak all the cost
estimates. Start by making sure that effective_cache_size is set
approximately to how much memory you have. Increasing that will favor
an index scan. Decreasing random_page_cost will also favor an index
scan, though I'd try not to go below 2 and definitely not below 1.
You can also tweak the CPU cost estimates (lower numbers will favor
indexes). But keep in mind that doing that at a system level will
impact every query running in the system. You may have no choice but
to explicitly set custom parameters for just this statement. SET
LOCAL and wrapping the SELECT in a transaction is a less painful way
to do that.

On May 7, 2007, at 10:47 AM, Jonas Henriksen wrote:

> Well thanks, but that don't help me much.
>
> I've tried setting an extra condition using datetime>(now() - '14
> weeks'::interval)
>
> explain analyze
> SELECT max(date_time) FROM data_values
> where date_time > (now() - '14 weeks'::interval)
> GROUP BY data_logger_id;
>
> HashAggregate  (cost=23264.52..23264.55 rows=2 width=12) (actual
> time=1691.447..1691.454 rows=3 loops=1)
>  ->  Bitmap Heap Scan on data_values  (cost=7922.08..21787.31
> rows=295442 width=12) (actual time=320.643..951.043 rows=298589
> loops=1)
>        Recheck Cond: (date_time > (now() - '98 days'::interval))
>        ->  Bitmap Index Scan on data_values_data_date_time_index
> (cost=0.00..7848.22 rows=295442 width=0) (actual time=319.708..319.708
> rows=298589 loops=1)
>              Index Cond: (date_time > (now() - '98 days'::interval))
> Total runtime: 1691.598 ms
>
> However, when I switch to using datetime>(now() - '15
> weeks'::interval) I get:
> explain analyze
> SELECT max(date_time) FROM data_values
> where date_time > (now() - '15 weeks'::interval)
> GROUP BY data_logger_id;
>
> HashAggregate  (cost=23798.26..23798.28 rows=2 width=12) (actual
> time=3237.816..3237.823 rows=3 loops=1)
>  ->  Seq Scan on data_values  (cost=0.00..22084.62 rows=342728
> width=12) (actual time=0.037..2409.234 rows=344111 loops=1)
>        Filter: (date_time > (now() - '105 days'::interval))
> Total runtime: 3237.944 ms
>
> Doing "SET enable_seqscan=off" speeds up the query and forces the use
> of the index, but I dont really love that solution...
>
>
> regards Jonas:))
>
>
>
>
> On 5/7/07, Peter Eisentraut <peter_e@gmx.net> wrote:
>> Am Montag, 7. Mai 2007 15:53 schrieb Jonas Henriksen:
>> > while if I add a GROUP BY data_logger the query uses a seq scan
>> and a
>> >
>> > lot of time:
>> > >> explain analyze SELECT max(date_time) FROM data_values GROUP BY
>> > data_logger_id;
>>
>> I don't think there is anything you can do about this.
>>
>> --
>> Peter Eisentraut
>> http://developer.postgresql.org/~petere/
>>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)