inconsistent/weird index usage - Mailing list pgsql-performance
From | Dustin Sallings |
---|---|
Subject | inconsistent/weird index usage |
Date | |
Msg-id | 6FA9AB99-1373-11D9-A87D-000A957659CC@spy.net Whole thread Raw |
Responses |
Re: inconsistent/weird index usage
Re: inconsistent/weird index usage Re: inconsistent/weird index usage |
List | pgsql-performance |
To save some time, let me start by saying PostgreSQL 7.4.3 on powerpc-apple-darwin7.4.0, compiled by GCC gcc (GCC) 3.3 20030304 (Apple Computer, Inc. build 1640) OK, now on to details... I'm trying to implement oracle style ``partitions'' in postgres. I've run into my first snag on what should be a fairly quick query. Basically, I started with the following schema and split the ``samples'' table into one table for each year (1999-2004). -- BEGIN SCHEMA create table sensor_types ( sensor_type_id serial, sensor_type text not null, units varchar(10) not null, primary key(sensor_type_id) ); create table sensors ( sensor_id serial, sensor_type_id integer not null, serial char(16) not null, name text not null, low smallint not null, high smallint not null, active boolean default true, primary key(sensor_id), foreign key(sensor_type_id) references sensor_types(sensor_type_id) ); create unique index sensors_byserial on sensors(serial); create table samples ( ts datetime not null, sensor_id integer not null, sample float not null, foreign key(sensor_id) references sensors(sensor_id) ); create index samples_bytime on samples(ts); create unique index samples_bytimeid on samples(ts, sensor_id); -- END SCHEMA Each samples_[year] table looks, and is indexed exactly as the above samples table was by using the following commands: create index samples_1999_bytime on samples_1999(ts); create index samples_2000_bytime on samples_2000(ts); create index samples_2001_bytime on samples_2001(ts); create index samples_2002_bytime on samples_2002(ts); create index samples_2003_bytime on samples_2003(ts); create index samples_2004_bytime on samples_2004(ts); create unique index samples_1999_bytimeid on samples_1999(ts, sensor_id); create unique index samples_2000_bytimeid on samples_2000(ts, sensor_id); create unique index samples_2001_bytimeid on samples_2001(ts, sensor_id); create unique index samples_2002_bytimeid on samples_2002(ts, sensor_id); create unique index samples_2003_bytimeid on samples_2003(ts, sensor_id); create unique index samples_2004_bytimeid on samples_2004(ts, sensor_id); The tables contain the following number of rows: samples_1999 311030 samples_2000 2142245 samples_2001 2706571 samples_2002 3111602 samples_2003 3149316 samples_2004 2375972 The following view creates the illusion of the old ``single-table'' model: create view samples as select * from samples_1999 union select * from samples_2000 union select * from samples_2001 union select * from samples_2002 union select * from samples_2003 union select * from samples_2004 ...along with the following rule on the view for the applications performing inserts: create rule sample_rule as on insert to samples do instead insert into samples_2004 (ts, sensor_id, sample) values(new.ts, new.sensor_id, new.sample) OK, now that that's over with, I have this one particular query that I attempt to run for a report from my phone that no longer works because it tries to do a table scan on *some* of the tables. Why it chooses this table scan, I can't imagine. The query is as follows: select s.serial as serial_num, s.name as name, date(ts) as day, min(sample) as min_temp, avg(sample) as avg_temp, stddev(sample) as stddev_temp, max(sample) as max_temp from samples inner join sensors s using (sensor_id) where ts > current_date - 7 group by serial_num, name, day order by serial_num, day desc explain analyze reports the following (sorry for the horrible wrapping): Sort (cost=1185281.45..1185285.95 rows=1800 width=50) (actual time=82832.106..82832.147 rows=56 loops=1) Sort Key: s.serial, date(samples.ts) -> HashAggregate (cost=1185161.62..1185184.12 rows=1800 width=50) (actual time=82830.624..82831.601 rows=56 loops=1) -> Hash Join (cost=1063980.21..1181539.96 rows=206952 width=50) (actual time=80408.123..81688.590 rows=66389 loops=1) Hash Cond: ("outer".sensor_id = "inner".sensor_id) -> Subquery Scan samples (cost=1063979.10..1155957.38 rows=4598914 width=20) (actual time=80392.477..80922.764 rows=66389 loops=1) -> Unique (cost=1063979.10..1109968.24 rows=4598914 width=20) (actual time=80392.451..80646.761 rows=66389 loops=1) -> Sort (cost=1063979.10..1075476.39 rows=4598914 width=20) (actual time=80392.437..80442.787 rows=66389 loops=1) Sort Key: ts, sensor_id, sample -> Append (cost=0.00..312023.46 rows=4598914 width=20) (actual time=79014.428..80148.396 rows=66389 loops=1) -> Subquery Scan "*SELECT* 1" (cost=0.00..9239.37 rows=103677 width=20) (actual time=4010.181..4010.181 rows=0 loops=1) -> Seq Scan on samples_1999 (cost=0.00..8202.60 rows=103677 width=20) (actual time=4010.165..4010.165 rows=0 loops=1) Filter: (ts > ((('now'::text)::date - 7))::timestamp without time zone) -> Subquery Scan "*SELECT* 2" (cost=0.00..28646.17 rows=714082 width=20) (actual time=44.827..44.827 rows=0 loops=1) -> Index Scan using samples_2000_bytime on samples_2000 (cost=0.00..21505.35 rows=714082 width=20) (actual time=44.818..44.818 rows=0 loops=1) Index Cond: (ts > ((('now'::text)::date - 7))::timestamp without time zone) -> Subquery Scan "*SELECT* 3" (cost=0.00..80393.33 rows=902191 width=20) (actual time=34772.377..34772.377 rows=0 loops=1) -> Seq Scan on samples_2001 (cost=0.00..71371.42 rows=902191 width=20) (actual time=34772.366..34772.366 rows=0 loops=1) Filter: (ts > ((('now'::text)::date - 7))::timestamp without time zone) -> Subquery Scan "*SELECT* 4" (cost=0.00..92424.05 rows=1037201 width=20) (actual time=40072.103..40072.103 rows=0 loops=1) -> Seq Scan on samples_2002 (cost=0.00..82052.04 rows=1037201 width=20) (actual time=40072.090..40072.090 rows=0 loops=1) Filter: (ts > ((('now'::text)::date - 7))::timestamp without time zone) -> Subquery Scan "*SELECT* 5" (cost=0.00..42380.58 rows=1049772 width=20) (actual time=49.455..49.455 rows=0 loops=1) -> Index Scan using samples_2003_bytime on samples_2003 (cost=0.00..31882.86 rows=1049772 width=20) (actual time=49.448..49.448 rows=0 loops=1) Index Cond: (ts > ((('now'::text)::date - 7))::timestamp without time zone) -> Subquery Scan "*SELECT* 6" (cost=0.00..58939.96 rows=791991 width=20) (actual time=65.458..1124.363 rows=66389 loops=1) -> Index Scan using samples_2004_bytime on samples_2004 (cost=0.00..51020.05 rows=791991 width=20) (actual time=65.430..750.336 rows=66389 loops=1) Index Cond: (ts > ((('now'::text)::date - 7))::timestamp without time zone) -> Hash (cost=1.09..1.09 rows=9 width=38) (actual time=15.295..15.295 rows=0 loops=1) -> Seq Scan on sensors s (cost=0.00..1.09 rows=9 width=38) (actual time=15.122..15.187 rows=9 loops=1) Total runtime: 82865.119 ms Essentially, what you can see here is that it's doing an index scan on samples_2000, samples_2003, and samples_2004, but a sequential scan on samples_1999, samples_2001, and samples_2002. It's very strange to me that it would make these choices. If I disable sequential scans altogether for this session, the query runs in under 4 seconds. This is a very cool solution for long-term storage, and isn't terribly hard to manage. I actually have other report queries that seem to be making pretty good index selection currently...but I always want more! :) Does anyone have any suggestions as to how to get this to do what I want? Of course, ideally, it would ignore five of the tables altogether. :) -- SPY My girlfriend asked me which one I like better. pub 1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net> | Key fingerprint = 87 02 57 08 02 D0 DA D6 C8 0F 3E 65 51 98 D8 BE L_______________________ I hope the answer won't upset her. ____________
pgsql-performance by date: