Re: Confirmation of bad query plan generated by 7.4 - Mailing list pgsql-performance
From | Jim C. Nasby |
---|---|
Subject | Re: Confirmation of bad query plan generated by 7.4 |
Date | |
Msg-id | 20060613223950.GK34196@pervasive.com Whole thread Raw |
In response to | Re: Confirmation of bad query plan generated by 7.4 (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Confirmation of bad query plan generated by 7.4
|
List | pgsql-performance |
On Tue, Jun 13, 2006 at 06:04:42PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > Also, I'm not sure that the behavior is entirely changed, either. On a > > 8.1.4 database I'm still seeing a difference between now() - interval > > and a hard-coded date. > > It'd depend on the context, possibly, but it's easy to show that the > current planner does fold "now() - interval_constant" when making > estimates. Simple example: Turns out the difference is between feeding a date vs a timestamp into the query... I would have thought that since date is a date that the WHERE clause would be casted to a date if it was a timestamptz, but I guess not... stats=# explain select * from email_contrib where project_id=8 and date >= now()-'15 days'::interval; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Index Scan using email_contrib__project_date on email_contrib (cost=0.01..45405.83 rows=14225 width=24) Index Cond: ((project_id = 8) AND (date >= (now() - '15 days'::interval))) (2 rows) stats=# explain select * from email_contrib where project_id=8 AND date >= '2006-05-29 22:09:56.814897+00'::date; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Index Scan using email_contrib__project_date on email_contrib (cost=0.00..48951.74 rows=15336 width=24) Index Cond: ((project_id = 8) AND (date >= '2006-05-29'::date)) (2 rows) stats=# explain select * from email_contrib where project_id=8 AND date >= '2006-05-29 22:09:56.814897+00'::timestamp; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Index Scan using email_contrib__project_date on email_contrib (cost=0.00..45472.76 rows=14246 width=24) Index Cond: ((project_id = 8) AND (date >= '2006-05-29 22:09:56.814897'::timestamp without time zone)) (2 rows) Actual row count is 109071; reason for the vast difference is querying on two columns. I know comming up with general-purpose multicolumn stats is extremely difficult, but can't we at least add histograms for multi-column indexes?? In this case that would most likely make the estimate dead-on, because there's an index on project_id, date. Details below for the morbidly curious/bored... stats=# \d email_contrib Table "public.email_contrib" Column | Type | Modifiers ------------+---------+----------- project_id | integer | not null id | integer | not null date | date | not null team_id | integer | work_units | bigint | not null Indexes: "email_contrib_pkey" PRIMARY KEY, btree (project_id, id, date), tablespace "raid10" "email_contrib__pk24" btree (id, date) WHERE project_id = 24, tablespace "raid10" "email_contrib__pk25" btree (id, date) WHERE project_id = 25, tablespace "raid10" "email_contrib__pk8" btree (id, date) WHERE project_id = 8, tablespace "raid10" "email_contrib__project_date" btree (project_id, date), tablespace "raid10" "email_contrib__project_id" btree (project_id), tablespace "raid10" "email_contrib__team_id" btree (team_id), tablespace "raid10" Foreign-key constraints: "fk_email_contrib__id" FOREIGN KEY (id) REFERENCES stats_participant(id) ON UPDATE CASCADE "fk_email_contrib__team_id" FOREIGN KEY (team_id) REFERENCES stats_team(team) ON UPDATE CASCADE Tablespace: "raid10" stats=# explain analyze select * from email_contrib where project_id=8 and date >= now()-'15 days'::interval; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using email_contrib__project_date on email_contrib (cost=0.01..45475.95 rows=14247 width=24) (actual time=0.294..264.345rows=109071 loops=1) Index Cond: ((project_id = 8) AND (date >= (now() - '15 days'::interval))) Total runtime: 412.167 ms (3 rows) stats=# select now()-'15 days'::interval; ?column? ------------------------------- 2006-05-29 22:09:56.814897+00 (1 row) stats=# explain analyze select * from email_contrib where project_id=8 and date >= '2006-05-29 22:09:56.814897+00'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using email_contrib__project_date on email_contrib (cost=0.00..48951.74 rows=15336 width=24) (actual time=0.124..229.800rows=116828 loops=1) Index Cond: ((project_id = 8) AND (date >= '2006-05-29'::date)) Total runtime: 391.240 ms (3 rows) stats=# explain select * from email_contrib where project_id=8 and date >= '2006-05-29 22:09:56.814897+00'::date; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Index Scan using email_contrib__project_date on email_contrib (cost=0.00..48951.74 rows=15336 width=24) Index Cond: ((project_id = 8) AND (date >= '2006-05-29'::date)) (2 rows) So casting to date doesn't change anything, but dropping project_id from the where clause certainly does... stats=# explain analyze select * from email_contrib where date >= now()-'15 days'::interval; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on email_contrib (cost=847355.98..1256538.96 rows=152552 width=24) (actual time=74886.028..75267.633 rows=148894loops=1) Recheck Cond: (date >= (now() - '15 days'::interval)) -> Bitmap Index Scan on email_contrib__project_date (cost=0.00..847355.98 rows=152552 width=0) (actual time=74885.690..74885.690rows=148894 loops=1) Index Cond: (date >= (now() - '15 days'::interval)) Total runtime: 75472.490 ms (5 rows) That estimate is dead-on. So it appears it's yet another case of cross-column stats. :( But there's still a difference between now()-interval and something hard-coded: stats=# explain analyze select * from email_contrib where date >= '2006-05-29 22:09:56.814897+00'::date; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on email_contrib (cost=847355.98..1278756.22 rows=164256 width=24) (actual time=19356.752..19623.450 rows=159348loops=1) Recheck Cond: (date >= '2006-05-29'::date) -> Bitmap Index Scan on email_contrib__project_date (cost=0.00..847355.98 rows=164256 width=0) (actual time=19356.391..19356.391rows=159348 loops=1) Index Cond: (date >= '2006-05-29'::date) Total runtime: 19841.614 ms (5 rows) stats=# explain analyze select * from email_contrib where date >= (now()-'15 days'::interval)::date; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on email_contrib (cost=847355.98..1279988.15 rows=164256 width=24) (actual time=19099.417..19372.167 rows=159348loops=1) Recheck Cond: (date >= ((now() - '15 days'::interval))::date) -> Bitmap Index Scan on email_contrib__project_date (cost=0.00..847355.98 rows=164256 width=0) (actual time=19099.057..19099.057rows=159348 loops=1) Index Cond: (date >= ((now() - '15 days'::interval))::date) Total runtime: 19589.785 ms Aha! It's the casting to date that changes things. The stats target is 100... stats=# select attname, n_distinct from pg_stats where tablename='email_contrib'; attname | n_distinct ------------+------------ project_id | 6 team_id | 4104 work_units | 6795 date | 3034 id | 35301 (5 rows) The n_distinct for project_id and date both look about right. stats=# select * from pg_stats where tablename='email_contrib' and attname='project_id'; -[ RECORD 1 ]-----+------------------------------------------------------------ schemaname | public tablename | email_contrib attname | project_id null_frac | 0 avg_width | 4 n_distinct | 6 most_common_vals | {205,5,8,25,24,3} most_common_freqs | {0.4273,0.419833,0.0933667,0.0514667,0.00506667,0.00296667} histogram_bounds | correlation | 0.605662 stats=# select relpages,reltuples from pg_class where relname='email_contrib'; relpages | reltuples ----------+------------- 996524 | 1.35509e+08 If we look at how many rows would match project_id 8 and any 15 dates... stats=# SELECT 1.35509e+08 * 0.0933667 / 3034 * 15; ?column? ------------------------ 62551.2268472313777195 We come up with something much closer to reality (116828 rows). I guess the problem is in the histogram for date; where the last 3 values are: 2005-11-02,2006-03-05,2006-06-11 -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
pgsql-performance by date: