Re: Confirmation of bad query plan generated by 7.4 - Mailing list pgsql-performance

From Jim C. Nasby
Subject Re: Confirmation of bad query plan generated by 7.4
Date
Msg-id 20060613223950.GK34196@pervasive.com
Whole thread Raw
In response to Re: Confirmation of bad query plan generated by 7.4  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Confirmation of bad query plan generated by 7.4
List pgsql-performance
On Tue, Jun 13, 2006 at 06:04:42PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > Also, I'm not sure that the behavior is entirely changed, either. On a
> > 8.1.4 database I'm still seeing a difference between now() - interval
> > and a hard-coded date.
>
> It'd depend on the context, possibly, but it's easy to show that the
> current planner does fold "now() - interval_constant" when making
> estimates.  Simple example:

Turns out the difference is between feeding a date vs a timestamp into the
query... I would have thought that since date is a date that the WHERE clause
would be casted to a date if it was a timestamptz, but I guess not...

stats=# explain select * from email_contrib where project_id=8 and date >= now()-'15 days'::interval;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Index Scan using email_contrib__project_date on email_contrib  (cost=0.01..45405.83 rows=14225 width=24)
   Index Cond: ((project_id = 8) AND (date >= (now() - '15 days'::interval)))
(2 rows)

stats=# explain select * from email_contrib where project_id=8 AND date >= '2006-05-29 22:09:56.814897+00'::date;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Index Scan using email_contrib__project_date on email_contrib  (cost=0.00..48951.74 rows=15336 width=24)
   Index Cond: ((project_id = 8) AND (date >= '2006-05-29'::date))
(2 rows)

stats=# explain select * from email_contrib where project_id=8 AND date >= '2006-05-29 22:09:56.814897+00'::timestamp;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Index Scan using email_contrib__project_date on email_contrib  (cost=0.00..45472.76 rows=14246 width=24)
   Index Cond: ((project_id = 8) AND (date >= '2006-05-29 22:09:56.814897'::timestamp without time zone))
(2 rows)

Actual row count is 109071; reason for the vast difference is querying on two columns.

I know comming up with general-purpose multicolumn stats is extremely
difficult, but can't we at least add histograms for multi-column indexes?? In
this case that would most likely make the estimate dead-on, because there's an
index on project_id, date.

Details below for the morbidly curious/bored...

stats=# \d email_contrib
   Table "public.email_contrib"
   Column   |  Type   | Modifiers
------------+---------+-----------
 project_id | integer | not null
 id         | integer | not null
 date       | date    | not null
 team_id    | integer |
 work_units | bigint  | not null
Indexes:
    "email_contrib_pkey" PRIMARY KEY, btree (project_id, id, date), tablespace "raid10"
    "email_contrib__pk24" btree (id, date) WHERE project_id = 24, tablespace "raid10"
    "email_contrib__pk25" btree (id, date) WHERE project_id = 25, tablespace "raid10"
    "email_contrib__pk8" btree (id, date) WHERE project_id = 8, tablespace "raid10"
    "email_contrib__project_date" btree (project_id, date), tablespace "raid10"
    "email_contrib__project_id" btree (project_id), tablespace "raid10"
    "email_contrib__team_id" btree (team_id), tablespace "raid10"
Foreign-key constraints:
    "fk_email_contrib__id" FOREIGN KEY (id) REFERENCES stats_participant(id) ON UPDATE CASCADE
    "fk_email_contrib__team_id" FOREIGN KEY (team_id) REFERENCES stats_team(team) ON UPDATE CASCADE
Tablespace: "raid10"

stats=# explain analyze select * from email_contrib where project_id=8 and date >= now()-'15 days'::interval;
                                                                        QUERY PLAN
                                   

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using email_contrib__project_date on email_contrib  (cost=0.01..45475.95 rows=14247 width=24) (actual
time=0.294..264.345rows=109071 loops=1) 
   Index Cond: ((project_id = 8) AND (date >= (now() - '15 days'::interval)))
 Total runtime: 412.167 ms
(3 rows)

stats=# select now()-'15 days'::interval;
           ?column?
-------------------------------
 2006-05-29 22:09:56.814897+00
(1 row)

stats=# explain analyze select * from email_contrib where project_id=8 and date >= '2006-05-29 22:09:56.814897+00';
                                                                        QUERY PLAN
                                   

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using email_contrib__project_date on email_contrib  (cost=0.00..48951.74 rows=15336 width=24) (actual
time=0.124..229.800rows=116828 loops=1) 
   Index Cond: ((project_id = 8) AND (date >= '2006-05-29'::date))
 Total runtime: 391.240 ms
(3 rows)

stats=# explain select * from email_contrib where project_id=8 and date >= '2006-05-29 22:09:56.814897+00'::date;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Index Scan using email_contrib__project_date on email_contrib  (cost=0.00..48951.74 rows=15336 width=24)
   Index Cond: ((project_id = 8) AND (date >= '2006-05-29'::date))
(2 rows)

So casting to date doesn't change anything, but dropping project_id from the
where clause certainly does...

stats=# explain analyze select * from email_contrib where date >= now()-'15 days'::interval;
                                                                        QUERY PLAN
                                   

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on email_contrib  (cost=847355.98..1256538.96 rows=152552 width=24) (actual time=74886.028..75267.633
rows=148894loops=1) 
   Recheck Cond: (date >= (now() - '15 days'::interval))
   ->  Bitmap Index Scan on email_contrib__project_date  (cost=0.00..847355.98 rows=152552 width=0) (actual
time=74885.690..74885.690rows=148894 loops=1) 
         Index Cond: (date >= (now() - '15 days'::interval))
 Total runtime: 75472.490 ms
(5 rows)

That estimate is dead-on. So it appears it's yet another case of cross-column
stats. :( But there's still a difference between now()-interval and something hard-coded:

stats=# explain analyze select * from email_contrib where date >= '2006-05-29 22:09:56.814897+00'::date;
                                                                        QUERY PLAN
                                   

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on email_contrib  (cost=847355.98..1278756.22 rows=164256 width=24) (actual time=19356.752..19623.450
rows=159348loops=1) 
   Recheck Cond: (date >= '2006-05-29'::date)
   ->  Bitmap Index Scan on email_contrib__project_date  (cost=0.00..847355.98 rows=164256 width=0) (actual
time=19356.391..19356.391rows=159348 loops=1) 
         Index Cond: (date >= '2006-05-29'::date)
 Total runtime: 19841.614 ms
(5 rows)

stats=# explain analyze select * from email_contrib where date >= (now()-'15 days'::interval)::date;
                                                                        QUERY PLAN
                                   

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on email_contrib  (cost=847355.98..1279988.15 rows=164256 width=24) (actual time=19099.417..19372.167
rows=159348loops=1) 
   Recheck Cond: (date >= ((now() - '15 days'::interval))::date)
   ->  Bitmap Index Scan on email_contrib__project_date  (cost=0.00..847355.98 rows=164256 width=0) (actual
time=19099.057..19099.057rows=159348 loops=1) 
         Index Cond: (date >= ((now() - '15 days'::interval))::date)
 Total runtime: 19589.785 ms

Aha! It's the casting to date that changes things.

The stats target is 100...

stats=# select attname, n_distinct from pg_stats where tablename='email_contrib';
  attname   | n_distinct
------------+------------
 project_id |          6
 team_id    |       4104
 work_units |       6795
 date       |       3034
 id         |      35301
(5 rows)

The n_distinct for project_id and date both look about right.

stats=# select * from pg_stats where tablename='email_contrib' and attname='project_id';
-[ RECORD 1 ]-----+------------------------------------------------------------
schemaname        | public
tablename         | email_contrib
attname           | project_id
null_frac         | 0
avg_width         | 4
n_distinct        | 6
most_common_vals  | {205,5,8,25,24,3}
most_common_freqs | {0.4273,0.419833,0.0933667,0.0514667,0.00506667,0.00296667}
histogram_bounds  |
correlation       | 0.605662

stats=# select relpages,reltuples from pg_class where relname='email_contrib';
 relpages |  reltuples
----------+-------------
   996524 | 1.35509e+08

If we look at how many rows would match project_id 8 and any 15 dates...

stats=# SELECT 1.35509e+08 * 0.0933667 / 3034 * 15;
        ?column?
------------------------
 62551.2268472313777195

We come up with something much closer to reality (116828 rows). I guess the
problem is in the histogram for date; where the last 3 values are:

2005-11-02,2006-03-05,2006-06-11
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

pgsql-performance by date:

Previous
From: Michael Fuhr
Date:
Subject: Re: Solaris shared_buffers anomaly?
Next
From: "Shaun Thomas"
Date:
Subject: Re: Confirmation of bad query plan generated by 7.4