strange estimate for number of rows - Mailing list pgsql-performance

From Andrew Sullivan
Subject strange estimate for number of rows
Date
Msg-id 20031113181415.GB25546@libertyrms.info
Whole thread Raw
Responses Re: strange estimate for number of rows
List pgsql-performance
Hi all,

I've one here that I cannot fathom.  Any suggestions?

We have a table, call it tablename, where we're selecting by a range
of dates and an identifier.  (This is redacted, obviously):

\d tablename

       Column       |           Type           |     Modifiers
--------------------+--------------------------+--------------------
 id                 | integer                  | not null
 transaction_date   | timestamp with time zone | not null
 product_id         | integer                  | not null
Indexes:
    "trans_posted_trans_date_idx" btree (transaction_date, product_id)


The statistics on transaction_date and product_id are set to 1000.
Everything is all analysed nicely.  But I'm getting a poor plan,
because of an estimate that the number of rows to be returned is
about double how many actually are:

explain analyse select * from transactions_posted where
transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and
product_id = 2;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on transactions_posted  (cost=0.00..376630.33 rows=700923
width=91) (actual time=8422.253..36176.078 rows=316029 loops=1)
   Filter: ((transaction_date >= '2003-09-01 00:00:00-04'::timestamp
with time zone) AND (transaction_date < '2003-10-01
00:00:00-04'::timestamp with time zone) AND (product_id = 2))
 Total runtime: 36357.630 ms
(3 rows)

SET enable_seqscan = off;

explain analyse select * from transactions_posted where
transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and
product_id = 2;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trans_posted_trans_date_idx on transactions_posted
(cost=0.00..1088862.56 rows=700923 width=91) (actual
time=35.214..14816.257 rows=316029 loops=1)
   Index Cond: ((transaction_date >= '2003-09-01
00:00:00-04'::timestamp with time zone) AND (transaction_date <
'2003-10-01 00:00:00-04'::timestamp with time zone) AND (product_id =
2))
 Total runtime: 15009.816 ms
(3 rows)

SELECT attname,null_frac,avg_width,n_distinct,correlation FROM
pg_stats where tablename = 'transactions_posted' AND attname in
('transaction_date','product_id');
     attname      | null_frac | avg_width | n_distinct | correlation
------------------+-----------+-----------+------------+-------------
 product_id       |         0 |         4 |          2 |    0.200956
 transaction_date |         0 |         8 |  -0.200791 |    0.289248

Any ideas?  I'm loathe to recommend cluster, since the data will not
stay clustered.

A


--
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


pgsql-performance by date:

Previous
From: Robert Treat
Date:
Subject: Re: IN surpasses NOT EXISTS in 7.4RC2 ??
Next
From: "Arthur Ward"
Date:
Subject: Union+group by planner estimates way off?