Re: Using LIMIT changes index used by planner - Mailing list pgsql-performance
From | Sven Willenberger |
---|---|
Subject | Re: Using LIMIT changes index used by planner |
Date | |
Msg-id | 1103048932.5534.22.camel@lanshark.dmv.com Whole thread Raw |
In response to | Re: Using LIMIT changes index used by planner (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Using LIMIT changes index used by planner
|
List | pgsql-performance |
On Mon, 2004-12-13 at 17:43 -0500, Tom Lane wrote: > Sven Willenberger <sven@dmv.com> writes: > > explain analyze select storelocation,order_number from custacct where > > referrer = 1365 and orderdate between '2004-12-07' and '2004-12-07 > > 12:00:00' order by custacctid limit 10; > > > QUERY PLAN > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Limit (cost=0.00..43065.76 rows=10 width=43) (actual > > time=1306957.216..1307072.111 rows=10 loops=1) > > -> Index Scan using custacct2_pkey on custacct > > (cost=0.00..92083209.38 rows=21382 width=43) (actual > > time=1306957.205..1307072.017 rows=10 loops=1) > > Filter: ((referrer = 1365) AND (orderdate >= '2004-12-07 > > 00:00:00'::timestamp without time zone) AND (orderdate <= '2004-12-07 > > 12:00:00'::timestamp without time zone)) > > Total runtime: 1307072.231 ms > > (4 rows) > > I think this is the well-known issue of lack of cross-column correlation > statistics. The planner is well aware that this indexscan will be > horridly expensive if run to completion --- > <snip> > There isn't any near-term fix in the wind for this, since storing > cross-column statistics is an expensive proposition that we haven't > decided how to handle. Your workaround with separating the ORDER BY > from the LIMIT is a good one. > You are correct in that there is a high degree of correlation between the custacctid (which is a serial key) and the orderdate as the orders generally get entered in the order that they arrive. I will go with the workaround subselect query plan then. On a related note, is there a way (other than set enable_seqscan=off) to give a hint to the planner that it is cheaper to use and index scan versus seq scan? Using the "workaround" query on any time period greater than 12 hours results in the planner using a seq scan. Disabling the seq scan and running the query on a full day period for example shows: explain analyze select foo.storelocaion, foo.order_number from (select storelocation,order_number from custacct where referrer = 1365 and ordertdate between '2004-12-09' and '2004-12-10' order by custacctid) as foo limit 10 offset 100; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=2661326.22..2661326.35 rows=10 width=100) (actual time=28446.605..28446.796 rows=10 loops=1) -> Subquery Scan foo (cost=2661324.97..2661866.19 rows=43297 width=100) (actual time=28444.916..28446.298 rows=110 loops=1) -> Sort (cost=2661324.97..2661433.22 rows=43297 width=41) (actual time=28444.895..28445.334 rows=110 loops=1) Sort Key: custacctid -> Index Scan using orderdate_idx on custacct (cost=0.00..2657990.68 rows=43297 width=41) (actual time=4.432..28145.212 rows=44333 loops=1) Index Cond: ((orderdate >= '2004-12-09 00:00:00'::timestamp without time zone) AND (orderdate <= '2004-12-10 00:00:00'::timestamp without time zone)) Filter: (referrer = 1365) Total runtime: 28456.893 ms (8 rows) If I interpret the above correctly, the planner guestimates a cost of 2661326 but the actual cost is much less (assuming time is equivalent to cost). Would the set statistics command be of any benefit here in "training" the planner? Sven
pgsql-performance by date: