Re: limit clause breaks query planner? - Mailing list pgsql-performance
From | Matt Smiley |
---|---|
Subject | Re: limit clause breaks query planner? |
Date | |
Msg-id | 48BD33B2.D078.0028.0@rentrak.com Whole thread Raw |
In response to | limit clause breaks query planner? ("David West" <david.west@cusppoint.com>) |
Responses |
Re: limit clause breaks query planner?
|
List | pgsql-performance |
Hi David, Early in this thread, Pavel suggested: > you should partial index > > create index foo(b) on mytable where a is null; Rather, you might try the opposite partial index (where a is NOT null) as a replacement for the original unqualified indexon column A. This new index will be ignored by the query you're trying to tune, but it'll be available to the otherqueries that filter to a non-null value of column A. (Omitting NULL from that index should be ok because you normallywouldn't want to use an index when 95% of the table's rows match the filtered key.) Then you can temporarily disable Seq Scans in your session for just this one query, as follows: SQL> create table my_table ( a int, b int ) ; CREATE TABLE SQL> create index idx_a_not_null on my_table ( a ) where a is not null ; CREATE INDEX SQL> create index idx_b on my_table ( b ) ; CREATE INDEX SQL> insert into my_table (a, b) select case when random() <= 0.95 then null else i end as a, mod(i, 10) as b from generate_series(1, 10000000) s(i) ; INSERT 0 10000000 SQL> analyze my_table ; ANALYZE Review the statistics available to the optimizer: SQL> select attname, null_frac, n_distinct, most_common_vals, most_common_freqs, histogram_bounds, correlation from pg_stats where tablename = 'my_table' order by attname ; attname | null_frac | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ---------+-----------+------------+-----------------------+--------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+------------- a | 0.945 | -1 | | | {2771,1301755,2096051,3059786,3680728,4653531,5882434,6737141,8240245,9428702,9875768} | 1 b | 0 | 10 | {9,4,3,1,2,6,8,5,7,0} | {0.110333,0.104,0.102333,0.100333,0.100333,0.0996667,0.0986667,0.0983333,0.096,0.09}| | 0.127294 (2 rows) SQL> select relname, reltuples, relpages from pg_class where relname in ('my_table', 'idx_a_not_null', 'idx_b') order byrelname ; relname | reltuples | relpages ----------------+-----------+---------- idx_a_not_null | 499955 | 1100 idx_b | 1e+07 | 21946 my_table | 1e+07 | 39492 (3 rows) Run the test query, first without disabling Seq Scan to show this example reproduces the plan you're trying to avoid. SQL> explain analyze select * from my_table where a is null and b = 5 limit 15 ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..2.66 rows=15 width=8) (actual time=0.070..0.263 rows=15 loops=1) -> Seq Scan on my_table (cost=0.00..164492.00 rows=929250 width=8) (actual time=0.061..0.159 rows=15 loops=1) Filter: ((a IS NULL) AND (b = 5)) Total runtime: 0.371 ms (4 rows) Now run the same query without the Seq Scan option. SQL> set enable_seqscan = false ; SET SQL> explain analyze select * from my_table where a is null and b = 5 limit 15 ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..46.33 rows=15 width=8) (actual time=0.081..0.232 rows=15 loops=1) -> Index Scan using idx_b on my_table (cost=0.00..2869913.63 rows=929250 width=8) (actual time=0.072..0.130 rows=15loops=1) Index Cond: (b = 5) Filter: (a IS NULL) Total runtime: 0.341 ms (5 rows) SQL> reset enable_seqscan ; RESET Yes, it's unsavory to temporarily adjust a session-level parameter to tune a single query, but I don't know of a less intrusiveway to avoid the SeqScan. Here's why I think it might be your simplest option: As far as I can tell, the plan nodes for accessing the table/index are unaware of the LIMIT. The cost of the Limit nodeis estimated as the cost of its input row-source multiplied by the ratio of requested/returned rows. For example, fromthe preceding plan output: 2869913.63 for "Index Scan" upper cost * (15 row limit / 929250 returned rows) = 46.326 upper cost for the "Limit" node The underlying plan nodes each assume that all the rows matching their filter predicates will be returned up the pipeline;the cost estimate is only reduced at the Limit node. A Seq Scan and an Index Scan (over a complete index) willboth expected the same number of input rows (pg_class.reltuples). They also produce the same estimated result set, sinceboth apply the same filters before outputing rows to the next node. So an Index Scan is always going to have a highercost estimate than an equivalent Seq Scan returning the same result rows (unless random_page_cost is < 1). That'swhy I think the planner is always preferring the plan that uses a Seq Scan. Hope this helps!
pgsql-performance by date: