Re: limit clause breaks query planner? - Mailing list pgsql-performance

From Matt Smiley
Subject Re: limit clause breaks query planner?
Date
Msg-id 48BD33B2.D078.0028.0@rentrak.com
Whole thread Raw
In response to limit clause breaks query planner?  ("David West" <david.west@cusppoint.com>)
Responses Re: limit clause breaks query planner?
List pgsql-performance
Hi David,

Early in this thread, Pavel suggested:

> you should partial index
>
> create index foo(b) on mytable where a is null;

Rather, you might try the opposite partial index (where a is NOT null) as a replacement for the original unqualified
indexon column A.  This new index will be ignored by the query you're trying to tune, but it'll be available to the
otherqueries that filter to a non-null value of column A.  (Omitting NULL from that index should be ok because you
normallywouldn't want to use an index when 95% of the table's rows match the filtered key.) 

Then you can temporarily disable Seq Scans in your session for just this one query, as follows:

SQL> create table my_table ( a int, b int ) ;
CREATE TABLE

SQL> create index idx_a_not_null on my_table ( a ) where a is not null ;
CREATE INDEX

SQL> create index idx_b on my_table ( b ) ;
CREATE INDEX

SQL> insert into my_table (a, b)
select
  case when random() <= 0.95 then null else i end as a,
  mod(i, 10) as b
from generate_series(1, 10000000) s(i)
;
INSERT 0 10000000

SQL> analyze my_table ;
ANALYZE


Review the statistics available to the optimizer:

SQL> select attname, null_frac, n_distinct, most_common_vals, most_common_freqs, histogram_bounds, correlation
from pg_stats
where tablename = 'my_table'
order by attname
;
 attname | null_frac | n_distinct |   most_common_vals    |                                  most_common_freqs
                        |                                    histogram_bounds                                    |
correlation

---------+-----------+------------+-----------------------+--------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-------------
 a       |     0.945 |         -1 |                       |
                        | {2771,1301755,2096051,3059786,3680728,4653531,5882434,6737141,8240245,9428702,9875768} |
    1 
 b       |         0 |         10 | {9,4,3,1,2,6,8,5,7,0} |
{0.110333,0.104,0.102333,0.100333,0.100333,0.0996667,0.0986667,0.0983333,0.096,0.09}|
                                                    |    0.127294 
(2 rows)

SQL> select relname, reltuples, relpages from pg_class where relname in ('my_table', 'idx_a_not_null', 'idx_b') order
byrelname ; 
    relname     | reltuples | relpages
----------------+-----------+----------
 idx_a_not_null |    499955 |     1100
 idx_b          |     1e+07 |    21946
 my_table       |     1e+07 |    39492
(3 rows)


Run the test query, first without disabling Seq Scan to show this example reproduces the plan you're trying to avoid.

SQL> explain analyze select * from my_table where a is null and b = 5 limit 15 ;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..2.66 rows=15 width=8) (actual time=0.070..0.263 rows=15 loops=1)
   ->  Seq Scan on my_table  (cost=0.00..164492.00 rows=929250 width=8) (actual time=0.061..0.159 rows=15 loops=1)
         Filter: ((a IS NULL) AND (b = 5))
 Total runtime: 0.371 ms
(4 rows)


Now run the same query without the Seq Scan option.

SQL> set enable_seqscan = false ;
SET

SQL> explain analyze select * from my_table where a is null and b = 5 limit 15 ;
                                                            QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..46.33 rows=15 width=8) (actual time=0.081..0.232 rows=15 loops=1)
   ->  Index Scan using idx_b on my_table  (cost=0.00..2869913.63 rows=929250 width=8) (actual time=0.072..0.130
rows=15loops=1) 
         Index Cond: (b = 5)
         Filter: (a IS NULL)
 Total runtime: 0.341 ms
(5 rows)

SQL> reset enable_seqscan ;
RESET


Yes, it's unsavory to temporarily adjust a session-level parameter to tune a single query, but I don't know of a less
intrusiveway to avoid the SeqScan.  Here's why I think it might be your simplest option: 

As far as I can tell, the plan nodes for accessing the table/index are unaware of the LIMIT.  The cost of the Limit
nodeis estimated as the cost of its input row-source multiplied by the ratio of requested/returned rows.  For example,
fromthe preceding plan output: 
    2869913.63 for "Index Scan" upper cost * (15 row limit / 929250 returned rows) = 46.326 upper cost for the "Limit"
node
The underlying plan nodes each assume that all the rows matching their filter predicates will be returned up the
pipeline;the cost estimate is only reduced at the Limit node.  A Seq Scan and an Index Scan (over a complete index)
willboth expected the same number of input rows (pg_class.reltuples).  They also produce the same estimated result set,
sinceboth apply the same filters before outputing rows to the next node.  So an Index Scan is always going to have a
highercost estimate than an equivalent Seq Scan returning the same result rows (unless random_page_cost is < 1).
That'swhy I think the planner is always preferring the plan that uses a Seq Scan. 

Hope this helps!



pgsql-performance by date:

Previous
From: Guillaume Lelarge
Date:
Subject: Re: logging options...
Next
From: Tom Lane
Date:
Subject: Re: limit clause breaks query planner?