Does "correlation" mislead the optimizer on large tables? - Mailing list pgsql-performance

From Ron Mayer
Subject Does "correlation" mislead the optimizer on large tables?
Date
Msg-id Pine.LNX.4.44.0301231909070.986-100000@localhost.localdomain
Whole thread Raw
In response to Re: Same query, same performance  ("alexandre :: aldeia digital" <alepaes@aldeiadigital.com.br>)
Responses Re: Does "correlation" mislead the optimizer on large tables?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
Short summary:

  On a large tables, I think the "correlation" pg_stats field as calculated
  by "vacuum analyze" or "analyze" can mislead the optimizer.

  By forcing index scans on some queries shown below, some queries
  in my database speed up from 197 seconds to under 30 seconds.

  I'd like feedback on whether or not having a smarter "analyze"
  function (which I think I could write as a separate utility) would
  help me situations like this.

Longer:

  In particular, if I have a large table t with columns 'a','b','c', etc,
  and I cluster the table as follows:

    create table t_ordered as select * from t order by a,b;
    vacuum analyze t_ordered;

  Column "b" will (correctly) get a very low "correlation" in
  the pg_stats table -- but I think the optimizer would do better
  assuming a high correlation because similar 'b' values are still
  grouped closely on the same disk pages.



  Below is a real-world example of this issue.

  The table "fact" is a large one (reltuples = 1e8, relpages = 1082385)
  and contains about 1 years worth of data.  The data was loaded
  sequentialy (ordered by dat,tim).

    logs=# \d fact;
            Table "fact"
     Column |          Type          | Modifiers
    --------+------------------------+-----------
     dat    | date                   |
     tim    | time without time zone |
     ip_id  | integer                |
     bid_id | integer                |
     req_id | integer                |
     ref_id | integer                |
     uag_id | integer                |
    Indexes: i_fact_2__bid_id,
         i_fact_2__dat,
         i_fact_2__tim,
         i_fact_2__ip_id,
         i_fact_2__ref_id,
         i_fact_2__req_id


  With a table this large, each day's worth of data contains
  about 3000 pages; or conversely, each page contains only about
  a 30 second range of values for "tim".

  As shown in the queries below, the optimizer wanted to do
  a sequential scan when looking at a 10 minute part of the day.
  However also as shown, forcing an index scan did much better.

  I'm guessing this happened because the optimizer saw the
  horrible correlation, and decided it would have to read
  an enormous number of pages if it did an index scan.

===========================================

logs=# select tablename,attname,n_distinct,correlation from pg_stats where tablename='fact';
 tablename | attname | n_distinct | correlation
-----------+---------+------------+-------------
 fact      | dat     |        365 |           1
 fact      | tim     |      80989 | -0.00281447
 fact      | ip_id   |      44996 |    0.660689
 fact      | bid_id  |     742850 |    0.969026
 fact      | req_id  |       2778 |     0.67896
 fact      | ref_id  |        595 |    0.258023
 fact      | uag_id  |        633 |    0.234216
(7 rows)


logs=# explain analyze select * from fact where tim<'00:10:00';
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Seq Scan on fact  (cost=0.00..1949838.40 rows=526340 width=32) (actual time=0.39..197447.50 rows=402929 loops=1)
   Filter: (tim < '00:10:00'::time without time zone)
 Total runtime: 197810.01 msec
(3 rows)

logs=# explain analyze select * from fact where tim<'00:10:00';
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Seq Scan on fact  (cost=0.00..1949838.40 rows=526340 width=32) (actual time=15.25..156705.76 rows=402929 loops=1)
   Filter: (tim < '00:10:00'::time without time zone)
 Total runtime: 157089.15 msec
(3 rows)

logs=# set enable_seqscan = off;
SET
logs=# explain analyze select * from fact where tim<'00:10:00';
                                                               QUERY PLAN
                

----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using i__fact__tim on fact  (cost=0.00..2110978.39 rows=526340 width=32) (actual time=104.41..23307.84
rows=402929loops=1) 
   Index Cond: (tim < '00:10:00'::time without time zone)
 Total runtime: 23660.95 msec
(3 rows)

logs=# explain analyze select * from fact where tim<'00:10:00';
                                                             QUERY PLAN
             

-------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using i__fact__tim on fact  (cost=0.00..2110978.39 rows=526340 width=32) (actual time=0.03..1477.35
rows=402929loops=1) 
   Index Cond: (tim < '00:10:00'::time without time zone)
 Total runtime: 1827.94 msec
(3 rows)



logs=#

*******************************************************************************
*******************************************************************************


So two questions:

  a) Am I on to something.... or is something else the reason why
     the optimizer chose the much slower sequential scan?

  b) If I did write an "analyze" that tried to set "correlation" values
     that took into account such local grouping of data, would anyone
     be interested?


       Ron






pgsql-performance by date:

Previous
From: Noah Silverman
Date:
Subject: Crash Recovery
Next
From: Tom Lane
Date:
Subject: Re: Crash Recovery