Thread: Statistics use with functions

From:
Matthew Wakeling
Date:

I'm running a rather complex query and noticed a peculiarity in the usage
of statistics that seriously affects the plan generated. I can extract the
relevant bit:

modmine-r9=# select * from pg_stats where tablename = 'geneflankingregion' AND attname IN ('distance', 'direction');
  schemaname |     tablename      |  attname  | null_frac | avg_width | n_distinct |         most_common_vals         |
             most_common_freqs                | histogram_bounds | correlation 

------------+--------------------+-----------+-----------+-----------+------------+----------------------------------+------------------------------------------------+------------------+-------------
  public     | geneflankingregion | distance  |         0 |         6 |          5 | {5.0kb,0.5kb,1.0kb,2.0kb,10.0kb} |
{0.201051,0.200798,0.200479,0.199088,0.198583}|                  |    0.197736 
  public     | geneflankingregion | direction |         0 |        10 |          2 | {downstream,upstream}            |
{0.500719,0.499281}                           |                  |    0.495437 
(2 rows)

modmine-r9=# SELECT COUNT(*) FROM geneflankingregion;
  count
--------
  455020
(1 row)

modmine-r9=# explain analyse SELECT * FROM geneflankingregion WHERE distance = '10.0kb' AND direction = 'upstream';
                                                          QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
  Seq Scan on geneflankingregion  (cost=0.00..15507.30 rows=45115 width=213) (actual time=0.053..181.764 rows=45502
loops=1)
    Filter: ((distance = '10.0kb'::text) AND (direction = 'upstream'::text))
  Total runtime: 227.245 ms
(3 rows)

modmine-r9=# explain analyse SELECT * FROM geneflankingregion WHERE LOWER(distance) = '10.0kb' AND LOWER(direction) =
'upstream';
                                                                            QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on geneflankingregion
    (cost=66.95..88.77 rows=11 width=213)
    (actual time=207.555..357.359 rows=45502 loops=1)
    Recheck Cond: ((lower(distance) = '10.0kb'::text) AND (lower(direction) = 'upstream'::text))
    ->  BitmapAnd
          (cost=66.95..66.95 rows=11 width=0)
          (actual time=205.978..205.978 rows=0 loops=1)
          ->  Bitmap Index Scan on geneflankingregion__distance_equals
                (cost=0.00..31.34 rows=2275 width=0)
                (actual time=79.380..79.380 rows=91004 loops=1)
                Index Cond: (lower(distance) = '10.0kb'::text)
          ->  Bitmap Index Scan on geneflankingregion__direction_equals
                (cost=0.00..35.35 rows=2275 width=0)
                (actual time=124.639..124.639 rows=227510 loops=1)
                Index Cond: (lower(direction) = 'upstream'::text)
  Total runtime: 401.740 ms
(8 rows)

When I wrap the fields in the constraints in a LOWER() function, the
planner stops looking at the statistics and makes a wild guess, even
though it is very obvious from just looking what the result should be.
Embedded in a much larger query, the inaccuracy in the number of rows (11
instead of 45502) causes major planning problems. Also, why does the
BitmapAnd say zero actual rows?

I understand this probably isn't Priority No. 1, and there are some
interesting corner cases when n_distinct is higher than the histogram
width, but would it be possible to fix this one up?

Matthew

--
 I would like to think that in this day and age people would know better than
 to open executables in an e-mail. I'd also like to be able to flap my arms
 and fly to the moon.                                    -- Tim Mullen

From:
Tom Lane
Date:

Matthew Wakeling <> writes:
> When I wrap the fields in the constraints in a LOWER() function, the
> planner stops looking at the statistics and makes a wild guess, even
> though it is very obvious from just looking what the result should be.

Well, in general the planner can't assume anything about the statistics
of a function result, since it doesn't know how the function behaves.
In this case, however, you evidently have an index on lower(distance)
which should have caused ANALYZE to gather stats on the values of that
functional expression.  It looks like there might be something wrong
there --- can you look into pg_stats and see if there is such an entry
and if it looks sane?

> Also, why does the BitmapAnd say zero actual rows?

There isn't any reasonably-inexpensive way for EXPLAIN ANALYZE to
determine how many rows are represented by a bitmap result, so it
doesn't try.

            regards, tom lane

From:
Matthew Wakeling
Date:

On Fri, 8 May 2009, Tom Lane wrote:
> In this case, however, you evidently have an index on lower(distance)
> which should have caused ANALYZE to gather stats on the values of that
> functional expression.  It looks like there might be something wrong
> there --- can you look into pg_stats and see if there is such an entry
> and if it looks sane?

What should I be looking for? I don't see anything obvious from this:

modmine-r9=# select attname from pg_stats where tablename = 'geneflankingregion';

Ah, now I see it - I re-analysed, and found entries in pg_stats where
tablename is the name of the index. Now the query plans correctly and has
the right estimates. So, one needs to analyse AFTER creating indexes -
didn't know that.

modmine-r9=# explain analyse SELECT * FROM geneflankingregion WHERE
LOWER(distance) = '10.0kb' AND LOWER(direction) = 'upstream';
                     QUERY PLAN
-----------------------------------------------------------------
  Bitmap Heap Scan on geneflankingregion
    (cost=1197.19..11701.87 rows=45614 width=212)
    (actual time=18.336..153.825 rows=45502 loops=1)
    Recheck Cond: (lower(distance) = '10.0kb'::text)
    Filter: (lower(direction) = 'upstream'::text)
    ->  Bitmap Index Scan on geneflankingregion__distance_equals
          (cost=0.00..1185.78 rows=91134 width=0)
          (actual time=16.565..16.565 rows=91004 loops=1)
          Index Cond: (lower(distance) = '10.0kb'::text)
  Total runtime: 199.282 ms
(6 rows)

Matthew

--
 It is better to keep your mouth closed and let people think you are a fool
 than to open it and remove all doubt.                  -- Mark Twain

From:
Tom Lane
Date:

Matthew Wakeling <> writes:
> Ah, now I see it - I re-analysed, and found entries in pg_stats where
> tablename is the name of the index. Now the query plans correctly and has
> the right estimates. So, one needs to analyse AFTER creating indexes -
> didn't know that.

Yes, for functional indexes it's helpful to do that.  Doesn't matter
for plain-old-plain-old indexes.

            regards, tom lane