Thread: Expression index ignores column statistics target

Expression index ignores column statistics target

From

Michael Fuhr

Date:

30 September 2005, 23:15:54

I've noticed that row count estimates for expression indexes appear
to rely on default_statistics_target rather than on a column's
actual statistics target.  That is, if I use ALTER TABLE SET
STATISTICS to increase a column's statistics target and then run
ANALYZE, then estimates for non-expression-index queries improve
as expected.  However, queries that use an expression index remain
accurate for only around the N most common values, where N is the
default_statistics_target that was in effect when ANALYZE ran.  I'm
still rummaging through the archives looking for past discussion;
is this behavior a known limitation or just an oversight?

CREATE TABLE foo (x integer);

CREATE INDEX foo_x_idx ON foo (x);
CREATE INDEX foo_abs_x_idx ON foo (abs(x));

INSERT INTO foo (x) SELECT r1 % r2 FROM generate_series(1, 100) AS g1(r1),      generate_series(1, 100) AS g2(r2);

SET default_statistics_target TO 15;
ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20;
ANALYZE foo;

SELECT most_common_vals FROM pg_stats WHERE attname = 'x';                 most_common_vals                   
-----------------------------------------------------{0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18}
(1 row)

EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13;                                                     QUERY PLAN
                                           
 

----------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on foo  (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1)  Recheck Cond: (x =
13) ->  Bitmap Index Scan on foo_x_idx  (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220 loops=1)
     Index Cond: (x = 13)Total runtime: 2.905 ms
 
(5 rows)

EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13;                                                       QUERY PLAN
                                                    
 

--------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on foo  (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1)  Recheck Cond: (abs(x)
=13)  ->  Bitmap Index Scan on foo_abs_x_idx  (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220
loops=1)       Index Cond: (abs(x) = 13)Total runtime: 2.875 ms
 
(5 rows)

EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18;                                                     QUERY PLAN
                                           
 

----------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on foo  (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1)  Recheck Cond: (x =
18) ->  Bitmap Index Scan on foo_x_idx  (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180 loops=1)
     Index Cond: (x = 18)Total runtime: 2.393 ms
 
(5 rows)

EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18;                                                      QUERY PLAN
                                                   
 

-------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on foo  (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1)  Recheck Cond: (abs(x)
=18)  ->  Bitmap Index Scan on foo_abs_x_idx  (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180
loops=1)       Index Cond: (abs(x) = 18)Total runtime: 2.418 ms
 
(5 rows)

-- 
Michael Fuhr

Re: Expression index ignores column statistics target

From

Bruce Momjian

Date:

01 October 2005, 00:59:49

This is expected.  The main TODO items is:* Allow accurate statistics to be collected on indexes with more than  one
columnor expression indexes, perhaps using per-index statistics
 

Basically, we don't have multi-column or expression statistics.  ANALYZE
just analyzes columns, even if an expression index exists.

---------------------------------------------------------------------------

Michael Fuhr wrote:
> I've noticed that row count estimates for expression indexes appear
> to rely on default_statistics_target rather than on a column's
> actual statistics target.  That is, if I use ALTER TABLE SET
> STATISTICS to increase a column's statistics target and then run
> ANALYZE, then estimates for non-expression-index queries improve
> as expected.  However, queries that use an expression index remain
> accurate for only around the N most common values, where N is the
> default_statistics_target that was in effect when ANALYZE ran.  I'm
> still rummaging through the archives looking for past discussion;
> is this behavior a known limitation or just an oversight?
> 
> CREATE TABLE foo (x integer);
> 
> CREATE INDEX foo_x_idx ON foo (x);
> CREATE INDEX foo_abs_x_idx ON foo (abs(x));
> 
> INSERT INTO foo (x)
>   SELECT r1 % r2
>   FROM generate_series(1, 100) AS g1(r1),
>        generate_series(1, 100) AS g2(r2);
> 
> SET default_statistics_target TO 15;
> ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20;
> ANALYZE foo;
> 
> SELECT most_common_vals FROM pg_stats WHERE attname = 'x';
>                   most_common_vals                   
> -----------------------------------------------------
>  {0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18}
> (1 row)
> 
> EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13;
>                                                       QUERY PLAN

>
----------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foo  (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1)
>    Recheck Cond: (x = 13)
>    ->  Bitmap Index Scan on foo_x_idx  (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220
loops=1)
>          Index Cond: (x = 13)
>  Total runtime: 2.905 ms
> (5 rows)
> 
> EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13;
>                                                         QUERY PLAN
   
 
>
--------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foo  (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1)
>    Recheck Cond: (abs(x) = 13)
>    ->  Bitmap Index Scan on foo_abs_x_idx  (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220
loops=1)
>          Index Cond: (abs(x) = 13)
>  Total runtime: 2.875 ms
> (5 rows)
> 
> EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18;
>                                                       QUERY PLAN

>
----------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foo  (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1)
>    Recheck Cond: (x = 18)
>    ->  Bitmap Index Scan on foo_x_idx  (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180
loops=1)
>          Index Cond: (x = 18)
>  Total runtime: 2.393 ms
> (5 rows)
> 
> EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18;
>                                                        QUERY PLAN
  
 
>
-------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foo  (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1)
>    Recheck Cond: (abs(x) = 18)
>    ->  Bitmap Index Scan on foo_abs_x_idx  (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180
loops=1)
>          Index Cond: (abs(x) = 18)
>  Total runtime: 2.418 ms
> (5 rows)
> 
> -- 
> Michael Fuhr
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Expression index ignores column statistics target

From

Tom Lane

Date:

01 October 2005, 01:53:08

Michael Fuhr <mike@fuhr.org> writes:
> I've noticed that row count estimates for expression indexes appear
> to rely on default_statistics_target rather than on a column's
> actual statistics target.  That is, if I use ALTER TABLE SET
> STATISTICS to increase a column's statistics target and then run
> ANALYZE, then estimates for non-expression-index queries improve
> as expected.  However, queries that use an expression index remain
> accurate for only around the N most common values, where N is the
> default_statistics_target that was in effect when ANALYZE ran.

The code does in fact honor per-column statistics targets attached to
expression indexes, viz

alter table myfuncindex alter column pg_expression_1 set statistics 100;

This isn't documented, mainly because pg_dump doesn't promise to dump
such things, which it doesn't do because I didn't want to see the
"pg_expression_N" naming for expression index columns become graven on
stone tablets.  I seem to recall bringing up the question of whether
we could find a less implementation-specific way of commanding this
behavior, but I can't find it in the archives right now.
        regards, tom lane

Re: Expression index ignores column statistics target

From

Michael Fuhr

Date:

01 October 2005, 02:10:13

On Fri, Sep 30, 2005 at 11:59:26PM -0400, Bruce Momjian wrote:
> This is expected.  The main TODO items is:
>     
>     * Allow accurate statistics to be collected on indexes with more than
>       one column or expression indexes, perhaps using per-index statistics
> 
> Basically, we don't have multi-column or expression statistics.  ANALYZE
> just analyzes columns, even if an expression index exists.

But the row count estimates imply that expression index queries do
use column statistics, presumably as a proxy in the absence of
expression statistics.  This looks like a relevant commit:

http://archives.postgresql.org/pgsql-committers/2004-02/msg00124.php

The behavior I observed is that the planner does appear to use
column statistics when planning an expression index query, but it
doesn't appear to honor a column's non-default statistics target.
In other words:

* Row count estimates for expression index queries (at least simple ones) are reasonably accurate for the N most common
columnvalues, where N is the value of default_statistics_target when ANALYZE was run.

* Specifically setting the column's statistics target with ALTER TABLE SET STATISTICS doesn't result in better
statisticsfor expression index queries.

That difference in behavior seems odd: if default_statistics_target
has an effect, why doesn't ALTER TABLE SET STATISTICS?

-- 
Michael Fuhr

Re: Expression index ignores column statistics target

From

Michael Fuhr

Date:

01 October 2005, 02:23:27

On Sat, Oct 01, 2005 at 12:53:03AM -0400, Tom Lane wrote:
> The code does in fact honor per-column statistics targets attached to
> expression indexes, viz
> 
> alter table myfuncindex alter column pg_expression_1 set statistics 100;

Aha -- that's the piece I didn't know about.  I was wondering where
those statistics were being stored, since they were affected by
default_statistics_target but not by per-column statistics targets.
And now I see them when I don't restrict queries against pg_stats
by just the table or column name.  Thanks.

-- 
Michael Fuhr

Re: Expression index ignores column statistics target

From

Tom Lane

Date:

01 October 2005, 03:19:10

I wrote:
> I seem to recall bringing up the question of whether
> we could find a less implementation-specific way of commanding this
> behavior, but I can't find it in the archives right now.

Ah, here it is:
http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php

No responses :-(
        regards, tom lane

Re: Expression index ignores column statistics target

From

Michael Fuhr

Date:

01 October 2005, 03:40:19

On Sat, Oct 01, 2005 at 02:19:06AM -0400, Tom Lane wrote:
> I wrote:
> > I seem to recall bringing up the question of whether
> > we could find a less implementation-specific way of commanding this
> > behavior, but I can't find it in the archives right now.
> 
> Ah, here it is:
> http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php
> 
> No responses :-(

Would an ALTER INDEX SET STATISTICS form be possible?

-- 
Michael Fuhr

Re: Expression index ignores column statistics target

From

Tom Lane

Date:

01 October 2005, 03:42:35

Michael Fuhr <mike@fuhr.org> writes:
> On Sat, Oct 01, 2005 at 02:19:06AM -0400, Tom Lane wrote:
>> Ah, here it is:
>> http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php

> Would an ALTER INDEX SET STATISTICS form be possible?

It's not so much the table/index misnomer that's bothering me, it's
the lack of a clean way to identify which column of the index you
are talking about.
        regards, tom lane

Re: Expression index ignores column statistics target

From

Michael Fuhr

Date:

01 October 2005, 04:41:23

On Sat, Oct 01, 2005 at 02:42:32AM -0400, Tom Lane wrote:
> Michael Fuhr <mike@fuhr.org> writes:
> > Would an ALTER INDEX SET STATISTICS form be possible?
> 
> It's not so much the table/index misnomer that's bothering me, it's
> the lack of a clean way to identify which column of the index you
> are talking about.

Ah, I see -- I wasn't thinking about expressions in multicolumn
indexes.  What about identifying the column with the expression
itself, ala quote_ident(pg_get_indexdef())?  That might be tedious
for the user to type but it would be attractive from a self-documentation
standpoint.

ALTER INDEX indexname ALTER COLUMN "the expression" SET STATISTICS 100;

I do see that indexes allow multiple instances of the same expression,
so this approach could be ambiguous.  Or should such repetition be
prohibited as it is with column names?

test=> CREATE TABLE foo (x integer);
CREATE TABLE
test=> CREATE INDEX foo1_idx ON foo (x, x);
ERROR:  duplicate key violates unique constraint "pg_attribute_relid_attnam_index"
test=> CREATE INDEX foo2_idx ON foo (abs(x), abs(x));
CREATE INDEX

-- 
Michael Fuhr

Re: Expression index ignores column statistics target

From

Bruce Momjian

Date:

01 October 2005, 11:34:59

Tom Lane wrote:
> Michael Fuhr <mike@fuhr.org> writes:
> > I've noticed that row count estimates for expression indexes appear
> > to rely on default_statistics_target rather than on a column's
> > actual statistics target.  That is, if I use ALTER TABLE SET
> > STATISTICS to increase a column's statistics target and then run
> > ANALYZE, then estimates for non-expression-index queries improve
> > as expected.  However, queries that use an expression index remain
> > accurate for only around the N most common values, where N is the
> > default_statistics_target that was in effect when ANALYZE ran.
> 
> The code does in fact honor per-column statistics targets attached to
> expression indexes, viz
> 
> alter table myfuncindex alter column pg_expression_1 set statistics 100;
> 
> This isn't documented, mainly because pg_dump doesn't promise to dump
> such things, which it doesn't do because I didn't want to see the
> "pg_expression_N" naming for expression index columns become graven on
> stone tablets.  I seem to recall bringing up the question of whether
> we could find a less implementation-specific way of commanding this
> behavior, but I can't find it in the archives right now.

Is this a TODO?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Expression index ignores column statistics target

From

Tom Lane

Date:

01 October 2005, 14:16:43

Michael Fuhr <mike@fuhr.org> writes:
> ALTER INDEX indexname ALTER COLUMN "the expression" SET STATISTICS 100;

Yeah, that could probably be made to work.

> I do see that indexes allow multiple instances of the same expression,
> so this approach could be ambiguous.

I can't think of an actual use for that, though, so we could just ignore
the possible ambiguity.  Or we could have the ALTER update all columns
matching the given expression.

> test=> CREATE INDEX foo1_idx ON foo (x, x);
> ERROR:  duplicate key violates unique constraint "pg_attribute_relid_attnam_index"

Hmm, seems like there should be a more direct check for this ...
        regards, tom lane