Thread: Expression index ignores column statistics target
I've noticed that row count estimates for expression indexes appear to rely on default_statistics_target rather than on a column's actual statistics target. That is, if I use ALTER TABLE SET STATISTICS to increase a column's statistics target and then run ANALYZE, then estimates for non-expression-index queries improve as expected. However, queries that use an expression index remain accurate for only around the N most common values, where N is the default_statistics_target that was in effect when ANALYZE ran. I'm still rummaging through the archives looking for past discussion; is this behavior a known limitation or just an oversight? CREATE TABLE foo (x integer); CREATE INDEX foo_x_idx ON foo (x); CREATE INDEX foo_abs_x_idx ON foo (abs(x)); INSERT INTO foo (x) SELECT r1 % r2 FROM generate_series(1, 100) AS g1(r1), generate_series(1, 100) AS g2(r2); SET default_statistics_target TO 15; ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20; ANALYZE foo; SELECT most_common_vals FROM pg_stats WHERE attname = 'x'; most_common_vals -----------------------------------------------------{0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18} (1 row) EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on foo (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1) Recheck Cond: (x = 13) -> Bitmap Index Scan on foo_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220 loops=1) Index Cond: (x = 13)Total runtime: 2.905 ms (5 rows) EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on foo (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1) Recheck Cond: (abs(x) =13) -> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220 loops=1) Index Cond: (abs(x) = 13)Total runtime: 2.875 ms (5 rows) EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on foo (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1) Recheck Cond: (x = 18) -> Bitmap Index Scan on foo_x_idx (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180 loops=1) Index Cond: (x = 18)Total runtime: 2.393 ms (5 rows) EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on foo (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1) Recheck Cond: (abs(x) =18) -> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180 loops=1) Index Cond: (abs(x) = 18)Total runtime: 2.418 ms (5 rows) -- Michael Fuhr
This is expected. The main TODO items is:* Allow accurate statistics to be collected on indexes with more than one columnor expression indexes, perhaps using per-index statistics Basically, we don't have multi-column or expression statistics. ANALYZE just analyzes columns, even if an expression index exists. --------------------------------------------------------------------------- Michael Fuhr wrote: > I've noticed that row count estimates for expression indexes appear > to rely on default_statistics_target rather than on a column's > actual statistics target. That is, if I use ALTER TABLE SET > STATISTICS to increase a column's statistics target and then run > ANALYZE, then estimates for non-expression-index queries improve > as expected. However, queries that use an expression index remain > accurate for only around the N most common values, where N is the > default_statistics_target that was in effect when ANALYZE ran. I'm > still rummaging through the archives looking for past discussion; > is this behavior a known limitation or just an oversight? > > CREATE TABLE foo (x integer); > > CREATE INDEX foo_x_idx ON foo (x); > CREATE INDEX foo_abs_x_idx ON foo (abs(x)); > > INSERT INTO foo (x) > SELECT r1 % r2 > FROM generate_series(1, 100) AS g1(r1), > generate_series(1, 100) AS g2(r2); > > SET default_statistics_target TO 15; > ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20; > ANALYZE foo; > > SELECT most_common_vals FROM pg_stats WHERE attname = 'x'; > most_common_vals > ----------------------------------------------------- > {0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18} > (1 row) > > EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on foo (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1) > Recheck Cond: (x = 13) > -> Bitmap Index Scan on foo_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220 loops=1) > Index Cond: (x = 13) > Total runtime: 2.905 ms > (5 rows) > > EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on foo (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1) > Recheck Cond: (abs(x) = 13) > -> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220 loops=1) > Index Cond: (abs(x) = 13) > Total runtime: 2.875 ms > (5 rows) > > EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on foo (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1) > Recheck Cond: (x = 18) > -> Bitmap Index Scan on foo_x_idx (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180 loops=1) > Index Cond: (x = 18) > Total runtime: 2.393 ms > (5 rows) > > EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on foo (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1) > Recheck Cond: (abs(x) = 18) > -> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180 loops=1) > Index Cond: (abs(x) = 18) > Total runtime: 2.418 ms > (5 rows) > > -- > Michael Fuhr > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Michael Fuhr <mike@fuhr.org> writes: > I've noticed that row count estimates for expression indexes appear > to rely on default_statistics_target rather than on a column's > actual statistics target. That is, if I use ALTER TABLE SET > STATISTICS to increase a column's statistics target and then run > ANALYZE, then estimates for non-expression-index queries improve > as expected. However, queries that use an expression index remain > accurate for only around the N most common values, where N is the > default_statistics_target that was in effect when ANALYZE ran. The code does in fact honor per-column statistics targets attached to expression indexes, viz alter table myfuncindex alter column pg_expression_1 set statistics 100; This isn't documented, mainly because pg_dump doesn't promise to dump such things, which it doesn't do because I didn't want to see the "pg_expression_N" naming for expression index columns become graven on stone tablets. I seem to recall bringing up the question of whether we could find a less implementation-specific way of commanding this behavior, but I can't find it in the archives right now. regards, tom lane
On Fri, Sep 30, 2005 at 11:59:26PM -0400, Bruce Momjian wrote: > This is expected. The main TODO items is: > > * Allow accurate statistics to be collected on indexes with more than > one column or expression indexes, perhaps using per-index statistics > > Basically, we don't have multi-column or expression statistics. ANALYZE > just analyzes columns, even if an expression index exists. But the row count estimates imply that expression index queries do use column statistics, presumably as a proxy in the absence of expression statistics. This looks like a relevant commit: http://archives.postgresql.org/pgsql-committers/2004-02/msg00124.php The behavior I observed is that the planner does appear to use column statistics when planning an expression index query, but it doesn't appear to honor a column's non-default statistics target. In other words: * Row count estimates for expression index queries (at least simple ones) are reasonably accurate for the N most common columnvalues, where N is the value of default_statistics_target when ANALYZE was run. * Specifically setting the column's statistics target with ALTER TABLE SET STATISTICS doesn't result in better statisticsfor expression index queries. That difference in behavior seems odd: if default_statistics_target has an effect, why doesn't ALTER TABLE SET STATISTICS? -- Michael Fuhr
On Sat, Oct 01, 2005 at 12:53:03AM -0400, Tom Lane wrote: > The code does in fact honor per-column statistics targets attached to > expression indexes, viz > > alter table myfuncindex alter column pg_expression_1 set statistics 100; Aha -- that's the piece I didn't know about. I was wondering where those statistics were being stored, since they were affected by default_statistics_target but not by per-column statistics targets. And now I see them when I don't restrict queries against pg_stats by just the table or column name. Thanks. -- Michael Fuhr
I wrote: > I seem to recall bringing up the question of whether > we could find a less implementation-specific way of commanding this > behavior, but I can't find it in the archives right now. Ah, here it is: http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php No responses :-( regards, tom lane
On Sat, Oct 01, 2005 at 02:19:06AM -0400, Tom Lane wrote: > I wrote: > > I seem to recall bringing up the question of whether > > we could find a less implementation-specific way of commanding this > > behavior, but I can't find it in the archives right now. > > Ah, here it is: > http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php > > No responses :-( Would an ALTER INDEX SET STATISTICS form be possible? -- Michael Fuhr
Michael Fuhr <mike@fuhr.org> writes: > On Sat, Oct 01, 2005 at 02:19:06AM -0400, Tom Lane wrote: >> Ah, here it is: >> http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php > Would an ALTER INDEX SET STATISTICS form be possible? It's not so much the table/index misnomer that's bothering me, it's the lack of a clean way to identify which column of the index you are talking about. regards, tom lane
On Sat, Oct 01, 2005 at 02:42:32AM -0400, Tom Lane wrote: > Michael Fuhr <mike@fuhr.org> writes: > > Would an ALTER INDEX SET STATISTICS form be possible? > > It's not so much the table/index misnomer that's bothering me, it's > the lack of a clean way to identify which column of the index you > are talking about. Ah, I see -- I wasn't thinking about expressions in multicolumn indexes. What about identifying the column with the expression itself, ala quote_ident(pg_get_indexdef())? That might be tedious for the user to type but it would be attractive from a self-documentation standpoint. ALTER INDEX indexname ALTER COLUMN "the expression" SET STATISTICS 100; I do see that indexes allow multiple instances of the same expression, so this approach could be ambiguous. Or should such repetition be prohibited as it is with column names? test=> CREATE TABLE foo (x integer); CREATE TABLE test=> CREATE INDEX foo1_idx ON foo (x, x); ERROR: duplicate key violates unique constraint "pg_attribute_relid_attnam_index" test=> CREATE INDEX foo2_idx ON foo (abs(x), abs(x)); CREATE INDEX -- Michael Fuhr
Tom Lane wrote: > Michael Fuhr <mike@fuhr.org> writes: > > I've noticed that row count estimates for expression indexes appear > > to rely on default_statistics_target rather than on a column's > > actual statistics target. That is, if I use ALTER TABLE SET > > STATISTICS to increase a column's statistics target and then run > > ANALYZE, then estimates for non-expression-index queries improve > > as expected. However, queries that use an expression index remain > > accurate for only around the N most common values, where N is the > > default_statistics_target that was in effect when ANALYZE ran. > > The code does in fact honor per-column statistics targets attached to > expression indexes, viz > > alter table myfuncindex alter column pg_expression_1 set statistics 100; > > This isn't documented, mainly because pg_dump doesn't promise to dump > such things, which it doesn't do because I didn't want to see the > "pg_expression_N" naming for expression index columns become graven on > stone tablets. I seem to recall bringing up the question of whether > we could find a less implementation-specific way of commanding this > behavior, but I can't find it in the archives right now. Is this a TODO? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Michael Fuhr <mike@fuhr.org> writes: > ALTER INDEX indexname ALTER COLUMN "the expression" SET STATISTICS 100; Yeah, that could probably be made to work. > I do see that indexes allow multiple instances of the same expression, > so this approach could be ambiguous. I can't think of an actual use for that, though, so we could just ignore the possible ambiguity. Or we could have the ALTER update all columns matching the given expression. > test=> CREATE INDEX foo1_idx ON foo (x, x); > ERROR: duplicate key violates unique constraint "pg_attribute_relid_attnam_index" Hmm, seems like there should be a more direct check for this ... regards, tom lane