Thread: Unused expression indexes
In a blog post [1], Bruce Momjian notes that expression indexes can help with planning even if they're not used directly. But the examples cited in that post are vague (i.e., they improve stats, but it's not clear how they could change plans), and Bruce's answer to a comment [2] suggests that this is not documented. Is there any more info on this mechanism? Specifically, if one has unused expression indexes (according to pg_stat_user_indexes), is it safe to drop them? Or could they be providing statistics that materially affect query planning even though the indexes themselves are unused? It looks like a very similar question was asked earlier this year on pgsql-general [3], but got only a vague answer. Any background or tips for where to look in the source regarding this behavior would be greatly appreciated. Thanks, Maciek [1]: https://momjian.us/main/blogs/pgblog/2017.html#February_20_2017 [2]: https://momjian.us/main/comment_item.html?/main/blogs/pgblog.html/February_20_2017#comment-3174376969 [3]: https://www.postgresql.org/message-id/flat/CAHg%3DPn%3DOZu7A3p%2B0Z-CDG4s2CHYe3UFQCTZp4RWGCEn2gmD35A%40mail.gmail.com
Maciek Sakrejda <maciek@pganalyze.com> writes: > In a blog post [1], Bruce Momjian notes that expression indexes can > help with planning even if they're not used directly. But the examples > cited in that post are vague (i.e., they improve stats, but it's not > clear how they could change plans), and Bruce's answer to a comment > [2] suggests that this is not documented. > Is there any more info on this mechanism? Specifically, if one has > unused expression indexes (according to pg_stat_user_indexes), is it > safe to drop them? Or could they be providing statistics that > materially affect query planning even though the indexes themselves > are unused? Expression indexes definitely can affect planning, because ANALYZE collects stats on the values of those expressions. As a trivial example, regression=# create table foo (x1 float8); CREATE TABLE regression=# insert into foo select 10 * random() from generate_series(1,10000); INSERT 0 10000 regression=# analyze foo; ANALYZE regression=# explain analyze select * from foo where sqrt(x1) < 1; QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on foo (cost=0.00..195.00 rows=3333 width=8) (actual time=0.009..0.546 rows=1028 loops=1) Filter: (sqrt(x1) < '1'::double precision) Rows Removed by Filter: 8972 Planning Time: 0.065 ms Execution Time: 0.572 ms (5 rows) The planner has no info about the values of sqrt(x1), so you get a default estimate (one-third) of the selectivity of the WHERE clause. But watch this: regression=# create index on foo (sqrt(x1)); CREATE INDEX regression=# analyze foo; ANALYZE regression=# explain analyze select * from foo where sqrt(x1) < 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on foo (cost=24.24..84.63 rows=1026 width=8) (actual time=0.078..0.229 rows=1028 loops=1) Recheck Cond: (sqrt(x1) < '1'::double precision) Heap Blocks: exact=45 -> Bitmap Index Scan on foo_sqrt_idx (cost=0.00..23.98 rows=1026 width=0) (actual time=0.068..0.068 rows=1028 loops=1) Index Cond: (sqrt(x1) < '1'::double precision) Planning Time: 0.113 ms Execution Time: 0.259 ms (7 rows) Now there are stats about the values of sqrt(x1), allowing a far more accurate selectivity estimate to be made. In this particular example there's no change of plan (it would have used the index anyway), but certainly a different rowcount estimate can make a big difference. This mechanism is quite ancient, and in principle it's now superseded by extended statistics. For example, I can drop this index and instead do regression=# drop index foo_sqrt_idx; DROP INDEX regression=# create statistics foostats on sqrt(x1) from foo; CREATE STATISTICS regression=# analyze foo; ANALYZE regression=# explain analyze select * from foo where sqrt(x1) < 1; QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on foo (cost=0.00..195.00 rows=1026 width=8) (actual time=0.006..0.479 rows=1028 loops=1) Filter: (sqrt(x1) < '1'::double precision) Rows Removed by Filter: 8972 Planning Time: 0.079 ms Execution Time: 0.503 ms (5 rows) So the accurate rowcount estimate is still obtained in this example; and we're not incurring any index maintenance costs, only ANALYZE costs that are going to be roughly the same either way. However, I am not certain that extended statistics are plugged into all the places where the older mechanism applies. Tomas Vondra might have a better idea than I of where gaps remain in that. regards, tom lane
On Tue, Aug 6, 2024 at 1:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The planner has no info about the values of sqrt(x1), so you get a > default estimate (one-third) of the selectivity of the WHERE clause. > But watch this: > > regression=# create index on foo (sqrt(x1)); > CREATE INDEX > regression=# analyze foo; > ANALYZE > regression=# explain analyze select * from foo where sqrt(x1) < 1; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on foo (cost=24.24..84.63 rows=1026 width=8) (actual time=0.078..0.229 rows=1028 loops=1) > Recheck Cond: (sqrt(x1) < '1'::double precision) > Heap Blocks: exact=45 > -> Bitmap Index Scan on foo_sqrt_idx (cost=0.00..23.98 rows=1026 width=0) (actual time=0.068..0.068 rows=1028 loops=1) > Index Cond: (sqrt(x1) < '1'::double precision) > Planning Time: 0.113 ms > Execution Time: 0.259 ms > (7 rows) > > Now there are stats about the values of sqrt(x1), allowing a far more > accurate selectivity estimate to be made. In this particular example > there's no change of plan (it would have used the index anyway), but > certainly a different rowcount estimate can make a big difference. Thanks, but I was asking specifically about _unused_ indexes (according to pg_stat_user_indexes). Bruce's blog post showed how they can still influence rowcount estimates, but can they do that (1) even if they don't end up being used by the query plan and (2) in a way that leads to a different plan? Basically, if I have some unused expression indexes, is it safe to drop them, or could they be used for planning optimizations even if they are not used directly. Thanks, Maciek
Maciek Sakrejda <maciek@pganalyze.com> writes: > Thanks, but I was asking specifically about _unused_ indexes > (according to pg_stat_user_indexes). Bruce's blog post showed how they > can still influence rowcount estimates, but can they do that (1) even > if they don't end up being used by the query plan and (2) in a way > that leads to a different plan? Certainly. This example was too simple to illustrate that point perhaps, but we would have arrived at the same rowcount estimate whether it then chose to use the index or not. (You could prove that with the same example by modifying the comparison constant to make the rowcount estimate high enough to discourage use of the index.) In turn, a different rowcount estimate might change the plan for subsequent joins or other processing. I didn't spend enough time on the example to show that, but it's surely not hard to show. regards, tom lane
On 8/6/24 22:25, Tom Lane wrote: > Maciek Sakrejda <maciek@pganalyze.com> writes: >> In a blog post [1], Bruce Momjian notes that expression indexes can >> help with planning even if they're not used directly. But the examples >> cited in that post are vague (i.e., they improve stats, but it's not >> clear how they could change plans), and Bruce's answer to a comment >> [2] suggests that this is not documented. > >> Is there any more info on this mechanism? Specifically, if one has >> unused expression indexes (according to pg_stat_user_indexes), is it >> safe to drop them? Or could they be providing statistics that >> materially affect query planning even though the indexes themselves >> are unused? > > Expression indexes definitely can affect planning, because ANALYZE > collects stats on the values of those expressions. As a trivial > example, > > regression=# create table foo (x1 float8); > CREATE TABLE > regression=# insert into foo select 10 * random() from generate_series(1,10000); > INSERT 0 10000 > regression=# analyze foo; > ANALYZE > regression=# explain analyze select * from foo where sqrt(x1) < 1; > QUERY PLAN > ----------------------------------------------------------------------------------------------------- > Seq Scan on foo (cost=0.00..195.00 rows=3333 width=8) (actual time=0.009..0.546 rows=1028 loops=1) > Filter: (sqrt(x1) < '1'::double precision) > Rows Removed by Filter: 8972 > Planning Time: 0.065 ms > Execution Time: 0.572 ms > (5 rows) > > The planner has no info about the values of sqrt(x1), so you get a > default estimate (one-third) of the selectivity of the WHERE clause. > But watch this: > > regression=# create index on foo (sqrt(x1)); > CREATE INDEX > regression=# analyze foo; > ANALYZE > regression=# explain analyze select * from foo where sqrt(x1) < 1; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on foo (cost=24.24..84.63 rows=1026 width=8) (actual time=0.078..0.229 rows=1028 loops=1) > Recheck Cond: (sqrt(x1) < '1'::double precision) > Heap Blocks: exact=45 > -> Bitmap Index Scan on foo_sqrt_idx (cost=0.00..23.98 rows=1026 width=0) (actual time=0.068..0.068 rows=1028 loops=1) > Index Cond: (sqrt(x1) < '1'::double precision) > Planning Time: 0.113 ms > Execution Time: 0.259 ms > (7 rows) > > Now there are stats about the values of sqrt(x1), allowing a far more > accurate selectivity estimate to be made. In this particular example > there's no change of plan (it would have used the index anyway), but > certainly a different rowcount estimate can make a big difference. > > This mechanism is quite ancient, and in principle it's now superseded > by extended statistics. For example, I can drop this index and > instead do > > regression=# drop index foo_sqrt_idx; > DROP INDEX > regression=# create statistics foostats on sqrt(x1) from foo; > CREATE STATISTICS > regression=# analyze foo; > ANALYZE > regression=# explain analyze select * from foo where sqrt(x1) < 1; > QUERY PLAN > ----------------------------------------------------------------------------------------------------- > Seq Scan on foo (cost=0.00..195.00 rows=1026 width=8) (actual time=0.006..0.479 rows=1028 loops=1) > Filter: (sqrt(x1) < '1'::double precision) > Rows Removed by Filter: 8972 > Planning Time: 0.079 ms > Execution Time: 0.503 ms > (5 rows) > > So the accurate rowcount estimate is still obtained in this example; > and we're not incurring any index maintenance costs, only ANALYZE > costs that are going to be roughly the same either way. > > However, I am not certain that extended statistics are plugged into > all the places where the older mechanism applies. Tomas Vondra might > have a better idea than I of where gaps remain in that. > AFAIK it handles all / exactly the same cases. The magic happens in examine_variable() in selfuncs.c, where we look for stats for simple Vars, and then for stats for expressions. First we look at all indexes, and then (if there's no suitable index) at all extended stats. There might be a place doing something ad hoc, but I can't think of any. regards -- Tomas Vondra
Great, thank you both for the info.