Thread: Unused expression indexes

Unused expression indexes

From

Maciek Sakrejda

Date:

06 August 2024, 20:06:08

In a blog post [1], Bruce Momjian notes that expression indexes can
help with planning even if they're not used directly. But the examples
cited in that post are vague (i.e., they improve stats, but it's not
clear how they could change plans), and Bruce's answer to a comment
[2] suggests that this is not documented.

Is there any more info on this mechanism? Specifically, if one has
unused expression indexes (according to pg_stat_user_indexes), is it
safe to drop them? Or could they be providing statistics that
materially affect query planning even though the indexes themselves
are unused?

It looks like a very similar question was asked earlier this year on
pgsql-general [3], but got only a vague answer. Any background or tips
for where to look in the source regarding this behavior would be
greatly appreciated.

Thanks,
Maciek

[1]: https://momjian.us/main/blogs/pgblog/2017.html#February_20_2017
[2]: https://momjian.us/main/comment_item.html?/main/blogs/pgblog.html/February_20_2017#comment-3174376969
[3]:
https://www.postgresql.org/message-id/flat/CAHg%3DPn%3DOZu7A3p%2B0Z-CDG4s2CHYe3UFQCTZp4RWGCEn2gmD35A%40mail.gmail.com

Re: Unused expression indexes

From

Tom Lane

Date:

06 August 2024, 20:25:12

Maciek Sakrejda <maciek@pganalyze.com> writes:
> In a blog post [1], Bruce Momjian notes that expression indexes can
> help with planning even if they're not used directly. But the examples
> cited in that post are vague (i.e., they improve stats, but it's not
> clear how they could change plans), and Bruce's answer to a comment
> [2] suggests that this is not documented.

> Is there any more info on this mechanism? Specifically, if one has
> unused expression indexes (according to pg_stat_user_indexes), is it
> safe to drop them? Or could they be providing statistics that
> materially affect query planning even though the indexes themselves
> are unused?

Expression indexes definitely can affect planning, because ANALYZE
collects stats on the values of those expressions.  As a trivial
example,

regression=# create table foo (x1 float8);
CREATE TABLE
regression=# insert into foo select 10 * random() from generate_series(1,10000);
INSERT 0 10000
regression=# analyze foo;
ANALYZE
regression=# explain analyze select * from foo where sqrt(x1) < 1;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..195.00 rows=3333 width=8) (actual time=0.009..0.546 rows=1028 loops=1)
   Filter: (sqrt(x1) < '1'::double precision)
   Rows Removed by Filter: 8972
 Planning Time: 0.065 ms
 Execution Time: 0.572 ms
(5 rows)

The planner has no info about the values of sqrt(x1), so you get a
default estimate (one-third) of the selectivity of the WHERE clause.
But watch this:

regression=# create index on foo (sqrt(x1));
CREATE INDEX
regression=# analyze foo;
ANALYZE
regression=# explain analyze select * from foo where sqrt(x1) < 1;
                                                         QUERY PLAN
    

----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foo  (cost=24.24..84.63 rows=1026 width=8) (actual time=0.078..0.229 rows=1028 loops=1)
   Recheck Cond: (sqrt(x1) < '1'::double precision)
   Heap Blocks: exact=45
   ->  Bitmap Index Scan on foo_sqrt_idx  (cost=0.00..23.98 rows=1026 width=0) (actual time=0.068..0.068 rows=1028
loops=1)
         Index Cond: (sqrt(x1) < '1'::double precision)
 Planning Time: 0.113 ms
 Execution Time: 0.259 ms
(7 rows)

Now there are stats about the values of sqrt(x1), allowing a far more
accurate selectivity estimate to be made.  In this particular example
there's no change of plan (it would have used the index anyway), but
certainly a different rowcount estimate can make a big difference.

This mechanism is quite ancient, and in principle it's now superseded
by extended statistics.  For example, I can drop this index and
instead do

regression=# drop index foo_sqrt_idx;
DROP INDEX
regression=# create statistics foostats on sqrt(x1) from foo;
CREATE STATISTICS
regression=# analyze foo;
ANALYZE
regression=# explain analyze select * from foo where sqrt(x1) < 1;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..195.00 rows=1026 width=8) (actual time=0.006..0.479 rows=1028 loops=1)
   Filter: (sqrt(x1) < '1'::double precision)
   Rows Removed by Filter: 8972
 Planning Time: 0.079 ms
 Execution Time: 0.503 ms
(5 rows)

So the accurate rowcount estimate is still obtained in this example;
and we're not incurring any index maintenance costs, only ANALYZE
costs that are going to be roughly the same either way.

However, I am not certain that extended statistics are plugged into
all the places where the older mechanism applies.  Tomas Vondra might
have a better idea than I of where gaps remain in that.

            regards, tom lane

Re: Unused expression indexes

From

Maciek Sakrejda

Date:

06 August 2024, 20:53:31

On Tue, Aug 6, 2024 at 1:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The planner has no info about the values of sqrt(x1), so you get a
> default estimate (one-third) of the selectivity of the WHERE clause.
> But watch this:
>
> regression=# create index on foo (sqrt(x1));
> CREATE INDEX
> regression=# analyze foo;
> ANALYZE
> regression=# explain analyze select * from foo where sqrt(x1) < 1;
>                                                          QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foo  (cost=24.24..84.63 rows=1026 width=8) (actual time=0.078..0.229 rows=1028 loops=1)
>    Recheck Cond: (sqrt(x1) < '1'::double precision)
>    Heap Blocks: exact=45
>    ->  Bitmap Index Scan on foo_sqrt_idx  (cost=0.00..23.98 rows=1026 width=0) (actual time=0.068..0.068 rows=1028
loops=1)
>          Index Cond: (sqrt(x1) < '1'::double precision)
>  Planning Time: 0.113 ms
>  Execution Time: 0.259 ms
> (7 rows)
>
> Now there are stats about the values of sqrt(x1), allowing a far more
> accurate selectivity estimate to be made.  In this particular example
> there's no change of plan (it would have used the index anyway), but
> certainly a different rowcount estimate can make a big difference.

Thanks, but I was asking specifically about _unused_ indexes
(according to pg_stat_user_indexes). Bruce's blog post showed how they
can still influence rowcount estimates, but can they do that (1) even
if they don't end up being used by the query plan and (2) in a way
that leads to a different plan?

Basically, if I have some unused expression indexes, is it safe to
drop them, or could they be used for planning optimizations even if
they are not used directly.

Thanks,
Maciek

Re: Unused expression indexes

From

Tom Lane

Date:

06 August 2024, 21:03:54

Maciek Sakrejda <maciek@pganalyze.com> writes:
> Thanks, but I was asking specifically about _unused_ indexes
> (according to pg_stat_user_indexes). Bruce's blog post showed how they
> can still influence rowcount estimates, but can they do that (1) even
> if they don't end up being used by the query plan and (2) in a way
> that leads to a different plan?

Certainly.  This example was too simple to illustrate that point
perhaps, but we would have arrived at the same rowcount estimate
whether it then chose to use the index or not.  (You could prove
that with the same example by modifying the comparison constant
to make the rowcount estimate high enough to discourage use of
the index.)  In turn, a different rowcount estimate might change
the plan for subsequent joins or other processing.  I didn't
spend enough time on the example to show that, but it's surely
not hard to show.

            regards, tom lane

Re: Unused expression indexes

From

Tomas Vondra

Date:

06 August 2024, 21:20:25

On 8/6/24 22:25, Tom Lane wrote:
> Maciek Sakrejda <maciek@pganalyze.com> writes:
>> In a blog post [1], Bruce Momjian notes that expression indexes can
>> help with planning even if they're not used directly. But the examples
>> cited in that post are vague (i.e., they improve stats, but it's not
>> clear how they could change plans), and Bruce's answer to a comment
>> [2] suggests that this is not documented.
> 
>> Is there any more info on this mechanism? Specifically, if one has
>> unused expression indexes (according to pg_stat_user_indexes), is it
>> safe to drop them? Or could they be providing statistics that
>> materially affect query planning even though the indexes themselves
>> are unused?
> 
> Expression indexes definitely can affect planning, because ANALYZE
> collects stats on the values of those expressions.  As a trivial
> example,
> 
> regression=# create table foo (x1 float8);
> CREATE TABLE
> regression=# insert into foo select 10 * random() from generate_series(1,10000);
> INSERT 0 10000
> regression=# analyze foo;
> ANALYZE
> regression=# explain analyze select * from foo where sqrt(x1) < 1;
>                                              QUERY PLAN     
> -----------------------------------------------------------------------------------------------------
>  Seq Scan on foo  (cost=0.00..195.00 rows=3333 width=8) (actual time=0.009..0.546 rows=1028 loops=1)
>    Filter: (sqrt(x1) < '1'::double precision)
>    Rows Removed by Filter: 8972
>  Planning Time: 0.065 ms
>  Execution Time: 0.572 ms
> (5 rows)
> 
> The planner has no info about the values of sqrt(x1), so you get a
> default estimate (one-third) of the selectivity of the WHERE clause.
> But watch this:
> 
> regression=# create index on foo (sqrt(x1));
> CREATE INDEX
> regression=# analyze foo;
> ANALYZE
> regression=# explain analyze select * from foo where sqrt(x1) < 1;
>                                                          QUERY PLAN
     
 
>
----------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foo  (cost=24.24..84.63 rows=1026 width=8) (actual time=0.078..0.229 rows=1028 loops=1)
>    Recheck Cond: (sqrt(x1) < '1'::double precision)
>    Heap Blocks: exact=45
>    ->  Bitmap Index Scan on foo_sqrt_idx  (cost=0.00..23.98 rows=1026 width=0) (actual time=0.068..0.068 rows=1028
loops=1)
>          Index Cond: (sqrt(x1) < '1'::double precision)
>  Planning Time: 0.113 ms
>  Execution Time: 0.259 ms
> (7 rows)
> 
> Now there are stats about the values of sqrt(x1), allowing a far more
> accurate selectivity estimate to be made.  In this particular example
> there's no change of plan (it would have used the index anyway), but
> certainly a different rowcount estimate can make a big difference.
> 
> This mechanism is quite ancient, and in principle it's now superseded
> by extended statistics.  For example, I can drop this index and
> instead do
> 
> regression=# drop index foo_sqrt_idx;
> DROP INDEX
> regression=# create statistics foostats on sqrt(x1) from foo;
> CREATE STATISTICS
> regression=# analyze foo;
> ANALYZE
> regression=# explain analyze select * from foo where sqrt(x1) < 1;
>                                              QUERY PLAN                                              
> -----------------------------------------------------------------------------------------------------
>  Seq Scan on foo  (cost=0.00..195.00 rows=1026 width=8) (actual time=0.006..0.479 rows=1028 loops=1)
>    Filter: (sqrt(x1) < '1'::double precision)
>    Rows Removed by Filter: 8972
>  Planning Time: 0.079 ms
>  Execution Time: 0.503 ms
> (5 rows)
> 
> So the accurate rowcount estimate is still obtained in this example;
> and we're not incurring any index maintenance costs, only ANALYZE
> costs that are going to be roughly the same either way.
> 
> However, I am not certain that extended statistics are plugged into
> all the places where the older mechanism applies.  Tomas Vondra might
> have a better idea than I of where gaps remain in that.
> 

AFAIK it handles all / exactly the same cases. The magic happens in
examine_variable() in selfuncs.c, where we look for stats for simple
Vars, and then for stats for expressions. First we look at all indexes,
and then (if there's no suitable index) at all extended stats.

There might be a place doing something ad hoc, but I can't think of any.


regards

-- 
Tomas Vondra

Re: Unused expression indexes

From

Maciek Sakrejda

Date:

08 August 2024, 00:35:05

Great, thank you both for the info.