Thread: Understanding partial index selection

Understanding partial index selection

From
Owen Nelson
Date:
Hi!

I've got a query running periodically which has been degrading in performance as time goes on. I'm hoping to better understand what the contributing factors are.

Given a table with:
```
postgres=# \d message
                                   Table "public.message"
   Column   |           Type           | Collation | Nullable |           Default          
------------+--------------------------+-----------+----------+-----------------------------
 id         | character varying        | C         | not null |
 created_at | timestamp with time zone |           | not null |
 org_id     | character varying        | C         | not null |
 app_id     | character varying        | C         | not null |
 event_type | character varying        |           | not null |
 uid        | character varying        |           |          |
 payload    | jsonb                    |           |          |
 channels   | jsonb                    |           |          |
 expiration | timestamp with time zone |           | not null | now() + '90 days'::interval
Indexes:
    "pk_message" PRIMARY KEY, btree (id)
    "ix_message_per_app" btree (app_id, id DESC)
    "ix_message_uid_unique_app_cond" UNIQUE, btree (app_id, uid) WHERE uid IS NOT NULL
    "message_payload_not_null_pidx" btree (expiration) WHERE payload IS NOT NULL
Foreign-key constraints:
    "fk_message_app_id_application" FOREIGN KEY (app_id) REFERENCES application(id) ON DELETE CASCADE
Referenced by:
    TABLE "messagedestination" CONSTRAINT "fk_messagedestination_msg_id_message" FOREIGN KEY (msg_id) REFERENCES message(id) ON DELETE CASCADE
```

I periodically run a query like this:
```
UPDATE message SET payload = NULL WHERE id IN (
    SELECT id FROM message
    WHERE
        payload IS NOT NULL
        AND expiration <= now()
    LIMIT 5000
    FOR UPDATE SKIP LOCKED
);
```

The aim is to cull payload values after the retention period has expired.
The hope is the sub-select would leverage the index "message_payload_not_null_pidx" but when I `EXPLAIN ANALYZE` the query, I see a seq scan instead.

```
Update on message  (cost=1773.41..44611.36 rows=5000 width=283) (actual time=20913.192..20913.194 rows=0 loops=1)
  ->  Nested Loop  (cost=1773.41..44611.36 rows=5000 width=283) (actual time=20881.320..20886.541 rows=51 loops=1)
        ->  HashAggregate  (cost=1772.85..1822.85 rows=5000 width=88) (actual time=20881.286..20882.052 rows=51 loops=1)
              Group Key: ("ANY_subquery".id)::text
              ->  Subquery Scan on "ANY_subquery"  (cost=0.00..1760.35 rows=5000 width=88) (actual time=8425.022..20881.244 rows=51 loops=1)
                    ->  Limit  (cost=0.00..1710.35 rows=5000 width=38) (actual time=8425.017..20881.219 rows=51 loops=1)
                          ->  LockRows  (cost=0.00..2112304.92 rows=6175068 width=38) (actual time=8425.016..20881.212 rows=51 loops=1)
                                ->  Seq Scan on message message_1  (cost=0.00..2050554.24 rows=6175068 width=38) (actual time=8424.977..20880.945 rows=65 loops=1)
                                      Filter: ((payload IS NOT NULL) AND (expiration <= now()))
                                      Rows Removed by Filter: 37772897
        ->  Index Scan using pk_message on message  (cost=0.56..8.56 rows=1 width=191) (actual time=0.073..0.073 rows=1 loops=51)
              Index Cond: ((id)::text = ("ANY_subquery".id)::text)
Planning Time: 0.237 ms
Execution Time: 20913.310 ms
```

I have read that the planner's index preference is driven by statistics. I looked at the correlation of both the expiration and payload columns, ~0.8 and ~0.4 respectively. I understand the planner prefers indices where the correlation is farther from 0 (closer to the min/max of the -1, 1 range), but I'm not sure where the threshold might be. Still, 0.4-ish feels low to me.

I wonder if the fact the index is partial using `WHERE payload IS NOT NULL` combined with the fact we're changing the value to NULL is sort of "punching holes" in the index, reducing correlation over time. I'm not sure how to test or prove this is the case.

I also wonder if the index type (btree) is inappropriate for some reason, or what other factors I should be thinking about here.

Separately, I saw this earlier thread which describes a somewhat similar problem, though the details are somewhat different. https://www.postgresql.org/message-id/20230119090425.GA8608%40arp.lijzij.de

Perhaps all I'd need is to `CREATE STATISTICS` on these two columns to make the index viable again.

I hope to just get a better understanding of what's going on. There's an upcoming change to the schema which will likely sidestep this situation, but there's still a lesson to learn here.

Thanks,
Owen

Re: Understanding partial index selection

From
Tom Lane
Date:
Owen Nelson <onelson@gmail.com> writes:
> The hope is the sub-select would leverage the index
> "message_payload_not_null_pidx" but when I `EXPLAIN ANALYZE` the query, I
> see a seq scan instead.

I think your problem is the horrid rowcount misestimation here:

>                                 ->  Seq Scan on message message_1 (cost=0.00..2050554.24 rows=6175068 width=38)
(actualtime=8424.977..20880.945 rows=65 loops=1) 
>                                       Filter: ((payload IS NOT NULL) AND (expiration <= now()))
>                                       Rows Removed by Filter: 37772897

The planner seems to think that about a sixth of the table satisfies
the filter condition, which is way more than enough to discourage it
from using an indexscan.  If it had gotten an estimate within even
one or two orders of magnitude of the reality of 65 rows, it'd have
gone for an indexscan, I'm pretty sure.

Are your ANALYZE stats up to date on this table?  If so, there must
be some strong correlation between the payload and expiration
conditions that the planner doesn't know about.  Perhaps creating
extended statistics on those two columns would help.

            regards, tom lane



Re: Understanding partial index selection

From
David Rowley
Date:
On Wed, 29 Nov 2023 at 11:23, Owen Nelson <onelson@gmail.com> wrote:
>     "message_payload_not_null_pidx" btree (expiration) WHERE payload IS NOT NULL

> I periodically run a query like this:
> ```
> UPDATE message SET payload = NULL WHERE id IN (

> Update on message  (cost=1773.41..44611.36 rows=5000 width=283) (actual time=20913.192..20913.194 rows=0 loops=1)
>   ->  Nested Loop  (cost=1773.41..44611.36 rows=5000 width=283) (actual time=20881.320..20886.541 rows=51 loops=1)
>         ->  HashAggregate  (cost=1772.85..1822.85 rows=5000 width=88) (actual time=20881.286..20882.052 rows=51
loops=1)
>               Group Key: ("ANY_subquery".id)::text
>               ->  Subquery Scan on "ANY_subquery"  (cost=0.00..1760.35 rows=5000 width=88) (actual
time=8425.022..20881.244rows=51 loops=1)
 
>                     ->  Limit  (cost=0.00..1710.35 rows=5000 width=38) (actual time=8425.017..20881.219 rows=51
loops=1)
>                           ->  LockRows  (cost=0.00..2112304.92 rows=6175068 width=38) (actual
time=8425.016..20881.212rows=51 loops=1)
 
>                                 ->  Seq Scan on message message_1  (cost=0.00..2050554.24 rows=6175068 width=38)
(actualtime=8424.977..20880.945 rows=65 loops=1)
 
>                                       Filter: ((payload IS NOT NULL) AND (expiration <= now()))
>                                       Rows Removed by Filter: 37772897
>         ->  Index Scan using pk_message on message  (cost=0.56..8.56 rows=1 width=191) (actual time=0.073..0.073
rows=1loops=51)
 
>               Index Cond: ((id)::text = ("ANY_subquery".id)::text)
> Planning Time: 0.237 ms
> Execution Time: 20913.310 ms

I think the most likely cause is that the index has just become
bloated from all the updates.  If you run the query after running SET
enable_seqscan TO off; then, proving the planner opts to use the
message_payload_not_null_pidx, you'll see what the planner's estimated
cost of that scan is. If you see the index being used, then that'll at
least confirm the index was not picked due to costs.

If the index gets used, then I'd check the size of the
message_payload_not_null_pidx index.  You could also consider using
pgstatindex() [1] to check the state of the index and if it's bloated,
reindex it.

David

[1] https://www.postgresql.org/docs/current/pgstattuple.html



Re: Understanding partial index selection

From
Owen Nelson
Date:
Embarrassed to say that it's been so long since I participated in a mailing list I neglected to ensure my replies were directed back at the list rather than select individuals.
I'll recap what I shared for the list here, for posterity.

@Boris

This is where the iteration aspect is tricky. I don't have much in the way of access to run ad hoc queries against the database, so right now I can't just run it.
I can see that pg_stat_user_tables.n_mod_since_analyze is 9834 which seems reasonable to me given the long life of the table so far, and I assume that the autovacuum/autoanalyze are doing the right thing here.

@Tom

> Are your ANALYZE stats up to date on this table?

It's a very good question! Right now, I'm taking it on faith that autovacuum and autoanalyze are keeping things up to date, but if I'm honest I've been getting some conflicting information from pg_stat_user_tables and I'm starting to suspect this might be an AWS Aurora quirk. The timestamps for last autovacuum seem recent enough to not be concerning, but sometimes all the stats come back with zeros and nulls. Surely that can't be right :(

I also speculated about using extended statistics. I would/will experiment with this if I can reproduce this degraded state outside of the production system. TBD.


I'm glad you mentioned pgstattuple. This is new to me, and I'm anxious to find some signal that could indicate the problem. I'll work to get the extension enabled and see what it shows.

I'm not currently able to `SET seqscan_enabled TO off`. The access I have right now only permits single statements, executed separately. Comparing the cost is an interesting idea though, so I'll see what I can do.


On Tue, Nov 28, 2023 at 2:23 PM Owen Nelson <onelson@gmail.com> wrote:
Hi!

I've got a query running periodically which has been degrading in performance as time goes on. I'm hoping to better understand what the contributing factors are.

Given a table with:
```
postgres=# \d message
                                   Table "public.message"
   Column   |           Type           | Collation | Nullable |           Default          
------------+--------------------------+-----------+----------+-----------------------------
 id         | character varying        | C         | not null |
 created_at | timestamp with time zone |           | not null |
 org_id     | character varying        | C         | not null |
 app_id     | character varying        | C         | not null |
 event_type | character varying        |           | not null |
 uid        | character varying        |           |          |
 payload    | jsonb                    |           |          |
 channels   | jsonb                    |           |          |
 expiration | timestamp with time zone |           | not null | now() + '90 days'::interval
Indexes:
    "pk_message" PRIMARY KEY, btree (id)
    "ix_message_per_app" btree (app_id, id DESC)
    "ix_message_uid_unique_app_cond" UNIQUE, btree (app_id, uid) WHERE uid IS NOT NULL
    "message_payload_not_null_pidx" btree (expiration) WHERE payload IS NOT NULL
Foreign-key constraints:
    "fk_message_app_id_application" FOREIGN KEY (app_id) REFERENCES application(id) ON DELETE CASCADE
Referenced by:
    TABLE "messagedestination" CONSTRAINT "fk_messagedestination_msg_id_message" FOREIGN KEY (msg_id) REFERENCES message(id) ON DELETE CASCADE
```

I periodically run a query like this:
```
UPDATE message SET payload = NULL WHERE id IN (
    SELECT id FROM message
    WHERE
        payload IS NOT NULL
        AND expiration <= now()
    LIMIT 5000
    FOR UPDATE SKIP LOCKED
);
```

The aim is to cull payload values after the retention period has expired.
The hope is the sub-select would leverage the index "message_payload_not_null_pidx" but when I `EXPLAIN ANALYZE` the query, I see a seq scan instead.

```
Update on message  (cost=1773.41..44611.36 rows=5000 width=283) (actual time=20913.192..20913.194 rows=0 loops=1)
  ->  Nested Loop  (cost=1773.41..44611.36 rows=5000 width=283) (actual time=20881.320..20886.541 rows=51 loops=1)
        ->  HashAggregate  (cost=1772.85..1822.85 rows=5000 width=88) (actual time=20881.286..20882.052 rows=51 loops=1)
              Group Key: ("ANY_subquery".id)::text
              ->  Subquery Scan on "ANY_subquery"  (cost=0.00..1760.35 rows=5000 width=88) (actual time=8425.022..20881.244 rows=51 loops=1)
                    ->  Limit  (cost=0.00..1710.35 rows=5000 width=38) (actual time=8425.017..20881.219 rows=51 loops=1)
                          ->  LockRows  (cost=0.00..2112304.92 rows=6175068 width=38) (actual time=8425.016..20881.212 rows=51 loops=1)
                                ->  Seq Scan on message message_1  (cost=0.00..2050554.24 rows=6175068 width=38) (actual time=8424.977..20880.945 rows=65 loops=1)
                                      Filter: ((payload IS NOT NULL) AND (expiration <= now()))
                                      Rows Removed by Filter: 37772897
        ->  Index Scan using pk_message on message  (cost=0.56..8.56 rows=1 width=191) (actual time=0.073..0.073 rows=1 loops=51)
              Index Cond: ((id)::text = ("ANY_subquery".id)::text)
Planning Time: 0.237 ms
Execution Time: 20913.310 ms
```

I have read that the planner's index preference is driven by statistics. I looked at the correlation of both the expiration and payload columns, ~0.8 and ~0.4 respectively. I understand the planner prefers indices where the correlation is farther from 0 (closer to the min/max of the -1, 1 range), but I'm not sure where the threshold might be. Still, 0.4-ish feels low to me.

I wonder if the fact the index is partial using `WHERE payload IS NOT NULL` combined with the fact we're changing the value to NULL is sort of "punching holes" in the index, reducing correlation over time. I'm not sure how to test or prove this is the case.

I also wonder if the index type (btree) is inappropriate for some reason, or what other factors I should be thinking about here.

Separately, I saw this earlier thread which describes a somewhat similar problem, though the details are somewhat different. https://www.postgresql.org/message-id/20230119090425.GA8608%40arp.lijzij.de

Perhaps all I'd need is to `CREATE STATISTICS` on these two columns to make the index viable again.

I hope to just get a better understanding of what's going on. There's an upcoming change to the schema which will likely sidestep this situation, but there's still a lesson to learn here.

Thanks,
Owen

Re: Understanding partial index selection

From
Tom Lane
Date:
Owen Nelson <onelson@gmail.com> writes:
>> Are your ANALYZE stats up to date on this table?

> It's a very good question! Right now, I'm taking it on faith that
> autovacuum and autoanalyze are keeping things up to date, but if I'm honest
> I've been getting some conflicting information from pg_stat_user_tables and
> I'm starting to suspect this might be an AWS Aurora quirk.

Oh, you didn't say this was Aurora.  Aurora is not really Postgres ---
at least, my not-an-insider impression is that the storage engine
is quite different.  Not sure how much of the advice you got here
is relevant to it.

            regards, tom lane



Re: Understanding partial index selection

From
Owen Nelson
Date:
> Aurora is not really Postgres

Oh geez, I didn't realize there was such a divide. This is my first look at Aurora and I thought it was just a hosted postgres offering.

Still, I'll take what I can get. Hopefully, some of this will carry over.

Re: Understanding partial index selection

From
Adrian Klaver
Date:
On 11/28/23 18:13, Owen Nelson wrote:
>  > Aurora is not really Postgres
> 
> Oh geez, I didn't realize there was such a divide. This is my first look 
> at Aurora and I thought it was just a hosted postgres offering.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_AuroraOverview.html#aur-shared-resp


"Aurora includes a high-performance storage subsystem. Its MySQL- and 
PostgreSQL-compatible database engines are customized to take advantage 
of that fast distributed storage. "

When I see things like *-compatible alarms start going off.

> 
> Still, I'll take what I can get. Hopefully, some of this will carry over.

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Understanding partial index selection

From
Owen Nelson
Date:
I was able to pull some stats with pgstattuple and nothing looks particularly hinky to me.

version: 4
tree_level: 2
index_size: 499589120
root_block_no: 412
internal_pages: 194
leaf_pages: 54572
empty_pages: 0
deleted_pages: 6218
avg_leaf_density: 90.08
leaf_fragmentation: 0.01

For flavor, If I remember correctly, the table has around 50mil rows, and around 17mil of them should be included in the partial index due to the "where payload is not null" predicate.

0 deleted pages would be nicer than ~6k, but by my count, that's around 10% of the total index size. I also assume if the index was not cleaned up during regular operations this number would be much larger. I think this points away from index bloat as the culprit, but please check me on this.

We're checking assumptions about when/how often the table is getting analyzed, but other than possibly using extended stats it sounds like the only other odd thing is "Aurora not being Postgres," which I'm not sure there's much I can do about right now :(

On Tue, Nov 28, 2023 at 9:23 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 11/28/23 18:13, Owen Nelson wrote:
>  > Aurora is not really Postgres
>
> Oh geez, I didn't realize there was such a divide. This is my first look
> at Aurora and I thought it was just a hosted postgres offering.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_AuroraOverview.html#aur-shared-resp


"Aurora includes a high-performance storage subsystem. Its MySQL- and
PostgreSQL-compatible database engines are customized to take advantage
of that fast distributed storage. "

When I see things like *-compatible alarms start going off.

>
> Still, I'll take what I can get. Hopefully, some of this will carry over.

--
Adrian Klaver
adrian.klaver@aklaver.com