Thread: Use of inefficient index in the presence of dead tuples

Use of inefficient index in the presence of dead tuples

From
Alexander Staubo
Date:
I am encountering an odd problem where Postgres will use the wrong index, particularly if the table
has some dead tuples. The database affected is running 12.6, but I can also reproduce with 16.3.

To reproduce:

(1) Disable autovacuum. This is just so we can induce a scenario where there are lots of dead tuples.

(2) Set up schema. It's important to create the index before insertion, in order to provoke a
situation where the indexes have dead tuples:

    CREATE TABLE outbox_batches (
        id             text    NOT NULL,
        receiver       text    NOT NULL,
        created_at     timestamp without time zone DEFAULT now() NOT NULL,
        PRIMARY KEY (receiver, id)
    );
    CREATE INDEX outbox_batches_on_receiver_and_created_at
        ON outbox_batches (receiver, created_at DESC);

(3) Insert 5M rows of dummy data. Note that we are using UUIDs here for the purposes of testing; in
    my real database, I use much shorter unique IDs.

    INSERT INTO outbox_batches (receiver, id)
        SELECT 'dummy', uuid_generate_v4()
        FROM (SELECT * FROM generate_series(1, 5000000, 1)) AS foo;

(4) Then ensure all tuples are dead except one:

    DELETE FROM outbox_batches;
    INSERT INTO outbox_batches (receiver, id) VALUES ('dummy', 'test');

(5) Analyze:

    ANALYZE outbox_batches;

(6) You should now have 5m dead rows and 1 live row:

    SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relname = 'outbox_batches';
    ┌────────────┬────────────┐
    │ n_live_tup │ n_dead_tup │
    ├────────────┼────────────┤
    │          1 │    5000000 │
    └────────────┴────────────┘

    We also observe that the outbox_batches_pkey index is 454 MB, and the
    outbox_batches_on_receiver_and_created_at is 31 MB.

(7) Try the following query:

    EXPLAIN (ANALYZE, VERBOSE, BUFFERS, COSTS, TIMING, SETTINGS, SUMMARY)
    SELECT id FROM outbox_batches
    WHERE receiver = 'dummy'
    AND id = 'test';

Here's the query plan:

    Index Scan using outbox_batches_on_receiver_and_created_at on public.outbox_batches  (cost=0.38..8.39 rows=1
width=5)(actual time=0.426..984.038 rows=1 loops=1) 
    Output: id
    Index Cond: (outbox_batches.receiver = 'dummy'::text)
    Filter: (outbox_batches.id = 'test'::text)
    Buffers: shared hit=3948 read=60742 dirtied=60741 written=30209
    Settings: work_mem = '32MB'
    Query Identifier: -2232653838283363139
    Planning:
    Buffers: shared hit=18 read=3
    Planning Time: 1.599 ms
    Execution Time: 984.082 ms

This query is reading 60K buffers even though it only needs to read a single row. Notice in particular the
use of the index outbox_batches_on_receiver_and_created_at, even though outbox_batches_pkey would be
a much better choice. We know this because if we drop the first index:

    Index Only Scan using outbox_batches_pkey on public.outbox_batches  (cost=0.50..8.52 rows=1 width=5) (actual
time=2.067..2.070rows=1 loops=1) 
    Output: id
    Index Cond: ((outbox_batches.receiver = 'dummy'::text) AND (outbox_batches.id = 'test'::text))
    Heap Fetches: 1
    Buffers: shared hit=1 read=4
    Settings: work_mem = '32MB'
    Query Identifier: -2232653838283363139
    Planning:
    Buffers: shared hit=5 dirtied=1
    Planning Time: 0.354 ms
    Execution Time: 2.115 ms

This is also the index that's used in the normal case when there are no dead tuples at all.

Interestingly, the cost of an index only scan on outbox_batches_pkey is 8.52, whereas the other is
8.39. Is this because it considers the number of index pages? I've tried adjusting the various cost
and memory settings, but they have no effect.

In this test, we created 5M dead tuples. However, for me it also reproduces with just 1,000 rows.
For such a small table, the performance degradation is minimal, but it increases as more and more
tuples are deleted.

In a production environment, we have rows being constantly deleted at a high rate, leaving a table
that often has very few live tuples, and often 500K+ dead tuples before autovacuum can kick in. Here
I am consistently seeing the wrong index used, leading to poor performance.

The autovacuum settings ar aggressive, but for whatever reason it is not keeping up. We also have
long-running transactions that sometimes cause the xmin to hang back for a while, preventing
vacuums from helping.

All of that said, I would rather Postgres choose the right index than spend a lot of time optimizing
vacuums.

Here's my full server config: https://gist.github.com/atombender/54207d473e415fab26fc59751a22feca.




Re: Use of inefficient index in the presence of dead tuples

From
Laurenz Albe
Date:
On Tue, 2024-05-28 at 10:00 +0200, Alexander Staubo wrote:
> I am encountering an odd problem where Postgres will use the wrong index, particularly if the table
> has some dead tuples. The database affected is running 12.6, but I can also reproduce with 16.3.
>
> To reproduce:
> [create a table with a larger index on "id" and "receiver" and a smaller on
>  "receiver" and "created_at", then delete all but one row and ANALYZE]
>
> (7) Try the following query:
>
>     EXPLAIN (ANALYZE, VERBOSE, BUFFERS, COSTS, TIMING, SETTINGS, SUMMARY)
>     SELECT id FROM outbox_batches
>     WHERE receiver = 'dummy'
>     AND id = 'test';
>
> Here's the query plan:
>
>     Index Scan using outbox_batches_on_receiver_and_created_at on public.outbox_batches  (cost=0.38..8.39 rows=1
width=5)(actual time=0.426..984.038 rows=1 loops=1) 
>     Output: id
>     Index Cond: (outbox_batches.receiver = 'dummy'::text)
>     Filter: (outbox_batches.id = 'test'::text)
>     Buffers: shared hit=3948 read=60742 dirtied=60741 written=30209
>     Settings: work_mem = '32MB'
>     Query Identifier: -2232653838283363139
>     Planning:
>     Buffers: shared hit=18 read=3
>     Planning Time: 1.599 ms
>     Execution Time: 984.082 ms
>
> This query is reading 60K buffers even though it only needs to read a single row. Notice in particular the
> use of the index outbox_batches_on_receiver_and_created_at, even though outbox_batches_pkey would be
> a much better choice. We know this because if we drop the first index:
>
>     Index Only Scan using outbox_batches_pkey on public.outbox_batches  (cost=0.50..8.52 rows=1 width=5) (actual
time=2.067..2.070rows=1 loops=1) 
>     Output: id
>     Index Cond: ((outbox_batches.receiver = 'dummy'::text) AND (outbox_batches.id = 'test'::text))
>     Heap Fetches: 1
>     Buffers: shared hit=1 read=4
>     Settings: work_mem = '32MB'
>     Query Identifier: -2232653838283363139
>     Planning:
>     Buffers: shared hit=5 dirtied=1
>     Planning Time: 0.354 ms
>     Execution Time: 2.115 ms
>
> This is also the index that's used in the normal case when there are no dead tuples at all.
>
> Interestingly, the cost of an index only scan on outbox_batches_pkey is 8.52, whereas the other is
> 8.39. Is this because it considers the number of index pages? I've tried adjusting the various cost
> and memory settings, but they have no effect.

ANALYZE considers only the live rows, so PostgreSQL knows that the query will
return only few results.  So it chooses the smaller index rather than the one
that matches the WHERE condition perfectly.

Unfortunately, it has to wade through all the deleted rows, which is slow.

But try to execute the query a second time, and it will be much faster.
PostgreSQL marks the index entries as "dead" during the first execution, so the
second execution won't have to look at the heap any more.

See https://www.cybertec-postgresql.com/en/killed-index-tuples/

> In this test, we created 5M dead tuples. However, for me it also reproduces with just 1,000 rows.
> For such a small table, the performance degradation is minimal, but it increases as more and more
> tuples are deleted.
>
> In a production environment, we have rows being constantly deleted at a high rate, leaving a table
> that often has very few live tuples, and often 500K+ dead tuples before autovacuum can kick in. Here
> I am consistently seeing the wrong index used, leading to poor performance.
>
> The autovacuum settings ar aggressive, but for whatever reason it is not keeping up. We also have
> long-running transactions that sometimes cause the xmin to hang back for a while, preventing
> vacuums from helping.
>
> All of that said, I would rather Postgres choose the right index than spend a lot of time optimizing
> vacuums.

I understand your pain, but your use case is somewhat unusual.

What I would consider in your place is
a) running an explicit VACUUM after you delete lots of rows or
b) using partitioning to get rid of old data

I don't know how the PostgreSQL optimizer could be improved to take dead rows into account.

Yours,
Laurenz Albe



Re: Use of inefficient index in the presence of dead tuples

From
Alexander Staubo
Date:
On 28 May 2024, at 13:02, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> ANALYZE considers only the live rows, so PostgreSQL knows that the query will
> return only few results.  So it chooses the smaller index rather than the one
> that matches the WHERE condition perfectly.
>
> Unfortunately, it has to wade through all the deleted rows, which is slow.

Sounds like the planner _should_ take the dead tuples into account. I’m surprised there are no parameters to tweak to
makethe planner understand that one index is more selective even though it is technically larger. 

> But try to execute the query a second time, and it will be much faster.
> PostgreSQL marks the index entries as "dead" during the first execution, so the
> second execution won't have to look at the heap any more.

Of course. It’s still not _free_; it’s still trawling through many megabytes of dead data, and going through the shared
buffercache and therefore competing with other queries that need shared buffers.  

> I understand your pain, but your use case is somewhat unusual.

I don’t think rapidly updated tables is an unusual use of Postgres, nor is the problem of long-running transaction
preventingdead tuple vacuuming. 

> What I would consider in your place is
> a) running an explicit VACUUM after you delete lots of rows or

The rows are deleted individually. It’s just that there are many transactions doing it concurrently.

> b) using partitioning to get rid of old data

Partitioning will generate dead tuples in the original partition when tuples are moved to the other partition, so I’m
notsure how that would help? 

I did explore a solution which is my “plan B” — adding a “done” column, then using “UPDATE … SET done = true” rather
thandeleting the rows. This causes dead tuples, of course, but then adding a new index with a “… WHERE NOT done” filter
fixesthe problem by forcing the query to use the right index. However, with this solution, rows will still have to be
deleted*sometime*, so this just delays the problem. But it would allow a “batch cleanup”: “DELETE … WHERE done; VACUUM”
inone fell swoop. 




Re: Use of inefficient index in the presence of dead tuples

From
"David G. Johnston"
Date:

On Tue, May 28, 2024, 07:21 Alexander Staubo <alex@purefiction.net> wrote:


I did explore a solution which is my “plan B” — adding a “done” column, then using “UPDATE … SET done = true” rather than deleting the rows. This causes dead tuples, of course, but then adding a new index with a “… WHERE NOT done” filter fixes the problem by forcing the query to use the right index. However, with this solution, rows will still have to be deleted *sometime*, so this just delays the problem. But it would allow a “batch cleanup”: “DELETE … WHERE done; VACUUM” in one fell swoop.

If you incorporate partitions into this, the final removal of the soft deleted rows becomes and truncate or a drop instead of a delete.

David J.


Alexander Staubo <alex@purefiction.net> writes:
> (2) Set up schema. It's important to create the index before insertion, in order to provoke a
> situation where the indexes have dead tuples:
> ...
> (4) Then ensure all tuples are dead except one:

>     DELETE FROM outbox_batches;
>     INSERT INTO outbox_batches (receiver, id) VALUES ('dummy', 'test');

> (5) Analyze:

>     ANALYZE outbox_batches;

So the problem here is that the ANALYZE didn't see any of the dead rows
and thus there is no way to know that they all match 'dummy'.  The cost
estimation is based on the conclusion that there is exactly one row
that will pass the index condition in each case, and thus the "right"
index doesn't look any cheaper than the "wrong" one --- in fact, it
looks a little worse because of the extra access to the visibility
map that will be incurred by an index-only scan.

I'm unpersuaded by the idea that ANALYZE should count dead tuples.
Since those are going to go away pretty soon, we would risk
estimating on the basis of no-longer-relevant stats and thus
creating problems worse than the one we solve.

What is interesting here is that had you done ANALYZE *before*
the delete-and-insert, you'd have been fine.  So it seems like
somewhat out-of-date stats would have benefited you.

It would be interesting to see a non-artificial example that took
into account when the last auto-vacuum and auto-analyze really
happened, so we could see if there's any less-fragile way of
dealing with this situation.

            regards, tom lane



Re: Use of inefficient index in the presence of dead tuples

From
David Rowley
Date:
On Wed, 29 May 2024 at 12:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It would be interesting to see a non-artificial example that took
> into account when the last auto-vacuum and auto-analyze really
> happened, so we could see if there's any less-fragile way of
> dealing with this situation.

I think we need to find a way to give some additional preference to
using indexes with more keys matching to quals.

David



Re: Use of inefficient index in the presence of dead tuples

From
Alexander Staubo
Date:
> On 29 May 2024, at 02:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alexander Staubo <alex@purefiction.net> writes:
>> (2) Set up schema. It's important to create the index before insertion, in order to provoke a
>> situation where the indexes have dead tuples:
>> ...
>> (4) Then ensure all tuples are dead except one:
>
>>    DELETE FROM outbox_batches;
>>    INSERT INTO outbox_batches (receiver, id) VALUES ('dummy', 'test');
>
>> (5) Analyze:
>
>>    ANALYZE outbox_batches;
>
> So the problem here is that the ANALYZE didn't see any of the dead rows
> and thus there is no way to know that they all match 'dummy'.  The cost
> estimation is based on the conclusion that there is exactly one row
> that will pass the index condition in each case, and thus the "right"
> index doesn't look any cheaper than the "wrong" one --- in fact, it
> looks a little worse because of the extra access to the visibility
> map that will be incurred by an index-only scan.
>
> I'm unpersuaded by the idea that ANALYZE should count dead tuples.
> Since those are going to go away pretty soon, we would risk
> estimating on the basis of no-longer-relevant stats and thus
> creating problems worse than the one we solve.

Mind you, “pretty soon” could actually be “hours" if a pg_dump is running, or some other long-running transaction is
holdingback the xmin. Granted, long-running transactions should be avoided, but they happen, and the result is
operationallysurprising. 

I have another use case where I used a transaction to do lock a resource to prevent concurrent access. I.e. the logic
did“SELECT … FROM … WHERE id = $1 FOR UPDATE” and held that transaction open for hours while doing maintenance. This
endedup causing the exact same index issue with dead tuples, with some queries taking 30 minutes where they previously
tookjust a few milliseconds. In retrospect, this process should have used advisory locks to avoid holding back vacuums.
Butthe point stands that a small amount dead tuple cruft can massively skew performance in surprising ways. 

> What is interesting here is that had you done ANALYZE *before*
> the delete-and-insert, you'd have been fine.  So it seems like
> somewhat out-of-date stats would have benefited you.
>
> It would be interesting to see a non-artificial example that took
> into account when the last auto-vacuum and auto-analyze really
> happened, so we could see if there's any less-fragile way of
> dealing with this situation.

Just to clarify, this is a real use case, though the repro is of course artificial since the real production case is
insertingand deleting rows very quickly. 

According to collected metrics, the average time since the last autoanalyze is around 20 seconds for this table, same
forautovacuum. The times I have observed poor performance is in situations where the autovacuum was not able reclaim
non-removablerows, i.e. it’s not the absence of autovacuum, but rather the inability to clear up dead tuples. 




Re: Use of inefficient index in the presence of dead tuples

From
Laurenz Albe
Date:
On Wed, 2024-05-29 at 14:36 +0200, Alexander Staubo wrote:
> > On 29 May 2024, at 02:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I'm unpersuaded by the idea that ANALYZE should count dead tuples.
> > Since those are going to go away pretty soon, we would risk
> > estimating on the basis of no-longer-relevant stats and thus
> > creating problems worse than the one we solve.
>
> Mind you, “pretty soon” could actually be “hours" if a pg_dump is running,
> or some other long-running transaction is holding back the xmin. Granted,
> long-running transactions should be avoided, but they happen, and the
> result is operationally surprising.

Don't do these things on a busy transactional database.

> I have another use case where I used a transaction to do lock a resource
> to prevent concurrent access. I.e. the logic did
> “SELECT … FROM … WHERE id = $1 FOR UPDATE” and held that transaction open
> for hours while doing maintenance.

That's a dreadful idea.

>
> Just to clarify, this is a real use case, though the repro is of course
> artificial since the real production case is inserting and deleting rows
> very quickly.

No doubt.
Still I think that your main trouble are long-running transactions.
They will always give you trouble on a busy PostgreSQL database.
You should avoid them.

Yours,
Laurenz Albe