Thread: Frequetly updated partial index leads to bloat on index for Postresql 11
Hi, We have a partial index on a column of the form: CREATE TABLE table_p2021q1 ( pk_id BIGINT, col1 BIGINT NOT NULL, status character varying(255) NOT NULL, ...other columns PRIMARY KEY (pk_id) ); CREATE INDEX table_p2021q1_ix04 ON table_p2021q1 (col1) WHERE status = 'IN_PROGRESS'; (the table is also a partitioned table on a date field) Every one of our entries in the table will go through a state where status = 'IN_PROGRESS' but there are only a few hundred of them at any one time. The index grows quite quickly to about 300MB. I have run some bloat queries and the size should always be 8KB and bloat is always high (99.99....% bloat). We have found that the only way to reduce the bloat is through reindexing, but after a couple of days the size goes back to 300MB or so. Running a vacuum does not reduce the bloat. Is this a known issue, are they any ways around it, and if it is an issue is there a plan to fix it if a fix is possible? Thanks.
Re: Frequetly updated partial index leads to bloat on index for Postresql 11
From
Peter Geoghegan
Date:
On Thu, Mar 18, 2021 at 6:51 AM Tom Dearman <tom.dearman@gmail.com> wrote: > Is this a known issue, are they any ways around it, and if it is an > issue is there a plan to fix it if a fix is possible? It's not exactly a known issue per se, but I think the problem here is related to the fact that you have lots of duplicates, which did perform rather badly prior to Postgres 12. I bet that you'd benefit from upgrading to Postgres 12, or especially to Postgres 13. The B-Tree space management is a lot better now. (Actually, it'll be better again in Postgres 14.) -- Peter Geoghegan
Hi,
We upgraded to 13 a couple of months ago on production but are still having an issue with bloated partial indexes which have an impact on our partial queries especially towards the end of a quarter when our quarterly-partitioned tables are getting big. I have built 14 (on macOS catalina, 14 beta 2) and run a test but the index grows fairly large (even though vacuums are running as the table is still relatively small - I put in 2 million inserts, each having one update of the column that makes up the partial index). The table is:
Table "public.buyer"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
---------------+------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
buyer_id | integer | | not null | | plain | | |
first_name | character varying(35) | | not null | | extended | | |
last_name | character varying(35) | | not null | | extended | | |
email_address | character varying(50) | | | | extended | | |
status | character varying(256) | | not null | | extended | | |
Indexes:
"buyer_pkey" PRIMARY KEY, btree (buyer_id)
"idex_buyer_inactive" btree (first_name) WHERE status::text = 'IN_PROGRESS'::text
Access method: heap
I run a loop to insert, commit, update, commit one row at a time as this is an emulation of what a similar table would experience in production. The index never has many rows with status=‘IN_PROGRESS’ as each row is set to CANCEL in the update. If the index is reindexed it takes 1 page as expected but without the reindexing it keeps growing, currently reaching 3MB - this is with 2 million inserts and updates but our production will have about 300 million inserts and > 300 million updates on the partial index in the quarter. Should we have seen more of an improvement in 14? Is it valid to look at the size of the index (\di+) as a measure of whether this latest change to bottom up index deleting has helped?
Thanks,
Tom
On 18 Mar 2021, at 16:30, Peter Geoghegan <pg@bowt.ie> wrote:On Thu, Mar 18, 2021 at 6:51 AM Tom Dearman <tom.dearman@gmail.com> wrote:Is this a known issue, are they any ways around it, and if it is an
issue is there a plan to fix it if a fix is possible?
It's not exactly a known issue per se, but I think the problem here is
related to the fact that you have lots of duplicates, which did
perform rather badly prior to Postgres 12. I bet that you'd benefit
from upgrading to Postgres 12, or especially to Postgres 13. The
B-Tree space management is a lot better now. (Actually, it'll be
better again in Postgres 14.)
--
Peter Geoghegan
Re: Frequetly updated partial index leads to bloat on index for Postresql 11
From
Michael Lewis
Date:
Have you tried setting autovacuum to run quite aggressively, perhaps just on this table? Have you tried an index on the status column only, rather than partial?
We have change autovacuum so that it runs more frequently autovacuum_vacuum_scale_factor=0.035, the reason we have a partial index on the status is that in a table of 300 million entries, only about 100 or so would have status=‘IN_PROGRESS’ so we think this should be a nice small index and many of our queries want to look up with a where clause status=‘IN_PROGRESS’. In theory it works well, but we get a lot of index bloat as there is a lot of churn on the status value, ie each row starts as IN_PROGRESS and then goes to one of 4 possible completed statuses.
On 16 Jul 2021, at 15:49, Michael Lewis <mlewis@entrata.com> wrote:Have you tried setting autovacuum to run quite aggressively, perhaps just on this table? Have you tried an index on the status column only, rather than partial?
Tom Dearman <tom.dearman@gmail.com> writes: > We have change autovacuum so that it runs more frequently autovacuum_vacuum_scale_factor=0.035, the reason we have a partialindex on the status is that in a table of 300 million entries, only about 100 or so would have status=‘IN_PROGRESS’so we think this should be a nice small index and many of our queries want to look up with a where clausestatus=‘IN_PROGRESS’. In theory it works well, but we get a lot of index bloat as there is a lot of churn on the statusvalue, ie each row starts as IN_PROGRESS and then goes to one of 4 possible completed statuses. Is it really the case that only this index is bloating? In principle, an update on a row of the table should result in new entries in every index of the table. A partial index, due to the filter applied to possibly not store any index entry, should in theory have less bloat than other indexes. If that's not what you're seeing, there must be something about the data being stored in that index (not the partial-index filter condition) that results in a lot of low-occupancy index pages over time. You didn't say anything about what the data payload is. But we've seen bloat problems in indexes where, say, every tenth or hundredth value in the index ordering would persist for a long time while the ones in between get deleted quickly. That leads to low-density indexes that VACUUM can't do anything about. regards, tom lane
Re: Frequetly updated partial index leads to bloat on index for Postresql 11
From
Francisco Olarte
Date:
Tom-Tom: On Fri, Jul 16, 2021 at 5:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tom Dearman <tom.dearman@gmail.com> writes: > > We have change autovacuum so that it runs more frequently autovacuum_vacuum_scale_factor=0.035, the reason we have apartial index on the status is that in a table of 300 million entries, only about 100 or so would have status=‘IN_PROGRESS’so we think this should be a nice small index and many of our queries want to look up with a where clausestatus=‘IN_PROGRESS’. In theory it works well, but we get a lot of index bloat as there is a lot of churn on the statusvalue, ie each row starts as IN_PROGRESS and then goes to one of 4 possible completed statuses. > Is it really the case that only this index is bloating? In principle, an > update on a row of the table should result in new entries in every index > of the table. A partial index, due to the filter applied to possibly not > store any index entry, should in theory have less bloat than other > indexes. May be not in relative terms. If I understand correctly, lets say you start unbloated with 300M entries with 100 in progress. You insert, in small batches, 10K rows "in_progress" and update 10k "in_progress" row to, let's say, "done" ( may be including the 100 original ones ) ( description seems to fit this kind of flow, something like a task queue ). Then you will have 10k dead tuples bloating a 100 live entries index, for an absolutely small but relatively large bloat, while a full pk index will have 10k for 300m. I may be misunderstanding some thing and HOT and similar things may help here, but it seems like a plausible explanation for an apparent bloat ( on a small index, I had similar things but for task-like things I use a pending and a done table, and the pending table bloated a lot on the before-autovacuum times, small table nearly all dead tuples, not a problem vaccuming it via cron every some minutes, as in this case only the index is bloated autovacuum may not fire frequently enough for it, as the table would have only 10k/30M~333 ppm bloat ) Francisco Olarte
Other indexes do bloat, but the percentage bloat is a lot less, presumably because this is a partial index where the partialcolumn has a high degree of changes ie maybe 100 genuinely ‘live’ rows in a table of 300 million where every row hasgone through a state where it would have been in the index. In some of our partitions we might have 2000 old rows thatdo hang around for a long time and another 100 or so ‘real’ partial index entries so 2200 in total but the number ofrows would be 300 million so it is a lot less than 1%. > On 16 Jul 2021, at 16:43, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Tom Dearman <tom.dearman@gmail.com> writes: >> We have change autovacuum so that it runs more frequently autovacuum_vacuum_scale_factor=0.035, the reason we have a partialindex on the status is that in a table of 300 million entries, only about 100 or so would have status=‘IN_PROGRESS’so we think this should be a nice small index and many of our queries want to look up with a where clausestatus=‘IN_PROGRESS’. In theory it works well, but we get a lot of index bloat as there is a lot of churn on the statusvalue, ie each row starts as IN_PROGRESS and then goes to one of 4 possible completed statuses. > > Is it really the case that only this index is bloating? In principle, an > update on a row of the table should result in new entries in every index > of the table. A partial index, due to the filter applied to possibly not > store any index entry, should in theory have less bloat than other > indexes. > > If that's not what you're seeing, there must be something about the data > being stored in that index (not the partial-index filter condition) that > results in a lot of low-occupancy index pages over time. You didn't say > anything about what the data payload is. But we've seen bloat problems in > indexes where, say, every tenth or hundredth value in the index ordering > would persist for a long time while the ones in between get deleted > quickly. That leads to low-density indexes that VACUUM can't do anything > about. > > regards, tom lane
Re: Frequetly updated partial index leads to bloat on index for Postresql 11
From
Peter Geoghegan
Date:
On Fri, Jul 16, 2021 at 9:19 AM Tom Dearman <tom.dearman@gmail.com> wrote: > Other indexes do bloat, but the percentage bloat is a lot less I have to imagine that the remaining problems have a lot to do with the fact that this is a partial index -- the partial index naturally gets vacuumed much less frequently than what would generally be considered ideal for the index itself. In general VACUUM scheduling makes the naive assumption that the indexes have the same needs as the table, which is far from the case with this partial index, for your workload. It's all of the specifics, taken together. It sounds like this is a case where bottom-up index deletion won't help -- it will only trigger in those indexes that are not "logically modified" by updates. But you're logically modifying these values. Or you're causing them to not need to be in the index anymore, by modifying the predicate. But that won't trigger bottom-up deletion. It's a bit like a delete, as far as the physical index structure is concerned -- the index won't be eagerly modified by the executor. The overall picture is that you cycle through all of the values in the table, and no cleanup can take place other than plain VACUUM (at least not to any significant extent). Although only a few hundred values are logically required to be indexed by the partial index at any one time, in practice no cleanup can run for long stretches of time (autovacuum just doesn't know about cases like this). This is why the partial index inevitably exceeds its theoretical pristine/high watermark size, which is actually more than 1 page/8KB, but still probably a lot less than what you actually see -- the partial index "falls through the cracks", even with recent enhancements that made cleanup more eager and more dynamic in certain other cases. I am afraid that I don't have a good suggestion right now. I can think of incremental improvements that would address this case, but for now they're just ideas. Fundamentally, we need to get to the partial index much more frequently than the other indexes, either within VACUUM or within some other mechanism. For example a conservative implementation of retail index tuple deletion might totally fix the issue. It does very much look like a problem in these partial indexes in particular -- it's quite possible that the other indexes won't grow at all due to garbage index tuples, especially on Postgres 14. -- Peter Geoghegan
Re: Frequetly updated partial index leads to bloat on index for Postresql 11
From
Peter Geoghegan
Date:
> On Thu, Mar 18, 2021 at 6:51 AM Tom Dearman <tom.dearman@gmail.com> wrote: > Is this a known issue, are they any ways around it, and if it is an > issue is there a plan to fix it if a fix is possible? On second thought I do think that the improvements to 14 will fix this for you. See the test case here: https://www.postgresql.org/message-id/flat/CAL9smLC%3DSxYiN7yZ4HDyk0RnZyXoP2vaHD-Vg1JskOEHyhMXug%40mail.gmail.com#e79eca5922789de828314e296fdcb82d -- Peter Geoghegan