So I've been looking at issues we used to have in production some time
ago which eventually lead us to migrating away from partial indexes in
some cases. In the end, I'm surprised how easy this (or at least a
similar case) was to reproduce. The attached program does some
UPDATEs where around every third update deletes the row from the
partial index since it doesn't match indpred anymore. In that case
the row is immediately UPDATEd back to match the index WHERE clause
again. This roughly emulates what some of our processes do in
production.
Today, running the program for a few minutes (until the built-in
262144 iteration limit), I usually end up with a partial index through
which producing the only row takes milliseconds on a cold cache, and
over a millisecond on a hot one. Finding the row through the primary
key is still fast, because the bloat there gets cleaned up. As far as
I can tell, after the index has gotten into this state, there's no way
to clean it up except VACUUMing the entire table or a REINDEX. Both
solutions are pretty bad.
My working theory was that this has to do with the fact that
HeapTupleSatisfiesMVCC doesn't set the HEAP_XMAX_COMMITTED bit here,
but I'm not so sure anymore. Has anyone seen something like this? If
that really is what's happening here, then I can see why we wouldn't
want to slow down SELECTs with expensive visibility checks. But that
really leaves me wishing for something like VACUUM INDEX partial_idx.
Otherwise your elephant just keeping getting slower and slower until
you get called at 2 AM to play REINDEX.
(I've tested this on 9.6, v11 and v13. 13 seems to be a bit better
here, but not "fixed", I think.)
.m