Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations |
Date | |
Msg-id | CAH2-WzkiB-qcsBmWrpzP0nxvrQExoUts1d7TYShg_DrkOHeg4Q@mail.gmail.com Whole thread Raw |
In response to | Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations |
List | pgsql-hackers |
On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > Another testing strategy occurs to me: we could stress-test the > implementation by simulating an environment where the no-cleanup-lock > path is hit an unusually large number of times, possibly a fixed > percentage of the time (like 1%, 5%), say by making vacuumlazy.c's > ConditionalLockBufferForCleanup() call return false randomly. Now that > we have lazy_scan_noprune for the no-cleanup-lock path (which is as > similar to the regular lazy_scan_prune path as possible), I wouldn't > expect this ConditionalLockBufferForCleanup() testing gizmo to be too > disruptive. I tried this out, using the attached patch. It was quite interesting, even when run against HEAD. I think that I might have found a bug on HEAD, though I'm not really sure. If you modify the patch to simulate conditions under which ConditionalLockBufferForCleanup() fails about 2% of the time, you get much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze, without it being so aggressive as to make "make check-world" fail -- which is exactly what I expected. If you are much more aggressive about it, and make it 50% instead (which you can get just by using the patch as written), then some tests will fail, mostly for reasons that aren't surprising or interesting (e.g. plan changes). This is also what I'd have guessed would happen. However, it gets more interesting. One thing that I did not expect to happen at all also happened (with the current 50% rate of simulated ConditionalLockBufferForCleanup() failure from the patch): if I run "make check" from the pg_surgery directory, then the Postgres backend gets stuck in an infinite loop inside lazy_scan_prune, which has been a symptom of several tricky bugs in the past year (not every time, but usually). Specifically, the VACUUM statement launched by the SQL command "vacuum freeze htab2;" from the file contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this misbehavior. This is a temp table, which is a choice made by the tests specifically because they need to "use a temp table so that vacuum behavior doesn't depend on global xmin". This is convenient way of avoiding spurious regression tests failures (e.g. from autoanalyze), and relies on the GlobalVisTempRels behavior established by Andres' 2020 bugfix commit 94bc27b5. It's quite possible that this is nothing more than a bug in my adversarial gizmo patch -- since I don't think that ConditionalLockBufferForCleanup() can ever fail with a temp buffer (though even that's not completely clear right now). Even if the behavior that I saw does not indicate a bug on HEAD, it still seems informative. At the very least, it wouldn't hurt to Assert() that the target table isn't a temp table inside lazy_scan_noprune, documenting our assumptions around temp tables and ConditionalLockBufferForCleanup(). I haven't actually tried to debug the issue just yet, so take all this with a grain of salt. -- Peter Geoghegan
Attachment
pgsql-hackers by date: