Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date
Msg-id 87697D0B-3A41-4A5B-B9CF-85D989361E27@anarazel.de
Whole thread Raw
In response to Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Hi,

(On phone, so crappy formatting and no source access)

On February 19, 2022 3:08:41 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:
>On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> Another testing strategy occurs to me: we could stress-test the
>> implementation by simulating an environment where the no-cleanup-lock
>> path is hit an unusually large number of times, possibly a fixed
>> percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
>> ConditionalLockBufferForCleanup() call return false randomly. Now that
>> we have lazy_scan_noprune for the no-cleanup-lock path (which is as
>> similar to the regular lazy_scan_prune path as possible), I wouldn't
>> expect this ConditionalLockBufferForCleanup() testing gizmo to be too
>> disruptive.
>
>I tried this out, using the attached patch. It was quite interesting,
>even when run against HEAD. I think that I might have found a bug on
>HEAD, though I'm not really sure.
>
>If you modify the patch to simulate conditions under which
>ConditionalLockBufferForCleanup() fails about 2% of the time, you get
>much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze,
>without it being so aggressive as to make "make check-world" fail --
>which is exactly what I expected. If you are much more aggressive
>about it, and make it 50% instead (which you can get just by using the
>patch as written), then some tests will fail, mostly for reasons that
>aren't surprising or interesting (e.g. plan changes). This is also
>what I'd have guessed would happen.
>
>However, it gets more interesting. One thing that I did not expect to
>happen at all also happened (with the current 50% rate of simulated
>ConditionalLockBufferForCleanup() failure from the patch): if I run
>"make check" from the pg_surgery directory, then the Postgres backend
>gets stuck in an infinite loop inside lazy_scan_prune, which has been
>a symptom of several tricky bugs in the past year (not every time, but
>usually). Specifically, the VACUUM statement launched by the SQL
>command "vacuum freeze htab2;" from the file
>contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this
>misbehavior.


>This is a temp table, which is a choice made by the tests specifically
>because they need to "use a temp table so that vacuum behavior doesn't
>depend on global xmin". This is convenient way of avoiding spurious
>regression tests failures (e.g. from autoanalyze), and relies on the
>GlobalVisTempRels behavior established by Andres' 2020 bugfix commit
>94bc27b5.

We don't have a blocking path for cleanup locks of temporary buffers IIRC (normally not reachable). So I wouldn't be
surprisedif a cleanup lock failing would cause some odd behavior. 

>It's quite possible that this is nothing more than a bug in my
>adversarial gizmo patch -- since I don't think that
>ConditionalLockBufferForCleanup() can ever fail with a temp buffer
>(though even that's not completely clear right now). Even if the
>behavior that I saw does not indicate a bug on HEAD, it still seems
>informative. At the very least, it wouldn't hurt to Assert() that the
>target table isn't a temp table inside lazy_scan_noprune, documenting
>our assumptions around temp tables and
>ConditionalLockBufferForCleanup().

Definitely worth looking into more.


This reminds me of a recent thing I noticed in the aio patch. Spgist can end up busy looping when buffers are locked,
insteadof blocking. Not actually related, of course. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: set TESTDIR from perl rather than Makefile
Next
From: Joseph Koshakow
Date:
Subject: Re: Fix overflow in DecodeInterval