On Sun, Mar 27, 2022 at 01:18:46PM -0400, Tom Lane wrote:
> skink has passed several runs since the commit went in, so it's
> "unstable" not "fails consistently". I see the test tries to
> disable autovacuum on that table, so that doesn't seem to be
> the problem ... what is?
This is a race condition, directly unrelated to valgrind but easier to
trigger under it because things get slower. It takes me a dozen of
tries to be able to reproduce the failure locally, but I can wiht
valgrind enabled.
So, the output of the test is simply telling us that the FSM of the
main table is not getting truncated. From what I can see, the
difference is in should_attempt_truncation(), where we finish with
nonempty_pages set to 1 rather than 0 on failure. And it just takes
one autovacuum to run in parallel of the manual VACUUM after the
DELETE to prevent the removal of those tuples, which is what I can see
from the logs on failure:
LOG: statement: DELETE FROM freespace_tab;
DEBUG: autovacuum: processing database "contrib_regression"
LOG: statement: VACUUM freespace_tab;
It seems to me here that the snapshot hold by autovacuum during the
scan of pg_database to find the relations to process is enough to
prevent the FSM truncation, as the tuples cleaned up by the DELETE
query still need to be visible. One simple way to keep this test
would be a custom configuration file with autovacuum disabled and
NO_INSTALLCHECK. Any better ideas?
--
Michael