Re: pgsql: Test replay of regression tests, attempt II. - Mailing list pgsql-committers

From Thomas Munro
Subject Re: pgsql: Test replay of regression tests, attempt II.
Date
Msg-id CA+hUKG+nHX+NNjm-ig0zWLxeMiivH8omey5Onfhnxzh6g524Cg@mail.gmail.com
Whole thread Raw
In response to Re: pgsql: Test replay of regression tests, attempt II.  (Andres Freund <andres@anarazel.de>)
Responses Re: pgsql: Test replay of regression tests, attempt II.  (Andres Freund <andres@anarazel.de>)
List pgsql-committers
On Wed, Jan 19, 2022 at 12:08 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-01-18 17:19:06 -0500, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > That's an extremely small shared_buffers for running the regression tests, it'd not
> > > be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM
> > > ends up skipping over a page because of page contention?
> >
> > Hmm, good thought.  I tried running the test with even smaller
> > shared_buffers, but could not make the reloptions test fall over for
> > me.  But this theory implies a strong timing dependency, so it might
> > still only happen on particular machines.  (If anyone else tries it:
> > below about 400kB, other tests start failing with "no free unpinned
> > buffers" and the like.)
>
> I ran the test in a loop for 200+ times now, without reproducing the
> problem. Rorqual runs on a shared machine though, so it's quite possible that
> IO will be slower, and thus triggering the issue.
>
> I was wondering whether we could use VACUUM VERBOSE for that specific VACUUM -
> that'd show information about the number of pages with tuples etc. But I don't
> currently see a way of that causing the regression tests to fail.
>
> Even if I set client_min_messages=error, the messages still get sent to the
> client, because elevel == INFO is special cased in
> should_output_to_client(). And I don't see a way of redirecting the output of
> common.c:NoticeProcessor() in psql either.

I hacked a branch thusly:

@@ -327,6 +327,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
        verbose = (params->options & VACOPT_VERBOSE) != 0;
        instrument = (verbose || (IsAutoVacuumWorkerProcess() &&

params->log_min_duration >= 0));
+       instrument = true;
        if (instrument)
        {
                pg_rusage_init(&ru0);

Having failed to reproduce this locally, I clicked on "re-run tests"
all afternoon on CI until eventually I captured a failure log[1]
there, with the smoking gun:

pages: 0 removed, 1 remain, 1 skipped due to pins, 0 skipped frozen

There are three places that skip and bump that counter, but two of
them were disabled when I added DISABLE_PAGE_SKIPPING, leaving this
one:

            LockBuffer(buf, BUFFER_LOCK_SHARE);
            if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
            {
                UnlockReleaseBuffer(buf);
                vacrel->scanned_pages++;
                vacrel->pinskipped_pages++;
                if (hastup)
                    vacrel->nonempty_pages = blkno + 1;
                continue;
            }

Since this page doesn't require wraparound vacuuming, if we fail to
conditionally acquire the cleanup lock, this block skips the page.

[1]
https://api.cirrus-ci.com/v1/artifact/task/5096848598761472/log/src/test/recovery/tmp_check/log/027_stream_regress_primary.log



pgsql-committers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pgsql: Make configure prefer python3 to plain python.
Next
From: Andres Freund
Date:
Subject: Re: pgsql: Test replay of regression tests, attempt II.