Re: Intermittent buildfarm failures on wrasse - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Intermittent buildfarm failures on wrasse
Date
Msg-id 1648548.1650039400@sss.pgh.pa.us
Whole thread Raw
In response to Re: Intermittent buildfarm failures on wrasse  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
> Off for a bit, but I realized that we likely don't exclude the launcher because it's not database associated...

Yeah.  I think this bit in ComputeXidHorizons needs rethinking:

        /*
         * Normally queries in other databases are ignored for anything but
         * the shared horizon. ...
         */
        if (in_recovery ||
            MyDatabaseId == InvalidOid || proc->databaseId == MyDatabaseId ||
            proc->databaseId == 0)    /* always include WalSender */
        {

The "proc->databaseId == 0" business apparently means to include only
walsender processes, and it's broken because that condition doesn't
include only walsender processes.

At this point we have the following conclusions:

1. A slow transaction in the launcher's initial get_database_list()
call fully explains these failures.  (I had been thinking that the
launcher's xact would have to persist as far as the create_index
script, but that's not so: it only has to last until test_setup
begins vacuuming tenk1.  The CREATE INDEX steps are not doing any
visibility map changes of their own, but what they are doing is
updating relallvisible from the results of visibilitymap_count().
That's why they undid the effects of manually poking relallvisible,
without actually inserting any data better than what the initial
VACUUM computed.)

2. We can probably explain why only wrasse sees this as some quirk
of the Solaris scheduler.  I'm satisfied to blame it-happens-in-
installcheck-but-not-check on that too.

3. It remains unclear why we suddenly started seeing this last week.
I suppose it has to be a side-effect of the pgstats changes, but
the mechanism is obscure.  Probably not worth the effort to pin
down exactly why.

As for fixing it, what I think would be the preferable answer is to
fix the above-quoted logic so that it indeed includes only walsenders
and not random other background workers.  (Why does it need to include
walsenders, anyway?  The commentary sucks.)  Alternatively, or perhaps
also, we could do what was discussed previously and make a hack to
allow delaying vacuum until the system is quiescent.

            regards, tom lane



pgsql-hackers by date:

Previous
From: "Euler Taveira"
Date:
Subject: Re: Inconsistent "ICU Locale" output on older server versions
Next
From: Tom Lane
Date:
Subject: Re: Intermittent buildfarm failures on wrasse