Re: Parallel Append subplan order instability on aye-aye - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Parallel Append subplan order instability on aye-aye
Date
Msg-id 20190717231232.ssx6fxay756exn6a@alap3.anarazel.de
Whole thread Raw
In response to Re: Parallel Append subplan order instability on aye-aye  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Parallel Append subplan order instability on aye-aye  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 2019-07-17 11:53:48 -0400, Tom Lane wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > Surely it can't be that since that just sets what *pages gets set to.
> > Tom mentioned that following was returning 0 pages and tuples:
> 
> > -- Temporary hack to investigate whether extra vacuum/analyze is happening
> > select relname, relpages, reltuples
> > from pg_class
> > where relname like '__star' order by relname;
> >  relname | relpages | reltuples
> > ---------+----------+-----------
> >  a_star  |        1 |         3
> 
> I poked around a little and came up with a much simpler theory:
> VACUUM will not change relpages/reltuples if it does not scan any pages
> (cf. special case for tupcount_pages == 0 in heap_vacuum_rel, at line 343
> in HEAD's vacuumlazy.c).  And, because sanity_check.sql's VACUUM is a
> plain unaggressive vacuum, all that it takes to make it skip over a_star's
> one page is for somebody else to have a pin on that page.

I wonder if we could set log_min_messages to DEBUG2 on occasionally
failing machines to test that theory. That ought to hit

    appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
                                    "Skipped %u pages due to buffer pins, ",
                                    vacrelstats->pinskipped_pages),
        ...
    ereport(elevel,
            (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
                    RelationGetRelationName(onerel),
                    tups_vacuumed, num_tuples,
                    vacrelstats->scanned_pages, nblocks),
             errdetail_internal("%s", buf.data)));



> So a chance
> collision with the bgwriter or checkpointer could cause the observed
> symptom, not just for a_star but for the other single-page relations that
> are at stake here.  Those pages are certainly dirty after create_misc.sql,
> so it's hardly implausible for one of these processes to be holding pin
> while trying to write out the buffer at the time sanity_check.sql runs.
> 
> A brute-force way to fix this (or at least reduce the odds quite a bit)
> would be to have sanity_check.sql issue a CHECKPOINT before its VACUUM,
> thereby guaranteeing that none of these pages are still in need of being
> written.  Not sure how much that'd penalize the regression tests' runtime,
> or whether we'd have a loss of test coverage of VACUUM behaviors.

Alternatively we could VACUUM FREEZE the relevant tables? That then
ought to hit the blocking codepath in lazu_scan_heap()?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: refactoring - share str2*int64 functions
Next
From: Andres Freund
Date:
Subject: Re: refactoring - share str2*int64 functions