Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date
Msg-id CAH2-Wz=4yg7PBaqmxJjhxEJYPNz7VZC3_NDJ7_RHcnicmX+B7A@mail.gmail.com
Whole thread Raw
In response to Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
List pgsql-hackers
On Tue, Jun 8, 2021 at 2:23 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> I'm not sure what you're suggesting ?  Maybe I should add some NOTICES there.

Here is one approach that might work: Can you check if the assertion
added by the attached patch fails very quickly with your test case?

This does nothing more than trigger an assertion failure in the event
of retrying a second time for any given heap page. Theoretically that
could happen without there being any bug -- in principle we might have
to retry several times for the same page. In practice the chances of
it happening even once are vanishingly low, though -- so two times
strongly signals a bug. It was quite hard to hit the "goto restart"
even once during my testing. There is still no test coverage for the
line of code because it's so hard to hit.

If you find that the assertion is hit pretty quickly with the same
workload then you've all but reproduced the issue, probably in far
less time. And, if you know that there were no concurrently aborting
transactions then you can be 100% sure that you have reproduced the
issue -- this goto is only supposed to be executed when a transaction
that was in progress during the heap_page_prune() aborts after it
returns, but before we call HeapTupleSatisfiesVacuum() for one of the
aborted-xact tuples. It's supposed to be a super narrow thing.

> I'm not sure why/if pg_statistic is special, but I guess when analyze happens,
> it gets updated, and eventually processed by autovacuum.

pg_statistic is probably special, though only in a superficial way: it
is the system catalog that tends to be the most frequently vacuumed in
practice.

> In pg14, the parent table is auto-analyzed.

I wouldn't expect that to matter. The "ANALYZE portion" of the VACUUM
ANALYZE won't have started at the point that we get stuck.

-- 
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Make unlogged table resets detectable
Next
From: Mark Dilger
Date:
Subject: logical replication of truncate command with trigger causes Assert