Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() - Mailing list pgsql-bugs

From Matthias van de Meent
Subject Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date
Msg-id CAEze2Wj7O5tnM_U151Baxr5ObTJafwH=71_JEmgJV+6eBgjL7g@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
List pgsql-bugs
On Fri, 29 Oct 2021 at 20:17, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > I can propose the debugging patch to reproduce the issue that replaces
> > the hang with the assert and modifies a pair of crash-causing test
> > scripts to simplify the reproducing. (Sorry, I have no time now to prune
> > down the scripts further as I have to leave for a week.)
>
> This bug is similar to the one fixed in commit d9d8aa9b. And so I
> wonder if code like GlobalVisTestFor() is missing something that it
> needs for partitioned tables.

Without `autovacuum = off; fsync = off` I could not replicate the
issue in the configured 10m time window; with those options I did get
the reported trace in minutes.

I think that I also have found the culprit, which is something we
talked about in [0]: GlobalVisState->maybe_needed was not guaranteed
to never move backwards when recalculated, and because vacuum can
update its snapshot bounds (heap_prune_satisfies_vacuum ->
GlobalVisTestIsRemovableFullXid -> GlobalVisUpdate) this maybe_needed
could move backwards, resulting in the observed behaviour.

It was my understanding based on the mail conversation that Andres
would fix this observed issue too while fixing [0] (whose fix was
included with beta 2), but apparently I was wrong; I can't find the
code for 'maybe_needed'-won't-move-backwards-in-a-backend.

I (again) propose the attached patch, which ensures that this
maybe_needed field will not move backwards for a backend. It is
based on 14, but should be applied on head as well, because it's
lacking there as well.

Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).

Note: After fixing the issue with retreating maybe_needed I also hit
your segfault, and I'm still trying to find out what the source of
that issue might be. I do think it is an issue seperate from stuck
vacuum, though.


Kind regards,

Matthias van de Meent

[0]
https://www.postgresql.org/message-id/flat/20210609184506.rqm5rikoikm47csf%40alap3.anarazel.de#e9d55b5cfff34238a24dc85c8c75a46f

Attachment

pgsql-bugs by date:

Previous
From: Robert Haas
Date:
Subject: Re: CREATE INDEX CONCURRENTLY does not index prepared xact's data
Next
From: PG Bug reporting form
Date:
Subject: BUG #17261: FK ON UPDATE CASCADE can break referential integrity with columns of different types