Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() - Mailing list pgsql-bugs
From | Matthias van de Meent |
---|---|
Subject | Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() |
Date | |
Msg-id | CAEze2WhxhEQEx+c+CXoDpQs1H1HgkYUK4BW-hFw5_eQxuVWqRw@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Responses |
Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
|
List | pgsql-bugs |
On Mon, 1 Nov 2021 at 16:15, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Fri, 29 Oct 2021 at 20:17, Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > > I can propose the debugging patch to reproduce the issue that replaces > > > the hang with the assert and modifies a pair of crash-causing test > > > scripts to simplify the reproducing. (Sorry, I have no time now to prune > > > down the scripts further as I have to leave for a week.) > > > > This bug is similar to the one fixed in commit d9d8aa9b. And so I > > wonder if code like GlobalVisTestFor() is missing something that it > > needs for partitioned tables. > > Without `autovacuum = off; fsync = off` I could not replicate the > issue in the configured 10m time window; with those options I did get > the reported trace in minutes. > > I think that I also have found the culprit, which is something we > talked about in [0]: GlobalVisState->maybe_needed was not guaranteed > to never move backwards when recalculated, and because vacuum can > update its snapshot bounds (heap_prune_satisfies_vacuum -> > GlobalVisTestIsRemovableFullXid -> GlobalVisUpdate) this maybe_needed > could move backwards, resulting in the observed behaviour. > > It was my understanding based on the mail conversation that Andres > would fix this observed issue too while fixing [0] (whose fix was > included with beta 2), but apparently I was wrong; I can't find the > code for 'maybe_needed'-won't-move-backwards-in-a-backend. > > I (again) propose the attached patch, which ensures that this > maybe_needed field will not move backwards for a backend. It is > based on 14, but should be applied on head as well, because it's > lacking there as well. > > Another alternative would be to replace the use of vacrel->OldestXmin > with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe > that is not legal in how vacuum works (we cannot unilaterally decide > that we want to retain tuples < OldestXmin). > > Note: After fixing the issue with retreating maybe_needed I also hit > your segfault, and I'm still trying to find out what the source of > that issue might be. I do think it is an issue seperate from stuck > vacuum, though. After further debugging, I think these both might be caused by the same issue, due to xmin horizon confusion as a result from restored snapshots: I seem to repeatedly get backends of which the xmin is set from InvalidTransactionId to some value < min(ProcGlobal->xids), which then result in shared_oldest_nonremovable (and others) being less than the value of their previous iteration. This leads to the infinite loop in lazy_scan_prune (it stores and uses one value of *_oldest_nonremovable, whereas heap_page_prune uses a more up-to-date variant). Ergo, this issue is not really solved by my previous patch, because apparently at this point we have snapshots wih an xmin that is only registered in the backend's procarray entry when the xmin is already out of scope, which makes it generally impossible to determine what tuples may or may not yet be vacuumed. I noticed that when this happens, generally a parallel vacuum worker is involved. I also think that this is intimately related to [0], and how snapshots are restored in parallel workers: A vacuum worker is generally ignored, but if its snapshot has the oldest xmin available, then a parallel worker launched from that vacuum worker will move the visible xmin backwards. Same for concurrent index creation jobs. Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/flat/202110191807.5svc3kmm32tl%40alvherre.pgsql
pgsql-bugs by date: