On Fri, Oct 8, 2021 at 8:13 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2021-Oct-06, Masahiko Sawada wrote:
> > A customer reported that during parallel index vacuum, the oldest xmin
> > doesn't advance. Normally, the calculation of oldest xmin
> > (ComputeXidHorizons()) ignores xmin/xid of processes having
> > PROC_IN_VACUUM flag in MyProc->statusFlags. But since parallel vacuum
> > workers don’t set their statusFlags, the xmin of the parallel vacuum
> > worker is considered to calculate the oldest xmin. This issue happens
> > from PG13 where the parallel vacuum was introduced. I think it's a
> > bug.
>
> Augh, yeah, I agree this is a pretty serious problem.
So is this comparable problem, which happens to be much older:
https://postgr.es/m/CAH2-WzkjrK556enVtFLmyXEdw91xGuwiyZVep2kp5yQT_-3JDg@mail.gmail.com
In both cases we see bugs (or implementation deficiencies) that
accidentally block ComputeXidHorizons() for hours, when that isn't
truly necessary. Practically all users are not sure of whether or not
VACUUM behaves like a long running transaction already, in general, so
we shouldn't be surprised that it takes so long for us to hear about
issues like this.
I think that we should try to find a way of making this whole class of
problems easier to identify in production. There needs to be greater
visibility into what process holds back VACUUM, and how long that
lasts -- something easy to use, and *obvious*. That would be a very
useful feature in general. It would also make catching these issues
early far more likely. It's just *not okay* that you have to follow long
and complicated instructions [1] to get just some of this information.
How can something this important just be an afterthought?
[1] https://www.cybertec-postgresql.com/en/reasons-why-vacuum-wont-remove-dead-rows/
--
Peter Geoghegan