Re: Assert while autovacuum was executing - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Assert while autovacuum was executing
Date
Msg-id CAA4eK1KKUaVEhCbOcs9UrO3cPK6Z_2n80GoybnW-yiPavjNzCA@mail.gmail.com
Whole thread Raw
In response to Re: Assert while autovacuum was executing  (Amit Kapila <amit.kapila16@gmail.com>)
Responses RE: Assert while autovacuum was executing
Re: Assert while autovacuum was executing
List pgsql-hackers
On Mon, Jun 19, 2023 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jun 18, 2023 at 12:18 AM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Sat, Jun 17, 2023 at 11:29 AM Jaime Casanova
> > <jcasanov@systemguards.com.ec> wrote:
> > > I have been testing 16beta1, last commit
> > > a14e75eb0b6a73821e0d66c0d407372ec8376105
> > > I just let sqlsmith do its magic before trying something else, and
> > > today I found a core with the attached backtrace.
> >
> > The assertion that fails is the IsPageLockHeld assertion from commit 72e78d831a.
> >
>
> I'll look into this and share my analysis.
>

This failure mode appears to be introduced in commit 7d71d3dd08 (in
PG16) where we started to process the config file after acquiring page
lock during autovacuum. The problem here is that after acquiring page
lock (a heavy-weight lock), while processing the config file, we tried
to access the catalog cache which in turn attempts to acquire a lock
on the catalog relation, and that leads to the assertion failure. This
is because of an existing rule that we don't acquire any other
heavyweight lock while holding the page lock except for relation
extension. I think normally we should be careful about the lock
ordering for heavy-weight locks to avoid deadlocks but here there may
not be any existing hazard in acquiring a lock on the catalog table
after acquiring page lock on the gin index's metapage as I am not
aware of a scenario where we can acquire them in reverse order. One
naive idea is to have a parameter like vacuum_config_reload_safe to
allow config reload during autovacuum and make it false for the gin
index cleanup code path.

The reason for the existing rule for page lock and relation extension
locks was to not allow them to participate in group locking which will
allow other parallel operations like a parallel vacuum where multiple
workers can work on the same index, or parallel inserts, parallel
copy, etc. The related commits are 15ef6ff4b9, 72e78d831ab,
85f6b49c2c, and 3ba59ccc89. See 3ba59ccc89 for more details (To allow
parallel inserts and parallel copy, we have ensured that relation
extension and page locks don't participate in group locking which
means such locks can conflict among the same group members. This is
required as it is no safer for two related processes to extend the
same relation or perform clean up in gin indexes at a time than for
unrelated processes to do the same....).

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Masahiro Ikeda
Date:
Subject: Re: Support to define custom wait events for extensions
Next
From: Amit Kapila
Date:
Subject: Re: Synchronizing slots from primary to standby