RE: Assert while autovacuum was executing - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Assert while autovacuum was executing
Date
Msg-id OS0PR01MB57167E9C3D729F55B5EC1E71945DA@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Assert while autovacuum was executing  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Tuesday, June 20, 2023 5:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Mon, Jun 19, 2023 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Sun, Jun 18, 2023 at 12:18 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > >
> > > On Sat, Jun 17, 2023 at 11:29 AM Jaime Casanova
> > > <jcasanov@systemguards.com.ec> wrote:
> > > > I have been testing 16beta1, last commit
> > > > a14e75eb0b6a73821e0d66c0d407372ec8376105
> > > > I just let sqlsmith do its magic before trying something else, and
> > > > today I found a core with the attached backtrace.
> > >
> > > The assertion that fails is the IsPageLockHeld assertion from commit
> 72e78d831a.
> > >
> >
> > I'll look into this and share my analysis.
> >
> 
> This failure mode appears to be introduced in commit 7d71d3dd08 (in
> PG16) where we started to process the config file after acquiring page lock
> during autovacuum. The problem here is that after acquiring page lock (a
> heavy-weight lock), while processing the config file, we tried to access the
> catalog cache which in turn attempts to acquire a lock on the catalog relation,
> and that leads to the assertion failure. This is because of an existing rule that we
> don't acquire any other heavyweight lock while holding the page lock except
> for relation extension. I think normally we should be careful about the lock
> ordering for heavy-weight locks to avoid deadlocks but here there may not be
> any existing hazard in acquiring a lock on the catalog table after acquiring page
> lock on the gin index's metapage as I am not aware of a scenario where we can
> acquire them in reverse order. One naive idea is to have a parameter like
> vacuum_config_reload_safe to allow config reload during autovacuum and
> make it false for the gin index cleanup code path.

I also think it would be better to skip reloading config in page lock cases to be consistent
with the rule. And here is the patch which does the same.

I tried to reproduce the assert failure(before applying the patch) using the following steps:

1. I added a sleep before vacuum_delay_point in ginInsertCleanup and LOG("attach this process") before sleeping.
2. And changed few GUC to make autovacuum happen more frequently and then start the server.
-
autovacuum_naptime = 5s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_insert_threshold = 100
-

3. Then I execute the following sqls:
-
create table gin_test_tbl(i int4[]);
create index gin_test_idx on gin_test_tbl using gin (i)
  with (fastupdate = on, gin_pending_list_limit = 4096);
insert into gin_test_tbl select array[1, 2, g] from generate_series(1, 20000) g;
insert into gin_test_tbl select array[1, 3, g] from generate_series(1, 1000) g;
-
4. After a while, I can see the LOG from autovacuum worker and then use gdb to attach to the autovacuum worker.
5. When the autovacuum worker is blocked, I changed the "default_text_search_config = 'pg_catalog.public'" in configure
fileand reload it.
 
6. Release the autovacuum worker and then I can see the assert failure.

And I can see the assert failure doesn't happen after applying the patch.

> 
> The reason for the existing rule for page lock and relation extension locks was
> to not allow them to participate in group locking which will allow other parallel
> operations like a parallel vacuum where multiple workers can work on the same
> index, or parallel inserts, parallel copy, etc. The related commits are 15ef6ff4b9,
> 72e78d831ab, 85f6b49c2c, and 3ba59ccc89. See 3ba59ccc89 for more details
> (To allow parallel inserts and parallel copy, we have ensured that relation
> extension and page locks don't participate in group locking which means such
> locks can conflict among the same group members. This is required as it is no
> safer for two related processes to extend the same relation or perform clean
> up in gin indexes at a time than for unrelated processes to do the same....).

Best Regards,
Hou zj

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
Next
From: Nathan Bossart
Date:
Subject: Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX