RE: Assert while autovacuum was executing - Mailing list pgsql-hackers
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: Assert while autovacuum was executing |
Date | |
Msg-id | OS0PR01MB57167E9C3D729F55B5EC1E71945DA@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Assert while autovacuum was executing (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
On Tuesday, June 20, 2023 5:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jun 19, 2023 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Sun, Jun 18, 2023 at 12:18 AM Peter Geoghegan <pg@bowt.ie> wrote: > > > > > > On Sat, Jun 17, 2023 at 11:29 AM Jaime Casanova > > > <jcasanov@systemguards.com.ec> wrote: > > > > I have been testing 16beta1, last commit > > > > a14e75eb0b6a73821e0d66c0d407372ec8376105 > > > > I just let sqlsmith do its magic before trying something else, and > > > > today I found a core with the attached backtrace. > > > > > > The assertion that fails is the IsPageLockHeld assertion from commit > 72e78d831a. > > > > > > > I'll look into this and share my analysis. > > > > This failure mode appears to be introduced in commit 7d71d3dd08 (in > PG16) where we started to process the config file after acquiring page lock > during autovacuum. The problem here is that after acquiring page lock (a > heavy-weight lock), while processing the config file, we tried to access the > catalog cache which in turn attempts to acquire a lock on the catalog relation, > and that leads to the assertion failure. This is because of an existing rule that we > don't acquire any other heavyweight lock while holding the page lock except > for relation extension. I think normally we should be careful about the lock > ordering for heavy-weight locks to avoid deadlocks but here there may not be > any existing hazard in acquiring a lock on the catalog table after acquiring page > lock on the gin index's metapage as I am not aware of a scenario where we can > acquire them in reverse order. One naive idea is to have a parameter like > vacuum_config_reload_safe to allow config reload during autovacuum and > make it false for the gin index cleanup code path. I also think it would be better to skip reloading config in page lock cases to be consistent with the rule. And here is the patch which does the same. I tried to reproduce the assert failure(before applying the patch) using the following steps: 1. I added a sleep before vacuum_delay_point in ginInsertCleanup and LOG("attach this process") before sleeping. 2. And changed few GUC to make autovacuum happen more frequently and then start the server. - autovacuum_naptime = 5s autovacuum_vacuum_threshold = 1 autovacuum_vacuum_insert_threshold = 100 - 3. Then I execute the following sqls: - create table gin_test_tbl(i int4[]); create index gin_test_idx on gin_test_tbl using gin (i) with (fastupdate = on, gin_pending_list_limit = 4096); insert into gin_test_tbl select array[1, 2, g] from generate_series(1, 20000) g; insert into gin_test_tbl select array[1, 3, g] from generate_series(1, 1000) g; - 4. After a while, I can see the LOG from autovacuum worker and then use gdb to attach to the autovacuum worker. 5. When the autovacuum worker is blocked, I changed the "default_text_search_config = 'pg_catalog.public'" in configure fileand reload it. 6. Release the autovacuum worker and then I can see the assert failure. And I can see the assert failure doesn't happen after applying the patch. > > The reason for the existing rule for page lock and relation extension locks was > to not allow them to participate in group locking which will allow other parallel > operations like a parallel vacuum where multiple workers can work on the same > index, or parallel inserts, parallel copy, etc. The related commits are 15ef6ff4b9, > 72e78d831ab, 85f6b49c2c, and 3ba59ccc89. See 3ba59ccc89 for more details > (To allow parallel inserts and parallel copy, we have ensured that relation > extension and page locks don't participate in group locking which means such > locks can conflict among the same group members. This is required as it is no > safer for two related processes to extend the same relation or perform clean > up in gin indexes at a time than for unrelated processes to do the same....). Best Regards, Hou zj
Attachment
pgsql-hackers by date: