catcache search while BUFFER_LOCK_EXCLUSIVE on catalog - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | catcache search while BUFFER_LOCK_EXCLUSIVE on catalog |
Date | |
Msg-id | 20250410191830.0e.nmisch@google.com Whole thread Raw |
List | pgsql-hackers |
Given the severity of the (never released) bug that commit 0bada39 fixed, I felt I owed improvements to our ability to detect the next bug like it. The attached patch adds pertinent assertions. For obstacles to the original bug detection, see Appendix 1. One of the exceptions this patch allows is new for v18. I think that exception is okay, but I'm starting this thread to invite other opinions. In v17, varstr_cmp() avoids catcache searches for collation "C", so catalog index maintenance doesn't do catcache searches. Non-catalog, non-"C" index maintenance does COLLID catcache searches in v17 (likely in all supported versions). v18 additionally reaches those COLLID catcache searches when updating catalog indexes having text columns: [local] test=*# select attrelid::regclass, attname , attcollation::regcollation from pg_attribute a join pg_index i on attrelid= i.indexrelid where atttypid = 'text'::regtype and indisunique; attrelid │ attname │ attcollation ────────────────────────────────────┼──────────┼────────────── pg_replication_origin_roname_index │ roname │ "C" pg_seclabel_object_index │ provider │ "C" pg_shseclabel_object_index │ provider │ "C" pg_parameter_acl_parname_index │ parname │ "C" (4 rows) See Appendix 2 for a stack trace from an earlier version of the patch. relcache builds don't use those indexes, and that's not likely to change. Hence, I don't see a need to move or avoid these catcache searches. We'd have problems if we later make a text column part of a UNIQUE key of, say, pg_class. An INSERT of pg_class would then do a COLLID search while we hold an exclusive lock on a pg_class index page. If the COLLID search reached a relcache build that attempts to read that index page, the outcome would be an undetected lwlock deadlock. Separately, the catcache.c part of the attached patch makes a questionable allowance for TypeCacheRelCallback() to reach GetSysCacheHashValue1(TYPEOID) when not in a transaction. I suspect that's buggy after an OOM at lookup_type_cache() -> CreateCacheMemoryContext() leaves the callback registered without having initialized the TYPEOID catcache. Other than that improbable outcome, it seems fine. Unlike the main $SUBJECT, this is ancient. ==== Appendix 1: difficulty of original commit 0bada39 bug detection It required a relcache inval in this call stack: PrepareToInvalidateCacheTuple() [common] CatalogCacheInitializeCache() [common: happens *unless* debug_discard_caches] table_open(cache->cc_reloid, AccessShareLock) [always] RelationBuildDesc() [rare: only on cache miss / inval] debug_discard_caches doesn't reproduce it, because it reaches CatalogCacheInitializeCache() earlier, via this code: if (needNewCacheFile) { /* * Force all the catcaches to finish initializing and thereby open the * catalogs and indexes they use. This will preload the relcache with * entries for all the most important system catalogs and indexes, so * that the init files will be most useful for future backends. */ InitCatalogCachePhase2(); ==== Appendix 2: stack of COLLID search while catalog buffer exclusive-locked #0 0x00007f07a3241387 in raise () from /lib64/libc.so.6 #1 0x00007f07a3242a78 in abort () from /lib64/libc.so.6 #2 0x0000000000c08133 in ExceptionalCondition ( conditionName=0xe4a298 "tag.spcOid != GLOBALTABLESPACE_OID", fileName=0xe49640 "bufmgr.c", lineNumber=4119) at assert.c:66 #3 0x00000000009d90f0 in AssertNotCatalogBufferLock (lock=0x7f07921e4ab0, unused_context=0x0) at bufmgr.c:4119 #4 0x0000000000a10849 in ForEachLWLockHeldByMeInMode (mode=LW_EXCLUSIVE, callback=0x9d906f <AssertNotCatalogBufferLock>, context=0x0) at lwlock.c:1976 #5 0x00000000009d9165 in AssertNoCatalogBufferLocksInMode (mode=2) at bufmgr.c:4142 #6 0x0000000000be3770 in ConditionalCatalogCacheInitializeCache ( cache=0x246bd80) at catcache.c:1064 #7 0x0000000000be3d59 in SearchCatCacheInternal (cache=0x246bd80, nkeys=1, v1=950, v2=0, v3=0, v4=0) at catcache.c:1397 #8 0x0000000000be3c29 in SearchCatCache1 (cache=0x246bd80, v1=950) at catcache.c:1344 #9 0x0000000000c01064 in SearchSysCache1 (cacheId=16, key1=950) at syscache.c:228 #10 0x0000000000b4054a in create_pg_locale (collid=950, context=0x24dec20) at pg_locale.c:1075 #11 0x0000000000b40b1e in pg_newlocale_from_collation (collid=950) at pg_locale.c:1223 #12 0x0000000000bcd4a4 in varstr_cmp (arg1=0x7f07965dbfe1 "pg_27254", len1=8, arg2=0x242da19 "pg_34104", len2=8, collid=950) at varlena.c:1617 #13 0x0000000000bcd760 in text_cmp (arg1=0x7f07965dbfe0, arg2=0x242da18, collid=950) at varlena.c:1671 #14 0x0000000000bce102 in bttextcmp (fcinfo=0x7fff47fe50a0) at varlena.c:1892 #15 0x0000000000c13332 in FunctionCall2Coll (flinfo=0x242c878, collation=950, arg1=139670564224992, arg2=37935640) at fmgr.c:1161 #16 0x000000000054019a in _bt_compare (rel=0x7f07a4a6b218, key=0x242c850, page=0x7f07965da000 "", offnum=1) at nbtsearch.c:765 #17 0x000000000053fc17 in _bt_binsrch_insert (rel=0x7f07a4a6b218, insertstate=0x7fff47fe54c0) at nbtsearch.c:537 #18 0x000000000052dbd8 in _bt_check_unique (rel=0x7f07a4a6b218, insertstate=0x7fff47fe54c0, heapRel=0x7f07a4a6a338, checkUnique=UNIQUE_CHECK_YES, is_unique=0x7fff47fe54f1, speculativeToken=0x7fff47fe54bc) at nbtinsert.c:443 #19 0x000000000052d74f in _bt_doinsert (rel=0x7f07a4a6b218, itup=0x242da10, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, heapRel=0x7f07a4a6a338) at nbtinsert.c:210 #20 0x000000000053c4e3 in btinsert (rel=0x7f07a4a6b218, values=0x7fff47fe5610, isnull=0x7fff47fe55f0, ht_ctid=0x242d534, heapRel=0x7f07a4a6a338, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, indexInfo=0x24ea0c8) at nbtree.c:215 #21 0x0000000000527c5a in index_insert (indexRelation=0x7f07a4a6b218, values=0x7fff47fe5610, isnull=0x7fff47fe55f0, heap_t_ctid=0x242d534, heapRelation=0x7f07a4a6a338, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, indexInfo=0x24ea0c8) at indexam.c:230 #22 0x00000000005dd100 in CatalogIndexInsert (indstate=0x243c1e8, heapTuple=0x242d530, updateIndexes=TU_All) at indexing.c:170 #23 0x00000000005dd21c in CatalogTupleInsert (heapRel=0x7f07a4a6a338, tup=0x242d530) at indexing.c:243 #24 0x000000000096fbac in replorigin_create (roname=0x7fff47fe5a30 "pg_34104") at origin.c:324 #25 0x00000000006f88c6 in CreateSubscription (pstate=0x242c710, stmt=0x2401f80, isTopLevel=true) at subscriptioncmds.c:696 #26 0x0000000000a34d61 in ProcessUtilitySlow (pstate=0x242c710, pstmt=0x2402030, queryString=0x2401390 "CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpubWITH (connect = false);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x24023b0, qc=0x7fff47fe6360) at utility.c:1860 #27 0x0000000000a334db in standard_ProcessUtility (pstmt=0x2402030, queryString=0x2401390 "CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpubWITH (connect = false);", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x24023b0, qc=0x7fff47fe6360) at utility.c:1070 #28 0x0000000000a32674 in ProcessUtility (pstmt=0x2402030, queryString=0x2401390 "CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpubWITH (connect = false);", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x24023b0, qc=0x7fff47fe6360) at utility.c:523 #29 0x0000000000a31259 in PortalRunUtility (portal=0x2485240, pstmt=0x2402030, isTopLevel=true, setHoldSnapshot=false, dest=0x24023b0, qc=0x7fff47fe6360) at pquery.c:1185 #30 0x0000000000a314f5 in PortalRunMulti (portal=0x2485240, isTopLevel=true, setHoldSnapshot=false, dest=0x24023b0, altdest=0x24023b0, qc=0x7fff47fe6360) at pquery.c:1349 #31 0x0000000000a309e8 in PortalRun (portal=0x2485240, count=9223372036854775807, isTopLevel=true, dest=0x24023b0, altdest=0x24023b0, qc=0x7fff47fe6360) at pquery.c:820 #32 0x0000000000a29c84 in exec_simple_query ( query_string=0x2401390 "CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpubWITH (connect = false);") at postgres.c:1274 #33 0x0000000000a2ea3d in PostgresMain (dbname=0x243eaf0 "regression", username=0x243ead8 "nm") at postgres.c:4768 #34 0x0000000000a2599e in BackendMain (startup_data=0x7fff47fe66b0, startup_data_len=24) at backend_startup.c:124 #35 0x000000000093fe47 in postmaster_child_launch (child_type=B_BACKEND, child_slot=20, startup_data=0x7fff47fe66b0, startup_data_len=24, client_sock=0x7fff47fe6700) at launch_backend.c:290 #36 0x0000000000945d55 in BackendStartup (client_sock=0x7fff47fe6700) at postmaster.c:3580 #37 0x0000000000943694 in ServerLoop () at postmaster.c:1702 #38 0x000000000094305c in PostmasterMain (argc=8, argv=0x23fbb30) at postmaster.c:1400 #39 0x00000000007f6c6e in main (argc=8, argv=0x23fbb30) at main.c:227
Attachment
pgsql-hackers by date: