catcache search while BUFFER_LOCK_EXCLUSIVE on catalog - Mailing list pgsql-hackers

From Noah Misch
Subject catcache search while BUFFER_LOCK_EXCLUSIVE on catalog
Date
Msg-id 20250410191830.0e.nmisch@google.com
Whole thread Raw
List pgsql-hackers
Given the severity of the (never released) bug that commit 0bada39 fixed, I
felt I owed improvements to our ability to detect the next bug like it.  The
attached patch adds pertinent assertions.  For obstacles to the original bug
detection, see Appendix 1.

One of the exceptions this patch allows is new for v18.  I think that
exception is okay, but I'm starting this thread to invite other opinions.  In
v17, varstr_cmp() avoids catcache searches for collation "C", so catalog index
maintenance doesn't do catcache searches.  Non-catalog, non-"C" index
maintenance does COLLID catcache searches in v17 (likely in all supported
versions).  v18 additionally reaches those COLLID catcache searches when
updating catalog indexes having text columns:

[local] test=*# select attrelid::regclass, attname , attcollation::regcollation from pg_attribute a join pg_index i on
attrelid= i.indexrelid where atttypid = 'text'::regtype and indisunique;
 
              attrelid              │ attname  │ attcollation 
────────────────────────────────────┼──────────┼──────────────
 pg_replication_origin_roname_index │ roname   │ "C"
 pg_seclabel_object_index           │ provider │ "C"
 pg_shseclabel_object_index         │ provider │ "C"
 pg_parameter_acl_parname_index     │ parname  │ "C"
(4 rows)

See Appendix 2 for a stack trace from an earlier version of the patch.
relcache builds don't use those indexes, and that's not likely to change.
Hence, I don't see a need to move or avoid these catcache searches.  We'd have
problems if we later make a text column part of a UNIQUE key of, say,
pg_class.  An INSERT of pg_class would then do a COLLID search while we hold
an exclusive lock on a pg_class index page.  If the COLLID search reached a
relcache build that attempts to read that index page, the outcome would be an
undetected lwlock deadlock.


Separately, the catcache.c part of the attached patch makes a questionable
allowance for TypeCacheRelCallback() to reach GetSysCacheHashValue1(TYPEOID)
when not in a transaction.  I suspect that's buggy after an OOM at
lookup_type_cache() -> CreateCacheMemoryContext() leaves the callback
registered without having initialized the TYPEOID catcache.  Other than that
improbable outcome, it seems fine.  Unlike the main $SUBJECT, this is ancient.


==== Appendix 1: difficulty of original commit 0bada39 bug detection

It required a relcache inval in this call stack:

PrepareToInvalidateCacheTuple() [common]
  CatalogCacheInitializeCache() [common: happens *unless* debug_discard_caches]
    table_open(cache->cc_reloid, AccessShareLock) [always]
      RelationBuildDesc() [rare: only on cache miss / inval]

debug_discard_caches doesn't reproduce it, because it reaches
CatalogCacheInitializeCache() earlier, via this code:

    if (needNewCacheFile)
    {
        /*
         * Force all the catcaches to finish initializing and thereby open the
         * catalogs and indexes they use.  This will preload the relcache with
         * entries for all the most important system catalogs and indexes, so
         * that the init files will be most useful for future backends.
         */
        InitCatalogCachePhase2();


==== Appendix 2: stack of COLLID search while catalog buffer exclusive-locked

#0  0x00007f07a3241387 in raise () from /lib64/libc.so.6
#1  0x00007f07a3242a78 in abort () from /lib64/libc.so.6
#2  0x0000000000c08133 in ExceptionalCondition (
    conditionName=0xe4a298 "tag.spcOid != GLOBALTABLESPACE_OID",
    fileName=0xe49640 "bufmgr.c", lineNumber=4119) at assert.c:66
#3  0x00000000009d90f0 in AssertNotCatalogBufferLock (lock=0x7f07921e4ab0,
    unused_context=0x0) at bufmgr.c:4119
#4  0x0000000000a10849 in ForEachLWLockHeldByMeInMode (mode=LW_EXCLUSIVE,
    callback=0x9d906f <AssertNotCatalogBufferLock>, context=0x0)
    at lwlock.c:1976
#5  0x00000000009d9165 in AssertNoCatalogBufferLocksInMode (mode=2)
    at bufmgr.c:4142
#6  0x0000000000be3770 in ConditionalCatalogCacheInitializeCache (
    cache=0x246bd80) at catcache.c:1064
#7  0x0000000000be3d59 in SearchCatCacheInternal (cache=0x246bd80, nkeys=1,
    v1=950, v2=0, v3=0, v4=0) at catcache.c:1397
#8  0x0000000000be3c29 in SearchCatCache1 (cache=0x246bd80, v1=950)
    at catcache.c:1344
#9  0x0000000000c01064 in SearchSysCache1 (cacheId=16, key1=950)
    at syscache.c:228
#10 0x0000000000b4054a in create_pg_locale (collid=950, context=0x24dec20)
    at pg_locale.c:1075
#11 0x0000000000b40b1e in pg_newlocale_from_collation (collid=950)
    at pg_locale.c:1223
#12 0x0000000000bcd4a4 in varstr_cmp (arg1=0x7f07965dbfe1 "pg_27254", len1=8,
    arg2=0x242da19 "pg_34104", len2=8, collid=950) at varlena.c:1617
#13 0x0000000000bcd760 in text_cmp (arg1=0x7f07965dbfe0, arg2=0x242da18,
    collid=950) at varlena.c:1671
#14 0x0000000000bce102 in bttextcmp (fcinfo=0x7fff47fe50a0) at varlena.c:1892
#15 0x0000000000c13332 in FunctionCall2Coll (flinfo=0x242c878, collation=950,
    arg1=139670564224992, arg2=37935640) at fmgr.c:1161
#16 0x000000000054019a in _bt_compare (rel=0x7f07a4a6b218, key=0x242c850,
    page=0x7f07965da000 "", offnum=1) at nbtsearch.c:765
#17 0x000000000053fc17 in _bt_binsrch_insert (rel=0x7f07a4a6b218,
    insertstate=0x7fff47fe54c0) at nbtsearch.c:537
#18 0x000000000052dbd8 in _bt_check_unique (rel=0x7f07a4a6b218,
    insertstate=0x7fff47fe54c0, heapRel=0x7f07a4a6a338,
    checkUnique=UNIQUE_CHECK_YES, is_unique=0x7fff47fe54f1,
    speculativeToken=0x7fff47fe54bc) at nbtinsert.c:443
#19 0x000000000052d74f in _bt_doinsert (rel=0x7f07a4a6b218, itup=0x242da10,
    checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, heapRel=0x7f07a4a6a338)
    at nbtinsert.c:210
#20 0x000000000053c4e3 in btinsert (rel=0x7f07a4a6b218, values=0x7fff47fe5610,
    isnull=0x7fff47fe55f0, ht_ctid=0x242d534, heapRel=0x7f07a4a6a338,
    checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, indexInfo=0x24ea0c8)
    at nbtree.c:215
#21 0x0000000000527c5a in index_insert (indexRelation=0x7f07a4a6b218,
    values=0x7fff47fe5610, isnull=0x7fff47fe55f0, heap_t_ctid=0x242d534,
    heapRelation=0x7f07a4a6a338, checkUnique=UNIQUE_CHECK_YES,
    indexUnchanged=false, indexInfo=0x24ea0c8) at indexam.c:230
#22 0x00000000005dd100 in CatalogIndexInsert (indstate=0x243c1e8,
    heapTuple=0x242d530, updateIndexes=TU_All) at indexing.c:170
#23 0x00000000005dd21c in CatalogTupleInsert (heapRel=0x7f07a4a6a338,
    tup=0x242d530) at indexing.c:243
#24 0x000000000096fbac in replorigin_create (roname=0x7fff47fe5a30 "pg_34104")
    at origin.c:324
#25 0x00000000006f88c6 in CreateSubscription (pstate=0x242c710,
    stmt=0x2401f80, isTopLevel=true) at subscriptioncmds.c:696
#26 0x0000000000a34d61 in ProcessUtilitySlow (pstate=0x242c710,
    pstmt=0x2402030,
    queryString=0x2401390 "CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION
testpubWITH (connect = false);",
 
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
    dest=0x24023b0, qc=0x7fff47fe6360) at utility.c:1860
#27 0x0000000000a334db in standard_ProcessUtility (pstmt=0x2402030,
    queryString=0x2401390 "CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION
testpubWITH (connect = false);",
 
    readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
    queryEnv=0x0, dest=0x24023b0, qc=0x7fff47fe6360) at utility.c:1070
#28 0x0000000000a32674 in ProcessUtility (pstmt=0x2402030,
    queryString=0x2401390 "CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION
testpubWITH (connect = false);",
 
    readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
    queryEnv=0x0, dest=0x24023b0, qc=0x7fff47fe6360) at utility.c:523
#29 0x0000000000a31259 in PortalRunUtility (portal=0x2485240, pstmt=0x2402030,
    isTopLevel=true, setHoldSnapshot=false, dest=0x24023b0, qc=0x7fff47fe6360)
    at pquery.c:1185
#30 0x0000000000a314f5 in PortalRunMulti (portal=0x2485240, isTopLevel=true,
    setHoldSnapshot=false, dest=0x24023b0, altdest=0x24023b0,
    qc=0x7fff47fe6360) at pquery.c:1349
#31 0x0000000000a309e8 in PortalRun (portal=0x2485240,
    count=9223372036854775807, isTopLevel=true, dest=0x24023b0,
    altdest=0x24023b0, qc=0x7fff47fe6360) at pquery.c:820
#32 0x0000000000a29c84 in exec_simple_query (
    query_string=0x2401390 "CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION
testpubWITH (connect = false);")
 
    at postgres.c:1274
#33 0x0000000000a2ea3d in PostgresMain (dbname=0x243eaf0 "regression",
    username=0x243ead8 "nm") at postgres.c:4768
#34 0x0000000000a2599e in BackendMain (startup_data=0x7fff47fe66b0,
    startup_data_len=24) at backend_startup.c:124
#35 0x000000000093fe47 in postmaster_child_launch (child_type=B_BACKEND,
    child_slot=20, startup_data=0x7fff47fe66b0, startup_data_len=24,
    client_sock=0x7fff47fe6700) at launch_backend.c:290
#36 0x0000000000945d55 in BackendStartup (client_sock=0x7fff47fe6700)
    at postmaster.c:3580
#37 0x0000000000943694 in ServerLoop () at postmaster.c:1702
#38 0x000000000094305c in PostmasterMain (argc=8, argv=0x23fbb30)
    at postmaster.c:1400
#39 0x00000000007f6c6e in main (argc=8, argv=0x23fbb30) at main.c:227

Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Non-text mode for pg_dumpall
Next
From: Kirill Reshke
Date:
Subject: Re: tab complete for COPY populated materialized view TO