BUG #17947: Combination of replslots pgstat issues causes error/assertion failure - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #17947: Combination of replslots pgstat issues causes error/assertion failure |
Date | |
Msg-id | 17947-b9554521ad963c9c@postgresql.org Whole thread Raw |
Responses |
Re: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 17947 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 15.3 Operating system: Ubuntu 22.04 Description: The following script: numclients=10 for ((c=1;c<=numclients;c++)); do ( echo " SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', false); SELECT pg_sleep(0.05); SELECT pg_drop_replication_slot('regression_slot'); SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', false); SELECT pg_sleep(0.05); SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2', 'test_decoding', true); SELECT pg_drop_replication_slot('regression_slot'); CREATE TABLE tbl_$c(id int); " | psql >psql-$c.log ) & done wait (with wal_level = logical in postgresql.conf) leads to an assertion failure (up to 10 runs required in my environment): 2023-05-26 13:08:09.243 MSK [2787598] ERROR: can only drop stats once TRAP: FailedAssertion("!LWLockAnyHeldByMe(&(hash_table)->control->partitions[0].lock, DSHASH_NUM_PARTITIONS, sizeof(dshash_partition))", File: "dshash.c", Line: 400, PID: 2787598) Core was generated by `postgres: law regression [local] CREATE TABLE '. Program terminated with signal SIGABRT, Aborted. #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140618617759552) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140618617759552) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140618617759552) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140618617759552, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007fe45210e476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007fe4520f47f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x0000555f13dad3e8 in ExceptionalCondition ( conditionName=conditionName@entry=0x555f13f06090 "!LWLockAnyHeldByMe(&(hash_table)->control->partitions[0].lock, DSHASH_NUM_PARTITIONS, sizeof(dshash_partition))", errorType=errorType@entry=0x555f13e0b00b "FailedAssertion", fileName=fileName@entry=0x555f13f05f6c "dshash.c", lineNumber=lineNumber@entry=400) at assert.c:69 #6 0x0000555f13b25896 in dshash_find (hash_table=0x555f15b6b668, key=key@entry=0x7fff8481e240, exclusive=exclusive@entry=false) at dshash.c:400 #7 0x0000555f13c93ea6 in pgstat_get_entry_ref (kind=kind@entry=PGSTAT_KIND_RELATION, dboid=dboid@entry=16384, objoid=objoid@entry=3467, create=create@entry=true, created_entry=created_entry@entry=0x0) at pgstat_shmem.c:439 #8 0x0000555f13c8edd9 in pgstat_prep_pending_entry (kind=kind@entry=PGSTAT_KIND_RELATION, dboid=16384, objoid=objoid@entry=3467, created_entry=created_entry@entry=0x0) at pgstat.c:1094 #9 0x0000555f13c91599 in pgstat_prep_relation_pending (isshared=false, rel_id=3467) at pgstat_relation.c:855 #10 pgstat_assoc_relation (rel=rel@entry=0x7fe446392740) at pgstat_relation.c:138 #11 0x0000555f1398ff8e in _bt_first (scan=scan@entry=0x555f15c3b5e0, dir=dir@entry=ForwardScanDirection) at nbtsearch.c:882 #12 0x0000555f13989e46 in btgettuple (scan=0x555f15c3b5e0, dir=ForwardScanDirection) at nbtree.c:242 #13 0x0000555f1397c014 in index_getnext_tid (scan=0x555f15c3b5e0, direction=<optimized out>) at indexam.c:533 #14 0x0000555f1397c28b in index_getnext_slot (scan=0x555f15c3b5e0, direction=direction@entry=ForwardScanDirection, slot=0x555f15c3b4c8) at indexam.c:625 #15 0x0000555f1397b426 in systable_getnext_ordered (sysscan=sysscan@entry=0x555f15c3b470, direction=direction@entry=ForwardScanDirection) at genam.c:715 #16 0x0000555f13d95e25 in BuildEventTriggerCache () at evtcache.c:151 #17 EventCacheLookup (event=<optimized out>, event@entry=EVT_SQLDrop) at evtcache.c:69 #18 0x0000555f13a72e32 in trackDroppedObjectsNeeded () at event_trigger.c:1147 #19 0x0000555f13a72ea0 in EventTriggerBeginCompleteQuery () at event_trigger.c:1089 #20 0x0000555f13c7c510 in ProcessUtilitySlow (pstate=0x555f15b91540, pstmt=0x555f15b70440, queryString=0x555f15b6f410 "CREATE TABLE tbl_4(id int);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, qc=0x7fff8481fe10, dest=<optimized out>) at utility.c:1114 #21 0x0000555f13c7bb53 in standard_ProcessUtility (pstmt=0x555f15b70440, queryString=0x555f15b6f410 "CREATE TABLE tbl_4(id int);", readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x555f15b70530, qc=0x7fff8481fe10) at utility.c:1074 #22 0x0000555f13c7a201 in PortalRunUtility (portal=portal@entry=0x555f15bdbbc0, pstmt=pstmt@entry=0x555f15b70440, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x555f15b70530, qc=qc@entry=0x7fff8481fe10) at pquery.c:1158 #23 0x0000555f13c7a33d in PortalRunMulti (portal=portal@entry=0x555f15bdbbc0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x555f15b70530, altdest=altdest@entry=0x555f15b70530, qc=qc@entry=0x7fff8481fe10) at pquery.c:1315 #24 0x0000555f13c7a9e1 in PortalRun (portal=portal@entry=0x555f15bdbbc0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x555f15b70530, altdest=altdest@entry=0x555f15b70530, qc=0x7fff8481fe10) at pquery.c:791 #25 0x0000555f13c766bd in exec_simple_query (query_string=0x555f15b6f410 "CREATE TABLE tbl_4(id int);") at postgres.c:1250 #26 0x0000555f13c78282 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4593 #27 0x0000555f13be2b42 in BackendRun (port=0x555f15b930a0, port=0x555f15b930a0) at postmaster.c:4511 #28 BackendStartup (port=0x555f15b930a0) at postmaster.c:4239 #29 ServerLoop () at postmaster.c:1806 #30 0x0000555f13be3beb in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x555f15b69670) at postmaster.c:1478 #31 0x0000555f13908cff in main (argc=3, argv=0x555f15b69670) at main.c:202 (Initially, I'd got this exception when playing with a concurrent installcheck for contrib/test_decoding [1].) With some debug logging added, I see that this failure is a consequence of the following combination of circumstances/issues: First, when a replication slot created, it can just reuse an pre-existing pgstat hash_entry, with increasing refcount. ReplicationSlotCreate() -> pgstat_create_replslot() -> pgstat_get_entry_ref_locked() -> pgstat_get_entry_ref() - > pgstat_reinit_entry(): /* mark as not dropped anymore */ pg_atomic_fetch_add_u32(&shhashent->refcount, 1); shhashent->dropped = false; Later, when that replication slot getting dropped, that hash_entry stays alive (with refcount > 0), but has dropped = true; (So, pg_create_logical_replication_slot/pg_drop_logical_replication_slot can leave the pgstat hash_entry having dropped = true, but refcount > 1.) Second, when another replication slot created, it can reuse the defective pgstat_get_entry_ref from the local cache: ReplicationSlotCreate() -> pgstat_create_replslot() -> pgstat_get_entry_ref_locked() -> pgstat_get_entry_ref(): /* * First check the lookup cache hashtable in local memory. If we find a * match here we can avoid taking locks / causing contention. */ if (pgstat_get_entry_ref_cached(key, &entry_ref)) return entry_ref; (Thus, a backend can get a new replication slot with an existing pgstat hash_entry, that has dropped = true.) Third, when the replication slot with pgstat hash_entry->dropped = true is dropped again, the error "can only drop stats once" is raised in pgstat_drop_entry_internal(): /* * Signal that the entry is dropped - this will eventually cause other * backends to release their references. */ if (shent->dropped) elog(ERROR, "can only drop stats once; hstat: %p", hstat); but in this case dshash_release_lock() is not called, so that leads to the assertion failure, when trying to access pgstat's hash on creating a relation later. [1] https://www.postgresql.org/message-id/7e4d4a80-3e3c-231f-f886-6cada2aa582b%40gmail.com
pgsql-bugs by date: