BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
Date
Msg-id 18031-ebe2d08cb405f6cc@postgresql.org
Whole thread Raw
Responses Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      18031
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16beta2
Operating system:   Ubuntu 22.04
Description:

The following script:
numclients=100
for ((c=1;c<=$numclients;c++)); do
createdb db$c
done

for ((i=1;i<=50;i++)); do
  echo "ITERATION $i"
  for ((c=1;c<=$numclients;c++)); do
    echo "SELECT format('CREATE TABLE t1_%s_$i (t TEXT);', g) FROM
generate_series(1,10) g \\gexec" | psql db$c >psql1_$c.out 2>&1 &
    echo "SELECT format('CREATE TABLE t2_%s_$i (t TEXT);', g) FROM
generate_series(1,10) g \\gexec" | psql db$c >psql2_$c.out 2>&1 &
    echo "SELECT 'VACUUM FULL pg_class' FROM generate_series(1,10) g
\\gexec" | psql db$c >psql3_$c.out 2>&1 &
  done
  wait
  grep -C1 'signal 11' server.log && break;
done

when executed with the custom settings:
fsync = off
max_connections = 1000
deadlock_timeout = 100ms
min_parallel_table_scan_size = 1kB

Leads to a server crash:
...
ITERATION 15
2023-07-19 17:01:46.542 MSK [1449673] STATEMENT:  VACUUM FULL pg_class
2023-07-19 17:01:46.617 MSK [1434014] LOG:  background worker "parallel
worker" (PID 1450094) was terminated by signal 11: Segmentation fault
2023-07-19 17:01:46.617 MSK [1434014] DETAIL:  Failed process was running:
VACUUM FULL pg_class

With the following stack trace:
Core was generated by `postgres: parallel worker for PID 1449148  waiting
                        '.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Section `.reg-xstate/1450094' in core file too small.
#0  RemoveFromWaitQueue (proc=0x7fbae3910400, hashcode=1106083722) at
lock.c:1874
1874            LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);
(gdb) bt
#0  RemoveFromWaitQueue (proc=0x7fbae3910400, hashcode=1106083722) at
lock.c:1874
#1  0x000055dcfc60f595 in LockErrorCleanup () at proc.c:735
#2  0x000055dcfc618757 in ProcessInterrupts () at postgres.c:3207
#3  0x000055dcfc60fbe1 in ProcSleep
(locallock=locallock@entry=0x55dcfe964380, 
    lockMethodTable=lockMethodTable@entry=0x55dcfca92800
<default_lockmethod>) at proc.c:1295
#4  0x000055dcfc5ffe0b in WaitOnLock
(locallock=locallock@entry=0x55dcfe964380, 
    owner=owner@entry=0x55dcfe978ed8) at lock.c:1818
#5  0x000055dcfc601ba5 in LockAcquireExtended
(locktag=locktag@entry=0x7ffd32e28110, 
    lockmode=lockmode@entry=5, sessionLock=sessionLock@entry=false,
dontWait=dontWait@entry=false, 
    reportMemoryError=reportMemoryError@entry=true,
locallockp=locallockp@entry=0x0) at lock.c:1082
#6  0x000055dcfc601c67 in LockAcquire (locktag=locktag@entry=0x7ffd32e28110,
lockmode=lockmode@entry=5, 
    sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false) at
lock.c:740
#7  0x000055dcfc5fe85f in XactLockTableWait (xid=xid@entry=43588,
rel=rel@entry=0x7fbad92f4348, 
    ctid=ctid@entry=0x55dcfe9f74b4, oper=oper@entry=XLTW_InsertIndexUnique)
at lmgr.c:702
#8  0x000055dcfc2720d2 in heapam_index_build_range_scan
(heapRelation=0x7fbad92f4348, 
    indexRelation=0x7fbad92f90c8, indexInfo=0x55dcfe9f7138,
allow_sync=<optimized out>, anyvisible=false, 
    progress=false, start_blockno=0, numblocks=4294967295,
callback=0x55dcfc29aba6 <_bt_build_callback>, 
    callback_state=0x7ffd32e28610, scan=0x55dcfe9f7458) at
heapam_handler.c:1496
#9  0x000055dcfc29af07 in table_index_build_scan (scan=<optimized out>,
callback_state=0x7ffd32e28610, 
    callback=0x55dcfc29aba6 <_bt_build_callback>, progress=false,
allow_sync=true, 
    index_info=0x55dcfe9f7138, index_rel=<optimized out>,
table_rel=<optimized out>)
    at ../../../../src/include/access/tableam.h:1781
#10 _bt_parallel_scan_and_sort (btspool=btspool@entry=0x55dcfe9f7d08,
btspool2=btspool2@entry=0x0, 
    btshared=btshared@entry=0x7fbad93122a0,
sharedsort=sharedsort@entry=0x7fbad9312240, 
    sharedsort2=sharedsort2@entry=0x0, sortmem=<optimized out>,
progress=false) at nbtsort.c:1985
#11 0x000055dcfc29d343 in _bt_parallel_build_main (seg=0x55dcfe979608,
toc=0x7fbad9312000) at nbtsort.c:1888
#12 0x000055dcfc2c64a6 in ParallelWorkerMain (main_arg=<optimized out>) at
parallel.c:1520
#13 0x000055dcfc56faf0 in StartBackgroundWorker () at bgworker.c:861
#14 0x000055dcfc57631a in do_start_bgworker (rw=rw@entry=0x55dcfe969ad0) at
postmaster.c:5762
#15 0x000055dcfc576496 in maybe_start_bgworkers () at postmaster.c:5986
#16 0x000055dcfc576dbf in process_pm_pmsignal () at postmaster.c:5149
#17 0x000055dcfc57736b in ServerLoop () at postmaster.c:1770
#18 0x000055dcfc5788ef in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x55dcfe938620)
    at postmaster.c:1463
#19 0x000055dcfc4964f5 in main (argc=3, argv=0x55dcfe938620) at main.c:198

(I got also another stack trace:
(gdb) bt
#0  RemoveFromWaitQueue (proc=0x7f763dde44c0, hashcode=1514899366) at
lock.c:1874
#1  0x0000559984e4e595 in LockErrorCleanup () at proc.c:735
#2  0x0000559984b125cc in AbortTransaction () at xact.c:2735
#3  0x0000559984b1307d in AbortCurrentTransaction () at xact.c:3338
#4  0x0000559984e5b68a in PostgresMain (dbname=<optimized out>,
username=0x55998640c6d8 "law")
    at postgres.c:4354
#5  0x0000559984db3183 in BackendRun (port=port@entry=0x559986445a60) at
postmaster.c:4461
#6  0x0000559984db623a in BackendStartup (port=port@entry=0x559986445a60) at
postmaster.c:4189
#7  0x0000559984db63d8 in ServerLoop () at postmaster.c:1779
#8  0x0000559984db78ef in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x55998640a620)
    at postmaster.c:1463
#9  0x0000559984cd54f5 in main (argc=3, argv=0x55998640a620) at main.c:198
)

server.log also contains:
2023-07-19 17:01:46.542 MSK [1449148] ERROR:  deadlock detected
2023-07-19 17:01:46.542 MSK [1449148] DETAIL:  Process 1449148 waits for
ShareLock on transaction 43588; blocked by process 1449150.
    Process 1449150 waits for RowExclusiveLock on relation 1259 of database
16385; blocked by process 1449148.
    Process 1449148: VACUUM FULL pg_class
    Process 1449150: CREATE TABLE t2_2_15 (t TEXT);
2023-07-19 17:01:46.542 MSK [1449148] HINT:  See server log for query
details.
2023-07-19 17:01:46.542 MSK [1449148] CONTEXT:  while checking uniqueness of
tuple (28,39) in relation "pg_class"
2023-07-19 17:01:46.542 MSK [1449148] STATEMENT:  VACUUM FULL pg_class

It looks like in LockErrorCleanup():
    if (!dlist_node_is_detached(&MyProc->links))
    {
        /* We could not have been granted the lock yet */
        RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
    }
MyProc->links happened to be not detached, but MyProc->waitLock is NULL.

`git bisect` for this anomaly points at 5764f611e.


pgsql-bugs by date:

Previous
From: vignesh C
Date:
Subject: Re: BUG #18027: Logical replication taking forever
Next
From: David Rowley
Date:
Subject: Re: BUG #18030: Large memory consumption caused by in-clause subqueries