Thread: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18031 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 16beta2 Operating system: Ubuntu 22.04 Description: The following script: numclients=100 for ((c=1;c<=$numclients;c++)); do createdb db$c done for ((i=1;i<=50;i++)); do echo "ITERATION $i" for ((c=1;c<=$numclients;c++)); do echo "SELECT format('CREATE TABLE t1_%s_$i (t TEXT);', g) FROM generate_series(1,10) g \\gexec" | psql db$c >psql1_$c.out 2>&1 & echo "SELECT format('CREATE TABLE t2_%s_$i (t TEXT);', g) FROM generate_series(1,10) g \\gexec" | psql db$c >psql2_$c.out 2>&1 & echo "SELECT 'VACUUM FULL pg_class' FROM generate_series(1,10) g \\gexec" | psql db$c >psql3_$c.out 2>&1 & done wait grep -C1 'signal 11' server.log && break; done when executed with the custom settings: fsync = off max_connections = 1000 deadlock_timeout = 100ms min_parallel_table_scan_size = 1kB Leads to a server crash: ... ITERATION 15 2023-07-19 17:01:46.542 MSK [1449673] STATEMENT: VACUUM FULL pg_class 2023-07-19 17:01:46.617 MSK [1434014] LOG: background worker "parallel worker" (PID 1450094) was terminated by signal 11: Segmentation fault 2023-07-19 17:01:46.617 MSK [1434014] DETAIL: Failed process was running: VACUUM FULL pg_class With the following stack trace: Core was generated by `postgres: parallel worker for PID 1449148 waiting '. Program terminated with signal SIGSEGV, Segmentation fault. warning: Section `.reg-xstate/1450094' in core file too small. #0 RemoveFromWaitQueue (proc=0x7fbae3910400, hashcode=1106083722) at lock.c:1874 1874 LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock); (gdb) bt #0 RemoveFromWaitQueue (proc=0x7fbae3910400, hashcode=1106083722) at lock.c:1874 #1 0x000055dcfc60f595 in LockErrorCleanup () at proc.c:735 #2 0x000055dcfc618757 in ProcessInterrupts () at postgres.c:3207 #3 0x000055dcfc60fbe1 in ProcSleep (locallock=locallock@entry=0x55dcfe964380, lockMethodTable=lockMethodTable@entry=0x55dcfca92800 <default_lockmethod>) at proc.c:1295 #4 0x000055dcfc5ffe0b in WaitOnLock (locallock=locallock@entry=0x55dcfe964380, owner=owner@entry=0x55dcfe978ed8) at lock.c:1818 #5 0x000055dcfc601ba5 in LockAcquireExtended (locktag=locktag@entry=0x7ffd32e28110, lockmode=lockmode@entry=5, sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false, reportMemoryError=reportMemoryError@entry=true, locallockp=locallockp@entry=0x0) at lock.c:1082 #6 0x000055dcfc601c67 in LockAcquire (locktag=locktag@entry=0x7ffd32e28110, lockmode=lockmode@entry=5, sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false) at lock.c:740 #7 0x000055dcfc5fe85f in XactLockTableWait (xid=xid@entry=43588, rel=rel@entry=0x7fbad92f4348, ctid=ctid@entry=0x55dcfe9f74b4, oper=oper@entry=XLTW_InsertIndexUnique) at lmgr.c:702 #8 0x000055dcfc2720d2 in heapam_index_build_range_scan (heapRelation=0x7fbad92f4348, indexRelation=0x7fbad92f90c8, indexInfo=0x55dcfe9f7138, allow_sync=<optimized out>, anyvisible=false, progress=false, start_blockno=0, numblocks=4294967295, callback=0x55dcfc29aba6 <_bt_build_callback>, callback_state=0x7ffd32e28610, scan=0x55dcfe9f7458) at heapam_handler.c:1496 #9 0x000055dcfc29af07 in table_index_build_scan (scan=<optimized out>, callback_state=0x7ffd32e28610, callback=0x55dcfc29aba6 <_bt_build_callback>, progress=false, allow_sync=true, index_info=0x55dcfe9f7138, index_rel=<optimized out>, table_rel=<optimized out>) at ../../../../src/include/access/tableam.h:1781 #10 _bt_parallel_scan_and_sort (btspool=btspool@entry=0x55dcfe9f7d08, btspool2=btspool2@entry=0x0, btshared=btshared@entry=0x7fbad93122a0, sharedsort=sharedsort@entry=0x7fbad9312240, sharedsort2=sharedsort2@entry=0x0, sortmem=<optimized out>, progress=false) at nbtsort.c:1985 #11 0x000055dcfc29d343 in _bt_parallel_build_main (seg=0x55dcfe979608, toc=0x7fbad9312000) at nbtsort.c:1888 #12 0x000055dcfc2c64a6 in ParallelWorkerMain (main_arg=<optimized out>) at parallel.c:1520 #13 0x000055dcfc56faf0 in StartBackgroundWorker () at bgworker.c:861 #14 0x000055dcfc57631a in do_start_bgworker (rw=rw@entry=0x55dcfe969ad0) at postmaster.c:5762 #15 0x000055dcfc576496 in maybe_start_bgworkers () at postmaster.c:5986 #16 0x000055dcfc576dbf in process_pm_pmsignal () at postmaster.c:5149 #17 0x000055dcfc57736b in ServerLoop () at postmaster.c:1770 #18 0x000055dcfc5788ef in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x55dcfe938620) at postmaster.c:1463 #19 0x000055dcfc4964f5 in main (argc=3, argv=0x55dcfe938620) at main.c:198 (I got also another stack trace: (gdb) bt #0 RemoveFromWaitQueue (proc=0x7f763dde44c0, hashcode=1514899366) at lock.c:1874 #1 0x0000559984e4e595 in LockErrorCleanup () at proc.c:735 #2 0x0000559984b125cc in AbortTransaction () at xact.c:2735 #3 0x0000559984b1307d in AbortCurrentTransaction () at xact.c:3338 #4 0x0000559984e5b68a in PostgresMain (dbname=<optimized out>, username=0x55998640c6d8 "law") at postgres.c:4354 #5 0x0000559984db3183 in BackendRun (port=port@entry=0x559986445a60) at postmaster.c:4461 #6 0x0000559984db623a in BackendStartup (port=port@entry=0x559986445a60) at postmaster.c:4189 #7 0x0000559984db63d8 in ServerLoop () at postmaster.c:1779 #8 0x0000559984db78ef in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x55998640a620) at postmaster.c:1463 #9 0x0000559984cd54f5 in main (argc=3, argv=0x55998640a620) at main.c:198 ) server.log also contains: 2023-07-19 17:01:46.542 MSK [1449148] ERROR: deadlock detected 2023-07-19 17:01:46.542 MSK [1449148] DETAIL: Process 1449148 waits for ShareLock on transaction 43588; blocked by process 1449150. Process 1449150 waits for RowExclusiveLock on relation 1259 of database 16385; blocked by process 1449148. Process 1449148: VACUUM FULL pg_class Process 1449150: CREATE TABLE t2_2_15 (t TEXT); 2023-07-19 17:01:46.542 MSK [1449148] HINT: See server log for query details. 2023-07-19 17:01:46.542 MSK [1449148] CONTEXT: while checking uniqueness of tuple (28,39) in relation "pg_class" 2023-07-19 17:01:46.542 MSK [1449148] STATEMENT: VACUUM FULL pg_class It looks like in LockErrorCleanup(): if (!dlist_node_is_detached(&MyProc->links)) { /* We could not have been granted the lock yet */ RemoveFromWaitQueue(MyProc, lockAwaited->hashcode); } MyProc->links happened to be not detached, but MyProc->waitLock is NULL. `git bisect` for this anomaly points at 5764f611e.
Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
From
Masahiko Sawada
Date:
Hi, On Thu, Jul 20, 2023 at 12:21 AM PG Bug reporting form <noreply@postgresql.org> wrote: > > The following bug has been logged on the website: > > Bug reference: 18031 > Logged by: Alexander Lakhin > Email address: exclusion@gmail.com > PostgreSQL version: 16beta2 > Operating system: Ubuntu 22.04 > Description: > > The following script: > numclients=100 > for ((c=1;c<=$numclients;c++)); do > createdb db$c > done > > for ((i=1;i<=50;i++)); do > echo "ITERATION $i" > for ((c=1;c<=$numclients;c++)); do > echo "SELECT format('CREATE TABLE t1_%s_$i (t TEXT);', g) FROM > generate_series(1,10) g \\gexec" | psql db$c >psql1_$c.out 2>&1 & > echo "SELECT format('CREATE TABLE t2_%s_$i (t TEXT);', g) FROM > generate_series(1,10) g \\gexec" | psql db$c >psql2_$c.out 2>&1 & > echo "SELECT 'VACUUM FULL pg_class' FROM generate_series(1,10) g > \\gexec" | psql db$c >psql3_$c.out 2>&1 & > done > wait > grep -C1 'signal 11' server.log && break; > done > > when executed with the custom settings: > fsync = off > max_connections = 1000 > deadlock_timeout = 100ms > min_parallel_table_scan_size = 1kB > > Leads to a server crash: Thank you for reporting! I've reproduced the issue in my environment with the provided script. The crashed process is not a parallel vacuum worker actually but a parallel worker for rebuilding the index. The scenario seems that when detecting a deadlock, the process removes itself from the wait queue by RemoveFromWaitQueue() (called by CheckDeadLock()), and then RemoveFromWaitQueue() is called again by LockErrorCleanup() while aborting the transaction. With commit 5764f611e, in RemoveFromWaitQueue() we remove the process from the wait queue using dclist_delete_from(): /* Remove proc from lock's wait queue */ dclist_delete_from(&waitLock->waitProcs, &proc->links); : /* Clean up the proc's own state, and pass it the ok/fail signal */ proc->waitLock = NULL; proc->waitProcLock = NULL; proc->waitStatus = PROC_WAIT_STATUS_ERROR; However, since dclist_delete_from() doesn't clear proc->links, in LockErrorCleanup(), dlist_node_is_detached() still returns false: if (!dlist_node_is_detached(&MyProc->links)) { /* We could not have been granted the lock yet */ RemoveFromWaitQueue(MyProc, lockAwaited->hashcode); } leading to calling RemoveFromWaitQueue() again. I think we should use dclist_delete_from_thoroughly() instead. With the attached patch, the issue doesn't happen in my environment. Another thing I noticed is that the Assert(waitLock) in RemoveFromWaitQueue() is useless actually, since we access *waitLock before that: void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode) { LOCK *waitLock = proc->waitLock; PROCLOCK *proclock = proc->waitProcLock; LOCKMODE lockmode = proc->waitLockMode; LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock); /* Make sure proc is waiting */ Assert(proc->waitStatus == PROC_WAIT_STATUS_WAITING); Assert(proc->links.next != NULL); Assert(waitLock); Assert(!dclist_is_empty(&waitLock->waitProcs)); Assert(0 < lockmethodid && lockmethodid < lengthof(LockMethods)); I think we should fix it as well. This fix is also included in the attached patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
From
Alexander Lakhin
Date:
Hello Sawada-san, 21.07.2023 11:01, Masahiko Sawada wrote: > ... > leading to calling RemoveFromWaitQueue() again. I think we should use > dclist_delete_from_thoroughly() instead. With the attached patch, the > issue doesn't happen in my environment. Thank you for the analysis and the patch! It fixes the issue for me too. > Another thing I noticed is that the Assert(waitLock) in > RemoveFromWaitQueue() is useless actually, since we access *waitLock > before that: Yeah, I also noticed that, and a simple grep-based script with a little eyeballing discovered one similar Assert usage: void llvm_split_symbol_name(const char *name, char **modname, char **funcname) { ... *funcname = rindex(name, '.'); (*funcname)++; /* jump over . */ *modname = pnstrdup(name + strlen("pgextern."), *funcname - name - strlen("pgextern.") - 1); Assert(funcname); May be it's worth to fix it in passing. Best regards, Alexander
Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
From
Masahiko Sawada
Date:
On Sat, Jul 22, 2023 at 7:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Sawada-san, > > 21.07.2023 11:01, Masahiko Sawada wrote: > > ... > > leading to calling RemoveFromWaitQueue() again. I think we should use > > dclist_delete_from_thoroughly() instead. With the attached patch, the > > issue doesn't happen in my environment. > > Thank you for the analysis and the patch! It fixes the issue for me too. Thank you for the confirmation. I found this issue is the same as one that has already been reported[1]. > > > Another thing I noticed is that the Assert(waitLock) in > > RemoveFromWaitQueue() is useless actually, since we access *waitLock > > before that: > > Yeah, I also noticed that, and a simple grep-based script with a little > eyeballing discovered one similar Assert usage: > > void > llvm_split_symbol_name(const char *name, char **modname, char **funcname) > { > ... > *funcname = rindex(name, '.'); > (*funcname)++; /* jump over . */ > > *modname = pnstrdup(name + strlen("pgextern."), > *funcname - name - strlen("pgextern.") - 1); > Assert(funcname); > > May be it's worth to fix it in passing. Yeah, but it should be discussed on another thread. [1] https://www.postgresql.org/message-id/ZKy4AdrLEfbqrxGJ%40telsasoft.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
From
Andres Freund
Date:
Hi, Thanks for finding / debugging this issue! On 2023-07-21 17:01:11 +0900, Masahiko Sawada wrote: > I've reproduced the issue in my environment with the provided script. > The crashed process is not a parallel vacuum worker actually but a > parallel worker for rebuilding the index. The scenario seems that when > detecting a deadlock, the process removes itself from the wait queue > by RemoveFromWaitQueue() (called by CheckDeadLock()), and then > RemoveFromWaitQueue() is called again by LockErrorCleanup() while > aborting the transaction. With commit 5764f611e, in > RemoveFromWaitQueue() we remove the process from the wait queue using > dclist_delete_from(): > > /* Remove proc from lock's wait queue */ > dclist_delete_from(&waitLock->waitProcs, &proc->links); > : > /* Clean up the proc's own state, and pass it the ok/fail signal */ > proc->waitLock = NULL; > proc->waitProcLock = NULL; > proc->waitStatus = PROC_WAIT_STATUS_ERROR; > > However, since dclist_delete_from() doesn't clear proc->links, in > LockErrorCleanup(), dlist_node_is_detached() still returns false: > > if (!dlist_node_is_detached(&MyProc->links)) > { > /* We could not have been granted the lock yet */ > RemoveFromWaitQueue(MyProc, lockAwaited->hashcode); > } Indeed :( > leading to calling RemoveFromWaitQueue() again. I think we should use > dclist_delete_from_thoroughly() instead. With the attached patch, the > issue doesn't happen in my environment. Yep. Do you want to push that fix, or should I? > Another thing I noticed is that the Assert(waitLock) in > RemoveFromWaitQueue() is useless actually, since we access *waitLock > before that: > > I think we should fix it as well. This fix is also included in the > attached patch. Don't really have an opinion on that. It's been this way for longer, afaict. Greetings, Andres Freund
Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
From
Masahiko Sawada
Date:
Hi, On Tue, Jul 25, 2023 at 12:39 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > Thanks for finding / debugging this issue! > > On 2023-07-21 17:01:11 +0900, Masahiko Sawada wrote: > > I've reproduced the issue in my environment with the provided script. > > The crashed process is not a parallel vacuum worker actually but a > > parallel worker for rebuilding the index. The scenario seems that when > > detecting a deadlock, the process removes itself from the wait queue > > by RemoveFromWaitQueue() (called by CheckDeadLock()), and then > > RemoveFromWaitQueue() is called again by LockErrorCleanup() while > > aborting the transaction. With commit 5764f611e, in > > RemoveFromWaitQueue() we remove the process from the wait queue using > > dclist_delete_from(): > > > > /* Remove proc from lock's wait queue */ > > dclist_delete_from(&waitLock->waitProcs, &proc->links); > > : > > /* Clean up the proc's own state, and pass it the ok/fail signal */ > > proc->waitLock = NULL; > > proc->waitProcLock = NULL; > > proc->waitStatus = PROC_WAIT_STATUS_ERROR; > > > > However, since dclist_delete_from() doesn't clear proc->links, in > > LockErrorCleanup(), dlist_node_is_detached() still returns false: > > > > if (!dlist_node_is_detached(&MyProc->links)) > > { > > /* We could not have been granted the lock yet */ > > RemoveFromWaitQueue(MyProc, lockAwaited->hashcode); > > } > > Indeed :( > > > > leading to calling RemoveFromWaitQueue() again. I think we should use > > dclist_delete_from_thoroughly() instead. With the attached patch, the > > issue doesn't happen in my environment. > > Yep. Do you want to push that fix, or should I? Thank you for your confirmation. I can push the fix if you're okay. I've attached the patch. > > > > Another thing I noticed is that the Assert(waitLock) in > > RemoveFromWaitQueue() is useless actually, since we access *waitLock > > before that: > > > > I think we should fix it as well. This fix is also included in the > > attached patch. > > Don't really have an opinion on that. It's been this way for longer, afaict. True. I would leave this part alone. We can fix it in a separate commit later if necessary. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
From
Andres Freund
Date:
On 2023-07-25 16:35:54 +0900, Masahiko Sawada wrote: > Thank you for your confirmation. I can push the fix if you're okay. > I've attached the patch. LGTM. Thanks!
Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
From
Masahiko Sawada
Date:
On Wed, Jul 26, 2023 at 1:27 AM Andres Freund <andres@anarazel.de> wrote: > > On 2023-07-25 16:35:54 +0900, Masahiko Sawada wrote: > > Thank you for your confirmation. I can push the fix if you're okay. > > I've attached the patch. > > LGTM. Thank you for reviewing it! Pushed, and marked as resolved in the open items page. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com