Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker - Mailing list pgsql-bugs
From | Masahiko Sawada |
---|---|
Subject | Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker |
Date | |
Msg-id | CAD21AoDs7vzK7NErse7xTruqY-FXmM+3K00SdBtMcQhiRNkoeQ@mail.gmail.com Whole thread Raw |
In response to | BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker (PG Bug reporting form <noreply@postgresql.org>) |
Responses |
Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker |
List | pgsql-bugs |
Hi, On Thu, Jul 20, 2023 at 12:21 AM PG Bug reporting form <noreply@postgresql.org> wrote: > > The following bug has been logged on the website: > > Bug reference: 18031 > Logged by: Alexander Lakhin > Email address: exclusion@gmail.com > PostgreSQL version: 16beta2 > Operating system: Ubuntu 22.04 > Description: > > The following script: > numclients=100 > for ((c=1;c<=$numclients;c++)); do > createdb db$c > done > > for ((i=1;i<=50;i++)); do > echo "ITERATION $i" > for ((c=1;c<=$numclients;c++)); do > echo "SELECT format('CREATE TABLE t1_%s_$i (t TEXT);', g) FROM > generate_series(1,10) g \\gexec" | psql db$c >psql1_$c.out 2>&1 & > echo "SELECT format('CREATE TABLE t2_%s_$i (t TEXT);', g) FROM > generate_series(1,10) g \\gexec" | psql db$c >psql2_$c.out 2>&1 & > echo "SELECT 'VACUUM FULL pg_class' FROM generate_series(1,10) g > \\gexec" | psql db$c >psql3_$c.out 2>&1 & > done > wait > grep -C1 'signal 11' server.log && break; > done > > when executed with the custom settings: > fsync = off > max_connections = 1000 > deadlock_timeout = 100ms > min_parallel_table_scan_size = 1kB > > Leads to a server crash: Thank you for reporting! I've reproduced the issue in my environment with the provided script. The crashed process is not a parallel vacuum worker actually but a parallel worker for rebuilding the index. The scenario seems that when detecting a deadlock, the process removes itself from the wait queue by RemoveFromWaitQueue() (called by CheckDeadLock()), and then RemoveFromWaitQueue() is called again by LockErrorCleanup() while aborting the transaction. With commit 5764f611e, in RemoveFromWaitQueue() we remove the process from the wait queue using dclist_delete_from(): /* Remove proc from lock's wait queue */ dclist_delete_from(&waitLock->waitProcs, &proc->links); : /* Clean up the proc's own state, and pass it the ok/fail signal */ proc->waitLock = NULL; proc->waitProcLock = NULL; proc->waitStatus = PROC_WAIT_STATUS_ERROR; However, since dclist_delete_from() doesn't clear proc->links, in LockErrorCleanup(), dlist_node_is_detached() still returns false: if (!dlist_node_is_detached(&MyProc->links)) { /* We could not have been granted the lock yet */ RemoveFromWaitQueue(MyProc, lockAwaited->hashcode); } leading to calling RemoveFromWaitQueue() again. I think we should use dclist_delete_from_thoroughly() instead. With the attached patch, the issue doesn't happen in my environment. Another thing I noticed is that the Assert(waitLock) in RemoveFromWaitQueue() is useless actually, since we access *waitLock before that: void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode) { LOCK *waitLock = proc->waitLock; PROCLOCK *proclock = proc->waitProcLock; LOCKMODE lockmode = proc->waitLockMode; LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock); /* Make sure proc is waiting */ Assert(proc->waitStatus == PROC_WAIT_STATUS_WAITING); Assert(proc->links.next != NULL); Assert(waitLock); Assert(!dclist_is_empty(&waitLock->waitProcs)); Assert(0 < lockmethodid && lockmethodid < lengthof(LockMethods)); I think we should fix it as well. This fix is also included in the attached patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-bugs by date: