Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker - Mailing list pgsql-bugs

From Masahiko Sawada
Subject Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
Date
Msg-id CAD21AoDs7vzK7NErse7xTruqY-FXmM+3K00SdBtMcQhiRNkoeQ@mail.gmail.com
Whole thread Raw
In response to BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
List pgsql-bugs
Hi,

On Thu, Jul 20, 2023 at 12:21 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      18031
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 16beta2
> Operating system:   Ubuntu 22.04
> Description:
>
> The following script:
> numclients=100
> for ((c=1;c<=$numclients;c++)); do
> createdb db$c
> done
>
> for ((i=1;i<=50;i++)); do
>   echo "ITERATION $i"
>   for ((c=1;c<=$numclients;c++)); do
>     echo "SELECT format('CREATE TABLE t1_%s_$i (t TEXT);', g) FROM
> generate_series(1,10) g \\gexec" | psql db$c >psql1_$c.out 2>&1 &
>     echo "SELECT format('CREATE TABLE t2_%s_$i (t TEXT);', g) FROM
> generate_series(1,10) g \\gexec" | psql db$c >psql2_$c.out 2>&1 &
>     echo "SELECT 'VACUUM FULL pg_class' FROM generate_series(1,10) g
> \\gexec" | psql db$c >psql3_$c.out 2>&1 &
>   done
>   wait
>   grep -C1 'signal 11' server.log && break;
> done
>
> when executed with the custom settings:
> fsync = off
> max_connections = 1000
> deadlock_timeout = 100ms
> min_parallel_table_scan_size = 1kB
>
> Leads to a server crash:
Thank you for reporting!

I've reproduced the issue in my environment with the provided script.
The crashed process is not a parallel vacuum worker actually but a
parallel worker for rebuilding the index. The scenario seems that when
detecting a deadlock, the process removes itself from the wait queue
by RemoveFromWaitQueue() (called by CheckDeadLock()), and then
RemoveFromWaitQueue() is called again by LockErrorCleanup() while
aborting the transaction. With commit 5764f611e, in
RemoveFromWaitQueue() we remove the process from the wait queue using
dclist_delete_from():

    /* Remove proc from lock's wait queue */
    dclist_delete_from(&waitLock->waitProcs, &proc->links);
:
    /* Clean up the proc's own state, and pass it the ok/fail signal */
    proc->waitLock = NULL;
    proc->waitProcLock = NULL;
    proc->waitStatus = PROC_WAIT_STATUS_ERROR;

 However, since dclist_delete_from() doesn't clear proc->links, in
LockErrorCleanup(), dlist_node_is_detached() still returns false:

    if (!dlist_node_is_detached(&MyProc->links))
    {
        /* We could not have been granted the lock yet */
        RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
    }

leading to calling RemoveFromWaitQueue() again. I think we should use
dclist_delete_from_thoroughly() instead. With the attached patch, the
issue doesn't happen in my environment.

Another thing I noticed is that the Assert(waitLock) in
RemoveFromWaitQueue() is useless actually, since we access *waitLock
before that:

void
RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
{
    LOCK       *waitLock = proc->waitLock;
    PROCLOCK   *proclock = proc->waitProcLock;
    LOCKMODE    lockmode = proc->waitLockMode;
    LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);

    /* Make sure proc is waiting */
    Assert(proc->waitStatus == PROC_WAIT_STATUS_WAITING);
    Assert(proc->links.next != NULL);
    Assert(waitLock);
    Assert(!dclist_is_empty(&waitLock->waitProcs));
    Assert(0 < lockmethodid && lockmethodid < lengthof(LockMethods));

I think we should fix it as well. This fix is also included in the
attached patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: pg_basebackup: errors on macOS on directories with ".DS_Store" files
Next
From: vignesh C
Date:
Subject: Re: BUG #18027: Logical replication taking forever