Home > mailing lists

Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker - Mailing list pgsql-bugs

From	Masahiko Sawada
Subject	Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
Date	July 21, 2023 11:01:11
Msg-id	CAD21AoDs7vzK7NErse7xTruqY-FXmM+3K00SdBtMcQhiRNkoeQ@mail.gmail.com Whole thread Raw
In response to	BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker (PG Bug reporting form <noreply@postgresql.org>)
Responses	Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
List	pgsql-bugs

Tree view

Hi,

On Thu, Jul 20, 2023 at 12:21 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      18031
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 16beta2
> Operating system:   Ubuntu 22.04
> Description:
>
> The following script:
> numclients=100
> for ((c=1;c<=$numclients;c++)); do
> createdb db$c
> done
>
> for ((i=1;i<=50;i++)); do
>   echo "ITERATION $i"
>   for ((c=1;c<=$numclients;c++)); do
>     echo "SELECT format('CREATE TABLE t1_%s_$i (t TEXT);', g) FROM
> generate_series(1,10) g \\gexec" | psql db$c >psql1_$c.out 2>&1 &
>     echo "SELECT format('CREATE TABLE t2_%s_$i (t TEXT);', g) FROM
> generate_series(1,10) g \\gexec" | psql db$c >psql2_$c.out 2>&1 &
>     echo "SELECT 'VACUUM FULL pg_class' FROM generate_series(1,10) g
> \\gexec" | psql db$c >psql3_$c.out 2>&1 &
>   done
>   wait
>   grep -C1 'signal 11' server.log && break;
> done
>
> when executed with the custom settings:
> fsync = off
> max_connections = 1000
> deadlock_timeout = 100ms
> min_parallel_table_scan_size = 1kB
>
> Leads to a server crash:
Thank you for reporting!

I've reproduced the issue in my environment with the provided script.
The crashed process is not a parallel vacuum worker actually but a
parallel worker for rebuilding the index. The scenario seems that when
detecting a deadlock, the process removes itself from the wait queue
by RemoveFromWaitQueue() (called by CheckDeadLock()), and then
RemoveFromWaitQueue() is called again by LockErrorCleanup() while
aborting the transaction. With commit 5764f611e, in
RemoveFromWaitQueue() we remove the process from the wait queue using
dclist_delete_from():

    /* Remove proc from lock's wait queue */
    dclist_delete_from(&waitLock->waitProcs, &proc->links);
:
    /* Clean up the proc's own state, and pass it the ok/fail signal */
    proc->waitLock = NULL;
    proc->waitProcLock = NULL;
    proc->waitStatus = PROC_WAIT_STATUS_ERROR;

 However, since dclist_delete_from() doesn't clear proc->links, in
LockErrorCleanup(), dlist_node_is_detached() still returns false:

    if (!dlist_node_is_detached(&MyProc->links))
    {
        /* We could not have been granted the lock yet */
        RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
    }

leading to calling RemoveFromWaitQueue() again. I think we should use
dclist_delete_from_thoroughly() instead. With the attached patch, the
issue doesn't happen in my environment.

Another thing I noticed is that the Assert(waitLock) in
RemoveFromWaitQueue() is useless actually, since we access *waitLock
before that:

void
RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
{
    LOCK       *waitLock = proc->waitLock;
    PROCLOCK   *proclock = proc->waitProcLock;
    LOCKMODE    lockmode = proc->waitLockMode;
    LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);

    /* Make sure proc is waiting */
    Assert(proc->waitStatus == PROC_WAIT_STATUS_WAITING);
    Assert(proc->links.next != NULL);
    Assert(waitLock);
    Assert(!dclist_is_empty(&waitLock->waitProcs));
    Assert(0 < lockmethodid && lockmethodid < lengthof(LockMethods));

I think we should fix it as well. This fix is also included in the
attached patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

remove_proc_from_wait_queue_thorougly.patch

pgsql-bugs by date:

From: Michael Paquier
Date: 21 July 2023, 02:53:02
Subject: Re: pg_basebackup: errors on macOS on directories with ".DS_Store" files

From: vignesh C
Date: 21 July 2023, 11:56:37
Subject: Re: BUG #18027: Logical replication taking forever

Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker - Mailing list pgsql-bugs

Attachment

Previous

Next