RE: subscription/015_stream sometimes breaks - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: subscription/015_stream sometimes breaks
Date
Msg-id OS0PR01MB5716A340DFCA5042CCAA5F7B941CA@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to RE: subscription/015_stream sometimes breaks  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
Responses Re: subscription/015_stream sometimes breaks
List pgsql-hackers

> -----Original Message-----
> From: Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> Sent: Wednesday, August 23, 2023 10:27 AM
> To: Thomas Munro <thomas.munro@gmail.com>
> Cc: Amit Kapila <amit.kapila16@gmail.com>; pgsql-hackers
> <pgsql-hackers@postgresql.org>
> Subject: RE: subscription/015_stream sometimes breaks
> 
> On Wednesday, August 23, 2023 4:55 AM Thomas Munro
> <thomas.munro@gmail.com> wrote:
> >
> > On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro
> <thomas.munro@gmail.com>
> > wrote:
> > > I didn't study it closely but it looks like there might be a second
> > > deadlock, after the one that is expected by the test?  Examples from
> > > the past couple of weeks:
> >
> > I should add, it's not correlated with the patches that cfbot is
> > testing, and it's the most frequent failure for which that is the case.
> >
> >     suite     |    name    | distinct_patches | errors
> > --------------+------------+------------------+--------
> >  subscription | 015_stream |               47 |     61
> >
> 
> Thanks for reporting !
> I am researching the failure and will share my analysis.

Hi,

After an off-list discussion with Amit, we figured out the reason.
From the crash log, I can see the apply worker crashed when accessing the
worker->proc, so I think it's because the work->proc has been released.

   577:     /* Now terminate the worker ... */
>  578:     kill(worker->proc->pid, signo);
   579: 
   580:     /* ... and wait for it to die. */

Normally, this should not happen because we take a lock on LogicalRepWorkerLock
when shutting all the parallel workers[1] which can prevent concurrent worker
to free the worker info. But in logicalrep_worker_stop_internal(), when
stopping parallel worker #1, we will release the lock shortly. and at this
timing it's possible that another parallel worker #2 which reported an ERROR
will shutdown by itself and free the worker->proc. So when we try to stop that
parallel worker #2 in next round, we didn't realize it has been closed,
thus accessing invalid memory(worker->proc).

[1]--
        LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);

        workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true);
        foreach(lc, workers)
        {
            LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);

**            if (isParallelApplyWorker(w))
                logicalrep_worker_stop_internal(w, SIGTERM);
        }
--

The bug happens after commit 2a8b40e where isParallelApplyWorker() start to use
the worker->type to check but we forgot to reset the worker type at worker exit
time. So, even if the worker #2 has shutdown, the worker_type is still valid
and we try to stop it again.

Previously, the isParallelApplyWorker() used the worker->leader_pid which will
be reset when the worker exits, so the "if (isParallelApplyWorker(w))" won't pass
in this case and we don't try to stop the worker #2.

To fix it I think we need to reset the worker type at exit as well.
Attach the patch which does the same. I am also testing it locally
to see if there are other issues here.

Best Regards,
Hou Zhijie


Attachment

pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: Synchronizing slots from primary to standby
Next
From: Andres Freund
Date:
Subject: Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?