Thread: subscription/015_stream sometimes breaks

subscription/015_stream sometimes breaks

From
Thomas Munro
Date:
Hi,

A couple of times a day, cfbot reports an error like this:

https://cirrus-ci.com/task/6424286882168832

I didn't study it closely but it looks like there might be a second
deadlock, after the one that is expected by the test?  Examples from
the past couple of weeks:

cfbot=> select test.task_id, task.task_name, task.created as time from
test join task using (task_id) where suite = 'subscription' and name =
'015_stream' and result = 'ERROR' and task.created > now() - interval
'14 days' order by task.created desc;
     task_id      |                   task_name
------------------+------------------------------------------------
 6600867550330880 | Windows - Server 2019, VS 2019 - Meson & ninja
 6222795470798848 | Windows - Server 2019, VS 2019 - Meson & ninja
 6162088322662400 | macOS - Ventura - Meson
 5014781862608896 | Windows - Server 2019, VS 2019 - Meson & ninja
 6424286882168832 | Windows - Server 2019, VS 2019 - Meson & ninja
 6683705222103040 | Windows - Server 2019, VS 2019 - Meson & ninja
 5881665076068352 | Windows - Server 2019, VS 2019 - Meson & ninja
 5865929054093312 | Windows - Server 2019, VS 2019 - Meson & ninja
 5855144995192832 | Windows - Server 2019, VS 2019 - Meson & ninja
 6071567994585088 | Windows - Server 2019, VS 2019 - Meson & ninja
 5311312343859200 | Windows - Server 2019, VS 2019 - Meson & ninja
 4986100071006208 | FreeBSD - 13 - Meson
 6302301388800000 | Windows - Server 2019, VS 2019 - Meson & ninja
 4554627119579136 | Windows - Server 2019, VS 2019 - Meson & ninja
 6106090807492608 | Windows - Server 2019, VS 2019 - Meson & ninja
 5190113534148608 | Windows - Server 2019, VS 2019 - Meson & ninja
 6452324697112576 | Windows - Server 2019, VS 2019 - Meson & ninja
 4610228927332352 | Windows - Server 2019, VS 2019 - Meson & ninja
 4928567608344576 | Windows - Server 2019, VS 2019 - Meson & ninja
 5705451157848064 | FreeBSD - 13 - Meson
 5952066133164032 | Windows - Server 2019, VS 2019 - Meson & ninja
 5341101565935616 | Windows - Server 2019, VS 2019 - Meson & ninja
 6751165837213696 | Windows - Server 2019, VS 2019 - Meson & ninja
 4624168109473792 | Windows - Server 2019, VS 2019 - Meson & ninja
 6730863963013120 | Windows - Server 2019, VS 2019 - Meson & ninja
 6174140269330432 | Windows - Server 2019, VS 2019 - Meson & ninja
 4637318561136640 | Windows - Server 2019, VS 2019 - Meson & ninja
 4535300303618048 | Windows - Server 2019, VS 2019 - Meson & ninja
 5672693542944768 | FreeBSD - 13 - Meson
 6087381225308160 | Windows - Server 2019, VS 2019 - Meson & ninja
 6098413217906688 | Windows - Server 2019, VS 2019 - Meson & ninja
 6130601380544512 | Windows - Server 2019, VS 2019 - Meson & ninja
 5546054284738560 | Windows - Server 2019, VS 2019 - Meson & ninja
 6674258676416512 | Windows - Server 2019, VS 2019 - Meson & ninja
 4571976740634624 | FreeBSD - 13 - Meson
 6566515328155648 | Windows - Server 2019, VS 2019 - Meson & ninja
 6576879084240896 | Windows - Server 2019, VS 2019 - Meson & ninja
 5295804827566080 | Windows - Server 2019, VS 2019 - Meson & ninja
 6426387188285440 | Windows - Server 2019, VS 2019 - Meson & ninja
 4763275859066880 | Windows - Server 2019, VS 2019 - Meson & ninja
 6137227240013824 | Windows - Server 2019, VS 2019 - Meson & ninja
 5185063273365504 | Windows - Server 2019, VS 2019 - Meson & ninja
 6542656449282048 | Windows - Server 2019, VS 2019 - Meson & ninja
 4874919171850240 | Windows - Server 2019, VS 2019 - Meson & ninja
 6531290556530688 | Windows - Server 2019, VS 2019 - Meson & ninja
 5377848232378368 | FreeBSD - 13 - Meson
 6436049925177344 | FreeBSD - 13 - Meson
 6057679748071424 | Windows - Server 2019, VS 2019 - Meson & ninja
 5534694867992576 | Windows - Server 2019, VS 2019 - Meson & ninja
 4831311429369856 | Windows - Server 2019, VS 2019 - Meson & ninja
 4704271531245568 | macOS - Ventura - Meson
 5297047549509632 | Windows - Server 2019, VS 2019 - Meson & ninja
 5836487120388096 | macOS - Ventura - Meson
 6527459915464704 | FreeBSD - 13 - Meson
 4985483743199232 | FreeBSD - 13 - Meson
 4583651082502144 | Linux - Debian Bullseye - Meson
 5498444756811776 | FreeBSD - 13 - Meson
 5146601035923456 | Windows - Server 2019, VS 2019 - Meson & ninja
 5709550989344768 | macOS - Ventura - Meson
 6357936616767488 | FreeBSD - 13 - Meson
(60 rows)



Re: subscription/015_stream sometimes breaks

From
Thomas Munro
Date:
On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I didn't study it closely but it looks like there might be a second
> deadlock, after the one that is expected by the test?  Examples from
> the past couple of weeks:

I should add, it's not correlated with the patches that cfbot is
testing, and it's the most frequent failure for which that is the
case.

    suite     |    name    | distinct_patches | errors
--------------+------------+------------------+--------
 subscription | 015_stream |               47 |     61



RE: subscription/015_stream sometimes breaks

From
"Zhijie Hou (Fujitsu)"
Date:
On Wednesday, August 23, 2023 4:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> 
> On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro
> <thomas.munro@gmail.com> wrote:
> > I didn't study it closely but it looks like there might be a second
> > deadlock, after the one that is expected by the test?  Examples from
> > the past couple of weeks:
> 
> I should add, it's not correlated with the patches that cfbot is testing, and it's
> the most frequent failure for which that is the case.
> 
>     suite     |    name    | distinct_patches | errors
> --------------+------------+------------------+--------
>  subscription | 015_stream |               47 |     61
> 

Thanks for reporting !
I am researching the failure and will share my analysis.

Best Regards,
Hou zj

Re: subscription/015_stream sometimes breaks

From
vignesh C
Date:
On Wed, 23 Aug 2023 at 02:25, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > I didn't study it closely but it looks like there might be a second
> > deadlock, after the one that is expected by the test?  Examples from
> > the past couple of weeks:
>
> I should add, it's not correlated with the patches that cfbot is
> testing, and it's the most frequent failure for which that is the
> case.
>
>     suite     |    name    | distinct_patches | errors
> --------------+------------+------------------+--------
>  subscription | 015_stream |               47 |     61

I had noticed that it is failing because of a segmentation fault:
2023-08-22 19:07:22.403 UTC [3823023][logical replication parallel
worker][4/44:767] FATAL:  terminating logical replication worker due
to administrator command
2023-08-22 19:07:22.403 UTC [3823023][logical replication parallel
worker][4/44:767] CONTEXT:  processing remote data for replication
origin "pg_16397" during message type "STREAM STOP" in transaction 748
2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] DEBUG:
unregistering background worker "logical replication parallel apply
worker for subscription 16397"
2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] LOG:  background
worker "logical replication parallel worker" (PID 3823455) exited with
exit code 1
2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] DEBUG:
unregistering background worker "logical replication parallel apply
worker for subscription 16397"
2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] LOG:  background
worker "logical replication parallel worker" (PID 3823023) exited with
exit code 1
2023-08-22 19:07:22.419 UTC [3819892][postmaster][:0] LOG:  background
worker "logical replication apply worker" (PID 3822876) was terminated
by signal 11: Segmentation fault

The stack trace for the same generated at [1] is:
Core was generated by `postgres: subscriber: logical replication apply
worker for subscription 16397 '.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Section `.reg-xstate/3822876' in core file too small.
#0  0x00000000007b461e in logicalrep_worker_stop_internal
(worker=<optimized out>, signo=<optimized out>) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:583
583 kill(worker->proc->pid, signo);
#0  0x00000000007b461e in logicalrep_worker_stop_internal
(worker=<optimized out>, signo=<optimized out>) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:583
#1  0x00000000007b565a in logicalrep_worker_detach () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:774
#2  0x00000000007b49ff in logicalrep_worker_onexit (code=<optimized
out>, arg=<optimized out>) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:829
#3  0x00000000008034c5 in shmem_exit (code=<optimized out>) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:239
#4  0x00000000008033dc in proc_exit_prepare (code=1) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:194
#5  0x000000000080333d in proc_exit (code=1) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:107
#6  0x0000000000797068 in StartBackgroundWorker () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:827
#7  0x000000000079f257 in do_start_bgworker (rw=0x284e750) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5734
#8  0x000000000079b541 in maybe_start_bgworkers () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5958
#9  0x000000000079cb51 in process_pm_pmsignal () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5121
#10 0x000000000079b6bb in ServerLoop () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1769
#11 0x000000000079aaa5 in PostmasterMain (argc=4, argv=<optimized
out>) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1462
#12 0x00000000006d82a0 in main (argc=4, argv=0x27e3fd0) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:198
$1 = {si_signo = 11, si_errno = 0, si_code = 1, _sifields = {_pad =
{64, 0 <repeats 27 times>}, _kill = {si_pid = 64, si_uid = 0}, _timer
= {si_tid = 64, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr
= 0x0}}, _rt = {si_pid = 64, si_uid = 0, si_sigval = {sival_int = 0,
sival_ptr = 0x0}}, _sigchld = {si_pid = 64, si_uid = 0, si_status = 0,
si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x40, _addr_lsb =
0, _addr_bnd = {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band =
64, si_fd = 0}, _sigsys = {_call_addr = 0x40, _syscall = 0, _arch =
0}}}

[1] -
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dragonet&dt=2023-08-22%2018%3A56%3A04&stg=subscription-check

Regards,
Vignesh



RE: subscription/015_stream sometimes breaks

From
"Zhijie Hou (Fujitsu)"
Date:

> -----Original Message-----
> From: Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> Sent: Wednesday, August 23, 2023 10:27 AM
> To: Thomas Munro <thomas.munro@gmail.com>
> Cc: Amit Kapila <amit.kapila16@gmail.com>; pgsql-hackers
> <pgsql-hackers@postgresql.org>
> Subject: RE: subscription/015_stream sometimes breaks
> 
> On Wednesday, August 23, 2023 4:55 AM Thomas Munro
> <thomas.munro@gmail.com> wrote:
> >
> > On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro
> <thomas.munro@gmail.com>
> > wrote:
> > > I didn't study it closely but it looks like there might be a second
> > > deadlock, after the one that is expected by the test?  Examples from
> > > the past couple of weeks:
> >
> > I should add, it's not correlated with the patches that cfbot is
> > testing, and it's the most frequent failure for which that is the case.
> >
> >     suite     |    name    | distinct_patches | errors
> > --------------+------------+------------------+--------
> >  subscription | 015_stream |               47 |     61
> >
> 
> Thanks for reporting !
> I am researching the failure and will share my analysis.

Hi,

After an off-list discussion with Amit, we figured out the reason.
From the crash log, I can see the apply worker crashed when accessing the
worker->proc, so I think it's because the work->proc has been released.

   577:     /* Now terminate the worker ... */
>  578:     kill(worker->proc->pid, signo);
   579: 
   580:     /* ... and wait for it to die. */

Normally, this should not happen because we take a lock on LogicalRepWorkerLock
when shutting all the parallel workers[1] which can prevent concurrent worker
to free the worker info. But in logicalrep_worker_stop_internal(), when
stopping parallel worker #1, we will release the lock shortly. and at this
timing it's possible that another parallel worker #2 which reported an ERROR
will shutdown by itself and free the worker->proc. So when we try to stop that
parallel worker #2 in next round, we didn't realize it has been closed,
thus accessing invalid memory(worker->proc).

[1]--
        LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);

        workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true);
        foreach(lc, workers)
        {
            LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);

**            if (isParallelApplyWorker(w))
                logicalrep_worker_stop_internal(w, SIGTERM);
        }
--

The bug happens after commit 2a8b40e where isParallelApplyWorker() start to use
the worker->type to check but we forgot to reset the worker type at worker exit
time. So, even if the worker #2 has shutdown, the worker_type is still valid
and we try to stop it again.

Previously, the isParallelApplyWorker() used the worker->leader_pid which will
be reset when the worker exits, so the "if (isParallelApplyWorker(w))" won't pass
in this case and we don't try to stop the worker #2.

To fix it I think we need to reset the worker type at exit as well.
Attach the patch which does the same. I am also testing it locally
to see if there are other issues here.

Best Regards,
Hou Zhijie


Attachment

Re: subscription/015_stream sometimes breaks

From
Alvaro Herrera
Date:
On 2023-Aug-23, Zhijie Hou (Fujitsu) wrote:

> [1]--
>         LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> 
>         workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true);
>         foreach(lc, workers)
>         {
>             LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);
> 
> **            if (isParallelApplyWorker(w))
>                 logicalrep_worker_stop_internal(w, SIGTERM);
>         }

Hmm, I think if worker->in_use is false, we shouldn't consult the rest
of the struct at all, so I propose to add the attached 0001 as a minimal
fix.

In fact, I'd go further and propose that if we do take that stance, then
we don't need clear out the contents of this struct at all, so let's
not.  That's 0002.

And the reason 0002 does not remove the zeroing of ->proc is that the
tests gets stuck when I do that, and the reason for that looks to be
some shoddy coding in WaitForReplicationWorkerAttach, so I propose we
change that too, as in 0003.

Thoughts?

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Attachment

Re: subscription/015_stream sometimes breaks

From
Alvaro Herrera
Date:
On 2023-Aug-23, Alvaro Herrera wrote:

> And the reason 0002 does not remove the zeroing of ->proc is that the
> tests gets stuck when I do that, and the reason for that looks to be
> some shoddy coding in WaitForReplicationWorkerAttach, so I propose we
> change that too, as in 0003.

Hmm, actually the test got stuck when I ran it repeatedly with this
0003.  I guess there might be other places that depend on ->proc being
set to NULL on exit.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."                               (Nathaniel Smith)



Re: subscription/015_stream sometimes breaks

From
Amit Kapila
Date:
On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2023-Aug-23, Zhijie Hou (Fujitsu) wrote:
>
> > [1]--
> >               LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> >
> >               workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true);
> >               foreach(lc, workers)
> >               {
> >                       LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);
> >
> > **                    if (isParallelApplyWorker(w))
> >                               logicalrep_worker_stop_internal(w, SIGTERM);
> >               }
>
> Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> of the struct at all, so I propose to add the attached 0001 as a minimal
> fix.
>

I think that way we may need to add the check for in_use before
accessing each of the LogicalRepWorker struct fields or form some rule
about which fields (or places) are okay to access without checking
in_use field.

> In fact, I'd go further and propose that if we do take that stance, then
> we don't need clear out the contents of this struct at all, so let's
> not.  That's 0002.
>
> And the reason 0002 does not remove the zeroing of ->proc is that the
> tests gets stuck when I do that, and the reason for that looks to be
> some shoddy coding in WaitForReplicationWorkerAttach, so I propose we
> change that too, as in 0003.
>

Personally, I think we should consider this change (0002 and 0002) separately.

--
With Regards,
Amit Kapila.



Re: subscription/015_stream sometimes breaks

From
Alvaro Herrera
Date:
On 2023-Aug-24, Amit Kapila wrote:

> On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> > of the struct at all, so I propose to add the attached 0001 as a minimal
> > fix.
> 
> I think that way we may need to add the check for in_use before
> accessing each of the LogicalRepWorker struct fields or form some rule
> about which fields (or places) are okay to access without checking
> in_use field.

As far as I realize, we have that rule already.  It's only a few
relatively new places that have broken it.  I understand that the in_use
concept comes from the one of the same name in ReplicationSlot, except
that it is not at all documented in worker_internal.h.

So I propose we do both: apply Zhijie's patch and my 0001 now; and
somebody gets to document the locking design for LogicalRepWorker.

> > In fact, I'd go further and propose that if we do take that stance, then
> > we don't need clear out the contents of this struct at all, so let's
> > not.  That's 0002.
> 
> Personally, I think we should consider this change (0002 and 0002) separately.

I agree.  I'd maybe even retract them.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: subscription/015_stream sometimes breaks

From
Amit Kapila
Date:
On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2023-Aug-24, Amit Kapila wrote:
>
> > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> > > of the struct at all, so I propose to add the attached 0001 as a minimal
> > > fix.
> >
> > I think that way we may need to add the check for in_use before
> > accessing each of the LogicalRepWorker struct fields or form some rule
> > about which fields (or places) are okay to access without checking
> > in_use field.
>
> As far as I realize, we have that rule already.  It's only a few
> relatively new places that have broken it.  I understand that the in_use
> concept comes from the one of the same name in ReplicationSlot, except
> that it is not at all documented in worker_internal.h.
>
> So I propose we do both: apply Zhijie's patch and my 0001 now; and
> somebody gets to document the locking design for LogicalRepWorker.
>

Agreed.

--
With Regards,
Amit Kapila.



Re: subscription/015_stream sometimes breaks

From
Peter Smith
Date:
On Thu, Aug 24, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2023-Aug-24, Amit Kapila wrote:
> >
> > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> > > > of the struct at all, so I propose to add the attached 0001 as a minimal
> > > > fix.
> > >
> > > I think that way we may need to add the check for in_use before
> > > accessing each of the LogicalRepWorker struct fields or form some rule
> > > about which fields (or places) are okay to access without checking
> > > in_use field.
> >
> > As far as I realize, we have that rule already.  It's only a few
> > relatively new places that have broken it.  I understand that the in_use
> > concept comes from the one of the same name in ReplicationSlot, except
> > that it is not at all documented in worker_internal.h.
> >
> > So I propose we do both: apply Zhijie's patch and my 0001 now; and
> > somebody gets to document the locking design for LogicalRepWorker.
> >
>
> Agreed.

Both of these patches (Hou-san's expedient resetting of the worker
type, Alvaro's 0001 putting the 'in_use' check within the isXXXWorker
type macros) appear to be blending the concept of "type" with whether
the worker is "alive" or not, which I am not sure is a good thing. IMO
the type is the type forever, so I felt type should get assigned only
once when the worker is "born". For example, a dead horse is still a
horse.

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: subscription/015_stream sometimes breaks

From
Amit Kapila
Date:
On Fri, Aug 25, 2023 at 9:09 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Thu, Aug 24, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > On 2023-Aug-24, Amit Kapila wrote:
> > >
> > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> > > > > of the struct at all, so I propose to add the attached 0001 as a minimal
> > > > > fix.
> > > >
> > > > I think that way we may need to add the check for in_use before
> > > > accessing each of the LogicalRepWorker struct fields or form some rule
> > > > about which fields (or places) are okay to access without checking
> > > > in_use field.
> > >
> > > As far as I realize, we have that rule already.  It's only a few
> > > relatively new places that have broken it.  I understand that the in_use
> > > concept comes from the one of the same name in ReplicationSlot, except
> > > that it is not at all documented in worker_internal.h.
> > >
> > > So I propose we do both: apply Zhijie's patch and my 0001 now; and
> > > somebody gets to document the locking design for LogicalRepWorker.
> > >
> >
> > Agreed.
>
> Both of these patches (Hou-san's expedient resetting of the worker
> type, Alvaro's 0001 putting the 'in_use' check within the isXXXWorker
> type macros) appear to be blending the concept of "type" with whether
> the worker is "alive" or not, which I am not sure is a good thing. IMO
> the type is the type forever, so I felt type should get assigned only
> once when the worker is "born". For example, a dead horse is still a
> horse.
>

I think it is important to have a alive check before accessing the
worker type as we are doing for some of the other fields. For example,
see the usage of in_use flag in the function logicalrep_worker_find().
The usage of parallel apply workers doesn't consider the use of in_use
flag where as other worker types would first check in_use flag.

--
With Regards,
Amit Kapila.



Re: subscription/015_stream sometimes breaks

From
Amit Kapila
Date:
On Thu, Aug 24, 2023 at 3:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2023-Aug-24, Amit Kapila wrote:
> >
> > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> > > > of the struct at all, so I propose to add the attached 0001 as a minimal
> > > > fix.
> > >
> > > I think that way we may need to add the check for in_use before
> > > accessing each of the LogicalRepWorker struct fields or form some rule
> > > about which fields (or places) are okay to access without checking
> > > in_use field.
> >
> > As far as I realize, we have that rule already.  It's only a few
> > relatively new places that have broken it.  I understand that the in_use
> > concept comes from the one of the same name in ReplicationSlot, except
> > that it is not at all documented in worker_internal.h.
> >
> > So I propose we do both: apply Zhijie's patch and my 0001 now; and
> > somebody gets to document the locking design for LogicalRepWorker.
> >
>
> Agreed.
>

Pushed both the patches.

--
With Regards,
Amit Kapila.



Re: subscription/015_stream sometimes breaks

From
Peter Smith
Date:
On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 24, 2023 at 3:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > On 2023-Aug-24, Amit Kapila wrote:
> > >
> > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> > > > > of the struct at all, so I propose to add the attached 0001 as a minimal
> > > > > fix.
> > > >
> > > > I think that way we may need to add the check for in_use before
> > > > accessing each of the LogicalRepWorker struct fields or form some rule
> > > > about which fields (or places) are okay to access without checking
> > > > in_use field.
> > >
> > > As far as I realize, we have that rule already.  It's only a few
> > > relatively new places that have broken it.  I understand that the in_use
> > > concept comes from the one of the same name in ReplicationSlot, except
> > > that it is not at all documented in worker_internal.h.
> > >
> > > So I propose we do both: apply Zhijie's patch and my 0001 now; and
> > > somebody gets to document the locking design for LogicalRepWorker.
> > >
> >
> > Agreed.
> >
>
> Pushed both the patches.
>

IMO there are inconsistencies in the second patch that was pushed.

1. In the am_xxx functions, why is there Assert 'in_use' only for the
APPLY / PARALLEL_APPLY workers but not for TABLESYNC workers?

2. In the am_xxx functions there is now Assert 'in_use', so why are we
still using macros to check again what we already asserted is not
possible? (Or, if the checking overkill was a deliberate choice then
why is there no isLeaderApplyWorker macro?)

~

PSA a small patch to address these.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: subscription/015_stream sometimes breaks

From
Amit Kapila
Date:
On Mon, Aug 28, 2023 at 5:35 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> IMO there are inconsistencies in the second patch that was pushed.
>
> 1. In the am_xxx functions, why is there Assert 'in_use' only for the
> APPLY / PARALLEL_APPLY workers but not for TABLESYNC workers?
>
> 2. In the am_xxx functions there is now Assert 'in_use', so why are we
> still using macros to check again what we already asserted is not
> possible? (Or, if the checking overkill was a deliberate choice then
> why is there no isLeaderApplyWorker macro?)
>
> ~
>
> PSA a small patch to address these.
>

I find your suggestions reasonable. Alvaro, do you have any comments?

--
With Regards,
Amit Kapila.



Re: subscription/015_stream sometimes breaks

From
Alvaro Herrera
Date:
On 2023-Aug-29, Amit Kapila wrote:

> On Mon, Aug 28, 2023 at 5:35 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > IMO there are inconsistencies in the second patch that was pushed.

> I find your suggestions reasonable. Alvaro, do you have any comments?

Well, my main comment is that at this point I'm not sure these
isFooWorker() macros are worth their salt.  It looks like we could
replace their uses with direct type comparisons in their callsites and
remove them, with no loss of readability.  The am_sth_worker() are
probably a bit more useful, though some of the callsites could end up
better if replaced with straight type comparison.

All in all, I don't disagree with Peter's suggestions, but this is
pretty much in "meh" territory for me.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Re: subscription/015_stream sometimes breaks

From
Peter Smith
Date:
On Tue, Aug 29, 2023 at 10:35 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2023-Aug-29, Amit Kapila wrote:
>
> > On Mon, Aug 28, 2023 at 5:35 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > IMO there are inconsistencies in the second patch that was pushed.
>
> > I find your suggestions reasonable. Alvaro, do you have any comments?
>
> Well, my main comment is that at this point I'm not sure these
> isFooWorker() macros are worth their salt.  It looks like we could
> replace their uses with direct type comparisons in their callsites and
> remove them, with no loss of readability.  The am_sth_worker() are
> probably a bit more useful, though some of the callsites could end up
> better if replaced with straight type comparison.
>
> All in all, I don't disagree with Peter's suggestions, but this is
> pretty much in "meh" territory for me.

I had written a small non-functional patch [1] to address some macro
inconsistencies introduced by a prior patch of this thread.

It received initial feedback from Amit ("I find your suggestions
reasonable") and from Alvaro ("I don't disagree with Peter's
suggestions") but then nothing further happened. I also created a CF
entry https://commitfest.postgresql.org/46/4570/ for it.

AFAIK my patch is still valid, but after 4 months of no activity it
seems there is no interest in pushing it, so I am withdrawing the CF
entry.

======
[1] https://www.postgresql.org/message-id/CAHut%2BPuwaF4Sb41pWQk69d2WO_ZJQpj-_2JkQvP%3D1jwozUpcCQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia