Thread: subscription/015_stream sometimes breaks
Hi, A couple of times a day, cfbot reports an error like this: https://cirrus-ci.com/task/6424286882168832 I didn't study it closely but it looks like there might be a second deadlock, after the one that is expected by the test? Examples from the past couple of weeks: cfbot=> select test.task_id, task.task_name, task.created as time from test join task using (task_id) where suite = 'subscription' and name = '015_stream' and result = 'ERROR' and task.created > now() - interval '14 days' order by task.created desc; task_id | task_name ------------------+------------------------------------------------ 6600867550330880 | Windows - Server 2019, VS 2019 - Meson & ninja 6222795470798848 | Windows - Server 2019, VS 2019 - Meson & ninja 6162088322662400 | macOS - Ventura - Meson 5014781862608896 | Windows - Server 2019, VS 2019 - Meson & ninja 6424286882168832 | Windows - Server 2019, VS 2019 - Meson & ninja 6683705222103040 | Windows - Server 2019, VS 2019 - Meson & ninja 5881665076068352 | Windows - Server 2019, VS 2019 - Meson & ninja 5865929054093312 | Windows - Server 2019, VS 2019 - Meson & ninja 5855144995192832 | Windows - Server 2019, VS 2019 - Meson & ninja 6071567994585088 | Windows - Server 2019, VS 2019 - Meson & ninja 5311312343859200 | Windows - Server 2019, VS 2019 - Meson & ninja 4986100071006208 | FreeBSD - 13 - Meson 6302301388800000 | Windows - Server 2019, VS 2019 - Meson & ninja 4554627119579136 | Windows - Server 2019, VS 2019 - Meson & ninja 6106090807492608 | Windows - Server 2019, VS 2019 - Meson & ninja 5190113534148608 | Windows - Server 2019, VS 2019 - Meson & ninja 6452324697112576 | Windows - Server 2019, VS 2019 - Meson & ninja 4610228927332352 | Windows - Server 2019, VS 2019 - Meson & ninja 4928567608344576 | Windows - Server 2019, VS 2019 - Meson & ninja 5705451157848064 | FreeBSD - 13 - Meson 5952066133164032 | Windows - Server 2019, VS 2019 - Meson & ninja 5341101565935616 | Windows - Server 2019, VS 2019 - Meson & ninja 6751165837213696 | Windows - Server 2019, VS 2019 - Meson & ninja 4624168109473792 | Windows - Server 2019, VS 2019 - Meson & ninja 6730863963013120 | Windows - Server 2019, VS 2019 - Meson & ninja 6174140269330432 | Windows - Server 2019, VS 2019 - Meson & ninja 4637318561136640 | Windows - Server 2019, VS 2019 - Meson & ninja 4535300303618048 | Windows - Server 2019, VS 2019 - Meson & ninja 5672693542944768 | FreeBSD - 13 - Meson 6087381225308160 | Windows - Server 2019, VS 2019 - Meson & ninja 6098413217906688 | Windows - Server 2019, VS 2019 - Meson & ninja 6130601380544512 | Windows - Server 2019, VS 2019 - Meson & ninja 5546054284738560 | Windows - Server 2019, VS 2019 - Meson & ninja 6674258676416512 | Windows - Server 2019, VS 2019 - Meson & ninja 4571976740634624 | FreeBSD - 13 - Meson 6566515328155648 | Windows - Server 2019, VS 2019 - Meson & ninja 6576879084240896 | Windows - Server 2019, VS 2019 - Meson & ninja 5295804827566080 | Windows - Server 2019, VS 2019 - Meson & ninja 6426387188285440 | Windows - Server 2019, VS 2019 - Meson & ninja 4763275859066880 | Windows - Server 2019, VS 2019 - Meson & ninja 6137227240013824 | Windows - Server 2019, VS 2019 - Meson & ninja 5185063273365504 | Windows - Server 2019, VS 2019 - Meson & ninja 6542656449282048 | Windows - Server 2019, VS 2019 - Meson & ninja 4874919171850240 | Windows - Server 2019, VS 2019 - Meson & ninja 6531290556530688 | Windows - Server 2019, VS 2019 - Meson & ninja 5377848232378368 | FreeBSD - 13 - Meson 6436049925177344 | FreeBSD - 13 - Meson 6057679748071424 | Windows - Server 2019, VS 2019 - Meson & ninja 5534694867992576 | Windows - Server 2019, VS 2019 - Meson & ninja 4831311429369856 | Windows - Server 2019, VS 2019 - Meson & ninja 4704271531245568 | macOS - Ventura - Meson 5297047549509632 | Windows - Server 2019, VS 2019 - Meson & ninja 5836487120388096 | macOS - Ventura - Meson 6527459915464704 | FreeBSD - 13 - Meson 4985483743199232 | FreeBSD - 13 - Meson 4583651082502144 | Linux - Debian Bullseye - Meson 5498444756811776 | FreeBSD - 13 - Meson 5146601035923456 | Windows - Server 2019, VS 2019 - Meson & ninja 5709550989344768 | macOS - Ventura - Meson 6357936616767488 | FreeBSD - 13 - Meson (60 rows)
On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro <thomas.munro@gmail.com> wrote: > I didn't study it closely but it looks like there might be a second > deadlock, after the one that is expected by the test? Examples from > the past couple of weeks: I should add, it's not correlated with the patches that cfbot is testing, and it's the most frequent failure for which that is the case. suite | name | distinct_patches | errors --------------+------------+------------------+-------- subscription | 015_stream | 47 | 61
On Wednesday, August 23, 2023 4:55 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro > <thomas.munro@gmail.com> wrote: > > I didn't study it closely but it looks like there might be a second > > deadlock, after the one that is expected by the test? Examples from > > the past couple of weeks: > > I should add, it's not correlated with the patches that cfbot is testing, and it's > the most frequent failure for which that is the case. > > suite | name | distinct_patches | errors > --------------+------------+------------------+-------- > subscription | 015_stream | 47 | 61 > Thanks for reporting ! I am researching the failure and will share my analysis. Best Regards, Hou zj
On Wed, 23 Aug 2023 at 02:25, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > I didn't study it closely but it looks like there might be a second > > deadlock, after the one that is expected by the test? Examples from > > the past couple of weeks: > > I should add, it's not correlated with the patches that cfbot is > testing, and it's the most frequent failure for which that is the > case. > > suite | name | distinct_patches | errors > --------------+------------+------------------+-------- > subscription | 015_stream | 47 | 61 I had noticed that it is failing because of a segmentation fault: 2023-08-22 19:07:22.403 UTC [3823023][logical replication parallel worker][4/44:767] FATAL: terminating logical replication worker due to administrator command 2023-08-22 19:07:22.403 UTC [3823023][logical replication parallel worker][4/44:767] CONTEXT: processing remote data for replication origin "pg_16397" during message type "STREAM STOP" in transaction 748 2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] DEBUG: unregistering background worker "logical replication parallel apply worker for subscription 16397" 2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] LOG: background worker "logical replication parallel worker" (PID 3823455) exited with exit code 1 2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] DEBUG: unregistering background worker "logical replication parallel apply worker for subscription 16397" 2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] LOG: background worker "logical replication parallel worker" (PID 3823023) exited with exit code 1 2023-08-22 19:07:22.419 UTC [3819892][postmaster][:0] LOG: background worker "logical replication apply worker" (PID 3822876) was terminated by signal 11: Segmentation fault The stack trace for the same generated at [1] is: Core was generated by `postgres: subscriber: logical replication apply worker for subscription 16397 '. Program terminated with signal SIGSEGV, Segmentation fault. warning: Section `.reg-xstate/3822876' in core file too small. #0 0x00000000007b461e in logicalrep_worker_stop_internal (worker=<optimized out>, signo=<optimized out>) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:583 583 kill(worker->proc->pid, signo); #0 0x00000000007b461e in logicalrep_worker_stop_internal (worker=<optimized out>, signo=<optimized out>) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:583 #1 0x00000000007b565a in logicalrep_worker_detach () at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:774 #2 0x00000000007b49ff in logicalrep_worker_onexit (code=<optimized out>, arg=<optimized out>) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:829 #3 0x00000000008034c5 in shmem_exit (code=<optimized out>) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:239 #4 0x00000000008033dc in proc_exit_prepare (code=1) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:194 #5 0x000000000080333d in proc_exit (code=1) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:107 #6 0x0000000000797068 in StartBackgroundWorker () at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:827 #7 0x000000000079f257 in do_start_bgworker (rw=0x284e750) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5734 #8 0x000000000079b541 in maybe_start_bgworkers () at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5958 #9 0x000000000079cb51 in process_pm_pmsignal () at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5121 #10 0x000000000079b6bb in ServerLoop () at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1769 #11 0x000000000079aaa5 in PostmasterMain (argc=4, argv=<optimized out>) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1462 #12 0x00000000006d82a0 in main (argc=4, argv=0x27e3fd0) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:198 $1 = {si_signo = 11, si_errno = 0, si_code = 1, _sifields = {_pad = {64, 0 <repeats 27 times>}, _kill = {si_pid = 64, si_uid = 0}, _timer = {si_tid = 64, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 64, si_uid = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 64, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x40, _addr_lsb = 0, _addr_bnd = {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band = 64, si_fd = 0}, _sigsys = {_call_addr = 0x40, _syscall = 0, _arch = 0}}} [1] - https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dragonet&dt=2023-08-22%2018%3A56%3A04&stg=subscription-check Regards, Vignesh
> -----Original Message----- > From: Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > Sent: Wednesday, August 23, 2023 10:27 AM > To: Thomas Munro <thomas.munro@gmail.com> > Cc: Amit Kapila <amit.kapila16@gmail.com>; pgsql-hackers > <pgsql-hackers@postgresql.org> > Subject: RE: subscription/015_stream sometimes breaks > > On Wednesday, August 23, 2023 4:55 AM Thomas Munro > <thomas.munro@gmail.com> wrote: > > > > On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro > <thomas.munro@gmail.com> > > wrote: > > > I didn't study it closely but it looks like there might be a second > > > deadlock, after the one that is expected by the test? Examples from > > > the past couple of weeks: > > > > I should add, it's not correlated with the patches that cfbot is > > testing, and it's the most frequent failure for which that is the case. > > > > suite | name | distinct_patches | errors > > --------------+------------+------------------+-------- > > subscription | 015_stream | 47 | 61 > > > > Thanks for reporting ! > I am researching the failure and will share my analysis. Hi, After an off-list discussion with Amit, we figured out the reason. From the crash log, I can see the apply worker crashed when accessing the worker->proc, so I think it's because the work->proc has been released. 577: /* Now terminate the worker ... */ > 578: kill(worker->proc->pid, signo); 579: 580: /* ... and wait for it to die. */ Normally, this should not happen because we take a lock on LogicalRepWorkerLock when shutting all the parallel workers[1] which can prevent concurrent worker to free the worker info. But in logicalrep_worker_stop_internal(), when stopping parallel worker #1, we will release the lock shortly. and at this timing it's possible that another parallel worker #2 which reported an ERROR will shutdown by itself and free the worker->proc. So when we try to stop that parallel worker #2 in next round, we didn't realize it has been closed, thus accessing invalid memory(worker->proc). [1]-- LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true); foreach(lc, workers) { LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc); ** if (isParallelApplyWorker(w)) logicalrep_worker_stop_internal(w, SIGTERM); } -- The bug happens after commit 2a8b40e where isParallelApplyWorker() start to use the worker->type to check but we forgot to reset the worker type at worker exit time. So, even if the worker #2 has shutdown, the worker_type is still valid and we try to stop it again. Previously, the isParallelApplyWorker() used the worker->leader_pid which will be reset when the worker exits, so the "if (isParallelApplyWorker(w))" won't pass in this case and we don't try to stop the worker #2. To fix it I think we need to reset the worker type at exit as well. Attach the patch which does the same. I am also testing it locally to see if there are other issues here. Best Regards, Hou Zhijie
Attachment
On 2023-Aug-23, Zhijie Hou (Fujitsu) wrote: > [1]-- > LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); > > workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true); > foreach(lc, workers) > { > LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc); > > ** if (isParallelApplyWorker(w)) > logicalrep_worker_stop_internal(w, SIGTERM); > } Hmm, I think if worker->in_use is false, we shouldn't consult the rest of the struct at all, so I propose to add the attached 0001 as a minimal fix. In fact, I'd go further and propose that if we do take that stance, then we don't need clear out the contents of this struct at all, so let's not. That's 0002. And the reason 0002 does not remove the zeroing of ->proc is that the tests gets stuck when I do that, and the reason for that looks to be some shoddy coding in WaitForReplicationWorkerAttach, so I propose we change that too, as in 0003. Thoughts? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Attachment
On 2023-Aug-23, Alvaro Herrera wrote: > And the reason 0002 does not remove the zeroing of ->proc is that the > tests gets stuck when I do that, and the reason for that looks to be > some shoddy coding in WaitForReplicationWorkerAttach, so I propose we > change that too, as in 0003. Hmm, actually the test got stuck when I ran it repeatedly with this 0003. I guess there might be other places that depend on ->proc being set to NULL on exit. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Most hackers will be perfectly comfortable conceptualizing users as entropy sources, so let's move on." (Nathaniel Smith)
On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2023-Aug-23, Zhijie Hou (Fujitsu) wrote: > > > [1]-- > > LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); > > > > workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true); > > foreach(lc, workers) > > { > > LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc); > > > > ** if (isParallelApplyWorker(w)) > > logicalrep_worker_stop_internal(w, SIGTERM); > > } > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > of the struct at all, so I propose to add the attached 0001 as a minimal > fix. > I think that way we may need to add the check for in_use before accessing each of the LogicalRepWorker struct fields or form some rule about which fields (or places) are okay to access without checking in_use field. > In fact, I'd go further and propose that if we do take that stance, then > we don't need clear out the contents of this struct at all, so let's > not. That's 0002. > > And the reason 0002 does not remove the zeroing of ->proc is that the > tests gets stuck when I do that, and the reason for that looks to be > some shoddy coding in WaitForReplicationWorkerAttach, so I propose we > change that too, as in 0003. > Personally, I think we should consider this change (0002 and 0002) separately. -- With Regards, Amit Kapila.
On 2023-Aug-24, Amit Kapila wrote: > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > > of the struct at all, so I propose to add the attached 0001 as a minimal > > fix. > > I think that way we may need to add the check for in_use before > accessing each of the LogicalRepWorker struct fields or form some rule > about which fields (or places) are okay to access without checking > in_use field. As far as I realize, we have that rule already. It's only a few relatively new places that have broken it. I understand that the in_use concept comes from the one of the same name in ReplicationSlot, except that it is not at all documented in worker_internal.h. So I propose we do both: apply Zhijie's patch and my 0001 now; and somebody gets to document the locking design for LogicalRepWorker. > > In fact, I'd go further and propose that if we do take that stance, then > > we don't need clear out the contents of this struct at all, so let's > > not. That's 0002. > > Personally, I think we should consider this change (0002 and 0002) separately. I agree. I'd maybe even retract them. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2023-Aug-24, Amit Kapila wrote: > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > > > of the struct at all, so I propose to add the attached 0001 as a minimal > > > fix. > > > > I think that way we may need to add the check for in_use before > > accessing each of the LogicalRepWorker struct fields or form some rule > > about which fields (or places) are okay to access without checking > > in_use field. > > As far as I realize, we have that rule already. It's only a few > relatively new places that have broken it. I understand that the in_use > concept comes from the one of the same name in ReplicationSlot, except > that it is not at all documented in worker_internal.h. > > So I propose we do both: apply Zhijie's patch and my 0001 now; and > somebody gets to document the locking design for LogicalRepWorker. > Agreed. -- With Regards, Amit Kapila.
On Thu, Aug 24, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > On 2023-Aug-24, Amit Kapila wrote: > > > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > > > > of the struct at all, so I propose to add the attached 0001 as a minimal > > > > fix. > > > > > > I think that way we may need to add the check for in_use before > > > accessing each of the LogicalRepWorker struct fields or form some rule > > > about which fields (or places) are okay to access without checking > > > in_use field. > > > > As far as I realize, we have that rule already. It's only a few > > relatively new places that have broken it. I understand that the in_use > > concept comes from the one of the same name in ReplicationSlot, except > > that it is not at all documented in worker_internal.h. > > > > So I propose we do both: apply Zhijie's patch and my 0001 now; and > > somebody gets to document the locking design for LogicalRepWorker. > > > > Agreed. Both of these patches (Hou-san's expedient resetting of the worker type, Alvaro's 0001 putting the 'in_use' check within the isXXXWorker type macros) appear to be blending the concept of "type" with whether the worker is "alive" or not, which I am not sure is a good thing. IMO the type is the type forever, so I felt type should get assigned only once when the worker is "born". For example, a dead horse is still a horse. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Aug 25, 2023 at 9:09 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Thu, Aug 24, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > On 2023-Aug-24, Amit Kapila wrote: > > > > > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > > > > > of the struct at all, so I propose to add the attached 0001 as a minimal > > > > > fix. > > > > > > > > I think that way we may need to add the check for in_use before > > > > accessing each of the LogicalRepWorker struct fields or form some rule > > > > about which fields (or places) are okay to access without checking > > > > in_use field. > > > > > > As far as I realize, we have that rule already. It's only a few > > > relatively new places that have broken it. I understand that the in_use > > > concept comes from the one of the same name in ReplicationSlot, except > > > that it is not at all documented in worker_internal.h. > > > > > > So I propose we do both: apply Zhijie's patch and my 0001 now; and > > > somebody gets to document the locking design for LogicalRepWorker. > > > > > > > Agreed. > > Both of these patches (Hou-san's expedient resetting of the worker > type, Alvaro's 0001 putting the 'in_use' check within the isXXXWorker > type macros) appear to be blending the concept of "type" with whether > the worker is "alive" or not, which I am not sure is a good thing. IMO > the type is the type forever, so I felt type should get assigned only > once when the worker is "born". For example, a dead horse is still a > horse. > I think it is important to have a alive check before accessing the worker type as we are doing for some of the other fields. For example, see the usage of in_use flag in the function logicalrep_worker_find(). The usage of parallel apply workers doesn't consider the use of in_use flag where as other worker types would first check in_use flag. -- With Regards, Amit Kapila.
On Thu, Aug 24, 2023 at 3:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > On 2023-Aug-24, Amit Kapila wrote: > > > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > > > > of the struct at all, so I propose to add the attached 0001 as a minimal > > > > fix. > > > > > > I think that way we may need to add the check for in_use before > > > accessing each of the LogicalRepWorker struct fields or form some rule > > > about which fields (or places) are okay to access without checking > > > in_use field. > > > > As far as I realize, we have that rule already. It's only a few > > relatively new places that have broken it. I understand that the in_use > > concept comes from the one of the same name in ReplicationSlot, except > > that it is not at all documented in worker_internal.h. > > > > So I propose we do both: apply Zhijie's patch and my 0001 now; and > > somebody gets to document the locking design for LogicalRepWorker. > > > > Agreed. > Pushed both the patches. -- With Regards, Amit Kapila.
On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 24, 2023 at 3:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > On 2023-Aug-24, Amit Kapila wrote: > > > > > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > > > > > of the struct at all, so I propose to add the attached 0001 as a minimal > > > > > fix. > > > > > > > > I think that way we may need to add the check for in_use before > > > > accessing each of the LogicalRepWorker struct fields or form some rule > > > > about which fields (or places) are okay to access without checking > > > > in_use field. > > > > > > As far as I realize, we have that rule already. It's only a few > > > relatively new places that have broken it. I understand that the in_use > > > concept comes from the one of the same name in ReplicationSlot, except > > > that it is not at all documented in worker_internal.h. > > > > > > So I propose we do both: apply Zhijie's patch and my 0001 now; and > > > somebody gets to document the locking design for LogicalRepWorker. > > > > > > > Agreed. > > > > Pushed both the patches. > IMO there are inconsistencies in the second patch that was pushed. 1. In the am_xxx functions, why is there Assert 'in_use' only for the APPLY / PARALLEL_APPLY workers but not for TABLESYNC workers? 2. In the am_xxx functions there is now Assert 'in_use', so why are we still using macros to check again what we already asserted is not possible? (Or, if the checking overkill was a deliberate choice then why is there no isLeaderApplyWorker macro?) ~ PSA a small patch to address these. ------ Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Mon, Aug 28, 2023 at 5:35 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > IMO there are inconsistencies in the second patch that was pushed. > > 1. In the am_xxx functions, why is there Assert 'in_use' only for the > APPLY / PARALLEL_APPLY workers but not for TABLESYNC workers? > > 2. In the am_xxx functions there is now Assert 'in_use', so why are we > still using macros to check again what we already asserted is not > possible? (Or, if the checking overkill was a deliberate choice then > why is there no isLeaderApplyWorker macro?) > > ~ > > PSA a small patch to address these. > I find your suggestions reasonable. Alvaro, do you have any comments? -- With Regards, Amit Kapila.
On 2023-Aug-29, Amit Kapila wrote: > On Mon, Aug 28, 2023 at 5:35 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > IMO there are inconsistencies in the second patch that was pushed. > I find your suggestions reasonable. Alvaro, do you have any comments? Well, my main comment is that at this point I'm not sure these isFooWorker() macros are worth their salt. It looks like we could replace their uses with direct type comparisons in their callsites and remove them, with no loss of readability. The am_sth_worker() are probably a bit more useful, though some of the callsites could end up better if replaced with straight type comparison. All in all, I don't disagree with Peter's suggestions, but this is pretty much in "meh" territory for me. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On Tue, Aug 29, 2023 at 10:35 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2023-Aug-29, Amit Kapila wrote: > > > On Mon, Aug 28, 2023 at 5:35 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > IMO there are inconsistencies in the second patch that was pushed. > > > I find your suggestions reasonable. Alvaro, do you have any comments? > > Well, my main comment is that at this point I'm not sure these > isFooWorker() macros are worth their salt. It looks like we could > replace their uses with direct type comparisons in their callsites and > remove them, with no loss of readability. The am_sth_worker() are > probably a bit more useful, though some of the callsites could end up > better if replaced with straight type comparison. > > All in all, I don't disagree with Peter's suggestions, but this is > pretty much in "meh" territory for me. I had written a small non-functional patch [1] to address some macro inconsistencies introduced by a prior patch of this thread. It received initial feedback from Amit ("I find your suggestions reasonable") and from Alvaro ("I don't disagree with Peter's suggestions") but then nothing further happened. I also created a CF entry https://commitfest.postgresql.org/46/4570/ for it. AFAIK my patch is still valid, but after 4 months of no activity it seems there is no interest in pushing it, so I am withdrawing the CF entry. ====== [1] https://www.postgresql.org/message-id/CAHut%2BPuwaF4Sb41pWQk69d2WO_ZJQpj-_2JkQvP%3D1jwozUpcCQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia