On Sun, Mar 20, 2022 at 4:53 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/20/22 07:23, Amit Kapila wrote:
> > On Sun, Mar 20, 2022 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> On Fri, Mar 18, 2022 at 10:42 PM Tomas Vondra
> >> <tomas.vondra@enterprisedb.com> wrote:
> >>
> >>> So the question is why those two sync workers never complete - I guess
> >>> there's some sort of lock wait (deadlock?) or infinite loop.
> >>>
> >>
> >> It would be a bit tricky to reproduce this even if the above theory is
> >> correct but I'll try it today or tomorrow.
> >>
> >
> > I am able to reproduce it with the help of a debugger. Firstly, I have
> > added the LOG message and some While (true) loops to debug sync and
> > apply workers. Test setup
> >
> > Node-1:
> > create table t1(c1);
> > create table t2(c1);
> > insert into t1 values(1);
> > create publication pub1 for table t1;
> > create publication pu2;
> >
> > Node-2:
> > change max_sync_workers_per_subscription to 1 in potgresql.conf
> > create table t1(c1);
> > create table t2(c1);
> > create subscription sub1 connection 'dbname = postgres' publication pub1;
> >
> > Till this point, just allow debuggers in both workers just continue.
> >
> > Node-1:
> > alter publication pub1 add table t2;
> > insert into t1 values(2);
> >
> > Here, we have to debug the apply worker such that when it tries to
> > apply the insert, stop the debugger in function apply_handle_insert()
> > after doing begin_replication_step().
> >
> > Node-2:
> > alter subscription sub1 set pub1, pub2;
> >
> > Now, continue the debugger of apply worker, it should first start the
> > sync worker and then exit because of parameter change. All of these
> > debugging steps are to just ensure the point that it should first
> > start the sync worker and then exit. After this point, table sync
> > worker never finishes and log is filled with messages: "reached
> > max_sync_workers_per_subscription limit" (a newly added message by me
> > in the attached debug patch).
> >
> > Now, it is not completely clear to me how exactly '013_partition.pl'
> > leads to this situation but there is a possibility based on the LOGs
> > it shows.
> >
>
> Thanks, I'll take a look later.
>
This is still failing [1][2].
[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2022-03-28%2005%3A16%3A53
[2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-24%2013%3A13%3A08
--
With Regards,
Amit Kapila.