On 3/20/22 07:23, Amit Kapila wrote:
> On Sun, Mar 20, 2022 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Mar 18, 2022 at 10:42 PM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>
>>> So the question is why those two sync workers never complete - I guess
>>> there's some sort of lock wait (deadlock?) or infinite loop.
>>>
>>
>> It would be a bit tricky to reproduce this even if the above theory is
>> correct but I'll try it today or tomorrow.
>>
>
> I am able to reproduce it with the help of a debugger. Firstly, I have
> added the LOG message and some While (true) loops to debug sync and
> apply workers. Test setup
>
> Node-1:
> create table t1(c1);
> create table t2(c1);
> insert into t1 values(1);
> create publication pub1 for table t1;
> create publication pu2;
>
> Node-2:
> change max_sync_workers_per_subscription to 1 in potgresql.conf
> create table t1(c1);
> create table t2(c1);
> create subscription sub1 connection 'dbname = postgres' publication pub1;
>
> Till this point, just allow debuggers in both workers just continue.
>
> Node-1:
> alter publication pub1 add table t2;
> insert into t1 values(2);
>
> Here, we have to debug the apply worker such that when it tries to
> apply the insert, stop the debugger in function apply_handle_insert()
> after doing begin_replication_step().
>
> Node-2:
> alter subscription sub1 set pub1, pub2;
>
> Now, continue the debugger of apply worker, it should first start the
> sync worker and then exit because of parameter change. All of these
> debugging steps are to just ensure the point that it should first
> start the sync worker and then exit. After this point, table sync
> worker never finishes and log is filled with messages: "reached
> max_sync_workers_per_subscription limit" (a newly added message by me
> in the attached debug patch).
>
> Now, it is not completely clear to me how exactly '013_partition.pl'
> leads to this situation but there is a possibility based on the LOGs
> it shows.
>
Thanks, I'll take a look later. From the description it seems this is an
issue that existed before any of the patches, right? It might be more
likely to hit due to some test changes, but the root cause is older.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company