On 3/29/22 12:00, Amit Kapila wrote:
> On Sun, Mar 20, 2022 at 4:53 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 3/20/22 07:23, Amit Kapila wrote:
>>> On Sun, Mar 20, 2022 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>
>>>> On Fri, Mar 18, 2022 at 10:42 PM Tomas Vondra
>>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>>> So the question is why those two sync workers never complete - I guess
>>>>> there's some sort of lock wait (deadlock?) or infinite loop.
>>>>>
>>>>
>>>> It would be a bit tricky to reproduce this even if the above theory is
>>>> correct but I'll try it today or tomorrow.
>>>>
>>>
>>> I am able to reproduce it with the help of a debugger. Firstly, I have
>>> added the LOG message and some While (true) loops to debug sync and
>>> apply workers. Test setup
>>>
>>> Node-1:
>>> create table t1(c1);
>>> create table t2(c1);
>>> insert into t1 values(1);
>>> create publication pub1 for table t1;
>>> create publication pu2;
>>>
>>> Node-2:
>>> change max_sync_workers_per_subscription to 1 in potgresql.conf
>>> create table t1(c1);
>>> create table t2(c1);
>>> create subscription sub1 connection 'dbname = postgres' publication pub1;
>>>
>>> Till this point, just allow debuggers in both workers just continue.
>>>
>>> Node-1:
>>> alter publication pub1 add table t2;
>>> insert into t1 values(2);
>>>
>>> Here, we have to debug the apply worker such that when it tries to
>>> apply the insert, stop the debugger in function apply_handle_insert()
>>> after doing begin_replication_step().
>>>
>>> Node-2:
>>> alter subscription sub1 set pub1, pub2;
>>>
>>> Now, continue the debugger of apply worker, it should first start the
>>> sync worker and then exit because of parameter change. All of these
>>> debugging steps are to just ensure the point that it should first
>>> start the sync worker and then exit. After this point, table sync
>>> worker never finishes and log is filled with messages: "reached
>>> max_sync_workers_per_subscription limit" (a newly added message by me
>>> in the attached debug patch).
>>>
>>> Now, it is not completely clear to me how exactly '013_partition.pl'
>>> leads to this situation but there is a possibility based on the LOGs
>>> it shows.
>>>
>>
>> Thanks, I'll take a look later.
>>
>
> This is still failing [1][2].
>
> [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2022-03-28%2005%3A16%3A53
> [2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-24%2013%3A13%3A08
>
AFAICS we've concluded this is a pre-existing issue, not something
introduced by a recently committed patch, and I don't think there's any
proposal how to fix that. So I've put that on the back burner until
after the current CF.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company