Re: Column Filtering in Logical Replication - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Column Filtering in Logical Replication
Date
Msg-id 369ae611-8822-f499-87cd-58ad0d60c60c@enterprisedb.com
Whole thread Raw
In response to Re: Column Filtering in Logical Replication  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Column Filtering in Logical Replication  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers

On 3/18/22 15:43, Tomas Vondra wrote:
> 
> 
> On 3/18/22 06:52, Amit Kapila wrote:
>> On Fri, Mar 18, 2022 at 12:47 AM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> I pushed the second fix. Interestingly enough, wrasse failed in the
>>> 013_partition test. I don't see how that could be caused by this
>>> particular commit, though - see the pgsql-committers thread [1].
>>>
>>
>> I have a theory about what's going on here. I think this is due to a
>> test added in your previous commit c91f71b9dc. The newly added test
>> added hangs in tablesync because there was no apply worker to set the
>> state to SUBREL_STATE_CATCHUP which blocked tablesync workers from
>> proceeding.
>>
>> See below logs from pogona [1].
>> 2022-03-18 01:33:15.190 CET [2551176][client
>> backend][3/74:0][013_partition.pl] LOG:  statement: ALTER SUBSCRIPTION
>> sub2 SET PUBLICATION pub_lower_level, pub_all
>> 2022-03-18 01:33:15.354 CET [2551193][logical replication
>> worker][4/57:0][] LOG:  logical replication apply worker for
>> subscription "sub2" has started
>> 2022-03-18 01:33:15.605 CET [2551176][client
>> backend][:0][013_partition.pl] LOG:  disconnection: session time:
>> 0:00:00.415 user=bf database=postgres host=[local]
>> 2022-03-18 01:33:15.607 CET [2551209][logical replication
>> worker][3/76:0][] LOG:  logical replication table synchronization
>> worker for subscription "sub2", table "tab4_1" has started
>> 2022-03-18 01:33:15.609 CET [2551211][logical replication
>> worker][5/11:0][] LOG:  logical replication table synchronization
>> worker for subscription "sub2", table "tab3" has started
>> 2022-03-18 01:33:15.617 CET [2551193][logical replication
>> worker][4/62:0][] LOG:  logical replication apply worker for
>> subscription "sub2" will restart because of a parameter change
>>
>> You will notice that the apply worker is never restarted after a
>> parameter change. The reason was that the particular subscription
>> reaches the limit of max_sync_workers_per_subscription after which we
>> don't allow to restart the apply worker. I think you might want to
>> increase the values of
>> max_sync_workers_per_subscription/max_logical_replication_workers to
>> make it work.
>>
> 
> Hmmm. So the theory is that in most runs we manage to sync the tables
> faster than starting the workers, so we don't hit the limit. But on some
> machines the sync worker takes a bit longer, we hit the limit. Seems
> possible, yes. Unfortunately we don't seem to log anything when we hit
> the limit, so hard to say for sure :-( I suggest we add a WARNING
> message to logicalrep_worker_launch or something. Not just because of
> this test, it seems useful in general.
> 
> However, how come we don't retry the sync? Surely we don't just give up
> forever, that'd be a pretty annoying behavior. Presumably we just end up
> sleeping for a long time before restarting the sync worker, somewhere.
> 

I tried lowering the max_sync_workers_per_subscription to 1 and making
the workers to run for a couple seconds (doing some CPU intensive
stuff), but everything still works just fine.

Looking a bit closer at the logs (from pogona and other), I doubt this
is about hitting the max_sync_workers_per_subscription limit. Notice we
start two sync workers, but neither of them ever completes. So we never
update the sync status or start syncing the remaining tables.

So the question is why those two sync workers never complete - I guess
there's some sort of lock wait (deadlock?) or infinite loop.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: a misbehavior of partition row movement (?)
Next
From: Tom Lane
Date:
Subject: Re: Remove INT64_FORMAT in translatable strings