Re: Column Filtering in Logical Replication - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Column Filtering in Logical Replication
Date
Msg-id CAA4eK1JzzoE61CY1qi9Vcdi742JFwG4YA3XpoMHwfKNhbFic6g@mail.gmail.com
Whole thread Raw
In response to Re: Column Filtering in Logical Replication  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Column Filtering in Logical Replication  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Fri, Mar 18, 2022 at 10:42 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/18/22 15:43, Tomas Vondra wrote:
> >>
> >
> > Hmmm. So the theory is that in most runs we manage to sync the tables
> > faster than starting the workers, so we don't hit the limit. But on some
> > machines the sync worker takes a bit longer, we hit the limit. Seems
> > possible, yes. Unfortunately we don't seem to log anything when we hit
> > the limit, so hard to say for sure :-( I suggest we add a WARNING
> > message to logicalrep_worker_launch or something. Not just because of
> > this test, it seems useful in general.
> >
> > However, how come we don't retry the sync? Surely we don't just give up
> > forever, that'd be a pretty annoying behavior. Presumably we just end up
> > sleeping for a long time before restarting the sync worker, somewhere.
> >
>
> I tried lowering the max_sync_workers_per_subscription to 1 and making
> the workers to run for a couple seconds (doing some CPU intensive
> stuff), but everything still works just fine.
>

Did the apply worker restarts during that time? If not you can try by
changing some subscription parameters which leads to its restart. This
has to happen before copy_table has finished. In the LOGS, you should
see the message: "logical replication apply worker for subscription
"<subscription_name>" will restart because of a parameter change".
IIUC, the code which doesn't allow to restart the apply worker after
the max_sync_workers_per_subscription is reached is as below:
logicalrep_worker_launch()
{
...
if (nsyncworkers >= max_sync_workers_per_subscription)
{
LWLockRelease(LogicalRepWorkerLock);
return;
}
...
}

This happens before we allocate a worker to apply. So, it can happen
only during the restart of the apply worker because we always first
the apply worker, so in that case, it will never restart.

> Looking a bit closer at the logs (from pogona and other), I doubt this
> is about hitting the max_sync_workers_per_subscription limit. Notice we
> start two sync workers, but neither of them ever completes. So we never
> update the sync status or start syncing the remaining tables.
>

I think they are never completed because they are in a sort of
infinite loop. If you see process_syncing_tables_for_sync(), it will
never mark the status as SUBREL_STATE_SYNCDONE unless apply worker has
set it to SUBREL_STATE_CATCHUP. In LogicalRepSyncTableStart(), we do
wait for a state change to catchup via wait_for_worker_state_change(),
but we bail out in that function if the apply worker has died. After
that tablesync worker won't be able to complete because in our case
apply worker won't be able to restart.

> So the question is why those two sync workers never complete - I guess
> there's some sort of lock wait (deadlock?) or infinite loop.
>

It would be a bit tricky to reproduce this even if the above theory is
correct but I'll try it today or tomorrow.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: a misbehavior of partition row movement (?)
Next
From: Peter Geoghegan
Date:
Subject: Hardening heap pruning code (was: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum)