On Wed, Oct 18, 2023 at 10:22:14AM +0900, Michael Paquier wrote:
> On Tue, Oct 17, 2023 at 09:50:52AM +0530, Amit Kapila wrote:
>> On Tue, Oct 17, 2023 at 4:46 AM Callahan, Drew <callaan@amazon.com> wrote:
>>> On the server side, we did not see evidence of WALSenders being launched. As a result, the gap kept increasing
further
>>> and further since they workers would not transition to the catchup state after several hours due to this.
>>
>> One possibility is that the system has reached
>> 'max_logical_replication_workers' limit due to which it is not
>> allowing to launch the apply worker. If so, then consider increasing
>> the value of 'max_logical_replication_workers'. You can query
>> 'pg_stat_subscription' to know more information about workers. See the
>> description of subscriber-side parameters [1].
>
> Hmm. So you basically mean that not being able to launch new workers
> prevents the existing workers to move on with their individual sync,
> freeing slots once their sync is done for other tables. Then, this
> causes all all of the existing workers to remain in a syncwait state,
> further increasing the gap in WAL replay. Am I getting that right?
I was looking more at what we could do here, played with the apply and
sync workers and I really doubt that just ERROR-ing in a sync worker
is a good idea on v14~ just for the sake of freeing bgworker slots,
which would have the effect of spawning more apply workers. I mean,
if your system is under pressure because of a lack of bgworker slots,
assuming that a set of N apply workers are gone, leaving M sync
workers (M >= N) lying around without being able to get updates from
an apply worker, ERRORs in sync workers would give more room to spawn
apply workers, but then the launcher worker may fight for the free'd
slots to attempt to spawn sync workers for the fresh apply workers,
no? It also seems to make the reasoning around the table states
harder to think about, as any future apply workers need to handle more
error cases related to sync workers.
I got a few extra questions about all that:
- How did the apply worker stop? That can be said to be the origin of
the issue.
- Did logicalrep_worker_launch() provide some WARNINGs because the
cluster is running out of slots, preventing the spawn of a new apply
worker to replace the one that's down, so as the table sync workers
are able to move on with their sync phases? I guess it did.. The
frequency would offer hints about the pressure put on the system.
--
Michael