Re: Improving the latch handling between logical replication launcher and worker processes. - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Improving the latch handling between logical replication launcher and worker processes.
Date
Msg-id CAHut+Pv=a47ttZTABkrcR+=SX2qinmThBL72cu7Hz3wuqqTQ6Q@mail.gmail.com
Whole thread Raw
In response to Improving the latch handling between logical replication launcher and worker processes.  (vignesh C <vignesh21@gmail.com>)
Responses Re: Improving the latch handling between logical replication launcher and worker processes.
List pgsql-hackers
On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:
>
...
> a) Introduce a new latch to handle worker attach and exit.

IIUC there is normally only the “shared” latch or a “process local”
latch. i.e. AFAICT is is quite uncommon to invent a new latches like
option a is proposing. I did not see any examples of making new
latches (e.g. MyLatchA, MyLatchB, MyLatchC) in the PostgreSQL source
other than that ‘recoveryWakeupLatch’ mentioned above by Hou-san. So
this option could be OK, but OTOH since there is hardly any precedent
maybe that should be taken as an indication to avoid doing it this way
(???)

> b) Add a new GUC launcher_retry_time which gives more flexibility to
> users as suggested by Amit at [1].

I'm not sure that introducing a new GUC is a good option because this
seems a rare problem to have -- so it will be hard to tune since it
will be difficult to know you even have this problem and then
difficult to know that it is fixed. Anyway. this made me consider more
what the WaitLatch timeout value should be. Here are some thoughts:

Idea 1)

I was wondering where did that DEFAULT_NAPTIME_PER_CYCLE value of 180
seconds come from or was that just a made-up number? AFAICT it just
came into existence in the first pub/sub commit [1] but there is no
explanation for why 180s was chosen. Anyway, I assume a low value
(5s?) would be bad because it incurs unacceptable CPU usage, right?
But if 180s is too long and 5s is too short then what does a “good”
number even look like? E.g.,. if 60s is deemed OK, then is there any
harm in just defining DEFAULT_NAPTIME_PER_CYCLE to be 60s and leaving
it at that?

Idea 2)

Another idea could be to use a “dynamic timeout” in the WaitLatch of
ApplyLauncherMain. Let me try to explain my thought bubble:
- If the preceding foreach(lc, sublist) loop launched any workers then
the WaitLatch timeout can be a much shorter 10s
- If the preceding foreach(lc, sublist) loop did NOT launch any
workers (this would be the most common case) then the WaitLatch
timeout can be the usual 180s.

IIUC this strategy will still give any recently launched workers
enough time to attach shmem but it also means that any concurrent
CREATE SUBSCRIPTION will be addressed within 10s instead of 180s.
Maybe this is sufficient to make an already rare problem become
insignificant.


> c) Don't reset the latch at worker attach and allow launcher main to
> identify and handle it. For this there is a patch v6-0002 available at
> [2].

This option c seems the easiest. Can you explain what are the
drawbacks of using this approach?

======
[1] github -
https://github.com/postgres/postgres/commit/665d1fad99e7b11678b0d5fa24d2898424243cd6#diff-127f8eb009151ec548d14c877a57a89d67da59e35ea09189411aed529c6341bf

Kind Regards,
Peter Smith.
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning
Next
From: Pradeep Kumar
Date:
Subject: Re: Need clarification on compilation errors in PG 16.2