Home > mailing lists

Re: logical apply worker's lock waits in subscriber can stall checkpointer in publisher - Mailing list pgsql-hackers

From	Fujii Masao
Subject	Re: logical apply worker's lock waits in subscriber can stall checkpointer in publisher
Date	January 29 17:33:17
Msg-id	CAHGQGwHn8NsKjGjkh+wGeoS9pc199UAbCpcC9+wrYPF7j3L4Gg@mail.gmail.com Whole thread Raw
In response to	RE: logical apply worker's lock waits in subscriber can stall checkpointer in publisher ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses	RE: logical apply worker's lock waits in subscriber can stall checkpointer in publisher
List	pgsql-hackers

Tree view

On Thu, Jan 29, 2026 at 4:03 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Fujii-san,
>
> > While reviewing the patch at [1], I noticed a case where lock waits on
> > a logical apply worker in the subscriber can cause the checkpointer on
> > the publisher to stall. This seems like unexpected behavior and may
> > need to be addressed.
> >
> > The issue can occur as follows:
> >
> > 1. A logical apply worker on the subscriber blocks waiting for a lock.
> > 2. Because the apply worker cannot receive further messages, the walsender's
> >     send buffer on the publisher becomes full.
> > 3. If the walsender then encounters a max_slot_wal_keep_size error,
> >     it attempts to send an error message to the subscriber before exiting.
> >     However, with a full send buffer, the walsender blocks while trying to
> >     send this message.
> > 4. The checkpointer on the publisher calls InvalidateObsoleteReplicationSlots()
> >     and waits for the slot to be released. Since the walsender is stuck and
> >     the slot is not released, the checkpointer also becomes stuck.
>
> I confirmed this could happen if the max_slot_wal_keep_size is enabled
> (in other words, the value is not -1).
> Per my test, wal_sender_timeout cannot work well because the process is stuck at
> the lower layer, but tcp_user_timeout can terminate the process. Can we mention
> the workaround in the doc instead of fixing the code?
>
> It won't work for a Unix domain socket connection, but it's not realistic for the
> production stage.

This approach doesn't seem helpful on platforms that don't support
TCP_USER_TIMEOUT, i.e., tcp_user_timeout is not available. Right?
If I remember correctly, Windows is one of those platforms.

Regards,

--
Fujii Masao

pgsql-hackers by date:

From: Alexander Pyhalov
Date: 29 January, 17:24:15
Subject: Re: Limit memory usage by postgres_fdw batches

From: Andrew Dunstan
Date: 29 January, 17:37:36
Subject: Re: getting "shell command argument contains a newline or carriage return:" error with pg_dumpall when db name have new line in double quote

Re: logical apply worker's lock waits in subscriber can stall checkpointer in publisher - Mailing list pgsql-hackers

Previous

Next