logical apply worker's lock waits in subscriber can stall checkpointer in publisher - Mailing list pgsql-hackers

From Fujii Masao
Subject logical apply worker's lock waits in subscriber can stall checkpointer in publisher
Date
Msg-id CAHGQGwFOW_EWtUa-8sTL21KGsWy76CaQZF-FarZqur2RONk3nA@mail.gmail.com
Whole thread Raw
Responses RE: logical apply worker's lock waits in subscriber can stall checkpointer in publisher
List pgsql-hackers
Hi,

While reviewing the patch at [1], I noticed a case where lock waits on
a logical apply worker in the subscriber can cause the checkpointer on
the publisher to stall. This seems like unexpected behavior and may
need to be addressed.

The issue can occur as follows:

1. A logical apply worker on the subscriber blocks waiting for a lock.
2. Because the apply worker cannot receive further messages, the walsender's
    send buffer on the publisher becomes full.
3. If the walsender then encounters a max_slot_wal_keep_size error,
    it attempts to send an error message to the subscriber before exiting.
    However, with a full send buffer, the walsender blocks while trying to
    send this message.
4. The checkpointer on the publisher calls InvalidateObsoleteReplicationSlots()
    and waits for the slot to be released. Since the walsender is stuck and
    the slot is not released, the checkpointer also becomes stuck.

This behavior seems problematic, isn't it?

One possible approach to address this issue would be to make the walsender
send the error message in non-blocking mode. Even if the send buffer is full,
the walsender could then exit, allowing the slot to be released and
the checkpointer to proceed. This would mean that, in some cases,
the final error message might not reach the subscriber, which seems
acceptable to me, though others may disagree.

This approach would also help when users want to terminate a walsender
via pg_terminate_backend() but the send buffer is full. In this case, today,
the walsender can similarly block while trying to send the error message.

Another idea would be to change the checkpointer so that
InvalidateObsoleteReplicationSlots() operates in a non-blocking manner.
I'm not sure whether that is feasible, but if immediate invalidation is not
strictly required, the checkpointer could give up and retry later.

Thoughts?

Regards,

[1]
https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com

-- 
Fujii Masao



pgsql-hackers by date:

Previous
From: Zsolt Parragi
Date:
Subject: Re: Time to add FIDO2 support?
Next
From: Dean Rasheed
Date:
Subject: Re: could not find replacement targetlist entry for attno -6