Re: Fix GetWALAvailability function code comments for WALAVAIL_REMOVED return value - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Fix GetWALAvailability function code comments for WALAVAIL_REMOVED return value
Date
Msg-id CALj2ACUtyW94TF76WEM-2JvMMD1a1PzLuaW5Qd9rrKRgnMAZnw@mail.gmail.com
Whole thread Raw
In response to Fix GetWALAvailability function code comments for WALAVAIL_REMOVED return value  (sirisha chamarthi <sirichamarthi22@gmail.com>)
Responses Re: Fix GetWALAvailability function code comments for WALAVAIL_REMOVED return value  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
On Wed, Oct 19, 2022 at 12:39 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
>
> Hi Hackers,
>
> The current code comment says that the replication stream on a slot with the given targetLSN can't continue after a
restartbut even without a restart the stream cannot continue. The slot is invalidated and the walsender process is
terminatedby the checkpoint process. Attaching a small patch to fix the comment. 
>
> 2022-10-19 06:26:22.387 UTC [144482] STATEMENT:  START_REPLICATION SLOT "s2" LOGICAL 0/0
> 2022-10-19 06:27:41.998 UTC [2553755] LOG:  checkpoint starting: time
> 2022-10-19 06:28:04.974 UTC [2553755] LOG:  terminating process 144482 to release replication slot "s2"
> 2022-10-19 06:28:04.974 UTC [144482] FATAL:  terminating connection due to administrator command
> 2022-10-19 06:28:04.974 UTC [144482] CONTEXT:  slot "s2", output plugin "test_decoding", in the change callback,
associatedLSN 0/1E23AB68 
> 2022-10-19 06:28:04.974 UTC [144482] STATEMENT:  START_REPLICATION SLOT "s2" LOGICAL 0/0

I think the walsender/replication stream can still continue even
before the checkpointer signals it to terminate, there's an
illuminating comment (see [1]) specifying when it can happen. It means
that the GetWALAvailability() can return WALAVAIL_REMOVED but the
checkpointer hasn't yet signalled/in the process of signalling the
walsender to terminate.

 * * WALAVAIL_REMOVED means it has been removed. A replication stream on
 *   a slot with this LSN cannot continue after a restart.

The above existing comment, says that the slot isn't usable if
"someone" (either checkpoitner or walsender or entire server itself)
got restarted. It looks fine, no?

[1]
            case WALAVAIL_REMOVED:

                /*
                 * If we read the restart_lsn long enough ago, maybe that file
                 * has been removed by now.  However, the walsender could have
                 * moved forward enough that it jumped to another file after
                 * we looked.  If checkpointer signalled the process to
                 * termination, then it's definitely lost; but if a process is
                 * still alive, then "unreserved" seems more appropriate.
                 *
                 * If we do change it, save the state for safe_wal_size below.
                 */
                if (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn))
                {

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: sirisha chamarthi
Date:
Subject: Fix GetWALAvailability function code comments for WALAVAIL_REMOVED return value
Next
From: Masahiko Sawada
Date:
Subject: Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)