Re: pg_upgrade: optimize replication slot caught-up check - Mailing list pgsql-hackers
| From | Chao Li |
|---|---|
| Subject | Re: pg_upgrade: optimize replication slot caught-up check |
| Date | |
| Msg-id | 614F0F15-47E8-47A3-BE00-B8500586BB2E@gmail.com Whole thread Raw |
| In response to | Re: pg_upgrade: optimize replication slot caught-up check (Masahiko Sawada <sawada.mshk@gmail.com>) |
| List | pgsql-hackers |
> On Jan 10, 2026, at 05:32, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jan 9, 2026 at 10:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Thu, Jan 8, 2026 at 6:59 PM Chao Li <li.evan.chao@gmail.com> wrote:
>>>
>>>
>>>
>>> I just looked into v3. Basically, it now does a shared WAL scan to find the newest decodable LSN and uses that to
comparewith all slots’ confirmed_flush_lsn, which significantly reduces WAL scan effort when there are many slots.
>>
>> Thank you for reviewing the patch!
>>
>>>
>>> One thing I'm thinking about is that if all slots are far behind, the shared scan may still take a long time.
Beforethis change, a scan could stop as soon as it found a pending WAL. So after the change, when there are only a few
slotsand they are far behind, the scan might end up doing more work than before.
>>
>> That's a valid concern.
>>
>>> As a possible optimization, maybe we could also pass the newest confirmed_flush_lsn to the scan. Once it finds a
decodableWAL record newer than that confirmed_flush_lsn, we already know all slots are behind, so the scan could stop
atthat point.
>>
>> Sounds like a reasonable idea. I'll give it a try and see how it's worthwhile.
>>
>>>
>>> WRT the code change, I got a few comments:
>>>
>>> 1
>>> ···
>>> + * otherwise false. If last_pending_wal_p is set by the caller, it continues
>>> + * scanning WAL even after detecting a decodable WAL record, in order to
>>> + * get the last decodable WAL record's LSN.
>>> */
>>> bool
>>> -LogicalReplicationSlotHasPendingWal(XLogRecPtr end_of_wal)
>>> +LogicalReplicationSlotHasPendingWal(XLogRecPtr end_of_wal,
>>> + XLogRecPtr *last_pending_wal_p)
>>> {
>>> bool has_pending_wal = false;
>>>
>>> Assert(MyReplicationSlot);
>>>
>>> + if (last_pending_wal_p != NULL)
>>> + *last_pending_wal_p = InvalidXLogRecPtr;
>>> ···
>>>
>>> The header comment seems to conflict to the code. last_pending_wal_p is unconditionally set to InvalidXLogRecPtr,
sowhatever a caller set is overwritten. I think you meant to say “if last_pending_wal_p is not NULL”.
>>
>> Agreed.
>>
>>>
>>> 2
>>> ```
>>> @@ -286,9 +287,9 @@ binary_upgrade_logical_slot_has_caught_up(PG_FUNCTION_ARGS)
>>> {
>>> Name slot_name;
>>> XLogRecPtr end_of_wal;
>>> - bool found_pending_wal;
>>> + XLogRecPtr last_pending_wal;
>>> ```
>>>
>>> The function header comment still says “returns true if …”, that should be updated.
>>>
>>> Also, with the change, the function name becomes misleading, where “has” implies a boolean result, but now it will
returnthe newest docodeable wal when no catching up. The function name doesn’t reflect to the actual behavior. Looks
likethe function is only used by pg_upgrade, so maybe rename it.
>>
>> Agreed, I'll incorporate the comment in the next version patch.
>>
>
> I've attached the updated patch that addressed all review comments.
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
> <v4-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch>
A few comments on v4:
1
```
- * slot is considered caught up is done by an upgrade function. This
- * regards the slot as caught up if we don't find any decodable changes.
- * See binary_upgrade_logical_slot_has_caught_up().
+ * slot is considered caught up is done by an upgrade function, unless the
+ * caller sets skip_caught_up_check. This regards the slot as caught up if
+ * we don't find any decodable changes. See
+ * binary_upgrade_logical_slot_has_caught_up().
```
binary_upgrade_logical_slot_has_caught_up has been renamed, so this commend needs to be updated.
2
```
+ "temporary IS FALSE "
+ "ORDER BY 1;",
+ (skip_caught_up_check || user_opts.live_check) ? "FALSE" :
"(CASE WHEN invalidation_reason IS NOT NULL THEN FALSE "
"ELSE (SELECT pg_catalog.binary_upgrade_logical_slot_has_caught_up(slot_name)) "
"END)");
```
pg_catalog.binary_upgrade_logical_slot_has_caught_up has been renamed and it takes two parameters now.
3
```
+ if (last_pending_wal > scan_cutoff_lsn)
+ break;
```
In LogicalReplicationSlotHasPendingWal() we early break when last_pending_wal > scan_cutoff_lsn, and later the SQL
check“confirmed_flush_lsn > last_pending_wal”. So there is an edge case, where last_pending_wal == scan_cutoff_lsn and
confirmed_flush_lsn== last_pending_wal, then neither early break nor caught up happens.
So, I think we should use “>=“ for the both checks.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
pgsql-hackers by date: