Re: pg_upgrade: optimize replication slot caught-up check - Mailing list pgsql-hackers

From Chao Li
Subject Re: pg_upgrade: optimize replication slot caught-up check
Date
Msg-id 00AA6098-1ECA-4FC9-839F-BFE26E4508E4@gmail.com
Whole thread Raw
In response to pg_upgrade: optimize replication slot caught-up check  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: pg_upgrade: optimize replication slot caught-up check
List pgsql-hackers

> On Jan 6, 2026, at 02:02, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi all,
>
> Commit 29d0a77fa6 improved pg_upgrade to allow migrating logical
> slots. Currently, to check if the slots are ready to be migrated, we
> call binary_upgrade_logical_slot_has_caught_up() for every single
> slot. This checks if there are any unconsumed WAL records. However, we
> noticed a performance issue. If there are many slots (e.g., 100 or
> more) or if there is a WAL backlog, checking all slots one by one
> takes a long time.
>
> Here are some test results from my environment:
> With an empty cluster: 1.55s
> With 200 slots and 30MB backlog: 15.51s
>
> Commit 6d3d2e8e5 introduced parallel checks per database, but a single
> job might still have to check too many slots, causing delays.
>
> Since binary_upgrade_logical_slot_has_caught_up() essentially checks
> if any decodable record exists in the database, IIUC it is not
> necessary to check every slot. We can optimize this by checking only
> the slot with the minimum confirmed_flush_lsn. If that slot is caught
> up, we can assume others are too. The attached patch implements this
> optimization. With the patch, the test with 200 slots finished in
> 2.512s. The execution time is now stable regardless of the number of
> slots.
>
> One thing to note is that DecodeTXNNeedSkip() also considers
> replication origin filters. Theoretically, a plugin could filter out
> specific origins, which might lead to different results. However, this
> is a very rare case. Even if it happens, it would just result in a
> false positive (the upgrade fails safely), so the impact is minimal.
> Therefore, the patch simplifies the check to be per-database instead
> of per-slot.
>
> Feedback is very welcome.
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
> <v1-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch>

Hi Masahiko-san,

Basically I think the optimization idea makes sense to me. Reducing the check from O(slots × WAL) to O(databases × WAL)
soundsa win. Just a few small comments: 

1
```
+static void process_old_cluter_logical_slot_catchup_infos(DbInfo *dbinfo, PGresult *res, void *arg);
```

Typo in the function name: cluter->cluster

2
···
-        logical_slot_infos_query = get_old_cluster_logical_slot_infos_query();
+        const char *slot_info_query = "SELECT slot_name, plugin, two_phase, failover, “
···

logical_slot_infos_query is no longer used, should be removed.

3
```
+                "  ORDER BY confirmed_flush_lsn ASC "
```

I am thinking, if confirmed_flush_lsn can be NULL? If that’s true, we may want to add “NULLS LAST”.

4
```
+    Assert(num_slots == dbinfo->slot_arr.nslots);
+
+    for (int i = 0; i < num_slots; i++)
+    {
+        char       *slotname;
+        bool        caught_up;
+
+        slotname = PQgetvalue(res, i, PQfnumber(res, "slot_name"));
+        caught_up = (strcmp(PQgetvalue(res, i, PQfnumber(res, "caught_up")), "t") == 0);
+
+        for (int slotnum = 0; slotnum < dbinfo->slot_arr.nslots; slotnum++)
+        {
```

As num_slots == dbinfo->slot_arr.nslots, this is still a O(num_slots^2) check. Given this patch’s goal is to eliminate
theburden from large number of slots, maybe we can build a hash table to make the check faster. 

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: apply_scanjoin_target_to_paths and partitionwise join
Next
From: Fujii Masao
Date:
Subject: Re: Allow GUC settings in CREATE SUBSCRIPTION CONNECTION to take effect