Re: pg_upgrade: optimize replication slot caught-up check - Mailing list pgsql-hackers
| From | Chao Li |
|---|---|
| Subject | Re: pg_upgrade: optimize replication slot caught-up check |
| Date | |
| Msg-id | B098C626-EE7D-4E98-8685-FBB7980C795E@gmail.com Whole thread Raw |
| In response to | Re: pg_upgrade: optimize replication slot caught-up check (Masahiko Sawada <sawada.mshk@gmail.com>) |
| List | pgsql-hackers |
> On Jan 23, 2026, at 04:33, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jan 21, 2026 at 7:49 PM shveta malik <shveta.malik@gmail.com> wrote: >> >> On Thu, Jan 22, 2026 at 5:19 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> >>> On Mon, Jan 19, 2026 at 10:38 PM shveta malik <shveta.malik@gmail.com> wrote: >>>> >>>> On Wed, Jan 14, 2026 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> >>>>> I've attached the updated patch. >>>>> >>>> >>>> Thank You for the patch. I like the idea of optimization. Few initial comments: >>> >>> Thank you for reviewing the patch! >>> >>>> >>>> 1) >>>> + * The query returns the slot names and their caught-up status in >>>> + * the same order as the results collected by >>>> + * get_old_cluster_logical_slot_infos(). If this query is changed, >>>> >>>> I could not find the function get_old_cluster_logical_slot_infos(), do >>>> you mean get_old_cluster_logical_slot_infos_query()? >>> >>> It seems an oversight in commit 6d3d2e8e541f0. I think it should be >>> get_db_rel_and_slot_infos(). >>> >>>> >>>> 2) >>>> " WHERE database = current_database() AND " >>>> " slot_type = 'logical' AND " >>>> >>>> Is there a reason why database = current_database() is placed before >>>> slot_type = 'logical'? I am not sure how the PostgreSQL optimizer and >>>> executor will order these predicates, but from the first look, >>>> slot_type = 'logical' appears cheaper and could be placed first, >>>> consistent with the ordering used at other places. >>> >>> Changed. >>> >>>> >>>> 3) >>>> Shouldn’t we add a sanity check inside >>>> get_old_cluster_logical_slot_infos_query() to ensure that when >>>> skip_caught_up_check is true, we are on PostgreSQL 18 or lower? This >>>> would make the function safer for future use if it's called elsewhere. >>>> I understand the caller already performs a similar check, but I think >>>> it's more appropriate here since we call >>>> binary_upgrade_logical_slot_has_caught_up() from inside, which doesn’t >>>> even exist on newer versions. >>> >>> What kind of sanity check did you mean? We can have a check with >>> pg_fatal() but it seems almost the same to me even if pg_upgrade fails >>> with an error due to missing >>> binary_upgrade_logical_slot_has_caught_up(). >> >> I was referring to a development-level sanity check, something like: >> >> /* skip_caught_up_check is required iff PG19 or newer */ >> Assert((GET_MAJOR_VERSION(cluster->major_version) >= 1900) == >> skip_caught_up_check); >> >> But performing this check requires access to the cluster version (or >> cluster information), which this function currently does not have. >> Given that, do you think it would make sense to pass the cluster as an >> argument to this function in order to perform the sanity check here? > > Hmm, I think it's better not to have the same check in multiple > places, but it might make sense to have > get_old_cluster_logical_slot_infos_query() decide whether to use the > fast method. I've updated the patch accordingly, please review it. > >> >>>> >>>> 4) >>>> +# Check the file content. While both test_slot1 and test_slot2 should >>>> be reporting >>>> +# that they have unconsumed WAL records, test_slot3 should not be reported as >>>> +# it has caught up. >>>> >>>> Can you please elaborate the reason behind test_slot3 not being >>>> reported? Also mention in the comment if possible. >>> >>> We advance test_slot3 to the current WAL LSN before executing >>> pg_upgrade, so the test_slot3 should have consumed all pending WALs. >>> Please refer to the following changes: >> >> I understand the test, and the comments are clear to me. I also >> understand that only test_slot3 is expected to be in the caught-up >> state. My questions were specifically about the following points: >> 1) Why do we expect 'slot3 caught-up' not to be mentioned in the LOG? >> Is it simply because there is no corresponding logging in the code, or >> is this behavior related to some aspect of your fix that I may have >> missed? >> >> 2) In general, we do not validate the absence of LOG messages in >> tests. Why is this considered a special case where such a check is >> appropriate? > > What LOG do you refer to? In these tests, we check the > invalid_logical_slots.txt file where pg_upgrade reports only invalid > slots (in terms of pg_upgrade). For test_slot3, it should not be > mentioned in that file as it has caught up. Given that the file has > only invalid slots, checking the absence of test_slot3 in the file > makes sense to me. > > > Regards, > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com > <v7-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch> I just went through v7, and overall looks good to me. Only nitpick is: ``` + * + * use_fast_caught_up_check is set to true on return if available in the given + * cluster. ``` Strictly speaking, use_fast_caught_up_check is a pointer, it cannot be set, it is *use_fast_caught_up_check that is set totrue. So, please add a star sign in front of use_fast_caught_up_check. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
pgsql-hackers by date: