Re: Build-farm - intermittent error in 031_column_list.pl - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Build-farm - intermittent error in 031_column_list.pl
Date
Msg-id CAA4eK1K40xhObN1MWO7=rzrJmo+oQ048O8drZ86-F7artVvwQQ@mail.gmail.com
Whole thread Raw
In response to Re: Build-farm - intermittent error in 031_column_list.pl  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses RE: Build-farm - intermittent error in 031_column_list.pl
List pgsql-hackers
On Wed, May 25, 2022 at 6:54 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 5/25/22 13:26, Amit Kapila wrote:
> > On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> >>
> >> It does "fix" the case of [1].  But AFAIS
> >> RelationSyncEntry.replicate_valid is only used to inhibit repeated
> >> loading in get_rel_sync_entry and the function doesn't seem to be
> >> assumed to return a invalid entry. (Since the flag is not checked
> >> nowhere else.)
> >>
> >> For example pgoutput_change does not check for the flag of the entry
> >> returned from the function before uses it, which is not seemingly
> >> safe. (I didn't check further, though)
> >>
> >> Don't we need to explicitly avoid using invalid entries outside the
> >> function?
> >>
> >
> > We decide that based on pubactions in the callers, so even if entry is
> > valid, it won't do anything.  Actually, we don't need to avoid setting
> > replication_valid flag as some of the publications for the table may
> > be already present. We can check if the publications_valid flag is set
> > while trying to validate the entry. Now, even if we don't find any
> > publications the replicate_valid flag will be set but none of the
> > actions will be set, so it won't do anything in the caller. Is this
> > better than the previous approach?
> >
>
> For the record, I'm not convinced this is the right way to fix the
> issue, as it may easily mask the real problem.
>
> We do silently ignore missing objects in various places, but only when
> either requested or when it's obvious it's expected and safe to ignore.
> But I'm not sure that applies here, in a clean way.
>
> Imagine you have a subscriber using two publications p1 and p2, and
> someone comes around and drops p1 by mistake. With the proposed patch,
> the subscription will notice this, but it'll continue sending data
> ignoring the missing publication. Yes, it will continue working, but
> it's quite possible this breaks the subscriber and it's be better to
> fail and stop replicating.
>

Ideally, shouldn't we disallow drop of publication in such cases where
it is part of some subscription? I know it will be tricky because some
subscriptions could be disabled.

> The other aspect I dislike is that we just stop caching publication
> info, forcing us to reload it for every replicated change/row. So even
> if dropping the publication happens not to "break" the subscriber (i.e.
> the data makes sense), this may easily cause performance issues, lag in
> the replication, and so on. And the users will have no idea why and/or
> how to fix it, because we just do this silently.
>

Yeah, this is true that if there are missing publications, it needs to
reload all the publications info again unless we build a mechanism to
change the existing cached entry by loading only required info. The
other thing we could do here is to LOG the info for missing
publications to make users aware of the fact. I think we can also
introduce a new option while defining/altering subscription to
indicate whether to continue on missing publication or not, that way
by default we will stop replication as we are doing now but users will
have a way to move replication.

BTW, what are the other options we have to fix the cases where
replication is broken (or the user has no clue on how to proceed) as
we are discussing the case here or the OP reported yet another case on
pgsql-bugs [1]?

[1] -
https://www.postgresql.org/message-id/CANWRaJyyD%3D9c1E2HdF-Tqfe7%2BvuCQnAkXd6%2BEFwxC0wM%3D313AA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: "ERROR: latch already owned" on gharial
Next
From: Zhihong Yu
Date:
Subject: Re: adding status for COPY progress report