RE: Build-farm - intermittent error in 031_column_list.pl - Mailing list pgsql-hackers

From osumi.takamichi@fujitsu.com
Subject RE: Build-farm - intermittent error in 031_column_list.pl
Date
Msg-id TYCPR01MB83738A1CE0AC7B63C1E95F68EDDF9@TYCPR01MB8373.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Build-farm - intermittent error in 031_column_list.pl  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Build-farm - intermittent error in 031_column_list.pl
List pgsql-hackers
On Thursday, May 26, 2022 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, May 25, 2022 at 6:54 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 5/25/22 13:26, Amit Kapila wrote:
> > > On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
> > > <horikyota.ntt@gmail.com> wrote:
> > >>
> > >> It does "fix" the case of [1].  But AFAIS
> > >> RelationSyncEntry.replicate_valid is only used to inhibit repeated
> > >> loading in get_rel_sync_entry and the function doesn't seem to be
> > >> assumed to return a invalid entry. (Since the flag is not checked
> > >> nowhere else.)
> > >>
> > >> For example pgoutput_change does not check for the flag of the
> > >> entry returned from the function before uses it, which is not
> > >> seemingly safe. (I didn't check further, though)
> > >>
> > >> Don't we need to explicitly avoid using invalid entries outside the
> > >> function?
> > >>
> > >
> > > We decide that based on pubactions in the callers, so even if entry
> > > is valid, it won't do anything.  Actually, we don't need to avoid
> > > setting replication_valid flag as some of the publications for the
> > > table may be already present. We can check if the publications_valid
> > > flag is set while trying to validate the entry. Now, even if we
> > > don't find any publications the replicate_valid flag will be set but
> > > none of the actions will be set, so it won't do anything in the
> > > caller. Is this better than the previous approach?
> > >
> >
> > For the record, I'm not convinced this is the right way to fix the
> > issue, as it may easily mask the real problem.
> >
> > We do silently ignore missing objects in various places, but only when
> > either requested or when it's obvious it's expected and safe to ignore.
> > But I'm not sure that applies here, in a clean way.
> >
> > Imagine you have a subscriber using two publications p1 and p2, and
> > someone comes around and drops p1 by mistake. With the proposed patch,
> > the subscription will notice this, but it'll continue sending data
> > ignoring the missing publication. Yes, it will continue working, but
> > it's quite possible this breaks the subscriber and it's be better to
> > fail and stop replicating.
> >
> 
> Ideally, shouldn't we disallow drop of publication in such cases where it is part
> of some subscription? I know it will be tricky because some subscriptions
> could be disabled.
> 
> > The other aspect I dislike is that we just stop caching publication
> > info, forcing us to reload it for every replicated change/row. So even
> > if dropping the publication happens not to "break" the subscriber (i.e.
> > the data makes sense), this may easily cause performance issues, lag
> > in the replication, and so on. And the users will have no idea why
> > and/or how to fix it, because we just do this silently.
> >
> 
> Yeah, this is true that if there are missing publications, it needs to reload all the
> publications info again unless we build a mechanism to change the existing
> cached entry by loading only required info. The other thing we could do here is
> to LOG the info for missing publications to make users aware of the fact. I think
> we can also introduce a new option while defining/altering subscription to
> indicate whether to continue on missing publication or not, that way by default
> we will stop replication as we are doing now but users will have a way to move
> replication.
> 
> BTW, what are the other options we have to fix the cases where replication is
> broken (or the user has no clue on how to proceed) as we are discussing the
> case here or the OP reported yet another case on pgsql-bugs [1]?
Hi, 


FYI, I've noticed that after the last report by Peter-san
we've gotten the same errors on Build Farm.
We need to keep discussing to conclude this.


1. Details for system "xenodermus" failure at stage subscriptionCheck, snapshot taken 2022-05-31 13:00:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-05-31%2013%3A00%3A04


2. Details for system "phycodurus" failure at stage subscriptionCheck, snapshot taken 2022-05-26 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-05-26%2017%3A30%3A04


Best Regards,
    Takamichi Osumi


pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: pg15b1: FailedAssertion("val > base", File: "...src/include/utils/relptr.h", Line: 67, PID: 30485)