RE: Build-farm - intermittent error in 031_column_list.pl - Mailing list pgsql-hackers
From | osumi.takamichi@fujitsu.com |
---|---|
Subject | RE: Build-farm - intermittent error in 031_column_list.pl |
Date | |
Msg-id | TYCPR01MB83738A1CE0AC7B63C1E95F68EDDF9@TYCPR01MB8373.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Build-farm - intermittent error in 031_column_list.pl (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Build-farm - intermittent error in 031_column_list.pl
|
List | pgsql-hackers |
On Thursday, May 26, 2022 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, May 25, 2022 at 6:54 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 5/25/22 13:26, Amit Kapila wrote: > > > On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > >> > > >> It does "fix" the case of [1]. But AFAIS > > >> RelationSyncEntry.replicate_valid is only used to inhibit repeated > > >> loading in get_rel_sync_entry and the function doesn't seem to be > > >> assumed to return a invalid entry. (Since the flag is not checked > > >> nowhere else.) > > >> > > >> For example pgoutput_change does not check for the flag of the > > >> entry returned from the function before uses it, which is not > > >> seemingly safe. (I didn't check further, though) > > >> > > >> Don't we need to explicitly avoid using invalid entries outside the > > >> function? > > >> > > > > > > We decide that based on pubactions in the callers, so even if entry > > > is valid, it won't do anything. Actually, we don't need to avoid > > > setting replication_valid flag as some of the publications for the > > > table may be already present. We can check if the publications_valid > > > flag is set while trying to validate the entry. Now, even if we > > > don't find any publications the replicate_valid flag will be set but > > > none of the actions will be set, so it won't do anything in the > > > caller. Is this better than the previous approach? > > > > > > > For the record, I'm not convinced this is the right way to fix the > > issue, as it may easily mask the real problem. > > > > We do silently ignore missing objects in various places, but only when > > either requested or when it's obvious it's expected and safe to ignore. > > But I'm not sure that applies here, in a clean way. > > > > Imagine you have a subscriber using two publications p1 and p2, and > > someone comes around and drops p1 by mistake. With the proposed patch, > > the subscription will notice this, but it'll continue sending data > > ignoring the missing publication. Yes, it will continue working, but > > it's quite possible this breaks the subscriber and it's be better to > > fail and stop replicating. > > > > Ideally, shouldn't we disallow drop of publication in such cases where it is part > of some subscription? I know it will be tricky because some subscriptions > could be disabled. > > > The other aspect I dislike is that we just stop caching publication > > info, forcing us to reload it for every replicated change/row. So even > > if dropping the publication happens not to "break" the subscriber (i.e. > > the data makes sense), this may easily cause performance issues, lag > > in the replication, and so on. And the users will have no idea why > > and/or how to fix it, because we just do this silently. > > > > Yeah, this is true that if there are missing publications, it needs to reload all the > publications info again unless we build a mechanism to change the existing > cached entry by loading only required info. The other thing we could do here is > to LOG the info for missing publications to make users aware of the fact. I think > we can also introduce a new option while defining/altering subscription to > indicate whether to continue on missing publication or not, that way by default > we will stop replication as we are doing now but users will have a way to move > replication. > > BTW, what are the other options we have to fix the cases where replication is > broken (or the user has no clue on how to proceed) as we are discussing the > case here or the OP reported yet another case on pgsql-bugs [1]? Hi, FYI, I've noticed that after the last report by Peter-san we've gotten the same errors on Build Farm. We need to keep discussing to conclude this. 1. Details for system "xenodermus" failure at stage subscriptionCheck, snapshot taken 2022-05-31 13:00:04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-05-31%2013%3A00%3A04 2. Details for system "phycodurus" failure at stage subscriptionCheck, snapshot taken 2022-05-26 17:30:04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2022-05-26%2017%3A30%3A04 Best Regards, Takamichi Osumi
pgsql-hackers by date: