Re: Build-farm - intermittent error in 031_column_list.pl - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Build-farm - intermittent error in 031_column_list.pl
Date
Msg-id CAA4eK1JTwOAniPua04o2EcOXfzRa8ANax=3bpx4H-8dH7M2p=A@mail.gmail.com
Whole thread Raw
In response to Re: Build-farm - intermittent error in 031_column_list.pl  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Build-farm - intermittent error in 031_column_list.pl
List pgsql-hackers
On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 5/20/22 05:58, Amit Kapila wrote:
>
> Are we really querying the publications (in get_rel_sync_entry) using
> the historical snapshot?
>

Yes.

> I haven't really realized this, but yeah, that
> might explain the issue.
>
> The new TAP test does ALTER SUBSCRIPTION ... SET PUBLICATION much more
> often than any other test (there are ~15 calls, 12 of which are in this
> new test). That might be why we haven't seen failures before. Or maybe
> the existing tests simply are not vulnerable to this,
>

Right, I have checked the other cases are not vulnerable to this,
otherwise, I think we would have seen intermittent failures till now.
They don't seem to be doing DMLs before the creation of a publication
or they create a subscription pointing to the same publication before.

> because they
> either do wait_for_catchup late enough or don't do any DML right before
> executing SET PUBLICATION.
>
> >>  That timetravel seems inintuitive but it's the
> >> (current) way it works.
> >>
> >
> > I have thought about it but couldn't come up with a good way to change
> > the way currently it works. Moreover, I think it is easy to hit this
> > in other ways as well. Say, you first create a subscription with a
> > non-existent publication and then do operation on any unrelated table
> > on the publisher before creating the required publication, we will hit
> > exactly this problem of "publication does not exist", so I think we
> > may need to live with this behavior and write tests carefully.
> >
>
> Yeah, I think it pretty much requires ensuring the subscriber is fully
> caught up with the publisher, otherwise ALTER SUBSCRIPTION may break the
> replication in an unrecoverable way (actually, you can alter the
> subscription and remove the publication again, right?).
>

Right.

> But this is not just about tests, of course - the same issue applies to
> regular replication. That's a bit unfortunate, so maybe we should think
> about making this less fragile.
>

Agreed, provided we find some reasonable solution.

> We might make sure the subscriber is not lagging (essentially the
> wait_for_catchup) - which the users will have to do anyway (although
> maybe they know the publisher is beyond the LSN where it was created).
>

This won't work for the case mentioned above where we create a
subscription with non-existent publications, then perform DML and then
'CREATE PUBLICATION'.

> The other option would be to detect such case, somehow - if you don't
> see the publication yet, see if it exists in current snapshot, and then
> maybe ignore this error. But that has other issues (the publication
> might have been created and dropped, in which case you won't see it).
>

True, the dropped case would again be tricky to deal with and I think
we will end up publishing some operations which are performed before
the publication is even created.

> Also, we'd probably have to ignore RelationSyncEntry for a while, which
> seems quite expensive.
>

Yet another option could be that we continue using a historic snapshot
but ignore publications that are not found for the purpose of
computing RelSyncEntry attributes. We won't mark such an entry as
valid till all the publications are loaded without anything missing. I
think such cases in practice won't be enough to matter. This means we
won't publish operations on tables corresponding to that publication
till we found such a publication and that seems okay.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: PG15 beta1 fix pg_stat_statements view document
Next
From: "Shinoda, Noriyoshi (PN Japan FSIP)"
Date:
Subject: RE: PG15 beta1 fix pg_stat_statements view document