Re: Build-farm - intermittent error in 031_column_list.pl - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Build-farm - intermittent error in 031_column_list.pl |
Date | |
Msg-id | CAA4eK1JTwOAniPua04o2EcOXfzRa8ANax=3bpx4H-8dH7M2p=A@mail.gmail.com Whole thread Raw |
In response to | Re: Build-farm - intermittent error in 031_column_list.pl (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Build-farm - intermittent error in 031_column_list.pl
|
List | pgsql-hackers |
On Fri, May 20, 2022 at 4:01 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 5/20/22 05:58, Amit Kapila wrote: > > Are we really querying the publications (in get_rel_sync_entry) using > the historical snapshot? > Yes. > I haven't really realized this, but yeah, that > might explain the issue. > > The new TAP test does ALTER SUBSCRIPTION ... SET PUBLICATION much more > often than any other test (there are ~15 calls, 12 of which are in this > new test). That might be why we haven't seen failures before. Or maybe > the existing tests simply are not vulnerable to this, > Right, I have checked the other cases are not vulnerable to this, otherwise, I think we would have seen intermittent failures till now. They don't seem to be doing DMLs before the creation of a publication or they create a subscription pointing to the same publication before. > because they > either do wait_for_catchup late enough or don't do any DML right before > executing SET PUBLICATION. > > >> That timetravel seems inintuitive but it's the > >> (current) way it works. > >> > > > > I have thought about it but couldn't come up with a good way to change > > the way currently it works. Moreover, I think it is easy to hit this > > in other ways as well. Say, you first create a subscription with a > > non-existent publication and then do operation on any unrelated table > > on the publisher before creating the required publication, we will hit > > exactly this problem of "publication does not exist", so I think we > > may need to live with this behavior and write tests carefully. > > > > Yeah, I think it pretty much requires ensuring the subscriber is fully > caught up with the publisher, otherwise ALTER SUBSCRIPTION may break the > replication in an unrecoverable way (actually, you can alter the > subscription and remove the publication again, right?). > Right. > But this is not just about tests, of course - the same issue applies to > regular replication. That's a bit unfortunate, so maybe we should think > about making this less fragile. > Agreed, provided we find some reasonable solution. > We might make sure the subscriber is not lagging (essentially the > wait_for_catchup) - which the users will have to do anyway (although > maybe they know the publisher is beyond the LSN where it was created). > This won't work for the case mentioned above where we create a subscription with non-existent publications, then perform DML and then 'CREATE PUBLICATION'. > The other option would be to detect such case, somehow - if you don't > see the publication yet, see if it exists in current snapshot, and then > maybe ignore this error. But that has other issues (the publication > might have been created and dropped, in which case you won't see it). > True, the dropped case would again be tricky to deal with and I think we will end up publishing some operations which are performed before the publication is even created. > Also, we'd probably have to ignore RelationSyncEntry for a while, which > seems quite expensive. > Yet another option could be that we continue using a historic snapshot but ignore publications that are not found for the purpose of computing RelSyncEntry attributes. We won't mark such an entry as valid till all the publications are loaded without anything missing. I think such cases in practice won't be enough to matter. This means we won't publish operations on tables corresponding to that publication till we found such a publication and that seems okay. -- With Regards, Amit Kapila.
pgsql-hackers by date: