Re: Build-farm - intermittent error in 031_column_list.pl - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Build-farm - intermittent error in 031_column_list.pl |
Date | |
Msg-id | 04caa3b1-fa97-2f33-e69f-91f612de668d@enterprisedb.com Whole thread Raw |
In response to | Re: Build-farm - intermittent error in 031_column_list.pl (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Build-farm - intermittent error in 031_column_list.pl
|
List | pgsql-hackers |
On 5/20/22 05:58, Amit Kapila wrote: > On Fri, May 20, 2022 at 6:58 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: >> >>>> the apply worker will use the existing slot and replication origin >>>> corresponding to the subscription. Now, it is possible that before >>>> restart the origin has not been updated and the WAL start location >>>> points to a location prior to where PUBLICATION pub9 exists which can >>>> lead to such an error. Once this error occurs, apply worker will never >>>> be able to proceed and will always return the same error. Does this >>>> make sense? >> >> Wow. I didin't thought that line. That theory explains the silence and >> makes sense even though I don't see LSN transistions that clearly >> support it. I dimly remember a similar kind of problem.. >> >>>> Unless you or others see a different theory, this seems to be the >>>> existing problem in logical replication which is manifested by this >>>> test. If we just want to fix these test failures, we can create a new >>>> subscription instead of altering the existing publication to point to >>>> the new publication. >>>> >>> >>> If the above theory is correct then I think allowing the publisher to >>> catch up with "$node_publisher->wait_for_catchup('sub1');" before >>> ALTER SUBSCRIPTION should fix this problem. Because if before ALTER >>> both publisher and subscriber are in sync then the new publication >>> should be visible to WALSender. >> >> It looks right to me. >> > > Let's wait for Tomas or others working in this area to share their thoughts. > Are we really querying the publications (in get_rel_sync_entry) using the historical snapshot? I haven't really realized this, but yeah, that might explain the issue. The new TAP test does ALTER SUBSCRIPTION ... SET PUBLICATION much more often than any other test (there are ~15 calls, 12 of which are in this new test). That might be why we haven't seen failures before. Or maybe the existing tests simply are not vulnerable to this, because they either do wait_for_catchup late enough or don't do any DML right before executing SET PUBLICATION. >> That timetravel seems inintuitive but it's the >> (current) way it works. >> > > I have thought about it but couldn't come up with a good way to change > the way currently it works. Moreover, I think it is easy to hit this > in other ways as well. Say, you first create a subscription with a > non-existent publication and then do operation on any unrelated table > on the publisher before creating the required publication, we will hit > exactly this problem of "publication does not exist", so I think we > may need to live with this behavior and write tests carefully. > Yeah, I think it pretty much requires ensuring the subscriber is fully caught up with the publisher, otherwise ALTER SUBSCRIPTION may break the replication in an unrecoverable way (actually, you can alter the subscription and remove the publication again, right?). But this is not just about tests, of course - the same issue applies to regular replication. That's a bit unfortunate, so maybe we should think about making this less fragile. We might make sure the subscriber is not lagging (essentially the wait_for_catchup) - which the users will have to do anyway (although maybe they know the publisher is beyond the LSN where it was created). The other option would be to detect such case, somehow - if you don't see the publication yet, see if it exists in current snapshot, and then maybe ignore this error. But that has other issues (the publication might have been created and dropped, in which case you won't see it). Also, we'd probably have to ignore RelationSyncEntry for a while, which seems quite expensive. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: