Re: Build-farm - intermittent error in 031_column_list.pl - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Build-farm - intermittent error in 031_column_list.pl
Date
Msg-id 04caa3b1-fa97-2f33-e69f-91f612de668d@enterprisedb.com
Whole thread Raw
In response to Re: Build-farm - intermittent error in 031_column_list.pl  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Build-farm - intermittent error in 031_column_list.pl
List pgsql-hackers

On 5/20/22 05:58, Amit Kapila wrote:
> On Fri, May 20, 2022 at 6:58 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
>>
>>>> the apply worker will use the existing slot and replication origin
>>>> corresponding to the subscription. Now, it is possible that before
>>>> restart the origin has not been updated and the WAL start location
>>>> points to a location prior to where PUBLICATION pub9 exists which can
>>>> lead to such an error. Once this error occurs, apply worker will never
>>>> be able to proceed and will always return the same error. Does this
>>>> make sense?
>>
>> Wow. I didin't thought that line. That theory explains the silence and
>> makes sense even though I don't see LSN transistions that clearly
>> support it.  I dimly remember a similar kind of problem..
>>
>>>> Unless you or others see a different theory, this seems to be the
>>>> existing problem in logical replication which is manifested by this
>>>> test. If we just want to fix these test failures, we can create a new
>>>> subscription instead of altering the existing publication to point to
>>>> the new publication.
>>>>
>>>
>>> If the above theory is correct then I think allowing the publisher to
>>> catch up with "$node_publisher->wait_for_catchup('sub1');" before
>>> ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
>>> both publisher and subscriber are in sync then the new publication
>>> should be visible to WALSender.
>>
>> It looks right to me.
>>
> 
> Let's wait for Tomas or others working in this area to share their thoughts.
> 

Are we really querying the publications (in get_rel_sync_entry) using
the historical snapshot? I haven't really realized this, but yeah, that
might explain the issue.

The new TAP test does ALTER SUBSCRIPTION ... SET PUBLICATION much more
often than any other test (there are ~15 calls, 12 of which are in this
new test). That might be why we haven't seen failures before. Or maybe
the existing tests simply are not vulnerable to this, because they
either do wait_for_catchup late enough or don't do any DML right before
executing SET PUBLICATION.

>>  That timetravel seems inintuitive but it's the
>> (current) way it works.
>>
> 
> I have thought about it but couldn't come up with a good way to change
> the way currently it works. Moreover, I think it is easy to hit this
> in other ways as well. Say, you first create a subscription with a
> non-existent publication and then do operation on any unrelated table
> on the publisher before creating the required publication, we will hit
> exactly this problem of "publication does not exist", so I think we
> may need to live with this behavior and write tests carefully.
> 

Yeah, I think it pretty much requires ensuring the subscriber is fully
caught up with the publisher, otherwise ALTER SUBSCRIPTION may break the
replication in an unrecoverable way (actually, you can alter the
subscription and remove the publication again, right?).

But this is not just about tests, of course - the same issue applies to
regular replication. That's a bit unfortunate, so maybe we should think
about making this less fragile.

We might make sure the subscriber is not lagging (essentially the
wait_for_catchup) - which the users will have to do anyway (although
maybe they know the publisher is beyond the LSN where it was created).

The other option would be to detect such case, somehow - if you don't
see the publication yet, see if it exists in current snapshot, and then
maybe ignore this error. But that has other issues (the publication
might have been created and dropped, in which case you won't see it).
Also, we'd probably have to ignore RelationSyncEntry for a while, which
seems quite expensive.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Handle infinite recursion in logical replication setup
Next
From: Michael Paquier
Date:
Subject: Re: Addition of PostgreSQL::Test::Cluster::pg_version()