Re: row filtering for logical replication - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: row filtering for logical replication
Date
Msg-id 4b213b91-8428-85a7-7ed6-48ceefd7f7be@enterprisedb.com
Whole thread Raw
In response to Re: row filtering for logical replication  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: row filtering for logical replication
List pgsql-hackers

On 9/24/21 8:09 AM, Amit Kapila wrote:
> On Thu, Sep 23, 2021 at 6:03 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> 13) turning update into insert
>>
>> I agree with Ajin Cherian [4] that looking at just old or new row for
>> updates is not the right solution, because each option will "break" the
>> replica in some case. So I think the goal "keeping the replica in sync"
>> is the right perspective, and converting the update to insert/delete if
>> needed seems appropriate.
>>
>> This seems a somewhat similar to what pglogical does, because that may
>> also convert updates (although only to inserts, IIRC) when handling
>> replication conflicts. The difference is pglogical does all this on the
>> subscriber, while this makes the decision on the publisher.
>>
>> I wonder if this might have some negative consequences, or whether
>> "moving" this to downstream would be useful for other purposes in the
>> fuure (e.g. it might be reused for handling other conflicts).
>>
> 
> Apart from additional traffic, I am not sure how will we handle all
> the conditions on subscribers, say if the new row doesn't match, how
> will subscribers know about this unless we pass row_filter or some
> additional information along with tuple. Previously, I have done some
> research and shared in one of the emails above that IBM's InfoSphere
> Data Replication [1] performs filtering in this way which also
> suggests that we won't be off here.
> 

I'm certainly not suggesting what we're doing is wrong. Given the design 
of built-in logical replication it makes sense doing it this way, I was 
just thinking aloud about what we might want to do in the future (e.g. 
pglogical uses this to deal with conflicts between multiple sources, and 
so on).

>>
>>
>> 15) pgoutput_row_filter initializing filter
>>
>> I'm not sure I understand why the filter initialization gets moved from
>> get_rel_sync_entry. Presumably, most of what the replication does is
>> replicating rows, so I see little point in not initializing this along
>> with the rest of the rel_sync_entry.
>>
> 
> Sorry, IIRC, this has been suggested by me and I thought it was best
> to do any expensive computation the first time it is required. I have
> shared few cases like in [2] where it would lead to additional cost
> without any gain. Unless I am missing something, I don't see any
> downside of doing it in a delayed fashion.
> 

Not sure, but the arguments presented there seem a bit wonky ...

Yes, the work would be wasted if we discard the cached data without 
using it (it might happen for truncate, I'm not sure). But how likely is 
it that such operations happen *in isolation*? I'd bet the workload is 
almost never just a stream of truncates - there are always some 
operations in between that would actually use this.

Similarly for the errors - IIRC hitting an error means the replication 
restarts, which is orders of magnitude more expensive than anything we 
can save by this delayed evaluation.

I'd keep it simple, for the sake of simplicity of the whole patch.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Column Filtering in Logical Replication
Next
From: Andrew Dunstan
Date:
Subject: Re: Proposal: Save user's original authenticated identity for logging