Re: row filtering for logical replication - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: row filtering for logical replication
Date
Msg-id ace1bc6e-f763-1791-2371-1d88c977441f@enterprisedb.com
Whole thread Raw
In response to Re: row filtering for logical replication  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: row filtering for logical replication  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On 7/20/21 7:23 AM, Amit Kapila wrote:
> On Mon, Jul 19, 2021 at 7:02 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 7/19/21 1:00 PM, Dilip Kumar wrote:
>>> On Mon, Jul 19, 2021 at 3:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> a. Just log it and move to the next row
>>>> b. send to stats collector some info about this which can be displayed
>>>> in a view and then move ahead
>>>> c. just skip it like any other row that doesn't match the filter clause.
>>>>
>>>> I am not sure if there is any use of sending a row if one of the
>>>> old/new rows doesn't match the filter. Because if the old row doesn't
>>>> match but the new one matches the criteria, we will anyway just throw
>>>> such a row on the subscriber instead of applying it.
>>>
>>> But at some time that will be true even if we skip the row based on
>>> (a) or (c) right.  Suppose the OLD row was not satisfying the
>>> condition but the NEW row is satisfying the condition, now even if we
>>> skip this operation then in the next operation on the same row even if
>>> both OLD and NEW rows are satisfying the filter the operation will
>>> just be dropped by the subscriber right? because we did not send the
>>> previous row when it first updated to value which were satisfying the
>>> condition.  So basically, any row is inserted which did not satisfy
>>> the condition first then post that no matter how many updates we do to
>>> that row either it will be skipped by the publisher because the OLD
>>> row was not satisfying the condition or it will be skipped by the
>>> subscriber as there was no matching row.
>>>
>>
>> I have a feeling it's getting overly complicated, to the extent that
>> it'll be hard to explain to users and reason about. I don't think
>> there's a "perfect" solution for cases when the filter expression gives
>> different answers for old/new row - it'll always be surprising for some
>> users :-(
>>
> 
> 
> It is possible but OTOH, the three replication solutions (Debezium,
> Oracle, IBM's InfoSphere Data Replication) which have this feature
> seems to filter based on both old and new rows in one or another way.
> Also, I am not sure if the simple approach of just filter based on the
> new row is very clear because it can also confuse users in a way that
> even if all the new rows matches the filters, they don't see anything
> on the subscriber and in fact, that can cause a lot of network
> overhead without any gain.
> 

True. My point is that it's easier to explain than when using some
combination of old/new row, and theapproach "replicate if the filter
matches both rows" proposed in this thread would be confusing too.

If the subscriber database can be modified, we kinda already have this
issue already - the row can be deleted, and all UPDATEs will be lost.
Yes, for read-only replicas that won't happen, but I think we're moving
to use cases more advanced than that.

I think there are only two ways to *guarantee* this does not happen:

* prohibit updates of columns referenced in row filters

* some sort of conflict resolution, turning UPDATE to INSERT etc.

>> So maybe the best thing is to stick to the simple approach already used
>> e.g. by pglogical, which simply user the new row when available (insert,
>> update) and old one for deletes.
>>
>> I think that behaves more or less sensibly and it's easy to explain.
>>
> 
> Okay, if nothing better comes up, then we can fall back to this option.
> 
>> All the other things (e.g. turning UPDATE to INSERT, advanced conflict
>> resolution etc.) will require a lot of other stuff,
>>
> 
> I have not evaluated this yet but I think spending some time thinking
> about turning Update to Insert/Delete (yesterday's suggestion by
> Alvaro) might be worth especially as that seems to be followed by some
> other replication solution as well.
> 

I think that requires quite a bit of infrastructure, and I'd bet we'll
need to handle other types of conflicts too. I don't have a clear
opinion if that's required to get this patch working - I'd try getting
the simplest implementation with reasonable behavior, with those more
advanced things as future enhancements.

>> and I see them as
>> improvements of this simple approach.
>>
>>>>> Maybe a second option is to have replication change any UPDATE into
>>>>> either an INSERT or a DELETE, if the old or the new row do not pass the
>>>>> filter, respectively.  That way, the databases would remain consistent.
>>>
>>> Yeah, I think this is the best way to keep the data consistent.
>>>
>>
>> It'd also require REPLICA IDENTITY FULL, which seems like it'd add a
>> rather significant overhead.
>>
> 
> Why? I think it would just need similar restrictions as we are
> planning for Delete operation such that filter columns must be either
> present in primary or replica identity columns.
> 

How else would you turn UPDATE to INSERT? For UPDATE we only send the
identity columns and modified columns, and the decision happens on the
subscriber. So we need to send everything if there's a risk we'll need
those columns. But it's early I only had one coffee, so I may be missing
something glaringly obvious.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: row filtering for logical replication
Next
From: Yugo NAGATA
Date:
Subject: Re: Numeric x^y for negative x