Re: row filtering for logical replication - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: row filtering for logical replication
Date
Msg-id CAA4eK1JXYQ3B7kLXwi2buDP-LZ_qnqZDf=nDTPt_c8hHkyztww@mail.gmail.com
Whole thread Raw
In response to Re: row filtering for logical replication  ("Euler Taveira" <euler@eulerto.com>)
Responses Re: row filtering for logical replication  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Wed, Aug 25, 2021 at 5:52 AM Euler Taveira <euler@eulerto.com> wrote:
>
> On Tue, Aug 24, 2021, at 4:46 AM, Peter Smith wrote:
>
> I have used debug logging to confirm that what Amit wrote [1] is
> correct; the row-filter ExprState of *every* table's row_filter will
> be invalidated (and so subsequently gets rebuilt) when the user
> changes the PUBLICATION tables. This was a side-effect of the
> rel_sync_cache_publication_cb which is freeing the cached ExprState
> and setting the entry->replicate_valid = false; for *every* entry.
>
> So yes, the ExprCache is getting rebuilt for some situations where it
> is not strictly necessary to do so.
>
> I'm afraid we are overenginnering this feature. We already have a cache
> mechanism that was suggested (that shows a small improvement). As you said the
> gain for this new improvement is zero or minimal (it depends on your logical
> replication setup/maintenance).
>

Hmm, I think the gain via caching is not visible because we are using
simple expressions. It will be visible when we use somewhat complex
expressions where expression evaluation cost is significant.
Similarly, the impact of this change will magnify and it will also be
visible when a publication has many tables. Apart from performance,
this change is logically correct as well because it would be any way
better if we don't invalidate the cached expressions unless required.

> 1. Although the ExprState cache is effective, in practice the
> performance improvement was not very much. My previous results [2]
> showed only about 2sec saving for 100K calls to the
> pgoutput_row_filter function. So I think eliminating just one or two
> unnecessary calls in the get_rel_sync_entry is going to make zero
> observable difference.
>
> 2. IMO it is safe to expect that the ALTER PUBLICATION is a rare
> operation relative to the number of times that pgoutput_row_filter
> will be called (the pgoutput_row_filter is quite a "hot" function
> since it is called for every INSERT/UPDATE/DELETE). It will be orders
> of magnitude difference 1:1000, 1:100000 etc.
>
> ~~
>
> Anyway, I have implemented the suggested cache change because I agree
> it is probably theoretically superior, even if in practice there is
> almost no difference.
>
> I didn't inspect your patch carefully but it seems you add another List to
> control this new cache mechanism. I don't like it. IMO if we can use the data
> structures that we have now, let's implement your idea; otherwise, -1 for this
> new micro optimization.
>

As mentioned above, without this we will invalidate many cached
expressions even though it is not required. I don't deny that there
might be a better way to achieve the same and if you or Peter have any
ideas, I am all ears. If there are technical challenges to achieve the
same or it makes the patch complex then certainly we can discuss but
according to me, this should not introduce additional complexity.

> [By the way, it took some time to extract what you changed. Since we're trading
> patches, I personally appreciate if you can send a patch on the top of the
> current one.
>

+1.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: replay of CREATE TABLESPACE eats data at wal_level=minimal
Next
From: Noah Misch
Date:
Subject: Re: Mark all GUC variable as PGDLLIMPORT