Thread: pgsql: Fix row filters with multiple publications

pgsql: Fix row filters with multiple publications

From
Tomas Vondra
Date:
Fix row filters with multiple publications

When publishing changes through a artition root, we should use the row
filter for the top-most ancestor. The relation may be added to multiple
publications, using different ancestors, and 52e4f0cd47 handled this
incorrectly. With c91f71b9dc we find the correct top-most ancestor, but
the code tried to fetch the row filter from all publications, including
those using a different ancestor etc. No row filter can be found for
such publications, which was treated as replicating all rows.

Similarly to c91f71b9dc, this seems to be a rare issue in practice. It
requires multiple publications including the same partitioned relation,
through different ancestors.

Fixed by only passing publications containing the top-most ancestor to
pgoutput_row_filter_init(), so that treating a missing row filter as
replicating all rows is correct.

Report and fix by me, test case by Hou zj. Reviews and improvements by
Amit Kapila.

Author: Tomas Vondra, Hou zj, Amit Kapila
Reviewed-by: Amit Kapila, Hou zj
Discussion: https://postgr.es/m/d26d24dd-2fab-3c48-0162-2b7f84a9c893%40enterprisedb.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/5a079662256e381fde699c4fbbed6c2504a6d30a

Modified Files
--------------
src/backend/replication/pgoutput/pgoutput.c | 26 +++++++++++++---
src/test/subscription/t/028_row_filter.pl   | 47 ++++++++++++++++++++++++++++-
2 files changed, 67 insertions(+), 6 deletions(-)


Re: pgsql: Fix row filters with multiple publications

From
Tomas Vondra
Date:
Hmm, this seems to have failed on wrasse [1], due to a timeout when
waiting for tablesync to complete:

2022-03-17 17:39:28.247 CET [19962:1] LOG:  logical replication table
synchronization worker for subscription "sub2", table "tab1" has started
2022-03-17 17:39:28.258 CET [19964:1] LOG:  logical replication table
synchronization worker for subscription "sub2", table "tab4" has started

In the previous runs this completed pretty much immediately (less than a
second), but this time the workers got stuck, so the script keeps
looping on the $synced_query. There's nothing in the log, so either it's
some sort of lock wait or infinite loop.

However, this fails in 013_partition.sql, which was not modified in this
commit. And there have been multiple successful runs since it was
modified (in c91f71b9dc). So it's not clear if this is a pre-existing
issue and we just happened to hit it now, or maybe it's introduced by
either c91f71b9dc or 5a07966225. But neither of these commits touched
tablesync at all, so I'm puzzled how could it happen.


regards

[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-03-17%2016%3A08%3A19

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pgsql: Fix row filters with multiple publications

From
Tomas Vondra
Date:
On 3/17/22 20:15, Tomas Vondra wrote:
> Hmm, this seems to have failed on wrasse [1], due to a timeout when
> waiting for tablesync to complete:
> 
> 2022-03-17 17:39:28.247 CET [19962:1] LOG:  logical replication table
> synchronization worker for subscription "sub2", table "tab1" has started
> 2022-03-17 17:39:28.258 CET [19964:1] LOG:  logical replication table
> synchronization worker for subscription "sub2", table "tab4" has started
> 
> In the previous runs this completed pretty much immediately (less than a
> second), but this time the workers got stuck, so the script keeps
> looping on the $synced_query. There's nothing in the log, so either it's
> some sort of lock wait or infinite loop.
> 
> However, this fails in 013_partition.sql, which was not modified in this
> commit. And there have been multiple successful runs since it was
> modified (in c91f71b9dc). So it's not clear if this is a pre-existing
> issue and we just happened to hit it now, or maybe it's introduced by
> either c91f71b9dc or 5a07966225. But neither of these commits touched
> tablesync at all, so I'm puzzled how could it happen.
> 

And sure enough - now it passed on wrasse, while lapwing failed with
exactly the same symptoms. Clearly some sort of race condition, but I've
been unable to reproduce that :-(

I'll try on my rpi4 once I get back home next week, but maybe we could
try reproducing this on one of the machines that triggered this so far.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pgsql: Fix row filters with multiple publications

From
Amit Kapila
Date:
On Fri, Mar 18, 2022 at 12:45 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hmm, this seems to have failed on wrasse [1], due to a timeout when
> waiting for tablesync to complete:
>
> 2022-03-17 17:39:28.247 CET [19962:1] LOG:  logical replication table
> synchronization worker for subscription "sub2", table "tab1" has started
> 2022-03-17 17:39:28.258 CET [19964:1] LOG:  logical replication table
> synchronization worker for subscription "sub2", table "tab4" has started
>
> In the previous runs this completed pretty much immediately (less than a
> second), but this time the workers got stuck, so the script keeps
> looping on the $synced_query. There's nothing in the log, so either it's
> some sort of lock wait or infinite loop.
>
> However, this fails in 013_partition.sql, which was not modified in this
> commit. And there have been multiple successful runs since it was
> modified (in c91f71b9dc). So it's not clear if this is a pre-existing
> issue and we just happened to hit it now, or maybe it's introduced by
> either c91f71b9dc or 5a07966225. But neither of these commits touched
> tablesync at all, so I'm puzzled how could it happen.
>

I have shared my analysis on the -hackers thread [1]. See, if that helps.

[1] -
https://www.postgresql.org/message-id/CAA4eK1LpBFU49Ohbnk%3Ddv_v9YP%2BKqh1%2BSf8i%2B%2B_s-QhD1Gy4Qw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.