Thread: pgsql: Fix row filters with multiple publications
Fix row filters with multiple publications When publishing changes through a artition root, we should use the row filter for the top-most ancestor. The relation may be added to multiple publications, using different ancestors, and 52e4f0cd47 handled this incorrectly. With c91f71b9dc we find the correct top-most ancestor, but the code tried to fetch the row filter from all publications, including those using a different ancestor etc. No row filter can be found for such publications, which was treated as replicating all rows. Similarly to c91f71b9dc, this seems to be a rare issue in practice. It requires multiple publications including the same partitioned relation, through different ancestors. Fixed by only passing publications containing the top-most ancestor to pgoutput_row_filter_init(), so that treating a missing row filter as replicating all rows is correct. Report and fix by me, test case by Hou zj. Reviews and improvements by Amit Kapila. Author: Tomas Vondra, Hou zj, Amit Kapila Reviewed-by: Amit Kapila, Hou zj Discussion: https://postgr.es/m/d26d24dd-2fab-3c48-0162-2b7f84a9c893%40enterprisedb.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/5a079662256e381fde699c4fbbed6c2504a6d30a Modified Files -------------- src/backend/replication/pgoutput/pgoutput.c | 26 +++++++++++++--- src/test/subscription/t/028_row_filter.pl | 47 ++++++++++++++++++++++++++++- 2 files changed, 67 insertions(+), 6 deletions(-)
Hmm, this seems to have failed on wrasse [1], due to a timeout when waiting for tablesync to complete: 2022-03-17 17:39:28.247 CET [19962:1] LOG: logical replication table synchronization worker for subscription "sub2", table "tab1" has started 2022-03-17 17:39:28.258 CET [19964:1] LOG: logical replication table synchronization worker for subscription "sub2", table "tab4" has started In the previous runs this completed pretty much immediately (less than a second), but this time the workers got stuck, so the script keeps looping on the $synced_query. There's nothing in the log, so either it's some sort of lock wait or infinite loop. However, this fails in 013_partition.sql, which was not modified in this commit. And there have been multiple successful runs since it was modified (in c91f71b9dc). So it's not clear if this is a pre-existing issue and we just happened to hit it now, or maybe it's introduced by either c91f71b9dc or 5a07966225. But neither of these commits touched tablesync at all, so I'm puzzled how could it happen. regards [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-03-17%2016%3A08%3A19 -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/17/22 20:15, Tomas Vondra wrote: > Hmm, this seems to have failed on wrasse [1], due to a timeout when > waiting for tablesync to complete: > > 2022-03-17 17:39:28.247 CET [19962:1] LOG: logical replication table > synchronization worker for subscription "sub2", table "tab1" has started > 2022-03-17 17:39:28.258 CET [19964:1] LOG: logical replication table > synchronization worker for subscription "sub2", table "tab4" has started > > In the previous runs this completed pretty much immediately (less than a > second), but this time the workers got stuck, so the script keeps > looping on the $synced_query. There's nothing in the log, so either it's > some sort of lock wait or infinite loop. > > However, this fails in 013_partition.sql, which was not modified in this > commit. And there have been multiple successful runs since it was > modified (in c91f71b9dc). So it's not clear if this is a pre-existing > issue and we just happened to hit it now, or maybe it's introduced by > either c91f71b9dc or 5a07966225. But neither of these commits touched > tablesync at all, so I'm puzzled how could it happen. > And sure enough - now it passed on wrasse, while lapwing failed with exactly the same symptoms. Clearly some sort of race condition, but I've been unable to reproduce that :-( I'll try on my rpi4 once I get back home next week, but maybe we could try reproducing this on one of the machines that triggered this so far. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 18, 2022 at 12:45 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hmm, this seems to have failed on wrasse [1], due to a timeout when > waiting for tablesync to complete: > > 2022-03-17 17:39:28.247 CET [19962:1] LOG: logical replication table > synchronization worker for subscription "sub2", table "tab1" has started > 2022-03-17 17:39:28.258 CET [19964:1] LOG: logical replication table > synchronization worker for subscription "sub2", table "tab4" has started > > In the previous runs this completed pretty much immediately (less than a > second), but this time the workers got stuck, so the script keeps > looping on the $synced_query. There's nothing in the log, so either it's > some sort of lock wait or infinite loop. > > However, this fails in 013_partition.sql, which was not modified in this > commit. And there have been multiple successful runs since it was > modified (in c91f71b9dc). So it's not clear if this is a pre-existing > issue and we just happened to hit it now, or maybe it's introduced by > either c91f71b9dc or 5a07966225. But neither of these commits touched > tablesync at all, so I'm puzzled how could it happen. > I have shared my analysis on the -hackers thread [1]. See, if that helps. [1] - https://www.postgresql.org/message-id/CAA4eK1LpBFU49Ohbnk%3Ddv_v9YP%2BKqh1%2BSf8i%2B%2B_s-QhD1Gy4Qw%40mail.gmail.com -- With Regards, Amit Kapila.