Thread: Some thoughts about the TAP tests' wait_for_catchup()
I noticed that some test scripts, instead of using wait_for_catchup to wait for replication catchup, use ad-hoc code like my $primary_lsn = $primary->safe_psql('postgres', 'select pg_current_wal_lsn()'); $standby->poll_query_until('postgres', qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()}) or die "standby never caught up"; This does not look much like what wait_for_catchup does, which typically ends up issuing queries like SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby'; It seems to me that for most purposes wait_for_catchup's approach is strictly worse, for two reasons: 1. It continually recomputes the primary's pg_current_wal_lsn(). Thus, if anything is happening on the primary (e.g. autovacuum), we're moving the goalposts as to how much the standby is required to replay before we continue. That slows down the tests, makes them less reproducible, and could help to mask actual bugs of the form this-hasn't-been-done-when-it-should-have. 2. It's querying the primary's view of the standby's state, which introduces a reporting delay. This has exactly the same three problems as the other point. So I think we ought to change wait_for_catchup to do things more like the first way, where feasible. This seems pretty easy for the call sites that provide the standby's PostgresNode; but it's not so easy for the ones that just provide a subscription name. So my first question is: to what extent are these tests specifically intended to exercise the replay reporting mechanisms? Would we lose important coverage if we changed *all* the call sites to use the query-the-standby approach? If so, which ones might be okay to change? The second question is what do we want the API to look like? This is simple if we can change all wait_for_catchup callers to pass the standby node. Otherwise, the choices are to split wait_for_catchup into two implementations, or to keep its current API and have it do significantly different things depending on isa("PostgresNode"). Thoughts? regards, tom lane
On Wed, Sep 29, 2021 at 3:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I noticed that some test scripts, instead of using wait_for_catchup > to wait for replication catchup, use ad-hoc code like > > my $primary_lsn = > $primary->safe_psql('postgres', 'select pg_current_wal_lsn()'); > $standby->poll_query_until('postgres', > qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()}) > or die "standby never caught up"; > > This does not look much like what wait_for_catchup does, which > typically ends up issuing queries like > > SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' > FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby'; > > It seems to me that for most purposes wait_for_catchup's approach is > strictly worse, for two reasons: > > 1. It continually recomputes the primary's pg_current_wal_lsn(). > Thus, if anything is happening on the primary (e.g. autovacuum), > we're moving the goalposts as to how much the standby is required > to replay before we continue. That slows down the tests, makes > them less reproducible, and could help to mask actual bugs of the > form this-hasn't-been-done-when-it-should-have. > > 2. It's querying the primary's view of the standby's state, which > introduces a reporting delay. This has exactly the same three > problems as the other point. > I can't comment on all the use cases of wait_for_catchup() but I think there are some use cases in logical replication where we need the publisher to use wait_for_catchup after setting up the replication to ensure that wal sender is started and in-proper state by checking its state (which should be 'streaming'). That also implicitly checks if the wal receiver has responded to initial ping requests by sending replay location. The typical use is as below where after setting up initial replication we wait for a publisher to catch up and then check if the initial table sync is finished. There is no use in checking the second till the first statement is completed. subscription/t/001_rep_changes.pl ... $node_publisher->wait_for_catchup('tap_sub'); # Also wait for initial table sync to finish my $synced_query = "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');"; $node_subscriber->poll_query_until('postgres', $synced_query) or die "Timed out while waiting for subscriber to synchronize data"; I am not sure in such tests if checking solely the subscriber's wal_replay_lsn would be sufficient. So, I think there are cases especially in physical replication tests where we can avoid using wait_for_catchup but I am not sure if we can completely eliminate it. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes: > On Wed, Sep 29, 2021 at 3:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It seems to me that for most purposes wait_for_catchup's approach is >> strictly worse, for two reasons: >> 1. It continually recomputes the primary's pg_current_wal_lsn(). >> 2. It's querying the primary's view of the standby's state, which >> introduces a reporting delay. > I can't comment on all the use cases of wait_for_catchup() but I think > there are some use cases in logical replication where we need the > publisher to use wait_for_catchup after setting up the replication to > ensure that wal sender is started and in-proper state by checking its > state (which should be 'streaming'). That also implicitly checks if > the wal receiver has responded to initial ping requests by sending > replay location. Yeah, for logical replication we can't look at the subscriber's WAL positions because they could be totally different. What I'm on about is the tests that use physical replication. I think examining the standby's state directly is better in that case, for the reasons I cited. I guess the question of interest is whether it's sufficient to test the walreceiver feedback mechanisms in the context of logical replication, or whether we feel that the physical-replication code path is enough different that there should be a specific test for that combination too. regards, tom lane
On Wed, Sep 29, 2021 at 9:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > On Wed, Sep 29, 2021 at 3:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> It seems to me that for most purposes wait_for_catchup's approach is > >> strictly worse, for two reasons: > >> 1. It continually recomputes the primary's pg_current_wal_lsn(). > >> 2. It's querying the primary's view of the standby's state, which > >> introduces a reporting delay. > > > I can't comment on all the use cases of wait_for_catchup() but I think > > there are some use cases in logical replication where we need the > > publisher to use wait_for_catchup after setting up the replication to > > ensure that wal sender is started and in-proper state by checking its > > state (which should be 'streaming'). That also implicitly checks if > > the wal receiver has responded to initial ping requests by sending > > replay location. > > Yeah, for logical replication we can't look at the subscriber's WAL > positions because they could be totally different. What I'm on > about is the tests that use physical replication. I think examining > the standby's state directly is better in that case, for the reasons > I cited. > > I guess the question of interest is whether it's sufficient to test > the walreceiver feedback mechanisms in the context of logical > replication, or whether we feel that the physical-replication code > path is enough different that there should be a specific test for > that combination too. > There is a difference in the handling of feedback messages for physical and logical replication code paths. It is mainly about how we advance slot's lsn based on wal flushed. See ProcessStandbyReplyMessage, towards end, we call different functions based on slot_type. So, I think it is better to have a test for the physical replication feedback mechanism. -- With Regards, Amit Kapila.