Thread: subscriptioncheck failure
I noticed $SUBJECT in curculio. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23 The failure test report: t/020_messages.pl (Wstat: 7424 Tests: 1 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 5 tests but ran 1. Files=21, Tests=149, 79 wallclock secs ( 0.03 usr 0.03 sys + 22.36 cusr 9.47 csys = 31.89 CPU) Result: FAIL Makefile:21: recipe for target 'check' failed gmake: *** [check] Error 1 As far as I can understand the failure happens due to error: 2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR: replication slot "tap_sub" is active for PID 7354 2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT: SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape') FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL, 'proto_version', '1', 'publication_names', 'tap_pub', 'messages', 'true') OFFSET 1 LIMIT 1 As per my initial analysis, this happens because in the test after creating a subscription, we are not waiting for the subscriber to catch up. I think there is a narrow window where it is possible that after creating a subscription, the apply worker starts replication even after we have done Alter Subscription .. DISABLE. This is also visible in buildfarm logs: 2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT: SELECT get_byte(data, 0) FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL, 'proto_version', '1', 'publication_names', 'tap_pub', 'messages', 'true') ... 2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') Notice that we have checked the messages before receiving the start replication command from the subscriber. Now, if we would have waited for the subscription to catch up then our check in pg_stat_replication wouldn't have given a false positive. I have noticed that we do have such wait in other subscription tests but somehow missed it here. Thoughts? -- With Regards, Amit Kapila.
On Thu, May 13, 2021 at 3:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I noticed $SUBJECT in curculio. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23 > > The failure test report: > t/020_messages.pl (Wstat: 7424 Tests: 1 Failed: 0) > Non-zero exit status: 29 > Parse errors: Bad plan. You planned 5 tests but ran 1. > Files=21, Tests=149, 79 wallclock secs ( 0.03 usr 0.03 sys + 22.36 > cusr 9.47 csys = 31.89 CPU) > Result: FAIL > Makefile:21: recipe for target 'check' failed > gmake: *** [check] Error 1 > > As far as I can understand the failure happens due to error: > 2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR: > replication slot "tap_sub" is active for PID 7354 > 2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT: > SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape') > FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL, > 'proto_version', '1', > 'publication_names', 'tap_pub', > 'messages', 'true') > OFFSET 1 LIMIT 1 > > As per my initial analysis, this happens because in the test after > creating a subscription, we are not waiting for the subscriber to > catch up. I think there is a narrow window where it is possible that > after creating a subscription, the apply worker starts replication > even after we have done Alter Subscription .. DISABLE. This is also > visible in buildfarm logs: > > 2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT: > SELECT get_byte(data, 0) > FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL, > 'proto_version', '1', > 'publication_names', 'tap_pub', > 'messages', 'true') > ... > > 2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG: received > replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 > (proto_version '2', publication_names '"tap_pub"') > > Notice that we have checked the messages before receiving the start > replication command from the subscriber. Now, if we would have waited > for the subscription to catch up then our check in pg_stat_replication > wouldn't have given a false positive. I have noticed that we do have > such wait in other subscription tests but somehow missed it here. I agree with your analysis, apart from this issue there is one other small problem. Currently after disabling subscription pg_stat_replication is used to verify if the walsender is exited. The steps of the walsender exit includes cleaning up of walsender and then releasing a replication slot. There will be random test failure if pg_logical_slot_peek_binary_changes is called in this narrow time window. I think it is better to fix it by checking the active column in pg_replication_slot instead of pg_stat_replication which is more reliable. Attached patch which has the fixes for the same. Thoughts? Regards, Vignesh
Attachment
On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote: > +$node_publisher->wait_for_catchup('tap_sub'); > + > # Ensure a transactional logical decoding message shows up on the slot > $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); > > # wait for the replication connection to drop from the publisher > $node_publisher->poll_query_until('postgres', > - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1); There are three places in this test where a slot is disabled, followed by a wait to make sure that the slot is gone. Perhaps it would be better to wrap that in a small-ish routine? -- Michael
Attachment
On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote: > > +$node_publisher->wait_for_catchup('tap_sub'); > > + > > # Ensure a transactional logical decoding message shows up on the slot > > $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); > > > > # wait for the replication connection to drop from the publisher > > $node_publisher->poll_query_until('postgres', > > - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1); > > There are three places in this test where a slot is disabled, followed > by a wait to make sure that the slot is gone. Perhaps it would be > better to wrap that in a small-ish routine? Yes that would be better, I will make the changes for this and post a patch. Regards, Vignesh
On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote: > > +$node_publisher->wait_for_catchup('tap_sub'); > > + > > # Ensure a transactional logical decoding message shows up on the slot > > $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); > > > > # wait for the replication connection to drop from the publisher > > $node_publisher->poll_query_until('postgres', > > - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1); > > There are three places in this test where a slot is disabled, followed > by a wait to make sure that the slot is gone. Perhaps it would be > better to wrap that in a small-ish routine? Thanks for the comments, Please find the attached patch having the changes. Regards, Vignesh
Attachment
On Thu, May 13, 2021 at 07:05:55PM +0530, vignesh C wrote: > Thanks for the comments, Please find the attached patch having the changes. Cool, thanks for the new version. I have spent some time understanding the initial report from Amit as well as what you are proposing here, and refactoring the test so as the set of CREATE/ALTER SUBSCRIPTION commands are added within this routine is a good idea. I would have made the comment on top of setup_subscription a bit more talkative regarding the fact that it may reuse an existing subscription, but that's a nit. Let's wait for Amit and see what he thinks about what you are proposing. -- Michael
Attachment
On Thu, May 13, 2021 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote: > > > +$node_publisher->wait_for_catchup('tap_sub'); > > > + > > > # Ensure a transactional logical decoding message shows up on the slot > > > $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); > > > > > > # wait for the replication connection to drop from the publisher > > > $node_publisher->poll_query_until('postgres', > > > - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); > > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1); > > > > There are three places in this test where a slot is disabled, followed > > by a wait to make sure that the slot is gone. Perhaps it would be > > better to wrap that in a small-ish routine? > > Thanks for the comments, Please find the attached patch having the changes. > Few comments: 1. + # Ensure a transactional logical decoding message shows up on the slot + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); After you have encapsulated this command in the function, the above comment doesn't make sense because we do this for both transactional and non-transactional messages. I suggest we can change it to something like: "This is done to ensure a logical decoding message is shown up on the slot". 2. +# Setup the subscription before checking pg_logical_slot_peek_binary_changes +sub setup_subscription I think here the functionality is more for the catchup of subscription, so it might be better to name the function as subscription_catchup or catchup_subscription. I think you can expand the comments atop this function a bit as suggested by Michael. -- With Regards, Amit Kapila.
On Mon, May 17, 2021 at 9:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, May 13, 2021 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote: > > Few comments: > 1. > + # Ensure a transactional logical decoding message shows up on the slot > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); > > After you have encapsulated this command in the function, the above > comment doesn't make sense because we do this for both transactional > and non-transactional messages. I suggest we can change it to > something like: "This is done to ensure a logical decoding message is > shown up on the slot". > > 2. > +# Setup the subscription before checking pg_logical_slot_peek_binary_changes > +sub setup_subscription > > I think here the functionality is more for the catchup of > subscription, so it might be better to name the function as > subscription_catchup or catchup_subscription. I think you can expand > the comments atop this function a bit as suggested by Michael. > One more point: + $node_publisher->wait_for_catchup('tap_sub'); + + # Ensure a transactional logical decoding message shows up on the slot + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); + + # wait for the replication connection to drop from the publisher + $node_publisher->poll_query_until('postgres', + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1); In the above sequence, wait_for_catchup will query pg_stat_replication whereas after disabling subscription we are checking pg_replication_slots. I understand from your explanation why we can't rely on pg_stat_replication after DISABLE but it might be better to check that the slot is active before disabling it. I think currently, the test assumes that, isn't it better to have an explicit check for that? -- With Regards, Amit Kapila.
On Mon, May 17, 2021 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, May 17, 2021 at 9:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, May 13, 2021 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > Few comments: > > 1. > > + # Ensure a transactional logical decoding message shows up on the slot > > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); > > > > After you have encapsulated this command in the function, the above > > comment doesn't make sense because we do this for both transactional > > and non-transactional messages. I suggest we can change it to > > something like: "This is done to ensure a logical decoding message is > > shown up on the slot". > > > > 2. > > +# Setup the subscription before checking pg_logical_slot_peek_binary_changes > > +sub setup_subscription > > > > I think here the functionality is more for the catchup of > > subscription, so it might be better to name the function as > > subscription_catchup or catchup_subscription. I think you can expand > > the comments atop this function a bit as suggested by Michael. > > > > One more point: > + $node_publisher->wait_for_catchup('tap_sub'); > + > + # Ensure a transactional logical decoding message shows up on the slot > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); > + > + # wait for the replication connection to drop from the publisher > + $node_publisher->poll_query_until('postgres', > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE > slot_name = 'tap_sub' AND active='f'", 1); > > In the above sequence, wait_for_catchup will query pg_stat_replication > whereas after disabling subscription we are checking > pg_replication_slots. I understand from your explanation why we can't > rely on pg_stat_replication after DISABLE but it might be better to > check that the slot is active before disabling it. I think currently, > the test assumes that, isn't it better to have an explicit check for > that? I felt this is not required, wait_for_catchup will poll_query_until the state = 'streaming', even if START_REPLICATION takes time, state will be in 'startup' state, this way poll_query_until will take care of handling this. On further analysis I found that we need to do "Alter subscription tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple time, Instead we can change pg_logical_slot_peek_binary_changes to pg_logical_slot_get_binary_changes at appropriate steps. This way the common function can be removed and the enable/disable multiple times can be removed. If we are going ahead with this approach the above comments provided are no more valid. I have made the changes in similar lines in the attached patch. If you are ok we can go ahead with the new approach which will simplify the changes required. Thoughts? Regards, Vignesh
Attachment
On Mon, May 17, 2021 at 5:48 PM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, May 17, 2021 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > One more point: > > + $node_publisher->wait_for_catchup('tap_sub'); > > + > > + # Ensure a transactional logical decoding message shows up on the slot > > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); > > + > > + # wait for the replication connection to drop from the publisher > > + $node_publisher->poll_query_until('postgres', > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE > > slot_name = 'tap_sub' AND active='f'", 1); > > > > In the above sequence, wait_for_catchup will query pg_stat_replication > > whereas after disabling subscription we are checking > > pg_replication_slots. I understand from your explanation why we can't > > rely on pg_stat_replication after DISABLE but it might be better to > > check that the slot is active before disabling it. I think currently, > > the test assumes that, isn't it better to have an explicit check for > > that? > > I felt this is not required, wait_for_catchup will poll_query_until > the state = 'streaming', even if START_REPLICATION takes time, state > will be in 'startup' state, this way poll_query_until will take care > of handling this. > makes sense, but let's add some comments to clarify the same. > On further analysis I found that we need to do "Alter subscription > tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple > time, Instead we can change pg_logical_slot_peek_binary_changes to > pg_logical_slot_get_binary_changes at appropriate steps. This way the > common function can be removed and the enable/disable multiple times > can be removed. > I think that is a valid point. This was probably kept so that we can peek multiple times for the same message to test various things but that can be achieved with the way you have changed the test. One more thing, shouldn't we make auto_vacuum=off for this test by using 'append_conf' before starting the publisher. That will avoid the risk of empty transactions. -- With Regards, Amit Kapila.
On Tue, May 18, 2021 at 9:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, May 17, 2021 at 5:48 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Mon, May 17, 2021 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > One more point: > > > + $node_publisher->wait_for_catchup('tap_sub'); > > > + > > > + # Ensure a transactional logical decoding message shows up on the slot > > > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); > > > + > > > + # wait for the replication connection to drop from the publisher > > > + $node_publisher->poll_query_until('postgres', > > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE > > > slot_name = 'tap_sub' AND active='f'", 1); > > > > > > In the above sequence, wait_for_catchup will query pg_stat_replication > > > whereas after disabling subscription we are checking > > > pg_replication_slots. I understand from your explanation why we can't > > > rely on pg_stat_replication after DISABLE but it might be better to > > > check that the slot is active before disabling it. I think currently, > > > the test assumes that, isn't it better to have an explicit check for > > > that? > > > > I felt this is not required, wait_for_catchup will poll_query_until > > the state = 'streaming', even if START_REPLICATION takes time, state > > will be in 'startup' state, this way poll_query_until will take care > > of handling this. > > > > makes sense, but let's add some comments to clarify the same. > Modified. > > On further analysis I found that we need to do "Alter subscription > > tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple > > time, Instead we can change pg_logical_slot_peek_binary_changes to > > pg_logical_slot_get_binary_changes at appropriate steps. This way the > > common function can be removed and the enable/disable multiple times > > can be removed. > > > > I think that is a valid point. This was probably kept so that we can > peek multiple times for the same message to test various things but > that can be achieved with the way you have changed the test. > > One more thing, shouldn't we make auto_vacuum=off for this test by > using 'append_conf' before starting the publisher. That will avoid the > risk of empty transactions. I felt that makes sense, added it. Thanks for the comments, the attached patch has the changes for the same. Regards, Vignesh
Attachment
On Tue, May 18, 2021 at 11:25 AM vignesh C <vignesh21@gmail.com> wrote: > > Thanks for the comments, the attached patch has the changes for the same. > Thanks, I have pushed your patch after making minor changes in the comments. -- With Regards, Amit Kapila.
On Wed, May 19, 2021 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, May 18, 2021 at 11:25 AM vignesh C <vignesh21@gmail.com> wrote: > > > > Thanks for the comments, the attached patch has the changes for the same. > > > > Thanks, I have pushed your patch after making minor changes in the comments. Thanks for pushing this patch. Regards, Vignesh