Thread: subscription/026_stats test is intermittently slow?
I just did a run of the regression test where this test was the last one to finish by quite a lot. Key log entries: [13:35:48.583](0.039s) # initializing database system by copying initdb template ... [13:35:52.397](0.108s) ok 5 - Check reset timestamp for 'test_tab1_sub' is newer after second reset. #### Begin standard error psql:<stdin>:1: NOTICE: created replication slot "test_tab2_sub" on publisher #### End standard error Waiting for replication conn test_tab2_sub's replay_lsn to pass 0/151E8C8 on publisher done [13:38:53.706](181.310s) ok 6 - Check that table 'test_tab2' now has 1 row. ... [13:38:54.344](0.294s) 1..13 I reran the test and it looks very different: [13:54:01.703](0.090s) ok 5 - Check reset timestamp for 'test_tab1_sub' is newer after second reset. ... Waiting for replication conn test_tab2_sub's replay_lsn to pass 0/151E900 on publisher ... [13:54:03.006](1.303s) ok 6 - Check that table 'test_tab2' now has 1 row. It looks to me like in the first run it took 3 minutes for the replay_lsn to catch up to the desired value, and in the second run, two seconds. I think I have seen previous instances where something similar happened, although in those cases I did not stop to record any details. Have others seen this? Is there something we can/should do about it? -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote: > It looks to me like in the first run it took 3 minutes for the > replay_lsn to catch up to the desired value, and in the second run, > two seconds. I think I have seen previous instances where something > similar happened, although in those cases I did not stop to record any > details. Have others seen this? Is there something we can/should do > about it? FWIW, I've also seen delays as well with this test on a few occasions. Thanks for mentioning it. -- Michael
Attachment
Hello Michael and Robert, 20.04.2024 05:57, Michael Paquier wrote: > On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote: >> It looks to me like in the first run it took 3 minutes for the >> replay_lsn to catch up to the desired value, and in the second run, >> two seconds. I think I have seen previous instances where something >> similar happened, although in those cases I did not stop to record any >> details. Have others seen this? Is there something we can/should do >> about it? > FWIW, I've also seen delays as well with this test on a few occasions. > Thanks for mentioning it. It reminds me of https://www.postgresql.org/message-id/858a7622-2c81-1687-d1df-1322dfcb2e72%40gmail.com At least, I could reproduce such a delay with the attached patch applied. Best regards, Alexander
Attachment
On Sat, 20 Apr 2024 at 10:30, Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Michael and Robert, > > 20.04.2024 05:57, Michael Paquier wrote: > > On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote: > >> It looks to me like in the first run it took 3 minutes for the > >> replay_lsn to catch up to the desired value, and in the second run, > >> two seconds. I think I have seen previous instances where something > >> similar happened, although in those cases I did not stop to record any > >> details. Have others seen this? Is there something we can/should do > >> about it? > > FWIW, I've also seen delays as well with this test on a few occasions. > > Thanks for mentioning it. > > It reminds me of > https://www.postgresql.org/message-id/858a7622-2c81-1687-d1df-1322dfcb2e72%40gmail.com Thanks Alexander for the test, I was able to reproduce the issue with the test you shared and also verify that the patch at [1] fixes the same. This is the same issue where the apply worker for test_tab2_sub was getting started after 180 seconds because the common latch (which is used for worker attached, subscription creation/modification and apply worker process exit) was getting reset when the other subscription test_tab1_sub's worker gets started. The same can be seen from the logs: 2024-04-22 20:47:52.009 IST [323280] 026_stats.pl LOG: statement: BEGIN; 2024-04-22 20:47:52.009 IST [323280] 026_stats.pl LOG: statement: SELECT pg_sleep(0.5); 2024-04-22 20:47:52.426 IST [323281] LOG: logical replication apply worker for subscription "test_tab1_sub" has started 2024-04-22 20:47:52.511 IST [323280] 026_stats.pl LOG: statement: CREATE TABLE test_tab2(a int primary key); 2024-04-22 20:47:52.518 IST [323280] 026_stats.pl LOG: statement: INSERT INTO test_tab2 VALUES (1); 2024-04-22 20:47:52.519 IST [323280] 026_stats.pl LOG: statement: COMMIT; 2024-04-22 20:47:52.540 IST [323286] 026_stats.pl LOG: statement: CREATE SUBSCRIPTION test_tab2_sub CONNECTION 'port=56685 host=/tmp/RwzpQrVMYH dbname=postgres' PUBLICATION test_tab2_pub 2024-04-22 20:50:52.658 IST [326265] LOG: logical replication apply worker for subscription "test_tab2_sub" has started 2024-04-22 20:50:52.668 IST [326267] LOG: logical replication table synchronization worker for subscription "test_tab2_sub", table "test_tab2" has started [1] - https://www.postgresql.org/message-id/CALDaNm10R7L0Dxq%2B-J%3DPp3AfM_yaokpbhECvJ69QiGH8-jQquw%40mail.gmail.com Regards, Vignesh
Hi, On 2024-04-19 13:57:41 -0400, Robert Haas wrote: > Have others seen this? Is there something we can/should do about it? Yes, I've also seen this - but never quite reproducible enough to properly tackle it. The first thing I'd like to do is to make the wait_for_catchup routine regularly log the current state, so we can in retrospect analyze e.g. whether there was continual, but slow, replay progress, or whether replay was entirely stuck. wait_for_catchup() not being debuggable has been a problem in many different tests, so I think it's high time to fix that. Greetings, Andres Freund
On Tue, Apr 23, 2024 at 11:49 AM vignesh C <vignesh21@gmail.com> wrote: > > On Sat, 20 Apr 2024 at 10:30, Alexander Lakhin <exclusion@gmail.com> wrote: > > > > Hello Michael and Robert, > > > > 20.04.2024 05:57, Michael Paquier wrote: > > > On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote: > > >> It looks to me like in the first run it took 3 minutes for the > > >> replay_lsn to catch up to the desired value, and in the second run, > > >> two seconds. I think I have seen previous instances where something > > >> similar happened, although in those cases I did not stop to record any > > >> details. Have others seen this? Is there something we can/should do > > >> about it? > > > FWIW, I've also seen delays as well with this test on a few occasions. > > > Thanks for mentioning it. > > > > It reminds me of > > https://www.postgresql.org/message-id/858a7622-2c81-1687-d1df-1322dfcb2e72%40gmail.com > > Thanks Alexander for the test, I was able to reproduce the issue with > the test you shared and also verify that the patch at [1] fixes the > same. > One of the issues reported in the thread you referred to has the same symptoms [1]. I'll review and analyze your proposal. [1] - https://www.postgresql.org/message-id/858a7622-2c81-1687-d1df-1322dfcb2e72%40gmail.com -- With Regards, Amit Kapila.