Re: Replication slot stats misgivings - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Replication slot stats misgivings |
Date | |
Msg-id | CAD21AoD8DvPDFKtZJOV1YkeTeemAREQ+ppuQL7U3LJcqFi1V-w@mail.gmail.com Whole thread Raw |
In response to | Re: Replication slot stats misgivings (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Replication slot stats misgivings
|
List | pgsql-hackers |
On Tue, Apr 27, 2021 at 11:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Apr 27, 2021 at 5:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Apr 27, 2021 at 8:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I have pushed this patch and seeing one buildfarm failure: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-27%2009%3A23%3A14 > > > > starting permutation: s1_init s1_begin s1_insert_tbl1 s1_insert_tbl2 > > s2_alter_tbl1_char s1_commit s2_get_changes > > + isolationtester: canceling step s1_init after 314 seconds > > step s1_init: SELECT 'init' FROM > > pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); > > ?column? > > > > I am analyzing this. > > > > After checking below logs corresponding to this test, it seems test > has been executed and create_slot was successful: The pg_create_logical_replication_slot() was executed at 11:04:25: 2021-04-27 11:04:25.494 UTC [17694956:49] isolation/concurrent_ddl_dml LOG: statement: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); Therefore this command took 314 sec that matches the number the isolation test reported. And the folling logs follow: 2021-04-27 11:06:43.770 UTC [17694956:50] isolation/concurrent_ddl_dml LOG: logical decoding found consistent point at 0/17F9078 2021-04-27 11:06:43.770 UTC [17694956:51] isolation/concurrent_ddl_dml DETAIL: There are no running transactions. > 2021-04-27 11:06:43.770 UTC [17694956:52] isolation/concurrent_ddl_dml > STATEMENT: SELECT 'init' FROM > pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); > 2021-04-27 11:07:11.748 UTC [5243096:9] LOG: checkpoint starting: time > 2021-04-27 11:09:24.332 UTC [5243096:10] LOG: checkpoint complete: > wrote 14 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; > write=0.716 s, sync=0.001 s, total=132.584 s; sync files=0, > longest=0.000 s, average=0.000 s; distance=198 kB, estimate=406 kB > 2021-04-27 11:09:40.116 UTC [6226046:1] [unknown] LOG: connection > received: host=[local] > 2021-04-27 11:09:40.117 UTC [17694956:53] isolation/concurrent_ddl_dml > LOG: statement: BEGIN; > 2021-04-27 11:09:40.117 UTC [17694956:54] isolation/concurrent_ddl_dml > LOG: statement: INSERT INTO tbl1 (val1, val2) VALUES (1, 1); > 2021-04-27 11:09:40.118 UTC [17694956:55] isolation/concurrent_ddl_dml > LOG: statement: INSERT INTO tbl2 (val1, val2) VALUES (1, 1); > 2021-04-27 11:09:40.119 UTC [10944636:49] isolation/concurrent_ddl_dml > LOG: statement: ALTER TABLE tbl1 ALTER COLUMN val2 TYPE character > varying; > > I am not sure but there is some possibility that even though create > slot is successful, the isolation tester got successful in canceling > it, maybe because create_slot is just finished at the same time. Yeah, we see the test log "canceling step s1_init after 314 seconds" but don't see any log indicating canceling query. > As we > can see from logs, during this test checkpoint also happened which > could also lead to the slowness of this particular command. Yes. I also think the checkpoint could somewhat lead to the slowness. And since create_slot() took 2min to find a consistent snapshot the system might have already been busy. > > Also, I see a lot of messages like below which indicate stats > collector is also quite slow: > 2021-04-27 10:57:59.385 UTC [18743536:1] LOG: using stale statistics > instead of current ones because stats collector is not responding > > I am not sure if the timeout happened because the machine is slow or > is it in any way related to code. I am seeing some previous failures > due to timeout on this machine [1][2]. In those failures, I see the > "using stale stats...." message. It seems like a time-dependent issue but I'm wondering why the logical decoding test failed at this time. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-hackers by date: