Thread: Replication slot drop message is sent after pgstats shutdown.
Hi all, I found another pass where we report stats after the stats collector shutdown. The reproducer and the backtrace I got are here: 1. psql -c "begin; create table a (a int); select pg_sleep(30); commit;" & 2. pg_recvlogical --create-slot -S slot -d postgres & 3. stop the server TRAP: FailedAssertion("pgstat_is_initialized && !pgstat_is_shutdown", File: "pgstat.c", Line: 4752, PID: 62789) 0 postgres 0x000000010a8ed79a ExceptionalCondition + 234 1 postgres 0x000000010a5e03d2 pgstat_assert_is_up + 66 2 postgres 0x000000010a5e1dc4 pgstat_send + 20 3 postgres 0x000000010a5e1d5c pgstat_report_replslot_drop + 108 4 postgres 0x000000010a64c796 ReplicationSlotDropPtr + 838 5 postgres 0x000000010a64c0e9 ReplicationSlotDropAcquired + 89 6 postgres 0x000000010a64bf23 ReplicationSlotRelease + 99 7 postgres 0x000000010a6d60ab ProcKill + 219 8 postgres 0x000000010a6a350c shmem_exit + 444 9 postgres 0x000000010a6a326a proc_exit_prepare + 122 10 postgres 0x000000010a6a3163 proc_exit + 19 11 postgres 0x000000010a8ee665 errfinish + 1109 12 postgres 0x000000010a6e3535 ProcessInterrupts + 1445 13 postgres 0x000000010a65f654 WalSndWaitForWal + 164 14 postgres 0x000000010a65edb2 logical_read_xlog_page + 146 15 postgres 0x000000010a22c336 ReadPageInternal + 518 16 postgres 0x000000010a22b860 XLogReadRecord + 320 17 postgres 0x000000010a619c67 DecodingContextFindStartpoint + 231 18 postgres 0x000000010a65c105 CreateReplicationSlot + 1237 19 postgres 0x000000010a65b64c exec_replication_command + 1180 20 postgres 0x000000010a6e6d2b PostgresMain + 2459 21 postgres 0x000000010a5ef1a9 BackendRun + 89 22 postgres 0x000000010a5ee6fd BackendStartup + 557 23 postgres 0x000000010a5ed487 ServerLoop + 759 24 postgres 0x000000010a5eac22 PostmasterMain + 6610 25 postgres 0x000000010a4c32d3 main + 819 26 libdyld.dylib 0x00007fff73477cc9 start + 1 At step #2, wal sender waits for another transaction started at step #1 to complete after creating the replication slot. When the server is stopping, wal sender process drops the slot on releasing the slot since it's still RS_EPHEMERAL. Then, after dropping the slot we report the message for dropping the slot (see ReplicationSlotDropPtr()). These are executed in ReplicationSlotRelease() called by ProcKill() which is called during calling on_shmem_exit callbacks, which is after shutting down pgstats during before_shmem_exit callbacks. I’ve not tested yet but I think this can potentially happen also when dropping a temporary slot. ProcKill() also calls ReplicationSlotCleanup() to clean up temporary slots. There are some ideas to fix this issue but I don’t think it’s a good idea to move either ProcKill() or the slot releasing code to before_shmem_exit in this case, like we did for other similar issues[1][2]. Reporting the slot dropping message on dropping the slot isn’t necessarily essential actually since autovacuums periodically check already-dropped slots and report to drop the stats. So another idea would be to move pgstat_report_replslot_drop() to a higher layer such as ReplicationSlotDrop() and ReplicationSlotsDropDBSlots() that are not called during callbacks. The replication slot stats are dropped when it’s dropped via commands such as pg_drop_replication_slot() and DROP_REPLICATION_SLOT. On the other hand, for temporary slots and ephemeral slots, we rely on autovacuums to drop their stats. Even if we delay to drop the stats for those slots, pg_stat_replication_slots don’t show the stats for already-dropped slots. Any other ideas? Regards, [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=675c945394b36c2db0e8c8c9f6209c131ce3f0a8 [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=dcac5e7ac157964f71f15d81c7429130c69c3f9b -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Hi, On 2021-08-31 11:37:08 +0900, Masahiko Sawada wrote: > At step #2, wal sender waits for another transaction started at step > #1 to complete after creating the replication slot. When the server is > stopping, wal sender process drops the slot on releasing the slot > since it's still RS_EPHEMERAL. Then, after dropping the slot we report > the message for dropping the slot (see ReplicationSlotDropPtr()). > These are executed in ReplicationSlotRelease() called by ProcKill() > which is called during calling on_shmem_exit callbacks, which is after > shutting down pgstats during before_shmem_exit callbacks. I’ve not > tested yet but I think this can potentially happen also when dropping > a temporary slot. ProcKill() also calls ReplicationSlotCleanup() to > clean up temporary slots. > > There are some ideas to fix this issue but I don’t think it’s a good > idea to move either ProcKill() or the slot releasing code to > before_shmem_exit in this case, like we did for other similar > issues[1][2]. Yea, that's clearly not an option. I wonder why the replication slot stuff is in ProcKill() rather than its own callback. That's probably my fault, but I don't remember what lead to that. > Reporting the slot dropping message on dropping the slot > isn’t necessarily essential actually since autovacuums periodically > check already-dropped slots and report to drop the stats. So another > idea would be to move pgstat_report_replslot_drop() to a higher layer > such as ReplicationSlotDrop() and ReplicationSlotsDropDBSlots() that > are not called during callbacks. The replication slot stats are > dropped when it’s dropped via commands such as > pg_drop_replication_slot() and DROP_REPLICATION_SLOT. On the other > hand, for temporary slots and ephemeral slots, we rely on autovacuums > to drop their stats. Even if we delay to drop the stats for those > slots, pg_stat_replication_slots don’t show the stats for > already-dropped slots. Yea, we could do that, but I think it'd be nicer to find a bit more principled solution... Perhaps moving this stuff out from ProcKill() into its own before_shmem_exit() callback would do the trick? Greetings, Andres Freund
On Tue, Aug 31, 2021 at 12:45 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-08-31 11:37:08 +0900, Masahiko Sawada wrote: > > At step #2, wal sender waits for another transaction started at step > > #1 to complete after creating the replication slot. When the server is > > stopping, wal sender process drops the slot on releasing the slot > > since it's still RS_EPHEMERAL. Then, after dropping the slot we report > > the message for dropping the slot (see ReplicationSlotDropPtr()). > > These are executed in ReplicationSlotRelease() called by ProcKill() > > which is called during calling on_shmem_exit callbacks, which is after > > shutting down pgstats during before_shmem_exit callbacks. I’ve not > > tested yet but I think this can potentially happen also when dropping > > a temporary slot. ProcKill() also calls ReplicationSlotCleanup() to > > clean up temporary slots. > > > > There are some ideas to fix this issue but I don’t think it’s a good > > idea to move either ProcKill() or the slot releasing code to > > before_shmem_exit in this case, like we did for other similar > > issues[1][2]. > > Yea, that's clearly not an option. > > I wonder why the replication slot stuff is in ProcKill() rather than its > own callback. That's probably my fault, but I don't remember what lead > to that. > > > > Reporting the slot dropping message on dropping the slot > > isn’t necessarily essential actually since autovacuums periodically > > check already-dropped slots and report to drop the stats. So another > > idea would be to move pgstat_report_replslot_drop() to a higher layer > > such as ReplicationSlotDrop() and ReplicationSlotsDropDBSlots() that > > are not called during callbacks. The replication slot stats are > > dropped when it’s dropped via commands such as > > pg_drop_replication_slot() and DROP_REPLICATION_SLOT. On the other > > hand, for temporary slots and ephemeral slots, we rely on autovacuums > > to drop their stats. Even if we delay to drop the stats for those > > slots, pg_stat_replication_slots don’t show the stats for > > already-dropped slots. > > Yea, we could do that, but I think it'd be nicer to find a bit more > principled solution... > > Perhaps moving this stuff out from ProcKill() into its own > before_shmem_exit() callback would do the trick? You mean to move only the part of sending the message to its own before_shmem_exit() callback? or move ReplicationSlotRelease() and ReplicationSlotCleanup() from ProcKill() to it? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On 2021-08-31 14:22:39 +0900, Masahiko Sawada wrote: > You mean to move only the part of sending the message to its own > before_shmem_exit() callback? or move ReplicationSlotRelease() and > ReplicationSlotCleanup() from ProcKill() to it? The latter.
On Tue, Aug 31, 2021 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > > On 2021-08-31 14:22:39 +0900, Masahiko Sawada wrote: > > You mean to move only the part of sending the message to its own > > before_shmem_exit() callback? or move ReplicationSlotRelease() and > > ReplicationSlotCleanup() from ProcKill() to it? > > The latter. Makes sense. I've attached the patch that moves them to its own before_shmem_exit(). Unless I missed to register the callback it works the same as before except for where to release and clean up the slots. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
At Tue, 31 Aug 2021 17:14:45 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in > On Tue, Aug 31, 2021 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > > > > On 2021-08-31 14:22:39 +0900, Masahiko Sawada wrote: > > > You mean to move only the part of sending the message to its own > > > before_shmem_exit() callback? or move ReplicationSlotRelease() and > > > ReplicationSlotCleanup() from ProcKill() to it? > > > > The latter. > > Makes sense. > > I've attached the patch that moves them to its own > before_shmem_exit(). Unless I missed to register the callback it works > the same as before except for where to release and clean up the slots. Is there any reason we need to register the callback dynamically? It seems to me what we need to do here is to call the functions at before-shmem-exit. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2021-08-31 18:34:12 +0900, Kyotaro Horiguchi wrote: > At Tue, 31 Aug 2021 17:14:45 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in > > On Tue, Aug 31, 2021 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > On 2021-08-31 14:22:39 +0900, Masahiko Sawada wrote: > > > > You mean to move only the part of sending the message to its own > > > > before_shmem_exit() callback? or move ReplicationSlotRelease() and > > > > ReplicationSlotCleanup() from ProcKill() to it? > > > > > > The latter. > > > > Makes sense. > > > > I've attached the patch that moves them to its own > > before_shmem_exit(). Unless I missed to register the callback it works > > the same as before except for where to release and clean up the slots. > > Is there any reason we need to register the callback dynamically? It > seems to me what we need to do here is to call the functions at > before-shmem-exit. +1. I'd just add a ReplicationSlotInitialize() to BaseInit(). Greetings, Andres Freund
On Wed, Sep 1, 2021 at 2:39 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-08-31 18:34:12 +0900, Kyotaro Horiguchi wrote: > > At Tue, 31 Aug 2021 17:14:45 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in > > > On Tue, Aug 31, 2021 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > > > On 2021-08-31 14:22:39 +0900, Masahiko Sawada wrote: > > > > > You mean to move only the part of sending the message to its own > > > > > before_shmem_exit() callback? or move ReplicationSlotRelease() and > > > > > ReplicationSlotCleanup() from ProcKill() to it? > > > > > > > > The latter. > > > > > > Makes sense. > > > > > > I've attached the patch that moves them to its own > > > before_shmem_exit(). Unless I missed to register the callback it works > > > the same as before except for where to release and clean up the slots. > > > > Is there any reason we need to register the callback dynamically? It > > seems to me what we need to do here is to call the functions at > > before-shmem-exit. > > +1. I'd just add a ReplicationSlotInitialize() to BaseInit(). +1. But BaseInit() is also called by auxiliary processes, which seems not necessary. So isn't it better to add it to InitPostgres() or InitProcess()? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Hi, On 2021-09-01 10:05:18 +0900, Masahiko Sawada wrote: > On Wed, Sep 1, 2021 at 2:39 AM Andres Freund <andres@anarazel.de> wrote: > > On 2021-08-31 18:34:12 +0900, Kyotaro Horiguchi wrote: > > > At Tue, 31 Aug 2021 17:14:45 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in > > > > On Tue, Aug 31, 2021 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > > > > I've attached the patch that moves them to its own > > > > before_shmem_exit(). Unless I missed to register the callback it works > > > > the same as before except for where to release and clean up the slots. > > > > > > Is there any reason we need to register the callback dynamically? It > > > seems to me what we need to do here is to call the functions at > > > before-shmem-exit. > > > > +1. I'd just add a ReplicationSlotInitialize() to BaseInit(). > > +1. But BaseInit() is also called by auxiliary processes, which seems > not necessary. So isn't it better to add it to InitPostgres() or > InitProcess()? -0.5 - I think we should default to making the environments more similar, rather than the opposite. With exceptions for cases where that'd cause overhead or undue complexity. Which I don't see here? Greetings, Andres Freund
On Wed, Sep 1, 2021 at 12:37 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-09-01 10:05:18 +0900, Masahiko Sawada wrote: > > On Wed, Sep 1, 2021 at 2:39 AM Andres Freund <andres@anarazel.de> wrote: > > > On 2021-08-31 18:34:12 +0900, Kyotaro Horiguchi wrote: > > > > At Tue, 31 Aug 2021 17:14:45 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in > > > > > On Tue, Aug 31, 2021 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > > > > > I've attached the patch that moves them to its own > > > > > before_shmem_exit(). Unless I missed to register the callback it works > > > > > the same as before except for where to release and clean up the slots. > > > > > > > > Is there any reason we need to register the callback dynamically? It > > > > seems to me what we need to do here is to call the functions at > > > > before-shmem-exit. > > > > > > +1. I'd just add a ReplicationSlotInitialize() to BaseInit(). > > > > +1. But BaseInit() is also called by auxiliary processes, which seems > > not necessary. So isn't it better to add it to InitPostgres() or > > InitProcess()? > > -0.5 - I think we should default to making the environments more similar, > rather than the opposite. With exceptions for cases where that'd cause > overhead or undue complexity. Which I don't see here? > Sorry for the late response. I'd missed this discussion for some reason. I agreed with Andres and Horiguchi-san and attached an updated patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
At Fri, 10 Dec 2021 18:13:31 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in > I agreed with Andres and Horiguchi-san and attached an updated patch. Thanks for the new version. It seems fine, but I have some comments from a cosmetic viewpoint. + /* + * Register callback to make sure cleanup and releasing the replication + * slot on exit. + */ + ReplicationSlotInitialize(); Actually all the function does is that but it looks slightly inconsistent with the function name. I think we can call it just "initialization" here. I think we don't need to change the function comment the same way but either will do for me. +ReplicationSlotBeforeShmemExit(int code, Datum arg) The name looks a bit too verbose. Doesn't just "ReplicationSlotShmemExit" work? - * so releasing here is fine. There's another cleanup in ProcKill() - * ensuring we'll correctly cleanup on FATAL errors as well. + * so releasing here is fine. There's another cleanup in + * ReplicationSlotBeforeShmemExit() callback ensuring we'll correctly + * cleanup on FATAL errors as well. I'd prefer "during before_shmem_exit()" than "in ReplicationSlotBeforeShmemExit() callback" here. (But the current wording is also fine by me.) The attached detects that bug, but I'm not sure it's worth expending test time, or this might be in the server test suit. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl index 90da1662e3..0fb4e67697 100644 --- a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl +++ b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl @@ -5,7 +5,8 @@ use strict; use warnings; use PostgreSQL::Test::Utils; use PostgreSQL::Test::Cluster; -use Test::More tests => 20; +use Test::More tests => 25; +use IPC::Run qw(pump finish timer); program_help_ok('pg_recvlogical'); program_version_ok('pg_recvlogical'); @@ -106,3 +107,44 @@ $node->command_ok( '--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-' ], 'replayed a two-phase transaction'); + +## Check for a crash bug caused by replication-slot cleanup after +## pgstat shutdown. +#fire up an interactive psql session +my $in = ''; +my $out = ''; +my $timer = timer(5); +my $h = $node->interactive_psql('postgres', \$in, \$out, $timer); +like($out, qr/psql/, "print startup banner"); + +# open a transaction +$out = ""; +$in .= "BEGIN;\nCREATE TABLE a (a int);\n"; +pump $h until ($out =~ /CREATE TABLE/ || timer->is_expired); +ok(!timer->is_expired, 'background CREATE TABLE passed'); + +# this recvlogical waits for the transaction ends +ok(open(my $recvlogical, '-|', + 'pg_recvlogical', '--create-slot', '-S', 'test2', + '-d', $node->connstr('postgres')), + 'launch background pg_recvlogical'); + +$node->poll_query_until('postgres', + qq{SELECT count(*) > 0 FROM pg_stat_activity + WHERE backend_type='walsender' + AND query like 'CREATE_REPLICATION_SLOT %';}); +# stop server while it hangs. This shouldn't crash server. +$node->stop; +ok(open(my $cont, '-|', 'pg_controldata', $node->data_dir), + 'run pg_controldata'); +my $stop_result = ''; +while (<$cont>) +{ + if (/^Database cluster state: *([^ ].*)$/) + { + $stop_result = $1; + last; + } +} + +is($stop_result, 'shut down', 'server is properly shut down');
At Mon, 14 Feb 2022 17:20:16 -0800, Andres Freund <andres@anarazel.de> wrote in > Hi, > > On 2021-12-22 22:34:45 +0900, Masahiko Sawada wrote: > > I've attached an updated patch. Please review it. > > Sorry for dropping the ball on this again :(. I've pushed the fix with some > very minor polishing. Thanks! > > > The attached detects that bug, but I'm not sure it's worth expending > > > test time, or this might be in the server test suit. > > > > Thanks. It's convenient to test this issue but I'm also not sure it's > > worth adding to the test suit. > > I think it's definitely worth adding a test, but I don't particularly like the > specific test implementation. Primarily because I think it's better to test > this in a cluster that stays running, so that we can verify that the slot drop > worked. It also doesn't seem necessary to create a separate cluster. One of the points I was not satisfied the TAP test is the second point above. FWIW I agree to the proposed test on the direction. > I wrote the attached isolation test. I ended up not committing it yet - I'm > worried that there could be some OS dependent output difference, due to some > libpq error handling issues. See [1], which Tom pointed out is caused by the > issue discussed in [2]. Mmm.. This is.. slot_creation_error.out > step s2_init: <... completed> > FATAL: terminating connection due to administrator command > FATAL: terminating connection due to administrator command > Greetings, > > Andres Freund > > [1] https://postgr.es/m/20220215004143.dlzsn72oqsmqa7uw%40alap3.anarazel.de > [2] https://postgr.es/m/20220215004143.dlzsn72oqsmqa7uw%40alap3.anarazel.de regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Feb 15, 2022 at 12:09 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 14 Feb 2022 17:20:16 -0800, Andres Freund <andres@anarazel.de> wrote in > > Hi, > > > > On 2021-12-22 22:34:45 +0900, Masahiko Sawada wrote: > > > I've attached an updated patch. Please review it. > > > > Sorry for dropping the ball on this again :(. I've pushed the fix with some > > very minor polishing. Thanks! > > > > > The attached detects that bug, but I'm not sure it's worth expending > > > > test time, or this might be in the server test suit. > > > > > > Thanks. It's convenient to test this issue but I'm also not sure it's > > > worth adding to the test suit. > > > > I think it's definitely worth adding a test, but I don't particularly like the > > specific test implementation. Primarily because I think it's better to test > > this in a cluster that stays running, so that we can verify that the slot drop > > worked. It also doesn't seem necessary to create a separate cluster. > > FWIW I agree to the proposed test on the direction. +1 Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Feb 15, 2022 at 08:58:56AM -0800, Andres Freund wrote: > Pushed the test yesterday evening, after Tom checked if it is likely to be > problematic. Seems to worked without problems so far. wrasse │ 2022-02-15 09:29:06 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-02-15%2009%3A29%3A06 flaviventris │ 2022-02-24 15:17:30 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-02-24%2015%3A17%3A30 calliphoridae │ 2022-03-08 01:14:51 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2022-03-08%2001%3A14%3A51 The buildfarm failed to convey adequate logs for this particular test suite. Here's regression.diffs from the wrasse case (saved via keep_error_builds): === diff -U3 /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out --- /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out Tue Feb 15 06:58:142022 +++ /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out TueFeb 15 11:38:14 2022 @@ -29,16 +29,17 @@ t (1 row) -step s2_init: <... completed> -ERROR: canceling statement due to user request step s1_view_slot: SELECT slot_name, slot_type, active FROM pg_replication_slots WHERE slot_name = 'slot_creation_error' -slot_name|slot_type|active ----------+---------+------ -(0 rows) +slot_name |slot_type|active +-------------------+---------+------ +slot_creation_error|logical |t +(1 row) step s1_c: COMMIT; +step s2_init: <... completed> +ERROR: canceling statement due to user request starting permutation: s1_b s1_xid s2_init s1_c s1_view_slot s1_drop_slot step s1_b: BEGIN; === I can make it fail that way by injecting a 1s delay here: --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3339,6 +3339,7 @@ ProcessInterrupts(void) */ if (!DoingCommandRead) { + pg_usleep(1 * 1000 * 1000); LockErrorCleanup(); ereport(ERROR, (errcode(ERRCODE_QUERY_CANCELED), I plan to fix this as attached, similar to how commit c04c767 fixed the same challenge in detach-partition-concurrently-[34].
Attachment
At Fri, 18 Mar 2022 00:28:37 -0700, Noah Misch <noah@leadboat.com> wrote in > === > diff -U3 /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out > --- /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out Tue Feb 1506:58:14 2022 > +++ /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out TueFeb 15 11:38:14 2022 > I plan to fix this as attached, similar to how commit c04c767 fixed the same > challenge in detach-partition-concurrently-[34]. It looks correct and I confirmed that it works. It looks like a similar issue with [1] but this is cleaner and stable. [1] https://www.postgresql.org/message-id/20220304.113347.2105652521035137491.horikyota.ntt@gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2022-03-18 00:28:37 -0700, Noah Misch wrote: > On Tue, Feb 15, 2022 at 08:58:56AM -0800, Andres Freund wrote: > > Pushed the test yesterday evening, after Tom checked if it is likely to be > > problematic. Seems to worked without problems so far. > > wrasse │ 2022-02-15 09:29:06 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-02-15%2009%3A29%3A06 > flaviventris │ 2022-02-24 15:17:30 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-02-24%2015%3A17%3A30 > calliphoridae │ 2022-03-08 01:14:51 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2022-03-08%2001%3A14%3A51 Huh. Weirdly enough I saw this failure twice in a development branch yesterday... > The buildfarm failed to convey adequate logs for this particular test suite. > Here's regression.diffs from the wrasse case (saved via keep_error_builds): Thanks for getting that! > === > diff -U3 /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out > --- /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out Tue Feb 1506:58:14 2022 > +++ /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out TueFeb 15 11:38:14 2022 > @@ -29,16 +29,17 @@ > t > (1 row) > > -step s2_init: <... completed> > -ERROR: canceling statement due to user request > step s1_view_slot: > SELECT slot_name, slot_type, active FROM pg_replication_slots WHERE slot_name = 'slot_creation_error' > > -slot_name|slot_type|active > ----------+---------+------ > -(0 rows) > +slot_name |slot_type|active > +-------------------+---------+------ > +slot_creation_error|logical |t > +(1 row) > > step s1_c: COMMIT; > +step s2_init: <... completed> > +ERROR: canceling statement due to user request > > starting permutation: s1_b s1_xid s2_init s1_c s1_view_slot s1_drop_slot > step s1_b: BEGIN; > === > > I can make it fail that way by injecting a 1s delay here: > > --- a/src/backend/tcop/postgres.c > +++ b/src/backend/tcop/postgres.c > @@ -3339,6 +3339,7 @@ ProcessInterrupts(void) > */ > if (!DoingCommandRead) > { > + pg_usleep(1 * 1000 * 1000); > LockErrorCleanup(); > ereport(ERROR, > (errcode(ERRCODE_QUERY_CANCELED), So isolationtester still sees the blocking condition from before the cancel processing is finished and thus proceeds to the next statement - which it normally should only do once the other running step is detected as still blocking? I wonder if we should emit <waiting> everytime a step is detected anew as being blocked to make it easier to understand issues like this. > diff --git a/contrib/test_decoding/specs/slot_creation_error.spec b/contrib/test_decoding/specs/slot_creation_error.spec > index 6816696..d1e35bf 100644 > --- a/contrib/test_decoding/specs/slot_creation_error.spec > +++ b/contrib/test_decoding/specs/slot_creation_error.spec > @@ -35,7 +35,7 @@ step s2_init { > # The tests first start a transaction with an xid assigned in s1, then create > # a slot in s2. The slot creation waits for s1's transaction to end. Instead > # we cancel / terminate s2. > -permutation s1_b s1_xid s2_init s1_view_slot s1_cancel_s2 s1_view_slot s1_c > +permutation s1_b s1_xid s2_init s1_view_slot s1_cancel_s2(s2_init) s1_view_slot s1_c > permutation s1_b s1_xid s2_init s1_c s1_view_slot s1_drop_slot # check slot creation still works > -permutation s1_b s1_xid s2_init s1_terminate_s2 s1_c s1_view_slot > +permutation s1_b s1_xid s2_init s1_terminate_s2(s2_init) s1_c s1_view_slot > # can't run tests after this, due to s2's connection failure That looks good to me. Thanks! Andres Freund
On Fri, Mar 18, 2022 at 01:24:15PM -0700, Andres Freund wrote: > On 2022-03-18 00:28:37 -0700, Noah Misch wrote: > > === > > diff -U3 /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out > > --- /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out Tue Feb 1506:58:14 2022 > > +++ /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out TueFeb 15 11:38:14 2022 > > @@ -29,16 +29,17 @@ > > t > > (1 row) > > > > -step s2_init: <... completed> > > -ERROR: canceling statement due to user request > > step s1_view_slot: > > SELECT slot_name, slot_type, active FROM pg_replication_slots WHERE slot_name = 'slot_creation_error' > > > > -slot_name|slot_type|active > > ----------+---------+------ > > -(0 rows) > > +slot_name |slot_type|active > > +-------------------+---------+------ > > +slot_creation_error|logical |t > > +(1 row) > > > > step s1_c: COMMIT; > > +step s2_init: <... completed> > > +ERROR: canceling statement due to user request > > > > starting permutation: s1_b s1_xid s2_init s1_c s1_view_slot s1_drop_slot > > step s1_b: BEGIN; > > === > > > > I can make it fail that way by injecting a 1s delay here: > > > > --- a/src/backend/tcop/postgres.c > > +++ b/src/backend/tcop/postgres.c > > @@ -3339,6 +3339,7 @@ ProcessInterrupts(void) > > */ > > if (!DoingCommandRead) > > { > > + pg_usleep(1 * 1000 * 1000); > > LockErrorCleanup(); > > ereport(ERROR, > > (errcode(ERRCODE_QUERY_CANCELED), > > So isolationtester still sees the blocking condition from before the cancel > processing is finished and thus proceeds to the next statement - which it > normally should only do once the other running step is detected as still > blocking? Essentially. It called s1_view_slot too early. s2_init can remain blocked arbitrarily long after pg_cancel_backend returns. Writing s1_cancel_s2(s2_init) directs the isolationtester to send the cancel, then wait for s2_init to finish, then wait for the cancel to finish. > I wonder if we should emit <waiting> everytime a step is detected anew as > being blocked to make it easier to understand issues like this. Good question. > > diff --git a/contrib/test_decoding/specs/slot_creation_error.spec b/contrib/test_decoding/specs/slot_creation_error.spec > > index 6816696..d1e35bf 100644 > > --- a/contrib/test_decoding/specs/slot_creation_error.spec > > +++ b/contrib/test_decoding/specs/slot_creation_error.spec > > @@ -35,7 +35,7 @@ step s2_init { > > # The tests first start a transaction with an xid assigned in s1, then create > > # a slot in s2. The slot creation waits for s1's transaction to end. Instead > > # we cancel / terminate s2. > > -permutation s1_b s1_xid s2_init s1_view_slot s1_cancel_s2 s1_view_slot s1_c > > +permutation s1_b s1_xid s2_init s1_view_slot s1_cancel_s2(s2_init) s1_view_slot s1_c > > permutation s1_b s1_xid s2_init s1_c s1_view_slot s1_drop_slot # check slot creation still works > > -permutation s1_b s1_xid s2_init s1_terminate_s2 s1_c s1_view_slot > > +permutation s1_b s1_xid s2_init s1_terminate_s2(s2_init) s1_c s1_view_slot > > # can't run tests after this, due to s2's connection failure > > That looks good to me. Pushed. Kyotaro Horiguchi had posted a patch that also changed the pg_terminate_backend call in temp-schema-cleanup.spec. I think that one is fine as-is, because it does pg_terminate_backend(pg_backend_pid()). There's no way for a backend running that particular command to report that it's ready for another query, so the problem doesn't arise.