Amit Kapila <amit.kapila16@gmail.com> writes:
> On Thu, Jul 12, 2018 at 1:29 PM, Michael Paquier <michael@paquier.xyz> wrote:
>> Looks unrelated, but flaviventris (a snake?) has just complained:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2018-07-12%2007%3A36%3A01
> Yeah, I am confused as to how this commit can lead to that failure.
It didn't; looks like just a random failure to me. The test is timing
out:
ok 26 - catalog xmin of cascaded slot still null with hs_feedback reset
# re-enabling hot_standby_feedback and disabling while stopped
### Reloading node "standby_2"
# Running: pg_ctl -D
/home/bf/build/buildfarm-flaviventris/HEAD/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdata
reload
server signaled
Waiting for replication conn standby_1's replay_lsn to pass '0/30A1720' on master
done
Waiting for replication conn standby_2's replay_lsn to pass '0/30A1720' on standby_1
done
### Stopping node "standby_2" using mode fast
# Running: pg_ctl -D
/home/bf/build/buildfarm-flaviventris/HEAD/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdata
-mfast stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "standby_2"
#
Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157.
This looks to me like it's probably a low-probability timing problem in
the test script itself. The relevant part of the script is
$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
replay_check();
$node_standby_2->safe_psql('postgres',
'ALTER SYSTEM SET hot_standby_feedback = off;');
$node_standby_2->stop;
($xmin, $catalog_xmin) =
get_slot_xmins($node_standby_1, $slotname_2, "xmin IS NOT NULL");
isnt($xmin, '', 'xmin of cascaded slot non-null with postgres shut down');
where the failure is occurring because get_slot_xmins() never sees the
expected state. I'm suspicious that this is happening because the script
is making no attempt to ensure the relative timing of events on the three
servers --- notably, issuing a "reload" does nothing to guarantee that
the server has actually seen and responded to the parameter change.
As an extreme case, I can reproduce the failure exactly with this:
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index a0d3e8f..0824184 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -286,7 +286,7 @@ is($catalog_xmin, '',
note "re-enabling hot_standby_feedback and disabling while stopped";
$node_standby_2->safe_psql('postgres',
'ALTER SYSTEM SET hot_standby_feedback = on;');
-$node_standby_2->reload;
+# $node_standby_2->reload;
$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
replay_check();
so I'm thinking that what we're seeing is an effect of the second standby
sometimes being too slow to turn on hot_standby_feedback.
We could maybe hack around that with a short "sleep" in this script, but
a less cruddy solution would be to change PostgresNode::reload so that
it somehow waits for the server to process the reload signal.
regards, tom lane