Re: pgsql: Allow using the updated tuple while moving it to a different par - Mailing list pgsql-committers

From Tom Lane
Subject Re: pgsql: Allow using the updated tuple while moving it to a different par
Date
Msg-id 11887.1531432634@sss.pgh.pa.us
Whole thread Raw
In response to Re: pgsql: Allow using the updated tuple while moving it to adifferent par  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-committers
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Thu, Jul 12, 2018 at 1:29 PM, Michael Paquier <michael@paquier.xyz> wrote:
>> Looks unrelated, but flaviventris (a snake?) has just complained:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2018-07-12%2007%3A36%3A01

> Yeah, I am confused as to how this commit can lead to that failure.

It didn't; looks like just a random failure to me.  The test is timing
out:

ok 26 - catalog xmin of cascaded slot still null with hs_feedback reset
# re-enabling hot_standby_feedback and disabling while stopped
### Reloading node "standby_2"
# Running: pg_ctl -D
/home/bf/build/buildfarm-flaviventris/HEAD/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdata
reload
server signaled
Waiting for replication conn standby_1's replay_lsn to pass '0/30A1720' on master
done
Waiting for replication conn standby_2's replay_lsn to pass '0/30A1720' on standby_1
done
### Stopping node "standby_2" using mode fast
# Running: pg_ctl -D
/home/bf/build/buildfarm-flaviventris/HEAD/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdata
-mfast stop 
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "standby_2"
#
Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157.


This looks to me like it's probably a low-probability timing problem in
the test script itself.  The relevant part of the script is


$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
replay_check();

$node_standby_2->safe_psql('postgres',
    'ALTER SYSTEM SET hot_standby_feedback = off;');
$node_standby_2->stop;

($xmin, $catalog_xmin) =
  get_slot_xmins($node_standby_1, $slotname_2, "xmin IS NOT NULL");
isnt($xmin, '', 'xmin of cascaded slot non-null with postgres shut down');


where the failure is occurring because get_slot_xmins() never sees the
expected state.  I'm suspicious that this is happening because the script
is making no attempt to ensure the relative timing of events on the three
servers --- notably, issuing a "reload" does nothing to guarantee that
the server has actually seen and responded to the parameter change.

As an extreme case, I can reproduce the failure exactly with this:

diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index a0d3e8f..0824184 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -286,7 +286,7 @@ is($catalog_xmin, '',
 note "re-enabling hot_standby_feedback and disabling while stopped";
 $node_standby_2->safe_psql('postgres',
    'ALTER SYSTEM SET hot_standby_feedback = on;');
-$node_standby_2->reload;
+# $node_standby_2->reload;

 $node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
 replay_check();

so I'm thinking that what we're seeing is an effect of the second standby
sometimes being too slow to turn on hot_standby_feedback.

We could maybe hack around that with a short "sleep" in this script, but
a less cruddy solution would be to change PostgresNode::reload so that
it somehow waits for the server to process the reload signal.

            regards, tom lane


pgsql-committers by date:

Previous
From: Michael Paquier
Date:
Subject: pgsql: Clean up temporary WAL segments after an instance crash
Next
From: Michael Paquier
Date:
Subject: pgsql: Fix argument of pg_create_logical_replication_slot for slotname