Thread: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

[HACKERS] Race-like failure in recovery/t/009_twophase.pl

From

Tom Lane

Date:

02 July 2017, 06:21:21

Since I put in that patch to reduce pg_ctl's reaction time, we've
seen several buildfarm failures with identical symptoms:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-06-30%2012%3A56%3A08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-06-30%2023%3A42%3A11
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-01%2000%3A22%3A09
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-07-01%2003%3A30%3A01

to wit that the recovery/009 test blows up with

Bailout called.  Further testing stopped:  pg_ctl start failed
FAILED--Further testing stopped: pg_ctl start failed

Fortunately, it reports the relevant log files, and we can see things
like this in the "slave" postmaster's log:

2017-07-01 02:31:51.131 UTC [11665568] LOG:  entering standby mode
2017-07-01 02:31:51.198 UTC [11665568] LOG:  consistent recovery state reached at 0/301CB10
2017-07-01 02:31:51.198 UTC [11665568] LOG:  invalid record length at 0/301CB10: wanted 24, got 0
2017-07-01 02:31:51.200 UTC [12779594] LOG:  database system is ready to accept read only connections
2017-07-01 02:31:51.214 UTC [5111896] LOG:  fetching timeline history file for timeline 3 from primary server
2017-07-01 02:31:51.246 UTC [5111896] LOG:  started streaming WAL from primary at 0/3000000 on timeline 2
2017-07-01 02:31:51.248 UTC [5111896] LOG:  replication terminated by primary server
2017-07-01 02:31:51.248 UTC [5111896] DETAIL:  End of WAL reached on timeline 2 at 0/301C168.
2017-07-01 02:31:51.263 UTC [11665568] LOG:  new timeline 3 forked off current database system timeline 2 before
currentrecovery point 0/301CB10 
2017-07-01 02:31:51.263 UTC [5111896] LOG:  restarted WAL streaming at 0/3000000 on timeline 2
2017-07-01 02:31:51.264 UTC [5111896] LOG:  replication terminated by primary server
2017-07-01 02:31:51.264 UTC [5111896] DETAIL:  End of WAL reached on timeline 2 at 0/301C168.
2017-07-01 02:31:51.264 UTC [11665568] LOG:  new timeline 3 forked off current database system timeline 2 before
currentrecovery point 0/301CB10 
2017-07-01 02:31:51.579 UTC [12779594] LOG:  received immediate shutdown request
2017-07-01 02:31:51.629 UTC [12779594] LOG:  database system is shut down
2017-07-01 02:31:51.790 UTC [7471446] LOG:  listening on Unix socket "/tmp/ELoHNJ0v5u/.s.PGSQL.49810"
2017-07-01 02:31:51.826 UTC [9502748] LOG:  database system was interrupted while in recovery at log time 2017-07-01
02:31:49UTC 
2017-07-01 02:31:51.826 UTC [9502748] HINT:  If this has occurred more than once some data might be corrupted and you
mightneed to choose an earlier recovery target. 
2017-07-01 02:31:51.826 UTC [9502748] LOG:  entering standby mode
2017-07-01 02:31:51.827 UTC [9502748] FATAL:  requested timeline 3 is not a child of this server's history
2017-07-01 02:31:51.827 UTC [9502748] DETAIL:  Latest checkpoint is at 0/301CAA0 on timeline 2, but in the history of
therequested timeline, the server forked off from that timeline at 0/301C168. 
2017-07-01 02:31:51.828 UTC [7471446] LOG:  startup process (PID 9502748) exited with exit code 1
2017-07-01 02:31:51.829 UTC [7471446] LOG:  aborting startup due to startup process failure

What seems to have happened here is that the slave has WAL up to
0/301CB10, but the master's WAL ends at 0/301C168, so we've got
trouble.  How did the slave get ahead of the master?

The reporting critters are all on the slow side, so I suspected
a timing problem, especially since it only started to show up
after changing pg_ctl's timing behavior.  I can't reproduce it
locally on unmodified sources, but I could after putting my thumb
on the scales like this:

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ea9d21a..7ef22c1 100644
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
*************** WalReceiverMain(void)
*** 446,451 ****
--- 446,454 ----                              * Something was received from master, so reset
 * timeout                              */ 
+
+                             pg_usleep(100000); /* be slow about processing */
+                              last_recv_timestamp = GetCurrentTimestamp();                             ping_sent =
false;                            XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1); 

(If this doesn't do it for you alone, you might try also increasing
WAITS_PER_SEC in pg_ctl.c, so as to make pg_ctl faster yet.)

I'm not entirely sure what's happening here, although the gist of it
has to be that we're killing off a slave server before it's quite
absorbed all the WAL the master has sent it, and then when we exchange
the roles of the master and slave, all of a sudden it's the slave that's
ahead, not the master.

Part of the reason I'm confused is that the programming technique
being used in 009_twophase.pl, namely doing
($node_master, $node_slave) = ($node_slave, $node_master);

and then working with the reversed variable names, is ENTIRELY TOO CUTE
FOR ITS OWN GOOD.  The objects continue to have their original "name"
fields and continue to refer to the same datadirs and log files
internally, which means that when you're trying to trace through the
log files to understand what's happening, you have to remember how many
of these swaps have been done so far in order to understand whether
script operations on "$node_master" are actually affecting the "master"
or the "slave" installation, and whether when the log says that it started
the "slave" server, that corresponds to a script action on "$node_slave"
or "$node_master".  This might not be the worst, most confusing code I've
ever seen committed into Postgres, but it's right up there, and you can
bet that I'm going to be holding somebody's feet to the fire for a
rewrite.

Anyway, having vented about that ... it's not very clear to me whether the
test script is at fault for not being careful to let the slave catch up to
the master before we promote it (and then deem the master to be usable as
a slave without rebuilding it first), or whether we actually imagine that
should work, in which case there's a replication logic bug here someplace.
The failure messages are also less clear than they could be, perhaps.
        regards, tom lane

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

From

Alvaro Herrera

Date:

02 July 2017, 18:00:30

Tom Lane wrote:

> Part of the reason I'm confused is that the programming technique
> being used in 009_twophase.pl, namely doing
> 
>     ($node_master, $node_slave) = ($node_slave, $node_master);
> 
> and then working with the reversed variable names, is ENTIRELY TOO CUTE
> FOR ITS OWN GOOD.

This is my fault.  I noticed this in the submitted test (and was pretty
confused about it too) when I reviewed it for commit, but somehow it
didn't reach my threshold to require a rewrite.  I'll fix it sometime
during the week.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

From

Tom Lane

Date:

02 July 2017, 18:54:15

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Part of the reason I'm confused is that the programming technique
>> being used in 009_twophase.pl, namely doing
>> ($node_master, $node_slave) = ($node_slave, $node_master);
>> and then working with the reversed variable names, is ENTIRELY TOO CUTE
>> FOR ITS OWN GOOD.

> This is my fault.  I noticed this in the submitted test (and was pretty
> confused about it too) when I reviewed it for commit, but somehow it
> didn't reach my threshold to require a rewrite.  I'll fix it sometime
> during the week.

I'd kind of like to fix it now, so I can reason in a less confused way
about the actual problem.  Last night I didn't have a clear idea of how
to make it better, but what I'm thinking this morning is:

* Naming the underlying server objects "master" and "slave" is just
wrong, because the script makes them swap those roles repeatedly.
Better to choose neutral names, like "alice" and "bob" or "london"
and "paris".

* We could simply make the test script refer directly to the appropriate
server at each step, ie s/node_master/node_london/ in relevant parts of
the script and s/node_slave/node_london/ elsewhere.  Maybe that's the
best way, but there is some value in identifying commands as to whether
we're issuing them to the current master or the current slave.  Plan B
might be to do($cur_master, $cur_slave) = ($node_paris, $node_london);
at the swap points and use those names where it seems clearer.

* Some effort should be put into emitting text to the log showing
what's going on, eg print("Now london is master."); as appropriate.

* Another reason why I had the feeling of being lost in a maze of
twisty little passages all alike was the relentless sameness of the
commands being sent to the servers.  There is no good reason for the
prepared transactions to be all alike; they should be all different
so that you can match up postmaster log entries to points in the
script.

If anyone has other ideas please speak up.

Barring objection I'll go do something about this shortly.
        regards, tom lane

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

From

Alvaro Herrera

Date:

02 July 2017, 20:32:43

Tom Lane wrote:

> I'd kind of like to fix it now, so I can reason in a less confused way
> about the actual problem.

OK, no objections here.

> Last night I didn't have a clear idea of how
> to make it better, but what I'm thinking this morning is:
> 
> * Naming the underlying server objects "master" and "slave" is just
> wrong, because the script makes them swap those roles repeatedly.
> Better to choose neutral names, like "alice" and "bob" or "london"
> and "paris".

Check.

> * We could simply make the test script refer directly to the appropriate
> server at each step, ie s/node_master/node_london/ in relevant parts of
> the script and s/node_slave/node_london/ elsewhere.  Maybe that's the
> best way, but there is some value in identifying commands as to whether
> we're issuing them to the current master or the current slave.  Plan B
> might be to do
>     ($cur_master, $cur_slave) = ($node_paris, $node_london);
> at the swap points and use those names where it seems clearer.

Hmm, yeah, using role-based aliases seems like a good idea too.

> * Some effort should be put into emitting text to the log showing
> what's going on, eg print("Now london is master."); as appropriate.

Check.  Not "print" though; I think using note(" .. ") (from Test::More)
is more appropriate.

> * Another reason why I had the feeling of being lost in a maze of
> twisty little passages all alike was the relentless sameness of the
> commands being sent to the servers.  There is no good reason for the
> prepared transactions to be all alike; they should be all different
> so that you can match up postmaster log entries to points in the
> script.

Check.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

From

Tom Lane

Date:

02 July 2017, 20:55:29

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> * Some effort should be put into emitting text to the log showing
>> what's going on, eg print("Now london is master."); as appropriate.

> Check.  Not "print" though; I think using note(" .. ") (from Test::More)
> is more appropriate.

Will do, thanks for the suggestion.
        regards, tom lane

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

From

Tom Lane

Date:

02 July 2017, 23:40:49

I wrote:
> The reporting critters are all on the slow side, so I suspected
> a timing problem, especially since it only started to show up
> after changing pg_ctl's timing behavior.  I can't reproduce it
> locally on unmodified sources, but I could after putting my thumb
> on the scales like this:
>
> diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
> index ea9d21a..7ef22c1 100644
> *** a/src/backend/replication/walreceiver.c
> --- b/src/backend/replication/walreceiver.c
> *************** WalReceiverMain(void)
> *** 446,451 ****
> --- 446,454 ----
>                                * Something was received from master, so reset
>                                * timeout
>                                */
> +
> +                             pg_usleep(100000); /* be slow about processing */
> +
>                               last_recv_timestamp = GetCurrentTimestamp();
>                               ping_sent = false;
>                               XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);


BTW, I neglected to mention that this hack also produces a second failure
in the recovery test suite:

[20:28:31] t/001_stream_rep.pl .................. 11/28
#   Failed test 'xmin of cascaded slot null with no hs_feedback'
#   at t/001_stream_rep.pl line 178.
#          got: undef
#     expected: ''

#   Failed test 'catalog xmin of cascaded slot null with no hs_feedback'
#   at t/001_stream_rep.pl line 179.
#          got: undef
#     expected: ''
[20:28:31] t/001_stream_rep.pl .................. 28/28 # Looks like you failed 2 tests of 28.
[20:28:31] t/001_stream_rep.pl .................. Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/28 subtests

I had supposed last night that this was merely a matter of needing
to insert a wait_slot_xmins() call before those tests, but doing so
doesn't help.  And, looking closer, the report is complaining that
$xmin and $catalog_xmin are coming back as undef, not empty strings.
My Perl-fu is too weak to figure out how that could possibly be ---
the documentation for PostgresNode::query_hash certainly states
flatly that it will never happen.  And it even more certainly
shouldn't happen if we already successfully probed the
pg_replication_slots row in wait_slot_xmins().

The corresponding entries in the postmaster's log don't look abnormal:

2017-07-02 20:07:29.313 UTC [30862] LOG:  database system is ready to accept read only connections
2017-07-02 20:07:29.318 UTC [30867] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
2017-07-02 20:07:29.415 UTC [30870] t/001_stream_rep.pl LOG:  statement: SELECT
pg_create_physical_replication_slot('standby_2');
2017-07-02 20:07:29.619 UTC [30863] LOG:  redo starts at 0/3029458
2017-07-02 20:07:29.665 UTC [30882] t/001_stream_rep.pl LOG:  statement:         SELECT xmin IS NULL AND catalog_xmin
ISNULL        FROM pg_catalog.pg_replication_slots        WHERE slot_name = 'standby_2';     
2017-07-02 20:07:29.679 UTC [30884] t/001_stream_rep.pl LOG:  statement: SELECT plugin, slot_type, datoid, database,
active,active_pid, xmin, catalog_xmin, restart_lsn FROM pg_catalog.pg_replication_slots WHERE slot_name = 'standby_2' 

(This is from a run with a wait_slot_xmins() call added, else the
statement logged by PID 30882 would not have been issued.)

Any ideas what's wrong there?
        regards, tom lane

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

From

Tom Lane

Date:

03 July 2017, 00:10:31

I wrote:
> Any ideas what's wrong there?

Hah: the answer is that query_hash's split() call is broken.
"man perlfunc" quoth
      split   Splits the string EXPR into a list of strings and returns that              list.  By default, empty
leadingfields are preserved, and              empty trailing ones are deleted.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the case at hand, the SQL query returns something like
"|physical|||t|10338|||0/302B7E8" with normal timing, but with
enough delay in there, you get "|physical|||t|11542|||" which
triggers split's default behavior of ignoring the trailing empty
fields.  It looks like the way to get split to not do that is
to pass it a "limit" of -1.
        regards, tom lane

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

From

Tom Lane

Date:

03 July 2017, 01:02:24

I wrote:
> Anyway, having vented about that ... it's not very clear to me whether the
> test script is at fault for not being careful to let the slave catch up to
> the master before we promote it (and then deem the master to be usable as
> a slave without rebuilding it first), or whether we actually imagine that
> should work, in which case there's a replication logic bug here someplace.

OK, now that I can make some sense of what's going on in the 009 test
script ... it seems like that test script is presuming synchronous
replication behavior, but it's only actually set up sync rep in one
direction, namely the london->paris direction.  The failure occurs
when we lose data in the paris->london direction.  Specifically,
with the delay hack in place, I find this in the log before things
go south completely:

# Now london is master and paris is slave
ok 11 - Restore prepared transactions from files with master down
### Enabling streaming replication for node "paris"
### Starting node "paris"
# Running: pg_ctl -D /home/postgres/pgsql/src/test/recovery/tmp_check/data_paris_xSFF/pgdata -l
/home/postgres/pgsql/src/test/recovery/tmp_check/log/009_twophase_paris.logstart 
waiting for server to start.... done
server started
# Postmaster PID for node "paris" is 30930
psql:<stdin>:1: ERROR:  prepared transaction with identifier "xact_009_11" does not exist

That ERROR is being reported by the london node, at line 267 of the
current script:$cur_master->psql('postgres', "COMMIT PREPARED 'xact_009_11'");
So london is missing a prepared transaction that was created while
paris was master, a few lines earlier.  (It's not real good that
the test script isn't bothering to check the results of any of
these queries, although the end-state test I just added should close
the loop on that.)  london has no idea that it's missing data, but
when we restart the paris node a little later, it notices that its
WAL is past where london is.

I'm now inclined to think that the correct fix is to ensure that we
run synchronous rep in both directions, rather than to insert delays
to substitute for that.  Just setting synchronous_standby_names for
node paris at the top of the script doesn't work, because there's
at least one place where the script intentionally issues commands
to paris while london is stopped.  But we could turn off sync rep
for that step, perhaps.

Anyone have a different view of what to fix here?
        regards, tom lane

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

From

Michael Paquier

Date:

03 July 2017, 04:15:15

(catching up test threads)

On Mon, Jul 3, 2017 at 7:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm now inclined to think that the correct fix is to ensure that we
> run synchronous rep in both directions, rather than to insert delays
> to substitute for that.  Just setting synchronous_standby_names for
> node paris at the top of the script doesn't work, because there's
> at least one place where the script intentionally issues commands
> to paris while london is stopped.

I bet that using syncrep in both directions will likely avoid
inconsistencies in the future if the test suite is extended on way or
another.

> But we could turn off sync rep for that step, perhaps.

Yeah, by using synchronous_commit = off.

> Anyone have a different view of what to fix here?

No, this sounds like a good plan. What do you think about the attached?
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

2pc-test-fix.patch

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

From

Tom Lane

Date:

03 July 2017, 04:23:59

Michael Paquier <michael.paquier@gmail.com> writes:
> On Mon, Jul 3, 2017 at 7:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Anyone have a different view of what to fix here?

> No, this sounds like a good plan. What do you think about the attached?

Oh, that's a good way.  I just finished testing a fix that involved
not turning on the second server's sync commit until later (it seems
that only the first action on "paris" is really at risk currently).
But disabling sync commit for individual transactions is clearly cleaner
and more extensible to future test script changes.

FWIW, I just got done doing a few check-world cycles with the delay in
WalReceiverMain plus speeding up pg_ctl.c to WAITS_PER_SEC = 1000.
No other problems seem to be revealed this way.
        regards, tom lane

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

From

Craig Ringer

Date:

03 July 2017, 04:58:05

On 3 July 2017 at 05:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Any ideas what's wrong there?
>
> Hah: the answer is that query_hash's split() call is broken.
> "man perlfunc" quoth
>
>        split   Splits the string EXPR into a list of strings and returns that
>                list.  By default, empty leading fields are preserved, and
>                empty trailing ones are deleted.
>                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> In the case at hand, the SQL query returns something like
> "|physical|||t|10338|||0/302B7E8" with normal timing, but with
> enough delay in there, you get "|physical|||t|11542|||" which
> triggers split's default behavior of ignoring the trailing empty
> fields.  It looks like the way to get split to not do that is
> to pass it a "limit" of -1.

That's my bad.

(Insert dark muttering about Perl here).

Thanks for spotting it.


-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

From

Tom Lane

Date:

03 July 2017, 05:27:54

Craig Ringer <craig@2ndquadrant.com> writes:
> That's my bad.
> (Insert dark muttering about Perl here).

Yeah, pretty much the only good thing about Perl is it's ubiquitous.
But you could say the same of C.  Or SQL.  For a profession that's
under 70 years old, we sure spend a lot of time dealing with legacy
stuff.
        regards, tom lane