[HACKERS] Race-like failure in recovery/t/009_twophase.pl - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | [HACKERS] Race-like failure in recovery/t/009_twophase.pl |
Date | |
Msg-id | 908.1498965681@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl
Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl |
List | pgsql-hackers |
Since I put in that patch to reduce pg_ctl's reaction time, we've seen several buildfarm failures with identical symptoms: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-06-30%2012%3A56%3A08 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-06-30%2023%3A42%3A11 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-01%2000%3A22%3A09 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-07-01%2003%3A30%3A01 to wit that the recovery/009 test blows up with Bailout called. Further testing stopped: pg_ctl start failed FAILED--Further testing stopped: pg_ctl start failed Fortunately, it reports the relevant log files, and we can see things like this in the "slave" postmaster's log: 2017-07-01 02:31:51.131 UTC [11665568] LOG: entering standby mode 2017-07-01 02:31:51.198 UTC [11665568] LOG: consistent recovery state reached at 0/301CB10 2017-07-01 02:31:51.198 UTC [11665568] LOG: invalid record length at 0/301CB10: wanted 24, got 0 2017-07-01 02:31:51.200 UTC [12779594] LOG: database system is ready to accept read only connections 2017-07-01 02:31:51.214 UTC [5111896] LOG: fetching timeline history file for timeline 3 from primary server 2017-07-01 02:31:51.246 UTC [5111896] LOG: started streaming WAL from primary at 0/3000000 on timeline 2 2017-07-01 02:31:51.248 UTC [5111896] LOG: replication terminated by primary server 2017-07-01 02:31:51.248 UTC [5111896] DETAIL: End of WAL reached on timeline 2 at 0/301C168. 2017-07-01 02:31:51.263 UTC [11665568] LOG: new timeline 3 forked off current database system timeline 2 before currentrecovery point 0/301CB10 2017-07-01 02:31:51.263 UTC [5111896] LOG: restarted WAL streaming at 0/3000000 on timeline 2 2017-07-01 02:31:51.264 UTC [5111896] LOG: replication terminated by primary server 2017-07-01 02:31:51.264 UTC [5111896] DETAIL: End of WAL reached on timeline 2 at 0/301C168. 2017-07-01 02:31:51.264 UTC [11665568] LOG: new timeline 3 forked off current database system timeline 2 before currentrecovery point 0/301CB10 2017-07-01 02:31:51.579 UTC [12779594] LOG: received immediate shutdown request 2017-07-01 02:31:51.629 UTC [12779594] LOG: database system is shut down 2017-07-01 02:31:51.790 UTC [7471446] LOG: listening on Unix socket "/tmp/ELoHNJ0v5u/.s.PGSQL.49810" 2017-07-01 02:31:51.826 UTC [9502748] LOG: database system was interrupted while in recovery at log time 2017-07-01 02:31:49UTC 2017-07-01 02:31:51.826 UTC [9502748] HINT: If this has occurred more than once some data might be corrupted and you mightneed to choose an earlier recovery target. 2017-07-01 02:31:51.826 UTC [9502748] LOG: entering standby mode 2017-07-01 02:31:51.827 UTC [9502748] FATAL: requested timeline 3 is not a child of this server's history 2017-07-01 02:31:51.827 UTC [9502748] DETAIL: Latest checkpoint is at 0/301CAA0 on timeline 2, but in the history of therequested timeline, the server forked off from that timeline at 0/301C168. 2017-07-01 02:31:51.828 UTC [7471446] LOG: startup process (PID 9502748) exited with exit code 1 2017-07-01 02:31:51.829 UTC [7471446] LOG: aborting startup due to startup process failure What seems to have happened here is that the slave has WAL up to 0/301CB10, but the master's WAL ends at 0/301C168, so we've got trouble. How did the slave get ahead of the master? The reporting critters are all on the slow side, so I suspected a timing problem, especially since it only started to show up after changing pg_ctl's timing behavior. I can't reproduce it locally on unmodified sources, but I could after putting my thumb on the scales like this: diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index ea9d21a..7ef22c1 100644 *** a/src/backend/replication/walreceiver.c --- b/src/backend/replication/walreceiver.c *************** WalReceiverMain(void) *** 446,451 **** --- 446,454 ---- * Something was received from master, so reset * timeout */ + + pg_usleep(100000); /* be slow about processing */ + last_recv_timestamp = GetCurrentTimestamp(); ping_sent = false; XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1); (If this doesn't do it for you alone, you might try also increasing WAITS_PER_SEC in pg_ctl.c, so as to make pg_ctl faster yet.) I'm not entirely sure what's happening here, although the gist of it has to be that we're killing off a slave server before it's quite absorbed all the WAL the master has sent it, and then when we exchange the roles of the master and slave, all of a sudden it's the slave that's ahead, not the master. Part of the reason I'm confused is that the programming technique being used in 009_twophase.pl, namely doing ($node_master, $node_slave) = ($node_slave, $node_master); and then working with the reversed variable names, is ENTIRELY TOO CUTE FOR ITS OWN GOOD. The objects continue to have their original "name" fields and continue to refer to the same datadirs and log files internally, which means that when you're trying to trace through the log files to understand what's happening, you have to remember how many of these swaps have been done so far in order to understand whether script operations on "$node_master" are actually affecting the "master" or the "slave" installation, and whether when the log says that it started the "slave" server, that corresponds to a script action on "$node_slave" or "$node_master". This might not be the worst, most confusing code I've ever seen committed into Postgres, but it's right up there, and you can bet that I'm going to be holding somebody's feet to the fire for a rewrite. Anyway, having vented about that ... it's not very clear to me whether the test script is at fault for not being careful to let the slave catch up to the master before we promote it (and then deem the master to be usable as a slave without rebuilding it first), or whether we actually imagine that should work, in which case there's a replication logic bug here someplace. The failure messages are also less clear than they could be, perhaps. regards, tom lane
pgsql-hackers by date: