Re: 001_rep_changes.pl stalls - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: 001_rep_changes.pl stalls |
Date | |
Msg-id | 20200414013849.GA886648@rfd.leadboat.com Whole thread Raw |
In response to | 001_rep_changes.pl stalls (Noah Misch <noah@leadboat.com>) |
Responses |
Re: 001_rep_changes.pl stalls
|
List | pgsql-hackers |
On Sun, Apr 05, 2020 at 11:36:49PM -0700, Noah Misch wrote: > Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is > important for promptly updating pg_stat_replication. When caught up, we > should impose that logic before every sleep. The one-line fix is to sleep in > WalSndLoop() only when pq_is_send_pending(), not when caught up. This seems to have made the following race condition easier to hit: https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us Now it happened eight times in three days, all on BSD machines: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2018%3A30%3A21 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2018%3A45%3A39 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2020%3A30%3A26 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2021%3A45%3A48 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2010%3A45%3A35 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2020-04-13%2016%3A00%3A18 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2018%3A45%3A34 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2023%3A45%3A22 While I don't think that indicates anything wrong with the fix for $SUBJECT, creating more buildfarm noise is itself bad. I am inclined to revert the fix after a week. Not immediately, in case it uncovers lower-probability bugs. I'd then re-commit it after one of those threads fixes the other bug. Would anyone like to argue for a revert earlier, later, or not at all? There was a novel buildfarm failure, in 010_logical_decoding_timelines.pl: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2008%3A35%3A05 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2017%3A15%3A01 Most-relevant lines of the test script: $node_master->safe_psql('postgres', "INSERT INTO decoding(blah) VALUES ('afterbb');"); $node_master->safe_psql('postgres', 'CHECKPOINT'); $node_master->stop('immediate'); The failure suggested the INSERT was not replicated before the immediate stop. I can reproduce that consistently, before or after the fix for $SUBJECT, by modifying walsender to delay 0.2s before sending WAL: --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -65,2 +65,3 @@ #include "libpq/pqformat.h" +#include "libpq/pqsignal.h" #include "miscadmin.h" @@ -2781,2 +2782,5 @@ retry: + PG_SETMASK(&BlockSig); + pg_usleep(200 * 1000); + PG_SETMASK(&UnBlockSig); pq_putmessage_noblock('d', output_message.data, output_message.len); I will shortly push a fix adding a wait_for_catchup to the test. I don't know if/how fixing $SUBJECT made this 010_logical_decoding_timelines.pl race condition easier to hit.
pgsql-hackers by date: