Re: 001_rep_changes.pl stalls - Mailing list pgsql-hackers

From Noah Misch
Subject Re: 001_rep_changes.pl stalls
Date
Msg-id 20200414013849.GA886648@rfd.leadboat.com
Whole thread Raw
In response to 001_rep_changes.pl stalls  (Noah Misch <noah@leadboat.com>)
Responses Re: 001_rep_changes.pl stalls  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sun, Apr 05, 2020 at 11:36:49PM -0700, Noah Misch wrote:
> Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is
> important for promptly updating pg_stat_replication.  When caught up, we
> should impose that logic before every sleep.  The one-line fix is to sleep in
> WalSndLoop() only when pq_is_send_pending(), not when caught up.

This seems to have made the following race condition easier to hit:
https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us

Now it happened eight times in three days, all on BSD machines:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2018%3A30%3A21
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2018%3A45%3A39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2020%3A30%3A26
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2021%3A45%3A48
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2010%3A45%3A35
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2020-04-13%2016%3A00%3A18
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2018%3A45%3A34
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2023%3A45%3A22

While I don't think that indicates anything wrong with the fix for $SUBJECT,
creating more buildfarm noise is itself bad.  I am inclined to revert the fix
after a week.  Not immediately, in case it uncovers lower-probability bugs.
I'd then re-commit it after one of those threads fixes the other bug.  Would
anyone like to argue for a revert earlier, later, or not at all?


There was a novel buildfarm failure, in 010_logical_decoding_timelines.pl:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2008%3A35%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2017%3A15%3A01

Most-relevant lines of the test script:

    $node_master->safe_psql('postgres',
        "INSERT INTO decoding(blah) VALUES ('afterbb');");
    $node_master->safe_psql('postgres', 'CHECKPOINT');
    $node_master->stop('immediate');

The failure suggested the INSERT was not replicated before the immediate stop.
I can reproduce that consistently, before or after the fix for $SUBJECT, by
modifying walsender to delay 0.2s before sending WAL:

--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -65,2 +65,3 @@
 #include "libpq/pqformat.h"
+#include "libpq/pqsignal.h"
 #include "miscadmin.h"
@@ -2781,2 +2782,5 @@ retry:
 
+    PG_SETMASK(&BlockSig);
+    pg_usleep(200 * 1000);
+    PG_SETMASK(&UnBlockSig);
     pq_putmessage_noblock('d', output_message.data, output_message.len);

I will shortly push a fix adding a wait_for_catchup to the test.  I don't know
if/how fixing $SUBJECT made this 010_logical_decoding_timelines.pl race
condition easier to hit.



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Race condition in SyncRepGetSyncStandbysPriority
Next
From: Tom Lane
Date:
Subject: Re: 001_rep_changes.pl stalls