I wrote:
> More to the point, aren't these proposals just band-aids that
> would stabilize the test without fixing the actual problem?
> The same thing is likely to happen to people in the field,
> unless we do something drastic like removing ALTER SUBSCRIPTION.
I've been able to make the 031_column_list.pl failure pretty
reproducible by adding a delay in walsender, as attached.
While I'm not too familiar with this code, it definitely does appear
that the new walsender is told to start up at an LSN before the
creation of the publication, and then if it needs to decide whether
to stream a particular data change before it's reached that creation,
kaboom!
I read and understood the upthread worries about it not being
a great idea to ignore publication lookup failures, but I really
don't see that we have much choice. As an example, if a subscriber
is humming along reading publication pub1, and then someone
drops and then recreates pub1 on the publisher, I don't think that
the subscriber will be able to advance through that gap if there
are any operations within it that require deciding if they should
be streamed. (That is, contrary to Amit's expectation that
DROP/CREATE would mask the problem, I suspect it will instead turn
it into a hard failure. I've not experimented though.)
BTW, this same change breaks two other subscription tests:
015_stream.pl and 022_twophase_cascade.pl.
The symptoms are different (no "publication does not exist" errors),
so maybe these are just test problems not fundamental weaknesses.
But "replication falls over if the walsender is slow" isn't
something I'd call acceptable.
regards, tom lane
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 77c8baa32a..c897ef4c60 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2699,6 +2699,8 @@ WalSndLoop(WalSndSendDataCallback send_data)
!pq_is_send_pending())
break;
+ pg_usleep(10000);
+
/*
* If we don't have any pending data in the output buffer, try to send
* some more. If there is some, we don't bother to call send_data