Hello,
At Materialize we observed a strange behavior of walsender during logical replication where it sends a flood of keepalive messages to the subscriber when the database walsender is running on is itself syncing a large table from another unrelated database.
The setup that reproduces the problem requires three database instances: A, B, and C. When the following actions are performed it leads to B sending a flood of keepalive messages to C:
First, in db A, we create a large table:
CREATE TABLE large (a int);
INSERT INTO large (SELECT generate_series(1, 100000000));
CREATE PUBLICATION large_pub FOR TABLE large;
Then, in db B, we create a tiny table:
CREATE TABLE tiny (a int);
INSERT INTO tiny VALUES (1);
CREATE PUBLICATION tiny_pub FOR TABLE tiny;
Then, in db C, we subscribe to tiny_pub:
CREATE TABLE tiny(a int);
CREATE SUBSCRIPTION tiny_sub CONNECTION 'host=B' PUBLICATION tiny_pub;
At this point db C receives a keepalive message rarely, according to the wal_sender_timeout parameter.
Finally, in db B, we subscribe to large_pub:
CREATE TABLE large(a int);
CREATE SUBSCRIPTION large_sub CONNECTION 'host=A' PUBLICATION large_pub;
This triggers a flood of keepalive messages from B to C, even though C doesn't need to learn anything about the large table, nor does it seem to perform any actions with the knowledge transferred by the keepalive messages.
I used the patch included in this message to produce a log every time a keepalive was received in order to count them and observed a rate of 20 keepalives per second lasting multiple minutes.
We identified the code that is sending these keepalives to be this one:
The comment of this if statement seems to imply that these keepalives are relevant to synchronous replication and shutdown but neither of those are actually happening in the reproduction. There is another section of walsender with a similar looking comment which does have an explicit check for synchronous replication:
Is it expected that this happens? Does the identified if statement also need synchronous replication guards?
For full context, the real system this was observed on was a Materialize instance which supports importing tables from PostgreSQL databases using logical replication. In our implementation, receiving keepalives triggers a non-trivial amount of work and the flood of keepalives caught us by surprise, causing high CPU usage.
Best,
Petros