I have a Publication/Subscription. Then 10 days ago my replication just
stopped, no one interesting message log on both sides. Then select from
pg_stat_replication was empty and some time later, an hour or two,
pg_stat_replication came back showing a record like it should. Well, it´s
replicating ? No, no one new record on replica, some more time and
pg_stat_replication was empty again. The only log was created was on
primary, initially wal_sender_timeout was 15 minutes. I tried changing it to
1 hour, maybe is a huge transaction, but no, the problem remains the same.
When I set timeout to 0 on Sender and Receiver, then I see that on replica
pg_stat_activity was DataFileRead on Wait Event Name field and IO on Wait
Event Type field, and that status remains the same for hours. Documentation
says DataFileRead is Waiting for a read from a relation data file, so it
does not specify which file is being waited. So, there is a problem on some
table on my replica but ... what file is this ? Well, it´ll spend a lot of
time but lets do a vacuum full. Then it took for 1 or 2 days but found two
corrupted tables. Those tables I couldn´t do select or dump, so I just
droped them and then finally replication came back.
This long explaining was only to show, at least for me, that would be
desirable to have an additional information when Postgres is waiting for a
file. What if DataFileRead showing relfilenode it´s waiting for ?
These logs are on primary, on replica I didn´t find any. And strangely,
those errors did not occur every wal_sender_timeout time, even when was
configured with 15 minutes they occured 2 or 3 times a day only.
log_time;user_name;session_start_time;error_severity;message
2021-04-06 16:21:23.385;replicate;2021-04-06 16:21:23.000;LOG;starting
logical decoding for slot sub_google_bkp
2021-04-06 16:21:23.386;replicate;2021-04-06 16:21:23.000;LOG;logical
decoding found consistent point at 5C1/5C7A3D48
2021-04-06 16:36:24.175;replicate;2021-04-06 16:21:23.000;LOG;terminating
walsender process due to replication timeout
2021-04-06 17:47:15.744;replicate;2021-04-06 17:47:15.000;LOG;starting
logical decoding for slot sub_google_bkp
2021-04-06 17:47:15.745;replicate;2021-04-06 17:47:15.000;LOG;logical
decoding found consistent point at 5C1/5CBFBF38
2021-04-06 18:02:15.757;replicate;2021-04-06 17:47:15.000;LOG;terminating
walsender process due to replication timeout
2021-04-07 11:59:27.831;replicate;2021-04-07 11:59:27.000;LOG;starting
logical decoding for slot sub_google_bkp
2021-04-07 11:59:27.832;replicate;2021-04-07 11:59:27.000;LOG;logical
decoding found consistent point at 5C1/5CBFBF38
2021-04-07 12:14:27.867;replicate;2021-04-07 11:59:27.000;LOG;terminating
walsender process due to replication timeout
2021-04-07 21:45:22.230;replicate;2021-04-07 21:45:22.000;LOG;starting
logical decoding for slot sub_google_bkp
2021-04-07 21:45:22.231;replicate;2021-04-07 21:45:22.000;LOG;logical
decoding found consistent point at 5C1/5CBFD438
2021-04-07 22:45:22.586;replicate;2021-04-07 21:45:22.000;LOG;terminating
walsender process due to replication timeout
2021-04-08 00:06:26.253;replicate;2021-04-08 00:06:25.000;LOG;starting
logical decoding for slot sub_google_bkp
2021-04-08 00:06:26.255;replicate;2021-04-08 00:06:25.000;LOG;logical
decoding found consistent point at 5C1/5CCF7E20
2021-04-08 02:15:10.342;replicate;2021-04-08 00:06:25.000;LOG;terminating
walsender process due to replication timeout
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html