BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached
Date
Msg-id 17327-89d0efa8b9ae6271@postgresql.org
Whole thread Raw
Responses Re: BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      17327
Logged by:          Alex E
Email address:      alex@altmetric.com
PostgreSQL version: 13.5
Operating system:   Ubuntu 18.04
Description:

We have recently run into a situation where our pg_basebackup-based backups
started failing unexpectedly. These use WAL streaming to keep up with
changes (which uses a temporary replication slot server side). 

The only errors logged on the client side were as listed below:

pg_basebackup: error: could not receive data from WAL stream: SSL connection
has been closed unexpectedly
pg_basebackup: error: could not read COPY data: server closed the connection
unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
pg_basebackup: removing contents of data directory "/backups/some/path/"

whilst on the server side we only got:

2021-12-03 16:21:54 UTC [29724-2647] LOG:  terminating process 42601 to
release replication slot "pg_basebackup_42601"
2021-12-03 16:21:54 UTC [42601-1] replicator@[unknown] FATAL:  terminating
connection due to administrator command
2021-12-03 16:21:54 UTC [42601-2] replicator@[unknown] STATEMENT:
START_REPLICATION SLOT "pg_basebackup_42601" 4721F/45000000 TIMELINE 3

The above was very unhelpful as it made us believe we might be dealing with
either a network interruption or another type of mysterious hardware
error.

We then proceeded to try several things to try and determine the root cause
of the problem and eventually realized (by trial and error and monitoring
various statistics) that we were breaching our max_slot_wal_keep_size limit
for the temporary replication slot whilst taking the pg_basebackup. The only
way we realized this was by using a permanent physical replication slot to
take the backup instead of a temporary one, and when doing this a relevant
error related to max_slot_wal_keep_size being breached was issued.

The core issue here then in our opinion is that Postgres server should log
an error when the max_slot_wal_keep_size limit is reached for temporary
replication slots as well as for permanent ones as otherwise
users/administrators are presented only with non-descript connection
termination errors which do not point to the actual cause of the problem.


pgsql-bugs by date:

Previous
From: Debabrata Pan
Date:
Subject: unable to start pg agent 12 service on windows 10
Next
From: Greg Rychlewski
Date:
Subject: Re: BUG #17325: Unexpected streaming replication protocol bytes for IDENTIFY_SYSTEM command