Hi
pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout in following environment.
[Environment]
Postgres 13dev (master branch)
Red Hat Enterprise Postgres 7.4
[Error]
$ pg_basebackup -F t --progress --verbose -h <hostname> -D <directory>
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/5A000060 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_15647"
pg_basebackup: error: could not read COPY data: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[Analysis]
- pg_basebackup -F t creates a tar file and does fsync() for each tablespace.
(Otherwise, -F p does fsync() only once at the end.)
- While doing fsync() for a tar file for one tablespace, wal sender sends the content of the next tablespace.
When fsync() spends long time, the tcp socket of pg_basebackup returns "zero window" packets to wal sender.
This means the tcp socket buffer of pg_basebackup is exhausted since pg_basebackup cannot receive during fsync().
- The socket of wal sender retries to send the packet, but resets connection after tcp_user_timeout.
After wal sender resets connection, pg_basebackup cannot receive data and fails with above error.
[Solution]
I think fsync() for each tablespace is not necessary.
Like pg_basebackup -F p, I think fsync() is necessary only once at the end.
Could you give me any comment?
Regards,
Ryohei Takahashi