pg_basebackup -F t fails when fsync spends more time thantcp_user_timeout - Mailing list pgsql-hackers

From r.takahashi_2@fujitsu.com
Subject pg_basebackup -F t fails when fsync spends more time thantcp_user_timeout
Date
Msg-id OSBPR01MB4550DAE2F8C9502894A45AAB82BE0@OSBPR01MB4550.jpnprd01.prod.outlook.com
Whole thread Raw
Responses Re: pg_basebackup -F t fails when fsync spends more time thantcp_user_timeout
List pgsql-hackers
Hi


pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout in following environment.

[Environment]
Postgres 13dev (master branch)
Red Hat Enterprise Postgres 7.4

[Error]
$ pg_basebackup -F t --progress --verbose -h <hostname> -D <directory>
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/5A000060 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_15647"
pg_basebackup: error: could not read COPY data: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

[Analysis]
- pg_basebackup -F t creates a tar file and does fsync() for each tablespace.
  (Otherwise, -F p does fsync() only once at the end.)
- While doing fsync() for a tar file for one tablespace, wal sender sends the content of the next tablespace.
  When fsync() spends long time, the tcp socket of pg_basebackup returns "zero window" packets to wal sender.
  This means the tcp socket buffer of pg_basebackup is exhausted since pg_basebackup cannot receive during fsync().
- The socket of wal sender retries to send the packet, but resets connection after tcp_user_timeout.
  After wal sender resets connection, pg_basebackup cannot receive data and fails with above error.

[Solution]
I think fsync() for each tablespace is not necessary.
Like pg_basebackup -F p, I think fsync() is necessary only once at the end.


Could you give me any comment?


Regards,
Ryohei Takahashi




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: safe to overload objectSubId for a type?
Next
From: Amit Langote
Date:
Subject: Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity