Thread: Issue with pg_basebackup v.11

Issue with pg_basebackup v.11

From
Ninad Shah
Date:
Hello experts,

I am facing an issue with a customer's production server while trying to take backup using pg_basebackup.

Below is the log from pg_basebackup execution.

* 115338208/1304172127 kB (8%), 0/1 tablespace (...atastaging/base/115868/154220.1)
 115355616/1304172127 kB (8%), 0/1 tablespace (...atastaging/base/115868/154220.1)
 115372640/1304172127 kB (8%), 0/1 tablespace (...atastaging/base/115868/154220.1)
 115389568/1304172127 kB (8%), 0/1 tablespace (...atastaging/base/115868/154220.1)
 115405792/1304172127 kB (8%), 0/1 tablespace (...atastaging/base/115868/154220.1)
 115423776/1304172127 kB (8%), 0/1 tablespace (...atastaging/base/115868/154220.1)
 115440640/1304172127 kB (8%), 0/1 tablespace (...atastaging/base/115868/154220.2)
 115454656/1304172127 kB (8%), 0/1 tablespace (...atastaging/base/115868/154220.2)
pgbasebackup: could not read COPY data: could not receive data from server: Connection timed out
pgbasebackup: removing contents of data directory "/u01/PostgreSQL/11/datastaging"*

It copied nearly 110 GB of data and exited. Initially, we suspected it as a network/OS issue. However, we tried to copy a 150 GB large file over the network, which finished successfully.

What I observed is that it takes a couple of hours between below 2 lines.

 115454656/1304172127 kB (8%), 0/1 tablespace (...atastaging/base/115868/154220.2)
pgbasebackup: could not read COPY data: could not receive data from server: Connection timed out

In other words, it run for an hour, and later, it takes 2 hours before it times out.

Can someone please help me out here?


Regards,
Ninad Shah

Re: Issue with pg_basebackup v.11

From
Tom Lane
Date:
Ninad Shah <nshah.postgres@gmail.com> writes:
> What I observed is that it takes a couple of hours between below 2 lines.

>  115454656/1304172127 kB (8%), 0/1 tablespace
> (...atastaging/base/115868/154220.2)
> pgbasebackup: could not read COPY data: could not receive data from server:
> Connection timed out

We have heard reports of network connections dropping while pg_basebackup
is busy doing something disk-intensive such as fsync'ing.  The apparent
2-hour delay here does not mean that pg_basebackup was out to lunch for
2 hours; more likely that reflects the TCP timeout delay before the kernel
realizes that the connection is lost.  The actual blame probably resides
with some firewall or router that has a short timeout for idle
connections.

I'd try turning on fairly aggressive TCP keepalive settings for the
connection, say keepalives_idle=30 or so.

            regards, tom lane



Re: Issue with pg_basebackup v.11

From
Ninad Shah
Date:
Hey Tom,

Thank you for your response. Actually, when we copy data using scp/rsync, it works without any issue. But, it fails while attempting to transfer using pg_basebackup.

Would keepalive setting address and mitigate the issue?


Regards,
Ninad Shah

On Fri, 22 Oct 2021 at 21:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ninad Shah <nshah.postgres@gmail.com> writes:
> What I observed is that it takes a couple of hours between below 2 lines.

>  115454656/1304172127 kB (8%), 0/1 tablespace
> (...atastaging/base/115868/154220.2)
> pgbasebackup: could not read COPY data: could not receive data from server:
> Connection timed out

We have heard reports of network connections dropping while pg_basebackup
is busy doing something disk-intensive such as fsync'ing.  The apparent
2-hour delay here does not mean that pg_basebackup was out to lunch for
2 hours; more likely that reflects the TCP timeout delay before the kernel
realizes that the connection is lost.  The actual blame probably resides
with some firewall or router that has a short timeout for idle
connections.

I'd try turning on fairly aggressive TCP keepalive settings for the
connection, say keepalives_idle=30 or so.

                        regards, tom lane

Re: Issue with pg_basebackup v.11

From
Tom Lane
Date:
Ninad Shah <nshah.postgres@gmail.com> writes:
> Would keepalive setting address and mitigate the issue?

[ shrug... ]  Maybe; nobody else has more information about this
situation than you do.  I suggested something to experiment with.

            regards, tom lane



Re: Issue with pg_basebackup v.11

From
Ninad Shah
Date:
Thanks Tom.


Regards,
Ninad Shah

On Sat, 23 Oct 2021 at 20:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ninad Shah <nshah.postgres@gmail.com> writes:
> Would keepalive setting address and mitigate the issue?

[ shrug... ]  Maybe; nobody else has more information about this
situation than you do.  I suggested something to experiment with.

                        regards, tom lane