Thread: pgBackRest : "ERROR: [042]: unexpected EOF reading line from remoteprocess on" and stale pgbackrest processes hanging

Hello List,

I had this error today with postgresql 10.6 and pgbackrest 2.07. I was running test backups and during a backup
The command given to the repo host (pg1-host=10.9.0.77) was :
pgbackrest@smadb2cs:~$ pgbackrest --stanza=dynacomtest backup
I noticed that the (two) ssh login sessions to the repo host just froze. No disconnections, just froze. So some network
problemsurely occurred.
 
On the repo host pgbackrest log I got :
2019-01-16 12:20:38.372 P00  ERROR: [042]: unexpected EOF reading line from remote process on '10.9.0.77'
on the pg host (10.9.0.77) I noticed that the pgbackrest backup processes had died, the only ones running was spawn
fromthe archive-push , however on the repo host the pgbackrest processes were 
 
alive, when in fact the backup was stalled.
On the repo I got :
root@smadb2cs:/home/pgbackrest# netstat -apn | grep 10.9.0.77
tcp        0     88 10.9.1.22:43837 10.9.0.77:22            ESTABLISHED 11918/ssh
tcp        0     88 10.9.1.22:43838 10.9.0.77:22            ESTABLISHED 11920/ssh
tcp        0     88 10.9.1.22:43839 10.9.0.77:22            ESTABLISHED 11919/ssh
tcp        0     88 10.9.1.22:43840 10.9.0.77:22            ESTABLISHED 11922/ssh
root@smadb2cs:/home/pgbackrest# netstat -anp --timers | grep 0.77
tcp        0     88 10.9.1.22:43837 10.9.0.77:22            ESTABLISHED 11918/ssh        on (65.50/12/0)
tcp        0     88 10.9.1.22:43838 10.9.0.77:22            ESTABLISHED 11920/ssh        on (69.60/12/0)
tcp        0     88 10.9.1.22:43839 10.9.0.77:22            ESTABLISHED 11919/ssh        on (65.50/12/0)
tcp        0     88 10.9.1.22:43840 10.9.0.77:22            ESTABLISHED 11922/ssh        on (61.40/12/0)

In order to start a new backup I had to kill the 4 processes.

Is this behavior normal ? Would the 4 processes be killed due to KEEPALIVE settings eventually ?

-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



Hello so should I post an issue to github? or just wait to see if this 
happens again?

On 16/1/19 5:24 μ.μ., Achilleas Mantzios wrote:
> Hello List,
>
> I had this error today with postgresql 10.6 and pgbackrest 2.07. I was 
> running test backups and during a backup
> The command given to the repo host (pg1-host=10.9.0.77) was :
> pgbackrest@smadb2cs:~$ pgbackrest --stanza=dynacomtest backup
> I noticed that the (two) ssh login sessions to the repo host just 
> froze. No disconnections, just froze. So some network problem surely 
> occurred.
> On the repo host pgbackrest log I got :
> 2019-01-16 12:20:38.372 P00  ERROR: [042]: unexpected EOF reading line 
> from remote process on '10.9.0.77'
> on the pg host (10.9.0.77) I noticed that the pgbackrest backup 
> processes had died, the only ones running was spawn from the 
> archive-push , however on the repo host the pgbackrest processes were 
> alive, when in fact the backup was stalled.
> On the repo I got :
> root@smadb2cs:/home/pgbackrest# netstat -apn | grep 10.9.0.77
> tcp        0     88 10.9.1.22:43837 10.9.0.77:22 ESTABLISHED 11918/ssh
> tcp        0     88 10.9.1.22:43838 10.9.0.77:22 ESTABLISHED 11920/ssh
> tcp        0     88 10.9.1.22:43839 10.9.0.77:22 ESTABLISHED 11919/ssh
> tcp        0     88 10.9.1.22:43840 10.9.0.77:22 ESTABLISHED 11922/ssh
> root@smadb2cs:/home/pgbackrest# netstat -anp --timers | grep 0.77
> tcp        0     88 10.9.1.22:43837 10.9.0.77:22 ESTABLISHED 
> 11918/ssh        on (65.50/12/0)
> tcp        0     88 10.9.1.22:43838 10.9.0.77:22 ESTABLISHED 
> 11920/ssh        on (69.60/12/0)
> tcp        0     88 10.9.1.22:43839 10.9.0.77:22 ESTABLISHED 
> 11919/ssh        on (65.50/12/0)
> tcp        0     88 10.9.1.22:43840 10.9.0.77:22 ESTABLISHED 
> 11922/ssh        on (61.40/12/0)
>
> In order to start a new backup I had to kill the 4 processes.
>
> Is this behavior normal ? Would the 4 processes be killed due to 
> KEEPALIVE settings eventually ?
>


Greetings,

* Achilleas Mantzios (achill@matrix.gatewaynet.com) wrote:
> I had this error today with postgresql 10.6 and pgbackrest 2.07. I was running test backups and during a backup
> The command given to the repo host (pg1-host=10.9.0.77) was :
> pgbackrest@smadb2cs:~$ pgbackrest --stanza=dynacomtest backup
> I noticed that the (two) ssh login sessions to the repo host just froze. No disconnections, just froze. So some
networkproblem surely occurred. 
> On the repo host pgbackrest log I got :
> 2019-01-16 12:20:38.372 P00  ERROR: [042]: unexpected EOF reading line from remote process on '10.9.0.77'
> on the pg host (10.9.0.77) I noticed that the pgbackrest backup processes
> had died, the only ones running was spawn from the archive-push , however on
> the repo host the pgbackrest processes were alive, when in fact the backup
> was stalled.

> On the repo I got :
> root@smadb2cs:/home/pgbackrest# netstat -apn | grep 10.9.0.77
> tcp        0     88 10.9.1.22:43837 10.9.0.77:22            ESTABLISHED 11918/ssh
> tcp        0     88 10.9.1.22:43838 10.9.0.77:22            ESTABLISHED 11920/ssh
> tcp        0     88 10.9.1.22:43839 10.9.0.77:22            ESTABLISHED 11919/ssh
> tcp        0     88 10.9.1.22:43840 10.9.0.77:22            ESTABLISHED 11922/ssh
> root@smadb2cs:/home/pgbackrest# netstat -anp --timers | grep 0.77
> tcp        0     88 10.9.1.22:43837 10.9.0.77:22            ESTABLISHED 11918/ssh        on (65.50/12/0)
> tcp        0     88 10.9.1.22:43838 10.9.0.77:22            ESTABLISHED 11920/ssh        on (69.60/12/0)
> tcp        0     88 10.9.1.22:43839 10.9.0.77:22            ESTABLISHED 11919/ssh        on (65.50/12/0)
> tcp        0     88 10.9.1.22:43840 10.9.0.77:22            ESTABLISHED 11922/ssh        on (61.40/12/0)

This looks like the kernel thinks the connections are still alive.

> In order to start a new backup I had to kill the 4 processes.
>
> Is this behavior normal ? Would the 4 processes be killed due to KEEPALIVE settings eventually ?

Yes, they almost certainly would have timed out eventually- but how
long were you already waiting..?  The timeouts in pgbackrest are not
terribly long, in general.

Thanks!

Stephen

Attachment
Hello Stephen

On 16/1/19 10:27 μ.μ., Stephen Frost wrote:

> Greetings,
>
> * Achilleas Mantzios (achill@matrix.gatewaynet.com) wrote:
>> I had this error today with postgresql 10.6 and pgbackrest 2.07. I was running test backups and during a backup
>> The command given to the repo host (pg1-host=10.9.0.77) was :
>> pgbackrest@smadb2cs:~$ pgbackrest --stanza=dynacomtest backup
>> I noticed that the (two) ssh login sessions to the repo host just froze. No disconnections, just froze. So some
networkproblem surely occurred.
 
>> On the repo host pgbackrest log I got :
>> 2019-01-16 12:20:38.372 P00  ERROR: [042]: unexpected EOF reading line from remote process on '10.9.0.77'
>> on the pg host (10.9.0.77) I noticed that the pgbackrest backup processes
>> had died, the only ones running was spawn from the archive-push , however on
>> the repo host the pgbackrest processes were alive, when in fact the backup
>> was stalled.
>> On the repo I got :
>> root@smadb2cs:/home/pgbackrest# netstat -apn | grep 10.9.0.77
>> tcp        0     88 10.9.1.22:43837 10.9.0.77:22            ESTABLISHED 11918/ssh
>> tcp        0     88 10.9.1.22:43838 10.9.0.77:22            ESTABLISHED 11920/ssh
>> tcp        0     88 10.9.1.22:43839 10.9.0.77:22            ESTABLISHED 11919/ssh
>> tcp        0     88 10.9.1.22:43840 10.9.0.77:22            ESTABLISHED 11922/ssh
>> root@smadb2cs:/home/pgbackrest# netstat -anp --timers | grep 0.77
>> tcp        0     88 10.9.1.22:43837 10.9.0.77:22            ESTABLISHED 11918/ssh        on (65.50/12/0)
>> tcp        0     88 10.9.1.22:43838 10.9.0.77:22            ESTABLISHED 11920/ssh        on (69.60/12/0)
>> tcp        0     88 10.9.1.22:43839 10.9.0.77:22            ESTABLISHED 11919/ssh        on (65.50/12/0)
>> tcp        0     88 10.9.1.22:43840 10.9.0.77:22            ESTABLISHED 11922/ssh        on (61.40/12/0)
> This looks like the kernel thinks the connections are still alive.
>
>> In order to start a new backup I had to kill the 4 processes.
>>
>> Is this behavior normal ? Would the 4 processes be killed due to KEEPALIVE settings eventually ?
> Yes, they almost certainly would have timed out eventually- but how
> long were you already waiting..?  The timeouts in pgbackrest are not
> terribly long, in general.
Looking at the log I waited 9 minutes, before manually killing.
>
> Thanks!
>
> Stephen


On 1/16/19 10:27 PM, Stephen Frost wrote:
> * Achilleas Mantzios (achill@matrix.gatewaynet.com) wrote:
>>
>> Is this behavior normal ? Would the 4 processes be killed due to KEEPALIVE settings eventually ?
> 
> Yes, they almost certainly would have timed out eventually- but how
> long were you already waiting..?  The timeouts in pgbackrest are not
> terribly long, in general.

By default the remote processes will shutdown after 30 minutes if they 
have not gotten a message from the main process.  This is controlled by 
the `protocol-timeout` option.

It's a bit unusual for them to get stuck like this -- usually they get 
an eof and exit immediately.  Definitely looks like something was up 
with SSH.

Regards,
-- 
-David
david@pgmasters.net