Re: WAL receive process dies - Mailing list pgsql-general

From Patrick Krecker
Subject Re: WAL receive process dies
Date
Msg-id CAK2mJFNK1aCNtGOqWAbf8fU3r-7y+yRa8VQ=9F=m0jx=d5CXeQ@mail.gmail.com
Whole thread Raw
In response to Re: WAL receive process dies  (Craig Ringer <craig@2ndquadrant.com>)
Responses Re: WAL receive process dies  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-general
Hi Craig -- Sorry for the late response, I've been tied up with some other things for the last day. Just to give some context, this is a machine that sits in our office and replicates from another read slave in production via a tunnel set up with spiped. The spiped tunnel is working and postgres is still stuck (it has been stuck since 8-25).

The last moment that replication was working was  2014-08-25 22:06:05.03972. We have a table called replication_time with one column and one row that has a timestamp that is updated every second, so it's easy to tell the last time this machine was in sync with production.


Currently the WAL receive process is still not running. Interestingly, another pg instance running on the same machine is replicating just fine.

A note about that: there is another instance running on that machine and a definite race condition with restore_wal_s3.py, which writes the file to /tmp before copying it to the destination requested by postgres (I just discovered this today, this is not generally how we run our servers). So, if both are restoring at the same time, they will step on the WAL archives being unzipped in /tmp and bad things will happen. But, interestingly, I checked the logs for the other machine and there is no activity on that day. It does not appear that the WAL replay was invoked or that the WAL receive timed out.

As for enabling the core dump, it seems that it needs to be done when Postgres starts, and thought I would leave it running in its "stuck" state for now. However, if you know how to enable it on a running process, let me know. We are running Ubuntu 13.10.


On Wed, Aug 27, 2014 at 11:30 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 08/28/2014 09:39 AM, Patrick Krecker wrote:
> We have a periodic network connectivity issue (unrelated to Postgres)
> that is causing the replication to fail.
>
> We are running Postgres 9.3 using streaming replication. We also have
> WAL archives available to be replayed with restore_command. Typically
> when I bring up a slave it copies over WAL archives for a while before
> connecting via streaming replication.
>
> When I notice the machine is behind in replication, I also notice that
> the WAL receiver process has died. There didn't seem to be any
> information in the logs about it.

What did you search for?

Do you have core dumps enabled? That'd be a good first step. (Exactly
how to do this depends on the OS/distro/version, but you basically want
to set "ulimit -c unlimited" on some ancestor of the postmaster).

> 1. It seems that Postgres does not fall back to copying WAL archives
> with its restore_command. I just want to confirm that this is what
> Postgres is supposed to do when its connection via streaming replication
> times out.

It should fall back.

> 2. Is it possible to restart replication after the WAL receiver process
> has died without restarting Postgres?

PostgreSQL should do so its self.

Please show your recovery.conf (appropriately redacted) and
postgresql.conf for the replica, and complete logs for the time period
of interest. You'll want to upload the logs somewhere then link to them,
do not attach them to an email to the list.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: Single Table Report With Calculated Column
Next
From: Andres Freund
Date:
Subject: Re: WAL receive process dies