Re: WAL receive process dies - Mailing list pgsql-general

From Patrick Krecker
Subject Re: WAL receive process dies
Date
Msg-id CAK2mJFMz7ST1BBL4_CJnoiRkWdhppy5yDeDmS5SQTm+mjVHs3w@mail.gmail.com
Whole thread Raw
In response to Re: WAL receive process dies  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: WAL receive process dies  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-general

On Fri, Aug 29, 2014 at 2:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-08-29 13:04:43 -0700, Patrick Krecker wrote:
> Hi Craig -- Sorry for the late response, I've been tied up with some other
> things for the last day. Just to give some context, this is a machine that
> sits in our office and replicates from another read slave in production via
> a tunnel set up with spiped. The spiped tunnel is working and postgres is
> still stuck (it has been stuck since 8-25).
>
> The last moment that replication was working was  2014-08-25
> 22:06:05.03972. We have a table called replication_time with one column and
> one row that has a timestamp that is updated every second, so it's easy to
> tell the last time this machine was in sync with production.
>
> recovery.conf: http://pastie.org/private/dfmystgf0wxgtmahiita
> logs: http://pastie.org/private/qt1ixycayvdsxafrzj0l0q

The problem is this log entry:

2014-08-27 18:44:27 PDT ERROR:  requested starting point 175/28000000 is
ahead of the WAL flush position of this server 174/B76D16A8

That's the walreceiver connecting to the upstream node and askign for
wAL. Somehow the standby has gotten *considerably* ahead of the node
it's trying to receive WAL from.

Are you sure primary_conninfo / the spiped tunnel connects to the right
postgres instance? Did you possibly have a failover or something like that?

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

We didn't have a failover. We just have one master and replica in different AZs on Amazon EC2.

I ran the following on the local endpoint of spiped:

while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c "select modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location() from replication_time";  done;

And the same command on production and I was able to verify that the xlogs for a given point in time were the same (modtime is updated every second by an upstart job):

spiped from office -> production:
          modtime           | pg_last_xlog_receive_location | pg_last_xlog_replay_location
----------------------------+-------------------------------+------------------------------
 2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8

Ran directly on production replica:
          modtime           | pg_last_xlog_receive_location | pg_last_xlog_replay_location
----------------------------+-------------------------------+------------------------------
 2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8

To me, this is sufficient proof that spiped is indeed talking to the machine I think it's talking to (also lsof reports the correct hostname).

I created another basebackup from the currently stuck postgres intance on another machine and I also get this error:

2014-08-29 15:27:30 PDT FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 177/2D000000 is ahead of the WAL flush position of this server 174/B76D16A8

However, this new instance is able to fetch logs from S3 and replay them without issue.

Is it possible that the data dir on the stuck instance is just corrupt? It is not impossible for this to have happened at some point in the past due to the race condition in fetching logs from S3 I mentioned above.
 

pgsql-general by date:

Previous
From: Emi Lu
Date:
Subject: Re: alter column to varchar without view drop/re-creation
Next
From: Andres Freund
Date:
Subject: Re: WAL receive process dies