Re: Detect when streaming replication stops streaming. - Mailing list pgsql-admin

From Keith Fiske
Subject Re: Detect when streaming replication stops streaming.
Date
Msg-id CAODZiv5Ytn0XPLwS65_Ow9FbjZc=O+7-5Z4yeWxLAiA6Fau_Lw@mail.gmail.com
Whole thread Raw
In response to Detect when streaming replication stops streaming.  (Anders Wegge Keller <wegge@wegge.dk>)
Responses Re: Detect when streaming replication stops streaming.  (Anders Wegge Keller <wegge@wegge.dk>)
List pgsql-admin


On Thu, Oct 4, 2018 at 6:13 AM Anders Wegge Keller <wegge@wegge.dk> wrote:
Is there a reliable way to detect, on the standby, that continuous recovery
have stopped for any reason? I'm adding a postgres 9.2(1) database to an
industrial control system that have to run unattended. Thus, I need to plan
for some really exotic failure modes, like e.g. loss of network connectivity
in an indefinitely long period.

On the master system, I can check for the existence of the proper row in
pg_stat_replication. If that is missing, I can report a degraded status on
the whole system. However, on the standby side, I've been unable to find
something similar. Especially the failure mode where the standby return from
a long period of no network connectivity, where the master no longer have
the WAL segments needed to catch up. Snooping for FATAL in
pg_log/Postgres-xxx is error prone, and I'd rather not go the route of
polling the master system for a situation that should be detectable on the
standby instance.


--
//Wegge

1. Yes, I know it's no longer supported, but that's the way of redhat.


Checking replication status from just the replica itself isn't that straight forward. This makes sense, though, since how is the replica supposed to know that it's behind if it doesn't compare itself to something else? PostgreSQL replication works by simply replaying the WAL stream from another database. When that WAL stream stops, the replication stops. This could be normal if the primary hasn't gotten any writes.

I wrote a brief blog post on monitoring replica lag:


There is a query you can run on the replica, but as I said above and in the blog post, this can cause false positives when there's no writes on the primary. But if you're always supposed to be getting writes to the primary, it's the easiest method to check, and could even help indicate when there's a problem if your primary isn't getting writes as it should be. I highly recommend also monitoring your primary for byte lag as well, so you can at least know when the replica is throwing a false positive.

Another option would be, as you hinted, to watch the replica for expected recent data. But again, if there's no writes, it can cause false positives. So it's really not much different than just checking for the last WAL replay.

If you're worried about the replica falling behind due to network outages or something similar, I recommend looking into WAL shipping in addition to streaming replication. If postgres doesn't get a successful run result from the archive_command, it keeps the WAL file around until it does. You will have to monitor for disk space usage issues on the primary if it backs up too much. If this happens a lot, you may be better off shipping them to a third-party server, preferably where you're storing your backups. You can then have the replica fetch its WAL stream from there if streaming replication fails. pgbackrest is one tool that can do this for you, as well as help manage backups. https://pgbackrest.org/

If you're able to upgrade to 9.4, postgres did introduce replication slots to help the primary actually be aware of the state of its replicas and keep WAL files around automatically until all its replicas are caught up. While 9.2 is the default version of PG that comes with Redhat/CentOS, the community does provide repositories with more recent rpm versions. https://www.postgresql.org/download/linux/redhat/  I would highly recommend looking into this since Redhat is not going to be updating their default version for a very long time.

--
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com

pgsql-admin by date:

Previous
From: Joakim Lundgren
Date:
Subject: Re: Takes long time to "fail back" using pg_rewind
Next
From: Anders Wegge Keller
Date:
Subject: Re: Detect when streaming replication stops streaming.