Home > mailing lists

Re: Streaming replication question - Mailing list pgsql-admin

From	Mark Steben
Subject	Re: Streaming replication question
Date	August 4, 2019 11:06:32
Msg-id	CADyzmyyG6i9fBcBVPXKzRZLBp23Sb9OWGpYYdYQT1iPNU8dYxw@mail.gmail.com Whole thread Raw
In response to	Re: Streaming replication question (Keith <keith@keithf4.com>)
List	pgsql-admin

Tree view

Thank your for your prompt response Keith. You were correct. The root of the problem was an ssh host key change which caused all logshipping to error, and therefore pg_xlog queued up.

On Sat, Aug 3, 2019 at 11:23 AM Keith <keith@keithf4.com> wrote:

On Sat, Aug 3, 2019 at 9:36 AM Mark Steben <mark.steben@drivedominion.com> wrote:
Good morning,
We run postgres 9.4. Early Saturday morning we had a production postgres outage because our pg_xlog directory ran out of space. Tracing the cause points to a scheduled reboot of the 3 database servers by our IT team to install some monitoring software. Because we did a simple stop/restart of the database our replication slots did not get reset and pg_logs queued up until the out-of-space condition

Can someone please verify that the following WOULD HAVE BEEN the appropriate action:
(Or offer corrections)
1. Stop the postgres database (on both slaves)
2. Run pg_drop_replication_slot (rep1, rep2 slotnames) (on Master)
3. Stop the postgres database (on master)
4. Reboot Linux Server to bring in monitoring software
5. Start the postgres database (on master)
6. Run pg_create_replication_slot (rep1, rep2 slotnames) (On Master)
7. Start the postgres database (on both slaves)

Thank you

--
Mark Steben
Database Administrator
@utoRevenue | Autobase
CRM division of Dominion Dealer Solutions
95D Ashley Ave.
West Springfield, MA 01089
t: 413.327-3045
f: 413.383-9567
www.fb.com/DominionDealerSolutions
www.twitter.com/DominionDealer
www.drivedominion.com

Restarting your databases should not have affected the replication slots like this. If you stopped the replicas first, then the primary would have just started keeping all the WAL files until the replicas came back, resuming from where they left off. If you stopped the primary first then the replicas would have just lost their connection until the primary came back and made its slots available again.

The only time the slots don't stick around is if you do a failover from the primary to one of the replicas. In that case, yes, you do have to recreate the slots. When things were rebooted, was there any failovers kicked off?

I'd check your postgres logs to see if there's any hint as to why things turned out the way they did. If all systems were rebooted at the exact same time, there may be some edge-case bug that's not being accounted for. But without a deeper dive into what happened, that would be hard to say and seems unlikely. I would say to make sure you are on the most recent version of 9.4 to ensure any known bug fixes are in place. Also highly recommend on planning to upgrade to a newer major version. 9.4 goes out of support this fall upon the release of version 12.

Keith

Mark Steben
Database Administrator
@utoRevenue | Autobase
CRM division of Dominion Dealer Solutions
95D Ashley Ave.
West Springfield, MA 01089
t: 413.327-3045
f: 413.383-9567

www.fb.com/DominionDealerSolutions
www.twitter.com/DominionDealer
www.drivedominion.com

pgsql-admin by date:

From: Jānis Pūris
Date: 04 August 2019, 00:27:39
Subject: HAProxy + Patroni + pgBouncer High Availability setup

From: Wells Oliver
Date: 05 August 2019, 16:32:23
Subject: Diagnosing a massive toast file

Re: Streaming replication question - Mailing list pgsql-admin

Previous

Next