Thread: Streaming replication question

Streaming replication question

From
Mark Steben
Date:
Good morning,
We run postgres 9.4. Early Saturday morning we had a production postgres outage because our pg_xlog directory ran out of space.  Tracing the cause points to a scheduled reboot of the 3 database servers by our IT team to install some monitoring software.  Because we did a simple stop/restart of the database our replication slots did not get reset and pg_logs queued up until the out-of-space condition

Can someone please verify that the following WOULD HAVE BEEN the appropriate action:
(Or offer corrections)
  1. Stop the postgres database (on both slaves)
  2. Run pg_drop_replication_slot (rep1, rep2 slotnames)  (on Master)
  3. Stop the postgres database (on master)
  4.  Reboot Linux Server to bring in monitoring software
  5.  Start the postgres database (on master)
  6.  Run pg_create_replication_slot (rep1, rep2 slotnames) (On Master)
  7. Start the postgres database (on both slaves)

Thank you 

--
Mark Steben
 Database Administrator
@utoRevenue | Autobase 
  CRM division of Dominion Dealer Solutions 
95D Ashley Ave.
West Springfield, MA 01089

t: 413.327-3045
f: 413.383-9567

www.fb.com/DominionDealerSolutions
www.twitter.com/DominionDealer
 www.drivedominion.com





Re: Streaming replication question

From
Keith
Date:


On Sat, Aug 3, 2019 at 9:36 AM Mark Steben <mark.steben@drivedominion.com> wrote:
Good morning,
We run postgres 9.4. Early Saturday morning we had a production postgres outage because our pg_xlog directory ran out of space.  Tracing the cause points to a scheduled reboot of the 3 database servers by our IT team to install some monitoring software.  Because we did a simple stop/restart of the database our replication slots did not get reset and pg_logs queued up until the out-of-space condition

Can someone please verify that the following WOULD HAVE BEEN the appropriate action:
(Or offer corrections)
  1. Stop the postgres database (on both slaves)
  2. Run pg_drop_replication_slot (rep1, rep2 slotnames)  (on Master)
  3. Stop the postgres database (on master)
  4.  Reboot Linux Server to bring in monitoring software
  5.  Start the postgres database (on master)
  6.  Run pg_create_replication_slot (rep1, rep2 slotnames) (On Master)
  7. Start the postgres database (on both slaves)

Thank you 

--
Mark Steben
 Database Administrator
@utoRevenue | Autobase 
  CRM division of Dominion Dealer Solutions 
95D Ashley Ave.
West Springfield, MA 01089

t: 413.327-3045
f: 413.383-9567

www.fb.com/DominionDealerSolutions
www.twitter.com/DominionDealer
 www.drivedominion.com





Restarting your databases should not have affected the replication slots like this. If you stopped the replicas first, then the primary would have just started keeping all the WAL files until the replicas came back, resuming from where they left off. If you stopped the primary first then the replicas would have just lost their connection until the primary came back and made its slots available again.

The only time the slots don't stick around is if you do a failover from the primary to one of the replicas. In that case, yes, you do have to recreate the slots. When things were rebooted, was there any failovers kicked off?

I'd check your postgres logs to see if there's any hint as to why things turned out the way they did. If all systems were rebooted at the exact same time, there may be some edge-case bug that's not being accounted for. But without a deeper dive into what happened, that would be hard to say and seems unlikely. I would say to make sure you are on the most recent version of 9.4 to ensure any known bug fixes are in place. Also highly recommend on planning to upgrade to a newer major version. 9.4 goes out of support this fall upon the release of version 12.

Keith

Re: Streaming replication question

From
Mark Steben
Date:
Thank your for your prompt response Keith.  You were correct.  The root of the problem was an ssh host key change which caused all logshipping to error, and therefore pg_xlog queued up. 

On Sat, Aug 3, 2019 at 11:23 AM Keith <keith@keithf4.com> wrote:


On Sat, Aug 3, 2019 at 9:36 AM Mark Steben <mark.steben@drivedominion.com> wrote:
Good morning,
We run postgres 9.4. Early Saturday morning we had a production postgres outage because our pg_xlog directory ran out of space.  Tracing the cause points to a scheduled reboot of the 3 database servers by our IT team to install some monitoring software.  Because we did a simple stop/restart of the database our replication slots did not get reset and pg_logs queued up until the out-of-space condition

Can someone please verify that the following WOULD HAVE BEEN the appropriate action:
(Or offer corrections)
  1. Stop the postgres database (on both slaves)
  2. Run pg_drop_replication_slot (rep1, rep2 slotnames)  (on Master)
  3. Stop the postgres database (on master)
  4.  Reboot Linux Server to bring in monitoring software
  5.  Start the postgres database (on master)
  6.  Run pg_create_replication_slot (rep1, rep2 slotnames) (On Master)
  7. Start the postgres database (on both slaves)

Thank you 

--
Mark Steben
 Database Administrator
@utoRevenue | Autobase 
  CRM division of Dominion Dealer Solutions 
95D Ashley Ave.
West Springfield, MA 01089

t: 413.327-3045
f: 413.383-9567

www.fb.com/DominionDealerSolutions
www.twitter.com/DominionDealer
 www.drivedominion.com





Restarting your databases should not have affected the replication slots like this. If you stopped the replicas first, then the primary would have just started keeping all the WAL files until the replicas came back, resuming from where they left off. If you stopped the primary first then the replicas would have just lost their connection until the primary came back and made its slots available again.

The only time the slots don't stick around is if you do a failover from the primary to one of the replicas. In that case, yes, you do have to recreate the slots. When things were rebooted, was there any failovers kicked off?

I'd check your postgres logs to see if there's any hint as to why things turned out the way they did. If all systems were rebooted at the exact same time, there may be some edge-case bug that's not being accounted for. But without a deeper dive into what happened, that would be hard to say and seems unlikely. I would say to make sure you are on the most recent version of 9.4 to ensure any known bug fixes are in place. Also highly recommend on planning to upgrade to a newer major version. 9.4 goes out of support this fall upon the release of version 12.

Keith


--
Mark Steben
 Database Administrator
@utoRevenue | Autobase 
  CRM division of Dominion Dealer Solutions 
95D Ashley Ave.
West Springfield, MA 01089

t: 413.327-3045
f: 413.383-9567

www.fb.com/DominionDealerSolutions
www.twitter.com/DominionDealer
 www.drivedominion.com