Re: WIP: Failover Slots - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: WIP: Failover Slots
Date
Msg-id CAMsr+YHYV78q_8gDKOgTNZQD9Lrfwa=5E0kOfFbrjjTDHhX+4A@mail.gmail.com
Whole thread Raw
In response to Re: WIP: Failover Slots  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: WIP: Failover Slots  (Craig Ringer <craig@2ndquadrant.com>)
List pgsql-hackers
On 6 April 2016 at 17:43, Simon Riggs <simon@2ndquadrant.com> wrote:
On 25 January 2016 at 14:25, Craig Ringer <craig@2ndquadrant.com> wrote:
 
I'd like to get failover slots in place for 9.6 since the're fairly self-contained and meet an immediate need: allowing replication using slots (physical or logical) to follow a failover event.

I'm a bit confused about this now.

We seem to have timeline following, yet no failover slot. How do we now follow a failover event? 
 

There are many and varied users of logical decoding now and a fix is critically important for 9.6.

I agree with you, but I haven't been able to convince enough people of that.
 
 Do all decoding plugins need to write their own support code??

We'll be able to write a bgworker based extension that handles it by running in the standby. So no, I don't think so.
 
Please explain how we cope without this, so if a problem remains we can fix by the freeze.

The TL;DR: Create a slot on the master to hold catalog_xmin where the replica needs it. Advance it using client or bgworker on replica based on the catalog_xmin of the oldest slot on the replica. Copy slot state from the master using an extension that keeps the slots on the replica reasonably up to date.

All of this is ugly workaround for not having true slot failover support. I'm not going to pretend it's nice, or anything that should go anywhere near core. Petr outlined the approach we want to take for core in 9.7 on the logical timeline following thread.

 
Details:

Logical decoding on a slot can follow timeline switches now - or rather, the xlogreader knows how to follow timeline switches, and the read page callback used by logical decoding uses that functionality now.

This doesn't help by its self because slots aren't synced to replicas so they're lost on failover promotion.

Nor can a client just create a backup slot for its self on the replica to be ready for failover:

- it has no way to create a new slot at a consistent point on the replica since logical decoding isn't supported on replicas yet;
- it can't advance a logical slot on the replica once created since decoding isn't permitted on a replica, so it can't just decode from the replica in lockstep with the master;
- it has no way to stop the master from removing catalog tuples still needed by the slot's catalog_xmin since catalog_xmin isn't propagated from standby to master.

So we have to help the client out. To do so, we have a function/worker/whatever on the replica that grabs the slot state from the master and copies it to the replica, and we have to hold the master's catalog_xmin down to the catalog_xmin required by the slots on the replica.

Holding the catalog_xmin down is the easier bit. We create a dummy logical slot on the master, maintained by a function/bgworker/whatever on the replica. It gets advanced so that its restart_lsn and catalog_xmin are those of the oldest slot on the replica. We can do that by requesting replay on it up to the confirmed_lsn of the lowest confirmed_lsn on the replica. Ugly, but workable. Or we can abuse the infrastructure more deeply by simply setting the catalog_xmin and restart_lsn on the slot directly, but I'd rather not.

Just copying slot state is pretty simple too, as at the C level you can create a physical or logical slot with whatever state you want.

However, that lets you copy/create any number of bogus ones, many of which will appear to work fine but will be subtly broken. Since the replica is an identical copy of the master we know that a slot state that was valid on the master at a given xlog insert lsn is also valid on the replica at the same replay lsn, but we've got no reliable way to ensure that when the master updates a slot at LSN A/B the replica also updates the slot at replay of LSN A/B. That's what failover slots did. Without that we need to use some external channel - but there's no way to capture knowledge of "at exactly LSN A/B, master saved a new copy of slot X" since we can't hook ReplicationSlotSave(). At least we *can* now inject slot state updates as generic WAL messages though, so we can ensure they happen at exactly the desired point in replay.

As Andres explained on the timeline following thread it's not safe for the slot on the replica to be behind the state the slot on the master was at the same LSN. At least unless we can protect catalog_xmin via some other mechanism so we can make sure no catalogs still needed by the slots on the replica are vacuumed away. It's vital that the catalog_xmin of any slots on the replica be >= the catalog_xmin the master had for the lowest catalog_xmin of any of its slots at the same LSN.

So what I figure we'll do is poll slot shmem on the master. When we notice that a slot has changed we'll dump it into xlog via the generic xlog mechanism to be applied on the replica, much like failover slots. The slot update might arrive a bit late on the replica, but that's OK because we're holding catalog_xmin pinned on the master using the dummy slot.

I don't like it, but I don't have anything better for 9.6.

I'd really like to be able to build a more solid proof of concept that tests this with a lagging replica, but -ENOTIME before FF. 

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server
Next
From: Amit Kapila
Date:
Subject: Re: Support for N synchronous standby servers - take 2