Re: Logical decoding from promoted standby with same replication slot - Mailing list pgsql-hackers

From Jeremy Finzel
Subject Re: Logical decoding from promoted standby with same replication slot
Date
Msg-id CAMa1XUgJBo5qaP5BhAobwqutx9NWX2VAc56w_mdZOmMWgPE38Q@mail.gmail.com
Whole thread Raw
In response to Logical decoding from promoted standby with same replication slot  (Jeremy Finzel <finzelj@gmail.com>)
List pgsql-hackers
On Fri, Jul 13, 2018 at 2:30 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
Hello -

We are working on several DR scenarios with logical decoding.  Although we are using pglogical the question we have I think is generally applicable to logical replication.

Say we have need to drop a logical replication slot for some emergency reason on the master, but we don't want to lose the data permanently.  We can make a point-in-time-recovery snapshot of the master to use in order to recover the lost data in the slot we are about to drop.  Then we drop the slot on master.

We can then point our logical subscription to pull from the snapshot to get the lost data, once we promote it.

The question is that after promotion, logical decoding is looking for a timeline 2 file whereas the file is still at timeline 1.

The WAL file is 00000001000008FD0000003C, for example.  After promotion, it is still 00000001000008FD0000003C in pg_wal.  But logical decoding says ERROR: segment 00000002000008FD0000003C has already been removed (it is looking for a timeline 2 WAL file).  Simply renaming the file actually allows us to stream from the replication slot accurately and recover the data.

But all of this begs the question of an easier way to do this - why doesn't logical decoding know to look for a timeline 1 file?  It is really helpful to have this ability to easily recover logical replicated data from a snapshot of a replication slot, in case of disaster.

All thoughts very welcome!

Thanks,
Jeremy

I'd like to bump this question with some elaboration on my original question: is it possible to do a *controlled* failover reliably with logical decoding, assuming there are unconsumed changes in the replication slot that client still needs?

It is rather easy to do a controlled failover if we can verify there are no unconsumed changes in the slot before failover.  Then, we just recreate the slot on the promoted standby while clients are locked out, and we have not missed any data changes.

I am trying to figure out if the problem of following timelines, as per this wiki for example: https://wiki.postgresql.org/wiki/Failover_slots, can be worked around in a controlled scenario.  One additional part of this is that after failover I have 2 WAL files with the same walfile name but on differing timelines, and the promoted standby is only going to decode from the latter.  Does that mean I am likely to lose data?

Part of the reason I ask is because in testing, I have NOT lost data in doing a controlled failover as described above (i.e. with unconsumed changes in the slot that I need to replay on promoted standby).  I am trying to figure out if I've gotten lucky or if this method is actually reliable.  That is, renaming the WAL files to bump the timeline, since these WAL files are simply identical to the ones that were played on the master, and thus ought to show the same logical decoding information to be consumed.


Thank you!
Jeremy

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Refactor documentation for wait events (Was: pgsql: Add waitevent for fsync of WAL segments)
Next
From: Robert Haas
Date:
Subject: Re: New GUC to sample log queries