Thread: [HACKERS] Replication origins and timelines

[HACKERS] Replication origins and timelines

From
Craig Ringer
Date:
Hi all

TL;DR: replication origins track LSN without timeline. This is
ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX
can now represent more than one state due to timeline forks with
promotions. Replication origins should track timelines so we can tell
the difference, I propose to patch them accordingly for pg11.

---------

When replication origins were introduced, they deliberately left out
tracking of the upstream node's timeline. Logical decoding couldn't
follow a timeline switch anyway, and replicas (still) have no facility
for logical decoding so everything completely breaks on promotion of a
physical replica.

I'm working on fixing that so that logical decoding and logical
replication integrates properly with physical replication and
failover. But when that works we'll face the same problem in logical
rep that timelines were introduced to solve for physical rep.

To prevent undetected misreplication we'll need to keep track of the
timeline of the last-replicated LSN in our downstream replication
origin. So I propose to add a timeline field to replication origins
for pg11.

Why?

Take master A, its physical replica B, and logical decoding client X
streaming changes from A. B is lagging. A is at lsn 1/1000, B is only
at 1/500. C has replicated from A up to 1/1000, when A fails. We
promote B to replace A. Now C connects to B, and requests to resume at
LSN 1/1000.

If B has since done enough work for its insert position to pass
1/1000, C will completely skip whatever B did between 1/500 and
1/1000, thinking (incorrectly) that it already replayed it. And it
will have *extra data* from A from the 1/500 to 1/1000 range that B
lost. It'll pick up from B's 1/1000 and try to apply that on top of
A's 1/1000 state, potentially leading to a mangled mess.

In physical rep this would lead to serious data corruption and
crashes. In logical rep it'll most likely lead to conflicts, apply
errors, inconsistent data, broken FKs, etc. It could be drastic, or
quite subtle, depending on app and workload.

But we really should still detect it. To do that, we need to remember
that our last replay position was (1/1000, 1) . And when we request to
start replay from 1/1000 at timeline 1 on B, it'll ERROR, telling us
that its timeline 1 ends at 1/500.

We could still *choose* to continue as if all was well, but by default
we'll detect the error.

But we can't do that unless replication origins on the downstream can
track the timeline.



-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Replication origins and timelines

From
Andres Freund
Date:
Hi,

On 2017-06-01 09:12:04 +0800, Craig Ringer wrote:
> TL;DR: replication origins track LSN without timeline. This is
> ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX
> can now represent more than one state due to timeline forks with
> promotions. Replication origins should track timelines so we can tell
> the difference, I propose to patch them accordingly for pg11.

I'm not quite convinced that this should be tracked at the origin level.
If you fail over physically, shouldn't we also reconfigure logical
replication?

Even if we decide this is necessary, I *strongly* suggest trying to get
the existing standby decoding etc wrapped up before starting something
nontrival afresh.


> Why?
> 
> Take master A, its physical replica B, and logical decoding client X
> streaming changes from A. B is lagging. A is at lsn 1/1000, B is only
> at 1/500. C has replicated from A up to 1/1000, when A fails. We
> promote B to replace A. Now C connects to B, and requests to resume at
> LSN 1/1000.

Wouldn't it be better to solve this by querying the new master's
timeline history, and checking whether the current replay point is
pre/post fork?

I'm more than bit doubtful that adding more overhead to every relevant
record is worth it here.

- Andres



Re: [HACKERS] Replication origins and timelines

From
Stephen Frost
Date:
Craig,

* Craig Ringer (craig@2ndquadrant.com) wrote:
> TL;DR: replication origins track LSN without timeline. This is
> ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX
> can now represent more than one state due to timeline forks with
> promotions. Replication origins should track timelines so we can tell
> the difference, I propose to patch them accordingly for pg11.

Uh, TL;DR, wow?  Why isn't this something which needs to be addressed
before PG10 can be released?  I hope I'm missing something that makes
the current approach work in PG10, or that there's some reason that this
isn't a big deal for PG10, but I'd like a bit of info as to why that's
the case, if it is.

The further comments in your email seem to state that logical
replication will just fail if a replica is promoted.  While not ideal,
that might barely reach the point of it being releasable, but turns it
into a feature that I'd have a really hard time recommending to anyone,
and are we absolutely sure that there aren't any cases where there might
be an issue of undetected promotion, leading to the complications which
you describe?

Thanks!

Stephen

Re: [HACKERS] Replication origins and timelines

From
Craig Ringer
Date:
On 1 June 2017 at 09:27, Stephen Frost <sfrost@snowman.net> wrote:
> Craig,
>
> * Craig Ringer (craig@2ndquadrant.com) wrote:
>> TL;DR: replication origins track LSN without timeline. This is
>> ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX
>> can now represent more than one state due to timeline forks with
>> promotions. Replication origins should track timelines so we can tell
>> the difference, I propose to patch them accordingly for pg11.
>
> Uh, TL;DR, wow?  Why isn't this something which needs to be addressed
> before PG10 can be released?  I hope I'm missing something that makes
> the current approach work in PG10, or that there's some reason that this
> isn't a big deal for PG10, but I'd like a bit of info as to why that's
> the case, if it is.

In Pg 10, if you promote a physical replica then logical replication
falls apart entirely and stops working. So there's no corruption
hazard because it just ... stops.

This only starts becoming an issue once logical replication slots can
exist on replicas and be maintained to follow the master's slot state.
Which is incomplete in Pg10 (not exposed to users) but I plan to
finish getting in for pg11, making this a possible issue to be
addressed.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Replication origins and timelines

From
Stephen Frost
Date:
Craig,

* Craig Ringer (craig@2ndquadrant.com) wrote:
> On 1 June 2017 at 09:27, Stephen Frost <sfrost@snowman.net> wrote:
> > * Craig Ringer (craig@2ndquadrant.com) wrote:
> >> TL;DR: replication origins track LSN without timeline. This is
> >> ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX
> >> can now represent more than one state due to timeline forks with
> >> promotions. Replication origins should track timelines so we can tell
> >> the difference, I propose to patch them accordingly for pg11.
> >
> > Uh, TL;DR, wow?  Why isn't this something which needs to be addressed
> > before PG10 can be released?  I hope I'm missing something that makes
> > the current approach work in PG10, or that there's some reason that this
> > isn't a big deal for PG10, but I'd like a bit of info as to why that's
> > the case, if it is.
>
> In Pg 10, if you promote a physical replica then logical replication
> falls apart entirely and stops working. So there's no corruption
> hazard because it just ... stops.

I see.

> This only starts becoming an issue once logical replication slots can
> exist on replicas and be maintained to follow the master's slot state.
> Which is incomplete in Pg10 (not exposed to users) but I plan to
> finish getting in for pg11, making this a possible issue to be
> addressed.

Fair enough.  I'm disappointed that we ended up with that as the
solution for PG10, but so be it, the main thing is that we avoid any
corruption risk.

Thanks!

Stephen

Re: [HACKERS] Replication origins and timelines

From
Andres Freund
Date:
On 2017-05-31 21:27:56 -0400, Stephen Frost wrote:
> Craig,
> 
> * Craig Ringer (craig@2ndquadrant.com) wrote:
> > TL;DR: replication origins track LSN without timeline. This is
> > ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX
> > can now represent more than one state due to timeline forks with
> > promotions. Replication origins should track timelines so we can tell
> > the difference, I propose to patch them accordingly for pg11.
> 
> Uh, TL;DR, wow?  Why isn't this something which needs to be addressed
> before PG10 can be released?

Huh?  Slots are't present on replicas, ergo there's no way for the whole
issue to occur in v10.


> The further comments in your email seem to state that logical
> replication will just fail if a replica is promoted.  While not ideal,
> that might barely reach the point of it being releasable, but turns it
> into a feature that I'd have a really hard time recommending to
> anyone,

Meh^10


> and are we absolutely sure that there aren't any cases where there might
> be an issue of undetected promotion, leading to the complications which
> you describe?

Yes, unless you manipulate things by hand, copying files around or such.


Greetings,

Andres Freund



Re: [HACKERS] Replication origins and timelines

From
Andres Freund
Date:
On 2017-05-31 21:33:26 -0400, Stephen Frost wrote:
> > This only starts becoming an issue once logical replication slots can
> > exist on replicas and be maintained to follow the master's slot state.
> > Which is incomplete in Pg10 (not exposed to users) but I plan to
> > finish getting in for pg11, making this a possible issue to be
> > addressed.
> 
> Fair enough.  I'm disappointed that we ended up with that as the
> solution for PG10

This has widely been debated, and it's not exactly new that development
happens incrementally, so I don't have particularly much sympathy for
that POV.



Re: [HACKERS] Replication origins and timelines

From
Stephen Frost
Date:
Andres,

* Andres Freund (andres@anarazel.de) wrote:
> On 2017-05-31 21:27:56 -0400, Stephen Frost wrote:
> > Uh, TL;DR, wow?  Why isn't this something which needs to be addressed
> > before PG10 can be released?
>
> Huh?  Slots are't present on replicas, ergo there's no way for the whole
> issue to occur in v10.

Ohhhh, ok, now that makes more sense, not sure how I missed that.

Thanks!

Stephen

Re: [HACKERS] Replication origins and timelines

From
Craig Ringer
Date:
On 1 June 2017 at 09:23, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2017-06-01 09:12:04 +0800, Craig Ringer wrote:
>> TL;DR: replication origins track LSN without timeline. This is
>> ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX
>> can now represent more than one state due to timeline forks with
>> promotions. Replication origins should track timelines so we can tell
>> the difference, I propose to patch them accordingly for pg11.
>
> I'm not quite convinced that this should be tracked at the origin level.
> If you fail over physically, shouldn't we also reconfigure logical
> replication?
>
> Even if we decide this is necessary, I *strongly* suggest trying to get
> the existing standby decoding etc wrapped up before starting something
> nontrival afresh.

Yeah, I'm not thinking of leaping straight to a patch before we've got
the rep on standby stuff nailed down. I just wanted to raise early
discussion to make sure it's not entirely the wrong path and/or
totally hopeless for core.

Logical decoding output plugins would need to keep track of the
timeline and send an extra message informing the downstream of a
timeline change whenever they see a new timeline. Or include it in all
messages (see: extra overhead). Since we don't stop a decoding session
when we hit a timeline boundary and force re-connection. (Nor can we,
since at some point our restart_lsn will be on the old timeline but
the first commits will be on the new timeline). I'll need to think
about if/how the decoding plugin can reliably do that.

>> Take master A, its physical replica B, and logical decoding client X
>> streaming changes from A. B is lagging. A is at lsn 1/1000, B is only
>> at 1/500. C has replicated from A up to 1/1000, when A fails. We
>> promote B to replace A. Now C connects to B, and requests to resume at
>> LSN 1/1000.
>
> Wouldn't it be better to solve this by querying the new master's
> timeline history, and checking whether the current replay point is
> pre/post fork?

That could work.

The decoding client would need to track the last-commit timeline in
its own metadata if we're not letting it put it in the replication
origin. Manageable, if awkward.

Clients would need to know how to fetch and parse timeline history
files, which is an irritating thing for every decoding client that
wants to support failover to have to support. But I guess it's
manageable, if not friendly. And non-Pg-based downstreams would have
to do it anyway.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Replication origins and timelines

From
Stephen Frost
Date:
Andres,

* Andres Freund (andres@anarazel.de) wrote:
> On 2017-05-31 21:33:26 -0400, Stephen Frost wrote:
> > > This only starts becoming an issue once logical replication slots can
> > > exist on replicas and be maintained to follow the master's slot state.
> > > Which is incomplete in Pg10 (not exposed to users) but I plan to
> > > finish getting in for pg11, making this a possible issue to be
> > > addressed.
> >
> > Fair enough.  I'm disappointed that we ended up with that as the
> > solution for PG10
>
> This has widely been debated, and it's not exactly new that development
> happens incrementally, so I don't have particularly much sympathy for
> that POV.

I do understand that, of course, but hadn't quite realized yet that
we're talking only about replication slots on replicas.  Apologies for
the noise.

Thanks!

Stephen

Re: [HACKERS] Replication origins and timelines

From
Craig Ringer
Date:
On 1 June 2017 at 09:36, Andres Freund <andres@anarazel.de> wrote:
> On 2017-05-31 21:33:26 -0400, Stephen Frost wrote:
>> > This only starts becoming an issue once logical replication slots can
>> > exist on replicas and be maintained to follow the master's slot state.
>> > Which is incomplete in Pg10 (not exposed to users) but I plan to
>> > finish getting in for pg11, making this a possible issue to be
>> > addressed.
>>
>> Fair enough.  I'm disappointed that we ended up with that as the
>> solution for PG10
>
> This has widely been debated, and it's not exactly new that development
> happens incrementally, so I don't have particularly much sympathy for
> that POV.

Yeah. Even if we'd had completed support for decoding on standby,
there's no way we could've implemented the required gymnastics for
getting logical replication to actually use it to support physical
failover for pg10, so it was always going to land in pg11.

This is very much a "how do we do it right when we do do it" topic,
not a pg10 issue.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Replication origins and timelines

From
Craig Ringer
Date:
On 1 June 2017 at 09:23, Andres Freund <andres@anarazel.de> wrote:

> Even if we decide this is necessary, I *strongly* suggest trying to get
> the existing standby decoding etc wrapped up before starting something
> nontrival afresh.

Speaking of such, I had a thought about how to sync logical slot state
to physical replicas without requiring all logical downstreams to know
about and be able to connect to all physical replicas. Interested in
your initial reaction. Basically, enlist the walreceiver's help.

Extend the walsender so in physical rep mode it periodically writes
the upstream's logical slot state into the stream as a message with
special lsn 0/0. Then the walreceiver uses that to make decoding calls
on the downstream to advance the downstream logical slots to the new
confirmed_flush_lsn, or hands the info off to a helper proc that does
it. It could set up a decoding context and do it via a proper decoding
session, discarding output, and later we could probably optimise that
decoding session to do even less work than ReorderBufferSkip()ing
xacts.

The alternative at this point since we nixed writing logical slot
state to WAL seems to be a bgworker on the upstream that periodically
writes logical slot state into generic WAL messages in a special
table, then another on the downstream that processes the table and
makes appropriate decoding calls to advance the downstream slot state.
(Safely, not via directly setting catalog_xmin etc). Which is pretty
damn ugly, but has the advantage that it'd work for PITR restores,
unlike the walsender/walreceiver based approach. Failover slots in
extension-space, basically.

I'm really, really not sold on all logical downstreams having to know
about and be able to connect to all physical standbys of the upstream
to maintain slots on them. Some kind of solution that runs entirely on
the standby will be needed. It's more a question of whether it's
something built-in, easy, and nice, or some out of tree extension.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services