Thread: [HACKERS] Replication origins and timelines
Hi all TL;DR: replication origins track LSN without timeline. This is ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX can now represent more than one state due to timeline forks with promotions. Replication origins should track timelines so we can tell the difference, I propose to patch them accordingly for pg11. --------- When replication origins were introduced, they deliberately left out tracking of the upstream node's timeline. Logical decoding couldn't follow a timeline switch anyway, and replicas (still) have no facility for logical decoding so everything completely breaks on promotion of a physical replica. I'm working on fixing that so that logical decoding and logical replication integrates properly with physical replication and failover. But when that works we'll face the same problem in logical rep that timelines were introduced to solve for physical rep. To prevent undetected misreplication we'll need to keep track of the timeline of the last-replicated LSN in our downstream replication origin. So I propose to add a timeline field to replication origins for pg11. Why? Take master A, its physical replica B, and logical decoding client X streaming changes from A. B is lagging. A is at lsn 1/1000, B is only at 1/500. C has replicated from A up to 1/1000, when A fails. We promote B to replace A. Now C connects to B, and requests to resume at LSN 1/1000. If B has since done enough work for its insert position to pass 1/1000, C will completely skip whatever B did between 1/500 and 1/1000, thinking (incorrectly) that it already replayed it. And it will have *extra data* from A from the 1/500 to 1/1000 range that B lost. It'll pick up from B's 1/1000 and try to apply that on top of A's 1/1000 state, potentially leading to a mangled mess. In physical rep this would lead to serious data corruption and crashes. In logical rep it'll most likely lead to conflicts, apply errors, inconsistent data, broken FKs, etc. It could be drastic, or quite subtle, depending on app and workload. But we really should still detect it. To do that, we need to remember that our last replay position was (1/1000, 1) . And when we request to start replay from 1/1000 at timeline 1 on B, it'll ERROR, telling us that its timeline 1 ends at 1/500. We could still *choose* to continue as if all was well, but by default we'll detect the error. But we can't do that unless replication origins on the downstream can track the timeline. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2017-06-01 09:12:04 +0800, Craig Ringer wrote: > TL;DR: replication origins track LSN without timeline. This is > ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX > can now represent more than one state due to timeline forks with > promotions. Replication origins should track timelines so we can tell > the difference, I propose to patch them accordingly for pg11. I'm not quite convinced that this should be tracked at the origin level. If you fail over physically, shouldn't we also reconfigure logical replication? Even if we decide this is necessary, I *strongly* suggest trying to get the existing standby decoding etc wrapped up before starting something nontrival afresh. > Why? > > Take master A, its physical replica B, and logical decoding client X > streaming changes from A. B is lagging. A is at lsn 1/1000, B is only > at 1/500. C has replicated from A up to 1/1000, when A fails. We > promote B to replace A. Now C connects to B, and requests to resume at > LSN 1/1000. Wouldn't it be better to solve this by querying the new master's timeline history, and checking whether the current replay point is pre/post fork? I'm more than bit doubtful that adding more overhead to every relevant record is worth it here. - Andres
Craig, * Craig Ringer (craig@2ndquadrant.com) wrote: > TL;DR: replication origins track LSN without timeline. This is > ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX > can now represent more than one state due to timeline forks with > promotions. Replication origins should track timelines so we can tell > the difference, I propose to patch them accordingly for pg11. Uh, TL;DR, wow? Why isn't this something which needs to be addressed before PG10 can be released? I hope I'm missing something that makes the current approach work in PG10, or that there's some reason that this isn't a big deal for PG10, but I'd like a bit of info as to why that's the case, if it is. The further comments in your email seem to state that logical replication will just fail if a replica is promoted. While not ideal, that might barely reach the point of it being releasable, but turns it into a feature that I'd have a really hard time recommending to anyone, and are we absolutely sure that there aren't any cases where there might be an issue of undetected promotion, leading to the complications which you describe? Thanks! Stephen
On 1 June 2017 at 09:27, Stephen Frost <sfrost@snowman.net> wrote: > Craig, > > * Craig Ringer (craig@2ndquadrant.com) wrote: >> TL;DR: replication origins track LSN without timeline. This is >> ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX >> can now represent more than one state due to timeline forks with >> promotions. Replication origins should track timelines so we can tell >> the difference, I propose to patch them accordingly for pg11. > > Uh, TL;DR, wow? Why isn't this something which needs to be addressed > before PG10 can be released? I hope I'm missing something that makes > the current approach work in PG10, or that there's some reason that this > isn't a big deal for PG10, but I'd like a bit of info as to why that's > the case, if it is. In Pg 10, if you promote a physical replica then logical replication falls apart entirely and stops working. So there's no corruption hazard because it just ... stops. This only starts becoming an issue once logical replication slots can exist on replicas and be maintained to follow the master's slot state. Which is incomplete in Pg10 (not exposed to users) but I plan to finish getting in for pg11, making this a possible issue to be addressed. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Craig, * Craig Ringer (craig@2ndquadrant.com) wrote: > On 1 June 2017 at 09:27, Stephen Frost <sfrost@snowman.net> wrote: > > * Craig Ringer (craig@2ndquadrant.com) wrote: > >> TL;DR: replication origins track LSN without timeline. This is > >> ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX > >> can now represent more than one state due to timeline forks with > >> promotions. Replication origins should track timelines so we can tell > >> the difference, I propose to patch them accordingly for pg11. > > > > Uh, TL;DR, wow? Why isn't this something which needs to be addressed > > before PG10 can be released? I hope I'm missing something that makes > > the current approach work in PG10, or that there's some reason that this > > isn't a big deal for PG10, but I'd like a bit of info as to why that's > > the case, if it is. > > In Pg 10, if you promote a physical replica then logical replication > falls apart entirely and stops working. So there's no corruption > hazard because it just ... stops. I see. > This only starts becoming an issue once logical replication slots can > exist on replicas and be maintained to follow the master's slot state. > Which is incomplete in Pg10 (not exposed to users) but I plan to > finish getting in for pg11, making this a possible issue to be > addressed. Fair enough. I'm disappointed that we ended up with that as the solution for PG10, but so be it, the main thing is that we avoid any corruption risk. Thanks! Stephen
On 2017-05-31 21:27:56 -0400, Stephen Frost wrote: > Craig, > > * Craig Ringer (craig@2ndquadrant.com) wrote: > > TL;DR: replication origins track LSN without timeline. This is > > ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX > > can now represent more than one state due to timeline forks with > > promotions. Replication origins should track timelines so we can tell > > the difference, I propose to patch them accordingly for pg11. > > Uh, TL;DR, wow? Why isn't this something which needs to be addressed > before PG10 can be released? Huh? Slots are't present on replicas, ergo there's no way for the whole issue to occur in v10. > The further comments in your email seem to state that logical > replication will just fail if a replica is promoted. While not ideal, > that might barely reach the point of it being releasable, but turns it > into a feature that I'd have a really hard time recommending to > anyone, Meh^10 > and are we absolutely sure that there aren't any cases where there might > be an issue of undetected promotion, leading to the complications which > you describe? Yes, unless you manipulate things by hand, copying files around or such. Greetings, Andres Freund
On 2017-05-31 21:33:26 -0400, Stephen Frost wrote: > > This only starts becoming an issue once logical replication slots can > > exist on replicas and be maintained to follow the master's slot state. > > Which is incomplete in Pg10 (not exposed to users) but I plan to > > finish getting in for pg11, making this a possible issue to be > > addressed. > > Fair enough. I'm disappointed that we ended up with that as the > solution for PG10 This has widely been debated, and it's not exactly new that development happens incrementally, so I don't have particularly much sympathy for that POV.
Andres, * Andres Freund (andres@anarazel.de) wrote: > On 2017-05-31 21:27:56 -0400, Stephen Frost wrote: > > Uh, TL;DR, wow? Why isn't this something which needs to be addressed > > before PG10 can be released? > > Huh? Slots are't present on replicas, ergo there's no way for the whole > issue to occur in v10. Ohhhh, ok, now that makes more sense, not sure how I missed that. Thanks! Stephen
On 1 June 2017 at 09:23, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2017-06-01 09:12:04 +0800, Craig Ringer wrote: >> TL;DR: replication origins track LSN without timeline. This is >> ambiguous when physical failover is present since XXXXXXXX/XXXXXXXX >> can now represent more than one state due to timeline forks with >> promotions. Replication origins should track timelines so we can tell >> the difference, I propose to patch them accordingly for pg11. > > I'm not quite convinced that this should be tracked at the origin level. > If you fail over physically, shouldn't we also reconfigure logical > replication? > > Even if we decide this is necessary, I *strongly* suggest trying to get > the existing standby decoding etc wrapped up before starting something > nontrival afresh. Yeah, I'm not thinking of leaping straight to a patch before we've got the rep on standby stuff nailed down. I just wanted to raise early discussion to make sure it's not entirely the wrong path and/or totally hopeless for core. Logical decoding output plugins would need to keep track of the timeline and send an extra message informing the downstream of a timeline change whenever they see a new timeline. Or include it in all messages (see: extra overhead). Since we don't stop a decoding session when we hit a timeline boundary and force re-connection. (Nor can we, since at some point our restart_lsn will be on the old timeline but the first commits will be on the new timeline). I'll need to think about if/how the decoding plugin can reliably do that. >> Take master A, its physical replica B, and logical decoding client X >> streaming changes from A. B is lagging. A is at lsn 1/1000, B is only >> at 1/500. C has replicated from A up to 1/1000, when A fails. We >> promote B to replace A. Now C connects to B, and requests to resume at >> LSN 1/1000. > > Wouldn't it be better to solve this by querying the new master's > timeline history, and checking whether the current replay point is > pre/post fork? That could work. The decoding client would need to track the last-commit timeline in its own metadata if we're not letting it put it in the replication origin. Manageable, if awkward. Clients would need to know how to fetch and parse timeline history files, which is an irritating thing for every decoding client that wants to support failover to have to support. But I guess it's manageable, if not friendly. And non-Pg-based downstreams would have to do it anyway. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres, * Andres Freund (andres@anarazel.de) wrote: > On 2017-05-31 21:33:26 -0400, Stephen Frost wrote: > > > This only starts becoming an issue once logical replication slots can > > > exist on replicas and be maintained to follow the master's slot state. > > > Which is incomplete in Pg10 (not exposed to users) but I plan to > > > finish getting in for pg11, making this a possible issue to be > > > addressed. > > > > Fair enough. I'm disappointed that we ended up with that as the > > solution for PG10 > > This has widely been debated, and it's not exactly new that development > happens incrementally, so I don't have particularly much sympathy for > that POV. I do understand that, of course, but hadn't quite realized yet that we're talking only about replication slots on replicas. Apologies for the noise. Thanks! Stephen
On 1 June 2017 at 09:36, Andres Freund <andres@anarazel.de> wrote: > On 2017-05-31 21:33:26 -0400, Stephen Frost wrote: >> > This only starts becoming an issue once logical replication slots can >> > exist on replicas and be maintained to follow the master's slot state. >> > Which is incomplete in Pg10 (not exposed to users) but I plan to >> > finish getting in for pg11, making this a possible issue to be >> > addressed. >> >> Fair enough. I'm disappointed that we ended up with that as the >> solution for PG10 > > This has widely been debated, and it's not exactly new that development > happens incrementally, so I don't have particularly much sympathy for > that POV. Yeah. Even if we'd had completed support for decoding on standby, there's no way we could've implemented the required gymnastics for getting logical replication to actually use it to support physical failover for pg10, so it was always going to land in pg11. This is very much a "how do we do it right when we do do it" topic, not a pg10 issue. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 1 June 2017 at 09:23, Andres Freund <andres@anarazel.de> wrote: > Even if we decide this is necessary, I *strongly* suggest trying to get > the existing standby decoding etc wrapped up before starting something > nontrival afresh. Speaking of such, I had a thought about how to sync logical slot state to physical replicas without requiring all logical downstreams to know about and be able to connect to all physical replicas. Interested in your initial reaction. Basically, enlist the walreceiver's help. Extend the walsender so in physical rep mode it periodically writes the upstream's logical slot state into the stream as a message with special lsn 0/0. Then the walreceiver uses that to make decoding calls on the downstream to advance the downstream logical slots to the new confirmed_flush_lsn, or hands the info off to a helper proc that does it. It could set up a decoding context and do it via a proper decoding session, discarding output, and later we could probably optimise that decoding session to do even less work than ReorderBufferSkip()ing xacts. The alternative at this point since we nixed writing logical slot state to WAL seems to be a bgworker on the upstream that periodically writes logical slot state into generic WAL messages in a special table, then another on the downstream that processes the table and makes appropriate decoding calls to advance the downstream slot state. (Safely, not via directly setting catalog_xmin etc). Which is pretty damn ugly, but has the advantage that it'd work for PITR restores, unlike the walsender/walreceiver based approach. Failover slots in extension-space, basically. I'm really, really not sold on all logical downstreams having to know about and be able to connect to all physical standbys of the upstream to maintain slots on them. Some kind of solution that runs entirely on the standby will be needed. It's more a question of whether it's something built-in, easy, and nice, or some out of tree extension. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services