Re: pglogical - logical replication contrib module - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: pglogical - logical replication contrib module
Date
Msg-id CAMsr+YGTL6Vrrkidn6EKVTVt0Abofn-q+0SDYzk6aHVM5DmPxw@mail.gmail.com
Whole thread Raw
In response to Re: pglogical - logical replication contrib module  (Steve Singer <steve@ssinger.info>)
Responses Re: pglogical - logical replication contrib module  (Steve Singer <steve@ssinger.info>)
List pgsql-hackers
On 23 January 2016 at 11:17, Steve Singer <steve@ssinger.info> wrote:

2) Does this patch provide a set of logical replication features that meet many popular use-cases

Below I will review some use-cases and try to assess how pglogical meets them.

 ** Streaming Postgresql Upgrade

pg_upgrade is great for many situations but sometimes you don't want an in place upgrade but you want a streaming upgrade.  Possibly because you don't want application downtime but instead you just want to point your applications at the upgraded database server in a controlled manner.   Othertimes you
might want an option of upgrading to a newer version of PG but maintain the option of having to rollback to the older version if things go badly.

I think pglogical should be able to handle this use case pretty well (assuming the source version of PG is actually new enough to include pglogical).

Yep, it's designed significantly for that case.  That's also why support for 9.4 and 9.5 is maintained as a standalone extension, so you can get data out of 9.4 and 9.5 easily (and for that matter, upgrade 9.4 to 9.5).
 
Support for replicating sequences would need to be added before this is as smooth but once sequence support was added I think this would work well.

This will unfortunately have to be 9.6 only. We can work around it with some limitations in a pglogical downstream in older versions, but I really want to get time to write a v2 of the sequence decoding patch so I can get that into 9.6.
 
** Query only replicas (with temp tables or additional indexes)

Sometimes you want a replica for long running or heavy queries.  Requirements for temp tables, additional indexes or maybe the effect on vacuum means that our existing WAL based replicas are unsuitable.

I think pglogical should be able to handle this use case pretty well with the caveat being that your replica is an asynchronous replica and will always lag the origin by some amount.

You can actually run it as a synchronous replica too, with the usual limitations that you can have only one synchronous standby at a time, etc. Or should be able to - I haven't had a chance to write proper tests for sync rep using pglogical yet.

Performance will currently hurt if you do big xacts. That's why we need interleaved xact streaming support down the track.
 
Pglogical doesn't have any facilities to rename the tables between the origin and replica but they could be added later.

Yep, we could do that with a hook. You couldn't use initial schema sync if you did that, of course.
 
** Sharding

Systems that do application level sharding (or even sharding with a fdw) often have non-sharded tables that need to be available on all shards for relational integrity or joins.   Logical replication is one way to make sure that the replicated data gets to all the shards.  Sharding systems also sometimes want
to take the data from individual shards and replicate it to a consolidation server for reporting purposes.

Pglogical seems to meet this use case, I guess you would have a designated origin for the shared data/global data that all shards would subscribe to
with a set containing the designated data.  For the consolidation use case you would have the consolidation server subscribe to all shards

I am less clear about how someone would want DDL changes to work for these cases.  The DDL support in the patch is pretty limited so I am not going to think much now about how we would want DDL to work.

DDL support is "version 2" material, basically. 

9.5 has hooks that allow DDL deparsing to be implemented as an extension. That extension needs to be finished off (there's some work-in-progress code floating around from 9.5 dev) and needs to expose an API for other extensions. Then pglogical can register hooks with the ddl deparse extension and use that for DDL replication.

As we learned with BDR, though, DDL replication is *hard*. 

For one thing PostgreSQL has global objects like users that we can't currently capture DDL for, and then creates db-local objects that have dependences on them. So you have to manually replicate the global objects still. I can see some possible solutions for this, but nothing's really on the horizon.

Additionally there a some operations that are a bit problematic for logical replication. Full table rewrites being the main one - they clobber replication origin information among other issues. We really need a way to decode

ALTER TABLE blah ADD COLUMN fred integer NOT NULL DEFAULT 42;

as

BEGIN;
ALTER TABLE blah ADD COLUMN fred integer;
ALTER TABLE blah ALTER COLUMN fred DEFAULT 42;
UPDATE blah SET fred = 42;
ALTER TABLE blah ALTER COLUMN fred NOT NULL;
COMMIT;

which involves some "interesting" co-operation between DDL deparse and logical replication. The mapping of the decoded full table rewrite to the underlying table is a bit interesting; we just get a decode stream for a synthetic table named "pg_temp_xxxx" where the xxxx is the table upstream oid. A nicer API for that would be good.

** Schema changes involving rewriting big tables

Sometimes you have a DDL change on a large table that will involve a table rewrite and the best way of deploying the change is to make the DDL change
on a replicate then once it is finished promote the replica to the origin in some controlled fashion.  This avoids having to lock the table on the origin for hours.

pglogical seems to allow minor schema changes on the replica such as changing a type but it doesn't seem to allow a DO INSTEAD trigger on the replica.  I don't think pglogical currently meets this use case particularly well

I'm not sure I fully understand that one.
 
** Failover

WAL replication is probably a better choice for someone just looking for failover support from replication.

"Physical" replication as I've been trying to call it, since logical rep is also WAL based.
 
I agree with you. It very definitely is.

I have a roadmap in mind for logical rep based failover. We need sequence advance replication (or even better, sequence access mehods), an upstream<->downstream LSN mapping and failover slots and logical decoding of logical slots. A few bits and pieces.

Someone who is looking at pglogical for failover related use cases probably has one or more of the other uses cases I mentioned  and wants a logical node to take over for a failed origin.   If a node fails you can take some of the remaining subscribers and have them resubscribe to one of the remaining nodes but there is no support for a) Figuring out which of the remaining nodes is most ahead b)  Letting the subscribers figure out which updates from the old origin that are missing and getting them from a surviving node (they can truncate and re-copy the data but that might be very expensive)

Yep. Failover slots are part of that picture, and the logical decoding of slot positions + lsn map stuff carries on from it.
 
** Geographically  distributed applications

Sometimes people have database in different geographical locations and they want to perform queries and writes locally but replicate all the data to all the other locations.   This is a multi-master eventually consistent use case.

Yep. That's what BDR aims for, and why the plan in 2ndQ is to rebuild BDR around pglogical to continue the work of streaming BDR into core. You can think of pglogical and pglogical_output as _parts of BDR_ that have been extracted to submit into core, they've just been heavily polished up, made much more general purpose, and had things that won't work in core yet removed.
 
Hopefully we'll have full MM on top in time, but that can't all be done in one release.

The lack of sequence support would be an issue for these use cases.

That's why we need sequence access methods. There's a patch for that in the 9.6 CF too.
 
I think you could also only configure the cluster in a fully connected grid (with forward_origins='none').  A lot of deployments you would want some amount of cascading and structure which isn't yet supported.   I also suspect that managing a grid cluster with more than a handful of nodes will be unwieldy (particularly compared to some of the eventual consistent nosql alternatives)

I envision a management layer on top for that, where pglogical forms an underlying component.

The features BDR has that were removed for pglogical are probably really useful for this use-case (which I think was the original BDR use-case)

Yep. They were removed mainly because they can't work with core until some other patches get in too. Also just to keep the first pglogical submission vaguely practical and manageable.
 
** Scaling across multiple machines

Sometimes people ask for replication systems that let them support  more load than a single database server supports but with consistency. Other use-case applies if you want 'eventually consistent'  this use case is for situations where you want something other than eventual consistent.

I don't think pglogical is intended to address this.

Correct. That's more like postgres-XL, where you have a distributed lock manager, distributed transaction manager, etc.

pglogical (or the output plugin at least) can form part of such a solution, and there's an experiment being contemplated right now to use pglogical as the data replication transport in postgres-XL. But it doesn't attempt to provide a whole solution there, only one component.
 
Metadata is not transferred between nodes.  What I mean by this is that nodes don't have a global view of the cluster they know about their own subscriptions but nothing else.  This is different than a system like slony where sl_node and sl_subscription contain a global view of your cluster state.  Not sending metadata to all nodes in the cluster simplifies a bunch of things (you don't have to worry about sending metadata around and if a given piece of metadata is stale) but the downside is that I think the tooling to perform a lot of cluster reconfigure operations will need to be a lot smarter.

Yep. We're going to need a management layer on top for building and monitoring non-trivial node graphs. Whether in core or not.

Petr and I found that trying to design a schema that could fit all use cases while preserving a system-wide view of the node graph was impractical, if not outright impossible. There are quite conflicting use cases: mesh multi-master wants to see everything, whereas if you have three upstreams feeding into a data aggregator that then replicates to other nodes you don't particularly want the leaf nodes worrying about the upstream origin servers.
 
Petr, and Craig have you thought about how you might support getting the cluster back into a sane state after a node fails with minimal pain.

Yes.

There are really two approaches. One is having a physical standby where you fail over to a streaming physical replica and your slot state on logical slots is preserved. For that we need failover slots (per the patch to 9.6).

The other is to use logical failover, where there's a logical replica that you can switch leaf nodes to point to. For that we need a way to record slot advances on one node, send them on the wire and interpret them usefully on another node. Hence the outlined support for logical decoding of logical slot create/drop/update, and a lsn map. I haven't thought as hard about this one yet.

There's another thing you need for multimaster/mesh systems where there's a graph not a simple tree. That's the ability to lazily advance a slot so that when a node fails you can find the peer that replayed the furthest in that node's history and ask it to send you the changes from the lost node. You have to be able to go back in time on the slot to the most recent point you have a local copy of the other node's state. Turns out that's not hard, you just delay advancing the slot. You also have to be able to replay it again with a filter that sends you only that node's changes. That's also not hard using replication origins. There are some hairy complexities when it comes to multi-master conflict resolution though, where changes to data come from more than one node. That's a "for later" problem.
 
I am concerned about testing, I don't think the .sql based regression tests are going to adequately test a replication system that supports concurrent activity on different databases/servers. 

I agree that we can't rely only on that.

This is part of a bigger picture in Pg where we just don't test multi-node stuff. Failover, replication, etc is ignored in the tests. The TAP based stuff looks to change that and I suspect we'd have to investigate whether it's possible to build on top of that for more comprehensive testing.
 

Thanks again for the review work, I know it takes serious time and effort and I appreciate it.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Konstantin Knizhnik
Date:
Subject: Re: Batch update of indexes
Next
From: Craig Ringer
Date:
Subject: Re: Proposal:Use PGDLLEXPORT for libpq