Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture |
Date | |
Msg-id | CA+Tgmoby-5VO7uXnAmnz02JproSiZ38tg5gbp73QVbzFKtEaKg@mail.gmail.com Whole thread Raw |
In response to | Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: [RFC][PATCH] Logical Replication/BDR prototype
and architecture
Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture |
List | pgsql-hackers |
On Thu, Jun 14, 2012 at 4:13 PM, Andres Freund <andres@2ndquadrant.com> wrote: > I don't plan to throw in loads of conflict resolution smarts. The aim is to get > to the place where all the infrastructure is there so that a MM solution can > be built by basically plugging in a conflict resolution mechanism. Maybe > providing a very simple one. > I think without in-core support its really, really hard to build a sensible MM > implementation. Which doesn't mean it has to live entirely in core. Of course, several people have already done it, perhaps most notably Bucardo. Anyway, it would be good to get opinions from more people here. I am sure I am not the only person with an opinion on the appropriateness of trying to build a multi-master replication solution in core or, indeed, the only person with an opinion on any of these other issues. It is not good for those other opinions to be saved for a later date. > Hm. Yes, you could do that. But I have to say I don't really see a point. > Maybe the fact that I do envision multimaster systems at some point is > clouding my judgement though as its far less easy in that case. Why? I don't think that particularly changes anything. > It also complicates the wal format as you now need to specify whether you > transport a full or a primary-key only tuple... Why? If the schemas are in sync, the target knows what the PK is perfectly well. If not, you're probably in trouble anyway. > I think though that we do not want to enforce that mode of operation for > tightly coupled instances. For those I was thinking of using command triggers > to synchronize the catalogs. > One of the big screwups of the current replication solutions is exactly that > you cannot sensibly do DDL which is not a big problem if you have a huge > system with loads of different databases and very knowledgeable people et al. > but at the beginning it really sucks. I have no problem with making one of the > nodes the "schema master" in that case. > Also I would like to avoid the overhead of the proxy instance for use-cases > where you really want one node replicated as fully as possible with the slight > exception of being able to have summing tables, different indexes et al. In my view, a logical replication solution is precisely one in which the catalogs don't need to be in sync. If the catalogs have to be in sync, it's not logical replication. ISTM that what you're talking about is sort of a hybrid between physical replication (pages) and logical replication (tuples) - you want to ship around raw binary tuple data, but not entire pages. The problem with that is it's going to be tough to make robust. Users could easily end up with answers that are total nonsense, or probably even crash the server. To step back and talk about DDL more generally, you've mentioned a few times the idea of using an SR instance that has been filtered down to just the system catalogs as a means of generating logical change records. However, as things stand today, there's no reason to suppose that replicating anything less than the entire cluster is sufficient. For example, you can't translate enum labels to strings without access to the pg_enum catalog, which would be there, because enums are built-in types. But someone could supply a similar user-defined type that uses a user-defined table to do those lookups, and now you've got a problem. I think this is a contractual problem, not a technical one. From the point of view of logical replication, it would be nice if type output functions were basically guaranteed to look at nothing but the datum they get passed as an argument, or at the very least nothing other than the system catalogs, but there is no such guarantee. And, without such a guarantee, I don't believe that we can create a high-performance, robust, in-core replication solution. Now, the nice thing about being the people who make PostgreSQL happen is we get to decide what the C code that people load into the server is required to guarantee; we can change the rules. Before, types were allowed to do X, but now they're not. Unfortunately, in this case, I don't really find that an acceptable solution. First, it might break code that has worked with PostgreSQL for many years; but worse, it won't break it in any obvious way, but rather only if you're using logical replication, which will doubtless cause people to attribute the failure to logical replication rather than to their own code. Even if they do understand that we imposed a rule-change from on high, there's no really good workaround: an enum type is a good example of something that you *can't* implement without a side-table. Second, it flies in the face of our often-stated desire to make the server extensible. Also, even given the existence of such a restriction, you still need to run any output function that relies on catalogs with catalog contents that match what existed at the time that WAL was generated, and under the correct snapshot, which is not trivial. These things are problems even for other things that we might need to do while examining the WAL stream, but they're particularly acute for any application that wants to run type-output functions to generate something that can be sent to a server which doesn't necessarily having matching catalog contents. But it strikes me that these things, really, are only a problem for a minority of data types. For text, or int4, or float8, or even timestamptz, we don't need *any catalog contents at all* to reconstruct the tuple data. Knowing the correct type alignment and which C function to call is entirely sufficient. So maybe instead of trying to cobble together a set of catalog contents that we can use for decoding any tuple whatsoever, we should instead divide the world into well-behaved types and poorly-behaved types. Well-behaved types are those that can be interpreted without the catalogs, provided that you know what type it is. Poorly-behaved types (records, enums) are those where you can't. For well-behaved types, we only need a small amount of additional information in WAL to identify which types we're trying to decode (not the type OID, which might fail in the presence of nasty catalog hacks, but something more universal, like a UUID that means "this is text", or something that identifies the C entrypoint). And then maybe we handle poorly-behaved types by pushing some of the work into the foreground task that's generating the WAL: in the worst case, the process logs a record before each insert/update/delete containing the text representation of any values that are going to be hard to decode. In some cases (e.g. records all of whose constituent fields are well-behaved types) we could instead log enough additional information about the type to permit blind decoding. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: