Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture
Date
Msg-id 201206142213.17276.andres@2ndquadrant.com
Whole thread Raw
In response to Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture
List pgsql-hackers
Hi Robert,

Thanks for your answer.

On Thursday, June 14, 2012 06:17:26 PM Robert Haas wrote:
> On Wed, Jun 13, 2012 at 7:27 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > === Design goals for logical replication === :
> > - in core
> > - fast
> > - async
> > - robust
> > - multi-master
> > - modular
> > - as unintrusive as possible implementation wise
> > - basis for other technologies (sharding, replication into other DBMSs,
> > ...)
> 
> I agree with all of these goals except for "multi-master".  I am not
> sure that there is a need to have a multi-master replication solution
> in core.  The big tricky part of multi-master replication is conflict
> resolution, and that seems like an awful lot of logic to try to build
> into core - especially given that we will want it to be extensible.
I don't plan to throw in loads of conflict resolution smarts. The aim is to get 
to the place where all the infrastructure is there so that a MM solution can 
be built by basically plugging in a conflict resolution mechanism. Maybe 
providing a very simple one.
I think without in-core support its really, really hard to build a sensible MM 
implementation. Which doesn't mean it has to live entirely in core.

Loads of the use-cases we have seen lately have a relatively small, low-
conflict shared dataset and a far bigger sharded one. While that obviously 
isn't the only relevant use case it is a senible important one.

> More generally, I would much rather see us focus on efficiently
> extracting changesets from WAL and efficiently applying those
> changesets on another server.  IMHO, those are the things that are
> holding back the not-in-core replication solutions we have today,
> particularly the first.  If we come up with a better way of applying
> change-sets, well, Slony can implement that too; they are already
> installing C code.  What neither they nor any other non-core solution
> can implement is change-set extraction, and therefore I think that
> ought to be the focus.
It definitely is a very important focus. I don't think it is the only one.  But 
that doesn't seem to be a problem to me as long as everything is kept fairly 
modular (which I tried rather hard to).

> To put all that another way, I think it is a 100% bad idea to try to
> kill off Slony, Bucardo, Londiste, or any of the many home-grown
> solutions that are out there to do replication.  Even if there were no
> technical place for third-party replication products (and I think
> there is), we will not win many friends by making it harder to extend
> and add on to the server.  If we build an in-core replication solution
> that can be used all by itself, that is fine with me.  But I think it
> should also be able to expose its constituent parts as a toolkit for
> third-party solutions.
I agree 100%. Unfortunately I forgot to explictly make that point, but the 
plan is definitely is to make the life of other replication solutions easier 
not harder. I don't think there will ever be one replication solution that fits 
every use-case perfectly.
At pgcon I talked with some of the slony guys and they were definetly 
interested in the changeset generation and I have kept that in mind. If some 
problems that need resolving indepently of that issue is resolved (namely DDL) 
it shouldn't take much generating their output format. The 'apply' code is 
fully abstracted and separted.

> > While you may argue that most of the above design goals are already
> > provided by various trigger based replication solutions like Londiste or
> > Slony, we think that thats not enough for various reasons:
> > 
> > - not in core (and thus less trustworthy)
> > - duplication of writes due to an additional log
> > - performance in general (check the end of the above presentation)
> > - complex to use because there is no native administration interface
> 
> I think that your parenthetical note "(and thus less trustworthy)"
> gets at another very important point, which is that one of the
> standards for inclusion in core is that it must in fact be trustworthy
> enough to justify the confidence that users will place in it.  It will
> NOT benefit the project to have two replication solutions in core, one
> of which is crappy.  More, even if what we put in core is AS GOOD as
> the best third-party solutions that are available, I don't think
> that's adequate.  It has to be better.  If it isn't, there is no
> excuse for preempting what's already out there.
I aggree that it has to be very good. *But* I think it is totally acceptable 
if it doesn't have all the bells and whistles from the start. That would be a 
sure road to disaster. For one implementing all that takes time and for 
another the amount of discussions till we are there is rather huge.

> I imagine you are thinking along similar lines, but I think that it
> bears being explicit about.
Seems like were thinking along the same lines, yes.

> > The biggest problem is, that interpreting tuples in the WAL stream
> > requires an up-to-date system catalog and needs to be done in a
> > compatible backend and architecture. The requirement of an up-to-date
> > catalog could be solved by adding more data to the WAL stream but it
> > seems to be likely that that would require relatively intrusive &
> > complex changes. Instead we chose to require a synchronized catalog at
> > the decoding site. That adds some complexity to use cases like
> > replicating into a different database or cross-version replication. For
> > those it is relatively straight-forward to develop a proxy pg instance
> > that only contains the catalog and does the transformation to textual
> > changes.
> The actual requirement here is more complex than "an up-to-date
> catalog".  Suppose transaction X begins, adds a column to a table,
> inserts a row, and commits.  That tuple needs to be interpreted using
> the tuple descriptor that transaction X would see (which includes the
> new column), NOT the tuple descriptor that some other transaction
> would see at the same time (which won't include the new column).  In a
> more complicated scenario, X might (1) begin, (2) start a
> subtransaction that alters the table, (3) release the savepoint or
> roll back to the save point, (4) insert a tuple, and (5) commit.  Now,
> the correct tuple descriptor for interpreting the tuple inserted in
> step (4) depends on whether step (3) was a release savepoint or a
> rollback-to-savepoint.  How are you handling these (and similar but
> more complex) cases?
I don't handle DDL at all yet. What I posted is a working, but early prototype 
;).  Building an simple prototype seemed like a good idea to get a feeling for 
everything but solving *the* hairy problem without input from -hackers seemed 
like a bad idea.

But I have thought about the issue for quite a while... The ApplyCache module 
does reassemble wal records from one transaction into on coherent stream with 
only changes from that transaction. Aborted transactions are thrown away.
So you can apply half a transaction, detect the changed tupledesc, and then 
reapply the rest.

But all that is moot until we agree on how to handle DDL. More about that 
further down.

> Moreover, we will want in the future to allow some of the DDL changes
> that currently require AccessExclusiveLock to be performed with a
> lesser lock.  It is unclear to me that this will be practical as far
> as adding columns goes, but it would be a shame if logical replication
> were the thing standing in the way. 
Well. Should it really come to that, which I don't think it will,  I 
personally don't have a big problem of making the relaxed rules only available 
with a wal_level < WAL_LEVEL_LOGICAL.

> > This also is the solution to the other big problem, the need to work
> > around architecture/version specific binary formats. The alternative,
> > producing cross-version, cross-architecture compatible binary changes or
> > even moreso textual changes all the time seems to be prohibitively
> > expensive. Both from a cpu and a storage POV and also from the point of
> > implementation effort.
> I think that if you can't produce a textual record of changes, you're
> throwing away 75% of what people will want to do with this.  Being
> able to replicate across architectures, versions, and even into
> heterogeneous databases is the main point of having logical
> replication, IMV.   Multi-master replication is nice to have, but IME
> there is huge demand for a replication solution that doesn't need to
> be temporarily replaced with something completely different every time
> you want to do a database upgrade.
All I am talking about in the above paragraph is saying that we do not want to 
change the wal into a cross-architecture, cross-version compatible format.

> > For some operations (UPDATE, DELETE) and corner-cases (e.g. full page
> > writes) additional data needs to be logged, but the additional amount of
> > data isn't that big. Requiring a primary-key for any change but INSERT
> > seems to be a sensible thing for now. The required changes are fully
> > contained in heapam.c and are pretty simple so far.
> I think that you can handle the case where there are no primary keys
> by simply treating the whole record as a primary key.  There might be
> duplicates, but I think you can work around that by decreeing that
> when you replay an UPDATE or DELETE operation you will update or
> delete at most one record.  So if there are multiple exactly-identical
> records in the target database, then you will just UPDATE or DELETE
> exactly one of them (it doesn't matter which).  This doesn't even seem
> particularly complicated.
Hm. Yes, you could do that. But I have to say I don't really see a point. 
Maybe the fact that I do envision multimaster systems at some point is 
clouding my judgement though as its far less easy in that case.

It also complicates the wal format as you now need to specify whether you 
transport a full or a primary-key only tuple...

> > We introduced a new command that, analogous to
> > START_REPLICATION, is called START_LOGICAL_REPLICATION that will stream
> > out all xlog records that pass through a filter.
> > 
> > The on-the-wire format stays the same. The filter currently simply
> > filters out all record which are not interesting for logical replication
> > (indexes, freezing, ...) and records that did not originate on the same
> > system.
> > 
> > The requirement of filtering by 'origin' of a wal node comes from the
> > planned multimaster support. Changes replayed locally that originate
> > from another site should not replayed again there. If the wal is plainly
> > used without such a filter that would cause loops. Instead we tag every
> > wal record with the "node id" of the site that caused the change to
> > happen and changes with a nodes own "node id" won't get applied again.
> > 
> > Currently filtered records get simply replaced by NOOP records and loads
> > of zeroes which obviously is not a sensible solution. The difficulty of
> > actually removing the records is that that would change the LSNs. We
> > currently rely on those though.
> > 
> > The filtering might very well get expanded to support partial replication
> > and such in future.
> This all seems to need a lot more thought.
Yes. I didn't want to make any decisions here which would probably be the 
wrong ones anyway, so I went for the NOOP option for now.

Unless youre talking about the 'origin_id' concept - that seems to work out 
very nicely with minimal amounts of code needed.

> > To sensibly apply changes out of the WAL stream we need to solve two
> > things: Reassemble transactions and apply them to the target database.
> > 
> > The logical stream from 1. via 2. consists out of individual changes
> > identified by the relfilenode of the table and the xid of the
> > transaction. Given (sub)transactions, rollbacks, crash recovery,
> > subtransactions and the like those changes obviously cannot be
> > individually applied without fully loosing the pretence of consistency.
> > To solve that we introduced a module, dubbed ApplyCache which does the
> > reassembling. This module is *independent* of the data source and of the
> > method of applying changes so it can be reused for replicating into a
> > foreign system or similar.
> > 
> > Due to the overhead of planner/executor/toast reassembly/type conversion
> > (yes, we benchmarked!) we decided against statement generation for
> > apply. Even when using prepared statements the overhead is rather
> > noticeable.
> > 
> > Instead we decided to use relatively lowlevel heapam.h/genam.h accesses
> > to do the apply. For now we decided to use only one process to do the
> > applying, parallelizing that seems to be too complex for an introduction
> > of an already complex feature.
> > In our tests the apply process could keep up with pgbench -c/j 20+
> > generating changes. This will obviously heavily depend on the workload.
> > A fully seek bound workload will definitely not scale that well.
> > 
> > Just to reiterate: Plugging in another method to do the apply should be a
> > relatively simple matter of setting up three callbacks to a different
> > function (begin, apply_change, commit).
> 
> I think it's reasonable to do the apply using these low-level
> mechanisms, but surely people will sometimes want to extract tuples as
> text and do whatever with them.  This goes back to my comment about
> feeling that we need a toolkit approach, not a one-size-fits-all
> solution.
Yes. We definitely need a toolkit approach. If you look at the individual 
patches, especially the later ones, you can see that I tried very hard to go 
that way.
The wal decoding (Patch 09) and transaction reassembly (Patch 08) and the low 
level apply (Patch 14) parts are as separate as possible. That is 09 feeds 
into an ApplyCache without knowing anything about the side which applies the 
changes. As written above, you currently really only need to replace those 3 
callbacks to produce text output instead.

That might get slightly more complex with the choice whether toast should be 
reassembled in memory or not, but even then it should be pretty simple.

> > Another complexity in this is how to synchronize the catalogs. We plan to
> > use command/event triggers and the oid preserving features from
> > pg_upgrade to keep the catalogs in-sync. We did not start working on
> > that.
> This strikes me as completely unacceptable.  People ARE going to want
> to replicate data between non-identical schemas on systems with
> unsynchronized OIDs.  And even if they weren't, relying on triggers to
> keep things in sync is exactly the sort of kludge that has inspired
> all sorts of frustration with our existing replication solutions.
Yes. I definitely agree that people will want to do that. Hell, I want to do 
that.
The plan for that is (see my mail to merlin about it) to have catalog-only 
proxy instances. For those its even possible without much additonal problems 
to use a pretty normal HS setup + filtering. Thats the only realistic way I 
have found to do the necessary catalog lookups (TupleDescs + output 
procedures) and handle the binary compatibility.

I think though that we do not want to enforce that mode of operation for 
tightly coupled instances. For those I was thinking of using command triggers 
to synchronize the catalogs. 
One of the big screwups of the current replication solutions is exactly that 
you cannot sensibly do DDL which is not a big problem if you have a huge 
system with loads of different databases and very knowledgeable people et al. 
but at the beginning it really sucks. I have no problem with making one of the 
nodes the "schema master" in that case.
Also I would like to avoid the overhead of the proxy instance for use-cases 
where you really want one node replicated as fully as possible with the slight 
exception of being able to have summing tables, different indexes et al.

Does that make sense for you?

Greetings,

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: "ktm@rice.edu"
Date:
Subject: Re: libpq compression
Next
From: Andres Freund
Date:
Subject: Re: [PATCH 03/16] Add a new syscache to fetch a pg_class entry via its relfilenode