[PATCH 16/16] current version of the design document - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | [PATCH 16/16] current version of the design document |
Date | |
Msg-id | 1339586927-13156-16-git-send-email-andres@2ndquadrant.com Whole thread Raw |
In response to | [RFC][PATCH] Logical Replication/BDR prototype and architecture (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: [PATCH 16/16] current version of the design document
|
List | pgsql-hackers |
From: Andres Freund <andres@anarazel.de> ---src/backend/replication/logical/DESIGN | 209 ++++++++++++++++++++++++++++++++1 file changed, 209 insertions(+)createmode 100644 src/backend/replication/logical/DESIGN diff --git a/src/backend/replication/logical/DESIGN b/src/backend/replication/logical/DESIGN new file mode 100644 index 0000000..2cf08ff --- /dev/null +++ b/src/backend/replication/logical/DESIGN @@ -0,0 +1,209 @@ +=== Design goals for logical replication ===: +- in core +- fast +- async +- robust +- multi-master +- modular +- as unintrusive as possible implementation wise +- basis for other technologies (sharding, replication into other DBMSs, ...) + +For reasons why we think this is an important set of features please check out +the presentation from the in-core replication summit at pgcon: +http://wiki.postgresql.org/wiki/File:BDR_Presentation_PGCon2012.pdf + +While you may argue that most of the above design goals are already provided by +various trigger based replication solutions like Londiste or Slony, we think +that thats not enough for various reasons: + +- not in core (and thus less trustworthy) +- duplication of writes due to an additional log +- performance in general (check the end of the above presentation) +- complex to use because there is no native administration interface + +We want to emphasize that this proposed architecture is based on the experience +of developing a minimal prototype which we developed with the above goals in +mind. While we obviously hope that a good part of it is reusable for the +community we definitely do *not* expect that the community accepts this ++as-is. It is intended to be the basis upon which we, the community, can build +and design the future logical replication. + +=== Basic architecture ===: +Very broadly speaking there are several major pieces common to most approaches +to replication: +1. Source data generation +2. Transportation of that data +3. Applying the changes +4. Conflict resolution + + +1.: + +As we need a change stream that contains all required changes in the correct +order, the requirement for this stream to reflect changes across multiple +concurrent backends raises concurrency and scalability issues. Reusing the +WAL stream for this seems a good choice since it is needed anyway and adresses +those issues already, and it further means that we don't incur duplicate +writes. Any other stream generating componenent would introduce additional +scalability issues. + +We need a change stream that contains all required changes in the correct order +which thus needs to be synchronized across concurrent backends which introduces +obvious concurrency/scalability issues. +Reusing the WAL stream for this seems a good choice since it is needed anyway +and adresses those issues already, and it further means we don't duplicate the +writes and locks already performance for its maintenance. + +Unfortunately, in this case, the WAL is mostly a physical representation of the +changes and thus does not, by itself, contain the necessary information in a +convenient format to create logical changesets. + +The biggest problem is, that interpreting tuples in the WAL stream requires an +up-to-date system catalog and needs to be done in a compatible backend and +architecture. The requirement of an up-to-date catalog could be solved by +adding more data to the WAL stream but it seems to be likely that that would +require relatively intrusive & complex changes. Instead we chose to require a +synchronized catalog at the decoding site. That adds some complexity to use +cases like replicating into a different database or cross-version +replication. For those it is relatively straight-forward to develop a proxy pg +instance that only contains the catalog and does the transformation to textual +changes. + +This also is the solution to the other big problem, the need to work around +architecture/version specific binary formats. The alternative, producing +cross-version, cross-architecture compatible binary changes or even moreso +textual changes all the time seems to be prohibitively expensive. Both from a +cpu and a storage POV and also from the point of implementation effort. + +The catalog on the site where changes originate can *not* be used for the +decoding because at the time we decode the WAL the catalog may have changed +from the state it was in when the WAL was generated. A possible solution for +this would be to have a fully versioned catalog but that again seems to be +rather complex and intrusive. + +For some operations (UPDATE, DELETE) and corner-cases (e.g. full page writes) +additional data needs to be logged, but the additional amount of data isn't +that big. Requiring a primary-key for any change but INSERT seems to be a +sensible thing for now. The required changes are fully contained in heapam.c +and are pretty simple so far. + +2.: + +For transport of the non-decoded data from the originating site to the decoding +site we decided to reuse the infrastructure already provided by +walsender/walreceiver. We introduced a new command that, analogous to +START_REPLICATION, is called START_LOGICAL_REPLICATION that will stream out all +xlog records that pass through a filter. + +The on-the-wire format stays the same. The filter currently simply filters out +all record which are not interesting for logical replication (indexes, +freezing, ...) and records that did not originate on the same system. + +The requirement of filtering by 'origin' of a wal node comes from the planned +multimaster support. Changes replayed locally that originate from another site +should not replayed again there. If the wal is plainly used without such a +filter that would cause loops. Instead we tag every wal record with the "node +id" of the site that caused the change to happen and changes with a nodes own +"node id" won't get applied again. + +Currently filtered records get simply replaced by NOOP records and loads of +zeroes which obviously is not a sensible solution. The difficulty of actually +removing the records is that that would change the LSNs. We currently rely on +those though. + +The filtering might very well get expanded to support partial replication and +such in future. + + +3.: + +To sensibly apply changes out of the WAL stream we need to solve two things: +Reassemble transactions and apply them to the target database. + +The logical stream from 1. via 2. consists out of individual changes identified +by the relfilenode of the table and the xid of the transaction. Given +(sub)transactions, rollbacks, crash recovery, subtransactions and the like +those changes obviously cannot be individually applied without fully loosing +the pretence of consistency. To solve that we introduced a module, dubbed +ApplyCache which does the reassembling. This module is *independent* of the +data source and of the method of applying changes so it can be reused for +replicating into a foreign system or similar. + +Due to the overhead of planner/executor/toast reassembly/type conversion (yes, +we benchmarked!) we decided against statement generation for apply. Even when +using prepared statements the overhead is rather noticeable. + +Instead we decided to use relatively lowlevel heapam.h/genam.h accesses to do +the apply. For now we decided to use only one process to do the applying, +parallelizing that seems to be too complex for an introduction of an already +complex feature. +In our tests the apply process could keep up with pgbench -c/j 20+ generating +changes. This will obviously heavily depend on the workload. A fully seek bound +workload will definitely not scale that well. + +Just to reiterate: Plugging in another method to do the apply should be a +relatively simple matter of setting up three callbacks to a different function +(begin, apply_change, commit). + +Another complexity in this is how to synchronize the catalogs. We plan to use +command/event triggers and the oid preserving features from pg_upgrade to keep +the catalogs in-sync. We did not start working on that. + + +4.: + +While we started to think about conflict resolution/avoidance we did not start +to work on it. We currently *cannot* handle conflicts. We think that the base +features/architecture should be aggreed uppon before starting with it. + +Multimaster tests were done with sequences setup with INCREMENT 2 and different +start values on the two nodes. + +=== Current Prototype === + +The current prototype consists of a series of patches that are split in +hopefully sensible and coherent parts to make reviewing of individual parts +possible. + +Its also available in the 'cabal-rebasing' branch on +git.postgresql.org/users/andresfreund/postgres.git . That branch will modify +history though. + +01: wakeup handling: reduces replication lag, not very interesting in this context + +02: Add zeroRecPtr: not very interesting either + +03: new syscache for relfilenode. This would benefit by some syscache experienced eyes + +04: embedded lists: This is a general facility, general review appreciated + +05: preliminary bgworker support: This is not ready and just posted as its + preliminary work for the other patches. Simon will post a real patch soon + +06: XLogReader: Review definitely appreciated + +07: logical data additions for WAL: Review definitely appreciated, I do not expect fundamental changes + +08: ApplyCache: Important infrastructure for the patch, review definitely appreciated + +09: Wal Decoding: Decode WAL generated with wal_level=logical into an ApplyCache + +10: WAL with 'origin node': This is another important base-piece for logical rep + +11: WAL segment handling changes: If the basic idea of adding a node_id to the + functions and adding a pg_lcr directory is acceptable the rest of the patch is + fairly boring/mechanical + +12: walsender/walreceiver changes: Implement transport/filtering of logical + changes. Very relevant + +13: shared memory/crash recovery state handling for logical rep: Very relevant + minus the TODO's in the commit message + +14: apply module: review appreciated + +15: apply process: somewhat dependent on the preliminary changes in 05, general + direction is visible, loads of detail work needed as soon as some design + decisions are agreed uppon. + +16: this document. Not very interesting after youve read it ;) -- 1.7.10.rc3.3.g19a6c.dirty
pgsql-hackers by date: