Thread: [RFC][PATCH] Logical Replication/BDR prototype and architecture

[RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Andres Freund
Date:
Hi everyone,

This mail contains the highlevel design description of how our prototype of
in-core logical replication works. The individual patches will be posted as
replies to this email. I obviously welcome all sorts of productive comments to
both the individual patches and the architecture.

Unless somebody objects I will add most of the individual marked as RFC to the
current commitfest. I hope that with comments stemming from that round we can
get several of the patches into the first or second commitfest. As soon as the
design is clear/accepted we will try very hard to get the following patches
into the second or third round.

If anybody disaggrees with the procedual way we try do this, please raise a hand
now.

I tried to find the right balance between keeping the description short enough
that anybody will read the design docs and verbose enough that it is
understandable. I can go into much more detail in any part if wanted.

Please, keep in mind that those patches are *RFC* and a prototype and not
intended to be applied as-is. There is a short description of the individual
patches and their relevancy at the end of the email.

Greetings,

Andres

========

=== Design goals for logical replication === :
- in core
- fast
- async
- robust
- multi-master
- modular
- as unintrusive as possible implementation wise
- basis for other technologies (sharding, replication into other DBMSs, ...)

For reasons why we think this is an important set of features please check out
the presentation from the in-core replication summit at pgcon:
http://wiki.postgresql.org/wiki/File:BDR_Presentation_PGCon2012.pdf

While you may argue that most of the above design goals are already provided by
various trigger based replication solutions like Londiste or Slony, we think
that thats not enough for various reasons:

- not in core (and thus less trustworthy)
- duplication of writes due to an additional log
- performance in general (check the end of the above presentation)
- complex to use because there is no native administration interface

We want to emphasize that this proposed architecture is based on the experience
of developing a minimal prototype which we developed with the above goals in
mind. While we obviously hope that a good part of it is reusable for the
community we definitely do *not* expect that the community accepts this
+as-is. It is intended to be the basis upon which we, the community, can build
and design the future logical replication.

=== Basic architecture === :
Very broadly speaking there are several major pieces common to most approaches
to replication:
1. Source data generation
2. Transportation of that data
3. Applying the changes
4. Conflict resolution


1.:

As we need a change stream that contains all required changes in the correct
order, the requirement for this stream to reflect changes across multiple
concurrent backends raises concurrency and scalability issues. Reusing the
WAL stream for this seems a good choice since it is needed anyway and adresses
those issues already, and it further means that we don't incur duplicate
writes. Any other stream generating componenent would introduce additional
scalability issues.

We need a change stream that contains all required changes in the correct order
which thus needs to be synchronized across concurrent backends which introduces
obvious concurrency/scalability issues.
Reusing the WAL stream for this seems a good choice since it is needed anyway
and adresses those issues already, and it further means we don't duplicate the
writes and locks already performance for its maintenance.

Unfortunately, in this case, the WAL is mostly a physical representation of the
changes and thus does not, by itself, contain the necessary information in a
convenient format to create logical changesets.

The biggest problem is, that interpreting tuples in the WAL stream requires an
up-to-date system catalog and needs to be done in a compatible backend and
architecture. The requirement of an up-to-date catalog could be solved by
adding more data to the WAL stream but it seems to be likely that that would
require relatively intrusive & complex changes. Instead we chose to require a
synchronized catalog at the decoding site. That adds some complexity to use
cases like replicating into a different database or cross-version
replication. For those it is relatively straight-forward to develop a proxy pg
instance that only contains the catalog and does the transformation to textual
changes.

This also is the solution to the other big problem, the need to work around
architecture/version specific binary formats. The alternative, producing
cross-version, cross-architecture compatible binary changes or even moreso
textual changes all the time seems to be prohibitively expensive. Both from a
cpu and a storage POV and also from the point of implementation effort.

The catalog on the site where changes originate can *not* be used for the
decoding because at the time we decode the WAL the catalog may have changed
from the state it was in when the WAL was generated. A possible solution for
this would be to have a fully versioned catalog but that again seems to be
rather complex and intrusive.

For some operations (UPDATE, DELETE) and corner-cases (e.g. full page writes)
additional data needs to be logged, but the additional amount of data isn't
that big. Requiring a primary-key for any change but INSERT seems to be a
sensible thing for now. The required changes are fully contained in heapam.c
and are pretty simple so far.

2.:

For transport of the non-decoded data from the originating site to the decoding
site we decided to reuse the infrastructure already provided by
walsender/walreceiver. We introduced a new command that, analogous to
START_REPLICATION, is called START_LOGICAL_REPLICATION that will stream out all
xlog records that pass through a filter.

The on-the-wire format stays the same. The filter currently simply filters out
all record which are not interesting for logical replication (indexes,
freezing, ...) and records that did not originate on the same system.

The requirement of filtering by 'origin' of a wal node comes from the planned
multimaster support. Changes replayed locally that originate from another site
should not replayed again there. If the wal is plainly used without such a
filter that would cause loops. Instead we tag every wal record with the "node
id" of the site that caused the change to happen and changes with a nodes own
"node id" won't get applied again.

Currently filtered records get simply replaced by NOOP records and loads of
zeroes which obviously is not a sensible solution. The difficulty of actually
removing the records is that that would change the LSNs. We currently rely on
those though.

The filtering might very well get expanded to support partial replication and
such in future.


3.:

To sensibly apply changes out of the WAL stream we need to solve two things:
Reassemble transactions and apply them to the target database.

The logical stream from 1. via 2. consists out of individual changes identified
by the relfilenode of the table and the xid of the transaction. Given
(sub)transactions, rollbacks, crash recovery, subtransactions and the like
those changes obviously cannot be individually applied without fully loosing
the pretence of consistency. To solve that we introduced a module, dubbed
ApplyCache which does the reassembling. This module is *independent* of the
data source and of the method of applying changes so it can be reused for
replicating into a foreign system or similar.

Due to the overhead of planner/executor/toast reassembly/type conversion (yes,
we benchmarked!) we decided against statement generation for apply. Even when
using prepared statements the overhead is rather noticeable.

Instead we decided to use relatively lowlevel heapam.h/genam.h accesses to do
the apply. For now we decided to use only one process to do the applying,
parallelizing that seems to be too complex for an introduction of an already
complex feature.
In our tests the apply process could keep up with pgbench -c/j 20+ generating
changes. This will obviously heavily depend on the workload. A fully seek bound
workload will definitely not scale that well.

Just to reiterate: Plugging in another method to do the apply should be a
relatively simple matter of setting up three callbacks to a different function
(begin, apply_change, commit).

Another complexity in this is how to synchronize the catalogs. We plan to use
command/event triggers and the oid preserving features from pg_upgrade to keep
the catalogs in-sync. We did not start working on that.


4.:

While we started to think about conflict resolution/avoidance we did not start
to work on it. We currently *cannot* handle conflicts. We think that the base
features/architecture should be aggreed uppon before starting with it.

Multimaster tests were done with sequences setup with INCREMENT 2 and different
start values on the two nodes.

=== Current Prototype ===

The current prototype consists of a series of patches that are split in
hopefully sensible and coherent parts to make reviewing of individual parts
possible.

Its also available in the 'cabal-rebasing' branch on
git.postgresql.org/users/andresfreund/postgres.git . That branch will modify
history though.

01: wakeup handling: reduces replication lag, not very interesting in this context

02: Add zeroRecPtr: not very interesting either

03: new syscache for relfilenode. This would benefit by some syscache experienced eyes

04: embedded lists: This is a general facility, general review appreciated

05: preliminary bgworker support: This is not ready and just posted as its   preliminary work for the other patches.
Simonwill post a real patch soon
 

06: XLogReader: Review definitely appreciated

07: logical data additions for WAL: Review definitely appreciated, I do not expect fundamental changes

08: ApplyCache: Important infrastructure for the patch, review definitely appreciated

09: Wal Decoding: Decode WAL generated with wal_level=logical into an ApplyCache

10: WAL with 'origin node': This is another important base-piece for logical rep

11: WAL segment handling changes: If the basic idea of adding a node_id to the   functions and adding a pg_lcr
directoryis acceptable the rest of the patch is   fairly boring/mechanical
 

12: walsender/walreceiver changes: Implement transport/filtering of logical   changes. Very relevant

13: shared memory/crash recovery state handling for logical rep: Very relevant   minus the TODO's in the commit
message

14: apply module: review appreciated

15: apply process: somewhat dependent on the preliminary changes in 05, general   direction is visible, loads of detail
workneeded as soon as some design   decisions are agreed uppon.
 

16: this document. Not very interesting after youve read it ;)

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


[PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
From: Andres Freund <andres@anarazel.de>

Adds a single and a double linked list which can easily embedded into other
datastructures and can be used without any additional allocations.

Problematic: It requires USE_INLINE to be used. It could be remade to fallback
to to externally defined functions if that is not available but that hardly
seems sensibly at this day and age. Besides, the speed hit would be noticeable
and its only used in new code which could be disabled on machines - given they
still exists - without proper support for inline functions
---src/include/utils/ilist.h |  253 +++++++++++++++++++++++++++++++++++++++++++++1 file changed, 253
insertions(+)createmode 100644 src/include/utils/ilist.h
 

diff --git a/src/include/utils/ilist.h b/src/include/utils/ilist.h
new file mode 100644
index 0000000..81d540e
--- /dev/null
+++ b/src/include/utils/ilist.h
@@ -0,0 +1,253 @@
+#ifndef ILIST_H
+#define ILIST_H
+
+#ifdef __GNUC__
+#define unused_attr __attribute__((unused))
+#else
+#define unused_attr
+#endif
+
+#ifndef USE_INLINE
+#error "a compiler supporting static inlines is required"
+#endif
+
+#include <assert.h>
+
+typedef struct ilist_d_node ilist_d_node;
+
+struct ilist_d_node
+{
+    ilist_d_node* prev;
+    ilist_d_node* next;
+};
+
+typedef struct
+{
+    ilist_d_node head;
+} ilist_d_head;
+
+typedef struct ilist_s_node ilist_s_node;
+
+struct ilist_s_node
+{
+    ilist_s_node* next;
+};
+
+typedef struct
+{
+    ilist_s_node head;
+} ilist_s_head;
+
+#ifdef ILIST_DEBUG
+void ilist_d_check(ilist_d_head* head);
+#else
+static inline void ilist_d_check(ilist_d_head* head)
+{
+}
+#endif
+
+static inline void ilist_d_init(ilist_d_head *head)
+{
+    head->head.next = head->head.prev = &head->head;
+    ilist_d_check(head);
+}
+
+/*
+ * adds a node at the beginning of the list
+ */
+static inline void ilist_d_push_front(ilist_d_head *head, ilist_d_node *node)
+{
+    node->next = head->head.next;
+    node->prev = &head->head;
+    node->next->prev = node;
+    head->head.next = node;
+    ilist_d_check(head);
+}
+
+
+/*
+ * adds a node at the end of the list
+ */
+static inline void ilist_d_push_back(ilist_d_head *head, ilist_d_node *node)
+{
+    node->next = &head->head;
+    node->prev = head->head.prev;
+    node->prev->next = node;
+    head->head.prev = node;
+    ilist_d_check(head);
+}
+
+
+/*
+ * adds a node after another *in the same list*
+ */
+static inline void ilist_d_add_after(unused_attr ilist_d_head *head, ilist_d_node *after, ilist_d_node *node)
+{
+    node->prev = after;
+    node->next = after->next;
+    after->next = node;
+    node->next->prev = node;
+    ilist_d_check(head);
+}
+
+/*
+ * adds a node after another *in the same list*
+ */
+static inline void ilist_d_add_before(unused_attr ilist_d_head *head, ilist_d_node *before, ilist_d_node *node)
+{
+    node->prev = before->prev;
+    node->next = before;
+    before->prev = node;
+    node->prev->next = node;
+    ilist_d_check(head);
+}
+
+
+/*
+ * removes a node from a list
+ */
+static inline void ilist_d_remove(unused_attr ilist_d_head *head, ilist_d_node *node)
+{
+    ilist_d_check(head);
+    node->prev->next = node->next;
+    node->next->prev = node->prev;
+    ilist_d_check(head);
+}
+
+/*
+ * removes the first node from a list or returns NULL
+ */
+static inline ilist_d_node* ilist_d_pop_front(ilist_d_head *head)
+{
+    ilist_d_node* ret;
+
+    if (&head->head == head->head.next)
+        return NULL;
+
+    ret = head->head.next;
+    ilist_d_remove(head, head->head.next);
+    return ret;
+}
+
+
+static inline bool ilist_d_has_next(ilist_d_head *head, ilist_d_node *node)
+{
+    return node->next != &head->head;
+}
+
+static inline bool ilist_d_has_prev(ilist_d_head *head, ilist_d_node *node)
+{
+    return node->prev != &head->head;
+}
+
+static inline bool ilist_d_is_empty(ilist_d_head *head)
+{
+    return head->head.next == head->head.prev;
+}
+
+#define ilist_d_front(type, membername, ptr) (&((ptr)->head) == (ptr)->head.next) ? \
+    NULL : ilist_container(type, membername, (ptr)->head.next)
+
+#define ilist_d_front_unchecked(type, membername, ptr) ilist_container(type, membername, (ptr)->head.next)
+
+#define ilist_d_back(type, membername, ptr)  (&((ptr)->head) == (ptr)->head.prev) ? \
+    NULL : ilist_container(type, membername, (ptr)->head.prev)
+
+#define ilist_container(type, membername, ptr) ((type*)((char*)(ptr) - offsetof(type, membername)))
+
+#define ilist_d_foreach(name, ptr) for(name = (ptr)->head.next;    \
+                                     name != &(ptr)->head;    \
+                                     name = name->next)
+
+#define ilist_d_foreach_modify(name, nxt, ptr) for(name = (ptr)->head.next,    \
+                                                       nxt = name->next;       \
+                                                   name != &(ptr)->head        \
+                                                       ;                       \
+                                                   name = nxt, nxt = name->next)
+
+static inline void ilist_s_init(ilist_s_head *head)
+{
+    head->head.next = NULL;
+}
+
+static inline void ilist_s_push_front(ilist_s_head *head, ilist_s_node *node)
+{
+    node->next = head->head.next;
+    head->head.next = node;
+}
+
+/*
+ * fails if the list is empty
+ */
+static inline ilist_s_node* ilist_s_pop_front(ilist_s_head *head)
+{
+    ilist_s_node* front = head->head.next;
+    head->head.next = head->head.next->next;
+    return front;
+}
+
+/*
+ * removes a node from a list
+ * Attention: O(n)
+ */
+static inline void ilist_s_remove(ilist_s_head *head,
+                                  ilist_s_node *node)
+{
+    ilist_s_node *last = &head->head;
+    ilist_s_node *cur;
+#ifndef NDEBUG
+    bool found = false;
+#endif
+    while ((cur = last->next))
+    {
+        if (cur == node)
+        {
+            last->next = cur->next;
+#ifndef NDEBUG
+            found = true;
+#endif
+            break;
+        }
+        last = cur;
+    }
+    assert(found);
+}
+
+
+static inline void ilist_s_add_after(unused_attr ilist_s_head *head,
+                                     ilist_s_node *after, ilist_s_node *node)
+{
+    node->next = after->next;
+    after->next = node;
+}
+
+
+static inline bool ilist_s_is_empty(ilist_s_head *head)
+{
+    return head->head.next == NULL;
+}
+
+static inline bool ilist_s_has_next(unused_attr ilist_s_head* head,
+                                    ilist_s_node *node)
+{
+    return node->next != NULL;
+}
+
+
+#define ilist_s_front(type, membername, ptr) (ilist_s_is_empty(ptr) ? \
+    ilist_container(type, membername, (ptr).next) : NULL
+
+#define ilist_s_front_unchecked(type, membername, ptr) \
+    ilist_container(type, membername, (ptr)->head.next)
+
+#define ilist_s_foreach(name, ptr) for(name = (ptr)->head.next;         \
+                                       name != NULL;                    \
+                                       name = name->next)
+
+#define ilist_s_foreach_modify(name, nxt, ptr) for(name = (ptr)->head.next, \
+                                                       nxt = name ? name->next : NULL; \
+                                                   name != NULL;            \
+                                                   name = nxt, nxt = name ? name->next : NULL)
+
+
+#endif
-- 
1.7.10.rc3.3.g19a6c.dirty



From: Andres Freund <andres@anarazel.de>

This patch is problematic because formally indexes used by syscaches needs to
be unique, this one is not though because of 0/InvalidOids entries for
nailed/shared catalog entries. Those values aren't allowed to be queried though.

It might be nicer to add infrastructure to do this properly, I just don't have
a clue what the best way for this would be.
---src/backend/utils/cache/syscache.c |   11 +++++++++++src/include/catalog/indexing.h     |    2
++src/include/utils/syscache.h      |    1 +3 files changed, 14 insertions(+)
 

diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index c365ec7..9cfb013 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -588,6 +588,17 @@ static const struct cachedesc cacheinfo[] = {        },        1024    },
+    {RelationRelationId,        /* RELFILENODE */
+        ClassRelfilenodeIndexId,
+        1,
+        {
+            Anum_pg_class_relfilenode,
+            0,
+            0,
+            0
+        },
+        1024
+    },    {RewriteRelationId,            /* RULERELNAME */        RewriteRelRulenameIndexId,        2,
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index 450ec25..5c9419b 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -106,6 +106,8 @@ DECLARE_UNIQUE_INDEX(pg_class_oid_index, 2662, on pg_class using btree(oid oid_o#define
ClassOidIndexId 2662DECLARE_UNIQUE_INDEX(pg_class_relname_nsp_index, 2663, on pg_class using btree(relname name_ops,
relnamespaceoid_ops));#define ClassNameNspIndexId  2663
 
+DECLARE_INDEX(pg_class_relfilenode_index, 2844, on pg_class using btree(relfilenode oid_ops));
+#define ClassRelfilenodeIndexId  2844DECLARE_UNIQUE_INDEX(pg_collation_name_enc_nsp_index, 3164, on pg_collation using
btree(collnamename_ops, collencoding int4_ops, collnamespace oid_ops));#define CollationNameEncNspIndexId 3164
 
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index d59dd4e..63a5042 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -73,6 +73,7 @@ enum SysCacheIdentifier    RANGETYPE,    RELNAMENSP,    RELOID,
+    RELFILENODE,    RULERELNAME,    STATRELATTINH,    TABLESPACEOID,
-- 
1.7.10.rc3.3.g19a6c.dirty



From: Andres Freund <andres@anarazel.de>

This is locally defined in lots of places and would get introduced frequently
in the next commits. It is expected that this can be defined in a header-only
manner as soon as the XLogInsert scalability groundwork from Heikki gets in.
---src/backend/access/transam/xlogutils.c |    1 +src/include/access/xlogdefs.h          |    1 +2 files changed, 2
insertions(+)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 6ddcc59..3a2462b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -51,6 +51,7 @@ typedef struct xl_invalid_pagestatic HTAB *invalid_page_tab = NULL;
+XLogRecPtr zeroRecPtr = {0, 0};/* Report a reference to an invalid page */static void
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 5e6d7e6..2768427 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -35,6 +35,7 @@ typedef struct XLogRecPtr    uint32        xrecoff;        /* byte offset of location in log file */}
XLogRecPtr;
+extern XLogRecPtr zeroRecPtr;#define XLogRecPtrIsInvalid(r)    ((r).xrecoff == 0)
-- 
1.7.10.rc3.3.g19a6c.dirty



[PATCH 01/16] Overhaul walsender wakeup handling

From
Andres Freund
Date:
From: Andres Freund <andres@anarazel.de>

The previous coding could miss xlog writeouts at several places. E.g. when wal
was written out by the background writer or even after a commit if
synchronous_commit=off.
This could lead to delays in sending data to the standby of up to 7 seconds.

To fix this move the responsibility of notification to the layer where the
neccessary information is actually present. We take some care not to do the
notification while we hold conteded locks like WALInsertLock or WalWriteLock
locks.

Document the preexisting fact that we rely on SetLatch to be safe from within
signal handlers and critical sections.

This removes the temporary bandaid from 2c8a4e9be2730342cbca85150a2a9d876aa77ff6
---src/backend/access/transam/twophase.c |   21 -----------------src/backend/access/transam/xact.c     |    7
------src/backend/access/transam/xlog.c    |   24 +++++++++++++------src/backend/port/unix_latch.c         |    3
+++src/backend/port/win32_latch.c       |    4 ++++src/backend/replication/walsender.c   |   41
++++++++++++++++++++++++++++++++-src/include/replication/walsender.h  |    2 ++7 files changed, 66 insertions(+), 36
deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b94fae3..bdb7bcd 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1042,13 +1042,6 @@ EndPrepare(GlobalTransaction gxact)    /* If we crash now, we have prepared: WAL replay will fix
things*/
 
-    /*
-     * Wake up all walsenders to send WAL up to the PREPARE record immediately
-     * if replication is enabled
-     */
-    if (max_wal_senders > 0)
-        WalSndWakeup();
-    /* write correct CRC and close file */    if ((write(fd, &statefile_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
{
 
@@ -2045,13 +2038,6 @@ RecordTransactionCommitPrepared(TransactionId xid,    /* Flush XLOG to disk */
XLogFlush(recptr);
-    /*
-     * Wake up all walsenders to send WAL up to the COMMIT PREPARED record
-     * immediately if replication is enabled
-     */
-    if (max_wal_senders > 0)
-        WalSndWakeup();
-    /* Mark the transaction committed in pg_clog */    TransactionIdCommitTree(xid, nchildren, children);
@@ -2133,13 +2119,6 @@ RecordTransactionAbortPrepared(TransactionId xid,    XLogFlush(recptr);    /*
-     * Wake up all walsenders to send WAL up to the ABORT PREPARED record
-     * immediately if replication is enabled
-     */
-    if (max_wal_senders > 0)
-        WalSndWakeup();
-
-    /*     * Mark the transaction aborted in clog.  This is not absolutely necessary     * but we may as well do it
whilewe are here.     */
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8f00186..3cc2bfa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1141,13 +1141,6 @@ RecordTransactionCommit(void)        XLogFlush(XactLastRecEnd);        /*
-         * Wake up all walsenders to send WAL up to the COMMIT record
-         * immediately if replication is enabled
-         */
-        if (max_wal_senders > 0)
-            WalSndWakeup();
-
-        /*         * Now we may update the CLOG, if we wrote a COMMIT record above         */        if
(markXidCommitted)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bcb71c4..166efb0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1017,6 +1017,8 @@ begin:;        END_CRIT_SECTION();
+        /* wakeup the WalSnd now that we released the WALWriteLock */
+        WalSndWakeupProcess();        return RecPtr;    }
@@ -1218,6 +1220,9 @@ begin:;    END_CRIT_SECTION();
+    /* wakeup the WalSnd now that we outside contented locks */
+    WalSndWakeupProcess();
+    return RecPtr;}
@@ -1822,6 +1827,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)            if (finishing_seg
||(xlog_switch && last_iteration))            {                issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
 
+
+                /* signal that we need to wakeup WalSnd later */
+                WalSndWakeupRequest();
+                LogwrtResult.Flush = LogwrtResult.Write;        /* end of page */                if
(XLogArchivingActive())
@@ -1886,6 +1895,9 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)                openLogOff = 0;
         }            issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
 
+
+            /* signal that we need to wakeup WalSnd later */
+            WalSndWakeupRequest();        }        LogwrtResult.Flush = LogwrtResult.Write;    }
@@ -2149,6 +2161,9 @@ XLogFlush(XLogRecPtr record)    END_CRIT_SECTION();
+    /* wakeup the WalSnd now that we released the WALWriteLock */
+    WalSndWakeupProcess();
+    /*     * If we still haven't flushed to the request point then we have a     * problem; most likely, the requested
flushpoint is past end of XLOG.
 
@@ -2274,13 +2289,8 @@ XLogBackgroundFlush(void)    END_CRIT_SECTION();
-    /*
-     * If we wrote something then we have something to send to standbys also,
-     * otherwise the replication delay become around 7s with just async
-     * commit.
-     */
-    if (wrote_something)
-        WalSndWakeup();
+    /* wakeup the WalSnd now that we released the WALWriteLock */
+    WalSndWakeupProcess();    return wrote_something;}
diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index 65b2fc5..335e9f6 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -418,6 +418,9 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, * NB: when calling this in
asignal handler, be sure to save and restore * errno around it.  (That's standard practice in most signal handlers, of
*course, but we used to omit it in handlers that only set a flag.)
 
+ *
+ * NB: this function is called from critical sections and signal handlers so
+ * throwing an error is not a good idea. */voidSetLatch(volatile Latch *latch)
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
index eb46dca..1f1ed33 100644
--- a/src/backend/port/win32_latch.c
+++ b/src/backend/port/win32_latch.c
@@ -247,6 +247,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,    return result;}
+/*
+ * The comments above the unix implementation (unix_latch.c) of this function
+ * apply here as well.
+ */voidSetLatch(volatile Latch *latch){
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 45a3b2e..e44c734 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -107,6 +107,11 @@ static StringInfoData reply_message; */static TimestampTz last_reply_timestamp;
+/*
+ * State for WalSndWakeupRequest
+ */
+static bool wroteNewXlogData = false;
+/* Flags set by signal handlers for later service in main loop */static volatile sig_atomic_t got_SIGHUP =
false;volatilesig_atomic_t walsender_shutdown_requested = false;
 
@@ -1424,7 +1429,12 @@ WalSndShmemInit(void)    }}
-/* Wake up all walsenders */
+/*
+ * Wake up all walsenders
+ *
+ * This will be called inside critical sections, so throwing an error is not
+ * adviseable.
+ */voidWalSndWakeup(void){
@@ -1434,6 +1444,35 @@ WalSndWakeup(void)        SetLatch(&WalSndCtl->walsnds[i].latch);}
+/*
+ * Remember that we want to wakeup walsenders later
+ *
+ * This is separated from doing the actual wakeup because the writeout is done
+ * while holding contended locks.
+ */
+void
+WalSndWakeupRequest(void)
+{
+    wroteNewXlogData = true;
+}
+
+/*
+ * wakeup walsenders if there is work to be done
+ */
+void
+WalSndWakeupProcess(void)
+{
+    if(wroteNewXlogData){
+        wroteNewXlogData = false;
+        /*
+         * Wake up all walsenders to send WAL up to the point where its flushed
+         * safely to disk.
+         */
+        if (max_wal_senders > 0)
+            WalSndWakeup();
+    }
+}
+/* Set state for current walsender (only called in walsender) */voidWalSndSetState(WalSndState state)
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 128d2db..38191e7 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -31,6 +31,8 @@ extern void WalSndSignals(void);extern Size WalSndShmemSize(void);extern void
WalSndShmemInit(void);externvoid WalSndWakeup(void);
 
+extern void WalSndWakeupRequest(void);
+extern void WalSndWakeupProcess(void);extern void WalSndRqstFileReload(void);extern Datum
pg_stat_get_wal_senders(PG_FUNCTION_ARGS);
-- 
1.7.10.rc3.3.g19a6c.dirty



[PATCH 05/16] Preliminary: Introduce Bgworker process

From
Andres Freund
Date:
From: Simon Riggs <simon@2ndquadrant.com>

Early prototype that allows for just 1 bgworker which calls a function called
do_applyprocess().  Expect major changes in this, but not in ways that would
effect the apply process.
---src/backend/postmaster/Makefile               |    4 +-src/backend/postmaster/bgworker.c             |  403
+++++++++++++++++++++++++src/backend/postmaster/postmaster.c          |   91 ++++--src/backend/tcop/postgres.c
        |    5 +src/backend/utils/init/miscinit.c             |    5 +-src/backend/utils/init/postinit.c             |
 3 +-src/backend/utils/misc/guc.c                  |   37 ++-src/backend/utils/misc/postgresql.conf.sample |    4
+src/include/postmaster/bgworker.h            |   29 ++9 files changed, 550 insertions(+), 31 deletions(-)create mode
100644src/backend/postmaster/bgworker.ccreate mode 100644 src/include/postmaster/bgworker.h
 

diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3056b09..7b23353 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/postmastertop_builddir = ../../..include $(top_builddir)/src/Makefile.global
-OBJS = autovacuum.o bgwriter.o fork_process.o pgarch.o pgstat.o postmaster.o \
-    startup.o syslogger.o walwriter.o checkpointer.o
+OBJS = autovacuum.o bgworker.o bgwriter.o fork_process.o pgarch.o pgstat.o \
+    postmaster.o startup.o syslogger.o walwriter.o checkpointer.oinclude $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
new file mode 100644
index 0000000..8144050
--- /dev/null
+++ b/src/backend/postmaster/bgworker.c
@@ -0,0 +1,403 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgworker.c
+ *
+ * PostgreSQL Integrated Worker Daemon
+ *
+ * Background workers can execute arbitrary user code. A shared library
+ * can request creation of a worker using RequestAddinBGWorkerProcess().
+ *
+ * The worker process is forked from the postmaster and then attaches
+ * to shared memory similarly to an autovacuum worker and finally begins
+ * executing the supplied WorkerMain function.
+ *
+ * If the fork() call fails in the postmaster, it will try again later.
+ * Note that the failure can only be transient (fork failure due to
+ * high load, memory pressure, too many processes, etc); more permanent
+ * problems, like failure to connect to a database, are detected later in the
+ * worker and dealt with just by having the worker exit normally.  Postmaster
+ * will launch a new worker again later.
+ *
+ * Note that there can be more than one worker in a database concurrently.
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *      src/backend/postmaster/bgworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <signal.h>
+#include <sys/types.h>
+#include <sys/time.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "access/heapam.h"
+#include "access/reloptions.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "catalog/dependency.h"
+#include "catalog/namespace.h"
+#include "catalog/pg_database.h"
+#include "commands/dbcommands.h"
+#include "commands/vacuum.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgworker.h"
+#include "postmaster/fork_process.h"
+#include "postmaster/postmaster.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/sinvaladt.h"
+#include "tcop/tcopprot.h"
+#include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+#include "utils/rel.h"
+#include "utils/snapmgr.h"
+#include "utils/syscache.h"
+#include "utils/timestamp.h"
+#include "utils/tqual.h"
+
+
+/*
+ * GUC parameters
+ */
+int            MaxWorkers;
+
+static int    bgworker_addin_request = 0;
+static bool bgworker_addin_request_allowed = true;
+
+/* Flags to tell if we are in a worker process */
+static bool am_bgworker = false;
+
+/* Flags set by signal handlers */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t got_SIGUSR2 = false;
+static volatile sig_atomic_t got_SIGTERM = false;
+
+static void bgworker_sigterm_handler(SIGNAL_ARGS);
+
+NON_EXEC_STATIC void BgWorkerMain(int argc, char *argv[]);
+
+static bool do_logicalapply(void);
+
+/********************************************************************
+ *                      BGWORKER CODE
+ ********************************************************************/
+
+/* SIGTERM: time to die */
+static void
+bgworker_sigterm_handler(SIGNAL_ARGS)
+{
+    int            save_errno = errno;
+
+    got_SIGTERM = true;
+    if (MyProc)
+        SetLatch(&MyProc->procLatch);
+
+    errno = save_errno;
+}
+
+/*
+ * Main entry point for background worker process, to be called from the
+ * postmaster.
+ *
+ * This code is heavily based on autovacuum.c, q.v.
+ */
+int
+StartBgWorker(void)
+{
+    pid_t        worker_pid;
+
+#ifdef EXEC_BACKEND
+    switch ((worker_pid = bgworker_forkexec()))
+#else
+    switch ((worker_pid = fork_process()))
+#endif
+    {
+        case -1:
+            ereport(LOG,
+                    (errmsg("could not fork worker process: %m")));
+            return 0;
+
+#ifndef EXEC_BACKEND
+        case 0:
+            /* in postmaster child ... */
+            /* Close the postmaster's sockets */
+            ClosePostmasterPorts(false);
+
+            /* Lose the postmaster's on-exit routines */
+            on_exit_reset();
+
+            BgWorkerMain(0, NULL);
+            break;
+#endif
+        default:
+            return (int) worker_pid;
+    }
+
+    /* shouldn't get here */
+    return 0;
+}
+
+/*
+ * BgWorkerMain
+ */
+NON_EXEC_STATIC void
+BgWorkerMain(int argc, char *argv[])
+{
+    sigjmp_buf    local_sigjmp_buf;
+    //Oid            dbid = 12037;        /* kluge to set dbid for "Postgres" */
+    bool        init = false;
+
+    /* we are a postmaster subprocess now */
+    IsUnderPostmaster = true;
+    am_bgworker = true;
+
+    /* reset MyProcPid */
+    MyProcPid = getpid();
+
+    /* record Start Time for logging */
+    MyStartTime = time(NULL);
+
+    /* Identify myself via ps */
+    init_ps_display("worker process", "", "", "");
+
+    SetProcessingMode(InitProcessing);
+
+    /*
+     * If possible, make this process a group leader, so that the postmaster
+     * can signal any child processes too.    (autovacuum probably never has any
+     * child processes, but for consistency we make all postmaster child
+     * processes do this.)
+     */
+#ifdef HAVE_SETSID
+    if (setsid() < 0)
+        elog(FATAL, "setsid() failed: %m");
+#endif
+
+    /*
+     * Set up signal handlers.    We operate on databases much like a regular
+     * backend, so we use the same signal handling.  See equivalent code in
+     * tcop/postgres.c.
+     *
+     * Currently, we don't pay attention to postgresql.conf changes that
+     * happen during a single daemon iteration, so we can ignore SIGHUP.
+     */
+    pqsignal(SIGHUP, SIG_IGN);
+
+    /*
+     * SIGINT is used to signal canceling the current action; SIGTERM
+     * means abort and exit cleanly, and SIGQUIT means abandon ship.
+     */
+    pqsignal(SIGINT, StatementCancelHandler);
+    pqsignal(SIGTERM, bgworker_sigterm_handler); // was    die);
+    pqsignal(SIGQUIT, quickdie);
+    pqsignal(SIGALRM, handle_sig_alarm);
+
+    pqsignal(SIGPIPE, SIG_IGN);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+    pqsignal(SIGUSR2, SIG_IGN);
+    pqsignal(SIGFPE, FloatExceptionHandler);
+    pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Early initialization */
+    BaseInit();
+
+    /*
+     * Create a per-backend PGPROC struct in shared memory, except in the
+     * EXEC_BACKEND case where this was done in SubPostmasterMain. We must do
+     * this before we can use LWLocks (and in the EXEC_BACKEND case we already
+     * had to do some stuff with LWLocks).
+     */
+#ifndef EXEC_BACKEND
+    InitProcess();
+#endif
+
+    /*
+     * If an exception is encountered, processing resumes here.
+     *
+     * See notes in postgres.c about the design of this coding.
+     */
+    if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+    {
+        /* Prevents interrupts while cleaning up */
+        HOLD_INTERRUPTS();
+
+        /* Report the error to the server log */
+        EmitErrorReport();
+
+        /*
+         * We can now go away.    Note that because we called InitProcess, a
+         * callback was registered to do ProcKill, which will clean up
+         * necessary state.
+         */
+        proc_exit(0);
+    }
+
+    /* We can now handle ereport(ERROR) */
+    PG_exception_stack = &local_sigjmp_buf;
+
+    PG_SETMASK(&UnBlockSig);
+
+    /*
+     * Force zero_damaged_pages OFF in a worker process, even if it is set
+     * in postgresql.conf.    We don't really want such a dangerous option being
+     * applied non-interactively.
+     */
+    SetConfigOption("zero_damaged_pages", "false", PGC_SUSET, PGC_S_OVERRIDE);
+
+    /*
+     * Force statement_timeout to zero to avoid a timeout setting from
+     * preventing regular maintenance from being executed.
+     */
+    SetConfigOption("statement_timeout", "0", PGC_SUSET, PGC_S_OVERRIDE);
+
+    /*
+     * Force default_transaction_isolation to READ COMMITTED.  We don't
+     * want to pay the overhead of serializable mode, nor add any risk
+     * of causing deadlocks or delaying other transactions.
+     */
+    SetConfigOption("default_transaction_isolation", "read committed",
+                    PGC_SUSET, PGC_S_OVERRIDE);
+
+    /*
+     * Force synchronous replication off to allow regular maintenance even if
+     * we are waiting for standbys to connect. This is important to ensure we
+     * aren't blocked from performing anti-wraparound tasks.
+     */
+    if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
+        SetConfigOption("synchronous_commit", "local",
+                        PGC_SUSET, PGC_S_OVERRIDE);
+
+    for (;;)
+    {
+        bool not_idle;
+
+        /* the normal shutdown case */
+        if (got_SIGTERM)
+            break;
+
+        if (got_SIGHUP)
+        {
+            got_SIGHUP = false;
+            ProcessConfigFile(PGC_SIGHUP);
+        }
+
+        if (!init)
+        {
+            char        dbname[NAMEDATALEN] = "postgres";
+
+            /*
+             * Connect to the selected database
+             *
+             * Note: if we have selected a just-deleted database (due to using
+             * stale stats info), we'll fail and exit here.
+             *
+             * Note that MyProcPort is not setup correctly, so normal
+             * authentication will simply fail. This is bypassed by moving
+             * straight to superuser mode, using same trick as autovacuum.
+             */
+            InitPostgres(dbname, InvalidOid, NULL, NULL);
+            SetProcessingMode(NormalProcessing);
+            ereport(LOG,
+                    (errmsg("starting worker process on database \"%s\"", dbname)));
+
+            if (PostAuthDelay)
+                pg_usleep(PostAuthDelay * 1000000L);
+
+            CurrentResourceOwner = ResourceOwnerCreate(NULL, "worker process");
+
+            init = true;
+        }
+
+        /*
+         * If we're initialised correctly we can call the worker code.
+         */
+        if (init)
+            not_idle = do_logicalapply();
+
+        if(!not_idle){
+            /* Just for test and can be removed. */
+            pg_usleep(100000L);
+        }
+    }
+
+    /* Normal exit from the bgworker is here */
+    ereport(LOG,
+            (errmsg("worker shutting down")));
+
+    /* All done, go away */
+    proc_exit(0);
+}
+
+bool
+IsWorkerProcess(void)
+{
+    return am_bgworker;
+}
+
+/*
+ * RequestAddinBgWorkerProcess
+ *        Request a background worker process
+ *
+ * This is only useful if called from the _PG_init hook of a library that
+ * is loaded into the postmaster via shared_preload_libraries.    Once
+ * shared memory has been allocated, calls will be ignored.  (We could
+ * raise an error, but it seems better to make it a no-op, so that
+ * libraries containing such calls can be reloaded if needed.)
+ */
+void
+RequestAddinBgWorkerProcess(const char *WorkerName,
+                            void *Main,
+                            const char *DBname)
+{
+    if (IsUnderPostmaster || !bgworker_addin_request_allowed)
+        return;                    /* too late */
+    bgworker_addin_request++;
+}
+
+/*
+ * Compute number of BgWorkers to allocate.
+ */
+int
+NumBgWorkers(void)
+{
+    return 1;
+
+#ifdef UNUSED
+    int    numWorkers;
+
+    /*
+     * Include number of workers required by server, for example,
+     * parallel query worker tasks.
+     */
+
+    /*
+     * Add any requested by loadable modules.
+     */
+    bgworker_addin_request_allowed = false;
+    numWorkers += bgworker_addin_request;
+
+    return numWorkers;
+#endif
+}
+
+static bool
+do_logicalapply(void)
+{
+    elog(LOG, "doing logical apply");
+    return false;
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index eeea933..71cfd6d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@#include "miscadmin.h"#include "pgstat.h"#include "postmaster/autovacuum.h"
+#include "postmaster/bgworker.h"#include "postmaster/fork_process.h"#include "postmaster/pgarch.h"#include
"postmaster/postmaster.h"
@@ -131,7 +132,7 @@ * children we have and send them appropriate signals when necessary. * * "Special" children such as
thestartup, bgwriter and autovacuum launcher
 
- * tasks are not in this list.    Autovacuum worker and walsender processes are
+ * tasks are not in this list.    All worker and walsender processes are * in it. Also, "dead_end" children are in it:
theseare children launched just * for the purpose of sending a friendly rejection message to a would-be * client.    We
musttrack them because they are attached to shared memory,
 
@@ -144,6 +145,7 @@ typedef struct bkend    long        cancel_key;        /* cancel key for cancels for this backend
*/   int            child_slot;        /* PMChildSlot for this backend, if any */    bool        is_autovacuum;    /*
isit an autovacuum process? */
 
+    bool        is_bgworker;    /* is it a bgworker process? */    bool        dead_end;        /* is it going to send
anerror and quit? */    Dlelem        elem;            /* list link in BackendList */} Backend;
 
@@ -216,6 +218,8 @@ static pid_t StartupPID = 0,            PgStatPID = 0,            SysLoggerPID = 0;
+static pid_t *BgWorkerPID; /* Array of PIDs of bg workers */
+/* Startup/shutdown state */#define            NoShutdown        0#define            SmartShutdown    1
@@ -303,6 +307,8 @@ static volatile sig_atomic_t start_autovac_launcher = false;/* the launcher needs to be signalled
tocommunicate some condition */static volatile bool avlauncher_needs_signal = false;
 
+static int NWorkers;
+/* * State for assigning random salts and cancel keys. * Also, the global MyCancelKey passes the cancel key assigned
toa given
 
@@ -366,12 +372,16 @@ static bool SignalSomeChildren(int signal, int targets);#define BACKEND_TYPE_NORMAL        0x0001
  /* normal backend */#define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */#define
BACKEND_TYPE_WALSND       0x0004    /* walsender process */
 
-#define BACKEND_TYPE_ALL        0x0007    /* OR of all the above */
+#define BACKEND_TYPE_BGWORKER    0x0008    /* general bgworker process */
+#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+
+#define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)static int    CountChildren(int
target);
+static void StartBackgroundWorkers(void);static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);static
pid_tStartChildProcess(AuxProcType type);
 
-static void StartAutovacuumWorker(void);
+static int StartWorker(bool is_autovacuum);static void InitPostmasterDeathWatchHandle(void);#ifdef EXEC_BACKEND
@@ -1037,7 +1047,7 @@ PostmasterMain(int argc, char *argv[])     * handling setup of child processes.  See
tcop/postgres.c,    * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,     *
postmaster/autovacuum.c,postmaster/pgarch.c, postmaster/pgstat.c,
 
-     * postmaster/syslogger.c and postmaster/checkpointer.c.
+     * postmaster/syslogger.c, postmaster/bgworker.c and postmaster/checkpointer.c     */    pqinitmask();
PG_SETMASK(&BlockSig);
@@ -1085,6 +1095,17 @@ PostmasterMain(int argc, char *argv[])    autovac_init();    /*
+     * Allocate background workers actually required.
+     */
+    NWorkers = NumBgWorkers();
+    if (NWorkers > 0)
+    {
+        BgWorkerPID = (pid_t *) MemoryContextAlloc(TopMemoryContext,
+                                      NWorkers * sizeof(pid_t));
+        memset(BgWorkerPID, 0, NWorkers * sizeof(pid_t));
+    }
+
+    /*     * Load configuration files for client authentication.     */    if (!load_hba())
@@ -1428,6 +1449,10 @@ ServerLoop(void)                kill(AutoVacPID, SIGUSR2);        }
+        /* Check all the workers requested are running. */
+        if (pmState == PM_RUN)
+            StartBackgroundWorkers();
+        /*         * Touch the socket and lock file every 58 minutes, to ensure that         * they are not removed by
overzealous/tmp-cleaning tasks.  We assume
 
@@ -2133,8 +2158,8 @@ pmdie(SIGNAL_ARGS)            if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState== PM_HOT_STANDBY || pmState == PM_STARTUP)            {
 
-                /* autovacuum workers are told to shut down immediately */
-                SignalSomeChildren(SIGTERM, BACKEND_TYPE_AUTOVAC);
+                /* workers are told to shut down immediately */
+                SignalSomeChildren(SIGTERM, BACKEND_TYPE_WORKER);                /* and the autovac launcher too */
           if (AutoVacPID != 0)                    signal_child(AutoVacPID, SIGTERM);
 
@@ -2203,9 +2228,9 @@ pmdie(SIGNAL_ARGS)            {                ereport(LOG,
(errmsg("abortingany active transactions")));
 
-                /* shut down all backends and autovac workers */
+                /* shut down all backends and workers */                SignalSomeChildren(SIGTERM,
-                                 BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC);
+                     BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER);                /* and the autovac launcher too */
          if (AutoVacPID != 0)                    signal_child(AutoVacPID, SIGTERM);
 
@@ -2396,6 +2421,7 @@ reaper(SIGNAL_ARGS)                PgArchPID = pgarch_start();            if (PgStatPID == 0)
          PgStatPID = pgstat_start();
 
+            StartBackgroundWorkers();            /* at this point we are really open for business */
ereport(LOG,
@@ -2963,7 +2989,7 @@ PostmasterStateMachine(void)         * later after writing the checkpoint record, like the
archiver        * process.         */
 
-        if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC) == 0 &&
+        if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER) == 0 &&            StartupPID == 0 &&
WalReceiverPID== 0 &&            BgWriterPID == 0 &&
 
@@ -3202,6 +3228,8 @@ SignalSomeChildren(int signal, int target)            if (bp->is_autovacuum)                child
=BACKEND_TYPE_AUTOVAC;
 
+            else if (bp->is_bgworker)
+                child = BACKEND_TYPE_BGWORKER;            else if (IsPostmasterChildWalSender(bp->child_slot))
      child = BACKEND_TYPE_WALSND;            else
 
@@ -3224,7 +3252,7 @@ SignalSomeChildren(int signal, int target) * * returns: STATUS_ERROR if the fork failed,
STATUS_OKotherwise. *
 
- * Note: if you change this code, also consider StartAutovacuumWorker.
+ * Note: if you change this code, also consider StartWorker. */static intBackendStartup(Port *port)
@@ -3325,6 +3353,7 @@ BackendStartup(Port *port)     */    bn->pid = pid;    bn->is_autovacuum = false;
+    bn->is_bgworker = false;    DLInitElem(&bn->elem, bn);    DLAddHead(BackendList, &bn->elem);#ifdef EXEC_BACKEND
@@ -4302,7 +4331,7 @@ sigusr1_handler(SIGNAL_ARGS)    if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER))    {
   /* The autovacuum launcher wants us to start a worker process. */
 
-        StartAutovacuumWorker();
+        (void) StartWorker(true);    }    if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER) &&
@@ -4448,6 +4477,8 @@ CountChildren(int target)            if (bp->is_autovacuum)                child =
BACKEND_TYPE_AUTOVAC;
+            else if (bp->is_bgworker)
+                child = BACKEND_TYPE_BGWORKER;            else if (IsPostmasterChildWalSender(bp->child_slot))
      child = BACKEND_TYPE_WALSND;            else
 
@@ -4570,16 +4601,16 @@ StartChildProcess(AuxProcType type)}/*
- * StartAutovacuumWorker
- *        Start an autovac worker process.
+ * StartWorker
+ *        Start a worker process either for autovacuum or more generally. * * This function is here because it enters
theresulting PID into the * postmaster's private backends list. * * NB -- this code very roughly matches
BackendStartup.*/
 
-static void
-StartAutovacuumWorker(void)
+static int
+StartWorker(bool is_autovacuum){    Backend    *bn;
@@ -4608,22 +4639,26 @@ StartAutovacuumWorker(void)            bn->dead_end = false;            bn->child_slot =
MyPMChildSlot= AssignPostmasterChildSlot();
 
-            bn->pid = StartAutoVacWorker();
+            if (is_autovacuum)
+                bn->pid = StartAutoVacWorker();
+            else
+                bn->pid = StartBgWorker();
+            if (bn->pid > 0)            {
-                bn->is_autovacuum = true;
+                bn->is_autovacuum = is_autovacuum;                DLInitElem(&bn->elem, bn);
DLAddHead(BackendList,&bn->elem);#ifdef EXEC_BACKEND                ShmemBackendArrayAdd(bn);#endif                /*
allOK */
 
-                return;
+                return bn->pid;            }            /*             * fork failed, fall through to report -- actual
errormessage was
 
-             * logged by StartAutoVacWorker
+             * logged by Start...Worker             */            (void) ReleasePostmasterChildSlot(bn->child_slot);
        free(bn);
 
@@ -4643,11 +4678,25 @@ StartAutovacuumWorker(void)     * quick succession between the autovac launcher and postmaster
incase     * things get ugly.     */
 
-    if (AutoVacPID != 0)
+    if (is_autovacuum && AutoVacPID != 0)    {        AutoVacWorkerFailed();        avlauncher_needs_signal = true;
}
+
+    return 0;
+}
+
+static void
+StartBackgroundWorkers(void)
+{
+    int i;
+
+    for (i = 0; i < NWorkers; i++)
+    {
+        if (BgWorkerPID[i] == 0)
+            BgWorkerPID[i] = StartWorker(false);
+    }}/*
@@ -4687,7 +4736,7 @@ CreateOptsFile(int argc, char *argv[], char *fullprogname) * * This reports the number of entries
neededin per-child-process arrays * (the PMChildFlags array, and if EXEC_BACKEND the ShmemBackendArray).
 
- * These arrays include regular backends, autovac workers and walsenders,
+ * These arrays include regular backends, all workers and walsenders, * but not special children nor dead_end
children.   This allows the arrays * to have a fixed maximum size, to wit the same too-many-children limit * enforced
bycanAcceptConnections().    The exact value isn't too critical
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 51b6df5..5aead05 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -56,6 +56,7 @@#include "parser/analyze.h"#include "parser/parser.h"#include "postmaster/autovacuum.h"
+#include "postmaster/bgworker.h"#include "postmaster/postmaster.h"#include "replication/walsender.h"#include
"rewrite/rewriteHandler.h"
@@ -2841,6 +2842,10 @@ ProcessInterrupts(void)            ereport(FATAL,
(errcode(ERRCODE_ADMIN_SHUTDOWN),                    errmsg("terminating autovacuum process due to administrator
command")));
+        else if (IsWorkerProcess())
+            ereport(FATAL,
+                    (errcode(ERRCODE_ADMIN_SHUTDOWN),
+                     errmsg("terminating worker process due to administrator command")));        else if
(RecoveryConflictPending&& RecoveryConflictRetryable)        {
pgstat_report_recovery_conflict(RecoveryConflictReason);
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index fb376a0..f7ae60a 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -33,6 +33,7 @@#include "mb/pg_wchar.h"#include "miscadmin.h"#include "postmaster/autovacuum.h"
+#include "postmaster/bgworker.h"#include "postmaster/postmaster.h"#include "storage/fd.h"#include "storage/ipc.h"
@@ -498,9 +499,9 @@ InitializeSessionUserIdStandalone(void){    /*     * This function should only be called in
single-usermode and in
 
-     * autovacuum workers.
+     * autovacuum or background workers.     */
-    AssertState(!IsUnderPostmaster || IsAutoVacuumWorkerProcess());
+    AssertState(!IsUnderPostmaster || IsAutoVacuumWorkerProcess() || IsWorkerProcess());    /* call only once */
AssertState(!OidIsValid(AuthenticatedUserId));
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1baa67d..3208b5e7 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -36,6 +36,7 @@#include "pgstat.h"#include "postmaster/autovacuum.h"#include "postmaster/postmaster.h"
+#include "postmaster/bgworker.h"#include "replication/walsender.h"#include "storage/bufmgr.h"#include "storage/fd.h"
@@ -584,7 +585,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,     * In standalone mode and
inautovacuum worker processes, we use a fixed     * ID, otherwise we figure it out from the authenticated user name.
*/
 
-    if (bootstrap || IsAutoVacuumWorkerProcess())
+    if (bootstrap || IsAutoVacuumWorkerProcess() || IsWorkerProcess())    {
InitializeSessionUserIdStandalone();       am_superuser = true;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b756e58..93c798b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -52,6 +52,7 @@#include "parser/scansup.h"#include "pgstat.h"#include "postmaster/autovacuum.h"
+#include "postmaster/bgworker.h"#include "postmaster/bgwriter.h"#include "postmaster/postmaster.h"#include
"postmaster/syslogger.h"
@@ -107,7 +108,7 @@ * removed, we still could not exceed INT_MAX/4 because some places compute * 4*MaxBackends without
anyoverflow check.  This is rechecked in * check_maxconnections, since MaxBackends is computed as MaxConnections
 
- * plus autovacuum_max_workers plus one (for the autovacuum launcher).
+ * plus max_workers plus autovacuum_max_workers plus one (for the autovacuum launcher). */#define MAX_BACKENDS
0x7fffff
@@ -197,6 +198,8 @@ static const char *show_tcp_keepalives_interval(void);static const char
*show_tcp_keepalives_count(void);staticbool check_maxconnections(int *newval, void **extra, GucSource source);static
voidassign_maxconnections(int newval, void *extra);
 
+static bool check_maxworkers(int *newval, void **extra, GucSource source);
+static void assign_maxworkers(int newval, void *extra);static bool check_autovacuum_max_workers(int *newval, void
**extra,GucSource source);static void assign_autovacuum_max_workers(int newval, void *extra);static bool
check_effective_io_concurrency(int*newval, void **extra, GucSource source);
 
@@ -1605,6 +1608,16 @@ static struct config_int ConfigureNamesInt[] =    },    {
+        {"max_workers", PGC_POSTMASTER, CONN_AUTH_SETTINGS,
+            gettext_noop("Sets the maximum number of background worker processes."),
+            NULL
+        },
+        &MaxWorkers,
+        10, 1, MAX_BACKENDS,
+        check_maxworkers, assign_maxworkers, NULL
+    },
+
+    {        {"superuser_reserved_connections", PGC_POSTMASTER, CONN_AUTH_SETTINGS,            gettext_noop("Sets the
numberof connection slots reserved for superusers."),            NULL
 
@@ -8605,7 +8618,7 @@ show_tcp_keepalives_count(void)static boolcheck_maxconnections(int *newval, void **extra,
GucSourcesource){
 
-    if (*newval + autovacuum_max_workers + 1 > MAX_BACKENDS)
+    if (*newval + MaxWorkers + autovacuum_max_workers + 1 > MAX_BACKENDS)        return false;    return true;}
@@ -8613,13 +8626,27 @@ check_maxconnections(int *newval, void **extra, GucSource source)static
voidassign_maxconnections(intnewval, void *extra){
 
-    MaxBackends = newval + autovacuum_max_workers + 1;
+    MaxBackends = newval + MaxWorkers + autovacuum_max_workers + 1;
+}
+
+static bool
+check_maxworkers(int *newval, void **extra, GucSource source)
+{
+    if (*newval + MaxConnections + autovacuum_max_workers + 1 > MAX_BACKENDS)
+        return false;
+    return true;
+}
+
+static void
+assign_maxworkers(int newval, void *extra)
+{
+    MaxBackends = newval + MaxConnections + autovacuum_max_workers + 1;}static boolcheck_autovacuum_max_workers(int
*newval,void **extra, GucSource source){
 
-    if (MaxConnections + *newval + 1 > MAX_BACKENDS)
+    if (MaxConnections + MaxWorkers + *newval + 1 > MAX_BACKENDS)        return false;    return true;}
@@ -8627,7 +8654,7 @@ check_autovacuum_max_workers(int *newval, void **extra, GucSource source)static
voidassign_autovacuum_max_workers(intnewval, void *extra){
 
-    MaxBackends = MaxConnections + newval + 1;
+    MaxBackends = MaxConnections + MaxWorkers + newval + 1;}static bool
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index fa75d00..ce3fc08 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -148,6 +148,10 @@#bgwriter_lru_maxpages = 100        # 0-1000 max buffers written/round#bgwriter_lru_multiplier =
2.0       # 0-10.0 multipler on buffers scanned/round
 
+# - Background Workers -
+#max_workers = 10                # max number of general worker subprocesses
+                                # (change requires restart)
+# - Asynchronous Behavior -#effective_io_concurrency = 1        # 1-1000; 0 disables prefetching
diff --git a/src/include/postmaster/bgworker.h b/src/include/postmaster/bgworker.h
new file mode 100644
index 0000000..92d0a75
--- /dev/null
+++ b/src/include/postmaster/bgworker.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgworker.h
+ *      header file for integrated background worker daemon
+ *
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/postmaster/bgworker.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BGWORKER_H
+#define BGWORKER_H
+
+
+/* GUC variables */
+int            MaxWorkers;
+
+extern int StartBgWorker(void);
+extern int NumBgWorkers(void);
+
+extern bool IsWorkerProcess(void);
+extern void RequestAddinBgWorkerProcess(const char *WorkerName,
+                                        void *Main,
+                                        const char *DBname);
+
+#endif   /* BGWORKER_H */
-- 
1.7.10.rc3.3.g19a6c.dirty



From: Andres Freund <andres@anarazel.de>

This adds a new wal_level value 'logical'

Missing cases:
- heap_multi_insert
- primary key changes for updates
- no primary key
- LOG_NEWPAGE
---src/backend/access/heap/heapam.c        |  135 ++++++++++++++++++++++++++++---src/backend/access/transam/xlog.c
|    1 +src/backend/catalog/index.c             |   74 +++++++++++++++++src/bin/pg_controldata/pg_controldata.c |    2
+src/include/access/xlog.h              |    3 +-src/include/catalog/index.h             |    4 +6 files changed, 207
insertions(+),12 deletions(-)
 

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9519e73..9149d53 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -52,6 +52,7 @@#include "access/xact.h"#include "access/xlogutils.h"#include "catalog/catalog.h"
+#include "catalog/index.h"#include "catalog/namespace.h"#include "miscadmin.h"#include "pgstat.h"
@@ -1937,10 +1938,19 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,        xl_heap_insert xlrec;
xl_heap_header xlhdr;        XLogRecPtr    recptr;
 
-        XLogRecData rdata[3];
+        XLogRecData rdata[4];        Page        page = BufferGetPage(buffer);        uint8        info =
XLOG_HEAP_INSERT;
+        /*
+         * For the logical replication case we need the tuple even if were
+         * doing a full page write. We could alternatively store a pointer into
+         * the fpw though.
+         * For that to work we add another rdata entry for the buffer in that
+         * case.
+         */
+        bool        need_tuple = wal_level == WAL_LEVEL_LOGICAL;
+        xlrec.all_visible_cleared = all_visible_cleared;        xlrec.target.node = relation->rd_node;
xlrec.target.tid= heaptup->t_self;
 
@@ -1960,18 +1970,32 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,         */        rdata[1].data =
(char*) &xlhdr;        rdata[1].len = SizeOfHeapHeader;
 
-        rdata[1].buffer = buffer;
+        rdata[1].buffer = need_tuple ? InvalidBuffer : buffer;        rdata[1].buffer_std = true;        rdata[1].next
=&(rdata[2]);        /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */        rdata[2].data = (char *)
heaptup->t_data+ offsetof(HeapTupleHeaderData, t_bits);        rdata[2].len = heaptup->t_len -
offsetof(HeapTupleHeaderData,t_bits);
 
-        rdata[2].buffer = buffer;
+        rdata[2].buffer = need_tuple ? InvalidBuffer : buffer;        rdata[2].buffer_std = true;        rdata[2].next
=NULL;        /*
 
+         * add record for the buffer without actual content thats removed if
+         * fpw is done for that buffer
+         */
+        if(need_tuple){
+            rdata[2].next = &(rdata[3]);
+
+            rdata[3].data = NULL;
+            rdata[3].len = 0;
+            rdata[3].buffer = buffer;
+            rdata[3].buffer_std = true;
+            rdata[3].next = NULL;
+        }
+
+        /*         * If this is the single and first tuple on page, we can reinit the         * page instead of
restoringthe whole thing.  Set flag, and hide         * buffer references from XLogInsert.
 
@@ -1980,7 +2004,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
PageGetMaxOffsetNumber(page)== FirstOffsetNumber)        {            info |= XLOG_HEAP_INIT_PAGE;
 
-            rdata[1].buffer = rdata[2].buffer = InvalidBuffer;
+            rdata[1].buffer = rdata[2].buffer = rdata[3].buffer = InvalidBuffer;        }        recptr =
XLogInsert(RM_HEAP_ID,info, rdata);
 
@@ -2568,7 +2592,9 @@ l1:    {        xl_heap_delete xlrec;        XLogRecPtr    recptr;
-        XLogRecData rdata[2];
+        XLogRecData rdata[4];
+
+        bool need_tuple = wal_level == WAL_LEVEL_LOGICAL && relation->rd_id  >= FirstNormalObjectId;
xlrec.all_visible_cleared= all_visible_cleared;        xlrec.target.node = relation->rd_node;
 
@@ -2584,6 +2610,73 @@ l1:        rdata[1].buffer_std = true;        rdata[1].next = NULL;
+        /*
+         * XXX: We could decide not to log changes when the origin is not the
+         * local node, that should reduce redundant logging.
+         */
+        if(need_tuple){
+            xl_heap_header xlhdr;
+
+            Oid indexoid = InvalidOid;
+            int16 pknratts;
+            int16 pkattnum[INDEX_MAX_KEYS];
+            Oid pktypoid[INDEX_MAX_KEYS];
+            Oid pkopclass[INDEX_MAX_KEYS];
+            TupleDesc desc = RelationGetDescr(relation);
+            Relation index_rel;
+            TupleDesc indexdesc;
+            int natt;
+
+            Datum idxvals[INDEX_MAX_KEYS];
+            bool idxisnull[INDEX_MAX_KEYS];
+            HeapTuple idxtuple;
+
+            MemSet(pkattnum, 0, sizeof(pkattnum));
+            MemSet(pktypoid, 0, sizeof(pktypoid));
+            MemSet(pkopclass, 0, sizeof(pkopclass));
+            MemSet(idxvals, 0, sizeof(idxvals));
+            MemSet(idxisnull, 0, sizeof(idxisnull));
+            relationFindPrimaryKey(relation, &indexoid, &pknratts, pkattnum, pktypoid, pkopclass);
+
+            if(!indexoid){
+                elog(WARNING, "Could not find primary key for table with oid %u",
+                     relation->rd_id);
+                goto no_index_found;
+            }
+
+            index_rel = index_open(indexoid, AccessShareLock);
+
+            indexdesc = RelationGetDescr(index_rel);
+
+            for(natt = 0; natt < indexdesc->natts; natt++){
+                idxvals[natt] =
+                    fastgetattr(&tp, pkattnum[natt], desc, &idxisnull[natt]);
+                Assert(!idxisnull[natt]);
+            }
+
+            idxtuple = heap_form_tuple(indexdesc, idxvals, idxisnull);
+
+            xlhdr.t_infomask2 = idxtuple->t_data->t_infomask2;
+            xlhdr.t_infomask = idxtuple->t_data->t_infomask;
+            xlhdr.t_hoff = idxtuple->t_data->t_hoff;
+
+            rdata[1].next = &(rdata[2]);
+            rdata[2].data = (char*)&xlhdr;
+            rdata[2].len = SizeOfHeapHeader;
+            rdata[2].buffer = InvalidBuffer;
+            rdata[2].next = NULL;
+
+            rdata[2].next = &(rdata[3]);
+            rdata[3].data = (char *) idxtuple->t_data + offsetof(HeapTupleHeaderData, t_bits);
+            rdata[3].len = idxtuple->t_len - offsetof(HeapTupleHeaderData, t_bits);
+            rdata[3].buffer = InvalidBuffer;
+            rdata[3].next = NULL;
+
+            heap_close(index_rel, NoLock);
+        no_index_found:
+            ;
+        }
+        recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE, rdata);        PageSetLSN(page, recptr);
@@ -4413,9 +4506,14 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,    xl_heap_header xlhdr;
uint8       info;    XLogRecPtr    recptr;
 
-    XLogRecData rdata[4];
+    XLogRecData rdata[5];    Page        page = BufferGetPage(newbuf);
+    /*
+     * Just as for XLOG_HEAP_INSERT we need to make sure the tuple
+     */
+    bool        need_tuple = wal_level == WAL_LEVEL_LOGICAL;
+    /* Caller should not call me on a non-WAL-logged relation */    Assert(RelationNeedsWAL(reln));
@@ -4446,28 +4544,43 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,    xlhdr.t_hoff =
newtup->t_data->t_hoff;   /*
 
-     * As with insert records, we need not store the rdata[2] segment if we
-     * decide to store the whole buffer instead.
+     * As with insert's logging , we need not store the the Datum containing
+     * tuples separately from the buffer if we do logical replication that
+     * is...     */    rdata[2].data = (char *) &xlhdr;    rdata[2].len = SizeOfHeapHeader;
-    rdata[2].buffer = newbuf;
+    rdata[2].buffer = need_tuple ? InvalidBuffer : newbuf;    rdata[2].buffer_std = true;    rdata[2].next =
&(rdata[3]);   /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */    rdata[3].data = (char *) newtup->t_data +
offsetof(HeapTupleHeaderData,t_bits);    rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
 
-    rdata[3].buffer = newbuf;
+    rdata[3].buffer = need_tuple ? InvalidBuffer : newbuf;    rdata[3].buffer_std = true;    rdata[3].next = NULL;
+    /*
+     * separate storage for the buffer reference of the new page in the
+     * wal_level=logical case
+    */
+    if(need_tuple){
+        rdata[3].next = &(rdata[4]);
+
+        rdata[4].data = NULL,
+        rdata[4].len = 0;
+        rdata[4].buffer = newbuf;
+        rdata[4].buffer_std = true;
+        rdata[4].next = NULL;
+    }
+    /* If new tuple is the single and first tuple on page... */    if (ItemPointerGetOffsetNumber(&(newtup->t_self))
==FirstOffsetNumber &&        PageGetMaxOffsetNumber(page) == FirstOffsetNumber)    {        info |=
XLOG_HEAP_INIT_PAGE;
-        rdata[2].buffer = rdata[3].buffer = InvalidBuffer;
+        rdata[2].buffer = rdata[3].buffer = rdata[4].buffer = InvalidBuffer;    }    recptr = XLogInsert(RM_HEAP_ID,
info,rdata); 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 166efb0..c6feed0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ const struct config_enum_entry wal_level_options[] = {    {"minimal", WAL_LEVEL_MINIMAL, false},
{"archive",WAL_LEVEL_ARCHIVE, false},    {"hot_standby", WAL_LEVEL_HOT_STANDBY, false},
 
+    {"logical", WAL_LEVEL_LOGICAL, false},    {NULL, 0, false}};
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 9e8b1cc..4cddcac 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -49,6 +49,7 @@#include "nodes/nodeFuncs.h"#include "optimizer/clauses.h"#include "parser/parser.h"
+#include "parser/parse_relation.h"#include "storage/bufmgr.h"#include "storage/lmgr.h"#include "storage/predicate.h"
@@ -3311,3 +3312,76 @@ ResetReindexPending(void){    pendingReindexedIndexes = NIL;}
+
+/*
+ * relationFindPrimaryKey
+ *        Find primary key for a relation if it exists.
+ *
+ * If no primary key is found *indexOid is set to InvalidOid
+ *
+ * This is quite similar to tablecmd.c's transformFkeyGetPrimaryKey.
+ *
+ * XXX: It might be a good idea to change pg_class.relhaspkey into an bool to
+ * make this more efficient.
+ */
+void
+relationFindPrimaryKey(Relation pkrel, Oid *indexOid,
+                       int16 *nratts, int16 *attnums, Oid *atttypids,
+                       Oid *opclasses){
+    List *indexoidlist;
+    ListCell *indexoidscan;
+    HeapTuple indexTuple = NULL;
+    Datum indclassDatum;
+    bool isnull;
+    oidvector  *indclass;
+    int i;
+    Form_pg_index indexStruct = NULL;
+
+    *indexOid = InvalidOid;
+
+    indexoidlist = RelationGetIndexList(pkrel);
+
+    foreach(indexoidscan, indexoidlist)
+    {
+        Oid indexoid = lfirst_oid(indexoidscan);
+
+        indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(indexoid));
+        if(!HeapTupleIsValid(indexTuple))
+            elog(ERROR, "cache lookup failed for index %u", indexoid);
+
+        indexStruct = (Form_pg_index) GETSTRUCT(indexTuple);
+        if(indexStruct->indisprimary && indexStruct->indimmediate)
+        {
+            *indexOid = indexoid;
+            break;
+        }
+        ReleaseSysCache(indexTuple);
+
+    }
+    list_free(indexoidlist);
+
+    if (!OidIsValid(*indexOid))
+        return;
+
+    /* Must get indclass the hard way */
+    indclassDatum = SysCacheGetAttr(INDEXRELID, indexTuple,
+                                    Anum_pg_index_indclass, &isnull);
+    Assert(!isnull);
+    indclass = (oidvector *) DatumGetPointer(indclassDatum);
+
+    *nratts = indexStruct->indnatts;
+    /*
+     * Now build the list of PK attributes from the indkey definition (we
+     * assume a primary key cannot have expressional elements)
+     */
+    for (i = 0; i < indexStruct->indnatts; i++)
+    {
+        int            pkattno = indexStruct->indkey.values[i];
+
+        attnums[i] = pkattno;
+        atttypids[i] = attnumTypeId(pkrel, pkattno);
+        opclasses[i] = indclass->values[i];
+    }
+
+    ReleaseSysCache(indexTuple);
+}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index c00183a..47715c9 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -82,6 +82,8 @@ wal_level_str(WalLevel wal_level)            return "archive";        case WAL_LEVEL_HOT_STANDBY:
      return "hot_standby";
 
+        case WAL_LEVEL_LOGICAL:
+            return "logical";    }    return _("unrecognized wal_level");}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index df5f232..2843aca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -199,7 +199,8 @@ typedef enum WalLevel{    WAL_LEVEL_MINIMAL = 0,    WAL_LEVEL_ARCHIVE,
-    WAL_LEVEL_HOT_STANDBY
+    WAL_LEVEL_HOT_STANDBY,
+    WAL_LEVEL_LOGICAL} WalLevel;extern int    wal_level;
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 7c8198f..2ba0ac3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -101,4 +101,8 @@ extern bool ReindexIsProcessingHeap(Oid heapOid);extern bool ReindexIsProcessingIndex(Oid
indexOid);externOid    IndexGetRelation(Oid indexId, bool missing_ok);
 
+extern void relationFindPrimaryKey(Relation pkrel, Oid *indexOid,
+                                   int16 *nratts, int16 *attnums, Oid *atttypids,
+                                   Oid *opclasses);
+#endif   /* INDEX_H */
-- 
1.7.10.rc3.3.g19a6c.dirty



From: Andres Freund <andres@anarazel.de>

This requires an up2date catalog and can thus only be run on a replica.

Missing:
- HEAP_NEWPAGE support
- HEAP2_MULTI_INSERT support
- DDL integration. *No* ddl, including TRUNCATE is possible atm
---src/backend/replication/logical/Makefile |    2 +-src/backend/replication/logical/decode.c |  439
++++++++++++++++++++++++++++++src/include/replication/decode.h        |   23 ++3 files changed, 463 insertions(+), 1
deletion(-)createmode 100644 src/backend/replication/logical/decode.ccreate mode 100644
src/include/replication/decode.h

diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
index 2eadab8..7dd9663 100644
--- a/src/backend/replication/logical/Makefile
+++ b/src/backend/replication/logical/Makefile
@@ -14,6 +14,6 @@ include $(top_builddir)/src/Makefile.globaloverride CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
-OBJS = applycache.o
+OBJS = applycache.o decode.oinclude $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
new file mode 100644
index 0000000..7e07d50
--- /dev/null
+++ b/src/backend/replication/logical/decode.c
@@ -0,0 +1,439 @@
+/*-------------------------------------------------------------------------
+ *
+ * decode.c
+ *
+ * Decodes wal records from an xlogreader.h callback into an applycache
+ *
+ *
+ * Portions Copyright (c) 2010-2012, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *      src/backend/replication/logical/decode.c
+ *
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xact.h"
+
+#include "replication/applycache.h"
+#include "replication/decode.h"
+
+#include "utils/memutils.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+
+static void DecodeXLogTuple(char* data, Size len,
+                            HeapTuple table, ApplyCacheTupleBuf* tuple);
+
+static void DecodeInsert(ApplyCache *cache, XLogRecordBuffer* buf);
+
+static void DecodeUpdate(ApplyCache *cache, XLogRecordBuffer* buf);
+
+static void DecodeDelete(ApplyCache *cache, XLogRecordBuffer* buf);
+
+static void DecodeNewpage(ApplyCache *cache, XLogRecordBuffer* buf);
+static void DecodeMultiInsert(ApplyCache *cache, XLogRecordBuffer* buf);
+
+static void DecodeCommit(ApplyCache* cache, XLogRecordBuffer* buf, TransactionId xid,
+                         TransactionId *sub_xids, int nsubxacts);
+
+
+void DecodeRecordIntoApplyCache(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+    XLogRecord* r = &buf->record;
+    uint8 info = r->xl_info & ~XLR_INFO_MASK;
+
+    switch (r->xl_rmid)
+    {
+        case RM_HEAP_ID:
+        {
+            info &= XLOG_HEAP_OPMASK;
+            switch (info)
+            {
+                case XLOG_HEAP_INSERT:
+                    DecodeInsert(cache, buf);
+                    break;
+
+                /* no guarantee that we get an HOT update again, so handle it as a normal update*/
+                case XLOG_HEAP_HOT_UPDATE:
+                case XLOG_HEAP_UPDATE:
+                    DecodeUpdate(cache, buf);
+                    break;
+
+                case XLOG_HEAP_NEWPAGE:
+                    DecodeNewpage(cache, buf);
+                    break;
+
+                case XLOG_HEAP_DELETE:
+                    DecodeDelete(cache, buf);
+                    break;
+                default:
+                    break;
+            }
+            break;
+        }
+        case RM_HEAP2_ID:
+        {
+            info &= XLOG_HEAP_OPMASK;
+            switch (info)
+            {
+                case XLOG_HEAP2_MULTI_INSERT:
+                    /* this also handles the XLOG_HEAP_INIT_PAGE case */
+                    DecodeMultiInsert(cache, buf);
+                    break;
+                default:
+                    /* everything else here is just physical stuff were not interested in */
+                    break;
+            }
+            break;
+        }
+
+        case RM_XACT_ID:
+        {
+            switch (info)
+            {
+                case XLOG_XACT_COMMIT:
+                {
+                    TransactionId *sub_xids;
+                    xl_xact_commit *xlrec = (xl_xact_commit*)buf->record_data;
+
+                    /* FIXME: this is not really allowed if there is no subtransactions */
+                    sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
+                    DecodeCommit(cache, buf, r->xl_xid, sub_xids, xlrec->nsubxacts);
+
+                    break;
+                }
+                case XLOG_XACT_COMMIT_PREPARED:
+                {
+                    TransactionId *sub_xids;
+                    xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared*)buf->record_data;
+
+                    sub_xids = (TransactionId *) &(xlrec->crec.xnodes[xlrec->crec.nrels]);
+
+                    DecodeCommit(cache, buf, r->xl_xid, sub_xids,
+                                 xlrec->crec.nsubxacts);
+
+                    break;
+                }
+                case XLOG_XACT_COMMIT_COMPACT:
+                {
+                    xl_xact_commit_compact *xlrec = (xl_xact_commit_compact*)buf->record_data;
+                    DecodeCommit(cache, buf, r->xl_xid, xlrec->subxacts,
+                                 xlrec->nsubxacts);
+                    break;
+                }
+                case XLOG_XACT_ABORT:
+                case XLOG_XACT_ABORT_PREPARED:
+                {
+                    TransactionId *sub_xids;
+                    xl_xact_abort *xlrec = (xl_xact_abort*)buf->record_data;
+                    int i;
+
+                    /* FIXME: this is not really allowed if there is no subtransactions */
+                    sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
+
+                    for(i = 0; i < xlrec->nsubxacts; i++)
+                    {
+                        ApplyCacheAbort(cache, *sub_xids, buf->origptr);
+                        sub_xids += 1;
+                    }
+
+                    /* TODO: check that this also contains not-yet-aborted subtxns */
+                    ApplyCacheAbort(cache, r->xl_xid, buf->origptr);
+
+                    elog(WARNING, "ABORT %u", r->xl_xid);
+                    break;
+                }
+                case XLOG_XACT_ASSIGNMENT:
+                    /*
+                     * XXX: We could reassign transactions to the parent here
+                     * to save space and effort when merging transactions at
+                     * commit.
+                     */
+                    break;
+                case XLOG_XACT_PREPARE:
+                    /*
+                     * FXIME: we should replay the transaction and prepare it
+                     * as well.
+                     */
+                    break;
+                default:
+                    break;
+                    ;
+            }
+            break;
+        }
+        default:
+            break;
+    }
+}
+
+static void
+DecodeCommit(ApplyCache* cache, XLogRecordBuffer* buf, TransactionId xid,
+             TransactionId *sub_xids, int nsubxacts)
+{
+    int i;
+
+    for (i = 0; i < nsubxacts; i++)
+    {
+        ApplyCacheCommitChild(cache, xid, *sub_xids, buf->origptr);
+        sub_xids++;
+    }
+
+    /* replay actions of all transaction + subtransactions in order */
+    ApplyCacheCommit(cache, xid, buf->origptr);
+}
+
+static void DecodeInsert(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+    XLogRecord* r = &buf->record;
+    xl_heap_insert *xlrec = (xl_heap_insert *) buf->record_data;
+
+    Oid relfilenode = xlrec->target.node.relNode;
+
+    ApplyCacheChange* change;
+
+    if (r->xl_info & XLR_BKP_BLOCK_1
+        && r->xl_len < (SizeOfHeapUpdate + SizeOfHeapHeader))
+    {
+        elog(FATAL, "huh, no tuple data on wal_level = logical?");
+    }
+
+    if(relfilenode == 0)
+    {
+        elog(ERROR, "nailed catalog changed");
+    }
+
+    change = ApplyCacheGetChange(cache);
+    change->action = APPLY_CACHE_CHANGE_INSERT;
+
+    /*
+     * Lookup the pg_class entry for the relfilenode to get the real oid
+     */
+    {
+        MemoryContext curctx = MemoryContextSwitchTo(TopMemoryContext);
+        change->table = SearchSysCacheCopy1(RELFILENODE,
+                                            relfilenode);
+        MemoryContextSwitchTo(curctx);
+    }
+
+    if (!HeapTupleIsValid(change->table))
+    {
+#ifdef SHOULD_BE_HANDLED_BETTER
+        elog(WARNING, "cache lookup failed for relfilenode %u, systable?",
+             relfilenode);
+#endif
+        ApplyCacheReturnChange(cache, change);
+        return;
+    }
+
+    if (HeapTupleGetOid(change->table) < FirstNormalObjectId)
+    {
+#ifdef VERBOSE_DEBUG
+        elog(LOG, "skipping change to systable");
+#endif
+        ApplyCacheReturnChange(cache, change);
+        return;
+    }
+
+#ifdef VERBOSE_DEBUG
+    {
+        /*for accessing the cache */
+        Form_pg_class class_form;
+        class_form = (Form_pg_class) GETSTRUCT(change->table);
+        elog(WARNING, "INSERT INTO \"%s\"", NameStr(class_form->relname));
+    }
+#endif
+
+    change->newtuple = ApplyCacheGetTupleBuf(cache);
+
+    DecodeXLogTuple((char*)xlrec + SizeOfHeapInsert,
+                    r->xl_len - SizeOfHeapInsert,
+                    change->table, change->newtuple);
+
+    ApplyCacheAddChange(cache, r->xl_xid, buf->origptr, change);
+}
+
+static void
+DecodeUpdate(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+    XLogRecord* r = &buf->record;
+    xl_heap_update *xlrec = (xl_heap_update *) buf->record_data;
+
+    Oid relfilenode = xlrec->target.node.relNode;
+
+    ApplyCacheChange* change;
+
+    if ((r->xl_info & XLR_BKP_BLOCK_1 || r->xl_info & XLR_BKP_BLOCK_2) &&
+        (r->xl_len < (SizeOfHeapUpdate + SizeOfHeapHeader)))
+    {
+        elog(FATAL, "huh, no tuple data on wal_level = logical?");
+    }
+
+    change = ApplyCacheGetChange(cache);
+    change->action = APPLY_CACHE_CHANGE_UPDATE;
+
+    /*
+     * Lookup the pg_class entry for the relfilenode to get the real oid
+     */
+    {
+        MemoryContext curctx = MemoryContextSwitchTo(TopMemoryContext);
+        change->table = SearchSysCacheCopy1(RELFILENODE,
+                                            relfilenode);
+        MemoryContextSwitchTo(curctx);
+    }
+
+    if (!HeapTupleIsValid(change->table))
+    {
+#ifdef SHOULD_BE_HANDLED_BETTER
+        elog(WARNING, "cache lookup failed for relfilenode %u, systable?",
+             relfilenode);
+#endif
+        ApplyCacheReturnChange(cache, change);
+        return;
+    }
+
+    if (HeapTupleGetOid(change->table) < FirstNormalObjectId)
+    {
+#ifdef VERBOSE_DEBUG
+        elog(LOG, "skipping change to systable");
+#endif
+        ApplyCacheReturnChange(cache, change);
+        return;
+    }
+
+#ifdef VERBOSE_DEBUG
+    {
+        /*for accessing the cache */
+        Form_pg_class class_form;
+        class_form = (Form_pg_class) GETSTRUCT(change->table);
+        elog(WARNING, "UPDATE \"%s\"", NameStr(class_form->relname));
+    }
+#endif
+
+    /* FIXME: need to save the old tuple as well if we want primary key updates to work. */
+    change->newtuple = ApplyCacheGetTupleBuf(cache);
+
+    DecodeXLogTuple((char*)xlrec + SizeOfHeapUpdate,
+                    r->xl_len - SizeOfHeapUpdate,
+                    change->table, change->newtuple);
+
+    ApplyCacheAddChange(cache, r->xl_xid, buf->origptr, change);
+}
+
+static void DecodeDelete(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+    XLogRecord* r = &buf->record;
+
+    xl_heap_delete *xlrec = (xl_heap_delete *) buf->record_data;
+
+    Oid relfilenode = xlrec->target.node.relNode;
+
+    ApplyCacheChange* change;
+
+    change = ApplyCacheGetChange(cache);
+    change->action = APPLY_CACHE_CHANGE_DELETE;
+
+    if (r->xl_len <= (SizeOfHeapDelete + SizeOfHeapHeader))
+    {
+        elog(FATAL, "huh, no primary key for a delete on wal_level = logical?");
+    }
+
+    /*
+     * Lookup the pg_class entry for the relfilenode to get the real oid
+     */
+    {
+        MemoryContext curctx = MemoryContextSwitchTo(TopMemoryContext);
+        change->table = SearchSysCacheCopy1(RELFILENODE,
+                                            relfilenode);
+        MemoryContextSwitchTo(curctx);
+    }
+
+    if (!HeapTupleIsValid(change->table))
+    {
+#ifdef SHOULD_BE_HANDLED_BETTER
+        elog(WARNING, "cache lookup failed for relfilenode %u, systable?",
+             relfilenode);
+#endif
+        ApplyCacheReturnChange(cache, change);
+        return;
+    }
+
+    if (HeapTupleGetOid(change->table) < FirstNormalObjectId)
+    {
+#ifdef VERBOSE_DEBUG
+        elog(LOG, "skipping change to systable");
+#endif
+        ApplyCacheReturnChange(cache, change);
+        return;
+    }
+
+#ifdef VERBOSE_DEBUG
+    {
+        /*for accessing the cache */
+        Form_pg_class class_form;
+        class_form = (Form_pg_class) GETSTRUCT(change->table);
+        elog(WARNING, "DELETE FROM \"%s\"", NameStr(class_form->relname));
+    }
+#endif
+
+    change->oldtuple = ApplyCacheGetTupleBuf(cache);
+
+    DecodeXLogTuple((char*)xlrec + SizeOfHeapDelete,
+                    r->xl_len - SizeOfHeapDelete,
+                    change->table, change->oldtuple);
+
+    ApplyCacheAddChange(cache, r->xl_xid, buf->origptr, change);
+}
+
+
+static void
+DecodeNewpage(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+    elog(WARNING, "skipping XLOG_HEAP_NEWPAGE record because we are too dumb");
+}
+
+static void
+DecodeMultiInsert(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+    elog(WARNING, "skipping XLOG_HEAP2_MULTI_INSERT record because we are too dumb");
+}
+
+
+static void DecodeXLogTuple(char* data, Size len,
+                            HeapTuple table, ApplyCacheTupleBuf* tuple)
+{
+    xl_heap_header xlhdr;
+    int datalen = len - SizeOfHeapHeader;
+
+    Assert(datalen >= 0);
+    Assert(datalen <= MaxHeapTupleSize);
+
+    tuple->tuple.t_len = datalen + offsetof(HeapTupleHeaderData, t_bits);
+
+    /* not a disk based tuple */
+    ItemPointerSetInvalid(&tuple->tuple.t_self);
+
+    /* probably not needed, but ... (is it actually valid to set it?) */
+    tuple->tuple.t_tableOid = HeapTupleGetOid(table);
+    tuple->tuple.t_data = &tuple->header;
+
+    /* data is not stored aligned */
+    memcpy((char *) &xlhdr,
+           data,
+           SizeOfHeapHeader);
+
+    memset(&tuple->header, 0, sizeof(HeapTupleHeaderData));
+
+    memcpy((char *) &tuple->header + offsetof(HeapTupleHeaderData, t_bits),
+           data + SizeOfHeapHeader,
+           datalen);
+
+    tuple->header.t_infomask = xlhdr.t_infomask;
+    tuple->header.t_infomask2 = xlhdr.t_infomask2;
+    tuple->header.t_hoff = xlhdr.t_hoff;
+}
diff --git a/src/include/replication/decode.h b/src/include/replication/decode.h
new file mode 100644
index 0000000..53088e2
--- /dev/null
+++ b/src/include/replication/decode.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ * decode.h
+ *     PostgreSQL WAL to logical transformation
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DECODE_H
+#define DECODE_H
+
+#include "access/xlogreader.h"
+#include "replication/applycache.h"
+
+void DecodeRecordIntoApplyCache(ApplyCache *cache, XLogRecordBuffer* buf);
+
+typedef struct ReaderApplyState
+{
+    ApplyCache *apply_cache;
+} ReaderApplyState;
+
+#endif
-- 
1.7.10.rc3.3.g19a6c.dirty



From: Andres Freund <andres@anarazel.de>

For that add a 'node_id' parameter to most commands dealing with wal
segments. A node_id thats 'InvalidMultimasterNodeId' references local wal,
every other node_id referes to wal in a new pg_lcr directory.

Using duplicated code would reduce the impact of that change but the long-term
code-maintenance burden outweighs that by a far bit.

Besides the decision to add a 'node_id' parameter to several functions the
changes in this patch are fairly mechanical.
---src/backend/access/transam/xlog.c           |   54 ++++++++++++++++-----------src/backend/replication/basebackup.c
    |    4 +-src/backend/replication/walreceiver.c       |    2 +-src/backend/replication/walsender.c         |    9
+++--src/bin/initdb/initdb.c                    |    1 +src/bin/pg_resetxlog/pg_resetxlog.c         |    2
+-src/include/access/xlog.h                  |    2 +-src/include/access/xlog_internal.h          |   13
+++++--src/include/replication/logical.h          |    2 +src/include/replication/walsender_private.h |    2 +-10 files
changed,56 insertions(+), 35 deletions(-)
 

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 504b4d0..0622726 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -635,8 +635,8 @@ static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,static bool
AdvanceXLInsertBuffer(boolnew_segment);static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);static void
XLogWrite(XLogwrtRqstWriteRqst, bool flexible, bool xlog_switch);
 
-static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
-                       bool find_free, int *max_advance,
+static bool InstallXLogFileSegment(RepNodeId node_id, uint32 *log, uint32 *seg,
+                       char *tmppath, bool find_free, int *max_advance,                       bool use_lock);static
intXLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,             int source, bool notexistOk);
 
@@ -1736,8 +1736,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)            /* create/use new
logfile */            use_existent = true;
 
-            openLogFile = XLogFileInit(openLogId, openLogSeg,
-                                       &use_existent, true);
+            openLogFile = XLogFileInit(InvalidMultimasterNodeId, openLogId,
+                                       openLogSeg, &use_existent, true);            openLogOff = 0;        }
@@ -2376,6 +2376,9 @@ XLogNeedsFlush(XLogRecPtr record) * place.  This should be TRUE except during bootstrap log
creation. The * caller must *not* hold the lock at call. *
 
+ * node_id: if != InvalidMultimasterNodeId this xlog file is actually a LCR
+ * file
+ * * Returns FD of opened file. * * Note: errors here are ERROR not PANIC because we might or might not be
@@ -2384,8 +2387,8 @@ XLogNeedsFlush(XLogRecPtr record) * in a critical section. */int
-XLogFileInit(uint32 log, uint32 seg,
-             bool *use_existent, bool use_lock)
+XLogFileInit(RepNodeId node_id, uint32 log, uint32 seg,
+             bool *use_existent, bool use_lock){    char        path[MAXPGPATH];    char        tmppath[MAXPGPATH];
@@ -2396,7 +2399,7 @@ XLogFileInit(uint32 log, uint32 seg,    int            fd;    int            nbytes;
-    XLogFilePath(path, ThisTimeLineID, log, seg);
+    XLogFilePath(path, ThisTimeLineID, node_id, log, seg);    /*     * Try to use existent file (checkpoint maker may
havecreated it already)
 
@@ -2425,6 +2428,11 @@ XLogFileInit(uint32 log, uint32 seg,     */    elog(DEBUG2, "creating and filling new WAL
file");
+    /*
+     * FIXME: to be safe we need to create tempfile in the pg_lcr directory if
+     * its actually an lcr file because pg_lcr might be in a different
+     * partition.
+     */    snprintf(tmppath, MAXPGPATH, XLOGDIR "/xlogtemp.%d", (int) getpid());    unlink(tmppath);
@@ -2493,7 +2501,7 @@ XLogFileInit(uint32 log, uint32 seg,    installed_log = log;    installed_seg = seg;
max_advance= XLOGfileslop;
 
-    if (!InstallXLogFileSegment(&installed_log, &installed_seg, tmppath,
+    if (!InstallXLogFileSegment(node_id, &installed_log, &installed_seg, tmppath,
*use_existent,&max_advance,                                use_lock))    {
 
@@ -2548,7 +2556,7 @@ XLogFileCopy(uint32 log, uint32 seg,    /*     * Open the source file     */
-    XLogFilePath(path, srcTLI, srclog, srcseg);
+    XLogFilePath(path, srcTLI, InvalidMultimasterNodeId, srclog, srcseg);    srcfd = BasicOpenFile(path, O_RDONLY |
PG_BINARY,0);    if (srcfd < 0)        ereport(ERROR,
 
@@ -2619,7 +2627,8 @@ XLogFileCopy(uint32 log, uint32 seg,    /*     * Now move the segment into place with its final
name.    */
 
-    if (!InstallXLogFileSegment(&log, &seg, tmppath, false, NULL, false))
+    if (!InstallXLogFileSegment(InvalidMultimasterNodeId, &log, &seg, tmppath,
+                                false, NULL, false))        elog(ERROR, "InstallXLogFileSegment should not have
failed");}
@@ -2653,14 +2662,14 @@ XLogFileCopy(uint32 log, uint32 seg, * file into place. */static bool
-InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
+InstallXLogFileSegment(RepNodeId node_id, uint32 *log, uint32 *seg, char *tmppath,                       bool
find_free,int *max_advance,                       bool use_lock){    char        path[MAXPGPATH];    struct stat
stat_buf;
-    XLogFilePath(path, ThisTimeLineID, *log, *seg);
+    XLogFilePath(path, ThisTimeLineID, node_id, *log, *seg);    /*     * We want to be sure that only one process does
thisat a time.
 
@@ -2687,7 +2696,7 @@ InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,            }
NextLogSeg(*log,*seg);            (*max_advance)--;
 
-            XLogFilePath(path, ThisTimeLineID, *log, *seg);
+            XLogFilePath(path, ThisTimeLineID, node_id, *log, *seg);        }    }
@@ -2736,7 +2745,7 @@ XLogFileOpen(uint32 log, uint32 seg)    char        path[MAXPGPATH];    int            fd;
-    XLogFilePath(path, ThisTimeLineID, log, seg);
+    XLogFilePath(path, ThisTimeLineID, InvalidMultimasterNodeId, log, seg);    fd = BasicOpenFile(path, O_RDWR |
PG_BINARY| get_sync_bit(sync_method),                       S_IRUSR | S_IWUSR);
 
@@ -2783,7 +2792,7 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,        case XLOG_FROM_PG_XLOG:
   case XLOG_FROM_STREAM:
 
-            XLogFilePath(path, tli, log, seg);
+            XLogFilePath(path, tli, InvalidMultimasterNodeId, log, seg);            restoredFromArchive = false;
    break;
 
@@ -2804,7 +2813,7 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,        bool        reload =
false;       struct stat statbuf;
 
-        XLogFilePath(xlogfpath, tli, log, seg);
+        XLogFilePath(xlogfpath, tli, InvalidMultimasterNodeId, log, seg);        if (stat(xlogfpath, &statbuf) == 0)
    {            if (unlink(xlogfpath) != 0)
 
@@ -2922,7 +2931,7 @@ XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources)    }    /* Couldn't find it.
Forsimplicity, complain about front timeline */
 
-    XLogFilePath(path, recoveryTargetTLI, log, seg);
+    XLogFilePath(path, recoveryTargetTLI, InvalidMultimasterNodeId, log, seg);    errno = ENOENT;    ereport(emode,
       (errcode_for_file_access(),
 
@@ -3366,7 +3375,8 @@ PreallocXlogFiles(XLogRecPtr endptr)    {        NextLogSeg(_logId, _logSeg);        use_existent
=true;
 
-        lf = XLogFileInit(_logId, _logSeg, &use_existent, true);
+        lf = XLogFileInit(InvalidMultimasterNodeId, _logId, _logSeg,
+                          &use_existent, true);        close(lf);        if (!use_existent)
CheckpointStats.ckpt_segs_added++;
@@ -3486,8 +3496,9 @@ RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr)                 * separate archive
directory.                */                if (lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 
-                    InstallXLogFileSegment(&endlogId, &endlogSeg, path,
-                                           true, &max_advance, true))
+                    InstallXLogFileSegment(InvalidMultimasterNodeId, &endlogId,
+                                           &endlogSeg, path, true,
+                                           &max_advance, true))                {                    ereport(DEBUG2,
                       (errmsg("recycled transaction log file \"%s\"",
 
@@ -5255,7 +5266,8 @@ BootStrapXLOG(void)    /* Create first XLOG segment file */    use_existent = false;
-    openLogFile = XLogFileInit(0, 1, &use_existent, false);
+    openLogFile = XLogFileInit(InvalidMultimasterNodeId, 0, 1,
+                               &use_existent, false);    /* Write the first page with the initial record */    errno =
0;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 0bc88a4..47e4641 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -245,7 +245,7 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)            char        fn[MAXPGPATH];
         int            i;
 
-            XLogFilePath(fn, ThisTimeLineID, logid, logseg);
+            XLogFilePath(fn, ThisTimeLineID, InvalidMultimasterNodeId, logid, logseg);            _tarWriteHeader(fn,
NULL,&statbuf);            /* Send the actual WAL file contents, block-by-block */
 
@@ -264,7 +264,7 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)                 *
http://lists.apple.com/archives/xcode-users/2003/Dec//msg000                * 51.html                 */
 
-                XLogRead(buf, ptr, TAR_SEND_SIZE);
+                XLogRead(buf, InvalidMultimasterNodeId, ptr, TAR_SEND_SIZE);                if (pq_putmessage('d',
buf,TAR_SEND_SIZE))                    ereport(ERROR,                            (errmsg("base backup could not send
data,aborting backup")));
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 650b74f..e97196b 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -509,7 +509,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)            /* Create/use new log file */
          XLByteToSeg(recptr, recvId, recvSeg);            use_existent = true;
 
-            recvFile = XLogFileInit(recvId, recvSeg, &use_existent, true);
+            recvFile = XLogFileInit(InvalidMultimasterNodeId, recvId, recvSeg, &use_existent, true);
recvOff= 0;        }
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e44c734..8cd3a00 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -977,7 +977,7 @@ WalSndKill(int code, Datum arg) * more than one. */void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+XLogRead(char *buf, RepNodeId node_id, XLogRecPtr startptr, Size count){    char       *p;    XLogRecPtr    recptr;
@@ -1009,8 +1009,8 @@ retry:                close(sendFile);            XLByteToSeg(recptr, sendId, sendSeg);
-            XLogFilePath(path, ThisTimeLineID, sendId, sendSeg);
-
+            XLogFilePath(path, ThisTimeLineID, node_id,
+                         sendId, sendSeg);            sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
  if (sendFile < 0)            {
 
@@ -1215,7 +1215,8 @@ XLogSend(char *msgbuf, bool *caughtup)     * Read the log directly into the output buffer to
avoidextra memcpy     * calls.     */
 
-    XLogRead(msgbuf + 1 + sizeof(WalDataMessageHeader), startptr, nbytes);
+    XLogRead(msgbuf + 1 + sizeof(WalDataMessageHeader), InvalidMultimasterNodeId,
+             startptr, nbytes);    /*     * We fill the message header last so that the send timestamp is taken as
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3789948..1f26382 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2637,6 +2637,7 @@ main(int argc, char *argv[])        "global",        "pg_xlog",        "pg_xlog/archive_status",
+        "pg_lcr",        "pg_clog",        "pg_notify",        "pg_serial",
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 65ba910..7ee3a3a 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -973,7 +973,7 @@ WriteEmptyXLOG(void)    /* Write the first page */    XLogFilePath(path,
ControlFile.checkPointCopy.ThisTimeLineID,
-                 newXlogId, newXlogSeg);
+                 InvalidMultimasterNodeId, newXlogId, newXlogSeg);    unlink(path);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index dd89cff..3b02c0b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -268,7 +268,7 @@ extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);extern void
XLogFlush(XLogRecPtrRecPtr);extern bool XLogBackgroundFlush(void);extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
 
-extern int XLogFileInit(uint32 log, uint32 seg,
+extern int XLogFileInit(RepNodeId node_id, uint32 log, uint32 seg,             bool *use_existent, bool
use_lock);externint    XLogFileOpen(uint32 log, uint32 seg);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 3328a50..deadddf 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -19,6 +19,7 @@#include "access/xlog.h"#include "fmgr.h"#include "pgtime.h"
+#include "replication/logical.h"#include "storage/block.h"#include "storage/relfilenode.h"
@@ -216,14 +217,11 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;#define MAXFNAMELEN        64#define
XLogFileName(fname,tli, log, seg)    \
 
-    snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli, log, seg)
+    snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli, log, seg);#define XLogFromFileName(fname, tli, log, seg)    \
sscanf(fname,"%08X%08X%08X", tli, log, seg)
 
-#define XLogFilePath(path, tli, log, seg)    \
-    snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X", tli, log, seg)
-#define TLHistoryFileName(fname, tli)    \    snprintf(fname, MAXFNAMELEN, "%08X.history", tli)
@@ -239,6 +237,13 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;#define BackupHistoryFilePath(path, tli, log,
seg,offset)    \    snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X.%08X.backup", tli, log, seg, offset)
 
+/* FIXME: move to xlogutils.c, needs to fix sharing with receivexlog.c first though */
+static inline int XLogFilePath(char* path, TimeLineID tli, RepNodeId node_id, uint32 log, uint32 seg){
+    if(node_id == InvalidMultimasterNodeId)
+        return snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X", tli, log, seg);
+    else
+        return snprintf(path, MAXPGPATH, LCRDIR "/%d/%08X%08X%08X", node_id, tli, log, seg);
+}/* * Method table for resource managers.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 0698b61..8f44fad 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -19,4 +19,6 @@ extern XLogRecPtr current_replication_origin_lsn;#define InvalidMultimasterNodeId 0#define
MaxMultimasterNodeId(2<<3)
 
+
+#define LCRDIR                "pg_lcr"#endif
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 66234cd..bc58ff4 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -95,7 +95,7 @@ extern WalSndCtlData *WalSndCtl;extern void WalSndSetState(WalSndState state);
-extern void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+extern void XLogRead(char *buf, RepNodeId node_id, XLogRecPtr startptr, Size count);/* * Internal functions for
parsingthe replication grammar, in repl_gram.y and
 
-- 
1.7.10.rc3.3.g19a6c.dirty



[PATCH 12/16] Add state to keep track of logical replication

From
Andres Freund
Date:
From: Andres Freund <andres@anarazel.de>

In order to have restartable replication with minimal additional writes its
very useful to know up to which point we have replayed/received changes from a
foreign node.

One representation of that is the lsn of changes at the originating cluster.

We need to keep track of the point up to which we received data and up to where
we applied data.

For that we added a field 'origin_lsn' to commit records. This allows to keep
track of the apply position with crash recovery with minimal additional io. We
only added the field to non-compact commit records to reduce the overhead in
case logical replication is not used.

Checkpoints need to keep track of the apply/receive positions as well because
otherwise it would be hard to determine the lsn from where to restart
receive/apply after a shutdown/crash if no changes happened since the last
shutdown/crash.

While running the startup process, the walreceiver and a (future) apply process
will need a coherent picture those two states so add shared memory state to
keep track of it. Currently this is represented in the walreceivers shared
memory segment. This will likely need to change.

During crash recovery/physical replication the origin_lsn field of commit
records is used to update the shared memory, and thus the next checkpoint's,
notion of the apply state.

Missing:

- For correct crash recovery we need more state than the 'apply lsn' because transactions on the originating side can
overlap.At the lsn we just applied many other transaction can be in-progres. To correctly handle that we need to keep
trackof oldest start lsn of any transaction currently being reassembled (c.f. ApplyCache). Then we can start to
reassemblethe ApplyCache up from that point and throw away any transaction which comitted before the recorded/recovered
applylsn. It should be sufficient to store that knowledge in shared memory and checkpoint records.
 
---src/backend/access/transam/xact.c          |   22 ++++++++-src/backend/access/transam/xlog.c          |   73
++++++++++++++++++++++++++++src/backend/replication/walreceiverfuncs.c|    8 +++src/include/access/xact.h
  |    1 +src/include/catalog/pg_control.h           |   13 ++++-src/include/replication/walreceiver.h      |   13
+++++6files changed, 128 insertions(+), 2 deletions(-)
 

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index dc30a17..40ac965 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -39,11 +39,13 @@#include "replication/logical.h"#include "replication/syncrep.h"#include "replication/walsender.h"
+#include "replication/walreceiver.h"#include "storage/lmgr.h"#include "storage/predicate.h"#include
"storage/procarray.h"#include"storage/sinvaladt.h"#include "storage/smgr.h"
 
+#include "storage/spin.h"#include "utils/combocid.h"#include "utils/guc.h"#include "utils/inval.h"
@@ -1015,7 +1017,8 @@ RecordTransactionCommit(void)        /*         * Do we need the long commit record? If not, use
thecompact format.         */
 
-        if (nrels > 0 || nmsgs > 0 || RelcacheInitFileInval || forceSyncCommit)
+        if (nrels > 0 || nmsgs > 0 || RelcacheInitFileInval || forceSyncCommit ||
+            (wal_level == WAL_LEVEL_LOGICAL && current_replication_origin_id != guc_replication_origin_id))        {
        XLogRecData rdata[4];            int            lastrdata = 0;
 
@@ -1037,6 +1040,8 @@ RecordTransactionCommit(void)            xlrec.nrels = nrels;            xlrec.nsubxacts =
nchildren;           xlrec.nmsgs = nmsgs;
 
+            xlrec.origin_lsn = current_replication_origin_lsn;
+            rdata[0].data = (char *) (&xlrec);            rdata[0].len = MinSizeOfXactCommit;
rdata[0].buffer= InvalidBuffer;
 
@@ -4575,6 +4580,21 @@ xact_redo_commit_internal(TransactionId xid, RepNodeId originating_node,
LWLockRelease(XidGenLock);   }
 
+    /*
+     * record where were at wrt to recovery. We need that to know from where on
+     * to restart applying logical change records
+     */
+    if(LogicalWalReceiverActive() && !XLByteEQ(origin_lsn, zeroRecPtr))
+    {
+        /*
+         * probably we don't need the locking because no lcr receiver can run
+         * yet.
+         */
+        SpinLockAcquire(&WalRcv->mutex);
+        WalRcv->mm_applyState[originating_node] = origin_lsn;
+        SpinLockRelease(&WalRcv->mutex);
+    }
+    if (standbyState == STANDBY_DISABLED)    {        /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0622726..20a4611 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5183,6 +5183,7 @@ BootStrapXLOG(void)    uint64        sysidentifier;    struct timeval tv;    pg_crc32    crc;
+    int i;    /*     * Select a hopefully-unique system identifier code for this installation.
@@ -5229,6 +5230,13 @@ BootStrapXLOG(void)    checkPoint.time = (pg_time_t) time(NULL);    checkPoint.oldestActiveXid =
InvalidTransactionId;
+    for(i = InvalidMultimasterNodeId + 1; i < MaxMultimasterNodeId;
+        i++){
+        checkPoint.logicalReceiveState[i] = zeroRecPtr;
+        checkPoint.logicalApplyState[i] = zeroRecPtr;
+    }
+
+    ShmemVariableCache->nextXid = checkPoint.nextXid;    ShmemVariableCache->nextOid = checkPoint.nextOid;
ShmemVariableCache->oidCount= 0;
 
@@ -6314,6 +6322,53 @@ StartupXLOG(void)        InRecovery = true;    }
+    /*
+     * setup shared memory state for logical wal receiver
+     *
+     * Do this unconditionally so enabling/disabling/enabling logical replay
+     * doesn't loose information due to rewriting pg_control
+     */
+    {
+        int i;
+
+        Assert(WalRcv);
+        /* locking is not really required here afaics, but ... */
+        SpinLockAcquire(&WalRcv->mutex);
+
+        for(i = InvalidMultimasterNodeId + 1; i < MaxMultimasterNodeId - 1;
+            i++)
+        {
+            XLogRecPtr* receiveState = &ControlFile->checkPointCopy.logicalReceiveState[i];
+            XLogRecPtr* applyState = &ControlFile->checkPointCopy.logicalApplyState[i];
+            if(i == guc_replication_origin_id && (
+                   !XLByteEQ(*receiveState, zeroRecPtr) ||
+                   !XLByteEQ(*applyState, zeroRecPtr))
+                )
+            {
+                elog(WARNING, "logical recovery state for own db. apply: %X/%X, receive %X/%X, origin %d",
+                     applyState->xlogid, applyState->xrecoff,
+                     receiveState->xlogid, receiveState->xrecoff,
+                     guc_replication_origin_id);
+                WalRcv->mm_receiveState[i] = zeroRecPtr;
+                WalRcv->mm_applyState[i] = zeroRecPtr;
+            }
+            else{
+                WalRcv->mm_receiveState[i] = *receiveState;
+                WalRcv->mm_applyState[i] = *applyState;
+            }
+        }
+        SpinLockRelease(&WalRcv->mutex);
+
+        /* FIXME: remove at some point */
+        for(i = InvalidMultimasterNodeId + 1; i < MaxMultimasterNodeId - 1;
+            i++){
+            elog(LOG, "restored apply state for node %d to %X/%X, receive %X/%X",
+                 i,
+                 WalRcv->mm_applyState[i].xlogid, WalRcv->mm_applyState[i].xrecoff,
+                 WalRcv->mm_receiveState[i].xlogid, WalRcv->mm_receiveState[i].xrecoff);
+        }
+    }
+    /* REDO */    if (InRecovery)    {
@@ -7906,6 +7961,24 @@ CreateCheckPoint(int flags)                             &checkPoint.nextMultiOffset);    /*
+     * fill out where are at wrt logical replay. Do this unconditionally so we
+     * don't loose information due to rewriting pg_control when toggling
+     * logical replay
+     */
+    {
+        int i;
+        SpinLockAcquire(&WalRcv->mutex);
+
+        for(i = InvalidMultimasterNodeId + 1; i < MaxMultimasterNodeId - 1;
+            i++){
+            checkPoint.logicalApplyState[i] = WalRcv->mm_applyState[i];
+            checkPoint.logicalReceiveState[i] = WalRcv->mm_receiveState[i];
+        }
+        SpinLockRelease(&WalRcv->mutex);
+        elog(LOG, "updated logical checkpoint data");
+    }
+
+    /*     * Having constructed the checkpoint record, ensure all shmem disk buffers     * and commit-log buffers are
flushedto disk.     *
 
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 876196f..cb49282 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -64,6 +64,14 @@ WalRcvShmemInit(void)        MemSet(WalRcv, 0, WalRcvShmemSize());        WalRcv->walRcvState =
WALRCV_STOPPED;       SpinLockInit(&WalRcv->mutex);
 
+
+        memset(&WalRcv->mm_receiveState,
+               0, sizeof(WalRcv->mm_receiveState));
+        memset(&WalRcv->mm_applyState,
+               0, sizeof(WalRcv->mm_applyState));
+
+        memset(&WalRcv->mm_receiveLatch,
+               0, sizeof(WalRcv->mm_receiveLatch));    }}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b12d2a0..2757782 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -137,6 +137,7 @@ typedef struct xl_xact_commit    int            nmsgs;            /* number of shared inval msgs */
  Oid            dbId;            /* MyDatabaseId */    Oid            tsId;            /* MyDatabaseTableSpace */
 
+    XLogRecPtr    origin_lsn;     /* location of originating commit */    /* Array of RelFileNode(s) to drop at commit
*/   RelFileNode xnodes[1];        /* VARIABLE LENGTH ARRAY */    /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
 
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 5cff396..bc6316e 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -16,12 +16,13 @@#define PG_CONTROL_H#include "access/xlogdefs.h"
+#include "replication/logical.h"#include "pgtime.h"                /* for pg_time_t */#include "utils/pg_crc.h"/*
Versionidentifier for this pg_control format */
 
-#define PG_CONTROL_VERSION    922
+#define PG_CONTROL_VERSION    923/* * Body of CheckPoint XLOG records.  This is declared here because we keep
@@ -50,6 +51,13 @@ typedef struct CheckPoint     * it's set to InvalidTransactionId.     */    TransactionId
oldestActiveXid;
+
+    /*
+     * The replay state from every other node. This is only needed if wal_level
+     * >= logical and thus is only filled then.
+     */
+    XLogRecPtr logicalApplyState[MaxMultimasterNodeId - 1];
+    XLogRecPtr logicalReceiveState[MaxMultimasterNodeId - 1];} CheckPoint;/* XLOG info values for XLOG rmgr */
@@ -85,6 +93,9 @@ typedef enum DBState * NOTE: try to keep this under 512 bytes so that it will fit on one physical *
sectorof typical disk drives.  This reduces the odds of corruption due to * power failure midway through a write.
 
+ *
+ * FIXME: in order to allow many nodes in mm (which increases checkpoint size)
+ * we should change the writing of this to write(temp_file);fsync();rename();fsync(); */typedef struct
ControlFileData
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index d21ec94..c9ab1be 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -14,6 +14,8 @@#include "access/xlog.h"#include "access/xlogdefs.h"
+#include "replication/logical.h"
+#include "storage/latch.h"#include "storage/spin.h"#include "pgtime.h"
@@ -90,6 +92,17 @@ typedef struct    char        conninfo[MAXCONNINFO];    slock_t        mutex;            /* locks
sharedvariables shown above */
 
+
+    /*
+     * replay point up to which we replayed for every node
+     * XXX: should possibly be dynamically sized?
+     * FIXME: should go to its own shm segment?
+     */
+    XLogRecPtr  mm_receiveState[MaxMultimasterNodeId - 1];
+    XLogRecPtr  mm_applyState[MaxMultimasterNodeId - 1];
+
+    Latch*       mm_receiveLatch[MaxMultimasterNodeId - 1];
+} WalRcvData;extern WalRcvData *WalRcv;
-- 
1.7.10.rc3.3.g19a6c.dirty



From: Andres Freund <andres@anarazel.de>

The individual changes need to be identified by an xid. The xid can be a
subtransaction or a toplevel one, at commit those can be reintegrated by doing
a k-way mergesort between the individual transaction.

Callbacks for apply_begin, apply_change and apply_commit are provided to
retrieve complete transactions.

Missing:
- spill-to-disk
- correct subtransaction merge, current behaviour is simple/wrong
- DDL handling (?)
- resource usage controls
---src/backend/replication/Makefile             |    2 +src/backend/replication/logical/Makefile     |   19
++src/backend/replication/logical/applycache.c|  380 ++++++++++++++++++++++++++src/include/replication/applycache.h
   |  185 +++++++++++++4 files changed, 586 insertions(+)create mode 100644
src/backend/replication/logical/Makefilecreatemode 100644 src/backend/replication/logical/applycache.ccreate mode
100644src/include/replication/applycache.h
 

diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index 9d9ec87..ae7f6b1 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -17,6 +17,8 @@ override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)OBJS = walsender.o walreceiverfuncs.o walreceiver.o
basebackup.o\    repl_gram.o syncrep.o
 
+SUBDIRS = logical
+include $(top_srcdir)/src/backend/common.mk# repl_scanner is compiled as part of repl_gram
diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
new file mode 100644
index 0000000..2eadab8
--- /dev/null
+++ b/src/backend/replication/logical/Makefile
@@ -0,0 +1,19 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/replication/logical
+#
+# IDENTIFICATION
+#    src/backend/replication/logical/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/logical
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
+
+OBJS = applycache.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/logical/applycache.c b/src/backend/replication/logical/applycache.c
new file mode 100644
index 0000000..b73b0ba
--- /dev/null
+++ b/src/backend/replication/logical/applycache.c
@@ -0,0 +1,380 @@
+/*-------------------------------------------------------------------------
+ *
+ * applycache.c
+ *
+ * PostgreSQL logical replay "cache" management
+ *
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *      src/backend/replication/applycache.c
+ *
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/xact.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_control.h"
+#include "replication/applycache.h"
+
+#include "utils/ilist.h"
+#include "utils/memutils.h"
+#include "utils/relcache.h"
+#include "utils/syscache.h"
+
+const Size max_memtries = 1<<16;
+
+const size_t max_cached_changes = 1024;
+const size_t max_cached_tuplebufs = 1024; /* ~8MB */
+const size_t max_cached_transactions = 512;
+
+typedef struct ApplyCacheTXNByIdEnt
+{
+    TransactionId xid;
+    ApplyCacheTXN* txn;
+} ApplyCacheTXNByIdEnt;
+
+static ApplyCacheTXN* ApplyCacheGetTXN(ApplyCache *cache);
+static void ApplyCacheReturnTXN(ApplyCache *cache, ApplyCacheTXN* txn);
+
+static ApplyCacheTXN* ApplyCacheTXNByXid(ApplyCache*, TransactionId xid, bool create);
+
+
+ApplyCache*
+ApplyCacheAllocate(void)
+{
+    ApplyCache* cache = (ApplyCache*)malloc(sizeof(ApplyCache));
+    HASHCTL         hash_ctl;
+
+    if (!cache)
+        elog(ERROR, "Could not allocate the ApplyCache");
+
+    memset(&hash_ctl, 0, sizeof(hash_ctl));
+
+    cache->context = AllocSetContextCreate(TopMemoryContext,
+                                           "ApplyCache",
+                                           ALLOCSET_DEFAULT_MINSIZE,
+                                           ALLOCSET_DEFAULT_INITSIZE,
+                                           ALLOCSET_DEFAULT_MAXSIZE);
+
+    hash_ctl.keysize = sizeof(TransactionId);
+    hash_ctl.entrysize = sizeof(ApplyCacheTXNByIdEnt);
+    hash_ctl.hash = tag_hash;
+    hash_ctl.hcxt = cache->context;
+
+    cache->by_txn = hash_create("ApplyCacheByXid", 1000, &hash_ctl,
+                                HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
+
+    cache->nr_cached_transactions = 0;
+    cache->nr_cached_changes = 0;
+    cache->nr_cached_tuplebufs = 0;
+
+    ilist_d_init(&cache->cached_transactions);
+    ilist_d_init(&cache->cached_changes);
+    ilist_s_init(&cache->cached_tuplebufs);
+
+    return cache;
+}
+
+void ApplyCacheFree(ApplyCache* cache)
+{
+    /* FIXME: check for in-progress transactions */
+    /* FIXME: clean up cached transaction */
+    /* FIXME: clean up cached changes */
+    /* FIXME: clean up cached tuplebufs */
+    hash_destroy(cache->by_txn);
+    free(cache);
+}
+
+static ApplyCacheTXN* ApplyCacheGetTXN(ApplyCache *cache)
+{
+    ApplyCacheTXN* txn;
+
+    if (cache->nr_cached_transactions)
+    {
+        cache->nr_cached_transactions--;
+        txn = ilist_container(ApplyCacheTXN, node,
+                              ilist_d_pop_front(&cache->cached_transactions));
+    }
+    else
+    {
+        txn = (ApplyCacheTXN*)
+            malloc(sizeof(ApplyCacheTXN));
+
+        if (!txn)
+            elog(ERROR, "Could not allocate a ApplyCacheTXN struct");
+    }
+
+    memset(txn, 0, sizeof(ApplyCacheTXN));
+    ilist_d_init(&txn->changes);
+    ilist_d_init(&txn->subtxns);
+    return txn;
+}
+
+void ApplyCacheReturnTXN(ApplyCache *cache, ApplyCacheTXN* txn)
+{
+    if(cache->nr_cached_transactions < max_cached_transactions){
+        cache->nr_cached_transactions++;
+        ilist_d_push_front(&cache->cached_transactions, &txn->node);
+    }
+    else{
+        free(txn);
+    }
+}
+
+ApplyCacheChange*
+ApplyCacheGetChange(ApplyCache* cache)
+{
+    ApplyCacheChange* change;
+
+    if (cache->nr_cached_changes)
+    {
+        cache->nr_cached_changes--;
+        change = ilist_container(ApplyCacheChange, node,
+                                 ilist_d_pop_front(&cache->cached_changes));
+    }
+    else
+    {
+        change = (ApplyCacheChange*)malloc(sizeof(ApplyCacheChange));
+
+        if (!change)
+            elog(ERROR, "Could not allocate a ApplyCacheChange struct");
+    }
+
+
+    memset(change, 0, sizeof(ApplyCacheChange));
+    return change;
+}
+
+void
+ApplyCacheReturnChange(ApplyCache* cache, ApplyCacheChange* change)
+{
+    if (change->newtuple)
+        ApplyCacheReturnTupleBuf(cache, change->newtuple);
+    if (change->oldtuple)
+        ApplyCacheReturnTupleBuf(cache, change->oldtuple);
+
+    if (change->table)
+        heap_freetuple(change->table);
+
+    if(cache->nr_cached_changes < max_cached_changes){
+        cache->nr_cached_changes++;
+        ilist_d_push_front(&cache->cached_changes, &change->node);
+    }
+    else{
+        free(change);
+    }
+}
+
+ApplyCacheTupleBuf*
+ApplyCacheGetTupleBuf(ApplyCache* cache)
+{
+    ApplyCacheTupleBuf* tuple;
+
+    if (cache->nr_cached_tuplebufs)
+    {
+        cache->nr_cached_tuplebufs--;
+        tuple = ilist_container(ApplyCacheTupleBuf, node,
+                                ilist_s_pop_front(&cache->cached_tuplebufs));
+    }
+    else
+    {
+        tuple =
+            (ApplyCacheTupleBuf*)malloc(sizeof(ApplyCacheTupleBuf));
+
+        if (!tuple)
+            elog(ERROR, "Could not allocate a ApplyCacheTupleBuf struct");
+    }
+
+    return tuple;
+}
+
+void
+ApplyCacheReturnTupleBuf(ApplyCache* cache, ApplyCacheTupleBuf* tuple)
+{
+    if(cache->nr_cached_tuplebufs < max_cached_tuplebufs){
+        cache->nr_cached_tuplebufs++;
+        ilist_s_push_front(&cache->cached_tuplebufs, &tuple->node);
+    }
+    else{
+        free(tuple);
+    }
+}
+
+
+static
+ApplyCacheTXN*
+ApplyCacheTXNByXid(ApplyCache* cache, TransactionId xid, bool create)
+{
+    ApplyCacheTXNByIdEnt* ent;
+    bool found;
+
+    ent = (ApplyCacheTXNByIdEnt*)
+        hash_search(cache->by_txn,
+                    (void *)&xid,
+                    (create ? HASH_ENTER : HASH_FIND),
+                    &found);
+
+    if (found)
+    {
+#ifdef VERBOSE_DEBUG
+        elog(LOG, "found cache entry for %u at %p", xid, ent);
+#endif
+    }
+    else
+    {
+#ifdef VERBOSE_DEBUG
+        elog(LOG, "didn't find cache entry for %u in %p at %p, creating %u",
+             xid, cache, ent, create);
+#endif
+    }
+
+    if (!found && !create)
+        return NULL;
+
+    if (!found)
+    {
+        ent->txn = ApplyCacheGetTXN(cache);
+    }
+
+    return ent->txn;
+}
+
+void
+ApplyCacheAddChange(ApplyCache* cache, TransactionId xid, XLogRecPtr lsn,
+                    ApplyCacheChange* change)
+{
+    ApplyCacheTXN* txn = ApplyCacheTXNByXid(cache, xid, true);
+    txn->lsn = lsn;
+    ilist_d_push_back(&txn->changes, &change->node);
+}
+
+
+void
+ApplyCacheCommitChild(ApplyCache* cache, TransactionId xid,
+                      TransactionId subxid, XLogRecPtr lsn)
+{
+    ApplyCacheTXN* txn;
+    ApplyCacheTXN* subtxn;
+
+    subtxn = ApplyCacheTXNByXid(cache, subxid, false);
+
+    /*
+     * No need to do anything if that subtxn didn't contain any changes
+     */
+    if (!subtxn)
+        return;
+
+    subtxn->lsn = lsn;
+
+    txn = ApplyCacheTXNByXid(cache, xid, true);
+
+    ilist_d_push_back(&txn->subtxns, &subtxn->node);
+}
+
+void
+ApplyCacheCommit(ApplyCache* cache, TransactionId xid, XLogRecPtr lsn)
+{
+    ApplyCacheTXN* txn = ApplyCacheTXNByXid(cache, xid, false);
+    ilist_d_node* cur_change, *next_change;
+    ilist_d_node* cur_txn, *next_txn;
+    bool found;
+
+    if (!txn)
+        return;
+
+    txn->lsn = lsn;
+
+    cache->begin(cache, txn);
+
+    /*
+     * FIXME:
+     * do a k-way mergesort of all changes ordered by xid
+     *
+     * For now we just iterate through all subtransactions and then through the
+     * main txn. But thats *WRONG*.
+     *
+     * The best way to do is probably to model the current heads of all TXNs as
+     * a heap and always remove from the smallest lsn till thats not the case
+     * anymore.
+     */
+    ilist_d_foreach_modify (cur_txn, next_txn, &txn->subtxns)
+    {
+        ApplyCacheTXN* subtxn = ilist_container(ApplyCacheTXN, node, cur_txn);
+
+        ilist_d_foreach_modify (cur_change, next_change, &subtxn->changes)
+        {
+            ApplyCacheChange* change =
+                ilist_container(ApplyCacheChange, node, cur_change);
+            cache->apply_change(cache, txn, subtxn, change);
+
+            ApplyCacheReturnChange(cache, change);
+        }
+        ApplyCacheReturnTXN(cache, subtxn);
+    }
+
+    ilist_d_foreach_modify (cur_change, next_change, &txn->changes)
+    {
+        ApplyCacheChange* change =
+            ilist_container(ApplyCacheChange, node, cur_change);
+        cache->apply_change(cache, txn, NULL, change);
+
+        ApplyCacheReturnChange(cache, change);
+    }
+
+    cache->commit(cache, txn);
+
+    /* now remove reference from cache */
+    hash_search(cache->by_txn,
+                (void *)&xid,
+                HASH_REMOVE,
+                &found);
+    Assert(found);
+
+    ApplyCacheReturnTXN(cache, txn);
+}
+
+void
+ApplyCacheAbort(ApplyCache* cache, TransactionId xid, XLogRecPtr lsn)
+{
+    ilist_d_node* cur_change, *next_change;
+    ilist_d_node* cur_txn, *next_txn;
+    ApplyCacheTXN* txn = ApplyCacheTXNByXid(cache, xid, false);
+    bool found;
+
+    /* no changes in this commit */
+    if (!txn)
+        return;
+
+    /* iterate through all subtransactions and free memory */
+    ilist_d_foreach_modify (cur_txn, next_txn, &txn->subtxns)
+    {
+        ApplyCacheTXN* subtxn = ilist_container(ApplyCacheTXN, node, cur_txn);
+        ilist_d_foreach_modify (cur_change, next_change, &subtxn->changes)
+        {
+            ApplyCacheChange* change =
+                ilist_container(ApplyCacheChange, node, cur_change);
+            ApplyCacheReturnChange(cache, change);
+        }
+        ApplyCacheReturnTXN(cache, subtxn);
+    }
+
+    ilist_d_foreach_modify (cur_change, next_change, &txn->changes)
+    {
+        ApplyCacheChange* change =
+            ilist_container(ApplyCacheChange, node, cur_change);
+        ApplyCacheReturnChange(cache, change);
+    }
+
+    /* now remove reference from cache */
+    hash_search(cache->by_txn,
+                (void *)&xid,
+                HASH_REMOVE,
+                &found);
+    Assert(found);
+
+    ApplyCacheReturnTXN(cache, txn);
+}
diff --git a/src/include/replication/applycache.h b/src/include/replication/applycache.h
new file mode 100644
index 0000000..4ceba63
--- /dev/null
+++ b/src/include/replication/applycache.h
@@ -0,0 +1,185 @@
+/*
+ * applycache.h
+ *
+ * PostgreSQL logical replay "cache" management
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/replication/applycache.h
+ */
+#ifndef APPLYCACHE_H
+#define APPLYCACHE_H
+
+#include "access/htup.h"
+#include "utils/hsearch.h"
+#include "utils/ilist.h"
+
+typedef struct ApplyCache ApplyCache;
+
+enum ApplyCacheChangeType
+{
+    APPLY_CACHE_CHANGE_INSERT,
+    APPLY_CACHE_CHANGE_UPDATE,
+    APPLY_CACHE_CHANGE_DELETE
+};
+
+typedef struct ApplyCacheTupleBuf
+{
+    /* position in preallocated list */
+    ilist_s_node node;
+
+    HeapTupleData tuple;
+    HeapTupleHeaderData header;
+    char data[MaxHeapTupleSize];
+} ApplyCacheTupleBuf;
+
+typedef struct ApplyCacheChange
+{
+    XLogRecPtr lsn;
+    enum ApplyCacheChangeType action;
+
+    ApplyCacheTupleBuf* newtuple;
+
+    ApplyCacheTupleBuf* oldtuple;
+
+    HeapTuple table;
+
+    /*
+     * While in use this is how a change is linked into a transactions,
+     * otherwise its the preallocated list.
+    */
+    ilist_d_node node;
+} ApplyCacheChange;
+
+typedef struct ApplyCacheTXN
+{
+    TransactionId xid;
+
+    XLogRecPtr lsn;
+
+    /*
+     * How many ApplyCacheChange's do we have in this txn.
+     *
+     * Subtransactions are *not* included.
+     */
+    Size nentries;
+
+    /*
+     * How many of the above entries are stored in memory in contrast to being
+     * spilled to disk.
+     */
+    Size nentries_mem;
+
+    /*
+     * List of actual changes
+     */
+    ilist_d_head changes;
+
+    /*
+     * non-hierarchical list of subtransactions that are *not* aborted
+     */
+    ilist_d_head subtxns;
+
+    /*
+     * our position in a list of subtransactions while the TXN is in
+     * use. Otherwise its the position in the list of preallocated
+     * transactions.
+     */
+    ilist_d_node node;
+} ApplyCacheTXN;
+
+
+/* XXX: were currently passing the originating subtxn. Not sure thats necessary */
+typedef void (*ApplyCacheApplyChangeCB)(ApplyCache* cache, ApplyCacheTXN* txn, ApplyCacheTXN* subtxn,
ApplyCacheChange*change);
 
+typedef void (*ApplyCacheBeginCB)(ApplyCache* cache, ApplyCacheTXN* txn);
+typedef void (*ApplyCacheCommitCB)(ApplyCache* cache, ApplyCacheTXN* txn);
+
+/*
+ * max number of concurrent top-level transactions or transaction where we
+ * don't know if they are top-level can be calculated by:
+ * (max_connections + max_prepared_xactx + ?)  * PGPROC_MAX_CACHED_SUBXIDS
+ */
+struct ApplyCache
+{
+    TransactionId last_txn;
+    ApplyCacheTXN *last_txn_cache;
+    HTAB *by_txn;
+
+    ApplyCacheBeginCB begin;
+    ApplyCacheApplyChangeCB apply_change;
+    ApplyCacheCommitCB commit;
+
+    void* private_data;
+
+    MemoryContext context;
+
+    /*
+     * we don't want to repeatedly (de-)allocated those structs, so cache them for reusage.
+     */
+    ilist_d_head cached_transactions;
+    size_t nr_cached_transactions;
+
+    ilist_d_head cached_changes;
+    size_t nr_cached_changes;
+
+    ilist_s_head cached_tuplebufs;
+    size_t nr_cached_tuplebufs;
+};
+
+
+ApplyCache*
+ApplyCacheAllocate(void);
+
+void
+ApplyCacheFree(ApplyCache*);
+
+ApplyCacheTupleBuf*
+ApplyCacheGetTupleBuf(ApplyCache*);
+
+void
+ApplyCacheReturnTupleBuf(ApplyCache* cache, ApplyCacheTupleBuf* tuple);
+
+/*
+ * Returns a (potentically preallocated) change struct. Its lifetime is managed
+ * by the applycache module.
+ *
+ * If not added to a transaction with ApplyCacheAddChange it needs to be
+ * returned via ApplyCacheReturnChange
+ *
+ * FIXME: better name
+ */
+ApplyCacheChange*
+ApplyCacheGetChange(ApplyCache*);
+
+/*
+ * Return an unused ApplyCacheChange struct
+ */
+void
+ApplyCacheReturnChange(ApplyCache*, ApplyCacheChange*);
+
+
+/*
+ * record the transaction as in-progress if not already done, add the current
+ * change.
+ *
+ * We have a one-entry cache for lookin up the current ApplyCacheTXN so we
+ * don't need to do a full hash-lookup if the same xid is used
+ * sequentially. Them being used multiple times that way is rather frequent.
+ */
+void
+ApplyCacheAddChange(ApplyCache*, TransactionId, XLogRecPtr lsn, ApplyCacheChange*);
+
+/*
+ *
+ */
+void
+ApplyCacheCommit(ApplyCache*, TransactionId, XLogRecPtr lsn);
+
+void
+ApplyCacheCommitChild(ApplyCache*, TransactionId, TransactionId, XLogRecPtr lsn);
+
+void
+ApplyCacheAbort(ApplyCache*, TransactionId, XLogRecPtr lsn);
+
+#endif
-- 
1.7.10.rc3.3.g19a6c.dirty



[PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
From: Andres Freund <andres@anarazel.de>

One solution to avoid loops when doing wal based logical replication in
topologies which are more complex than one unidirectional transport is
introducing the concept of a 'origin_id' into the wal stream. Luckily there is
some padding in the XLogRecord struct that allows us to add that field without
further bloating the struct.
This solution was chosen because it allows for just about any topology and is
inobstrusive.

This adds a new configuration parameter multimaster_node_id which determines
the id used for wal originating in one cluster.

When applying changes from wal from another cluster code can set the variable
current_replication_origin_id. This is a global variable because passing it
through everything which can generate wal would be far to intrusive.
---src/backend/access/transam/xact.c             |   48 +++++++++++++++++++------src/backend/access/transam/xlog.c
      |    3 +-src/backend/access/transam/xlogreader.c       |    2 ++src/backend/replication/logical/Makefile      |
2 +-src/backend/replication/logical/logical.c     |   19 ++++++++++src/backend/utils/misc/guc.c                  |   19
++++++++++src/backend/utils/misc/postgresql.conf.sample|    3 ++src/include/access/xlog.h                     |    4
+--src/include/access/xlogdefs.h                |    2 ++src/include/replication/logical.h             |   22
++++++++++++10files changed, 110 insertions(+), 14 deletions(-)create mode 100644
src/backend/replication/logical/logical.ccreatemode 100644 src/include/replication/logical.h
 

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3cc2bfa..dc30a17 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,8 +36,9 @@#include "libpq/be-fsstubs.h"#include "miscadmin.h"#include "pgstat.h"
-#include "replication/walsender.h"
+#include "replication/logical.h"#include "replication/syncrep.h"
+#include "replication/walsender.h"#include "storage/lmgr.h"#include "storage/predicate.h"#include
"storage/procarray.h"
@@ -4545,12 +4546,13 @@ xactGetCommittedChildren(TransactionId **ptr) * actions for which the order of execution is
critical.*/static void
 
-xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
-                          TransactionId *sub_xids, int nsubxacts,
-                          SharedInvalidationMessage *inval_msgs, int nmsgs,
-                          RelFileNode *xnodes, int nrels,
-                          Oid dbId, Oid tsId,
-                          uint32 xinfo)
+xact_redo_commit_internal(TransactionId xid, RepNodeId originating_node,
+                          XLogRecPtr lsn, XLogRecPtr origin_lsn,
+                          TransactionId *sub_xids, int nsubxacts,
+                          SharedInvalidationMessage *inval_msgs, int nmsgs,
+                          RelFileNode *xnodes, int nrels,
+                          Oid dbId, Oid tsId,
+                          uint32 xinfo){    TransactionId max_xid;    int            i;
@@ -4659,8 +4661,13 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn, * Utility function to call
xact_redo_commit_internalafter breaking down xlrec */static void
 
+<<<<<<< HEADxact_redo_commit(xl_xact_commit *xlrec,                 TransactionId xid, XLogRecPtr lsn)
+=======
+xact_redo_commit(xl_xact_commit *xlrec, RepNodeId originating_node,
+                            TransactionId xid, XLogRecPtr lsn)
+>>>>>>> Introduce the concept that wal has a 'origin' node{    TransactionId *subxacts;    SharedInvalidationMessage
*inval_msgs;
@@ -4670,18 +4677,26 @@ xact_redo_commit(xl_xact_commit *xlrec,    /* invalidation messages array follows subxids */
inval_msgs= (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
 
+<<<<<<< HEAD    xact_redo_commit_internal(xid, lsn, subxacts, xlrec->nsubxacts,
inval_msgs,xlrec->nmsgs,                              xlrec->xnodes, xlrec->nrels,
xlrec->dbId,                             xlrec->tsId,                              xlrec->xinfo);
 
+=======
+    xact_redo_commit_internal(xid, originating_node, lsn, xlrec->origin_lsn,
+                              subxacts, xlrec->nsubxacts, inval_msgs,
+                              xlrec->nmsgs, xlrec->xnodes, xlrec->nrels,
+                              xlrec->dbId, xlrec->tsId, xlrec->xinfo);
+>>>>>>> Introduce the concept that wal has a 'origin' node}/* * Utility function to call xact_redo_commit_internal
forcompact form of message. */static void
 
+<<<<<<< HEADxact_redo_commit_compact(xl_xact_commit_compact *xlrec,                         TransactionId xid,
XLogRecPtrlsn){
 
@@ -4691,6 +4706,18 @@ xact_redo_commit_compact(xl_xact_commit_compact *xlrec,                              InvalidOid,
      /* dbId */                              InvalidOid,        /* tsId */                              0);        /*
xinfo*/
 
+=======
+xact_redo_commit_compact(xl_xact_commit_compact *xlrec, RepNodeId originating_node,
+                            TransactionId xid, XLogRecPtr lsn)
+{
+    xact_redo_commit_internal(xid, originating_node, lsn, zeroRecPtr, xlrec->subxacts,
+                                xlrec->nsubxacts,
+                                NULL, 0,        /* inval msgs */
+                                NULL, 0,        /* relfilenodes */
+                                InvalidOid,        /* dbId */
+                                InvalidOid,        /* tsId */
+                                0);                /* xinfo */
+>>>>>>> Introduce the concept that wal has a 'origin' node}/*
@@ -4786,17 +4813,18 @@ xact_redo(XLogRecPtr lsn, XLogRecord *record)    /* Backup blocks are not used in xact records
*/   Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
 
+    /* FIXME: we probably shouldn't pass xl_origin_id at multiple places, hm */    if (info ==
XLOG_XACT_COMMIT_COMPACT)   {        xl_xact_commit_compact *xlrec = (xl_xact_commit_compact *)
XLogRecGetData(record);
-        xact_redo_commit_compact(xlrec, record->xl_xid, lsn);
+        xact_redo_commit_compact(xlrec, record->xl_origin_id, record->xl_xid, lsn);    }    else if (info ==
XLOG_XACT_COMMIT)   {        xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(record);
 
-        xact_redo_commit(xlrec, record->xl_xid, lsn);
+        xact_redo_commit(xlrec, record->xl_origin_id, record->xl_xid, lsn);    }    else if (info == XLOG_XACT_ABORT)
 {
 
@@ -4814,7 +4842,7 @@ xact_redo(XLogRecPtr lsn, XLogRecord *record)    {        xl_xact_commit_prepared *xlrec =
(xl_xact_commit_prepared*) XLogRecGetData(record);
 
-        xact_redo_commit(&xlrec->crec, xlrec->xid, lsn);
+        xact_redo_commit(&xlrec->crec, record->xl_origin_id, xlrec->xid, lsn);        RemoveTwoPhaseFile(xlrec->xid,
false);   }    else if (info == XLOG_XACT_ABORT_PREPARED)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c6feed0..504b4d0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -42,6 +42,7 @@#include "postmaster/startup.h"#include "replication/walreceiver.h"#include "replication/walsender.h"
+#include "replication/logical.h"#include "storage/bufmgr.h"#include "storage/fd.h"#include "storage/ipc.h"
@@ -1032,7 +1033,7 @@ begin:;    record->xl_len = len;        /* doesn't include backup blocks */    record->xl_info =
info;   record->xl_rmid = rmid;
 
-
+    record->xl_origin_id = current_replication_origin_id;    /* Now we can finish computing the record's CRC */
COMP_CRC32(rdata_crc,(char *) record + sizeof(pg_crc32),               SizeOfXLogRecord - sizeof(pg_crc32));
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 6f15d66..bacd31e 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -24,6 +24,7 @@#include "access/xlogreader.h"/* FIXME */
+#include "replication/logical.h" /* InvalidMultimasterNodeId */#include "replication/walsender_private.h"#include
"replication/walprotocol.h"
@@ -563,6 +564,7 @@ XLogReaderRead(XLogReaderState* state)                spacer.xl_len = temp_record->xl_tot_len -
SizeOfXLogRecord;               spacer.xl_rmid = RM_XLOG_ID;                spacer.xl_info = XLOG_NOOP;
 
+                spacer.xl_origin_id = InvalidMultimasterNodeId;                state->writeout_data(state,
                       (char*)&spacer,
 
diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
index 7dd9663..c2d6d82 100644
--- a/src/backend/replication/logical/Makefile
+++ b/src/backend/replication/logical/Makefile
@@ -14,6 +14,6 @@ include $(top_builddir)/src/Makefile.globaloverride CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
-OBJS = applycache.o decode.o
+OBJS = applycache.o decode.o logical.oinclude $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
new file mode 100644
index 0000000..4f34488
--- /dev/null
+++ b/src/backend/replication/logical/logical.c
@@ -0,0 +1,19 @@
+/*-------------------------------------------------------------------------
+ *
+ * logical.c
+ *
+ * Support functions for logical/multimaster replication
+ *
+ *
+ * Portions Copyright (c) 2010-2012, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *      src/backend/replication/logical.c
+ *
+ */
+#include "postgres.h"
+#include "replication/logical.h"
+int guc_replication_origin_id = InvalidMultimasterNodeId;
+RepNodeId current_replication_origin_id = InvalidMultimasterNodeId;
+XLogRecPtr current_replication_origin_lsn = {0, 0};
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 93c798b..46b0657 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -60,6 +60,7 @@#include "replication/syncrep.h"#include "replication/walreceiver.h"#include
"replication/walsender.h"
+#include "replication/logical.h"#include "storage/bufmgr.h"#include "storage/standby.h"#include "storage/fd.h"
@@ -198,6 +199,7 @@ static const char *show_tcp_keepalives_interval(void);static const char
*show_tcp_keepalives_count(void);staticbool check_maxconnections(int *newval, void **extra, GucSource source);static
voidassign_maxconnections(int newval, void *extra);
 
+static void assign_replication_node_id(int newval, void *extra);static bool check_maxworkers(int *newval, void
**extra,GucSource source);static void assign_maxworkers(int newval, void *extra);static bool
check_autovacuum_max_workers(int*newval, void **extra, GucSource source);
 
@@ -1598,6 +1600,16 @@ static struct config_int ConfigureNamesInt[] =    },    {
+        {"multimaster_node_id", PGC_POSTMASTER, REPLICATION_MASTER,
+            gettext_noop("node id for multimaster."),
+            NULL
+        },
+        &guc_replication_origin_id,
+        InvalidMultimasterNodeId, InvalidMultimasterNodeId, MaxMultimasterNodeId,
+        NULL, assign_replication_node_id, NULL
+    },
+
+    {        {"max_connections", PGC_POSTMASTER, CONN_AUTH_SETTINGS,            gettext_noop("Sets the maximum number
ofconcurrent connections."),            NULL
 
@@ -8629,6 +8641,13 @@ assign_maxconnections(int newval, void *extra)    MaxBackends = newval + MaxWorkers +
autovacuum_max_workers+ 1;}
 
+static void
+assign_replication_node_id(int newval, void *extra)
+{
+    guc_replication_origin_id = newval;
+    current_replication_origin_id = newval;
+}
+static boolcheck_maxworkers(int *newval, void **extra, GucSource source){
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ce3fc08..12f8a3f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -241,6 +241,9 @@#hot_standby_feedback = off        # send info from standby to prevent                    # query
conflicts
+# - Multi Master Servers -
+
+#multimaster_node_id = 0 #invalid node
id#------------------------------------------------------------------------------#QUERY TUNING
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2843aca..dd89cff 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -47,8 +47,8 @@ typedef struct XLogRecord    uint32        xl_len;            /* total len of rmgr data */    uint8
    xl_info;        /* flag bits, see below */    RmgrId        xl_rmid;        /* resource manager for this record */
 
-
-    /* Depending on MAXALIGN, there are either 2 or 6 wasted bytes here */
+    RepNodeId   xl_origin_id;   /* what node did originally cause this record to be written */
+    /* Depending on MAXALIGN, there are either 0 or 4 wasted bytes here */    /* ACTUAL LOG DATA FOLLOWS AT END OF
STRUCT*/
 
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 2768427..6b6700a 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -84,6 +84,8 @@ extern XLogRecPtr zeroRecPtr; */typedef uint32 TimeLineID;
+typedef uint16 RepNodeId;
+/* *    Because O_DIRECT bypasses the kernel buffers, and because we never *    read those buffers except during crash
recoveryor if wal_level != minimal,
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
new file mode 100644
index 0000000..0698b61
--- /dev/null
+++ b/src/include/replication/logical.h
@@ -0,0 +1,22 @@
+/*
+ * logical.h
+ *
+ * PostgreSQL logical replication support
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/replication/logical.h
+ */
+#ifndef LOGICAL_H
+#define LOGICAL_H
+
+#include "access/xlogdefs.h"
+
+extern int guc_replication_origin_id;
+extern RepNodeId current_replication_origin_id;
+extern XLogRecPtr current_replication_origin_lsn;
+
+#define InvalidMultimasterNodeId 0
+#define MaxMultimasterNodeId (2<<3)
+#endif
-- 
1.7.10.rc3.3.g19a6c.dirty



From: Andres Freund <andres@anarazel.de>

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- validating CRCs
- separating reader/writer
---src/backend/access/transam/Makefile     |    2 +-src/backend/access/transam/xlogreader.c |  914
+++++++++++++++++++++++++++++++src/include/access/xlogreader.h        |  173 ++++++3 files changed, 1088 insertions(+),
1deletion(-)create mode 100644 src/backend/access/transam/xlogreader.ccreate mode 100644
src/include/access/xlogreader.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index f82f10e..660b5fc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..include $(top_builddir)/src/Makefile.globalOBJS = clog.o transam.o varsup.o
xact.ormgr.o slru.o subtrans.o multixact.o \
 
-    twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogutils.o
+    twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogreader.o xlogutils.oinclude $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 0000000..6f15d66
--- /dev/null
+++ b/src/backend/access/transam/xlogreader.c
@@ -0,0 +1,914 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogreader.c
+ *
+ * Aa somewhat generic xlog read interface
+ *
+ * Portions Copyright (c) 2010-2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *      src/backend/access/transam/readxlog.c
+ *
+ *-------------------------------------------------------------------------
+ *
+ * FIXME:
+ * * CRC computation
+ * * separation of reader/writer
+ */
+
+#include "postgres.h"
+
+#include "access/xlog_internal.h"
+#include "access/transam.h"
+#include "catalog/pg_control.h"
+#include "access/xlogreader.h"
+
+/* FIXME */
+#include "replication/walsender_private.h"
+#include "replication/walprotocol.h"
+
+//#define VERBOSE_DEBUG
+
+XLogReaderState* XLogReaderAllocate(void)
+{
+    XLogReaderState* state = (XLogReaderState*)malloc(sizeof(XLogReaderState));
+    int i;
+
+    if (!state)
+        goto oom;
+
+    memset(&state->buf.record, 0, sizeof(XLogRecord));
+    state->buf.record_data_size = XLOG_BLCKSZ*8;
+    state->buf.record_data =
+            malloc(state->buf.record_data_size);
+
+    if (!state->buf.record_data)
+        goto oom;
+
+    if (!state)
+        goto oom;
+
+    for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+    {
+        state->buf.bkp_block_data[i] =
+            malloc(BLCKSZ);
+
+        if (!state->buf.bkp_block_data[i])
+            goto oom;
+    }
+    XLogReaderReset(state);
+    return state;
+
+oom:
+    elog(ERROR, "could not allocate memory for XLogReaderState");
+    return 0;
+}
+
+void XLogReaderReset(XLogReaderState* state)
+{
+    state->in_record = false;
+    state->in_bkp_blocks = 0;
+    state->in_bkp_block_header = false;
+    state->in_skip = false;
+    state->remaining_size = 0;
+    state->nbytes = 0;
+    state->incomplete = false;
+    state->initialized = false;
+    state->needs_input = false;
+    state->needs_output = false;
+}
+
+static inline bool
+XLogReaderHasInput(XLogReaderState* state, Size size)
+{
+    XLogRecPtr tmp = state->curptr;
+    XLByteAdvance(tmp, size);
+    if (XLByteLE(state->endptr, tmp))
+        return false;
+    return true;
+}
+
+static inline bool
+XLogReaderHasOutput(XLogReaderState* state, Size size){
+    if (state->nbytes + size > MAX_SEND_SIZE)
+        return false;
+    return true;
+}
+
+static inline bool
+XLogReaderHasSpace(XLogReaderState* state, Size size)
+{
+    XLogRecPtr tmp = state->curptr;
+    XLByteAdvance(tmp, size);
+    if (XLByteLE(state->endptr, tmp))
+        return false;
+    else if (state->nbytes + size > MAX_SEND_SIZE)
+        return false;
+    return true;
+}
+
+void
+XLogReaderRead(XLogReaderState* state)
+{
+    XLogRecord* temp_record;
+
+    state->needs_input = false;
+    state->needs_output = false;
+
+    /*
+     * Do some basic sanity checking and setup if were starting anew.
+     */
+    if (!state->initialized)
+    {
+        state->initialized = true;
+        /*
+         * we need to start reading at the beginning of the page to understand
+         * what we are currently reading. We will skip over that because we
+         * check curptr < startptr later.
+         */
+        state->curptr.xrecoff = state->curptr.xrecoff - state->curptr.xrecoff % XLOG_BLCKSZ;
+        Assert(state->curptr.xrecoff % XLOG_BLCKSZ == 0);
+        elog(LOG, "start reading from %X/%X, scrolled back to %X/%X",
+             state->startptr.xlogid, state->startptr.xrecoff,
+             state->curptr.xlogid, state->curptr.xrecoff);
+
+    }
+    else
+    {
+        /*
+         * We didn't finish reading the last time round. Since then new data
+         * could have been appended to the current page. So we need to update
+         * our copy of that.
+         *
+         * XXX: We could tie that to state->needs_input but that doesn't seem
+         * worth the complication atm.
+         */
+        XLogRecPtr rereadptr = state->curptr;
+        rereadptr.xrecoff -= rereadptr.xrecoff % XLOG_BLCKSZ;
+
+        XLByteAdvance(rereadptr, SizeOfXLogShortPHD);
+
+        if(!XLByteLE(rereadptr, state->endptr))
+            goto not_enough_input;
+
+        rereadptr.xrecoff -= rereadptr.xrecoff % XLOG_BLCKSZ;
+
+        state->read_page(state, state->cur_page, rereadptr);
+
+        state->page_header = (XLogPageHeader)state->cur_page;
+        state->page_header_size = XLogPageHeaderSize(state->page_header);
+
+    }
+
+#ifdef VERBOSE_DEBUG
+    elog(LOG, "starting reading for %X from %X",
+         state->startptr.xrecoff, state->curptr.xrecoff);
+#endif
+    while (XLByteLT(state->curptr, state->endptr))
+    {
+        uint32 len_in_block;
+        /* did we read a partial xlog record due to input/output constraints */
+        bool partial_read = false;
+        bool partial_write = false;
+
+#ifdef VERBOSE_DEBUG
+        elog(LOG, "one loop start: record: %u skip: %u bkb_block: %d in_bkp_header: %u xrecoff: %X/%X remaining: %u,
off:%u",
 
+             state->in_record, state->in_skip,
+             state->in_bkp_blocks, state->in_bkp_block_header,
+             state->curptr.xlogid, state->curptr.xrecoff,
+             state->remaining_size,
+             state->curptr.xrecoff % XLOG_BLCKSZ);
+#endif
+
+        if (state->curptr.xrecoff % XLOG_BLCKSZ == 0)
+        {
+#ifdef VERBOSE_DEBUG
+            elog(LOG, "reading page header, at %X/%X",
+                 state->curptr.xlogid, state->curptr.xrecoff);
+#endif
+            /* check whether we can read enough to see the short header */
+            if (!XLogReaderHasInput(state, SizeOfXLogShortPHD))
+                goto not_enough_input;
+
+            state->read_page(state, state->cur_page, state->curptr);
+            state->page_header = (XLogPageHeader)state->cur_page;
+            state->page_header_size = XLogPageHeaderSize(state->page_header);
+
+            /* check wether we have enough to read/write the full header */
+            if (!XLogReaderHasInput(state, state->page_header_size))
+                goto not_enough_input;
+
+            /* writeout page header only if were somewhere interesting */
+            if (!XLByteLT(state->curptr, state->startptr))
+            {
+                if (!XLogReaderHasOutput(state, state->page_header_size))
+                    goto not_enough_output;
+
+                state->writeout_data(state, state->cur_page, state->page_header_size);
+            }
+
+            XLByteAdvance(state->curptr, state->page_header_size);
+
+            if (XLByteLT(state->curptr, state->startptr))
+            {
+                /* don't intepret anything if were before startpoint */
+            }
+            else if (state->page_header->xlp_info & XLP_FIRST_IS_CONTRECORD)
+            {
+                XLogContRecord* temp_contrecord;
+
+                if(!XLogReaderHasInput(state, SizeOfXLogContRecord))
+                    goto not_enough_input;
+
+                if(!XLogReaderHasOutput(state, SizeOfXLogContRecord))
+                    goto not_enough_output;
+
+                temp_contrecord =
+                    (XLogContRecord*)(state->cur_page
+                                      + state->curptr.xrecoff % XLOG_BLCKSZ);
+
+
+                state->writeout_data(state, (char*)temp_contrecord, SizeOfXLogContRecord);
+
+                XLByteAdvance(state->curptr, SizeOfXLogContRecord);
+
+                if (!state->in_record)
+                {
+                    /* we need to support this case for initializing a cluster... */
+                    elog(WARNING, "contrecord although were not in a record at %X/%X, starting at %X/%X",
+                         state->curptr.xlogid, state->curptr.xrecoff,
+                         state->startptr.xlogid, state->startptr.xrecoff);
+                    state->in_record = true;
+                    state->in_skip = true;
+                    state->remaining_size = temp_contrecord->xl_rem_len;
+                    continue;
+                }
+
+
+                if(temp_contrecord->xl_rem_len < state->remaining_size)
+                    elog(PANIC, "remaining length is smaller than to be read data: %u %u",
+                         temp_contrecord->xl_rem_len, state->remaining_size
+                        );
+
+            }
+            else
+            {
+                if (state->in_record)
+                {
+                    elog(PANIC, "no contrecord although were in a record");
+                }
+            }
+        }
+
+        if (!state->in_record)
+        {
+            /*
+             * a record must be stored aligned. So skip as far we need to
+             * comply with that.
+             */
+            Size skiplen;
+            skiplen = MAXALIGN(state->curptr.xrecoff)
+                - state->curptr.xrecoff;
+
+            if (skiplen)
+            {
+                if (!XLogReaderHasSpace(state, skiplen))
+                {
+#ifdef VERBOSE_DEBUG
+                    elog(LOG, "not aligning bc of space");
+#endif
+                    /*
+                     * We don't have enough space to read/write the alignment
+                     * bytes, so fake up a skip-state
+                     */
+                    state->in_record = true;
+                    state->in_skip = true;
+                    state->remaining_size = skiplen;
+
+                    if (!XLogReaderHasInput(state, skiplen))
+                        goto not_enough_input;
+                    goto not_enough_output;
+                }
+#ifdef VERBOSE_DEBUG
+                elog(LOG, "aligning from %X/%X to %X/%X",
+                     state->curptr.xlogid, state->curptr.xrecoff,
+                     state->curptr.xlogid, state->curptr.xrecoff + (uint32)skiplen);
+#endif
+                if (!XLByteLT(state->curptr, state->startptr))
+                    state->writeout_data(state, NULL, skiplen);
+                XLByteAdvance(state->curptr, skiplen);
+            }
+        }
+
+        /* skip until we reach the part of the page were interested in */
+        if (XLByteLT(state->curptr, state->startptr))
+        {
+
+            if (state->in_skip)
+            {
+                /* the code already handles that, we expect a contrecord */
+            }
+            else if ((state->curptr.xrecoff % XLOG_BLCKSZ) == state->page_header_size &&
+                     state->page_header->xlp_info & XLP_FIRST_IS_CONTRECORD)
+            {
+
+                XLogContRecord* temp_contrecord = (XLogContRecord*)
+                    (state->cur_page + state->curptr.xrecoff % XLOG_BLCKSZ);
+
+                /*
+                 * we know we have enough space here because we didn't start
+                 * writing out data yet because were < startptr
+                 */
+                Assert(XLogReaderHasSpace(state, SizeOfXLogContRecord));
+
+                XLByteAdvance(state->curptr, SizeOfXLogContRecord);
+
+#ifdef VERBOSE_DEBUG
+                elog(LOG, "skipping contrecord before start");
+#endif
+                state->in_skip = true;
+                state->in_record = true;
+                state->in_bkp_blocks = 0;
+                state->remaining_size = temp_contrecord->xl_rem_len;
+            }
+            else
+            {
+                Assert(!state->in_record);
+
+                /* read how much space we have left on the current page */
+                if(state->curptr.xrecoff % XLOG_BLCKSZ == 0)
+                    len_in_block = 0;
+                else
+                    len_in_block = XLOG_BLCKSZ - state->curptr.xrecoff % XLOG_BLCKSZ;
+
+                if(len_in_block < SizeOfXLogRecord)
+                {
+                    XLByteAdvance(state->curptr, len_in_block);
+                    continue;
+                }
+
+                /*
+                 * now read the record information and start skipping till the
+                 * record is over
+                 */
+                temp_record = (XLogRecord*)(state->cur_page + (state->curptr.xrecoff % XLOG_BLCKSZ));
+
+#ifdef VERBOSE_DEBUG
+                elog(LOG, "skipping record before start %lu, tot %u at %X/%X off %d ",
+                     temp_record->xl_tot_len - SizeOfXLogRecord,
+                     temp_record->xl_tot_len,
+                     state->curptr.xlogid, state->curptr.xrecoff,
+                     state->curptr.xrecoff % XLOG_BLCKSZ);
+#endif
+
+                Assert(XLogReaderHasSpace(state, SizeOfXLogRecord));
+
+                XLByteAdvance(state->curptr, SizeOfXLogRecord);
+
+                state->in_skip = true;
+                state->in_record = true;
+                state->in_bkp_blocks = 0;
+                state->remaining_size = temp_record->xl_tot_len
+                    - SizeOfXLogRecord;
+            }
+        }
+
+        /*
+         * ----------------------------------------
+         * start to read a record
+         *
+         * This will only happen if were already behind state->startptr
+         * ----------------------------------------
+         */
+        if (!state->in_record)
+        {
+            /*
+             * if were at the beginning of a page (after the page header) it
+             * could be that were starting in a continuation of an earlier
+             * record. Its debatable wether thats a valid use-case. Support it
+             * for now but cry loudly.
+             */
+            if ((state->curptr.xrecoff % XLOG_BLCKSZ) == state->page_header_size &&
+               state->page_header->xlp_info & XLP_FIRST_IS_CONTRECORD)
+            {
+                XLogContRecord* temp_contrecord = (XLogContRecord*)
+                    (state->cur_page + state->curptr.xrecoff % XLOG_BLCKSZ);
+
+                if (!XLogReaderHasInput(state, SizeOfXLogContRecord))
+                    goto not_enough_input;
+
+                if (!XLogReaderHasOutput(state, SizeOfXLogContRecord))
+                    goto not_enough_output;
+
+                state->writeout_data(state,
+                                     (char*)temp_contrecord,
+                                     SizeOfXLogContRecord);
+                XLByteAdvance(state->curptr, SizeOfXLogContRecord);
+
+                elog(PANIC, "hum, ho, first is contrecord, but trying to read the record afterwards %X/%X",
+                     state->curptr.xlogid, state->curptr.xrecoff);
+
+                state->in_skip = true;
+                state->in_record = true;
+                state->in_bkp_blocks = 0;
+                state->remaining_size = temp_contrecord->xl_rem_len;
+                continue;
+            }
+
+            /* read how much space we have left on the current page */
+            if (state->curptr.xrecoff % XLOG_BLCKSZ == 0)
+                len_in_block = 0;
+            else
+                len_in_block = XLOG_BLCKSZ - state->curptr.xrecoff % XLOG_BLCKSZ;
+
+            /* if there is not enough space for the xlog header, skip to next page */
+            if (len_in_block < SizeOfXLogRecord)
+            {
+
+                if (!XLogReaderHasOutput(state, len_in_block))
+                    goto not_enough_input;
+
+                if (!XLogReaderHasOutput(state, len_in_block))
+                    goto not_enough_output;
+
+                state->writeout_data(state,
+                                     NULL,
+                                     len_in_block);
+
+                XLByteAdvance(state->curptr, len_in_block);
+                continue;
+            }
+
+            temp_record = (XLogRecord*)(state->cur_page + (state->curptr.xrecoff % XLOG_BLCKSZ));
+
+            /*
+             * we quickly loose the original address of a record as we can skip
+             * records and such, so keep the original addresses.
+             */
+            state->buf.origptr = state->curptr;
+
+            /* we writeout data as soon as we know whether were writing out something sensible */
+            XLByteAdvance(state->curptr, SizeOfXLogRecord);
+
+            /* ----------------------------------------
+             * normally we don't look at the content of xlog records here,
+             * XLOG_SWITCH is a special case though, as everything left in that
+             * segment won't be sensbible content.
+             * So skip to the next segment. For that we currently simply leave
+             * the loop as we don't have any mechanism to communicate that
+             * behaviour otherwise.
+             * ----------------------------------------
+             */
+            if (temp_record->xl_rmid == RM_XLOG_ID
+                && (temp_record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
+            {
+
+                /*
+                 * writeout data so that this gap makes sense in the written
+                 * out data
+                 */
+                state->writeout_data(state,
+                                     (char*)temp_record,
+                                     SizeOfXLogRecord);
+
+                /*
+                 * Pretend the current data extends to end of segment
+                 *
+                 * FIXME: This logic is copied from xlog.c but seems to
+                 *    disregard xrecoff wrapping around to the next xlogid?
+                 */
+                state->curptr.xrecoff += XLogSegSize - 1;
+                state->curptr.xrecoff -= state->curptr.xrecoff % XLogSegSize;
+
+                state->in_record = false;
+                state->in_bkp_blocks = 0;
+                state->in_skip = false;
+                goto out;
+            }
+            /* ----------------------------------------
+             * Ok, we found interesting data. That means we need to do the full
+             * deal, reading the record, reading the BKP blocks afterward and
+             * then hand of the record to be processed.
+             * ----------------------------------------
+             */
+            else if (state->is_record_interesting(state, temp_record))
+            {
+                /*
+                 * the rest of the record might be on another page so we need a
+                 * copy instead just pointing into the current page.
+                 */
+                memcpy(&state->buf.record,
+                       temp_record,
+                       sizeof(XLogRecord));/* really needs sizeof(XLogRecord) */
+
+                state->writeout_data(state,
+                                     (char*)temp_record,
+                                     SizeOfXLogRecord);
+                /*
+                 * read till the record itself finished, after that we will
+                 * continue with the bkp blocks et al
+                 */
+                state->remaining_size = temp_record->xl_len;
+
+                state->in_record = true;
+                state->in_bkp_blocks = 0;
+                state->in_skip = false;
+
+#ifdef VERBOSE_DEBUG
+                elog(LOG, "found record at %X/%X, tx %u, rmid %hhu, len %u tot %u",
+                     state->buf.origptr.xlogid, state->buf.origptr.xrecoff,
+                     temp_record->xl_xid, temp_record->xl_rmid, temp_record->xl_len,
+                     temp_record->xl_tot_len);
+#endif
+            }
+            /* ----------------------------------------
+             * ok, everybody aggrees, the content of the current record are
+             * just plain boring. So fake-up a record that replaces it by a
+             * NOOP record.
+             *
+             * FIXME: we should allow "compressing" the output here. That is
+             * write something that shows how long the record should be if
+             * everything is decompressed again. This can radically reduce
+             * space-usage over the wire.
+             * It could also be very useful for traditional SR by removing
+             * unneded BKP blocks from being transferred.
+             * For that we would need to recompute CRCs though, which we
+             * currently don't support.
+             * ----------------------------------------
+             */
+            else
+            {
+                /*
+                 * we need to fix up a fake record with correct length that can
+                 * be written out.
+                 */
+                /* needs space for padding to SizeOfXLogRecord */
+                XLogRecord spacer;
+
+                /*
+                 * xl_tot_len contains the size of the XLogRecord itself, we
+                 * read that already though.
+                 */
+                state->remaining_size = temp_record->xl_tot_len
+                    - SizeOfXLogRecord;
+
+                state->in_record = true;
+                state->in_bkp_blocks = 0;
+                state->in_skip = true;
+
+                /* FIXME: fixup the xl_prev of the next record */
+                spacer.xl_prev = state->buf.origptr;
+                spacer.xl_xid = InvalidTransactionId;
+                spacer.xl_tot_len = temp_record->xl_tot_len;
+                spacer.xl_len = temp_record->xl_tot_len - SizeOfXLogRecord;
+                spacer.xl_rmid = RM_XLOG_ID;
+                spacer.xl_info = XLOG_NOOP;
+
+                state->writeout_data(state,
+                                     (char*)&spacer,
+                                     SizeOfXLogRecord);
+            }
+        }
+        /*
+         * We read an interesting page and now want the BKP
+         * blocks. Unfortunately a bkp header is stored unaligned and can be
+         * split across pages. So we copy it to a bit more permanent location.
+         */
+        else if (state->in_bkp_blocks > 0
+                && state->remaining_size == 0)
+        {
+            Assert(!state->in_bkp_block_header);
+            Assert(state->buf.record.xl_info &
+                   XLR_SET_BKP_BLOCK(XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks));
+
+            state->in_bkp_block_header = true;
+            state->remaining_size = sizeof(BkpBlock);
+            /* in_bkp_blocks will be changed uppon completion */
+            state->in_skip = false;
+        }
+
+        Assert(state->in_record);
+
+        /* compute how much space on the current page is left */
+        if (state->curptr.xrecoff % XLOG_BLCKSZ == 0)
+            len_in_block = 0;
+        else
+            len_in_block = XLOG_BLCKSZ - state->curptr.xrecoff % XLOG_BLCKSZ;
+
+        /* we have more data available than we need, so read only as much as needed */
+        if(len_in_block > state->remaining_size)
+            len_in_block = state->remaining_size;
+
+        /*
+         * Handle constraints set by endptr and the size of the output buffer.
+         *
+         * Normally we use XLogReaderHasSpace for that, but thats not
+         * convenient because we want to read data in parts. So, open-code the
+         * logic for that here.
+         */
+        if (state->curptr.xlogid == state->endptr.xlogid &&
+           state->curptr.xrecoff + len_in_block > state->endptr.xrecoff)
+        {
+            Size cur_len = len_in_block;
+            len_in_block = state->endptr.xrecoff - state->curptr.xrecoff;
+            partial_read = true;
+            elog(LOG, "truncating len_in_block due to endptr %X/%X %lu to %i at %X/%X",
+                 state->startptr.xlogid, state->startptr.xrecoff,
+                 cur_len, len_in_block,
+                 state->curptr.xlogid, state->curptr.xrecoff);
+        }
+        else if (len_in_block > (MAX_SEND_SIZE - state->nbytes))
+        {
+            Size cur_len = len_in_block;
+            len_in_block = MAX_SEND_SIZE - state->nbytes;
+            partial_write = true;
+            elog(LOG, "truncating len_in_block due to nbytes %lu to %i",
+                 cur_len, len_in_block);
+        }
+
+        /* ----------------------------------------
+         * copy data to whatever were currently reading.
+         * ----------------------------------------
+         */
+
+        /* nothing to do if were skipping */
+        if (state->in_skip)
+        {
+            /* writeout zero data */
+            if (!XLByteLT(state->curptr, state->startptr))
+                state->writeout_data(state, NULL, len_in_block);
+        }
+        /* copy data into the current bkp block */
+        else if (state->in_bkp_block_header)
+        {
+            int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+            BkpBlock* bkpb = &state->buf.bkp_block[blockno];
+            Assert(state->in_bkp_blocks);
+
+            memcpy((char*)bkpb + sizeof(BkpBlock) - state->remaining_size,
+                   state->cur_page + state->curptr.xrecoff % XLOG_BLCKSZ,
+                   len_in_block);
+
+            state->writeout_data(state,
+                                 state->cur_page + state->curptr.xrecoff % XLOG_BLCKSZ,
+                                 len_in_block);
+
+#ifdef VERBOSE_DEBUG
+            elog(LOG, "copying bkp header for %d of %u complete %lu at %X/%X rem %u",
+                 blockno, len_in_block, sizeof(BkpBlock),
+                 state->curptr.xlogid, state->curptr.xrecoff,
+                 state->remaining_size);
+            if (state->remaining_size == len_in_block)
+            {
+                elog(LOG, "block off %u len %u", bkpb->hole_offset, bkpb->hole_length);
+            }
+#endif
+        }
+        else if (state->in_bkp_blocks)
+        {
+            int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+            BkpBlock* bkpb = &state->buf.bkp_block[blockno];
+            char* data = state->buf.bkp_block_data[blockno];
+
+            memcpy(data + BLCKSZ - bkpb->hole_length - state->remaining_size,
+                   state->cur_page + state->curptr.xrecoff % XLOG_BLCKSZ,
+                   len_in_block);
+
+            state->writeout_data(state,
+                                 state->cur_page + state->curptr.xrecoff % XLOG_BLCKSZ,
+                                 len_in_block);
+#ifdef VERBOSE_DEBUG
+            elog(LOG, "copying data for %d of %u complete %u",
+                 blockno, len_in_block, state->remaining_size);
+#endif
+        }
+        /* read the (rest) of the XLogRecord's data */
+        else if (state->in_record)
+        {
+            if(state->buf.record_data_size < state->buf.record.xl_len){
+                state->buf.record_data_size = state->buf.record.xl_len;
+                state->buf.record_data =
+                    realloc(state->buf.record_data, state->buf.record_data_size);
+            }
+
+            memcpy(state->buf.record_data
+                   + state->buf.record.xl_len
+                   - state->remaining_size,
+                   state->cur_page + state->curptr.xrecoff % XLOG_BLCKSZ,
+                   len_in_block);
+
+            state->writeout_data(state,
+                                 state->cur_page + state->curptr.xrecoff % XLOG_BLCKSZ,
+                                 len_in_block);
+        }
+
+        /* should handle wrapping around to next page */
+        XLByteAdvance(state->curptr, len_in_block);
+
+        state->remaining_size -= len_in_block;
+
+        /*
+         * ----------------------------------------
+         * we completed whatever we were reading. So, handle going to the next
+         * state.
+         * ----------------------------------------
+         */
+
+        if (state->remaining_size == 0)
+        {
+            /*
+             * in the in_skip case we already read backup blocks, so everything
+             * is finished.
+             */
+            if (state->in_skip)
+            {
+                state->in_record = false;
+                state->in_bkp_blocks = 0;
+                state->in_skip = false;
+                /* alignment is handled when starting to read a record */
+            }
+            /*
+             * We read the header of the current block. Start reading the
+             * content of that now.
+             */
+            else if (state->in_bkp_block_header)
+            {
+                BkpBlock* bkpb;
+                int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+
+                Assert(state->in_bkp_blocks);
+
+                bkpb = &state->buf.bkp_block[blockno];
+                state->remaining_size = BLCKSZ - bkpb->hole_length;
+                state->in_bkp_block_header = false;
+#ifdef VERBOSE_DEBUG
+                elog(LOG, "completed reading of header for %d, reading data now %u hole %u, off %u",
+                     blockno, state->remaining_size, bkpb->hole_length,
+                     bkpb->hole_offset);
+#endif
+            }
+            /*
+             * The current backup block is finished, more maybe comming
+             */
+            else if (state->in_bkp_blocks)
+            {
+                int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+                BkpBlock* bkpb;
+                char* bkpb_data;
+
+                Assert(!state->in_bkp_block_header);
+
+                bkpb = &state->buf.bkp_block[blockno];
+                bkpb_data = state->buf.bkp_block_data[blockno];
+
+                /*
+                 * reassemble block to its entirety by removing the bkp_hole
+                 * "compression"
+                 */
+                if(bkpb->hole_length){
+                    memmove(bkpb_data + bkpb->hole_offset,
+                            bkpb_data + bkpb->hole_offset + bkpb->hole_length,
+                            BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
+                    memset(bkpb_data + bkpb->hole_offset,
+                           0,
+                           bkpb->hole_length);
+                }
+#if 0
+                elog(LOG, "finished with bkp block %d", blockno);
+#endif
+                state->in_bkp_blocks--;
+
+                state->in_skip = false;
+
+                /*
+                 * only continue with in_record=true if we have bkp block
+                 */
+                while (state->in_bkp_blocks)
+                {
+                    if (state->buf.record.xl_info &
+                       XLR_SET_BKP_BLOCK(XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks))
+                    {
+                        elog(LOG, "reading record %u", XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks);
+                        break;
+                    }
+                    state->in_bkp_blocks--;
+                }
+
+                if (!state->in_bkp_blocks)
+                {
+                    goto all_bkp_finished;
+                }
+                /* bkp blocks are stored without regard for alignment */
+            }
+            /*
+             * read a non-skipped record, start reading bkp blocks afterwards
+             */
+            else if (state->in_record)
+            {
+                state->in_record = true;
+                state->in_skip = false;
+                state->in_bkp_blocks = XLR_MAX_BKP_BLOCKS;
+
+                /*
+                 * only continue with in_record=true if we have bkp block
+                 */
+                while (state->in_bkp_blocks)
+                {
+                    if (state->buf.record.xl_info &
+                        XLR_SET_BKP_BLOCK(XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks))
+                    {
+#ifdef VERBOSE_DEBUG
+                        elog(LOG, "reading bkp block %u", XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks);
+#endif
+                        break;
+                    }
+                    state->in_bkp_blocks--;
+                }
+
+                if (!state->in_bkp_blocks)
+                {
+                    goto all_bkp_finished;
+                }
+                /* bkp blocks are stored without regard for alignment */
+            }
+
+#ifdef VERBOSE_DEBUG
+            elog(LOG, "finish with record at %X/%X",
+                 state->curptr.xlogid, state->curptr.xrecoff);
+#endif
+        }
+        /*
+         * Something could only be partially read inside a single block because
+         * of input or output space constraints.. This case needs to be
+         * separate because otherwise we would treat it as a continuation which
+         * would obviously be wrong (we don't have a continuation record).
+         */
+        else if (partial_read)
+        {
+            partial_read = false;
+            goto not_enough_input;
+        }
+        else if (partial_write)
+        {
+            partial_write = false;
+            goto not_enough_output;
+        }
+        /*
+         * Data continues into the next block. Read the contiuation record
+         * there and then continue.
+         */
+        else
+        {
+        }
+#ifdef VERBOSE_DEBUG
+        elog(LOG, "one loop: record: %u skip: %u bkb_block: %d in_bkp_header: %u xrecoff: %X/%X remaining: %u, off:
%u",
+             state->in_record, state->in_skip,
+             state->in_bkp_blocks, state->in_bkp_block_header,
+             state->curptr.xlogid, state->curptr.xrecoff,
+             state->remaining_size,
+             state->curptr.xrecoff % XLOG_BLCKSZ);
+#endif
+        continue;
+
+    all_bkp_finished:
+        {
+            Assert(!state->in_skip);
+            Assert(!state->in_bkp_block_header);
+            Assert(!state->in_bkp_blocks);
+
+            state->finished_record(state, &state->buf);
+
+            state->in_record = false;
+
+            /* alignment is handled when starting to read a record */
+#ifdef VERBOSE_DEBUG
+            elog(LOG, "currently at %X/%X to %X/%X, wrote nbytes: %lu",
+                 state->curptr.xlogid, state->curptr.xrecoff,
+                 state->endptr.xlogid, state->endptr.xrecoff, state->nbytes);
+#endif
+        }
+    }
+
+out:
+    if (state->in_skip)
+    {
+        state->incomplete = true;
+    }
+    else if (state->in_record)
+    {
+        state->incomplete = true;
+    }
+    else
+    {
+        state->incomplete = false;
+    }
+    return;
+
+not_enough_input:
+    state->needs_input = true;
+    goto out;
+
+not_enough_output:
+    state->needs_output = true;
+    goto out;
+}
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
new file mode 100644
index 0000000..7df98cf
--- /dev/null
+++ b/src/include/access/xlogreader.h
@@ -0,0 +1,173 @@
+/*-------------------------------------------------------------------------
+ *
+ * readxlog.h
+ *
+ * Generic xlog reading facility.
+ *
+ * Portions Copyright (c) 2010-2012, PostgreSQL Global Development Group
+ *
+ * src/include/access/readxlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _READXLOG_H
+#define _READXLOG_H
+
+#include "access/xlog_internal.h"
+
+typedef struct XLogRecordBuffer
+{
+    /* the record itself */
+    XLogRecord record;
+
+    /* at which LSN was that record found at */
+    XLogRecPtr origptr;
+
+    /* the data for xlog record */
+    char* record_data;
+    uint32 record_data_size;
+
+    BkpBlock bkp_block[XLR_MAX_BKP_BLOCKS];
+    char* bkp_block_data[XLR_MAX_BKP_BLOCKS];
+} XLogRecordBuffer;
+
+
+struct XLogReaderState;
+
+typedef bool (*XLogReaderStateInterestingCB)(struct XLogReaderState*, XLogRecord* r);
+typedef void (*XLogReaderStateWriteoutCB)(struct XLogReaderState*, char* data, Size len);
+typedef void (*XLogReaderStateFinishedRecordCB)(struct XLogReaderState*, XLogRecordBuffer* buf);
+typedef void (*XLogReaderStateReadPageCB)(struct XLogReaderState*, char* cur_page, XLogRecPtr at);
+
+typedef struct XLogReaderState
+{
+    /* ----------------------------------------
+     * Public parameters
+     * ----------------------------------------
+     */
+
+    /* callbacks */
+
+    /*
+     * Called to decide whether a xlog record is interesting and should be
+     * assembled, analyzed (finished_record) and written out or skipped.
+     */
+    XLogReaderStateInterestingCB is_record_interesting;
+
+    /*
+     * writeout data. This doesn't have to do anything if the data isn't needed
+     * lateron.
+     */
+    XLogReaderStateWriteoutCB writeout_data;
+
+    /*
+     * Gets called after a record, including the backup blocks, has been fully
+     * reassembled.
+     */
+    XLogReaderStateFinishedRecordCB finished_record;
+
+    /*
+     * Data input function. Has to read XLOG_BLKSZ blocks into cur_page
+     * although everything behind endptr does not have to be valid.
+     */
+    XLogReaderStateReadPageCB read_page;
+
+    /*
+     * this can be used by the caller to pass state to the callbacks without
+     * using global variables or such ugliness
+     */
+    void* private_data;
+
+
+    /* from where to where are we reading */
+
+    /* so we know where interesting data starts after scrolling back to the beginning of a page */
+    XLogRecPtr startptr;
+
+    /* continue up to here in this run */
+    XLogRecPtr endptr;
+
+
+    /* ----------------------------------------
+     * output parameters
+     * ----------------------------------------
+     */
+
+    /* we need new input data - a later endptr - to continue reading */
+    bool needs_input;
+
+    /* we need new output space to continue reading */
+    bool needs_output;
+
+    /* track our progress */
+    XLogRecPtr curptr;
+
+    /*
+     * are we in the middle of something? This is useful for the outside to
+     * know whether to start reading anew
+     */
+    bool incomplete;
+
+    /* ----------------------------------------
+     * private parameters
+     * ----------------------------------------
+     */
+
+    char cur_page[XLOG_BLCKSZ];
+    XLogPageHeader page_header;
+    uint32 page_header_size;
+    XLogRecordBuffer buf;
+
+
+    /* ----------------------------------------
+     * state machine variables
+     * ----------------------------------------
+     */
+
+    bool initialized;
+
+    /* are we currently reading a record */
+    bool in_record;
+
+    /* how many bkp blocks remain to be read */
+    int in_bkp_blocks;
+
+    /*
+     * the header of a bkp block can be split across pages, so we need to
+     * support reading that incrementally
+     */
+    int in_bkp_block_header;
+
+    /* we don't want to read this block, so keep track of that */
+    bool in_skip;
+
+    /* how much more to read in the current state */
+    uint32 remaining_size;
+
+    Size nbytes; /* size of sent data*/
+
+} XLogReaderState;
+
+/*
+ * Get a new XLogReader
+ *
+ * The 4 callbacks, startptr and endptr have to be set before the reader can be
+ * used.
+ */
+extern XLogReaderState* XLogReaderAllocate(void);
+
+/*
+ * Reset internal state so it can be used without continuing from the last
+ * state.
+ *
+ * The callbacks and private_data won't be reset
+ */
+extern void XLogReaderReset(XLogReaderState* state);
+
+/*
+ * Read the xlog and call the appropriate callbacks as far as possible within
+ * the constraints of input data (startptr, endptr) and output space.
+ */
+extern void XLogReaderRead(XLogReaderState* state);
+
+#endif
-- 
1.7.10.rc3.3.g19a6c.dirty



[PATCH 16/16] current version of the design document

From
Andres Freund
Date:
From: Andres Freund <andres@anarazel.de>

---src/backend/replication/logical/DESIGN |  209 ++++++++++++++++++++++++++++++++1 file changed, 209
insertions(+)createmode 100644 src/backend/replication/logical/DESIGN
 

diff --git a/src/backend/replication/logical/DESIGN b/src/backend/replication/logical/DESIGN
new file mode 100644
index 0000000..2cf08ff
--- /dev/null
+++ b/src/backend/replication/logical/DESIGN
@@ -0,0 +1,209 @@
+=== Design goals for logical replication ===:
+- in core
+- fast
+- async
+- robust
+- multi-master
+- modular
+- as unintrusive as possible implementation wise
+- basis for other technologies (sharding, replication into other DBMSs, ...)
+
+For reasons why we think this is an important set of features please check out
+the presentation from the in-core replication summit at pgcon:
+http://wiki.postgresql.org/wiki/File:BDR_Presentation_PGCon2012.pdf
+
+While you may argue that most of the above design goals are already provided by
+various trigger based replication solutions like Londiste or Slony, we think
+that thats not enough for various reasons:
+
+- not in core (and thus less trustworthy)
+- duplication of writes due to an additional log
+- performance in general (check the end of the above presentation)
+- complex to use because there is no native administration interface
+
+We want to emphasize that this proposed architecture is based on the experience
+of developing a minimal prototype which we developed with the above goals in
+mind. While we obviously hope that a good part of it is reusable for the
+community we definitely do *not* expect that the community accepts this
++as-is. It is intended to be the basis upon which we, the community, can build
+and design the future logical replication.
+
+=== Basic architecture ===:
+Very broadly speaking there are several major pieces common to most approaches
+to replication:
+1. Source data generation
+2. Transportation of that data
+3. Applying the changes
+4. Conflict resolution
+
+
+1.:
+
+As we need a change stream that contains all required changes in the correct
+order, the requirement for this stream to reflect changes across multiple
+concurrent backends raises concurrency and scalability issues. Reusing the
+WAL stream for this seems a good choice since it is needed anyway and adresses
+those issues already, and it further means that we don't incur duplicate
+writes. Any other stream generating componenent would introduce additional
+scalability issues.
+
+We need a change stream that contains all required changes in the correct order
+which thus needs to be synchronized across concurrent backends which introduces
+obvious concurrency/scalability issues.
+Reusing the WAL stream for this seems a good choice since it is needed anyway
+and adresses those issues already, and it further means we don't duplicate the
+writes and locks already performance for its maintenance.
+
+Unfortunately, in this case, the WAL is mostly a physical representation of the
+changes and thus does not, by itself, contain the necessary information in a
+convenient format to create logical changesets.
+
+The biggest problem is, that interpreting tuples in the WAL stream requires an
+up-to-date system catalog and needs to be done in a compatible backend and
+architecture. The requirement of an up-to-date catalog could be solved by
+adding more data to the WAL stream but it seems to be likely that that would
+require relatively intrusive & complex changes. Instead we chose to require a
+synchronized catalog at the decoding site. That adds some complexity to use
+cases like replicating into a different database or cross-version
+replication. For those it is relatively straight-forward to develop a proxy pg
+instance that only contains the catalog and does the transformation to textual
+changes.
+
+This also is the solution to the other big problem, the need to work around
+architecture/version specific binary formats. The alternative, producing
+cross-version, cross-architecture compatible binary changes or even moreso
+textual changes all the time seems to be prohibitively expensive. Both from a
+cpu and a storage POV and also from the point of implementation effort.
+
+The catalog on the site where changes originate can *not* be used for the
+decoding because at the time we decode the WAL the catalog may have changed
+from the state it was in when the WAL was generated. A possible solution for
+this would be to have a fully versioned catalog but that again seems to be
+rather complex and intrusive.
+
+For some operations (UPDATE, DELETE) and corner-cases (e.g. full page writes)
+additional data needs to be logged, but the additional amount of data isn't
+that big. Requiring a primary-key for any change but INSERT seems to be a
+sensible thing for now. The required changes are fully contained in heapam.c
+and are pretty simple so far.
+
+2.:
+
+For transport of the non-decoded data from the originating site to the decoding
+site we decided to reuse the infrastructure already provided by
+walsender/walreceiver. We introduced a new command that, analogous to
+START_REPLICATION, is called START_LOGICAL_REPLICATION that will stream out all
+xlog records that pass through a filter.
+
+The on-the-wire format stays the same. The filter currently simply filters out
+all record which are not interesting for logical replication (indexes,
+freezing, ...) and records that did not originate on the same system.
+
+The requirement of filtering by 'origin' of a wal node comes from the planned
+multimaster support. Changes replayed locally that originate from another site
+should not replayed again there. If the wal is plainly used without such a
+filter that would cause loops. Instead we tag every wal record with the "node
+id" of the site that caused the change to happen and changes with a nodes own
+"node id" won't get applied again.
+
+Currently filtered records get simply replaced by NOOP records and loads of
+zeroes which obviously is not a sensible solution. The difficulty of actually
+removing the records is that that would change the LSNs. We currently rely on
+those though.
+
+The filtering might very well get expanded to support partial replication and
+such in future.
+
+
+3.:
+
+To sensibly apply changes out of the WAL stream we need to solve two things:
+Reassemble transactions and apply them to the target database.
+
+The logical stream from 1. via 2. consists out of individual changes identified
+by the relfilenode of the table and the xid of the transaction. Given
+(sub)transactions, rollbacks, crash recovery, subtransactions and the like
+those changes obviously cannot be individually applied without fully loosing
+the pretence of consistency. To solve that we introduced a module, dubbed
+ApplyCache which does the reassembling. This module is *independent* of the
+data source and of the method of applying changes so it can be reused for
+replicating into a foreign system or similar.
+
+Due to the overhead of planner/executor/toast reassembly/type conversion (yes,
+we benchmarked!) we decided against statement generation for apply. Even when
+using prepared statements the overhead is rather noticeable.
+
+Instead we decided to use relatively lowlevel heapam.h/genam.h accesses to do
+the apply. For now we decided to use only one process to do the applying,
+parallelizing that seems to be too complex for an introduction of an already
+complex feature.
+In our tests the apply process could keep up with pgbench -c/j 20+ generating
+changes. This will obviously heavily depend on the workload. A fully seek bound
+workload will definitely not scale that well.
+
+Just to reiterate: Plugging in another method to do the apply should be a
+relatively simple matter of setting up three callbacks to a different function
+(begin, apply_change, commit).
+
+Another complexity in this is how to synchronize the catalogs. We plan to use
+command/event triggers and the oid preserving features from pg_upgrade to keep
+the catalogs in-sync. We did not start working on that.
+
+
+4.:
+
+While we started to think about conflict resolution/avoidance we did not start
+to work on it. We currently *cannot* handle conflicts. We think that the base
+features/architecture should be aggreed uppon before starting with it.
+
+Multimaster tests were done with sequences setup with INCREMENT 2 and different
+start values on the two nodes.
+
+=== Current Prototype ===
+
+The current prototype consists of a series of patches that are split in
+hopefully sensible and coherent parts to make reviewing of individual parts
+possible.
+
+Its also available in the 'cabal-rebasing' branch on
+git.postgresql.org/users/andresfreund/postgres.git . That branch will modify
+history though.
+
+01: wakeup handling: reduces replication lag, not very interesting in this context
+
+02: Add zeroRecPtr: not very interesting either
+
+03: new syscache for relfilenode. This would benefit by some syscache experienced eyes
+
+04: embedded lists: This is a general facility, general review appreciated
+
+05: preliminary bgworker support: This is not ready and just posted as its
+    preliminary work for the other patches. Simon will post a real patch soon
+
+06: XLogReader: Review definitely appreciated
+
+07: logical data additions for WAL: Review definitely appreciated, I do not expect fundamental changes
+
+08: ApplyCache: Important infrastructure for the patch, review definitely appreciated
+
+09: Wal Decoding: Decode WAL generated with wal_level=logical into an ApplyCache
+
+10: WAL with 'origin node': This is another important base-piece for logical rep
+
+11: WAL segment handling changes: If the basic idea of adding a node_id to the
+    functions and adding a pg_lcr directory is acceptable the rest of the patch is
+    fairly boring/mechanical
+
+12: walsender/walreceiver changes: Implement transport/filtering of logical
+    changes. Very relevant
+
+13: shared memory/crash recovery state handling for logical rep: Very relevant
+    minus the TODO's in the commit message
+
+14: apply module: review appreciated
+
+15: apply process: somewhat dependent on the preliminary changes in 05, general
+    direction is visible, loads of detail work needed as soon as some design
+    decisions are agreed uppon.
+
+16: this document. Not very interesting after youve read it ;)
-- 
1.7.10.rc3.3.g19a6c.dirty



From: Andres Freund <andres@anarazel.de>

One apply process currently can only apply changes from one database in another
cluster (with a specific node_id).

Currently synchronous_commit=off is statically set in the apply process because
after a crash we can safely recover all changes which we didn't apply so there
is no point of incurring the overhead of synchronous commits. This might be
problematic in combination with synchronous replication.

Missing/Todo:
- The foreign node_id currently is hardcoded (2, 1 depending on the local id) as is the database (postgres). This
obviouslyneed to change.
 
- Proper mainloop with error handling, PROCESS_INTERRUPTS and everything
- Start multiple apply processes (per node_id per database)
- Possibly switch databases during runtime?
---src/backend/postmaster/bgworker.c         |   10 +-src/backend/replication/logical/logical.c |  194
+++++++++++++++++++++++++++++src/include/replication/logical.h        |    3 +3 files changed, 198 insertions(+), 9
deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 8144050..bbb7e86 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -52,6 +52,7 @@#include "postmaster/bgworker.h"#include "postmaster/fork_process.h"#include
"postmaster/postmaster.h"
+#include "replication/logical.h"#include "storage/bufmgr.h"#include "storage/ipc.h"#include "storage/latch.h"
@@ -91,8 +92,6 @@ static void bgworker_sigterm_handler(SIGNAL_ARGS);NON_EXEC_STATIC void BgWorkerMain(int argc, char
*argv[]);
-static bool do_logicalapply(void);
-/******************************************************************** *                      BGWORKER CODE
********************************************************************/
@@ -394,10 +393,3 @@ NumBgWorkers(void)    return numWorkers;#endif}
-
-static bool
-do_logicalapply(void)
-{
-    elog(LOG, "doing logical apply");
-    return false;
-}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4f34488..7fadafe 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -13,7 +13,201 @@ * */#include "postgres.h"
+
+#include "access/xlogreader.h"
+
+#include "replication/applycache.h"#include "replication/logical.h"
+#include "replication/apply.h"
+#include "replication/decode.h"
+#include "replication/walreceiver.h"
+/*FIXME: XLogRead*/
+#include "replication/walsender_private.h"
+
+#include "storage/ipc.h"
+#include "storage/proc.h"
+int guc_replication_origin_id = InvalidMultimasterNodeId;RepNodeId current_replication_origin_id =
InvalidMultimasterNodeId;XLogRecPtrcurrent_replication_origin_lsn = {0, 0};
 
+
+static XLogReaderState* xlogreader_state = 0;
+
+static bool
+replay_record_is_interesting(XLogReaderState* state, XLogRecord* r)
+{
+    /*
+     * we filtered in the sender, so no filtering necessary atm.
+     *
+     * If we want to introduce per-table filtering after a proxy or such this
+     * would be the place.
+     */
+    return true;
+}
+
+static void
+replay_writeout_data(XLogReaderState* state, char* data, Size len)
+{
+    /* no data needs to persists after this */
+    return;
+}
+
+static void
+replay_finished_record(XLogReaderState* state, XLogRecordBuffer* buf)
+{
+    ReaderApplyState* apply_state = state->private_data;
+    ApplyCache *cache = apply_state->apply_cache;
+
+    DecodeRecordIntoApplyCache(cache, buf);
+}
+
+static void
+replay_read_page(XLogReaderState* state, char* cur_page, XLogRecPtr startptr)
+{
+    XLogPageHeader page_header;
+
+    Assert((startptr.xrecoff % XLOG_BLCKSZ) == 0);
+
+    /* FIXME: more sensible/efficient implementation */
+    XLogRead(cur_page, receiving_from_node_id, startptr, XLOG_BLCKSZ);
+
+    page_header = (XLogPageHeader)cur_page;
+
+    if (page_header->xlp_magic != XLOG_PAGE_MAGIC)
+    {
+        elog(FATAL, "page header magic %x, should be %x at %X/%X", page_header->xlp_magic,
+             XLOG_PAGE_MAGIC, startptr.xlogid, startptr.xrecoff);
+    }
+}
+
+bool
+do_logicalapply(void)
+{
+    XLogRecPtr *from;
+    XLogRecPtr *to;
+    ReaderApplyState *apply_state;
+    int res;
+
+    static bool initialized = false;
+
+    if (!initialized)
+    {
+        /*
+         * FIXME: We need a sensible implementation for choosing this.
+         */
+        if (guc_replication_origin_id == 1)
+        {
+            receiving_from_node_id = 2;
+        }
+        else
+        {
+            receiving_from_node_id = 1;
+        }
+    }
+
+    ResetLatch(&MyProc->procLatch);
+
+    SpinLockAcquire(&WalRcv->mutex);
+    from = &WalRcv->mm_applyState[receiving_from_node_id];
+    to = &WalRcv->mm_receiveState[receiving_from_node_id];
+    SpinLockRelease(&WalRcv->mutex);
+
+    if (XLByteEQ(*to, zeroRecPtr)){
+        /* shmem state not ready, walreceivers didn't start up yet */
+        return false;
+    }
+
+    if (!initialized)
+    {
+        ApplyCache *apply_cache;
+
+        initialized = true;
+
+        elog(LOG, "at node %u were receiving from %u",
+             guc_replication_origin_id,
+             receiving_from_node_id);
+
+        /* FIXME: do we want to set that permanently? */
+        current_replication_origin_id = receiving_from_node_id;
+
+        /* we cannot loose anything due to this as we just restart replay */
+        SetConfigOption("synchronous_commit", "off",
+                        PGC_SUSET, PGC_S_OVERRIDE);
+
+        WalRcv->mm_receiveLatch[receiving_from_node_id] = &MyProc->procLatch;
+
+        /* initialize xlogreader */
+        xlogreader_state = XLogReaderAllocate();
+        XLogReaderReset(xlogreader_state);
+
+        xlogreader_state->is_record_interesting = replay_record_is_interesting;
+        xlogreader_state->finished_record = replay_finished_record;
+        xlogreader_state->writeout_data = replay_writeout_data;
+        xlogreader_state->read_page = replay_read_page;
+        xlogreader_state->private_data = malloc(sizeof(ReaderApplyState));
+        if (!xlogreader_state->private_data)
+            elog(ERROR, "Could not allocate the ReaderApplyState struct");
+
+        xlogreader_state->startptr = *from;
+        xlogreader_state->curptr = *from;
+
+        apply_state = (ReaderApplyState*)xlogreader_state->private_data;
+
+        /*
+         * allocate an ApplyCache that will apply data using lowlevel calls
+         * without type conversion et al. This requires binary compatibility
+         * between both systems.
+         * XXX: This would be the place too hook different apply methods, like
+         * producing sql and applying it.
+         */
+        apply_cache = ApplyCacheAllocate();
+        apply_cache->begin = apply_begin_txn;
+        apply_cache->apply_change = apply_change;
+        apply_cache->commit = apply_commit_txn;
+        apply_state->apply_cache = apply_cache;
+
+        apply_cache->private_data = malloc(sizeof(ApplyApplyCacheState));
+        if (!apply_cache->private_data)
+            elog(ERROR, "Could not allocate the DecodeApplyCacheState struct");
+
+        elog(WARNING, "initialized");
+
+    }
+
+    if(XLByteLT(*to, *from))
+    {
+        goto wait;
+    }
+
+    xlogreader_state->endptr = *to;
+
+    XLogReaderRead(xlogreader_state);
+
+    SpinLockAcquire(&WalRcv->mutex);
+    /*
+     * FIXME: This is not enough to recover properly after a crash because we
+     * loose in-progress transactions. For that we need two pointers: One to
+     * remember which is the lsn we committed last and which is the lsn with
+     * the oldest, in-progress, transaction. Then we can start reading at the
+     * latter and just throw away everything which commits before the former.
+     */
+    WalRcv->mm_applyState[receiving_from_node_id] = xlogreader_state->curptr;
+    SpinLockRelease(&WalRcv->mutex);
+
+wait:
+    /*
+     * if we either need data to complete reading or have finished everything
+     * up to this point
+     */
+    if (xlogreader_state->needs_input || !xlogreader_state->incomplete)
+    {
+        res = WaitLatch(&MyProc->procLatch,
+                        WL_LATCH_SET|WL_POSTMASTER_DEATH, 0);
+        if (res & WL_POSTMASTER_DEATH)
+        {
+            elog(WARNING, "got deathsig");
+            proc_exit(0);
+        }
+    }
+    return true;
+}
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index fc9e120..aa19ab9 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -25,4 +25,7 @@ extern XLogRecPtr current_replication_origin_lsn;#define MaxMultimasterNodeId (2<<3)#define LCRDIR
           "pg_lcr"
 
+
+bool do_logicalapply(void);
+#endif
-- 
1.7.10.rc3.3.g19a6c.dirty



From: Andres Freund <andres@anarazel.de>

We decided to use low level functions to do the apply instead of producing sql
statements containing the data (or using prepared statements) because both, the
text conversion and the full executor overhead aren't introduce a significant
overhead which is unneccesary if youre using the same version of pg on the same
architecture.

There are loads of use cases though that require different methods of applyin
though - so the part doing the applying from an ApplyCache is just a bunch of
well abstracted callbacks getting passed all the required knowledge to change
the data representation into other formats.

Missing:

- TOAST handling. For physical apply not much needs to be done because the toast inserts will have been made
beforehand.There needs to be an option in ApplyCache that helps reassembling TOAST datums to make it easier to write
applymodules which convert to text.
 
---src/backend/replication/logical/Makefile |    2 +-src/backend/replication/logical/apply.c  |  313
++++++++++++++++++++++++++++++src/include/replication/apply.h         |   24 +++3 files changed, 338 insertions(+), 1
deletion(-)createmode 100644 src/backend/replication/logical/apply.ccreate mode 100644 src/include/replication/apply.h
 

diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
index c2d6d82..d0e0b13 100644
--- a/src/backend/replication/logical/Makefile
+++ b/src/backend/replication/logical/Makefile
@@ -14,6 +14,6 @@ include $(top_builddir)/src/Makefile.globaloverride CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
-OBJS = applycache.o decode.o logical.o
+OBJS = apply.o applycache.o decode.o logical.oinclude $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/logical/apply.c b/src/backend/replication/logical/apply.c
new file mode 100644
index 0000000..646bd54
--- /dev/null
+++ b/src/backend/replication/logical/apply.c
@@ -0,0 +1,313 @@
+/*-------------------------------------------------------------------------
+ *
+ * logical.c
+ *
+ * Support functions for logical/multimaster replication
+ *
+ *
+ * Portions Copyright (c) 2010-2012, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *      src/backend/replication/logical.c
+ *
+ */
+#include "postgres.h"
+
+#include "access/xact.h"
+#include "access/heapam.h"
+#include "access/genam.h"
+
+#include "catalog/pg_control.h"
+#include "catalog/index.h"
+
+#include "executor/executor.h"
+
+#include "replication/applycache.h"
+#include "replication/apply.h"
+
+#include "utils/rel.h"
+#include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
+
+
+
+static void
+UserTableUpdateIndexes(Relation heapRel, HeapTuple heapTuple);
+
+
+void apply_begin_txn(ApplyCache* cache, ApplyCacheTXN* txn)
+{
+    ApplyApplyCacheState *state = cache->private_data;
+
+    state->original_resource_owner = CurrentResourceOwner;
+
+    PreventTransactionChain(true, "Apply Process cannot be started inside a txn");
+
+    StartTransactionCommand();
+
+    PushActiveSnapshot(GetTransactionSnapshot());
+}
+
+void apply_commit_txn(ApplyCache* cache, ApplyCacheTXN* txn)
+{
+    ApplyApplyCacheState *state = cache->private_data;
+
+    current_replication_origin_lsn = txn->lsn;
+
+    PopActiveSnapshot();
+    CommitTransactionCommand();
+
+
+    /*
+     * for some reason after (Start|Commit)TransactionCommand we loose our
+     * resource owner, restore it.
+     * XXX: is that correct?
+     */
+    CurrentResourceOwner = state->original_resource_owner;
+
+    current_replication_origin_lsn.xlogid = 0;
+    current_replication_origin_lsn.xrecoff = 0;
+}
+
+
+void apply_change(ApplyCache* cache, ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change)
+{
+    /* for inserting */
+    Relation    tuple_rel;
+
+    tuple_rel = heap_open(HeapTupleGetOid(change->table), RowExclusiveLock);
+
+    switch (change->action)
+    {
+        case APPLY_CACHE_CHANGE_INSERT:
+        {
+#ifdef VERBOSE_DEBUG
+            elog(LOG, "INSERT");
+#endif
+            simple_heap_insert(tuple_rel, &change->newtuple->tuple);
+
+            UserTableUpdateIndexes(tuple_rel, &change->newtuple->tuple);
+            break;
+        }
+        case APPLY_CACHE_CHANGE_UPDATE:
+        {
+            Oid indexoid = InvalidOid;
+            int16 pknratts;
+            int16 pkattnum[INDEX_MAX_KEYS];
+            Oid pktypoid[INDEX_MAX_KEYS];
+            Oid pkopclass[INDEX_MAX_KEYS];
+
+            ScanKeyData cur_skey[INDEX_MAX_KEYS];
+            int i;
+            bool isnull;
+            TupleDesc desc = RelationGetDescr(tuple_rel);
+
+            Relation index_rel;
+
+            HeapTuple old_tuple;
+            bool found = false;
+
+            IndexScanDesc scan;
+
+#ifdef VERBOSE_DEBUG
+            elog(LOG, "UPDATE");
+#endif
+            MemSet(pkattnum, 0, sizeof(pkattnum));
+            MemSet(pktypoid, 0, sizeof(pktypoid));
+            MemSet(pkopclass, 0, sizeof(pkopclass));
+
+            relationFindPrimaryKey(tuple_rel, &indexoid, &pknratts, pkattnum, pktypoid, pkopclass);
+
+            if (!OidIsValid(indexoid))
+               ereport(ERROR,
+                       (errcode(ERRCODE_UNDEFINED_OBJECT),
+                        errmsg("there is no primary key for table \"%s\"",
+                               RelationGetRelationName(tuple_rel))));
+
+            index_rel = index_open(indexoid, AccessShareLock);
+
+            for (i = 0; i < pknratts; i++)
+            {
+                Oid operator;
+                Oid opfamily;
+                RegProcedure regop;
+
+                opfamily = get_opclass_family(pkopclass[i]);
+
+                operator = get_opfamily_member(opfamily, pktypoid[i], pktypoid[i], BTEqualStrategyNumber);
+
+                regop = get_opcode(operator);
+
+                ScanKeyInit(&cur_skey[i],
+                            pkattnum[i],
+                            BTEqualStrategyNumber,
+                            regop,
+                            fastgetattr(&change->newtuple->tuple, pkattnum[i], desc, &isnull));
+
+                Assert(!isnull);
+            }
+
+            scan = index_beginscan(tuple_rel, index_rel, GetTransactionSnapshot(),
+                                   pknratts, 0);
+            index_rescan(scan, cur_skey, pknratts, NULL, 0);
+
+            while ((old_tuple = index_getnext(scan, ForwardScanDirection)) != NULL)
+            {
+                if (found)
+                {
+                    elog(ERROR, "WTF, more than one tuple found via pk???");
+                }
+                found = true;
+
+                simple_heap_update(tuple_rel, &old_tuple->t_self, &change->newtuple->tuple);
+            }
+
+            if (!found)
+                elog(ERROR, "could not find tuple to update");
+
+            index_endscan(scan);
+
+            if (!HeapTupleIsHeapOnly(&change->newtuple->tuple))
+                UserTableUpdateIndexes(tuple_rel, &change->newtuple->tuple);
+
+            heap_close(index_rel, NoLock);
+
+            break;
+        }
+        case APPLY_CACHE_CHANGE_DELETE:
+        {
+            Oid indexoid = InvalidOid;
+            int16 pknratts;
+            int16 pkattnum[INDEX_MAX_KEYS];
+            Oid pktypoid[INDEX_MAX_KEYS];
+            Oid pkopclass[INDEX_MAX_KEYS];
+
+            ScanKeyData cur_skey[INDEX_MAX_KEYS];
+            int i;
+            bool isnull;
+
+            Relation index_rel;
+
+            HeapTuple old_tuple;
+            bool found = false;
+
+            TupleDesc index_desc;
+
+            IndexScanDesc scan;
+
+#ifdef VERBOSE_DEBUG
+            elog(LOG, "DELETE comming");
+#endif
+            MemSet(pkattnum, 0, sizeof(pkattnum));
+            MemSet(pktypoid, 0, sizeof(pktypoid));
+            MemSet(pkopclass, 0, sizeof(pkopclass));
+
+            relationFindPrimaryKey(tuple_rel, &indexoid, &pknratts, pkattnum, pktypoid, pkopclass);
+
+            if (!OidIsValid(indexoid))
+               ereport(ERROR,
+                       (errcode(ERRCODE_UNDEFINED_OBJECT),
+                        errmsg("there is no primary key for table \"%s\"",
+                               RelationGetRelationName(tuple_rel))));
+
+            index_rel = index_open(indexoid, AccessShareLock);
+            index_desc = RelationGetDescr(index_rel);
+
+            for (i = 0; i < pknratts; i++)
+            {
+                Oid operator;
+                Oid opfamily;
+                RegProcedure regop;
+
+                opfamily = get_opclass_family(pkopclass[i]);
+
+                operator = get_opfamily_member(opfamily, pktypoid[i], pktypoid[i], BTEqualStrategyNumber);
+
+                regop = get_opcode(operator);
+
+                ScanKeyInit(&cur_skey[i],
+                            pkattnum[i],
+                            BTEqualStrategyNumber,
+                            regop,
+                            fastgetattr(&change->oldtuple->tuple, i + 1, index_desc, &isnull));
+
+                Assert(!isnull);
+            }
+
+            scan = index_beginscan(tuple_rel, index_rel, GetTransactionSnapshot(),
+                                   pknratts, 0);
+            index_rescan(scan, cur_skey, pknratts, NULL, 0);
+
+
+            while ((old_tuple = index_getnext(scan, ForwardScanDirection)) != NULL)
+            {
+                if (found)
+                {
+                    elog(ERROR, "WTF, more than one tuple found via pk???");
+                }
+                found = true;
+                simple_heap_delete(tuple_rel, &old_tuple->t_self);
+            }
+
+            if (!found)
+                elog(ERROR, "could not find tuple to update");
+
+            index_endscan(scan);
+
+            heap_close(index_rel, NoLock);
+
+            break;
+        }
+    }
+    /* FIXME: locking */
+
+    heap_close(tuple_rel, NoLock);
+    CommandCounterIncrement();
+}
+
+/*
+ * The state object used by CatalogOpenIndexes and friends is actually the
+ * same as the executor's ResultRelInfo, but we give it another type name
+ * to decouple callers from that fact.
+ */
+typedef struct ResultRelInfo *UserTableIndexState;
+
+static void
+UserTableUpdateIndexes(Relation heapRel, HeapTuple heapTuple)
+{
+    /* this is largely copied together from copy.c's CopyFrom */
+    EState *estate = CreateExecutorState();
+    ResultRelInfo *resultRelInfo;
+    List *recheckIndexes = NIL;
+    TupleDesc tupleDesc = RelationGetDescr(heapRel);
+
+    resultRelInfo = makeNode(ResultRelInfo);
+    resultRelInfo->ri_RangeTableIndex = 1;        /* dummy */
+    resultRelInfo->ri_RelationDesc = heapRel;
+    resultRelInfo->ri_TrigInstrument = NULL;
+
+    ExecOpenIndices(resultRelInfo);
+
+    estate->es_result_relations = resultRelInfo;
+    estate->es_num_result_relations = 1;
+    estate->es_result_relation_info = resultRelInfo;
+
+    if (resultRelInfo->ri_NumIndices > 0)
+    {
+        TupleTableSlot *slot = ExecInitExtraTupleSlot(estate);
+        ExecSetSlotDescriptor(slot, tupleDesc);
+        ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);
+
+        recheckIndexes = ExecInsertIndexTuples(slot, &heapTuple->t_self,
+                                               estate);
+    }
+
+    ExecResetTupleTable(estate->es_tupleTable, false);
+
+    ExecCloseIndices(resultRelInfo);
+
+    FreeExecutorState(estate);
+    /* FIXME: recheck the indexes */
+    list_free(recheckIndexes);
+}
diff --git a/src/include/replication/apply.h b/src/include/replication/apply.h
new file mode 100644
index 0000000..3b818c0
--- /dev/null
+++ b/src/include/replication/apply.h
@@ -0,0 +1,24 @@
+/*
+ * apply.h
+ *
+ * PostgreSQL logical replay
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * src/include/replication/logical/replay.h
+ */
+#ifndef APPLY_H
+#define APPLY_H
+
+#include "utils/resowner.h"
+
+typedef struct ApplyApplyCacheState
+{
+    ResourceOwner original_resource_owner;
+} ApplyApplyCacheState;
+
+void apply_begin_txn(ApplyCache* cache, ApplyCacheTXN* txn);
+void apply_commit_txn(ApplyCache* cache, ApplyCacheTXN* txn);
+void apply_change(ApplyCache* cache, ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change);
+
+#endif
-- 
1.7.10.rc3.3.g19a6c.dirty



[PATCH 13/16] Introduction of pair of logical walreceiver/sender

From
Andres Freund
Date:
From: Andres Freund <andres@anarazel.de>

A logical WALReceiver is started directly by Postmaster when we enter PM_RUN
state and the new parameter multimaster_conninfo is set. For now only one of
those is started, but the code doesn't rely on that. In future multiple ones
should be allowed.

To transfer that data a new command, START_LOGICAL_REPLICATION is introduced in
the walsender reusing most of the infrastructure for START_REPLICATION. The
former uses the same on-the-wire format as the latter.

To make initialization possibly IDENTIFY_SYSTEM returns two new columns node_id
returning the multimaster_node_id and last_checkpoint returning the RedoRecPtr.

The walreceiver writes that data into the previously introduce pg_lcr/$node_id
directory.

Future Directions/TODO:
- pass node_ids were interested in to START_LOGICAL_REPLICATION to allow complex topologies
- allow to pass filters to reduce the transfer volume
- compress the transferred data by actually removing uninteresting records instead of replacing them by NOOP records.
Thisadds some complexities because we still need to map the received lsn to the requested lsn so we know where to
restarttransferring data and such.
 
- check that wal on the sending side was generated with WAL_LEVEL_LOGICAL
---src/backend/postmaster/postmaster.c                |   10 +-.../libpqwalreceiver/libpqwalreceiver.c            |
104++++-src/backend/replication/repl_gram.y                |   19 +-src/backend/replication/repl_scanner.l
|   1 +src/backend/replication/walreceiver.c              |  165 +++++++-src/backend/replication/walreceiverfuncs.c
   |    1 +src/backend/replication/walsender.c                |  422 +++++++++++++++-----src/backend/utils/misc/guc.c
                   |    9 +src/backend/utils/misc/postgresql.conf.sample      |    1 +src/include/nodes/nodes.h
                |    1 +src/include/nodes/replnodes.h                      |   10 +src/include/replication/logical.h
             |    4 +src/include/replication/walreceiver.h              |    9 +-13 files changed, 624 insertions(+),
132deletions(-)
 

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 71cfd6d..13e9592 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1449,6 +1449,11 @@ ServerLoop(void)                kill(AutoVacPID, SIGUSR2);        }
+        /* Restart walreceiver process in certain states only. */
+        if (WalReceiverPID == 0 && pmState == PM_RUN &&
+            LogicalWalReceiverActive())
+            WalReceiverPID = StartWalReceiver();
+        /* Check all the workers requested are running. */        if (pmState == PM_RUN)
StartBackgroundWorkers();
@@ -2169,7 +2174,8 @@ pmdie(SIGNAL_ARGS)                /* and the walwriter too */                if (WalWriterPID !=
0)                   signal_child(WalWriterPID, SIGTERM);
 
-
+                if (WalReceiverPID != 0)
+                    signal_child(WalReceiverPID, SIGTERM);                /*                 * If we're in recovery,
wecan't kill the startup process                 * right away, because at present doing so does not release
 
@@ -2421,6 +2427,8 @@ reaper(SIGNAL_ARGS)                PgArchPID = pgarch_start();            if (PgStatPID == 0)
          PgStatPID = pgstat_start();
 
+            if (WalReceiverPID == 0 && LogicalWalReceiverActive())
+                WalReceiverPID = StartWalReceiver();            StartBackgroundWorkers();            /* at this point
weare really open for business */
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 979b66b..0ea3fce 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -46,7 +46,8 @@ static PGconn *streamConn = NULL;static char *recvBuf = NULL;/* Prototypes for interface functions
*/
-static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint);
+static bool libpqrcv_connect(char *conninfo, XLogRecPtr* redo, XLogRecPtr* where_at, bool startedDuringRecovery);
+static bool libpqrcv_start(char *conninfo, XLogRecPtr* startpoint, bool startedDuringRecovery);static bool
libpqrcv_receive(inttimeout, unsigned char *type,                 char **buffer, int *len);static void
libpqrcv_send(constchar *buffer, int nbytes);
 
@@ -63,10 +64,12 @@ void_PG_init(void){    /* Tell walreceiver how to reach us */
-    if (walrcv_connect != NULL || walrcv_receive != NULL ||
-        walrcv_send != NULL || walrcv_disconnect != NULL)
+    if (walrcv_connect != NULL || walrcv_start != NULL ||
+        walrcv_receive != NULL || walrcv_send != NULL ||
+        walrcv_disconnect != NULL)        elog(ERROR, "libpqwalreceiver already loaded");    walrcv_connect =
libpqrcv_connect;
+    walrcv_start = libpqrcv_start;    walrcv_receive = libpqrcv_receive;    walrcv_send = libpqrcv_send;
walrcv_disconnect= libpqrcv_disconnect;
 
@@ -76,7 +79,7 @@ _PG_init(void) * Establish the connection to the primary server for XLOG streaming */static bool
-libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
+libpqrcv_connect(char *conninfo, XLogRecPtr* redo, XLogRecPtr* where_at, bool startedDuringRecovery){    char
conninfo_repl[MAXCONNINFO+ 75];    char       *primary_sysid;
 
@@ -84,7 +87,8 @@ libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)    TimeLineID    primary_tli;    TimeLineID
standby_tli;    PGresult   *res;
 
-    char        cmd[64];
+
+    elog(LOG, "wal receiver connecting");    /*     * Connect using deliberately undocumented parameter: replication.
The
@@ -96,10 +100,16 @@ libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)             conninfo);    streamConn =
PQconnectdb(conninfo_repl);
-    if (PQstatus(streamConn) != CONNECTION_OK)
+    if (PQstatus(streamConn) != CONNECTION_OK){
+        /*
+         * FIXME: its very annoying for development if the whole buffer is
+         * immediately filled. We need a better solution.
+         */
+        pg_usleep(1000000);        ereport(ERROR,                (errmsg("could not connect to the primary server:
%s",                       PQerrorMessage(streamConn))));
 
+    }    /*     * Get the system identifier and timeline ID as a DataRow message from the
@@ -114,7 +124,7 @@ libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)                        "the primary server:
%s",                       PQerrorMessage(streamConn))));    }
 
-    if (PQnfields(res) != 3 || PQntuples(res) != 1)
+    if (PQnfields(res) != 5 || PQntuples(res) != 1)    {        int            ntuples = PQntuples(res);        int
       nfields = PQnfields(res);
 
@@ -122,14 +132,40 @@ libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)        PQclear(res);
ereport(ERROR,               (errmsg("invalid response from primary server"),
 
-                 errdetail("Expected 1 tuple with 3 fields, got %d tuples with %d fields.",
+                 errdetail("Expected 1 tuple with 5 fields, got %d tuples with %d fields.",
ntuples,nfields)));    }    primary_sysid = PQgetvalue(res, 0, 0);
 
+    primary_tli = pg_atoi(PQgetvalue(res, 0, 1), 4, 0);
+    /* FIXME: this should be already implemented nicely somewhere? */
+    if(sscanf(PQgetvalue(res, 0, 2),
+              "%X/%X", &where_at->xlogid, &where_at->xrecoff) != 2){
+        elog(FATAL, "couldn't parse the xlog address from the other side: %s",
+             PQgetvalue(res, 0, 2));
+    }
+
+    elog(LOG, "other end is currently at %X/%X",
+         where_at->xlogid, where_at->xrecoff);
+
+    receiving_from_node_id = pg_atoi(PQgetvalue(res, 0, 3), 4, 0);
+
+    /* FIXME: this should be already implemented nicely somewhere? */
+    if(sscanf(PQgetvalue(res, 0, 4),
+              "%X/%X", &redo->xlogid, &redo->xrecoff) != 2){
+        elog(FATAL, "couldn't parse the xlog address from the other side: %s",
+             PQgetvalue(res, 0, 4));
+    }
+
+    elog(LOG, "other end's redo is currently at %X/%X",
+         redo->xlogid, redo->xrecoff);
+
+    /*     * Confirm that the system identifier of the primary is the same as ours.
+     *
+     * FIXME: do we wan't that restriction for mm?     */    snprintf(standby_sysid, sizeof(standby_sysid),
UINT64_FORMAT,            GetSystemIdentifier());
 
@@ -142,21 +178,49 @@ libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)                           primary_sysid,
standby_sysid)));   }
 
-    /*
-     * Confirm that the current timeline of the primary is the same as the
-     * recovery target timeline.
-     */
-    standby_tli = GetRecoveryTargetTLI();    PQclear(res);
-    if (primary_tli != standby_tli)
-        ereport(ERROR,
-                (errmsg("timeline %u of the primary does not match recovery target timeline %u",
-                        primary_tli, standby_tli)));
-    ThisTimeLineID = primary_tli;    /* Start streaming from the point requested by startup process */
-    snprintf(cmd, sizeof(cmd), "START_REPLICATION %X/%X",
-             startpoint.xlogid, startpoint.xrecoff);
+    if (startedDuringRecovery)
+    {
+        /*
+         * Confirm that the current timeline of the primary is the same as the
+         * recovery target timeline.
+         */
+        standby_tli = GetRecoveryTargetTLI();
+        if (primary_tli != standby_tli)
+            ereport(ERROR,
+                    (errmsg("timeline %u of the primary does not match recovery target timeline %u",
+                            primary_tli, standby_tli)));
+        ThisTimeLineID = primary_tli;
+    }
+
+    return true;
+}
+
+/*
+ * start streaming data
+ */
+static bool
+libpqrcv_start(char *conninfo, XLogRecPtr* startpoint, bool startedDuringRecovery)
+{
+    PGresult   *res;
+    char        cmd[64];
+
+    if(startedDuringRecovery)
+    {
+        snprintf(cmd, sizeof(cmd), "START_REPLICATION %X/%X",
+             startpoint->xlogid, startpoint->xrecoff);
+    }
+    else
+    {
+        /* ignore the timeline */
+        elog(LOG, "receiving_from_node_id: %u at %X/%X", receiving_from_node_id,
+             startpoint->xlogid, startpoint->xrecoff);
+        snprintf(cmd, sizeof(cmd), "START_LOGICAL_REPLICATION %X/%X",
+             startpoint->xlogid, startpoint->xrecoff);
+    }
+    res = libpqrcv_PQexec(cmd);    if (PQresultStatus(res) != PGRES_COPY_BOTH)    {
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index b6cfdac..b49ae6f 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,9 +76,10 @@ Node *replication_parse_result;%token K_NOWAIT%token K_WAL%token K_START_REPLICATION
+%token K_START_LOGICAL_REPLICATION%type <node>    command
-%type <node>    base_backup start_replication identify_system
+%type <node>    base_backup start_replication start_logical_replication identify_system%type <list>
base_backup_opt_list%type<defelt>    base_backup_opt%%
 
@@ -97,6 +98,7 @@ command:            identify_system            | base_backup            | start_replication
+            | start_logical_replication            ;/*
@@ -166,6 +168,21 @@ start_replication:                    $$ = (Node *) cmd;                }            ;
+
+/*
+ * START_LOGICAL_REPLICATION %X/%X
+ */
+start_logical_replication:
+            K_START_LOGICAL_REPLICATION RECPTR
+                {
+                    StartLogicalReplicationCmd *cmd;
+
+                    cmd = makeNode(StartLogicalReplicationCmd);
+                    cmd->startpoint = $2;
+
+                    $$ = (Node *) cmd;
+                }
+            ;%%#include "repl_scanner.c"
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 9d4edcf..f8be982 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -64,6 +64,7 @@ NOWAIT            { return K_NOWAIT; }PROGRESS            { return K_PROGRESS; }WAL            {
returnK_WAL; }START_REPLICATION    { return K_START_REPLICATION; }
 
+START_LOGICAL_REPLICATION    { return K_START_LOGICAL_REPLICATION; }","                { return ','; }";"
 { return ';'; }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e97196b..73a3021 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -44,6 +44,7 @@#include "replication/walprotocol.h"#include "replication/walreceiver.h"#include
"replication/walsender.h"
+#include "replication/logical.h"#include "storage/ipc.h"#include "storage/pmsignal.h"#include "storage/procarray.h"
@@ -58,9 +59,12 @@ bool        am_walreceiver;/* GUC variable */int            wal_receiver_status_interval;bool
hot_standby_feedback;
+char        *mm_conninfo = 0;
+RepNodeId receiving_from_node_id = InvalidMultimasterNodeId;/* libpqreceiver hooks to these when loaded
*/walrcv_connect_typewalrcv_connect = NULL;
 
+walrcv_start_type walrcv_start = NULL;walrcv_receive_type walrcv_receive = NULL;walrcv_send_type walrcv_send =
NULL;walrcv_disconnect_typewalrcv_disconnect = NULL;
 
@@ -93,9 +97,13 @@ static struct    XLogRecPtr    Flush;            /* last byte + 1 flushed in the standby */}
LogstreamResult;
+XLogRecPtr curRecv;
+static StandbyReplyMessage reply_message;static StandbyHSFeedbackMessage feedback_message;
+static bool startedDuringRecovery;  /* are we going to receive WAL data */
+/* * About SIGTERM handling: *
@@ -122,6 +130,9 @@ static void WalRcvDie(int code, Datum arg);static void XLogWalRcvProcessMsg(unsigned char type,
char*buf, Size len);static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);static void
XLogWalRcvFlush(booldying);
 
+
+static void LogicalWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
+static void XLogWalRcvSendReply(void);static void XLogWalRcvSendHSFeedback(void);static void
ProcessWalSndrMessage(XLogRecPtrwalEnd, TimestampTz sendTime);
 
@@ -170,13 +181,17 @@ voidWalReceiverMain(void){    char        conninfo[MAXCONNINFO];
-    XLogRecPtr    startpoint;
+    XLogRecPtr    startpoint = {0, 0};
+    XLogRecPtr    other_end_at;
+    XLogRecPtr    other_end_redo;    /* use volatile pointer to prevent code rearrangement */    volatile WalRcvData
*walrcv= WalRcv;    am_walreceiver = true;
 
+    elog(LOG, "wal receiver starting");
+    /*     * WalRcv should be set up already (if we are a backend, we inherit this     * by fork() or EXEC_BACKEND
mechanismfrom the postmaster).
 
@@ -200,8 +215,11 @@ WalReceiverMain(void)            /* fall through */        case WALRCV_STOPPED:
-            SpinLockRelease(&walrcv->mutex);
-            proc_exit(1);
+            if (startedDuringRecovery)
+            {
+                SpinLockRelease(&walrcv->mutex);
+                proc_exit(1);
+            }            break;        case WALRCV_STARTING:
@@ -212,13 +230,35 @@ WalReceiverMain(void)            /* Shouldn't happen */            elog(PANIC, "walreceiver still
runningaccording to shared memory state");    }
 
-    /* Advertise our PID so that the startup process can kill us */
+    /* Advertise our PID so that we can be killed */    walrcv->pid = MyProcPid;    walrcv->walRcvState =
WALRCV_RUNNING;
-    /* Fetch information required to start streaming */
-    strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
-    startpoint = walrcv->receiveStart;
+    /*
+     * Fetch information required to start streaming.
+     *
+     * During recovery the WALReceiver is started from the Startup process,
+     * by sending a postmaster signal. In normal running the Postmaster
+     * starts the WALReceiver directly. In that case the walrcv shmem struct
+     * is simply zeroed, so walrcv->startedDuringRecovery will show as false.
+     *
+     * The connection info required to access the upstream master comes from
+     * the multimaster_conninfo parameter, stored in the mm_conninfo variable.
+     *
+     * XXX The starting point for logical replication is not yet determined.
+     */
+    startedDuringRecovery = walrcv->startedDuringRecovery;
+    if (startedDuringRecovery)
+    {
+        strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
+        startpoint = walrcv->receiveStart;
+    }
+    else
+    {
+        elog(LOG, "logical replication starting: %s", mm_conninfo);
+        strlcpy(conninfo, (char *) mm_conninfo, MAXCONNINFO);
+        /* The startpoint for logical replay can only be determined after connecting */
+    }    /* Initialise to a sanish value */    walrcv->lastMsgSendTime = walrcv->lastMsgReceiptTime =
GetCurrentTimestamp();
@@ -262,8 +302,9 @@ WalReceiverMain(void)    /* Load the libpq-specific functions */    load_file("libpqwalreceiver",
false);
-    if (walrcv_connect == NULL || walrcv_receive == NULL ||
-        walrcv_send == NULL || walrcv_disconnect == NULL)
+    if (walrcv_connect == NULL || walrcv_start == NULL ||
+        walrcv_receive == NULL || walrcv_send == NULL ||
+        walrcv_disconnect == NULL)        elog(ERROR, "libpqwalreceiver didn't initialize correctly");    /*
@@ -277,7 +318,58 @@ WalReceiverMain(void)    /* Establish the connection to the primary for XLOG streaming */
EnableWalRcvImmediateExit();
-    walrcv_connect(conninfo, startpoint);
+    walrcv_connect(conninfo, &other_end_redo, &other_end_at, startedDuringRecovery);
+
+    if(LogicalWalReceiverActive()){
+        char buf[MAXPGPATH];
+
+        if(RecoveryInProgress()){
+            elog(FATAL, "cannot have the logical receiver running while recovery is ongoing");
+        }
+
+        if(receiving_from_node_id == InvalidMultimasterNodeId)
+            elog(FATAL, "didn't setup/derive other node id");
+
+        Assert(WalRcv);
+
+        startpoint = WalRcv->mm_receiveState[receiving_from_node_id];
+
+        /*
+         * in this case we connect to some master we haven't yet received data
+         * from yet.
+         * FIXME: This means we would need to initialize the local cluster!
+         */
+        if(XLByteEQ(startpoint, zeroRecPtr)){
+            startpoint = other_end_redo;
+
+            /* we need to scroll back to the begin of the segment */
+            startpoint.xrecoff -= startpoint.xrecoff % XLogSegSize;
+
+            WalRcv->mm_receiveState[receiving_from_node_id] = startpoint;
+
+            WalRcv->mm_applyState[receiving_from_node_id] = other_end_redo;
+
+            /* FIXME: this should be an ereport */
+            elog(LOG, "initializing recovery from logical node %d to %X/%X, transfer from %X/%X",
+                 receiving_from_node_id,
+                 other_end_at.xlogid, other_end_at.xrecoff,
+                 startpoint.xlogid, startpoint.xrecoff);
+        }
+        else if(XLByteLT(other_end_at, startpoint)){
+            elog(FATAL, "something went wrong, the other side has a too small xlogid/xlrecoff. Other: %X/%X, self:
%X/%X",
+                 other_end_at.xlogid, other_end_at.xrecoff,
+                 startpoint.xlogid, startpoint.xrecoff);
+        }
+
+        /*
+         * the set of foreign nodes can increase all the time, so we just make
+         * sure the particular one we need exists.
+         */
+        snprintf(buf, MAXPGPATH-1, "%s/%u", LCRDIR, receiving_from_node_id);
+        pg_mkdir_p(buf, S_IRWXU);
+    }
+
+    walrcv_start(conninfo, &startpoint, startedDuringRecovery);    DisableWalRcvImmediateExit();    /* Loop until
end-of-streamingor error */
 
@@ -298,7 +390,7 @@ WalReceiverMain(void)         * Exit walreceiver if we're not in recovery. This should not happen,
      * but cross-check the status here.         */
 
-        if (!RecoveryInProgress())
+        if (!RecoveryInProgress() && !LogicalWalReceiverActive())            ereport(FATAL,
(errmsg("cannotcontinue WAL streaming, recovery has already ended")));
 
@@ -443,7 +535,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)                buf +=
sizeof(WalDataMessageHeader);               len -= sizeof(WalDataMessageHeader);
 
-                XLogWalRcvWrite(buf, len, msghdr.dataStart);
+
+                /*
+                 * The WALReceiver connects either during recovery or during
+                 * normal running. During recovery then pure WAL data is
+                 * received, whereas during normal running we send Logical
+                 * Change Records (LCRs) which are stored differently.
+                 */
+                if (LogicalWalReceiverActive())
+                    XLogWalRcvWrite(buf, len, msghdr.dataStart);
+                else
+                    LogicalWalRcvWrite(buf, len, msghdr.dataStart);                break;            }        case
'k':               /* Keepalive */
 
@@ -477,6 +579,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)    int            startoff;    int
     byteswritten;
 
+#ifdef VERBOSE_DEBUG
+      elog(LOG, "received data len %lu, at %X/%X",
+         nbytes, recptr.xlogid, recptr.xrecoff);
+#endif    while (nbytes > 0)    {        int            segbytes;
@@ -509,7 +615,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)            /* Create/use new log file */
          XLByteToSeg(recptr, recvId, recvSeg);            use_existent = true;
 
-            recvFile = XLogFileInit(InvalidMultimasterNodeId, recvId, recvSeg, &use_existent, true);
+            recvFile = XLogFileInit(receiving_from_node_id, recvId, recvSeg, &use_existent, true);            recvOff
=0;        }
 
@@ -585,6 +691,27 @@ XLogWalRcvFlush(bool dying)        {            walrcv->latestChunkStart = walrcv->receivedUpto;
        walrcv->receivedUpto = LogstreamResult.Flush;
 
+
+            /* FIXME */
+            if(LogicalWalReceiverActive()){
+                if(XLByteLE(curRecv, LogstreamResult.Write)){
+                    WalRcv->mm_receiveState[receiving_from_node_id] = curRecv;
+
+                    if(WalRcv->mm_receiveLatch[receiving_from_node_id])
+                        SetLatch(WalRcv->mm_receiveLatch[receiving_from_node_id]);
+#if 0
+                    elog(LOG, "confirming flush to %X/%X",
+                         curRecv.xlogid, curRecv.xrecoff);
+#endif
+                }
+                else{
+#if 0
+                    elog(LOG, "not conf flush to %X/%X, wrote to %X/%X",
+                         curRecv.xlogid, curRecv.xrecoff,
+                         LogstreamResult.Write.xlogid, LogstreamResult.Write.xrecoff);
+#endif
+                }
+            }        }        SpinLockRelease(&walrcv->mutex);
@@ -614,6 +741,15 @@ XLogWalRcvFlush(bool dying)}/*
+ * Handle LCR data.
+ */
+static void
+LogicalWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
+{
+    elog(LOG, "received msg of length %u", (uint) nbytes);
+}
+
+/* * Send reply message to primary, indicating our current XLOG positions and * the current time. */
@@ -750,6 +886,9 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)    walrcv->lastMsgReceiptTime =
lastMsgReceiptTime;   SpinLockRelease(&walrcv->mutex);
 
+    /* we need to store that in shmem */
+    curRecv = walEnd;
+    if (log_min_messages <= DEBUG2)        elog(DEBUG2, "sendtime %s receipttime %s replication apply delay %d ms
transferlatency %d ms",             timestamptz_to_str(sendTime),
 
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index cb49282..aa07746 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -207,6 +207,7 @@ RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo)        walrcv->conninfo[0] = '\0';
walrcv->walRcvState= WALRCV_STARTING;    walrcv->startTime = now;
 
+    walrcv->startedDuringRecovery = true;    /*     * If this is the first startup of walreceiver, we initialize
receivedUpto
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8cd3a00..d2e1c76 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -37,9 +37,13 @@#include <signal.h>#include <unistd.h>
+#include "access/xlogreader.h"#include "access/transam.h"
+#include "access/xact.h"#include "access/xlog_internal.h"#include "catalog/pg_type.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_control.h"#include "funcapi.h"#include "libpq/libpq.h"#include "libpq/pqformat.h"
@@ -64,6 +68,7 @@#include "utils/ps_status.h"#include "utils/resowner.h"#include "utils/timestamp.h"
+#include "utils/syscache.h"/* Array of WalSnds in shared memory */
@@ -74,9 +79,12 @@ WalSnd       *MyWalSnd = NULL;/* Global state */bool        am_walsender = false;        /* Am I a
walsenderprocess ? */
 
+bool        am_cascading_walsender = false;        /* Am I cascading WAL to
    * another standby ? */
 
+bool        am_doing_logical = false; /* Am I sending logical changes instead of physical ones */
+/* User-settable parameters for walsender */int            max_wal_senders = 0;    /* the maximum number of concurrent
walsenders*/int            replication_timeout = 60 * 1000;    /* maximum time to send one
 
@@ -112,6 +120,12 @@ static TimestampTz last_reply_timestamp; */static bool wroteNewXlogData = false;
+/*
+ * state for continuous reading of the local servers wal for sending logical
+ * wal
+ */
+static XLogReaderState* xlogreader_state = 0;
+/* Flags set by signal handlers for later service in main loop */static volatile sig_atomic_t got_SIGHUP =
false;volatilesig_atomic_t walsender_shutdown_requested = false;
 
@@ -131,8 +145,19 @@ static void InitWalSnd(void);static void WalSndHandshake(void);static void WalSndKill(int code,
Datumarg);static void XLogSend(char *msgbuf, bool *caughtup);
 
+static Size XLogSendPhysical(char *msgbuf, bool *caughtup, XLogRecPtr startptr,
+                             XLogRecPtr endptr);
+static Size XLogSendLogical(char *msgbuf, bool *caughtup, XLogRecPtr startptr,
+                            XLogRecPtr endptr);static void IdentifySystem(void);static void
StartReplication(StartReplicationCmd*cmd);
 
+static void StartLogicalReplication(StartLogicalReplicationCmd *cmd);
+
+static bool RecordRelevantForLogicalReplication(XLogReaderState* state, XLogRecord* r);
+static void ProcessRecord(XLogReaderState* state, XLogRecordBuffer* buf);
+static void WriteoutData(XLogReaderState* state, char* data, Size len);
+static void XLogReadPage(XLogReaderState* state, char *buf, XLogRecPtr startptr);
+static void ProcessStandbyMessage(void);static void ProcessStandbyReplyMessage(void);static void
ProcessStandbyHSFeedbackMessage(void);
@@ -293,8 +318,10 @@ IdentifySystem(void)    char        sysid[32];    char        tli[11];    char
xpos[MAXFNAMELEN];
+    char        node_id[MAXFNAMELEN];//FIXME
+    char        redoptr_s[MAXFNAMELEN];    XLogRecPtr    logptr;
-
+    XLogRecPtr    redoptr = GetRedoRecPtr();    /*     * Reply with a result set with one row, three columns. First
colis     * system ID, second is timeline ID, and third is current xlog location.
 
@@ -309,9 +336,14 @@ IdentifySystem(void)    snprintf(xpos, sizeof(xpos), "%X/%X",             logptr.xlogid,
logptr.xrecoff);
+    snprintf(node_id, sizeof(node_id), "%i", guc_replication_origin_id);
+
+    snprintf(redoptr_s, sizeof(redoptr_s), "%X/%X",
+             redoptr.xlogid, redoptr.xrecoff);
+    /* Send a RowDescription message */    pq_beginmessage(&buf, 'T');
-    pq_sendint(&buf, 3, 2);        /* 3 fields */
+    pq_sendint(&buf, 5, 2);        /* 5 fields */    /* first field */    pq_sendstring(&buf, "systemid");    /* col
name*/
 
@@ -332,24 +364,47 @@ IdentifySystem(void)    pq_sendint(&buf, 0, 2);        /* format code */    /* third field */
-    pq_sendstring(&buf, "xlogpos");
-    pq_sendint(&buf, 0, 4);
-    pq_sendint(&buf, 0, 2);
-    pq_sendint(&buf, TEXTOID, 4);
-    pq_sendint(&buf, -1, 2);
-    pq_sendint(&buf, 0, 4);
-    pq_sendint(&buf, 0, 2);
+    pq_sendstring(&buf, "xlogpos");    /* col name */
+    pq_sendint(&buf, 0, 4);        /* table oid */
+    pq_sendint(&buf, 0, 2);        /* attnum */
+    pq_sendint(&buf, TEXTOID, 4);        /* type oid */
+    pq_sendint(&buf, -1, 2);        /* typlen */
+    pq_sendint(&buf, 0, 4);        /* typmod */
+    pq_sendint(&buf, 0, 2);        /* format code */
+
+    /* fourth field */
+    pq_sendstring(&buf, "node_id");    /* col name */
+    pq_sendint(&buf, 0, 4);        /* table oid */
+    pq_sendint(&buf, 0, 2);        /* attnum */
+    pq_sendint(&buf, INT4OID, 4);        /* type oid */
+    pq_sendint(&buf, 4, 2);        /* typlen */
+    pq_sendint(&buf, 0, 4);        /* typmod */
+    pq_sendint(&buf, 0, 2);        /* format code */
+
+    /* fifth field */
+    pq_sendstring(&buf, "last_checkpoint");    /* col name */
+    pq_sendint(&buf, 0, 4);        /* table oid */
+    pq_sendint(&buf, 0, 2);        /* attnum */
+    pq_sendint(&buf, TEXTOID, 4);        /* type oid */
+    pq_sendint(&buf, -1, 2);        /* typlen */
+    pq_sendint(&buf, 0, 4);        /* typmod */
+    pq_sendint(&buf, 0, 2);        /* format code */
+    pq_endmessage(&buf);    /* Send a DataRow message */    pq_beginmessage(&buf, 'D');
-    pq_sendint(&buf, 3, 2);        /* # of columns */
+    pq_sendint(&buf, 5, 2);        /* # of columns */    pq_sendint(&buf, strlen(sysid), 4); /* col1 len */
pq_sendbytes(&buf,(char *) &sysid, strlen(sysid));    pq_sendint(&buf, strlen(tli), 4);    /* col2 len */
pq_sendbytes(&buf,(char *) tli, strlen(tli));    pq_sendint(&buf, strlen(xpos), 4);    /* col3 len */
pq_sendbytes(&buf,(char *) xpos, strlen(xpos));
 
+    pq_sendint(&buf, strlen(node_id), 4);    /* col4 len */
+    pq_sendbytes(&buf, (char *)node_id, strlen(node_id));
+    pq_sendint(&buf, strlen(redoptr_s), 4);    /* col5 len */
+    pq_sendbytes(&buf, (char *)redoptr_s, strlen(redoptr_s));    pq_endmessage(&buf);
@@ -432,6 +487,8 @@ StartReplication(StartReplicationCmd *cmd)    pq_endmessage(&buf);    pq_flush();
+    am_doing_logical = false;
+    /*     * Initialize position to the received one, then the xlog records begin to     * be shipped from that
position
@@ -440,6 +497,56 @@ StartReplication(StartReplicationCmd *cmd)}/*
+ * START_LOGICAL_REPLICATION
+ */
+static void
+StartLogicalReplication(StartLogicalReplicationCmd *cmd)
+{
+    StringInfoData buf;
+
+    /* XXX: see above */
+    MarkPostmasterChildWalSender();
+    SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+
+    /* XXX: see above*/
+    if (am_cascading_walsender && !RecoveryInProgress())
+    {
+        ereport(LOG,
+                (errmsg("terminating walsender process to force cascaded standby "
+                        "to update timeline and reconnect")));
+        walsender_ready_to_stop = true;
+    }
+
+    /* XXX: see above*/
+    WalSndSetState(WALSNDSTATE_CATCHUP);
+
+    /* Send a CopyBothResponse message, and start streaming */
+    pq_beginmessage(&buf, 'W');
+    pq_sendbyte(&buf, 0);
+    pq_sendint(&buf, 0, 2);
+    pq_endmessage(&buf);
+    pq_flush();
+
+    am_doing_logical = true;
+
+    sentPtr = cmd->startpoint;
+
+    if(!xlogreader_state){
+        xlogreader_state = XLogReaderAllocate();
+        xlogreader_state->is_record_interesting = RecordRelevantForLogicalReplication;
+        xlogreader_state->finished_record = ProcessRecord;
+        xlogreader_state->writeout_data = WriteoutData;
+        xlogreader_state->read_page = XLogReadPage;
+
+        /* FIXME: it would probably better to handle this */
+        XLogReaderReset(xlogreader_state);
+    }
+
+    xlogreader_state->startptr = cmd->startpoint;
+    xlogreader_state->curptr = cmd->startpoint;
+}
+
+/* * Execute an incoming replication command. */static bool
@@ -483,6 +590,13 @@ HandleReplicationCommand(const char *cmd_string)            replication_started = true;
break;
+        case T_StartLogicalReplicationCmd:
+            StartLogicalReplication((StartLogicalReplicationCmd *) cmd_node);
+
+            /* break out of the loop */
+            replication_started = true;
+            break;
+        case T_BaseBackupCmd:            SendBaseBackup((BaseBackupCmd *) cmd_node);
@@ -1071,54 +1185,142 @@ retry:        p += readbytes;    }
-    /*
-     * After reading into the buffer, check that what we read was valid. We do
-     * this after reading, because even though the segment was present when we
-     * opened it, it might get recycled or removed while we read it. The
-     * read() succeeds in that case, but the data we tried to read might
-     * already have been overwritten with new WAL records.
-     */
-    XLogGetLastRemoved(&lastRemovedLog, &lastRemovedSeg);
-    XLByteToSeg(startptr, log, seg);
-    if (log < lastRemovedLog ||
-        (log == lastRemovedLog && seg <= lastRemovedSeg))
-    {
-        char        filename[MAXFNAMELEN];
+    if(node_id == InvalidMultimasterNodeId){
+        /*
+         * After reading into the buffer, check that what we read was valid. We
+         * do this after reading, because even though the segment was present
+         * when we opened it, it might get recycled or removed while we read
+         * it. The read() succeeds in that case, but the data we tried to read
+         * might already have been overwritten with new WAL records.
+         */
+        XLogGetLastRemoved(&lastRemovedLog, &lastRemovedSeg);
+        XLByteToSeg(startptr, log, seg);
+        if (log < lastRemovedLog ||
+            (log == lastRemovedLog && seg <= lastRemovedSeg))
+        {
+            char        filename[MAXFNAMELEN];
-        XLogFileName(filename, ThisTimeLineID, log, seg);
-        ereport(ERROR,
-                (errcode_for_file_access(),
-                 errmsg("requested WAL segment %s has already been removed",
-                        filename)));
+            XLogFileName(filename, ThisTimeLineID, log, seg);
+            ereport(ERROR,
+                    (errcode_for_file_access(),
+                     errmsg("requested WAL segment %s has already been removed",
+                            filename)));
+        }
+
+        /*
+         * During recovery, the currently-open WAL file might be replaced with
+         * the file of the same name retrieved from archive. So we always need
+         * to check what we read was valid after reading into the buffer. If
+         * it's invalid, we try to open and read the file again.
+         */
+        if (am_cascading_walsender)
+        {
+            /* use volatile pointer to prevent code rearrangement */
+            volatile WalSnd *walsnd = MyWalSnd;
+            bool        reload;
+
+            SpinLockAcquire(&walsnd->mutex);
+            reload = walsnd->needreload;
+            walsnd->needreload = false;
+            SpinLockRelease(&walsnd->mutex);
+
+            if (reload && sendFile >= 0)
+            {
+                close(sendFile);
+                sendFile = -1;
+
+                goto retry;
+            }
+        }    }
+    else{
+        /* FIXME: check shm? */
+    }
+}
+static bool
+RecordRelevantForLogicalReplication(XLogReaderState* state, XLogRecord* r){    /*
-     * During recovery, the currently-open WAL file might be replaced with the
-     * file of the same name retrieved from archive. So we always need to
-     * check what we read was valid after reading into the buffer. If it's
-     * invalid, we try to open and read the file again.
+     * For now we only send out data that are originating locally which implies
+     * a start topology between all nodes. Later we might support more
+     * complicated models. For that filtering positively by wanted IDs sounds
+     * like a better idea.     */
-    if (am_cascading_walsender)
-    {
-        /* use volatile pointer to prevent code rearrangement */
-        volatile WalSnd *walsnd = MyWalSnd;
-        bool        reload;
+    if(r->xl_origin_id != current_replication_origin_id)
+        return false;
+
+    switch(r->xl_rmid){
+        case RM_HEAP_ID:
+        case RM_HEAP2_ID:
+        case RM_XACT_ID:
+        case RM_XLOG_ID:
+            /* FIXME: filter additionally */
+            return true;
+        default:
+            return false;
+    }
+}
-        SpinLockAcquire(&walsnd->mutex);
-        reload = walsnd->needreload;
-        walsnd->needreload = false;
-        SpinLockRelease(&walsnd->mutex);
-        if (reload && sendFile >= 0)
-        {
-            close(sendFile);
-            sendFile = -1;
+static void
+XLogReadPage(XLogReaderState* state, char *buf, XLogRecPtr startptr)
+{
+    XLogPageHeader page_header;
-            goto retry;
-        }
+    Assert((startptr.xrecoff % XLOG_BLCKSZ) == 0);
+
+    /* elog(LOG, "Reading from %X/%X", startptr.xlogid, startptr.xrecoff); */
+
+    /* FIXME: more sensible implementation */
+    XLogRead(buf, InvalidMultimasterNodeId, startptr, XLOG_BLCKSZ);
+
+    page_header = (XLogPageHeader)buf;
+
+    if(page_header->xlp_magic != XLOG_PAGE_MAGIC){
+        elog(FATAL, "page header magic %x, should be %x", page_header->xlp_magic,
+             XLOG_PAGE_MAGIC);    }}
+static void
+ProcessRecord(XLogReaderState* state, XLogRecordBuffer* buf){
+    //FIXME: process table relfilenode reassignments here
+}
+
+static void WriteoutData(XLogReaderState* state, char* data, Size len){
+    //FIXME: state->nbytes shouldn't be used in here
+    /* we want to writeout zeros */
+    if(data == 0)
+        memset((char*)state->private_data + state->nbytes, 0, len);
+    else
+        memcpy((char*)state->private_data + state->nbytes, data, len);
+    state->nbytes += len;
+}
+
+static Size
+XLogSendLogical(char *msgbuf, bool *caughtup, XLogRecPtr startptr,
+                XLogRecPtr endptr)
+{
+#ifdef BUGGY
+    if(!xlogreader_state->incomplete){
+        XLogReaderReset(xlogreader_state);
+        xlogreader_state->startptr = startptr;
+        xlogreader_state->curptr = startptr;
+    }
+#endif
+
+    xlogreader_state->endptr = endptr;
+    xlogreader_state->private_data = msgbuf;
+    xlogreader_state->nbytes = 0;//FIXME: this should go
+
+    XLogReaderRead(xlogreader_state);
+
+    //FIXME
+    sentPtr = xlogreader_state->curptr;
+
+    return xlogreader_state->nbytes;
+}
+/* * Read up to MAX_SEND_SIZE bytes of WAL that's been flushed to disk, * but not yet sent to the client, and buffer
itin the libpq output
 
@@ -1136,10 +1338,11 @@ static voidXLogSend(char *msgbuf, bool *caughtup){    XLogRecPtr    SendRqstPtr;
-    XLogRecPtr    startptr;
-    XLogRecPtr    endptr;
-    Size        nbytes;
+    XLogRecPtr    startptr = sentPtr;
+    XLogRecPtr    endptr = sentPtr;
+    WalDataMessageHeader msghdr;
+    Size        nbytes = 0;    /*     * Attempt to send all data that's already been written out and fsync'd to
@@ -1155,44 +1358,17 @@ XLogSend(char *msgbuf, bool *caughtup)    if (XLByteLE(SendRqstPtr, sentPtr))    {
*caughtup= true;
 
+#if 0
+        elog(LOG, "caughtup %X/%X", SendRqstPtr.xlogid, SendRqstPtr.xrecoff);
+#endif        return;    }
-    /*
-     * Figure out how much to send in one message. If there's no more than
-     * MAX_SEND_SIZE bytes to send, send everything. Otherwise send
-     * MAX_SEND_SIZE bytes, but round back to logfile or page boundary.
-     *
-     * The rounding is not only for performance reasons. Walreceiver relies on
-     * the fact that we never split a WAL record across two messages. Since a
-     * long WAL record is split at page boundary into continuation records,
-     * page boundary is always a safe cut-off point. We also assume that
-     * SendRqstPtr never points to the middle of a WAL record.
-     */
-    startptr = sentPtr;
-    if (startptr.xrecoff >= XLogFileSize)
-    {
-        /*
-         * crossing a logid boundary, skip the non-existent last log segment
-         * in previous logical log file.
-         */
-        startptr.xlogid += 1;
-        startptr.xrecoff = 0;
-    }
-
-    endptr = startptr;
+    /* FIXME: this is duplicated in physical transport */    XLByteAdvance(endptr, MAX_SEND_SIZE);
-    if (endptr.xlogid != startptr.xlogid)
-    {
-        /* Don't cross a logfile boundary within one message */
-        Assert(endptr.xlogid == startptr.xlogid + 1);
-        endptr.xlogid = startptr.xlogid;
-        endptr.xrecoff = XLogFileSize;
-    }    /* if we went beyond SendRqstPtr, back off */
-    if (XLByteLE(SendRqstPtr, endptr))
-    {
+    if (XLByteLE(SendRqstPtr, endptr)){        endptr = SendRqstPtr;        *caughtup = true;    }
@@ -1203,34 +1379,39 @@ XLogSend(char *msgbuf, bool *caughtup)        *caughtup = false;    }
-    nbytes = endptr.xrecoff - startptr.xrecoff;
-    Assert(nbytes <= MAX_SEND_SIZE);
-    /*     * OK to read and send the slice.     */    msgbuf[0] = 'w';
-    /*
-     * Read the log directly into the output buffer to avoid extra memcpy
-     * calls.
-     */
-    XLogRead(msgbuf + 1 + sizeof(WalDataMessageHeader), InvalidMultimasterNodeId,
-             startptr, nbytes);
+    nbytes += 1 + sizeof(WalDataMessageHeader);
+
+    if(am_doing_logical)
+        nbytes += XLogSendLogical(msgbuf + nbytes, caughtup, sentPtr, endptr);
+    else
+        nbytes += XLogSendPhysical(msgbuf + nbytes, caughtup, sentPtr, endptr);
+
+#if 0
+    elog(LOG, "setting sentPtr to %X/%X, SendRqstPtr %X/%X, endptr %X/%X",
+         sentPtr.xlogid, sentPtr.xrecoff,
+         SendRqstPtr.xlogid, SendRqstPtr.xrecoff,
+         endptr.xlogid, endptr.xrecoff);
+#endif    /*     * We fill the message header last so that the send timestamp is taken as     * late as possible.
*/   msghdr.dataStart = startptr;
 
-    msghdr.walEnd = SendRqstPtr;
+    msghdr.walEnd = sentPtr;    msghdr.sendTime = GetCurrentTimestamp();
+    memcpy(msgbuf + 1, &msghdr, sizeof(WalDataMessageHeader));
-    pq_putmessage_noblock('d', msgbuf, 1 + sizeof(WalDataMessageHeader) + nbytes);
+    pq_putmessage_noblock('d', msgbuf,
+                          nbytes);
-    sentPtr = endptr;    /* Update shared memory status */    {
@@ -1251,8 +1432,59 @@ XLogSend(char *msgbuf, bool *caughtup)                 sentPtr.xlogid, sentPtr.xrecoff);
set_ps_display(activitymsg,false);    }
 
+}
+
+static Size
+XLogSendPhysical(char *msgbuf, bool *caughtup, XLogRecPtr startptr, XLogRecPtr endptr){
+    Size        nbytes;
+
+    /*
+     * Figure out how much to send in one message. If there's no more than
+     * MAX_SEND_SIZE bytes to send, send everything. Otherwise send
+     * MAX_SEND_SIZE bytes, but round back to logfile or page boundary.
+     *
+     * The rounding is not only for performance reasons. Walreceiver relies on
+     * the fact that we never split a WAL record across two messages. Since a
+     * long WAL record is split at page boundary into continuation records,
+     * page boundary is always a safe cut-off point. We also assume that
+     * endptr never points to the middle of a WAL record.
+     */
+    startptr = sentPtr;
+    if (startptr.xrecoff >= XLogFileSize)
+    {
+        /*
+         * crossing a logid boundary, skip the non-existent last log segment
+         * in previous logical log file.
+         *
+         * FIXME: Isn't getting to that point a bug in the XLByte arithmetic?
+         */
+        startptr.xlogid += 1;
+        startptr.xrecoff = 0;
+    }
+
+    endptr = startptr;
+    XLByteAdvance(endptr, MAX_SEND_SIZE);
+    if (endptr.xlogid != startptr.xlogid)
+    {
+        /* Don't cross a logfile boundary within one message */
+        Assert(endptr.xlogid == startptr.xlogid + 1);
+        endptr.xlogid = startptr.xlogid;
+        endptr.xrecoff = XLogFileSize;
+    }
+
+
+    nbytes = endptr.xrecoff - startptr.xrecoff;
+    Assert(nbytes <= MAX_SEND_SIZE);
+
+    /*
+     * Read the log directly into the output buffer to avoid extra memcpy
+     * calls.
+     */
+    XLogRead(msgbuf, InvalidMultimasterNodeId, startptr, nbytes);
+
+    sentPtr = endptr;
-    return;
+    return nbytes;}/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 46b0657..6a58f96 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3058,6 +3058,15 @@ static struct config_string ConfigureNamesString[] =    },    {
+        {"multimaster_conninfo", PGC_POSTMASTER, REPLICATION_MASTER,
+            gettext_noop("Connection string to upstream master."),
+            NULL
+        },
+        &mm_conninfo,
+        0, NULL, NULL, NULL
+    },
+
+    {        {"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,            gettext_noop("Sets default
textsearch configuration."),            NULL
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 12f8a3f..240c13d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -243,6 +243,7 @@# - Multi Master Servers -
+#multimaster_conninfo = 'host=myupstreammaster'#multimaster_node_id = 0 #invalid node
id#------------------------------------------------------------------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 1e16088..78b2f5f 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -403,6 +403,7 @@ typedef enum NodeTag    T_IdentifySystemCmd,    T_BaseBackupCmd,    T_StartReplicationCmd,
+    T_StartLogicalReplicationCmd,    /*     * TAGS FOR RANDOM OTHER STUFF
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 236a36d..fee111c 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -49,4 +49,14 @@ typedef struct StartReplicationCmd    XLogRecPtr    startpoint;} StartReplicationCmd;
+/* ----------------------
+ *        START_LOGICAL_REPLICATION command
+ * ----------------------
+ */
+typedef struct StartLogicalReplicationCmd
+{
+    NodeTag        type;
+    XLogRecPtr    startpoint;
+} StartLogicalReplicationCmd;
+#endif   /* REPLNODES_H */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 8f44fad..fc9e120 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -13,6 +13,10 @@#include "access/xlogdefs.h"
+/* user settable parameters for multi-master in postmaster */
+extern char *mm_conninfo;    /* copied in walreceiver.h also */
+#define LogicalWalReceiverActive() (mm_conninfo != NULL)
+extern int guc_replication_origin_id;extern RepNodeId current_replication_origin_id;extern XLogRecPtr
current_replication_origin_lsn;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c9ab1be..b565190 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -22,6 +22,7 @@extern bool am_walreceiver;extern int    wal_receiver_status_interval;extern bool
hot_standby_feedback;
+extern RepNodeId receiving_from_node_id;/* * MAXCONNINFO: maximum size of a connection string.
@@ -38,9 +39,9 @@ extern bool hot_standby_feedback; */typedef enum{
-    WALRCV_STOPPED,                /* stopped and mustn't start up again */    WALRCV_STARTING,            /*
launched,but the process hasn't                                 * initialized yet */
 
+    WALRCV_STOPPED,                /* stopped and mustn't start up again */    WALRCV_RUNNING,                /*
walreceiveris running */    WALRCV_STOPPING                /* requested to stop, but still running */} WalRcvState;
 
@@ -55,6 +56,7 @@ typedef struct     */    pid_t        pid;    WalRcvState walRcvState;
+    bool        startedDuringRecovery;    pg_time_t    startTime;    /*
@@ -108,9 +110,12 @@ typedef structextern WalRcvData *WalRcv;/* libpqwalreceiver hooks */
-typedef bool (*walrcv_connect_type) (char *conninfo, XLogRecPtr startpoint);
+typedef bool (*walrcv_connect_type) (char *conninfo, XLogRecPtr* redo, XLogRecPtr* where_at, bool
startedDuringRecovery);externPGDLLIMPORT walrcv_connect_type walrcv_connect;
 
+typedef bool (*walrcv_start_type) (char *conninfo, XLogRecPtr* startpoint, bool startedDuringRecovery);
+extern PGDLLIMPORT walrcv_start_type walrcv_start;
+typedef bool (*walrcv_receive_type) (int timeout, unsigned char *type,
char**buffer, int *len);extern PGDLLIMPORT walrcv_receive_type walrcv_receive;
 
-- 
1.7.10.rc3.3.g19a6c.dirty



Re: [PATCH 16/16] current version of the design document

From
Merlin Moncure
Date:
On Wed, Jun 13, 2012 at 6:28 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> +synchronized catalog at the decoding site. That adds some complexity to use
> +cases like replicating into a different database or cross-version
> +replication. For those it is relatively straight-forward to develop a proxy pg
> +instance that only contains the catalog and does the transformation to textual
> +changes.

wow.  Anyways, could you elaborate on a little on how this proxy
instance concept would work?  Let's take the case where I have N
small-ish schema identical database shards that I want to aggregate
into a single warehouse -- something that HS/SR currently can't do.
There's a lot of ways to do that obviously but assuming the warehouse
would have to have a unique schema, could it be done in your
architecture?

merlin


Re: [PATCH 16/16] current version of the design document

From
Andres Freund
Date:
Hi Merlin,

On Wednesday, June 13, 2012 04:21:12 PM Merlin Moncure wrote:
> On Wed, Jun 13, 2012 at 6:28 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > +synchronized catalog at the decoding site. That adds some complexity to
> > use +cases like replicating into a different database or cross-version
> > +replication. For those it is relatively straight-forward to develop a
> > proxy pg +instance that only contains the catalog and does the
> > transformation to textual +changes.
> wow.  Anyways, could you elaborate on a little on how this proxy
> instance concept would work?
To do the decoding into another form you need an up2date catalog + correct 
binaries. So the idea would be to have a minimal instance which is just a copy 
of the database with all the tables with an oid < FirstNormalObjectId i.e. 
only the catalog tables. Then you can apply all xlog changes on system tables 
using the existing infrastructure for HS (or use the command trigger 
equivalent we need to build for BDR) and decode everything else into the 
ApplyCache just as done in the patch. Then you would fill out the callbacks 
for the ApplyCache (see patch 14/16 and 15/16 for an example) to do whatever 
you want with the data. I.e. generate plain sql statements or run some 
transform procedure.

> Let's take the case where I have N small-ish schema identical database
> shards that I want to aggregate into a single warehouse -- something that
> HS/SR currently can't do.
> There's a lot of ways to do that obviously but assuming the warehouse
> would have to have a unique schema, could it be done in your
> architecture?
Not sure what you mean by the warehouse having a unique schema? It has the 
same schema as the OLTP counterparts? That would obviously be the easy case if 
you take care and guarantee uniqueness of keys upfront. That basically would 
be trivial ;)
It gets a bit more complex if you need to transform the data for the 
warehouse. I don't plan to put in work to make that possible without some C 
coding (filling out the callbacks and doing the work in there). It shouldn't 
need much though.

Does that answer your question?

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Andres Freund <andres@2ndquadrant.com> wrote:
> This adds a new wal_level value 'logical'
> 
> Missing cases:
> - heap_multi_insert
> - primary key changes for updates
> - no primary key
> - LOG_NEWPAGE
First, Wow!
I look forward to the point where we can replace our trigger-based
replication with this!  Your "missing cases" for primary key issues
would not cause us any pain for our current system, since we require
a primary key and don't support updates to PKs for replicated
tables. While I don't expect that the first cut of this will be able
to replace our replication-related functionality, I'm interested in
making sure it can be extended in that direction, so I have a couple
things to consider:
(1)  For our usage, with dozens of source databases feeding into
multiple aggregate databases and interfaces, DDL replication is not
of much if any interest.  It should be easy enough to ignore as long
as it is low volume, so that doesn't worry me too much; but if I'm
missing something any you run across any logical WAL logging for DDL
which does generate a lot of WAL traffic, it would be nice to have a
way to turn that off at generation time rather than filtering it or
ignoring it later.  (Probably won't be an issue, just a head-up.)
(2)  To match the functionality we now have, we would need the
logical stream to include the *before* image of the whole tuple for
each row updated or deleted.  I understand that this is not needed
for the use cases you are initially targeting; I just hope the
design leaves this option open without needing to disturb other use
cases.  Perhaps this would require yet another wal_level value. 
Perhaps rather than testing the current value directly for
determining whether to log something, the GUC processing could set
some booleans for faster testing and less code churn when the
initial implementation is expanded to support other use cases (like
ours).
(3)  Similar to point 2, it would be extremely desirable to be able
to determine table name and columns names for the tuples in a stream
from that stream, without needing to query a hot standby or similar
digging into other sources of information.  Not only will the
various source databases all have different OID values for the same
objects, and the aggregate targets have different values from each
other and the sources, but some targets don't have the tables at
all.  I'm talking about our database transaction repository and the
interfaces to business partners which we currently drive off of the
same transaction stream which drives replication.
Would it be helpful or just a distraction if I were to provide a
more detailed description of our whole replication / transaction
store / interface area?
If it would be useful, I could also describe some other replication
patterns I have seen over the years.  In particular, one which might
be interesting is where subsets of the data are distributed to
multiple standalone machines which have intermittent or unreliable
connections to a central site, which periodically collects data from
all the remote sites, recalculates distribution, and sends
transactions back out to those remote sites to add, remove, and
update rows based on the distribution rules and the new data.
-Kevin


Re: [PATCH 16/16] current version of the design document

From
Merlin Moncure
Date:
On Wed, Jun 13, 2012 at 9:40 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi Merlin,
>
> On Wednesday, June 13, 2012 04:21:12 PM Merlin Moncure wrote:
>> On Wed, Jun 13, 2012 at 6:28 AM, Andres Freund <andres@2ndquadrant.com>
> wrote:
>> > +synchronized catalog at the decoding site. That adds some complexity to
>> > use +cases like replicating into a different database or cross-version
>> > +replication. For those it is relatively straight-forward to develop a
>> > proxy pg +instance that only contains the catalog and does the
>> > transformation to textual +changes.
>> wow.  Anyways, could you elaborate on a little on how this proxy
>> instance concept would work?
> To do the decoding into another form you need an up2date catalog + correct
> binaries. So the idea would be to have a minimal instance which is just a copy
> of the database with all the tables with an oid < FirstNormalObjectId i.e.
> only the catalog tables. Then you can apply all xlog changes on system tables
> using the existing infrastructure for HS (or use the command trigger
> equivalent we need to build for BDR) and decode everything else into the
> ApplyCache just as done in the patch. Then you would fill out the callbacks
> for the ApplyCache (see patch 14/16 and 15/16 for an example) to do whatever
> you want with the data. I.e. generate plain sql statements or run some
> transform procedure.
>
>> Let's take the case where I have N small-ish schema identical database
>> shards that I want to aggregate into a single warehouse -- something that
>> HS/SR currently can't do.
>> There's a lot of ways to do that obviously but assuming the warehouse
>> would have to have a unique schema, could it be done in your
>> architecture?
> Not sure what you mean by the warehouse having a unique schema? It has the
> same schema as the OLTP counterparts? That would obviously be the easy case if
> you take care and guarantee uniqueness of keys upfront. That basically would
> be trivial ;)

by unique I meant 'not the same as the shards' -- presumably this
would mean one of
a) each shard's data would be in a private schema folder
or
b) you'd have one set of tables but decorated with an extra shard
identifying column that would to be present in all keys to get around
uniqueness issues

> It gets a bit more complex if you need to transform the data for the
> warehouse. I don't plan to put in work to make that possible without some C
> coding (filling out the callbacks and doing the work in there). It shouldn't
> need much though.
>
> Does that answer your question?

yes.  Do you envision it would be possible to wrap the ApplyCache
callbacks in a library that could be exposed as an extension?  For
example, a library that would stick the replication data into a queue
that a userland (non C) process could walk, transform, etc?   I know
that's vague -- my general thrust here is that I find the
transformation features particularly interesting and I'm wondering how
much C coding would be needed to access them in the long term.

merlin


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Robert Haas
Date:
On Wed, Jun 13, 2012 at 7:27 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Unless somebody objects I will add most of the individual marked as RFC to the
> current commitfest. I hope that with comments stemming from that round we can
> get several of the patches into the first or second commitfest. As soon as the
> design is clear/accepted we will try very hard to get the following patches
> into the second or third round.

I made a "logical replication" topic within the CommitFest for this
patch series.  I think you should add them all there.  I have some
substantive thoughts about the design as well, which I will write up
in a separate email.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Wednesday, June 13, 2012 05:27:06 PM Kevin Grittner wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
> > This adds a new wal_level value 'logical'
> > 
> > Missing cases:
> > - heap_multi_insert
> > - primary key changes for updates
> > - no primary key
> > - LOG_NEWPAGE
> 
> First, Wow!
Thanks ;) I hope you will still be convinced after reading some of the code :P

> I look forward to the point where we can replace our trigger-based
> replication with this!  Your "missing cases" for primary key issues
> would not cause us any pain for our current system, since we require
> a primary key and don't support updates to PKs for replicated
> tables. While I don't expect that the first cut of this will be able
> to replace our replication-related functionality, I'm interested in
> making sure it can be extended in that direction, so I have a couple
> things to consider:
Ok.

> (1)  For our usage, with dozens of source databases feeding into
> multiple aggregate databases and interfaces, DDL replication is not
> of much if any interest.  It should be easy enough to ignore as long
> as it is low volume, so that doesn't worry me too much; but if I'm
> missing something any you run across any logical WAL logging for DDL
> which does generate a lot of WAL traffic, it would be nice to have a
> way to turn that off at generation time rather than filtering it or
> ignoring it later.  (Probably won't be an issue, just a head-up.)
I don't really see a problem there. I don't yet have a mental image of the 
API/Parameters to START_LOGICAL_REPLICATION to specify filters on the source 
side, but this should be possible.

> (2)  To match the functionality we now have, we would need the
> logical stream to include the *before* image of the whole tuple for
> each row updated or deleted.  I understand that this is not needed
> for the use cases you are initially targeting; I just hope the
> design leaves this option open without needing to disturb other use
> cases.  Perhaps this would require yet another wal_level value.
> Perhaps rather than testing the current value directly for
> determining whether to log something, the GUC processing could set
> some booleans for faster testing and less code churn when the
> initial implementation is expanded to support other use cases (like
> ours).
Hm. I don't see a big problem implementing this although I have to say that I 
am a bit hesitant to do this without in-core users of it for fear of silent 
breakage. WAL is kind of a central thing... ;). But then, the implementation 
should be relatively easy.
I don't see a need to break the wal level down into some booleans: changing 
the test from wal_level >= WAL_LEVEL_LOGICAL into something else shouldn't 
result in any measurable difference in that codepath.

I definitely have the use-case of replicating into databases where you need to 
do some transformation in mind.

> (3)  Similar to point 2, it would be extremely desirable to be able
> to determine table name and columns names for the tuples in a stream
> from that stream, without needing to query a hot standby or similar
> digging into other sources of information.  Not only will the
> various source databases all have different OID values for the same
> objects, and the aggregate targets have different values from each
> other and the sources, but some targets don't have the tables at
> all.  I'm talking about our database transaction repository and the
> interfaces to business partners which we currently drive off of the
> same transaction stream which drives replication.
I don't forsee this as a realistic thing. I think the required changes would 
be way to intrusive for too little gain. The performance and space 
requirements would probably also be rather noticeable. I think you will have 
to live with the mentioned 'proxy' pg instances which only contain the catalog 
(which normally isn't very big anyway) doing the decoding into your target 
database (which then doesn't need the same oids).
For how I imagine those proxy instances check my mail to merlin earlier today, 
I described it in some more detail there.
If I can find the time I should possibly develop (another) prototype of such a 
proxy instance. I don't forsee it needing much more infrastructure/code.

Is that a problem for your use-case? If yes, why?

> Would it be helpful or just a distraction if I were to provide a
> more detailed description of our whole replication / transaction
> store / interface area?
If you have it ready, yes. Otherwise I think I have a good enough image 
already. I tried to listen to you at pgcon and I have developed similar things 
before.
To be honest, I would prefer you spending the time on checking some of the 
code if not ;)

> If it would be useful, I could also describe some other replication
> patterns I have seen over the years.  In particular, one which might
> be interesting is where subsets of the data are distributed to
> multiple standalone machines which have intermittent or unreliable
> connections to a central site, which periodically collects data from
> all the remote sites, recalculates distribution, and sends
> transactions back out to those remote sites to add, remove, and
> update rows based on the distribution rules and the new data.
I don't think its that relevant for now. I think that time is spent more 
wisely when we get to conflict resolution and user interface...

Thanks!

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 16/16] current version of the design document

From
Andres Freund
Date:
Hi,

On Wednesday, June 13, 2012 05:39:36 PM Merlin Moncure wrote:
> On Wed, Jun 13, 2012 at 9:40 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> Let's take the case where I have N small-ish schema identical database
> >> shards that I want to aggregate into a single warehouse -- something
> >> that HS/SR currently can't do.
> >> There's a lot of ways to do that obviously but assuming the warehouse
> >> would have to have a unique schema, could it be done in your
> >> architecture?
> > 
> > Not sure what you mean by the warehouse having a unique schema? It has
> > the same schema as the OLTP counterparts? That would obviously be the
> > easy case if you take care and guarantee uniqueness of keys upfront.
> > That basically would be trivial ;)
> 
> by unique I meant 'not the same as the shards' -- presumably this
> would mean one of
> a) each shard's data would be in a private schema folder
> or
> b) you'd have one set of tables but decorated with an extra shard
> identifying column that would to be present in all keys to get around
> uniqueness issues
I think it would have to mean a) and that you have N of those logical import 
processes hanging around. We really need an identical TupleDesc to do the 
decoding.

> > It gets a bit more complex if you need to transform the data for the
> > warehouse. I don't plan to put in work to make that possible without some
> > C coding (filling out the callbacks and doing the work in there). It
> > shouldn't need much though.
> > 
> > Does that answer your question?
> yes.  Do you envision it would be possible to wrap the ApplyCache
> callbacks in a library that could be exposed as an extension?  For
> example, a library that would stick the replication data into a queue
> that a userland (non C) process could walk, transform, etc?   I know
> that's vague -- my general thrust here is that I find the
> transformation features particularly interesting and I'm wondering how
> much C coding would be needed to access them in the long term.
I can definitely imagine the callbacks calling some wrapper around a higher-
level language. Not sure how that fits into an extension (if you mean it as in 
CREATE EXTENSION) though. I don't think you will be able to start the 
replication process from inside a normal backend. I imagine something like 
specifying a shared object + parameters in the config or such.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Andres Freund
Date:
On Wednesday, June 13, 2012 05:53:31 PM Robert Haas wrote:
> On Wed, Jun 13, 2012 at 7:27 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > Unless somebody objects I will add most of the individual marked as RFC
> > to the current commitfest. I hope that with comments stemming from that
> > round we can get several of the patches into the first or second
> > commitfest. As soon as the design is clear/accepted we will try very
> > hard to get the following patches into the second or third round.
> 
> I made a "logical replication" topic within the CommitFest for this
> patch series.  I think you should add them all there.  
Thanks. Added all but the bgworker patch (which is not ready) and the 
WalSndWakeup (different category) one which is not really relevant to the 
topic.

I have the feeling I am due quite some reviewing this round...

> I have some substantive thoughts about the design as well, which I will
> write up in a separate email.
Thanks. Looking forward to it. At least now, before I have read it.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Steve Singer
Date:
On 12-06-13 07:27 AM, Andres Freund wrote:
> Its also available in the 'cabal-rebasing' branch on
> git.postgresql.org/users/andresfreund/postgres.git . That branch will modify
> history though.
>

That branch has a merge error in f685a11ce43b9694cbe61ffa42e396c9fbc32b05


gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include 
-D_GNU_SOURCE -c -o xact.o xact.c
xact.c:4684: error: expected identifier or ‘(’ before ‘<<’ token
xact.c:4690:46: warning: character constant too long for its type
xact.c:4712:46: warning: character constant too long for its type
xact.c:4719: error: expected identifier or ‘(’ before ‘<<’ token
xact.c:4740:46: warning: character constant too long for its type
make[4]: *** [xact.o] Error 1




Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Andres Freund
Date:
On Wednesday, June 13, 2012 07:11:40 PM Steve Singer wrote:
> On 12-06-13 07:27 AM, Andres Freund wrote:
> > Its also available in the 'cabal-rebasing' branch on
> > git.postgresql.org/users/andresfreund/postgres.git . That branch will
> > modify history though.
>
> That branch has a merge error in f685a11ce43b9694cbe61ffa42e396c9fbc32b05
>
>
> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
> -Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include
> -D_GNU_SOURCE -c -o xact.o xact.c
> xact.c:4684: error: expected identifier or ‘(’ before ‘<<’ token
> xact.c:4690:46: warning: character constant too long for its type
> xact.c:4712:46: warning: character constant too long for its type
> xact.c:4719: error: expected identifier or ‘(’ before ‘<<’ token
> xact.c:4740:46: warning: character constant too long for its type
> make[4]: *** [xact.o] Error 1
Aw crap. Will fix that. Sorry.

Andres


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Andres Freund
Date:
On Wednesday, June 13, 2012 07:11:40 PM Steve Singer wrote:
> On 12-06-13 07:27 AM, Andres Freund wrote:
> > Its also available in the 'cabal-rebasing' branch on
> > git.postgresql.org/users/andresfreund/postgres.git . That branch will
> > modify history though.
>
> That branch has a merge error in f685a11ce43b9694cbe61ffa42e396c9fbc32b05
>
>
> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
> -Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include
> -D_GNU_SOURCE -c -o xact.o xact.c
> xact.c:4684: error: expected identifier or ‘(’ before ‘<<’ token
> xact.c:4690:46: warning: character constant too long for its type
> xact.c:4712:46: warning: character constant too long for its type
> xact.c:4719: error: expected identifier or ‘(’ before ‘<<’ token
> xact.c:4740:46: warning: character constant too long for its type
> make[4]: *** [xact.o] Error 1
Hrmpf. I reordered the patch series a last time to be more clear and I somehow
didn't notice this. I compiled & tested the now pushed head
(7e0340a3bef927f79b3d97a11f94ede4b911560c) and will submit an updated patch
[10/16].

Andres


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
The previous mail contained a patch with a mismerge caused by reording 
commits. Corrected version attached.

Thanks to Steve Singer for noticing this quickly.

Andres

-- Andres Freund                     http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Andres Freund
Date:
Hi,

The patch as of yet doesn't contain how you actually can use the prototype... 
Obviously at this point its not very convenient:

I have two config files:
Node 1:
port = 5501
wal_level = logical
max_wal_senders = 10
wal_keep_segments = 200
multimaster_conninfo = 'port=5502 host=/tmp'
multimaster_node_id = 1

Node 2:
port = 5502
wal_level = logical
max_wal_senders = 10
wal_keep_segments = 200
multimaster_conninfo = 'port=5501 host=/tmp'
multimaster_node_id = 2

after initdb'ing the first cluster (initdb required):
$ ~/src/postgresql/build/assert/src/backend/postgres -D 
~/tmp/postgres/bdr/1/datadir/ -c 
config_file=~/tmp/postgres/bdr/1/postgresql.conf -c 
hba_file=~/tmp/postgres/bdr/1/pg_hba.conf -c 
ident_file=~/tmp/postgres/bdr/1/pg_ident.conf

$ psql -p 5501 -U andres postgres
CREATE TABLE data(id serial primary key, data bigint);
ALTER SEQUENCE data_id_seq INCREMENT 2;
SELECT setval('data_id_seq', 1);

shutdown cluster

$ rsync -raxv --delete /home/andres/tmp/postgres/bdr/1/datadir/* 
/home/andres/tmp/postgres/bdr/2/datadir

start both cluster which should sync after some output.

$ psql -p 5501 -U andres postgres
SELECT setval('data_id_seq', 2);


On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> From: Andres Freund <andres@anarazel.de>
>
> We decided to use low level functions to do the apply instead of producing sql
> statements containing the data (or using prepared statements) because both, the
> text conversion and the full executor overhead aren't introduce a significant
> overhead which is unneccesary if youre using the same version of pg on the same
> architecture.
>
> There are loads of use cases though that require different methods of applyin
> though - so the part doing the applying from an ApplyCache is just a bunch of
> well abstracted callbacks getting passed all the required knowledge to change
> the data representation into other formats.

It's important to make sure that it's not going to be *too* difficult
to "jump through the hoops" necessary to apply changes on a different
version.

While pg_upgrade has diminished the need to use replication to handle
cross-version/architecture upgrades, I don't think it has brought that
to zero.

One other complication I'll observe...  The assumption is being made
that UPDATE/DELETE will be handled via The Primary Key.  For the most
part, I approve of this.  Once upon a time, Slony had a misfeature
where you could tell it to add in a surrogate primary key, and that
caused no end of trouble.  However, the alternative, that *does* seem
to work alright, is to allow selecting a candidate primary key, that
is, a set of columns that have UNIQUE + NOT NULL characteristics.  I
could see people having a problem switching over to use this system if
they MUST begin with a 'Right Proper Primary Key.'  If they start with
a system with a 2TB table full of data that lacks that particular
constraint, that could render them unable to use the facility.

> Missing:
>
> - TOAST handling. For physical apply not much needs to be done because the
>  toast inserts will have been made beforehand. There needs to be an option in
>  ApplyCache that helps reassembling TOAST datums to make it easier to write
>  apply modules which convert to text.

Dumb question: Is it possible that two nodes would have a different
idea as to which columns need to get toasted?  I should think it
possible for nodes to be configured with different values for TOAST
policies, and while it's likely pretty dumb to set them to have
different handlings, it would seem undesirable to not bother looking,
and find the backend crashing due to an un-noticed mismatch.

--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


On Wednesday, June 13, 2012 08:50:42 PM Christopher Browne wrote:
> On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > From: Andres Freund <andres@anarazel.de>
> > 
> > We decided to use low level functions to do the apply instead of
> > producing sql statements containing the data (or using prepared
> > statements) because both, the text conversion and the full executor
> > overhead aren't introduce a significant overhead which is unneccesary if
> > youre using the same version of pg on the same architecture.
> > 
> > There are loads of use cases though that require different methods of
> > applyin though - so the part doing the applying from an ApplyCache is
> > just a bunch of well abstracted callbacks getting passed all the
> > required knowledge to change the data representation into other formats.
> 
> It's important to make sure that it's not going to be *too* difficult
> to "jump through the hoops" necessary to apply changes on a different
> version.
I aggree. But I don't see it as a feature of the first version for the moment. 
Getting a base set of features into 9.3 is going to be hard enough as-is. But 
I think there is enough interest from all sides to make that possible cross-
version.

> While pg_upgrade has diminished the need to use replication to handle
> cross-version/architecture upgrades, I don't think it has brought that
> to zero.
Aggreed.

> One other complication I'll observe...  The assumption is being made
> that UPDATE/DELETE will be handled via The Primary Key.  For the most
> part, I approve of this.  Once upon a time, Slony had a misfeature
> where you could tell it to add in a surrogate primary key, and that
> caused no end of trouble.  However, the alternative, that *does* seem
> to work alright, is to allow selecting a candidate primary key, that
> is, a set of columns that have UNIQUE + NOT NULL characteristics.  I
> could see people having a problem switching over to use this system if
> they MUST begin with a 'Right Proper Primary Key.'  If they start with
> a system with a 2TB table full of data that lacks that particular
> constraint, that could render them unable to use the facility.
It wouldn't need that much code to allow candidate keys. The data 
representation in the catalogs is a bit unfriendly for that, but there has 
been talk about changing that for some time now. I am not convinced that its 
worth the cpu cycles though.

Btw, you can convert a unique key to a primary key since 9.1. The unique key 
previously can be created CONCURRENTLY.

> > Missing:
> > 
> > - TOAST handling. For physical apply not much needs to be done because
> > the toast inserts will have been made beforehand. There needs to be an
> > option in ApplyCache that helps reassembling TOAST datums to make it
> > easier to write apply modules which convert to text.
> Dumb question: Is it possible that two nodes would have a different
> idea as to which columns need to get toasted?  I should think it
> possible for nodes to be configured with different values for TOAST
> policies, and while it's likely pretty dumb to set them to have
> different handlings, it would seem undesirable to not bother looking,
> and find the backend crashing due to an un-noticed mismatch.
I don't think it should be possible to configure the toast configurations 
differently if you use the "binary apply" mode. But even if it were a value 
which is toasted although the local policy says it should not be wouldn't 
cause any problems as far as I can see.
The one problem that could cause problems for that are different page sizes et 
al, but that needs to be prohibited anyway.

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> This is locally defined in lots of places and would get introduced frequently
> in the next commits. It is expected that this can be defined in a header-only
> manner as soon as the XLogInsert scalability groundwork from Heikki gets in.

This appears to be redundant with the existing InvalidXLogRecPtr,
defined in access/transam.h.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thursday, June 14, 2012 03:50:28 PM Robert Haas wrote:
> On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > This is locally defined in lots of places and would get introduced
> > frequently in the next commits. It is expected that this can be defined
> > in a header-only manner as soon as the XLogInsert scalability groundwork
> > from Heikki gets in.
> 
> This appears to be redundant with the existing InvalidXLogRecPtr,
> defined in access/transam.h.
Oh. I didn't find that one. Judging from all the code defining local variants 
of it I am not alone in that... Why is it in transam.h and not xlogdefs.h?

Obviously that patch is void then. Doesn't warrant rebasing the other patches 
yet though...

Thanks!

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


On Thu, Jun 14, 2012 at 9:57 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On Thursday, June 14, 2012 03:50:28 PM Robert Haas wrote:
>> On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com>
> wrote:
>> > This is locally defined in lots of places and would get introduced
>> > frequently in the next commits. It is expected that this can be defined
>> > in a header-only manner as soon as the XLogInsert scalability groundwork
>> > from Heikki gets in.
>>
>> This appears to be redundant with the existing InvalidXLogRecPtr,
>> defined in access/transam.h.
> Oh. I didn't find that one. Judging from all the code defining local variants
> of it I am not alone in that... Why is it in transam.h and not xlogdefs.h?

Uh, not sure.  We used to have a variable by that name defined in a
bunch of places, and I cleaned it up some in commit
503c7305a1e379f95649eef1a694d0c1dbdc674a.  But if there are still more
redundant definitions floating around, it would be nice to clean those
up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thursday, June 14, 2012 04:04:22 PM Robert Haas wrote:
> On Thu, Jun 14, 2012 at 9:57 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > On Thursday, June 14, 2012 03:50:28 PM Robert Haas wrote:
> >> On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com>
> > 
> > wrote:
> >> > This is locally defined in lots of places and would get introduced
> >> > frequently in the next commits. It is expected that this can be
> >> > defined in a header-only manner as soon as the XLogInsert scalability
> >> > groundwork from Heikki gets in.
> >> 
> >> This appears to be redundant with the existing InvalidXLogRecPtr,
> >> defined in access/transam.h.
> > 
> > Oh. I didn't find that one. Judging from all the code defining local
> > variants of it I am not alone in that... Why is it in transam.h and not
> > xlogdefs.h?
> 
> Uh, not sure.  We used to have a variable by that name defined in a
> bunch of places, and I cleaned it up some in commit
> 503c7305a1e379f95649eef1a694d0c1dbdc674a.  But if there are still more
> redundant definitions floating around, it would be nice to clean those
> up.
Forget it, they are in things that don't link to the backend... /me looks 
forward to the 64bit conversion of XLogRecPtr's.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Robert Haas
Date:
On Wed, Jun 13, 2012 at 7:27 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> === Design goals for logical replication === :
> - in core
> - fast
> - async
> - robust
> - multi-master
> - modular
> - as unintrusive as possible implementation wise
> - basis for other technologies (sharding, replication into other DBMSs, ...)

I agree with all of these goals except for "multi-master".  I am not
sure that there is a need to have a multi-master replication solution
in core.  The big tricky part of multi-master replication is conflict
resolution, and that seems like an awful lot of logic to try to build
into core - especially given that we will want it to be extensible.

More generally, I would much rather see us focus on efficiently
extracting changesets from WAL and efficiently applying those
changesets on another server.  IMHO, those are the things that are
holding back the not-in-core replication solutions we have today,
particularly the first.  If we come up with a better way of applying
change-sets, well, Slony can implement that too; they are already
installing C code.  What neither they nor any other non-core solution
can implement is change-set extraction, and therefore I think that
ought to be the focus.

To put all that another way, I think it is a 100% bad idea to try to
kill off Slony, Bucardo, Londiste, or any of the many home-grown
solutions that are out there to do replication.  Even if there were no
technical place for third-party replication products (and I think
there is), we will not win many friends by making it harder to extend
and add on to the server.  If we build an in-core replication solution
that can be used all by itself, that is fine with me.  But I think it
should also be able to expose its constituent parts as a toolkit for
third-party solutions.

> While you may argue that most of the above design goals are already provided by
> various trigger based replication solutions like Londiste or Slony, we think
> that thats not enough for various reasons:
>
> - not in core (and thus less trustworthy)
> - duplication of writes due to an additional log
> - performance in general (check the end of the above presentation)
> - complex to use because there is no native administration interface

I think that your parenthetical note "(and thus less trustworthy)"
gets at another very important point, which is that one of the
standards for inclusion in core is that it must in fact be trustworthy
enough to justify the confidence that users will place in it.  It will
NOT benefit the project to have two replication solutions in core, one
of which is crappy.  More, even if what we put in core is AS GOOD as
the best third-party solutions that are available, I don't think
that's adequate.  It has to be better.  If it isn't, there is no
excuse for preempting what's already out there.

I imagine you are thinking along similar lines, but I think that it
bears being explicit about.

> As we need a change stream that contains all required changes in the correct
> order, the requirement for this stream to reflect changes across multiple
> concurrent backends raises concurrency and scalability issues. Reusing the
> WAL stream for this seems a good choice since it is needed anyway and adresses
> those issues already, and it further means that we don't incur duplicate
> writes. Any other stream generating componenent would introduce additional
> scalability issues.

Agreed.

> We need a change stream that contains all required changes in the correct order
> which thus needs to be synchronized across concurrent backends which introduces
> obvious concurrency/scalability issues.
> Reusing the WAL stream for this seems a good choice since it is needed anyway
> and adresses those issues already, and it further means we don't duplicate the
> writes and locks already performance for its maintenance.

Agreed.

> Unfortunately, in this case, the WAL is mostly a physical representation of the
> changes and thus does not, by itself, contain the necessary information in a
> convenient format to create logical changesets.

Agreed.

> The biggest problem is, that interpreting tuples in the WAL stream requires an
> up-to-date system catalog and needs to be done in a compatible backend and
> architecture. The requirement of an up-to-date catalog could be solved by
> adding more data to the WAL stream but it seems to be likely that that would
> require relatively intrusive & complex changes. Instead we chose to require a
> synchronized catalog at the decoding site. That adds some complexity to use
> cases like replicating into a different database or cross-version
> replication. For those it is relatively straight-forward to develop a proxy pg
> instance that only contains the catalog and does the transformation to textual
> changes.

The actual requirement here is more complex than "an up-to-date
catalog".  Suppose transaction X begins, adds a column to a table,
inserts a row, and commits.  That tuple needs to be interpreted using
the tuple descriptor that transaction X would see (which includes the
new column), NOT the tuple descriptor that some other transaction
would see at the same time (which won't include the new column).  In a
more complicated scenario, X might (1) begin, (2) start a
subtransaction that alters the table, (3) release the savepoint or
roll back to the save point, (4) insert a tuple, and (5) commit.  Now,
the correct tuple descriptor for interpreting the tuple inserted in
step (4) depends on whether step (3) was a release savepoint or a
rollback-to-savepoint.  How are you handling these (and similar but
more complex) cases?

Moreover, we will want in the future to allow some of the DDL changes
that currently require AccessExclusiveLock to be performed with a
lesser lock.  It is unclear to me that this will be practical as far
as adding columns goes, but it would be a shame if logical replication
were the thing standing in the way.  In that scenario, you might have:
transaction X begins a transaction, and adds a column; transaction Y
inserts a tuple which must be interpreted using the old tuple
descriptor (or maybe it's harmless to use the new one, since the extra
column will be interpreted as NULL anyway); transaction X inserts a
tuple (which MUST be interpreted using the new tuple descriptor); and
then X either commits or rolls back.  This might be too hypothetical
to worry about in detail, but it would at least be nice to have the
sense that we're not totally screwed if the locking rules for DDL
change someday.

> This also is the solution to the other big problem, the need to work around
> architecture/version specific binary formats. The alternative, producing
> cross-version, cross-architecture compatible binary changes or even moreso
> textual changes all the time seems to be prohibitively expensive. Both from a
> cpu and a storage POV and also from the point of implementation effort.

I think that if you can't produce a textual record of changes, you're
throwing away 75% of what people will want to do with this.  Being
able to replicate across architectures, versions, and even into
heterogeneous databases is the main point of having logical
replication, IMV.   Multi-master replication is nice to have, but IME
there is huge demand for a replication solution that doesn't need to
be temporarily replaced with something completely different every time
you want to do a database upgrade.

> The catalog on the site where changes originate can *not* be used for the
> decoding because at the time we decode the WAL the catalog may have changed
> from the state it was in when the WAL was generated. A possible solution for
> this would be to have a fully versioned catalog but that again seems to be
> rather complex and intrusive.

Yes, that seems like a non-starter.

> For some operations (UPDATE, DELETE) and corner-cases (e.g. full page writes)
> additional data needs to be logged, but the additional amount of data isn't
> that big. Requiring a primary-key for any change but INSERT seems to be a
> sensible thing for now. The required changes are fully contained in heapam.c
> and are pretty simple so far.

I think that you can handle the case where there are no primary keys
by simply treating the whole record as a primary key.  There might be
duplicates, but I think you can work around that by decreeing that
when you replay an UPDATE or DELETE operation you will update or
delete at most one record.  So if there are multiple exactly-identical
records in the target database, then you will just UPDATE or DELETE
exactly one of them (it doesn't matter which).  This doesn't even seem
particularly complicated.

> For transport of the non-decoded data from the originating site to the decoding
> site we decided to reuse the infrastructure already provided by
> walsender/walreceiver.

I think this is reasonable.  Actually, I think we might want to
generalize this to a bunch of other stuff, too.  There could be many
reasons to want a distributed cluster of PostgreSQL servers with some
amount of interconnect.  We might want to think about renaming the
processes at some point, but that's probably putting the cart before
the horse just right now.

> We introduced a new command that, analogous to
> START_REPLICATION, is called START_LOGICAL_REPLICATION that will stream out all
> xlog records that pass through a filter.
>
> The on-the-wire format stays the same. The filter currently simply filters out
> all record which are not interesting for logical replication (indexes,
> freezing, ...) and records that did not originate on the same system.
>
> The requirement of filtering by 'origin' of a wal node comes from the planned
> multimaster support. Changes replayed locally that originate from another site
> should not replayed again there. If the wal is plainly used without such a
> filter that would cause loops. Instead we tag every wal record with the "node
> id" of the site that caused the change to happen and changes with a nodes own
> "node id" won't get applied again.
>
> Currently filtered records get simply replaced by NOOP records and loads of
> zeroes which obviously is not a sensible solution. The difficulty of actually
> removing the records is that that would change the LSNs. We currently rely on
> those though.
>
> The filtering might very well get expanded to support partial replication and
> such in future.

This all seems to need a lot more thought.

> To sensibly apply changes out of the WAL stream we need to solve two things:
> Reassemble transactions and apply them to the target database.
>
> The logical stream from 1. via 2. consists out of individual changes identified
> by the relfilenode of the table and the xid of the transaction. Given
> (sub)transactions, rollbacks, crash recovery, subtransactions and the like
> those changes obviously cannot be individually applied without fully loosing
> the pretence of consistency. To solve that we introduced a module, dubbed
> ApplyCache which does the reassembling. This module is *independent* of the
> data source and of the method of applying changes so it can be reused for
> replicating into a foreign system or similar.
>
> Due to the overhead of planner/executor/toast reassembly/type conversion (yes,
> we benchmarked!) we decided against statement generation for apply. Even when
> using prepared statements the overhead is rather noticeable.
>
> Instead we decided to use relatively lowlevel heapam.h/genam.h accesses to do
> the apply. For now we decided to use only one process to do the applying,
> parallelizing that seems to be too complex for an introduction of an already
> complex feature.
> In our tests the apply process could keep up with pgbench -c/j 20+ generating
> changes. This will obviously heavily depend on the workload. A fully seek bound
> workload will definitely not scale that well.
>
> Just to reiterate: Plugging in another method to do the apply should be a
> relatively simple matter of setting up three callbacks to a different function
> (begin, apply_change, commit).

I think it's reasonable to do the apply using these low-level
mechanisms, but surely people will sometimes want to extract tuples as
text and do whatever with them.  This goes back to my comment about
feeling that we need a toolkit approach, not a one-size-fits-all
solution.

> Another complexity in this is how to synchronize the catalogs. We plan to use
> command/event triggers and the oid preserving features from pg_upgrade to keep
> the catalogs in-sync. We did not start working on that.

This strikes me as completely unacceptable.  People ARE going to want
to replicate data between non-identical schemas on systems with
unsynchronized OIDs.  And even if they weren't, relying on triggers to
keep things in sync is exactly the sort of kludge that has inspired
all sorts of frustration with our existing replication solutions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> This patch is problematic because formally indexes used by syscaches needs to
> be unique, this one is not though because of 0/InvalidOids entries for
> nailed/shared catalog entries. Those values aren't allowed to be queried though.

That's not the only reason it's not unique.  Take a look at
GetRelFileNode().  We really only guarantee that <database OID,
tablespace OID, relfilenode, backend-ID>, taken as a four-tuple, is
unique.  You could have the same relfilenode in different tablespaces,
or even within the same tablespace with different backend-IDs.  The
latter might not matter for you because you're presumably disregarding
temp tables, but the former probably does.  It's an uncommon scenario
because we normally set relid = relfilenode, and of course relid is
unique across the database, but the table gets rewritten then you end
up with relid != relfilenode, and I don't think there's anything at
that point that will prevent the new relfilenode from being chosen as
some other relations relfilenode, as long as it's in a different
tablespace.

I think the solution may be to create a specialized cache for this,
rather than relying on the general syscache infrastructure.  You might
look at, e.g., attoptcache.c for an example.  That would allow you to
build a cache that is aware of things like the relmapper
infrastructure, and the fact that temp tables are ignorable for your
purposes.  But I think you will need to include at least the
tablespace OID in the key along with the relfilenode to make it
bullet-proof.

I haven't read through the patch series far enough to know what this
is being used for yet, but my fear is that you're using it to handle
mapping a relflenode extracted from the WAL stream back to a relation
OID.  The problem with that is that relfilenode assignments obey
transaction semantics.  So, if someone begins a transaction, truncates
a table, inserts a tuple, and commits, the heap_insert record is going
to refer to a relfilenode that, according to the system catalogs,
doesn't exist.  This is similar to one of the worries in my other
email, so I won't belabor the point too much more here...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Andres Freund
Date:
Hi Robert,

Thanks for your answer.

On Thursday, June 14, 2012 06:17:26 PM Robert Haas wrote:
> On Wed, Jun 13, 2012 at 7:27 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > === Design goals for logical replication === :
> > - in core
> > - fast
> > - async
> > - robust
> > - multi-master
> > - modular
> > - as unintrusive as possible implementation wise
> > - basis for other technologies (sharding, replication into other DBMSs,
> > ...)
> 
> I agree with all of these goals except for "multi-master".  I am not
> sure that there is a need to have a multi-master replication solution
> in core.  The big tricky part of multi-master replication is conflict
> resolution, and that seems like an awful lot of logic to try to build
> into core - especially given that we will want it to be extensible.
I don't plan to throw in loads of conflict resolution smarts. The aim is to get 
to the place where all the infrastructure is there so that a MM solution can 
be built by basically plugging in a conflict resolution mechanism. Maybe 
providing a very simple one.
I think without in-core support its really, really hard to build a sensible MM 
implementation. Which doesn't mean it has to live entirely in core.

Loads of the use-cases we have seen lately have a relatively small, low-
conflict shared dataset and a far bigger sharded one. While that obviously 
isn't the only relevant use case it is a senible important one.

> More generally, I would much rather see us focus on efficiently
> extracting changesets from WAL and efficiently applying those
> changesets on another server.  IMHO, those are the things that are
> holding back the not-in-core replication solutions we have today,
> particularly the first.  If we come up with a better way of applying
> change-sets, well, Slony can implement that too; they are already
> installing C code.  What neither they nor any other non-core solution
> can implement is change-set extraction, and therefore I think that
> ought to be the focus.
It definitely is a very important focus. I don't think it is the only one.  But 
that doesn't seem to be a problem to me as long as everything is kept fairly 
modular (which I tried rather hard to).

> To put all that another way, I think it is a 100% bad idea to try to
> kill off Slony, Bucardo, Londiste, or any of the many home-grown
> solutions that are out there to do replication.  Even if there were no
> technical place for third-party replication products (and I think
> there is), we will not win many friends by making it harder to extend
> and add on to the server.  If we build an in-core replication solution
> that can be used all by itself, that is fine with me.  But I think it
> should also be able to expose its constituent parts as a toolkit for
> third-party solutions.
I agree 100%. Unfortunately I forgot to explictly make that point, but the 
plan is definitely is to make the life of other replication solutions easier 
not harder. I don't think there will ever be one replication solution that fits 
every use-case perfectly.
At pgcon I talked with some of the slony guys and they were definetly 
interested in the changeset generation and I have kept that in mind. If some 
problems that need resolving indepently of that issue is resolved (namely DDL) 
it shouldn't take much generating their output format. The 'apply' code is 
fully abstracted and separted.

> > While you may argue that most of the above design goals are already
> > provided by various trigger based replication solutions like Londiste or
> > Slony, we think that thats not enough for various reasons:
> > 
> > - not in core (and thus less trustworthy)
> > - duplication of writes due to an additional log
> > - performance in general (check the end of the above presentation)
> > - complex to use because there is no native administration interface
> 
> I think that your parenthetical note "(and thus less trustworthy)"
> gets at another very important point, which is that one of the
> standards for inclusion in core is that it must in fact be trustworthy
> enough to justify the confidence that users will place in it.  It will
> NOT benefit the project to have two replication solutions in core, one
> of which is crappy.  More, even if what we put in core is AS GOOD as
> the best third-party solutions that are available, I don't think
> that's adequate.  It has to be better.  If it isn't, there is no
> excuse for preempting what's already out there.
I aggree that it has to be very good. *But* I think it is totally acceptable 
if it doesn't have all the bells and whistles from the start. That would be a 
sure road to disaster. For one implementing all that takes time and for 
another the amount of discussions till we are there is rather huge.

> I imagine you are thinking along similar lines, but I think that it
> bears being explicit about.
Seems like were thinking along the same lines, yes.

> > The biggest problem is, that interpreting tuples in the WAL stream
> > requires an up-to-date system catalog and needs to be done in a
> > compatible backend and architecture. The requirement of an up-to-date
> > catalog could be solved by adding more data to the WAL stream but it
> > seems to be likely that that would require relatively intrusive &
> > complex changes. Instead we chose to require a synchronized catalog at
> > the decoding site. That adds some complexity to use cases like
> > replicating into a different database or cross-version replication. For
> > those it is relatively straight-forward to develop a proxy pg instance
> > that only contains the catalog and does the transformation to textual
> > changes.
> The actual requirement here is more complex than "an up-to-date
> catalog".  Suppose transaction X begins, adds a column to a table,
> inserts a row, and commits.  That tuple needs to be interpreted using
> the tuple descriptor that transaction X would see (which includes the
> new column), NOT the tuple descriptor that some other transaction
> would see at the same time (which won't include the new column).  In a
> more complicated scenario, X might (1) begin, (2) start a
> subtransaction that alters the table, (3) release the savepoint or
> roll back to the save point, (4) insert a tuple, and (5) commit.  Now,
> the correct tuple descriptor for interpreting the tuple inserted in
> step (4) depends on whether step (3) was a release savepoint or a
> rollback-to-savepoint.  How are you handling these (and similar but
> more complex) cases?
I don't handle DDL at all yet. What I posted is a working, but early prototype 
;).  Building an simple prototype seemed like a good idea to get a feeling for 
everything but solving *the* hairy problem without input from -hackers seemed 
like a bad idea.

But I have thought about the issue for quite a while... The ApplyCache module 
does reassemble wal records from one transaction into on coherent stream with 
only changes from that transaction. Aborted transactions are thrown away.
So you can apply half a transaction, detect the changed tupledesc, and then 
reapply the rest.

But all that is moot until we agree on how to handle DDL. More about that 
further down.

> Moreover, we will want in the future to allow some of the DDL changes
> that currently require AccessExclusiveLock to be performed with a
> lesser lock.  It is unclear to me that this will be practical as far
> as adding columns goes, but it would be a shame if logical replication
> were the thing standing in the way. 
Well. Should it really come to that, which I don't think it will,  I 
personally don't have a big problem of making the relaxed rules only available 
with a wal_level < WAL_LEVEL_LOGICAL.

> > This also is the solution to the other big problem, the need to work
> > around architecture/version specific binary formats. The alternative,
> > producing cross-version, cross-architecture compatible binary changes or
> > even moreso textual changes all the time seems to be prohibitively
> > expensive. Both from a cpu and a storage POV and also from the point of
> > implementation effort.
> I think that if you can't produce a textual record of changes, you're
> throwing away 75% of what people will want to do with this.  Being
> able to replicate across architectures, versions, and even into
> heterogeneous databases is the main point of having logical
> replication, IMV.   Multi-master replication is nice to have, but IME
> there is huge demand for a replication solution that doesn't need to
> be temporarily replaced with something completely different every time
> you want to do a database upgrade.
All I am talking about in the above paragraph is saying that we do not want to 
change the wal into a cross-architecture, cross-version compatible format.

> > For some operations (UPDATE, DELETE) and corner-cases (e.g. full page
> > writes) additional data needs to be logged, but the additional amount of
> > data isn't that big. Requiring a primary-key for any change but INSERT
> > seems to be a sensible thing for now. The required changes are fully
> > contained in heapam.c and are pretty simple so far.
> I think that you can handle the case where there are no primary keys
> by simply treating the whole record as a primary key.  There might be
> duplicates, but I think you can work around that by decreeing that
> when you replay an UPDATE or DELETE operation you will update or
> delete at most one record.  So if there are multiple exactly-identical
> records in the target database, then you will just UPDATE or DELETE
> exactly one of them (it doesn't matter which).  This doesn't even seem
> particularly complicated.
Hm. Yes, you could do that. But I have to say I don't really see a point. 
Maybe the fact that I do envision multimaster systems at some point is 
clouding my judgement though as its far less easy in that case.

It also complicates the wal format as you now need to specify whether you 
transport a full or a primary-key only tuple...

> > We introduced a new command that, analogous to
> > START_REPLICATION, is called START_LOGICAL_REPLICATION that will stream
> > out all xlog records that pass through a filter.
> > 
> > The on-the-wire format stays the same. The filter currently simply
> > filters out all record which are not interesting for logical replication
> > (indexes, freezing, ...) and records that did not originate on the same
> > system.
> > 
> > The requirement of filtering by 'origin' of a wal node comes from the
> > planned multimaster support. Changes replayed locally that originate
> > from another site should not replayed again there. If the wal is plainly
> > used without such a filter that would cause loops. Instead we tag every
> > wal record with the "node id" of the site that caused the change to
> > happen and changes with a nodes own "node id" won't get applied again.
> > 
> > Currently filtered records get simply replaced by NOOP records and loads
> > of zeroes which obviously is not a sensible solution. The difficulty of
> > actually removing the records is that that would change the LSNs. We
> > currently rely on those though.
> > 
> > The filtering might very well get expanded to support partial replication
> > and such in future.
> This all seems to need a lot more thought.
Yes. I didn't want to make any decisions here which would probably be the 
wrong ones anyway, so I went for the NOOP option for now.

Unless youre talking about the 'origin_id' concept - that seems to work out 
very nicely with minimal amounts of code needed.

> > To sensibly apply changes out of the WAL stream we need to solve two
> > things: Reassemble transactions and apply them to the target database.
> > 
> > The logical stream from 1. via 2. consists out of individual changes
> > identified by the relfilenode of the table and the xid of the
> > transaction. Given (sub)transactions, rollbacks, crash recovery,
> > subtransactions and the like those changes obviously cannot be
> > individually applied without fully loosing the pretence of consistency.
> > To solve that we introduced a module, dubbed ApplyCache which does the
> > reassembling. This module is *independent* of the data source and of the
> > method of applying changes so it can be reused for replicating into a
> > foreign system or similar.
> > 
> > Due to the overhead of planner/executor/toast reassembly/type conversion
> > (yes, we benchmarked!) we decided against statement generation for
> > apply. Even when using prepared statements the overhead is rather
> > noticeable.
> > 
> > Instead we decided to use relatively lowlevel heapam.h/genam.h accesses
> > to do the apply. For now we decided to use only one process to do the
> > applying, parallelizing that seems to be too complex for an introduction
> > of an already complex feature.
> > In our tests the apply process could keep up with pgbench -c/j 20+
> > generating changes. This will obviously heavily depend on the workload.
> > A fully seek bound workload will definitely not scale that well.
> > 
> > Just to reiterate: Plugging in another method to do the apply should be a
> > relatively simple matter of setting up three callbacks to a different
> > function (begin, apply_change, commit).
> 
> I think it's reasonable to do the apply using these low-level
> mechanisms, but surely people will sometimes want to extract tuples as
> text and do whatever with them.  This goes back to my comment about
> feeling that we need a toolkit approach, not a one-size-fits-all
> solution.
Yes. We definitely need a toolkit approach. If you look at the individual 
patches, especially the later ones, you can see that I tried very hard to go 
that way.
The wal decoding (Patch 09) and transaction reassembly (Patch 08) and the low 
level apply (Patch 14) parts are as separate as possible. That is 09 feeds 
into an ApplyCache without knowing anything about the side which applies the 
changes. As written above, you currently really only need to replace those 3 
callbacks to produce text output instead.

That might get slightly more complex with the choice whether toast should be 
reassembled in memory or not, but even then it should be pretty simple.

> > Another complexity in this is how to synchronize the catalogs. We plan to
> > use command/event triggers and the oid preserving features from
> > pg_upgrade to keep the catalogs in-sync. We did not start working on
> > that.
> This strikes me as completely unacceptable.  People ARE going to want
> to replicate data between non-identical schemas on systems with
> unsynchronized OIDs.  And even if they weren't, relying on triggers to
> keep things in sync is exactly the sort of kludge that has inspired
> all sorts of frustration with our existing replication solutions.
Yes. I definitely agree that people will want to do that. Hell, I want to do 
that.
The plan for that is (see my mail to merlin about it) to have catalog-only 
proxy instances. For those its even possible without much additonal problems 
to use a pretty normal HS setup + filtering. Thats the only realistic way I 
have found to do the necessary catalog lookups (TupleDescs + output 
procedures) and handle the binary compatibility.

I think though that we do not want to enforce that mode of operation for 
tightly coupled instances. For those I was thinking of using command triggers 
to synchronize the catalogs. 
One of the big screwups of the current replication solutions is exactly that 
you cannot sensibly do DDL which is not a big problem if you have a huge 
system with loads of different databases and very knowledgeable people et al. 
but at the beginning it really sucks. I have no problem with making one of the 
nodes the "schema master" in that case.
Also I would like to avoid the overhead of the proxy instance for use-cases 
where you really want one node replicated as fully as possible with the slight 
exception of being able to have summing tables, different indexes et al.

Does that make sense for you?

Greetings,

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


On Thursday, June 14, 2012 08:50:51 PM Robert Haas wrote:
> On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > This patch is problematic because formally indexes used by syscaches
> > needs to be unique, this one is not though because of 0/InvalidOids
> > entries for nailed/shared catalog entries. Those values aren't allowed
> > to be queried though.
> That's not the only reason it's not unique.  Take a look at
> GetRelFileNode().  We really only guarantee that <database OID,
> tablespace OID, relfilenode, backend-ID>, taken as a four-tuple, is
> unique.  You could have the same relfilenode in different tablespaces,
> or even within the same tablespace with different backend-IDs.  The
> latter might not matter for you because you're presumably disregarding
> temp tables, but the former probably does.  It's an uncommon scenario
> because we normally set relid = relfilenode, and of course relid is
> unique across the database, but the table gets rewritten then you end
> up with relid != relfilenode, and I don't think there's anything at
> that point that will prevent the new relfilenode from being chosen as
> some other relations relfilenode, as long as it's in a different
> tablespace.
> 
> I think the solution may be to create a specialized cache for this,
> rather than relying on the general syscache infrastructure.  You might
> look at, e.g., attoptcache.c for an example.  That would allow you to
> build a cache that is aware of things like the relmapper
> infrastructure, and the fact that temp tables are ignorable for your
> purposes.  But I think you will need to include at least the
> tablespace OID in the key along with the relfilenode to make it
> bullet-proof.
Yes, the tablespace OID should definitely be in there. Need to read up on the 
details of an own cache. Once more I didn't want to put in more work before 
discussing it here.

> I haven't read through the patch series far enough to know what this
> is being used for yet, but my fear is that you're using it to handle
> mapping a relflenode extracted from the WAL stream back to a relation
> OID.  The problem with that is that relfilenode assignments obey
> transaction semantics.  So, if someone begins a transaction, truncates
> a table, inserts a tuple, and commits, the heap_insert record is going
> to refer to a relfilenode that, according to the system catalogs,
> doesn't exist.  This is similar to one of the worries in my other
> email, so I won't belabor the point too much more here...
Well, yes. We *need* to do the mapping back from the relfilenode to a table. 
The idea is that by the fact that the receiving side, be it a full cluster or 
just a catalog one, has a fully synchronized catalog in which DDL gets applied 
correctly, inside the transaction just as in the sending side, it should never 
be wrong to do that mapping.
It probably is necessary to make the syscache lookup/infrastructure to use an 
MVCCish snapshot though. No idea how hard that would be yet. Might be a good 
argument for your argument of using a specialized cache.

Lets sidetrack this till we have a tender agreement on how to handle DDL ;). I 
am aware of the issues with rollbacks, truncate et al...

Thanks,

Andres

-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


On Thu, Jun 14, 2012 at 4:51 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Lets sidetrack this till we have a tender agreement on how to handle DDL ;). I
> am aware of the issues with rollbacks, truncate et al...

Agreed; I will write up my thoughts about DDL on the other thread.  I
think that's a key thing we need to figure out; once we understand how
we're handling that, the correct design for this will probably fall
out pretty naturally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 03/16] Add a new syscache to fetch a pg_class entry via its relfilenode

From
Christopher Browne
Date:
On Thu, Jun 14, 2012 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 14, 2012 at 4:51 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Lets sidetrack this till we have a tender agreement on how to handle DDL ;). I
>> am aware of the issues with rollbacks, truncate et al...
>
> Agreed; I will write up my thoughts about DDL on the other thread.  I
> think that's a key thing we need to figure out; once we understand how
> we're handling that, the correct design for this will probably fall
> out pretty naturally.

I wonder if *an* answer (if not forcibly a perfect one) is to provide
a capturable injection point.

A possible shape of this would be to have a function to which you pass
a DDL statement, at which point, it does two things:
a) Runs the DDL, to make the requested change, and
b) Captures the DDL in a convenient form in the WAL log so that it may
be detected and replayed at the right point in processing.

That's not a solution for capturing DDL automatically, but it's
something that would be useful-ish even today, for systems like Slony
and Londiste, and this would be natural to connect to Dimitri's "Event
Triggers."

That also fits with the desire to have components that are (at least
hopefully) usable to other existing replication systems.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: [PATCH 06/16] Add support for a generic wal reading facility dubbed XLogReader

From
Heikki Linnakangas
Date:
On 13.06.2012 14:28, Andres Freund wrote:
> Features:
> - streaming reading/writing
> - filtering
> - reassembly of records
>
> Reusing the ReadRecord infrastructure in situations where the code that wants
> to do so is not tightly integrated into xlog.c is rather hard and would require
> changes to rather integral parts of the recovery code which doesn't seem to be
> a good idea.

It would be nice refactor ReadRecord and its subroutines out of xlog.c. 
That file has grown over the years to be really huge, and separating the 
code to read WAL sounds like it should be a pretty natural split. I 
don't want to duplicate all the WAL reading code, so we really should 
find a way to reuse that. I'd suggest rewriting ReadRecord into a thin 
wrapper that just calls the new xlogreader code.

> Missing:
> - "compressing" the stream when removing uninteresting records
> - writing out correct CRCs
> - validating CRCs
> - separating reader/writer

- comments.

At a quick glance, I couldn't figure out how this works. There seems to 
be some callback functions? If you want to read an xlog stream using 
this facility, what do you do? Can this be used for writing WAL, as well 
as reading? If so, what do you need the write support for?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Thursday, June 14, 2012 11:19:00 PM Heikki Linnakangas wrote:
> On 13.06.2012 14:28, Andres Freund wrote:
> > Features:
> > - streaming reading/writing
> > - filtering
> > - reassembly of records
> > 
> > Reusing the ReadRecord infrastructure in situations where the code that
> > wants to do so is not tightly integrated into xlog.c is rather hard and
> > would require changes to rather integral parts of the recovery code
> > which doesn't seem to be a good idea.
> It would be nice refactor ReadRecord and its subroutines out of xlog.c.
> That file has grown over the years to be really huge, and separating the
> code to read WAL sounds like it should be a pretty natural split. I
> don't want to duplicate all the WAL reading code, so we really should
> find a way to reuse that. I'd suggest rewriting ReadRecord into a thin
> wrapper that just calls the new xlogreader code.
I aggree that it is not very nice to duplicate it. But I also don't want to go 
the route of replacing ReadRecord with it for a while, we can replace 
ReadRecord later if we want. As long as it is in flux like it is right now I 
don't really see the point in investing energy in it.
Also I am not that sure how a callback oriented API fits into the xlog.c 
workflow?

> > Missing:
> > - "compressing" the stream when removing uninteresting records
> > - writing out correct CRCs
> > - validating CRCs
> > - separating reader/writer
> 
> - comments.
> At a quick glance, I couldn't figure out how this works. There seems to
> be some callback functions? If you want to read an xlog stream using
> this facility, what do you do?
You currently have to fill out 4 callbacks:

XLogReaderStateInterestingCB is_record_interesting;
XLogReaderStateWriteoutCB writeout_data;
XLogReaderStateFinishedRecordCB finished_record;
XLogReaderStateReadPageCB read_page;

As an example how to use it (from the walsender support for 
START_LOGICAL_REPLICATION):

if(!xlogreader_state){xlogreader_state = XLogReaderAllocate();xlogreader_state->is_record_interesting = 
RecordRelevantForLogicalReplication;xlogreader_state->finished_record = ProcessRecord;xlogreader_state->writeout_data =
WriteoutData;xlogreader_state->read_page= XLogReadPage;
 
/* startptr is the current XLog position */xlogreader_state->startptr = startptr;
XLogReaderReset(xlogreader_state);
}

/* how far does valid data go */
xlogreader_state->endptr = endptr;

XLogReaderRead(xlogreader_state);

The last step will then call the above callbacks till it reaches endptr. I.e. 
it first reads a page with "read_page"; then checks whether a record is 
interesting for the use-case ("is_record_interesting"); in case it is 
interesting, it gets reassembled and passed to the "finished_record" callback. 
Then the bytestream gets written out again with "writeout_data".

In this case it gets written to the buffer the walsender has allocated. In 
others it might just get thrown away.

> Can this be used for writing WAL, as well as reading? If so, what do you
> need the write support for?
It currently can replace records which are not interesting (e.g. index changes 
in the case of logical rep). Filtered records are replaced with XLOG_NOOP 
records with correct length currently. In future the actual amount of data 
should really be reduced. I don't know yet know how to map LSNs of 
uncompressed/compressed stream onto each other...
The filtered data is then passed to a writeout callback (in a streaming 
fashion).

The whole writing out part is pretty ugly at the moment and I just bolted it 
ontop because it was convenient for the moment. I am not yet sure how the api 
for that should look....

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


On Thursday, June 14, 2012 11:19:00 PM Heikki Linnakangas wrote:
> On 13.06.2012 14:28, Andres Freund wrote:
> > Features:
> > - streaming reading/writing
> > - filtering
> > - reassembly of records
> > 
> > Reusing the ReadRecord infrastructure in situations where the code that
> > wants to do so is not tightly integrated into xlog.c is rather hard and
> > would require changes to rather integral parts of the recovery code
> > which doesn't seem to be a good idea.
> 
> It would be nice refactor ReadRecord and its subroutines out of xlog.c.
> That file has grown over the years to be really huge, and separating the
> code to read WAL sounds like it should be a pretty natural split. I
> don't want to duplicate all the WAL reading code, so we really should
> find a way to reuse that. I'd suggest rewriting ReadRecord into a thin
> wrapper that just calls the new xlogreader code.
> 
> > Missing:
> > - "compressing" the stream when removing uninteresting records
> > - writing out correct CRCs
> > - validating CRCs
> > - separating reader/writer
> 
> - comments.
> 
> At a quick glance, I couldn't figure out how this works. There seems to
> be some callback functions? If you want to read an xlog stream using
> this facility, what do you do? Can this be used for writing WAL, as well
> as reading? If so, what do you need the write support for?
Oh, btw, the callbacks and parameters are somewhat documented in the 
xlogreader.h header in the XLogReaderState struct.
Still needs improvement though.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Robert Haas
Date:
On Thu, Jun 14, 2012 at 4:13 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> I don't plan to throw in loads of conflict resolution smarts. The aim is to get
> to the place where all the infrastructure is there so that a MM solution can
> be built by basically plugging in a conflict resolution mechanism. Maybe
> providing a very simple one.
> I think without in-core support its really, really hard to build a sensible MM
> implementation. Which doesn't mean it has to live entirely in core.

Of course, several people have already done it, perhaps most notably Bucardo.

Anyway, it would be good to get opinions from more people here.  I am
sure I am not the only person with an opinion on the appropriateness
of trying to build a multi-master replication solution in core or,
indeed, the only person with an opinion on any of these other issues.
It is not good for those other opinions to be saved for a later date.

> Hm. Yes, you could do that. But I have to say I don't really see a point.
> Maybe the fact that I do envision multimaster systems at some point is
> clouding my judgement though as its far less easy in that case.

Why?  I don't think that particularly changes anything.

> It also complicates the wal format as you now need to specify whether you
> transport a full or a primary-key only tuple...

Why?  If the schemas are in sync, the target knows what the PK is
perfectly well.  If not, you're probably in trouble anyway.

> I think though that we do not want to enforce that mode of operation for
> tightly coupled instances. For those I was thinking of using command triggers
> to synchronize the catalogs.
> One of the big screwups of the current replication solutions is exactly that
> you cannot sensibly do DDL which is not a big problem if you have a huge
> system with loads of different databases and very knowledgeable people et al.
> but at the beginning it really sucks. I have no problem with making one of the
> nodes the "schema master" in that case.
> Also I would like to avoid the overhead of the proxy instance for use-cases
> where you really want one node replicated as fully as possible with the slight
> exception of being able to have summing tables, different indexes et al.

In my view, a logical replication solution is precisely one in which
the catalogs don't need to be in sync.  If the catalogs have to be in
sync, it's not logical replication.  ISTM that what you're talking
about is sort of a hybrid between physical replication (pages) and
logical replication (tuples) - you want to ship around raw binary
tuple data, but not entire pages.  The problem with that is it's going
to be tough to make robust.  Users could easily end up with answers
that are total nonsense, or probably even crash the server.

To step back and talk about DDL more generally, you've mentioned a few
times the idea of using an SR instance that has been filtered down to
just the system catalogs as a means of generating logical change
records.  However, as things stand today, there's no reason to suppose
that replicating anything less than the entire cluster is sufficient.
For example, you can't translate enum labels to strings without access
to the pg_enum catalog, which would be there, because enums are
built-in types.  But someone could supply a similar user-defined type
that uses a user-defined table to do those lookups, and now you've got
a problem.  I think this is a contractual problem, not a technical
one.  From the point of view of logical replication, it would be nice
if type output functions were basically guaranteed to look at nothing
but the datum they get passed as an argument, or at the very least
nothing other than the system catalogs, but there is no such
guarantee.  And, without such a guarantee, I don't believe that we can
create a high-performance, robust, in-core replication solution.

Now, the nice thing about being the people who make PostgreSQL happen
is we get to decide what the C code that people load into the server
is required to guarantee; we can change the rules.  Before, types were
allowed to do X, but now they're not.  Unfortunately, in this case, I
don't really find that an acceptable solution.  First, it might break
code that has worked with PostgreSQL for many years; but worse, it
won't break it in any obvious way, but rather only if you're using
logical replication, which will doubtless cause people to attribute
the failure to logical replication rather than to their own code.
Even if they do understand that we imposed a rule-change from on high,
there's no really good workaround: an enum type is a good example of
something that you *can't* implement without a side-table.  Second, it
flies in the face of our often-stated desire to make the server
extensible.  Also, even given the existence of such a restriction, you
still need to run any output function that relies on catalogs with
catalog contents that match what existed at the time that WAL was
generated, and under the correct snapshot, which is not trivial.
These things are problems even for other things that we might need to
do while examining the WAL stream, but they're particularly acute for
any application that wants to run type-output functions to generate
something that can be sent to a server which doesn't necessarily
having matching catalog contents.

But it strikes me that these things, really, are only a problem for a
minority of data types.  For text, or int4, or float8, or even
timestamptz, we don't need *any catalog contents at all* to
reconstruct the tuple data.  Knowing the correct type alignment and
which C function to call is entirely sufficient.  So maybe instead of
trying to cobble together a set of catalog contents that we can use
for decoding any tuple whatsoever, we should instead divide the world
into well-behaved types and poorly-behaved types.  Well-behaved types
are those that can be interpreted without the catalogs, provided that
you know what type it is.  Poorly-behaved types (records, enums) are
those where you can't.  For well-behaved types, we only need a small
amount of additional information in WAL to identify which types we're
trying to decode (not the type OID, which might fail in the presence
of nasty catalog hacks, but something more universal, like a UUID that
means "this is text", or something that identifies the C entrypoint).
And then maybe we handle poorly-behaved types by pushing some of the
work into the foreground task that's generating the WAL: in the worst
case, the process logs a record before each insert/update/delete
containing the text representation of any values that are going to be
hard to decode.  In some cases (e.g. records all of whose constituent
fields are well-behaved types) we could instead log enough additional
information about the type to permit blind decoding.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> So maybe instead of trying to cobble together a set of catalog
> contents that we can use for decoding any tuple whatsoever, we
> should instead divide the world into well-behaved types and
> poorly-behaved types.  Well-behaved types are those that can be
> interpreted without the catalogs, provided that you know what type
> it is.  Poorly-behaved types (records, enums) are those where you
> can't.  For well-behaved types, we only need a small amount of
> additional information in WAL to identify which types we're trying
> to decode (not the type OID, which might fail in the presence of
> nasty catalog hacks, but something more universal, like a UUID
> that means "this is text", or something that identifies the C
> entrypoint).  And then maybe we handle poorly-behaved types by
> pushing some of the work into the foreground task that's
> generating the WAL: in the worst case, the process logs a record
> before each insert/update/delete containing the text
> representation of any values that are going to be hard to decode. 
> In some cases (e.g. records all of whose constituent fields are
> well-behaved types) we could instead log enough additional
> information about the type to permit blind decoding.
What about matching those values up to the correct table name and
the respective columns names?
-Kevin


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Andres Freund
Date:
Hi Robert,

On Friday, June 15, 2012 10:03:38 PM Robert Haas wrote:
> On Thu, Jun 14, 2012 at 4:13 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > I don't plan to throw in loads of conflict resolution smarts. The aim is
> > to get to the place where all the infrastructure is there so that a MM
> > solution can be built by basically plugging in a conflict resolution
> > mechanism. Maybe providing a very simple one.
> > I think without in-core support its really, really hard to build a
> > sensible MM implementation. Which doesn't mean it has to live entirely
> > in core.
> 
> Of course, several people have already done it, perhaps most notably
> Bucardo.
Bucardo certainly is nice but its not useable for many things just from an 
overhead perspective.

> Anyway, it would be good to get opinions from more people here.  I am
> sure I am not the only person with an opinion on the appropriateness
> of trying to build a multi-master replication solution in core or,
> indeed, the only person with an opinion on any of these other issues.
> It is not good for those other opinions to be saved for a later date.
Agreed.

> > Hm. Yes, you could do that. But I have to say I don't really see a point.
> > Maybe the fact that I do envision multimaster systems at some point is
> > clouding my judgement though as its far less easy in that case.
> Why?  I don't think that particularly changes anything.
Because it makes conflict detection very hard. I also don't think its a 
feature worth supporting. Whats the use-case of updating records you cannot 
properly identify?

> > It also complicates the wal format as you now need to specify whether you
> > transport a full or a primary-key only tuple...
> Why?  If the schemas are in sync, the target knows what the PK is
> perfectly well.  If not, you're probably in trouble anyway.
True. There already was the wish (from Kevin) of having the option of 
transporting full before/after images anyway, so the wal format might want to 
be able to represent that.

> > I think though that we do not want to enforce that mode of operation for
> > tightly coupled instances. For those I was thinking of using command
> > triggers to synchronize the catalogs.
> > One of the big screwups of the current replication solutions is exactly
> > that you cannot sensibly do DDL which is not a big problem if you have a
> > huge system with loads of different databases and very knowledgeable
> > people et al. but at the beginning it really sucks. I have no problem
> > with making one of the nodes the "schema master" in that case.
> > Also I would like to avoid the overhead of the proxy instance for
> > use-cases where you really want one node replicated as fully as possible
> > with the slight exception of being able to have summing tables,
> > different indexes et al.
> In my view, a logical replication solution is precisely one in which
> the catalogs don't need to be in sync.  If the catalogs have to be in
> sync, it's not logical replication.  ISTM that what you're talking
> about is sort of a hybrid between physical replication (pages) and
> logical replication (tuples) - you want to ship around raw binary
> tuple data, but not entire pages.
Ok, thats a valid point. Simon argued at the cluster summit that everything 
thats not physical is logical. Which has some appeal because it seems hard to 
agree what exactly logical rep is. So definition by exclusion makes kind of 
sense ;)

I think what you categorized as "hybrid logical/physical" rep solves an 
important use-case thats very hard to solve at the moment. Before my 
2ndquadrant days I had several client which had huge problemsing the trigger 
based solutions because their overhead simply was to big a burden on the 
master. They couldn't use SR either because every consuming database kept 
loads of local data.
I think such scenarios are getting more and more common.

> The problem with that is it's going to be tough to make robust.  Users could
> easily end up with answers that are total nonsense, or probably even crash
> the server.
Why?

> To step back and talk about DDL more generally, you've mentioned a few
> times the idea of using an SR instance that has been filtered down to
> just the system catalogs as a means of generating logical change
> records.  However, as things stand today, there's no reason to suppose
> that replicating anything less than the entire cluster is sufficient.
> For example, you can't translate enum labels to strings without access
> to the pg_enum catalog, which would be there, because enums are
> built-in types.  But someone could supply a similar user-defined type
> that uses a user-defined table to do those lookups, and now you've got
> a problem.  I think this is a contractual problem, not a technical
> one.  From the point of view of logical replication, it would be nice
> if type output functions were basically guaranteed to look at nothing
> but the datum they get passed as an argument, or at the very least
> nothing other than the system catalogs, but there is no such
> guarantee.  And, without such a guarantee, I don't believe that we can
> create a high-performance, robust, in-core replication solution.
I don't think thats a valid argument. Any such solution existing today fails 
to work properly with dump/restore and such because it implies dependencies 
that they do not know about. The "internal" tables will possibly be restored 
later than the tables using the tables and such. So your data format *has* to 
deal with loading/outputting data without such anyway.

Some of that can be ameliorated using extensions + configuration tables but 
even then you have to be *very* careful and plan your backup/restore 
procedures way much more careful then when not.

> Now, the nice thing about being the people who make PostgreSQL happen
> is we get to decide what the C code that people load into the server
> is required to guarantee; we can change the rules.  Before, types were
> allowed to do X, but now they're not.  Unfortunately, in this case, I
> don't really find that an acceptable solution.  First, it might break
> code that has worked with PostgreSQL for many years; but worse, it
> won't break it in any obvious way, but rather only if you're using
> logical replication, which will doubtless cause people to attribute
> the failure to logical replication rather than to their own code.
As I said above, anybody using code like that has to be aware of the problem 
anyway. Should there really be real cases of that marking configuration tables 
in the catalog as to be shared would be relatively uncomplicated solution.

> Even if they do understand that we imposed a rule-change from on high,
> there's no really good workaround: an enum type is a good example of
> something that you *can't* implement without a side-table.
Enums are a good example of a intrusive feature which breaks features we have 
come to rely on (transactional DDL) which could not be implemented outside of 
core pg.

And yes, you obviously can implement it without needing side-table for output. 
Just as a string which is checked during input.

> Second, it flies in the face of our often-stated desire to make the server
> extensible.
While I generally see that desire as something worthwile I don't see it being 
violated in this case. The amount of extensibility youre removing here is 
minimal in my opinion and solutions for it aren't that hard. Solving them (aka 
marking those tables as some form of secondary system tables) would make that 
code actually reliable.

Sorry, I remain highly unconvinced of the above argumentation.

> Also, even given the existence of such a restriction, you
> still need to run any output function that relies on catalogs with
> catalog contents that match what existed at the time that WAL was
> generated, and under the correct snapshot, which is not trivial.
> These things are problems even for other things that we might need to
> do while examining the WAL stream, but they're particularly acute for
> any application that wants to run type-output functions to generate
> something that can be sent to a server which doesn't necessarily
> having matching catalog contents.
I don't think its that hard. And as you say, we need to solve it anyway.

> But it strikes me that these things, really, are only a problem for a
> minority of data types.  For text, or int4, or float8, or even
> timestamptz, we don't need *any catalog contents at all* to
> reconstruct the tuple data.  Knowing the correct type alignment and
> which C function to call is entirely sufficient.  So maybe instead of
> trying to cobble together a set of catalog contents that we can use
> for decoding any tuple whatsoever, we should instead divide the world
> into well-behaved types and poorly-behaved types. Well-behaved types
> are those that can be interpreted without the catalogs, provided that
> you know what type it is.  Poorly-behaved types (records, enums) are
> those where you can't.  For well-behaved types, we only need a small
> amount of additional information in WAL to identify which types we're
> trying to decode (not the type OID, which might fail in the presence
> of nasty catalog hacks, but something more universal, like a UUID that
> means "this is text", or something that identifies the C entrypoint).
This would essentially double the size of the WAL even for rows containing 
only simple types and would mean we would run quite a bit of non-trivial code 
additionally in relatively critical parts of the code. Both imnsho is 
unacceptable.
You could reduce the space overhead by only adding that information only the 
first time after a table has changed (and then regularly after a checkpoint or 
so) but doing so seems to be introducing too much complexity.


> And then maybe we handle poorly-behaved types by pushing some of the
> work into the foreground task that's generating the WAL: in the worst
> case, the process logs a record before each insert/update/delete
> containing the text representation of any values that are going to be
> hard to decode.  In some cases (e.g. records all of whose constituent
> fields are well-behaved types) we could instead log enough additional
> information about the type to permit blind decoding.
I think this is prohibitively expensive from a development, runtime, space and 
maintenance standpoint.
For databases using thing were decoding is rather expensive (e.g. postgis) you 
wouldn't really improve much above the old trigger based solutions. Its a 
return to "log everything twice".

Sorry if I seem pigheaded here, but I fail to see why all that would buy us 
anything but loads of complexity while loosing many potential advantages.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Steve Singer
Date:
On 12-06-15 04:03 PM, Robert Haas wrote:
> On Thu, Jun 14, 2012 at 4:13 PM, Andres Freund<andres@2ndquadrant.com>  wrote:
>> I don't plan to throw in loads of conflict resolution smarts. The aim is to get
>> to the place where all the infrastructure is there so that a MM solution can
>> be built by basically plugging in a conflict resolution mechanism. Maybe
>> providing a very simple one.
>> I think without in-core support its really, really hard to build a sensible MM
>> implementation. Which doesn't mean it has to live entirely in core.
> Of course, several people have already done it, perhaps most notably Bucardo.
>
> Anyway, it would be good to get opinions from more people here.  I am
> sure I am not the only person with an opinion on the appropriateness
> of trying to build a multi-master replication solution in core or,
> indeed, the only person with an opinion on any of these other issues.

This sounds like a good place for me to chime in.

I feel that in-core support to capture changes and turn them into change 
records that can be replayed on other databases, without relying on 
triggers and log tables, would be good to have.

I think we want some flexible enough that people write consumers of the 
LCRs to do conflict resolution for multi-master but I am not sure that 
the conflict resolution support actually belongs in core.

Most of the complexity of slony (both in terms of lines of code, and 
issues people encounter using it) comes not from the log triggers or 
replay of the logged data but comes from the configuration of the cluster.
Controlling things like

* Which tables replicate from a node to which other nodes
* How do you change the cluster configuration on a running system 
(adding nodes, removing nodes, moving the origin of a table, adding 
tables to replication etc...)

This is the harder part of the problem, I think we need to first get the 
infrastructure committed (that the current patch set deals with) to 
capturing, transporting and translating the LCR's into the system before 
get too caught up in the configuration aspects.   I think we will have a 
hard time agreeing on behaviours for some of that other stuff that are 
both flexible for enough use cases and simple enough for 
administrators.  I'd like to see in-core support for a lot of that stuff 
but I'm not holding my breath.

> It is not good for those other opinions to be saved for a later date.
>
>> Hm. Yes, you could do that. But I have to say I don't really see a point.
>> Maybe the fact that I do envision multimaster systems at some point is
>> clouding my judgement though as its far less easy in that case.
> Why?  I don't think that particularly changes anything.
>
>> It also complicates the wal format as you now need to specify whether you
>> transport a full or a primary-key only tuple...
> Why?  If the schemas are in sync, the target knows what the PK is
> perfectly well.  If not, you're probably in trouble anyway.
>


>> I think though that we do not want to enforce that mode of operation for
>> tightly coupled instances. For those I was thinking of using command triggers
>> to synchronize the catalogs.
>> One of the big screwups of the current replication solutions is exactly that
>> you cannot sensibly do DDL which is not a big problem if you have a huge
>> system with loads of different databases and very knowledgeable people et al.
>> but at the beginning it really sucks. I have no problem with making one of the
>> nodes the "schema master" in that case.
>> Also I would like to avoid the overhead of the proxy instance for use-cases
>> where you really want one node replicated as fully as possible with the slight
>> exception of being able to have summing tables, different indexes et al.
> In my view, a logical replication solution is precisely one in which
> the catalogs don't need to be in sync.  If the catalogs have to be in
> sync, it's not logical replication.  ISTM that what you're talking
> about is sort of a hybrid between physical replication (pages) and
> logical replication (tuples) - you want to ship around raw binary
> tuple data, but not entire pages.  The problem with that is it's going
> to be tough to make robust.  Users could easily end up with answers
> that are total nonsense, or probably even crash the server.
>

I see three catalogs in play here.
1. The catalog on the origin
2. The catalog on the proxy system (this is the catalog used to 
translate the WAL records to LCR's).  The proxy system will need 
essentially the same pgsql binaries (same architecture, important 
complie flags etc..) as the origin
3. The catalog on the destination system(s).

The catalog 2 must be in sync with catalog 1, catalog 3 shouldn't need 
to be in-sync with catalog 1.   I think catalogs 2 and 3 are combined in 
the current patch set (though I haven't yet looked at the code 
closely).   I think the performance optimizations Andres has implemented 
to update tuples through low-level functions should be left for later 
and that we should  be generating SQL in the apply cache so we don't 
start assuming much about catalog 3.

> guarantee.  And, without such a guarantee, I don't believe that we can
> create a high-performance, robust, in-core replication solution.
>
>
Part of what people expect from a robust in-core solution is that it 
should work with the the other in-core features.  If we have to list a 
bunch of in-core type as being incompatible with logical replication 
then people will look at logical replication with the same 'there be 
dragons here' attitude that scare many people away from the existing 
third party replication solutions.   Non-core or third party user 
defined types are a slightly different matter because we can't control 
what they do.


Steve



Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Steve Singer
Date:
On 12-06-13 01:27 PM, Andres Freund wrote: <blockquote cite="mid:201206131927.49923.andres@2ndquadrant.com"
type="cite"><prewrap="">The previous mail contained a patch with a mismerge caused by reording 
 
commits. Corrected version attached.

Thanks to Steve Singer for noticing this quickly.

</pre></blockquote><br /> Attached is a more complete review of this patch.<br /><br /> I agree that we will need to
identifythe node a change originated at.  We will not only want this for multi-master support but it might also be very
helpfulonce we introduce things like cascaded replicas. Using a 16 bit integer for this purpose makes sense to me.<br
/><br/> This patch (with the previous numbered patches already applied), still doesn't compile.<br /><br /> gcc -O2
-Wall-Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
-Wformat-security-fno-strict-aliasing -fwrapv -I../../../../src/include -D_GNU_SOURCE   -c -o xact.o xact.c<br />
xact.c:In function ‘xact_redo_commit’:<br /> xact.c:4678: error: ‘xl_xact_commit’ has no member named ‘origin_lsn’<br
/>make[4]: *** [xact.o] Error 1<br /><br /> Your complete patch set did compile.  origin_lsn gets added as part of your
12'thpatch.  Managing so many related patches is going to be a pain. but it beats one big patch.  I don't think this
patchactually requires the origin_lsn change.<br /><br /><br /> Code Review<br /> -------------------------<br />
src/backend/utils/misc/guc.c<br/> @@ -1598,6 +1600,16 @@ static struct config_int ConfigureNamesInt[] =<br />     
},<br/>  <br />      {<br /> +        {"multimaster_node_id", PGC_POSTMASTER, REPLICATION_MASTER,<br /> +           
gettext_noop("nodeid for multimaster."),<br /> +            NULL<br /> +        },<br /> +       
&guc_replication_origin_id,<br/> +        InvalidMultimasterNodeId, InvalidMultimasterNodeId,
MaxMultimasterNodeId,<br/> +        NULL, assign_replication_node_id, NULL<br /> +    },<br /><br /> I'd rather see us
referto this as the 'node id for logical replication' over the multimaster node id.  I think that terminology will be
lesscontroversial.  Multi-master means different things to different people and it is still unclear what forms of
multi-masterwe will have in-core.  For example,  most people don't consider slony to be multi-master replication.  If a
futureversion of slony were to feed off logical replication (instead of triggers) then I think it would need this node
idto determine which node a particular change has come from.  <br /><br /> The description inside the gettext call
shouldprobably be "Sets the node id for ....." to be consistent with the description of the rest of the GUC's<br /><br
/>BootStrapXLOG in xlog.c<br /> creates a XLogRecord structure and shouldit  set xl_origin_id to the 
InvalidMultimasterNodeId?<br/><br /> WriteEmptyXLOG in pg_resetxlog.c might also should set xl_origin_id to a well
definedvalue.  I think InvalidMultimasterNodeId should be safe even for a no-op record in from a node that actually has
anode_id set on real records.<br /><br /> backend/replication/logical/logical.c:<br /> XLogRecPtr
current_replication_origin_lsn= {0, 0};<br /><br /> This variable isn't used/referenced by this patch it probably
belongsas part of the later patch.<br /><br /><br /> Steve<br /><br /><blockquote
cite="mid:201206131927.49923.andres@2ndquadrant.com"type="cite"><pre wrap="">Andres
 

</pre> <pre wrap="">
<fieldset class="mimeAttachmentHeader"></fieldset>

</pre></blockquote><br />

Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Monday, June 18, 2012 02:43:26 AM Steve Singer wrote:
> On 12-06-13 01:27 PM, Andres Freund wrote:
> > The previous mail contained a patch with a mismerge caused by reording
> > commits. Corrected version attached.
> > 
> > Thanks to Steve Singer for noticing this quickly.
> 
> Attached is a more complete review of this patch.
> 
> I agree that we will need to identify the node a change originated at.
> We will not only want this for multi-master support but it might also be
> very helpful once we introduce things like cascaded replicas. Using a 16
> bit integer for this purpose makes sense to me.
Good.

> This patch (with the previous numbered patches already applied), still
> doesn't compile.
> 
> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
> -Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include
> -D_GNU_SOURCE   -c -o xact.o xact.c
> xact.c: In function 'xact_redo_commit':
> xact.c:4678: error: 'xl_xact_commit' has no member named 'origin_lsn'
> make[4]: *** [xact.o] Error 1
> 
> Your complete patch set did compile.  origin_lsn gets added as part of
> your 12'th patch.  Managing so many related patches is going to be a
> pain. but it beats one big patch.  I don't think this patch actually
> requires the origin_lsn change.
Hrmpf #666. I will go through through the series commit-by-commit again to 
make sure everything compiles again. Reordinging this late definitely wasn't a 
good idea...

I pushed a rebased version with all those fixups (and removal of the 
zeroRecPtr patch).
> 
> Code Review
> -------------------------
> src/backend/utils/misc/guc.c
> @@ -1598,6 +1600,16 @@ static struct config_int ConfigureNamesInt[] =
>       },
> 
>       {
> +        {"multimaster_node_id", PGC_POSTMASTER, REPLICATION_MASTER,
> +            gettext_noop("node id for multimaster."),
> +            NULL
> +        },
> + &guc_replication_origin_id,
> +        InvalidMultimasterNodeId, InvalidMultimasterNodeId,
> MaxMultimasterNodeId,
> +        NULL, assign_replication_node_id, NULL
> +    },
> 
> I'd rather see us refer to this as the 'node id for logical replication'
> over the multimaster node id.  I think that terminology will be less
> controversial. 
Youre right. 'replication_node_id' or such should be ok?

> BootStrapXLOG in xlog.c
> creates a XLogRecord structure and shouldit  set xl_origin_id to the
> InvalidMultimasterNodeId?
> WriteEmptyXLOG in pg_resetxlog.c might also should set xl_origin_id to a
> well defined value.  I think InvalidMultimasterNodeId should be safe
> even for a no-op record in from a node that actually has a node_id set
> on real records.
Good catches.


> backend/replication/logical/logical.c:
> XLogRecPtr current_replication_origin_lsn = {0, 0};
> 
> This variable isn't used/referenced by this patch it probably belongs as
> part of the later patch.
Yea, just as the usage of origin_lsn in the above compile failure.


-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 06/16] Add support for a generic wal reading facility dubbed XLogReader

From
Heikki Linnakangas
Date:
On 15.06.2012 00:38, Andres Freund wrote:
> On Thursday, June 14, 2012 11:19:00 PM Heikki Linnakangas wrote:
>> On 13.06.2012 14:28, Andres Freund wrote:
>>> Features:
>>> - streaming reading/writing
>>> - filtering
>>> - reassembly of records
>>>
>>> Reusing the ReadRecord infrastructure in situations where the code that
>>> wants to do so is not tightly integrated into xlog.c is rather hard and
>>> would require changes to rather integral parts of the recovery code
>>> which doesn't seem to be a good idea.
>> It would be nice refactor ReadRecord and its subroutines out of xlog.c.
>> That file has grown over the years to be really huge, and separating the
>> code to read WAL sounds like it should be a pretty natural split. I
>> don't want to duplicate all the WAL reading code, so we really should
>> find a way to reuse that. I'd suggest rewriting ReadRecord into a thin
>> wrapper that just calls the new xlogreader code.
> I aggree that it is not very nice to duplicate it. But I also don't want to go
> the route of replacing ReadRecord with it for a while, we can replace
> ReadRecord later if we want. As long as it is in flux like it is right now I
> don't really see the point in investing energy in it.
> Also I am not that sure how a callback oriented API fits into the xlog.c
> workflow?

If the API doesn't fit the xlog.c workflow, I think that's a pretty good 
indication that the API is not good. The xlog.c workflow is basically:

while(!end-of-wal)
{  record = ReadRecord();  redo(recode);
}

There's some pretty complicated logic within the bowels of ReadRecord 
(see XLogPageRead(), for example), but it seems to me those parts 
actually fit the your callback API pretty well. The biggest change is 
that the xlog-reader thingie should return to the caller after each 
record, instead of calling a callback for each read record and only 
returning at end-of-wal.

Or we could put the code to call the redo-function into the callback, 
but there's some also some logic there to exit early if you reach the 
desired recovery point, for example, so the callback API would need to 
be extended to allow such early exit. I think a return-after-each-record 
API would be better, though.

>> Can this be used for writing WAL, as well as reading? If so, what do you
>> need the write support for?
> It currently can replace records which are not interesting (e.g. index changes
> in the case of logical rep). Filtered records are replaced with XLOG_NOOP
> records with correct length currently. In future the actual amount of data
> should really be reduced. I don't know yet know how to map LSNs of
> uncompressed/compressed stream onto each other...
> The filtered data is then passed to a writeout callback (in a streaming
> fashion).
>
> The whole writing out part is pretty ugly at the moment and I just bolted it
> ontop because it was convenient for the moment. I am not yet sure how the api
> for that should look....

Can we just leave that out for now?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Monday, June 18, 2012 01:51:45 PM Heikki Linnakangas wrote:
> On 15.06.2012 00:38, Andres Freund wrote:
> > On Thursday, June 14, 2012 11:19:00 PM Heikki Linnakangas wrote:
> >> On 13.06.2012 14:28, Andres Freund wrote:
> >>> Features:
> >>> - streaming reading/writing
> >>> - filtering
> >>> - reassembly of records
> >>> 
> >>> Reusing the ReadRecord infrastructure in situations where the code that
> >>> wants to do so is not tightly integrated into xlog.c is rather hard and
> >>> would require changes to rather integral parts of the recovery code
> >>> which doesn't seem to be a good idea.
> >> 
> >> It would be nice refactor ReadRecord and its subroutines out of xlog.c.
> >> That file has grown over the years to be really huge, and separating the
> >> code to read WAL sounds like it should be a pretty natural split. I
> >> don't want to duplicate all the WAL reading code, so we really should
> >> find a way to reuse that. I'd suggest rewriting ReadRecord into a thin
> >> wrapper that just calls the new xlogreader code.
> > 
> > I aggree that it is not very nice to duplicate it. But I also don't want
> > to go the route of replacing ReadRecord with it for a while, we can
> > replace ReadRecord later if we want. As long as it is in flux like it is
> > right now I don't really see the point in investing energy in it.
> > Also I am not that sure how a callback oriented API fits into the xlog.c
> > workflow?
> 
> If the API doesn't fit the xlog.c workflow, I think that's a pretty good
> indication that the API is not good. The xlog.c workflow is basically:
> 
> while(!end-of-wal)
> {
>    record = ReadRecord();
>    redo(recode);
> }
> 
> There's some pretty complicated logic within the bowels of ReadRecord
> (see XLogPageRead(), for example), but it seems to me those parts
> actually fit the your callback API pretty well. The biggest change is
> that the xlog-reader thingie should return to the caller after each
> record, instead of calling a callback for each read record and only
> returning at end-of-wal.
I did it that way at start but it seemed to move too much knowledge to the 
callsites.
Its also something of an efficiency/complexity thing: Either you alway reread 
the current page on entering XLogReader (because it could have changed if it 
was the last one at some point) which is noticeable performancewise or you 
make the contract more complex by requiring to notify the XLogReader about the 
fact that you changed the end-of-valid data which doesn't seem to be a good 
idea because I think we will rather quickly will grow more callsites.

> Or we could put the code to call the redo-function into the callback,
> but there's some also some logic there to exit early if you reach the
> desired recovery point, for example, so the callback API would need to
> be extended to allow such early exit. I think a return-after-each-record
> API would be better, though.
Adding an early exit would be easy.

> >> Can this be used for writing WAL, as well as reading? If so, what do you
> >> need the write support for?
> > 
> > It currently can replace records which are not interesting (e.g. index
> > changes in the case of logical rep). Filtered records are replaced with
> > XLOG_NOOP records with correct length currently. In future the actual
> > amount of data should really be reduced. I don't know yet know how to
> > map LSNs of uncompressed/compressed stream onto each other...
> > The filtered data is then passed to a writeout callback (in a streaming
> > fashion).
> > 
> > The whole writing out part is pretty ugly at the moment and I just bolted
> > it ontop because it was convenient for the moment. I am not yet sure how
> > the api for that should look....
>
> Can we just leave that out for now?
Not easily I think. I should probably get busy writing up a PoC for separating 
that.

Thanks,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 13 June 2012 19:28, Andres Freund <andres@2ndquadrant.com> wrote:

> This adds a new configuration parameter multimaster_node_id which determines
> the id used for wal originating in one cluster.

Looks good and it seems this aspect at least is commitable in this CF.

Design decisions I think we need to review are

* Naming of field. I think origin is the right term, borrowing from Slony.

* Can we add the origin_id dynamically to each WAL record? Probably no
need, but lets consider why and document that.

* Size of field. 16 bits is enough for 32,000 master nodes, which is
quite a lot. Do we need that many? I think we may have need for a few
flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
8 bits. Even if we don't need them in this release, I'd like to have
them. If they remain unused after a few releases, we may choose to
redeploy some of them as additional nodeids in future. I don't foresee
complaints that 256 master nodes is too few anytime soon, so we can
defer that decision.

* Do we want origin_id as a parameter or as a setting in pgcontrol?
IIRC we go to a lot of trouble elsewhere to avoid problems with
changing on/off parameter values. I think we need some discussion to
validate where that should live.

* Is there any overhead from CRC of WAL record because of this? I'd
guess not, but just want to double check thinking.

Presumably there is no issue wrt Heikki's WAL changes? I assume not,
but ask since I know you're reviewing that also.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
Hi Simon,

On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote:
> On 13 June 2012 19:28, Andres Freund <andres@2ndquadrant.com> wrote:
> > This adds a new configuration parameter multimaster_node_id which
> > determines the id used for wal originating in one cluster.
> 
> Looks good and it seems this aspect at least is commitable in this CF.
I think we need to agree on the parameter name. It currently is 
'multimaster_node_id'. In the discussion with Steve we got to 
"replication_node_id". I don't particularly like either.

Other suggestions?

> Design decisions I think we need to review are
> 
> * Naming of field. I think origin is the right term, borrowing from Slony.
I think it fits as well.

> * Can we add the origin_id dynamically to each WAL record? Probably no
> need, but lets consider why and document that.
Not sure what you mean? Its already set in XLogInsert to 
current_replication_origin_id which defaults to the value of the guc?

> * Size of field. 16 bits is enough for 32,000 master nodes, which is
> quite a lot. Do we need that many? I think we may have need for a few
> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
> 8 bits. Even if we don't need them in this release, I'd like to have
> them. If they remain unused after a few releases, we may choose to
> redeploy some of them as additional nodeids in future. I don't foresee
> complaints that 256 master nodes is too few anytime soon, so we can
> defer that decision.
I wished we had some flag bits available before as well. I find 256 nodes a 
pretty low value to start with though, 4096 sounds better though, so I would 
be happy with 4 flag bits. I think for cascading setups and such you want to 
add node ids for every node, not only masters...

Any opinions from others on this?

> * Do we want origin_id as a parameter or as a setting in pgcontrol?
> IIRC we go to a lot of trouble elsewhere to avoid problems with
> changing on/off parameter values. I think we need some discussion to
> validate where that should live.
Hm. I don't really forsee any need to have it in pg_control. What do you want 
to protect against with that?
It would need to be changeable anyway, because otherwise it would need to 
become a parameter for initdb which would suck for anybody migrating to use 
replication at some point.

Do you want to protect against problems in replication setups after changing 
the value?

> * Is there any overhead from CRC of WAL record because of this? I'd
> guess not, but just want to double check thinking.
I cannot imagine that there is. The actual size of the record didn't change 
because of alignment padding (both on 32 and 64 bit systems). 

> Presumably there is no issue wrt Heikki's WAL changes? I assume not,
> but ask since I know you're reviewing that also.
It might clash minimally because of offset changes or such, but otherwise 
there shouldn't be much.

Thanks,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


On Thursday, June 14, 2012 03:50:28 PM Robert Haas wrote:
> On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > This is locally defined in lots of places and would get introduced
> > frequently in the next commits. It is expected that this can be defined
> > in a header-only manner as soon as the XLogInsert scalability groundwork
> > from Heikki gets in.
> 
> This appears to be redundant with the existing InvalidXLogRecPtr,
> defined in access/transam.h.
I dropped the patch from the series in the git repo and replaced every usage 
with the version in transam.h

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Daniel Farina
Date:
On Mon, Jun 18, 2012 at 8:50 AM, Andres Freund <andres@2ndquadrant.com> wrote:

>> * Size of field. 16 bits is enough for 32,000 master nodes, which is
>> quite a lot. Do we need that many? I think we may have need for a few
>> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
>> 8 bits. Even if we don't need them in this release, I'd like to have
>> them. If they remain unused after a few releases, we may choose to
>> redeploy some of them as additional nodeids in future. I don't foresee
>> complaints that 256 master nodes is too few anytime soon, so we can
>> defer that decision.
> I wished we had some flag bits available before as well. I find 256 nodes a
> pretty low value to start with though, 4096 sounds better though, so I would
> be happy with 4 flag bits. I think for cascading setups and such you want to
> add node ids for every node, not only masters...
>
> Any opinions from others on this?

What's the cost of going a lot higher?  Because if one makes enough
numerical space available, one can assign node identities without a
coordinator, a massive decrease in complexity.

-- 
fdr


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote:
> On Mon, Jun 18, 2012 at 8:50 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> * Size of field. 16 bits is enough for 32,000 master nodes, which is
> >> quite a lot. Do we need that many? I think we may have need for a few
> >> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
> >> 8 bits. Even if we don't need them in this release, I'd like to have
> >> them. If they remain unused after a few releases, we may choose to
> >> redeploy some of them as additional nodeids in future. I don't foresee
> >> complaints that 256 master nodes is too few anytime soon, so we can
> >> defer that decision.
> > 
> > I wished we had some flag bits available before as well. I find 256 nodes
> > a pretty low value to start with though, 4096 sounds better though, so I
> > would be happy with 4 flag bits. I think for cascading setups and such
> > you want to add node ids for every node, not only masters...
> > 
> > Any opinions from others on this?
> 
> What's the cost of going a lot higher?  Because if one makes enough
> numerical space available, one can assign node identities without a
> coordinator, a massive decrease in complexity.
It would increase the size of every wal record. We just have 16bit left there 
by chance...

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Christopher Browne
Date:
On Mon, Jun 18, 2012 at 11:50 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi Simon,
>
> On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote:
>> On 13 June 2012 19:28, Andres Freund <andres@2ndquadrant.com> wrote:
>> > This adds a new configuration parameter multimaster_node_id which
>> > determines the id used for wal originating in one cluster.
>>
>> Looks good and it seems this aspect at least is commitable in this CF.
> I think we need to agree on the parameter name. It currently is
> 'multimaster_node_id'. In the discussion with Steve we got to
> "replication_node_id". I don't particularly like either.
>
> Other suggestions?

I wonder if it should be origin_node_id?  That is the term Slony uses.

>> Design decisions I think we need to review are
>>
>> * Naming of field. I think origin is the right term, borrowing from Slony.
> I think it fits as well.
>
>> * Can we add the origin_id dynamically to each WAL record? Probably no
>> need, but lets consider why and document that.
> Not sure what you mean? Its already set in XLogInsert to
> current_replication_origin_id which defaults to the value of the guc?
>
>> * Size of field. 16 bits is enough for 32,000 master nodes, which is
>> quite a lot. Do we need that many? I think we may have need for a few
>> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
>> 8 bits. Even if we don't need them in this release, I'd like to have
>> them. If they remain unused after a few releases, we may choose to
>> redeploy some of them as additional nodeids in future. I don't foresee
>> complaints that 256 master nodes is too few anytime soon, so we can
>> defer that decision.
> I wished we had some flag bits available before as well. I find 256 nodes a
> pretty low value to start with though, 4096 sounds better though, so I would
> be happy with 4 flag bits. I think for cascading setups and such you want to
> add node ids for every node, not only masters...

Even though the number of nodes that can reasonably participate in
replication is likely to be not too terribly large, it might be good
to allow larger values, in case someone is keen on encoding something
descriptive in the node number.

If you restrict the number to a tiny range, then you'll be left
wanting some other mapping.  At one point, I did some work trying to
get a notion of named nodes implemented in Slony; gave up on it, as
the coordination process was wildly too bug-prone.

In our environment, at Afilias, we have used quasi-symbolic node
numbers that encoded something somewhat meaningful about the
environment.  That seems better to me than the risky "kludge" of
saying:
- The first node I created is node #1
- The second one is node #2.
- The third and fourth are #3 and #4
- I dropped node #2 due to a problem, and thus the "new node 2" is #5.

That numbering scheme gets pretty anti-intuitive fairly quickly, from
whence we took the approach of having a couple digits indicating data
centre followed by a digit indicating which node in that data centre.

If that all sounds incoherent, well, the more nodes you have around,
the more difficult it becomes to make sure you *do* have a coherent
picture of your cluster.

I recall the Slony-II project having a notion of attaching a permanent
UUID-based node ID to each node.  As long as there is somewhere decent
to find a symbolically significant node "name," I like the idea of the
ID *not* being in a tiny range, and being UUID/OID-like...

> Any opinions from others on this?
>
>> * Do we want origin_id as a parameter or as a setting in pgcontrol?
>> IIRC we go to a lot of trouble elsewhere to avoid problems with
>> changing on/off parameter values. I think we need some discussion to
>> validate where that should live.
> Hm. I don't really forsee any need to have it in pg_control. What do you want
> to protect against with that?
> It would need to be changeable anyway, because otherwise it would need to
> become a parameter for initdb which would suck for anybody migrating to use
> replication at some point.
>
> Do you want to protect against problems in replication setups after changing
> the value?

In Slony, changing the node ID is Not Something That Is Done.  The ID
is captured in *way* too many places to be able to have any hope of
updating it in a coordinated way.  I should be surprised if it wasn't
similarly troublesome here.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Robert Haas
Date:
On Sat, Jun 16, 2012 at 3:03 PM, Steve Singer <steve@ssinger.info> wrote:
> I feel that in-core support to capture changes and turn them into change
> records that can be replayed on other databases, without relying on triggers
> and log tables, would be good to have.
>
> I think we want some flexible enough that people write consumers of the LCRs
> to do conflict resolution for multi-master but I am not sure that the
> conflict resolution support actually belongs in core.

I agree, on both counts.  Anyone else want to chime in here?

> Most of the complexity of slony (both in terms of lines of code, and issues
> people encounter using it) comes not from the log triggers or replay of the
> logged data but comes from the configuration of the cluster.
> Controlling things like
>
> * Which tables replicate from a node to which other nodes
> * How do you change the cluster configuration on a running system (adding
> nodes, removing nodes, moving the origin of a table, adding tables to
> replication etc...)

Not being as familiar with Slony as I probably ought to be, I hadn't
given this much thought, but it's an interesting point.  The number of
logical replication policies that someone might want to implement, and
the ways in which they might want to change them as the situation
develops, is very large.  Whole cluster, whole database, one or
several schemas, individual tables, perhaps even more fine-grained
than per-table.  Trying to figure all of that out is going to require
a lot of work and, frankly, I question the value of having that stuff
in core anyway.

> I see three catalogs in play here.
> 1. The catalog on the origin
> 2. The catalog on the proxy system (this is the catalog used to translate
> the WAL records to LCR's).  The proxy system will need essentially the same
> pgsql binaries (same architecture, important complie flags etc..) as the
> origin
> 3. The catalog on the destination system(s).
>
> The catalog 2 must be in sync with catalog 1, catalog 3 shouldn't need to be
> in-sync with catalog 1.   I think catalogs 2 and 3 are combined in the
> current patch set (though I haven't yet looked at the code closely).   I
> think the performance optimizations Andres has implemented to update tuples
> through low-level functions should be left for later and that we should  be
> generating SQL in the apply cache so we don't start assuming much about
> catalog 3.

+1.  Although there is a lot of performance benefit to be had there,
it seems better to me to get the basics working and then do
performance optimization later.  That is, if we can detect that the
catalogs are in sync, then by all means ship around the binary tuple
to make things faster.  But requiring that (without having any way to
know whether it actually holds) strikes me as a mess.

> Part of what people expect from a robust in-core solution is that it should
> work with the the other in-core features.  If we have to list a bunch of
> in-core type as being incompatible with logical replication then people will
> look at logical replication with the same 'there be dragons here' attitude
> that scare many people away from the existing third party replication
> solutions.   Non-core or third party user defined types are a slightly
> different matter because we can't control what they do.

I agree, although I don't think either Andres or I are saying anything else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Robert Haas
Date:
On Sat, Jun 16, 2012 at 7:43 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > Hm. Yes, you could do that. But I have to say I don't really see a point.
>> > Maybe the fact that I do envision multimaster systems at some point is
>> > clouding my judgement though as its far less easy in that case.
>> Why?  I don't think that particularly changes anything.
> Because it makes conflict detection very hard. I also don't think its a
> feature worth supporting. Whats the use-case of updating records you cannot
> properly identify?

Don't ask me; I just work here.  I think it's something that some
people want, though.  I mean, if you don't support replicating a table
without a primary key, then you can't even run pgbench in a
replication environment.  Is that an important workload?  Well,
objectively, no.  But I guarantee you that other people with more
realistic workloads than that will complain if we don't have it.
Absolutely required on day one?  Probably not.  Completely useless
appendage that no one wants?  Not that, either.

>> In my view, a logical replication solution is precisely one in which
>> the catalogs don't need to be in sync.  If the catalogs have to be in
>> sync, it's not logical replication.  ISTM that what you're talking
>> about is sort of a hybrid between physical replication (pages) and
>> logical replication (tuples) - you want to ship around raw binary
>> tuple data, but not entire pages.
> Ok, thats a valid point. Simon argued at the cluster summit that everything
> thats not physical is logical. Which has some appeal because it seems hard to
> agree what exactly logical rep is. So definition by exclusion makes kind of
> sense ;)

Well, the words are fuzzy, but I would define logical replication to
be something which is independent of the binary format in which stuff
gets stored on disk.  If it's not independent of the disk format, then
you can't do heterogenous replication (between versions, or between
products).  That precise limitation is the main thing that drives
people to use anything other than SR in the first place, IME.

> I think what you categorized as "hybrid logical/physical" rep solves an
> important use-case thats very hard to solve at the moment. Before my
> 2ndquadrant days I had several client which had huge problemsing the trigger
> based solutions because their overhead simply was to big a burden on the
> master. They couldn't use SR either because every consuming database kept
> loads of local data.
> I think such scenarios are getting more and more common.

I think this is to some extent true, but I also think you're
conflating two different things.  Change extraction via triggers
introduces overhead that can be eliminated by reconstructing tuples
from WAL in the background rather than forcing them to be inserted
into a shadow table (and re-WAL-logged!) in the foreground.  I will
grant that shipping the tuple as a binary blob rather than as text
eliminates additional overehead on both ends, but it also closes off a
lot of important use cases.  As I noted in my previous email, I think
that ought to be a performance optimization that we do, if at all,
when it's been proven safe, not a baked-in part of the design.  Even a
solution that decodes WAL to text tuples and ships those around and
reinserts the via pure SQL should be significantly faster than the
replication solutions we have today; if it isn't, something's wrong.

>> The problem with that is it's going to be tough to make robust.  Users could
>> easily end up with answers that are total nonsense, or probably even crash
>> the server.
> Why?

Because the routines that decode tuples don't include enough sanity
checks to prevent running off the end of the block, or even the end of
memory completely.  Consider a corrupt TOAST pointer that indicates
that there is a gigabyte of data stored in an 8kB block.  One of the
common symptoms of corruption IME is TOAST requests for -3 bytes of
memory.

And, of course, even if you could avoid crashing, interpreting what
was originally intended as a series of int4s as a varlena isn't likely
to produce anything terribly meaningful.  Tuple data isn't
self-identifying; that's why this is such a hard problem.

>> To step back and talk about DDL more generally, you've mentioned a few
>> times the idea of using an SR instance that has been filtered down to
>> just the system catalogs as a means of generating logical change
>> records.  However, as things stand today, there's no reason to suppose
>> that replicating anything less than the entire cluster is sufficient.
>> For example, you can't translate enum labels to strings without access
>> to the pg_enum catalog, which would be there, because enums are
>> built-in types.  But someone could supply a similar user-defined type
>> that uses a user-defined table to do those lookups, and now you've got
>> a problem.  I think this is a contractual problem, not a technical
>> one.  From the point of view of logical replication, it would be nice
>> if type output functions were basically guaranteed to look at nothing
>> but the datum they get passed as an argument, or at the very least
>> nothing other than the system catalogs, but there is no such
>> guarantee.  And, without such a guarantee, I don't believe that we can
>> create a high-performance, robust, in-core replication solution.
> I don't think thats a valid argument. Any such solution existing today fails
> to work properly with dump/restore and such because it implies dependencies
> that they do not know about. The "internal" tables will possibly be restored
> later than the tables using the tables and such. So your data format *has* to
> deal with loading/outputting data without such anyway.

Do you know for certain that PostGIS doesn't do anything of this type?Or what about something like an SE-Linux label
cache,where we might 
arrange to create labels as they are used and associate them with
integer tags?

> And yes, you obviously can implement it without needing side-table for output.
> Just as a string which is checked during input.

That misses the point - if people wanted labels represented by a
string rather than an integer, they would have just used a string and
stuffed a check or foreign key constraint in there.

> You could reduce the space overhead by only adding that information only the
> first time after a table has changed (and then regularly after a checkpoint or
> so) but doing so seems to be introducing too much complexity.

Well, I dunno: it is complicated, but I'm worried that the design
you've got is awfully complicated, too.  Requiring an extra PG
instance with a very specific configuration that furthermore uses an
untested WAL-filtering methodology that excludes everything but the
system catalogs seems like an administrative nightmare, and I remain
unconvinced that it is safe.  In fact, I have a strong feeling that it
isn't safe, but if you're not convinced by the argument already laid
out then I'm not sure I can convince you of it right this minute.
What happens if you have a crash on the WAL generation machine?
You'll have to rewind to the most recent restartpoint, and you can't
use the catalogs until you've reached the minimum recovery point.  Is
that going to mess you up?

>> And then maybe we handle poorly-behaved types by pushing some of the
>> work into the foreground task that's generating the WAL: in the worst
>> case, the process logs a record before each insert/update/delete
>> containing the text representation of any values that are going to be
>> hard to decode.  In some cases (e.g. records all of whose constituent
>> fields are well-behaved types) we could instead log enough additional
>> information about the type to permit blind decoding.
> I think this is prohibitively expensive from a development, runtime, space and
> maintenance standpoint.
> For databases using thing were decoding is rather expensive (e.g. postgis) you
> wouldn't really improve much above the old trigger based solutions. Its a
> return to "log everything twice".

Well, if the PostGIS types are poorly behaved under the definition I
proposed, that implies they won't work at all under your scheme.  I
think putting a replication solution into core that won't support
PostGIS is dead on arrival.  If they're well-behaved, then there's no
double-logging.

> Sorry if I seem pigheaded here, but I fail to see why all that would buy us
> anything but loads of complexity while loosing many potential advantages.

The current system that you are proposing is very complex and has a
number of holes at present, some of which you've already mentioned in
previous emails.  There's a lot of advantage in picking a design that
allows you to put together a working prototype relatively quickly, but
I have a sinking feeling that your chosen design is going to be very
hard to bullet-proof and not very user-friendly.  If we could find a
way to run it all inside a single server I think we would be way ahead
on both fronts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Steve Singer
Date:
On 12-06-18 07:30 AM, Andres Freund wrote:
>
> Hrmpf #666. I will go through through the series commit-by-commit again to
> make sure everything compiles again. Reordinging this late definitely wasn't a
> good idea...
>
> I pushed a rebased version with all those fixups (and removal of the
> zeroRecPtr patch).

Where did you push that rebased version to? I don't see an attachment, 
or an updated patch in the commitfest app and your repo at 
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary 
hasn't been updated in 5 days.




Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Steve Singer
Date:
On 12-06-18 11:50 AM, Andres Freund wrote:
> Hi Simon,

> I think we need to agree on the parameter name. It currently is
> 'multimaster_node_id'. In the discussion with Steve we got to
> "replication_node_id". I don't particularly like either.
>
> Other suggestions?
>
Other things that come to mind (for naming this parameter in the 
postgresql.conf)

node_id
origin_node_id
local_node_id
> I wished we had some flag bits available before as well. I find 256 nodes a
> pretty low value to start with though, 4096 sounds better though, so I would
> be happy with 4 flag bits. I think for cascading setups and such you want to
> add node ids for every node, not only masters...
>
> Any opinions from others on this?
>

256 sounds a bit low to me as well.  Sometimes the use case of a retail 
chain comes up where people want each store to have a postgresql 
instance and replicate back to a central office.  I can think of many 
chains with more than 256 stores.



> Thanks,
>
> Andres



Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote:
>> What's the cost of going a lot higher?  Because if one makes enough
>> numerical space available, one can assign node identities without a
>> coordinator, a massive decrease in complexity.

> It would increase the size of every wal record. We just have 16bit left there
> by chance...

"Every WAL record"?  Why in heck would you attach it to every record?
Surely putting it in WAL page headers would be sufficient.  We could
easily afford to burn a page switch (if not a whole segment switch)
when changing masters.

I'm against the idea of eating any spare space we have in WAL record
headers for this purpose, anyway; there are likely to be more pressing
needs in future.
        regards, tom lane


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Tuesday, June 19, 2012 12:47:54 AM Christopher Browne wrote:
> On Mon, Jun 18, 2012 at 11:50 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > Hi Simon,
> > 
> > On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote:
> >> On 13 June 2012 19:28, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > This adds a new configuration parameter multimaster_node_id which
> >> > determines the id used for wal originating in one cluster.
> >> 
> >> Looks good and it seems this aspect at least is commitable in this CF.
> > 
> > I think we need to agree on the parameter name. It currently is
> > 'multimaster_node_id'. In the discussion with Steve we got to
> > "replication_node_id". I don't particularly like either.
> > 
> > Other suggestions?
> I wonder if it should be origin_node_id?  That is the term Slony uses.
I think in the slony context its clear that thats related to replication. Less 
so in a general postgres guc. So maybe replication_origin_node_id?

> >> * Size of field. 16 bits is enough for 32,000 master nodes, which is
> >> quite a lot. Do we need that many? I think we may have need for a few
> >> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
> >> 8 bits. Even if we don't need them in this release, I'd like to have
> >> them. If they remain unused after a few releases, we may choose to
> >> redeploy some of them as additional nodeids in future. I don't foresee
> >> complaints that 256 master nodes is too few anytime soon, so we can
> >> defer that decision.
> > 
> > I wished we had some flag bits available before as well. I find 256 nodes
> > a pretty low value to start with though, 4096 sounds better though, so I
> > would be happy with 4 flag bits. I think for cascading setups and such
> > you want to add node ids for every node, not only masters...
> Even though the number of nodes that can reasonably participate in
> replication is likely to be not too terribly large, it might be good
> to allow larger values, in case someone is keen on encoding something
> descriptive in the node number.
Well, having a large number space makes the wal records bigger which is not 
something I can see us getting through with. People have gone through 
considerable length to avoid that.
Perhaps we can have a mapping system catalog at some point that includes 
additional infromation to each node id like a name and where its at wrt 
replication...

> I recall the Slony-II project having a notion of attaching a permanent
> UUID-based node ID to each node.  As long as there is somewhere decent
> to find a symbolically significant node "name," I like the idea of the
> ID *not* being in a tiny range, and being UUID/OID-like...
I think adding 14 bytes (16bytes of an ooid - 2 bytes available) is out of 
question...

> > Any opinions from others on this?
> > 
> >> * Do we want origin_id as a parameter or as a setting in pgcontrol?
> >> IIRC we go to a lot of trouble elsewhere to avoid problems with
> >> changing on/off parameter values. I think we need some discussion to
> >> validate where that should live.
> > 
> > Hm. I don't really forsee any need to have it in pg_control. What do you
> > want to protect against with that?
> > It would need to be changeable anyway, because otherwise it would need to
> > become a parameter for initdb which would suck for anybody migrating to
> > use replication at some point.
> > 
> > Do you want to protect against problems in replication setups after
> > changing the value?
> In Slony, changing the node ID is Not Something That Is Done.  The ID
> is captured in *way* too many places to be able to have any hope of
> updating it in a coordinated way.  I should be surprised if it wasn't
> similarly troublesome here.
If you update you will need to reset consuming nodes, yes. Imo thats still 
something else than disallowing changing the parameter entirely. Requiring 
initdb for that seems like it would make experimentation too hard.

We need to allow at least changing the setting from no node id to an initial 
one.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
Hi,

On Tuesday, June 19, 2012 08:03:04 AM Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote:
> >> What's the cost of going a lot higher?  Because if one makes enough
> >> numerical space available, one can assign node identities without a
> >> coordinator, a massive decrease in complexity.
> > 
> > It would increase the size of every wal record. We just have 16bit left
> > there by chance...
> 
> "Every WAL record"?  Why in heck would you attach it to every record?
> Surely putting it in WAL page headers would be sufficient.  We could
> easily afford to burn a page switch (if not a whole segment switch)
> when changing masters.
The idea is that you can have cascading, circular and whatever replication 
topologies if you include the "logical origin" of a wal causing action into 
it.
That is, if you have nodes A(1) and B(2) and a insert happens on A the wal 
records generated by that will get an xl_origin_id = 1 and when it will be 
decoded and replayed on B it will *also* get the id 1. Only when a change 
originally is generated on Bit will get xl_origin_id = 2.
That way you can easily have circular or hierarchical replication topologies 
including diamonds.

> I'm against the idea of eating any spare space we have in WAL record
> headers for this purpose, anyway; there are likely to be more pressing
> needs in future.
Every other solution to allowing this seems to be far more complicated than 
this, thats why I arrived at the conclusion that its a good idea.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Tuesday, June 19, 2012 04:12:47 AM Steve Singer wrote:
> On 12-06-18 07:30 AM, Andres Freund wrote:
> > Hrmpf #666. I will go through through the series commit-by-commit again
> > to make sure everything compiles again. Reordinging this late definitely
> > wasn't a good idea...
> > 
> > I pushed a rebased version with all those fixups (and removal of the
> > zeroRecPtr patch).
> 
> Where did you push that rebased version to? I don't see an attachment,
> or an updated patch in the commitfest app and your repo at
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summa
> ry hasn't been updated in 5 days.
To the 2ndquadrant internal repo. Which strangely doesn't help you. 
*Headdesk*. Pushed to the correct repo and manually verified.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Tuesday, June 19, 2012 08:03:04 AM Tom Lane wrote:
>> "Every WAL record"?  Why in heck would you attach it to every record?
>> Surely putting it in WAL page headers would be sufficient.

> The idea is that you can have cascading, circular and whatever replication 
> topologies if you include the "logical origin" of a wal causing action into 
> it.
> That is, if you have nodes A(1) and B(2) and a insert happens on A the wal 
> records generated by that will get an xl_origin_id = 1 and when it will be 
> decoded and replayed on B it will *also* get the id 1. Only when a change 
> originally is generated on Bit will get xl_origin_id = 2.

None of this explains to me why each individual WAL record would need to
carry a separate copy of the indicator.  We don't have multiple masters
putting records into the same WAL file, nor does it seem to me that it
could possibly be workable to merge WAL streams.  (If you are thinking
of something sufficiently high-level that merging could possibly work,
then it's not WAL, and we shouldn't be trying to make the WAL
representation cater for it.)
        regards, tom lane


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Tuesday, June 19, 2012 04:17:01 PM Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On Tuesday, June 19, 2012 08:03:04 AM Tom Lane wrote:
> >> "Every WAL record"?  Why in heck would you attach it to every record?
> >> Surely putting it in WAL page headers would be sufficient.
> > 
> > The idea is that you can have cascading, circular and whatever
> > replication topologies if you include the "logical origin" of a wal
> > causing action into it.
> > That is, if you have nodes A(1) and B(2) and a insert happens on A the
> > wal records generated by that will get an xl_origin_id = 1 and when it
> > will be decoded and replayed on B it will *also* get the id 1. Only when
> > a change originally is generated on Bit will get xl_origin_id = 2.
> 
> None of this explains to me why each individual WAL record would need to
> carry a separate copy of the indicator.  We don't have multiple masters
> putting records into the same WAL file, nor does it seem to me that it
> could possibly be workable to merge WAL streams.  (If you are thinking
> of something sufficiently high-level that merging could possibly work,
> then it's not WAL, and we shouldn't be trying to make the WAL
> representation cater for it.)
The idea is that if youre replaying changes on node A originating from node B 
you set the origin to *B* in the wal records that are generated during that. 
So when B, in a bidirectional setup, replays the changes that A has made it 
can simply ignore all changes which originated on itself.

That works rather simple & performant if you have a conflict avoidance scheme.

For many scenarios you need to be able to separate locally generated changes 
and changes that have been replayed from another node. Without support of 
something like this this is really hard to achieve.

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Tuesday, June 19, 2012 04:17:01 PM Tom Lane wrote:
>> ...  (If you are thinking
>> of something sufficiently high-level that merging could possibly work,
>> then it's not WAL, and we shouldn't be trying to make the WAL
>> representation cater for it.)

> The idea is that if youre replaying changes on node A originating from node B
> you set the origin to *B* in the wal records that are generated during that. 
> So when B, in a bidirectional setup, replays the changes that A has made it 
> can simply ignore all changes which originated on itself.

This is most certainly not possible at the level of WAL.  As I said
above, we shouldn't be trying to shoehorn high level logical-replication
commands into WAL streams.  No good can come of confusing those concepts.
        regards, tom lane


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On Tuesday, June 19, 2012 04:17:01 PM Tom Lane wrote:
> >> ...  (If you are thinking
> >> of something sufficiently high-level that merging could possibly work,
> >> then it's not WAL, and we shouldn't be trying to make the WAL
> >> representation cater for it.)
> > 
> > The idea is that if youre replaying changes on node A originating from
> > node B you set the origin to *B* in the wal records that are generated
> > during that. So when B, in a bidirectional setup, replays the changes
> > that A has made it can simply ignore all changes which originated on
> > itself.
> 
> This is most certainly not possible at the level of WAL. 
Huh? This isn't used during normal crash-recovery replay. The information is 
used when decoding the wal into logical changes and applying those. Its just a 
common piece of information thats needed for a large number of records. 

Alternatively it could be added to all the records that need it, but that 
would smear the necessary logic - which is currently trivial - over more of 
the backend. And it would increase the actual size of wal which this one did 
not.

> As I said above, we shouldn't be trying to shoehorn high level logical-
> replication commands into WAL streams.  No good can come of confusing those
> concepts.
Its not doing anything high-level in there? All that patch does is embedding 
one single piece of information in previously unused space.

I can follow the argument that you do not want *any* logical information in 
the wal. But as I said in the patchset-introducing email: I don't really see 
an alternative. Otherwise we would just duplicate all the locking/scalability 
issues of xlog as well as the amount of writes.
This is, besides logging some more informations when wal_level = logical in 
some particular records (HEAP_UPDATE|DELETE and ensuring fpw's don't remove 
the record data in HEAP_(INSERT|UPDATE|DELETE) in patch 07/16 the only change 
that I really forsee being needed for doing the logical stuff.

Do you really see this as such a big problem?

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
>>> ...  (If you are thinking
>>> of something sufficiently high-level that merging could possibly work,
>>> then it's not WAL, and we shouldn't be trying to make the WAL
>>> representation cater for it.)

> Do you really see this as such a big problem?

It looks suspiciously like "I have a hammer, therefore every problem
must be a nail".  I don't like the design concept of cramming logical
replication records into WAL in the first place.

However, if we're dead set on doing it that way, let us put information
that is only relevant to logical replication records into only the
logical replication records.  Saving a couple bytes in each such record
is penny-wise and pound-foolish, I'm afraid; especially when you're
nailing down hard, unexpansible limits at the very beginning of the
development process in order to save those bytes.
        regards, tom lane


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
Hi,

On Tuesday, June 19, 2012 06:11:20 PM Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
> >>> ...  (If you are thinking
> >>> of something sufficiently high-level that merging could possibly work,
> >>> then it's not WAL, and we shouldn't be trying to make the WAL
> >>> representation cater for it.)
> > 
> > Do you really see this as such a big problem?
> It looks suspiciously like "I have a hammer, therefore every problem
> must be a nail".  I don't like the design concept of cramming logical
> replication records into WAL in the first place.
There are - so far - no specific "logical replication records". Its a 
relatively minor amount of additional data under wal_level=logical for 
existing records. HEAP_UPDATE gets the old primary key on updates changing the 
pkey and HEAP_DELETE always has the pkey. HEAP_INSERT|UPDATE|
DELETE,HEAP2_MULTI_INSERT put their information in another XLogRecData block 
than the page to handle full page writes. Thats it.

I can definitely understand hesitation about that, but I simply see no 
realistic way to solve the issues of existing replication solutions otherwise.
Do you have a better idea to solve those than the above? Without significant 
complications of the backend code and without loads of additional writes going 
on?
I *really* would like to hear them if you do.

> However, if we're dead set on doing it that way, let us put information
> that is only relevant to logical replication records into only the
> logical replication records. 
I found, and still do, the idea of having the origin_id in there rather 
elegant. If people prefer adding the same block to all of the above xlog 
records: I can live with that and will then do so. It makes some things more 
complicated, but its not too bad.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Robert Haas
Date:
On Tue, Jun 19, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
>>>> ...  (If you are thinking
>>>> of something sufficiently high-level that merging could possibly work,
>>>> then it's not WAL, and we shouldn't be trying to make the WAL
>>>> representation cater for it.)
>
>> Do you really see this as such a big problem?
>
> It looks suspiciously like "I have a hammer, therefore every problem
> must be a nail".  I don't like the design concept of cramming logical
> replication records into WAL in the first place.

Me, neither.  I think it's necessary to try to find a way of
generating logical replication records from WAL.  But once generated,
I think those records should form their own stream, independent of
WAL.  If you take the contrary position that they should be included
in WAL, then when you filter the WAL stream down to just the records
of interest to logical replication, you end up with a WAL stream with
holes in it, which is one of the things that Andres listed as an
unresolved design problem in his original email.

Moreover, this isn't necessary at all for single-master replication,
or even multi-source replication where each table has a single master.It's only necessary for full multi-master
replication,which we have 
no consensus to include in core, and even if we did have a consensus
to include it in core, it certainly shouldn't be the first feature we
design.

> However, if we're dead set on doing it that way, let us put information
> that is only relevant to logical replication records into only the
> logical replication records.

Right.  If we decide we need this, and if we did decide to conflate
the WAL stream, both of which I disagree with as noted above, then we
still don't need it on every record.  It would probably be sufficient
for local transactions to do nothing at all (and we can implicitly
assume that they have master node ID = local node ID) and transactions
which are replaying remote changes to emit one record per XID per
checkpoint cycle containing the remote node ID.

> Saving a couple bytes in each such record
> is penny-wise and pound-foolish, I'm afraid; especially when you're
> nailing down hard, unexpansible limits at the very beginning of the
> development process in order to save those bytes.

I completely agree.  I think that, as Dan said upthread, having a 64
or 128 bit ID so that it can be generated automatically rather than
configured by an administrator who must be careful not to duplicate
node IDs in any pair of systems that could ever end up talking to each
other would be a vast usability improvement.  Perhaps systems A, B,
and C are replicating to each other today, as are systems D and E.
But now suppose that someone decides they want to replicate one table
between A and D.  Suddenly the node IDs have to be distinct where they
didn't before, and now there's potentially a problem to hassle with
that wouldn't have been an issue if the node IDs had been wide enough
to begin with.  It is not unusual for people to decide after-the-fact
to begin replicating between machines where this wasn't originally
anticipated and which may even be even be under different
administrative control.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Andres Freund
Date:
Hi,

The most important part, even for people not following my discussion with 
Robert is at the bottom where the possible wal decoding strategies are laid 
out.

On Tuesday, June 19, 2012 03:20:58 AM Robert Haas wrote:
> On Sat, Jun 16, 2012 at 7:43 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> > Hm. Yes, you could do that. But I have to say I don't really see a
> >> > point. Maybe the fact that I do envision multimaster systems at some
> >> > point is clouding my judgement though as its far less easy in that
> >> > case.
> >> 
> >> Why?  I don't think that particularly changes anything.
> > 
> > Because it makes conflict detection very hard. I also don't think its a
> > feature worth supporting. Whats the use-case of updating records you
> > cannot properly identify?
> Don't ask me; I just work here.  I think it's something that some
> people want, though.  I mean, if you don't support replicating a table
> without a primary key, then you can't even run pgbench in a
> replication environment.
Well, I have no problem with INSERT only tables not having a PK. And 
pgbench_history is the only pgbench table that doesn't have a pkey? And thats 
only truncated...

> Is that an important workload?  Well,
> objectively, no.  But I guarantee you that other people with more
> realistic workloads than that will complain if we don't have it.
> Absolutely required on day one?  Probably not.  Completely useless
> appendage that no one wants?  Not that, either.
Maybe. And I don't really care, so if others see that as important I am happy 
to appease them ;). 

> >> In my view, a logical replication solution is precisely one in which
> >> the catalogs don't need to be in sync.  If the catalogs have to be in
> >> sync, it's not logical replication.  ISTM that what you're talking
> >> about is sort of a hybrid between physical replication (pages) and
> >> logical replication (tuples) - you want to ship around raw binary
> >> tuple data, but not entire pages.
> > Ok, thats a valid point. Simon argued at the cluster summit that
> > everything thats not physical is logical. Which has some appeal because
> > it seems hard to agree what exactly logical rep is. So definition by
> > exclusion makes kind of sense ;)
> Well, the words are fuzzy, but I would define logical replication to
> be something which is independent of the binary format in which stuff
> gets stored on disk.  If it's not independent of the disk format, then
> you can't do heterogenous replication (between versions, or between
> products).  That precise limitation is the main thing that drives
> people to use anything other than SR in the first place, IME.
Not in mine. The main limitation I see is that you cannot write anything on 
the standby. Which sucks majorly for many things. Its pretty much impossible 
to "fix" that for SR outside of very limited cases.
While many scenarios don't need multimaster *many* need to write outside of 
the standby's replication set.

> > I think what you categorized as "hybrid logical/physical" rep solves an
> > important use-case thats very hard to solve at the moment. Before my
> > 2ndquadrant days I had several client which had huge problemsing the
> > trigger based solutions because their overhead simply was to big a
> > burden on the master. They couldn't use SR either because every
> > consuming database kept loads of local data.
> > I think such scenarios are getting more and more common.

> I think this is to some extent true, but I also think you're
> conflating two different things.  Change extraction via triggers
> introduces overhead that can be eliminated by reconstructing tuples
> from WAL in the background rather than forcing them to be inserted
> into a shadow table (and re-WAL-logged!) in the foreground.  I will
> grant that shipping the tuple as a binary blob rather than as text
> eliminates additional overehead on both ends, but it also closes off a
> lot of important use cases.  As I noted in my previous email, I think
> that ought to be a performance optimization that we do, if at all,
> when it's been proven safe, not a baked-in part of the design.  Even a
> solution that decodes WAL to text tuples and ships those around and
> reinserts the via pure SQL should be significantly faster than the
> replication solutions we have today; if it isn't, something's wrong.
Its not only the logging side which is a limitation in todays replication 
scenarios. The apply side scales even worse because its *very* hard to 
distribute it between multiple backends.

> >> The problem with that is it's going to be tough to make robust.  Users
> >> could easily end up with answers that are total nonsense, or probably
> >> even crash the server.
> > Why?
> Because the routines that decode tuples don't include enough sanity
> checks to prevent running off the end of the block, or even the end of
> memory completely.  Consider a corrupt TOAST pointer that indicates
> that there is a gigabyte of data stored in an 8kB block.  One of the
> common symptoms of corruption IME is TOAST requests for -3 bytes of
> memory.
Yes, but we need to put safeguards against that sort of thing anyway. So sure, 
we can have bugs but this is not a fundamental limitation.

> And, of course, even if you could avoid crashing, interpreting what
> was originally intended as a series of int4s as a varlena isn't likely
> to produce anything terribly meaningful.  Tuple data isn't
> self-identifying; that's why this is such a hard problem.
Yes, sure.

> >> To step back and talk about DDL more generally, you've mentioned a few
> >> times the idea of using an SR instance that has been filtered down to
> >> just the system catalogs as a means of generating logical change
> >> records.  However, as things stand today, there's no reason to suppose
> >> that replicating anything less than the entire cluster is sufficient.
> >> For example, you can't translate enum labels to strings without access
> >> to the pg_enum catalog, which would be there, because enums are
> >> built-in types.  But someone could supply a similar user-defined type
> >> that uses a user-defined table to do those lookups, and now you've got
> >> a problem.  I think this is a contractual problem, not a technical
> >> one.  From the point of view of logical replication, it would be nice
> >> if type output functions were basically guaranteed to look at nothing
> >> but the datum they get passed as an argument, or at the very least
> >> nothing other than the system catalogs, but there is no such
> >> guarantee.  And, without such a guarantee, I don't believe that we can
> >> create a high-performance, robust, in-core replication solution.
> > 
> > I don't think thats a valid argument. Any such solution existing today
> > fails to work properly with dump/restore and such because it implies
> > dependencies that they do not know about. The "internal" tables will
> > possibly be restored later than the tables using the tables and such. So
> > your data format *has* to deal with loading/outputting data without such
> > anyway.
> Do you know for certain that PostGIS doesn't do anything of this type?
>  Or what about something like an SE-Linux label cache, where we might
> arrange to create labels as they are used and associate them with
> integer tags?
Postgis uses one information table in a few more complex functions but not in 
anything low-level. Evidenced by the fact that it was totally normal for that 
to go out of sync before < 2.0.

But even if such a thing would be needed, it wouldn't be problematic to make 
extension configuration tables be replicated as well.

> > You could reduce the space overhead by only adding that information only
> > the first time after a table has changed (and then regularly after a
> > checkpoint or so) but doing so seems to be introducing too much
> > complexity.
> Well, I dunno: it is complicated, but I'm worried that the design
> you've got is awfully complicated, too. Requiring an extra PG
> instance with a very specific configuration that furthermore uses an
> untested WAL-filtering methodology that excludes everything but the
> system catalogs seems like an administrative nightmare, and I remain
> unconvinced that it is safe.
Well, I assumed we would have a utility to create such an instance via the 
replication protocol which does allt he ncessary thing.

I personally don't want to do this in the very first version of an applied 
patch. I think solving this on a binary level, without decoding is the 
implementation wise smallest, already usable set of features. Yes, it doesn't 
provide many things people want (cross arch, replication into text, ...) but 
it provides most of the infrastructure for that. If we try to do everything at 
once we will *never* get anywhere.

The wal filtering machinery is a very nice building block to improve the SR 
experience as well btw, by removing unneeded full page writes, filtering on 
one database and such.

> In fact, I have a strong feeling that it isn't safe, but if you're not
> convinced by the argument already laid out then I'm not sure I can convince
> you of it right this minute.

> What happens if you have a crash on the WAL generation machine?
> You'll have to rewind to the most recent restartpoint, and you can't
> use the catalogs until you've reached the minimum recovery point.  Is
> that going to mess you up?
With "Wal generation machine" you mean the catalog-only instance?

If yes: Very good point.

The only real problem I know with that are checkpoints with overflowed subxid 
snapshots. Otherwise you always can restart with the last checkpoint and 
exactly replay to the former location. The minimum recovery point cannot be 
ahead the point you just replayed. Everytime the recovery point is updated you 
need to serialize the applycache to disk but thats already required to be 
possible because otherwise you would need unbounded amounts of memory.
For the overflowed snapshot problem I had toyed with the idea of simply 
serializing the state of the KnownAssignedXids machinery when replaying 
checkpoints and at RecoveryPoints to solve that issue after the initial clone 
for HS in general.

> >> And then maybe we handle poorly-behaved types by pushing some of the
> >> work into the foreground task that's generating the WAL: in the worst
> >> case, the process logs a record before each insert/update/delete
> >> containing the text representation of any values that are going to be
> >> hard to decode.  In some cases (e.g. records all of whose constituent
> >> fields are well-behaved types) we could instead log enough additional
> >> information about the type to permit blind decoding.
> > 
> > I think this is prohibitively expensive from a development, runtime,
> > space and maintenance standpoint.
> > For databases using thing were decoding is rather expensive (e.g.
> > postgis) you wouldn't really improve much above the old trigger based
> > solutions. Its a return to "log everything twice".
> 
> Well, if the PostGIS types are poorly behaved under the definition I
> proposed, that implies they won't work at all under your scheme.  I
> think putting a replication solution into core that won't support
> PostGIS is dead on arrival.  If they're well-behaved, then there's no
> double-logging.
I am pretty sure its not bad-behaved. But how should the code know that? You 
want each type to explictly say that its unsafe if it is?
In that case you just as well can mark all such tables which would also make 
reliable pg_dump/restore possible. Assuming you buy the duplicated catalog 
idea. Which you do not ;)

> > Sorry if I seem pigheaded here, but I fail to see why all that would buy
> > us anything but loads of complexity while loosing many potential
> > advantages.
> The current system that you are proposing is very complex and has a
> number of holes at present, some of which you've already mentioned in
> previous emails.  There's a lot of advantage in picking a design that
> allows you to put together a working prototype relatively quickly, but
> I have a sinking feeling that your chosen design is going to be very
> hard to bullet-proof and not very user-friendly.  If we could find a
> way to run it all inside a single server I think we would be way ahead
> on both fronts.
Don't get me wrong: If I believed another approach would be realistic I would 
definitely prefer not going for that. I do find the requirement of keeping 
multiple copies of the catalog arround pretty damn ugly.

The problem is just that to support basically arbitrary decoding requirements 
you need to provide at least those pieces of information in a transactionally 
consistent manner:
* the data
* table names
* column names
* type information
* replication configuration

I have yet to see anybody point out a way to provide all that information 
without huge problems with the complexity/space-usage/performance triangle.

I have played with several ideas:

1.)
keep the decoding catalog in sync with command/event triggers, correctly 
replicating oids. If those log into some internal event table its easy to keep 
the catalog in a correct transactional state because the events from that 
table get decoded in the transaction and replayed at exactly the right spot in 
there *after* it has been reassembled. The locking on the generating side 
takes care of the concurrency aspects.

Advantages:
* minimal overhead (space, performance)
* allows additional tables/indexes/triggers if you take care with oid 
allocation
* easy transactionally correct catalog behaviour behaviour with catalogs
* the decoding instance can be used to store all data in a highly efficient 
manner (no decoding, no full detoasting, ...)  * the decoding instance is fully writable without problems if you don't

generate conflicts (separate tables, non-overlapping writes, whatever)
* implementable in a pretty unintrusive way

Disadvantes:
* the table structure of replicated tables needs to be *exactly* the same
* the type definition + support procs needs to be similar enough to read the 
data
* error checking of the above isn't easy
* full version/architecture compatibility required
* a second instance required even if you want to replicate into some other 
system/architecture/verison

2.)
Keep the decoding site up2date by replicating the catalog via normal recovery 
mechanisms

Advantages:
* most of the technology is already there
* minimal overhead (space, performance)
* no danger of out of sync catalogs
* no support for command triggers required that can keep a catalog in sync, 
including oids

Disadvantages:
* driving the catalog recovery that way requires some somewhat intricate code 
as it needs to be done in lockstep with decoding the wal-stream
* requires an additional feature to guarantee HS always has enough information 
to be queryable
* some complex logic/low-level fudging required to keep the transactional 
behaviour sensible when querying the catalog
* fully version/architecture compatibility required
* the decoding site will always ever be only readable

3.)
Fully versioned catalog

Advantages:
* Decoding is done on the master in an asynchronous fashion
* low overhead during normal DML execution, not much additional code in that 
path
* can be very efficient if architecture/version are the same
* version/architecture compatibility can be done transparently by falling back 
to textual versions on mismatch

Disadvantages:
* catalog versioning is complex to implement
* space overhead for all users, even without using logical replication
* I can't see -hackers signing off
* decoding has to happen on the master which might not be what people want 
performancewise

4.)
Log enough information in the walstream to make decoding possible using only 
the walstream.

Advantages:
* Decoding can optionally be done on the master
* No catalog syncing/access required
* its possible to make this architecture independent

Disadvantage:
* high to very high implementation overhead depending on efficiency aims
* high space overhead in the wal because at least all the catalog information
needs to be logged in a transactional manner repeatedly
* misuses wal far more than other methods
* significant new complexity in somewhat cricital code paths (heapam.c)
* insanely high space overhead if the decoding should be possible architecture 
independent

5.)
The actually good idea. Yours?

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Robert Haas
Date:
On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Adds a single and a double linked list which can easily embedded into other
> datastructures and can be used without any additional allocations.

dllist.h advertises that it's embeddable.  Can you use that instead,
or enhance it slightly to support what you want to do?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Tuesday, June 19, 2012 07:24:13 PM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Andres Freund <andres@2ndquadrant.com> writes:
> >> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
> >>>> ...  (If you are thinking
> >>>> of something sufficiently high-level that merging could possibly work,
> >>>> then it's not WAL, and we shouldn't be trying to make the WAL
> >>>> representation cater for it.)
> >> 
> >> Do you really see this as such a big problem?
> > 
> > It looks suspiciously like "I have a hammer, therefore every problem
> > must be a nail".  I don't like the design concept of cramming logical
> > replication records into WAL in the first place.
> 
> Me, neither.  I think it's necessary to try to find a way of
> generating logical replication records from WAL.  But once generated,
> I think those records should form their own stream, independent of
> WAL.  If you take the contrary position that they should be included
> in WAL, then when you filter the WAL stream down to just the records
> of interest to logical replication, you end up with a WAL stream with
> holes in it, which is one of the things that Andres listed as an
> unresolved design problem in his original email.
Yes.

> Moreover, this isn't necessary at all for single-master replication,
> or even multi-source replication where each table has a single master.
>  It's only necessary for full multi-master replication, which we have
> no consensus to include in core, and even if we did have a consensus
> to include it in core, it certainly shouldn't be the first feature we
> design.
Well, you can't blame a patch/prototype trying to implement what it claims to 
implement ;)

More seriously: Even if we don't put MM in core I think putting the basis for 
it in core so that somebody can build such a solution reusing the existing 
infrastructure is a sensible idea. Imo the only thing that requires explicit 
support which is hard to add outside of core is prevention of loops (aka this 
patch). Everything else should be doable reusing the hopefully modular pieces.

> > However, if we're dead set on doing it that way, let us put information
> > that is only relevant to logical replication records into only the
> > logical replication records.
> Right.  If we decide we need this, and if we did decide to conflate
> the WAL stream, both of which I disagree with as noted above, then we
> still don't need it on every record.  It would probably be sufficient
> for local transactions to do nothing at all (and we can implicitly
> assume that they have master node ID = local node ID) and transactions
> which are replaying remote changes to emit one record per XID per
> checkpoint cycle containing the remote node ID.
Youve gone from a pretty trivial 150 line patch without any runtime/space 
overhead to something *considerably* more complex in that case though. 

You suddently need to have relatively complex logic to remember which 
information you got for a certain xid (and forget that information afterwards) 
and whether you already logged that xid and you need to have to decide about 
logging that information at multiple places.

Btw, what do you mean with "conflating" the stream? I don't really see that 
being proposed.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
On Tuesday, June 19, 2012 09:16:41 PM Robert Haas wrote:
> On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > Adds a single and a double linked list which can easily embedded into
> > other datastructures and can be used without any additional allocations.
> 
> dllist.h advertises that it's embeddable.  Can you use that instead,
> or enhance it slightly to support what you want to do?
Oh, wow. Didn't know that existed. I had $subject lying around from a play-
around project, so I didn't look too hard.
Why is that code not used more widely? Quite a bit of our list usage should be 
replaced embedding list element in larger structs imo. There are also open-
coded inline list manipulations around (check aset.c for example).

*looks*

Not really that happy with it though:

1. dllist.h has double the element overhead by having an inline value pointer 
(which is not needed when embedding) and a pointer to the list (which I have a 
hard time seing as being useful)
2. only double linked list, mine provided single and double linked ones
3. missing macros to use when embedded in a larger struct (containerof() 
wrappers and for(...) support basically)
4. most things are external function calls...
5. way much more branches/complexity in most of the functions. My 
implementation doesn't use any branches for the typical easy modifications 
(push, pop, remove element somewhere) and only one for the typical tests 
(empty, has-next, ...)

The performance and memory aspects were crucial for the aforementioned toy 
project (slab allocator for postgres). Its not that crucial for the applycache 
where the lists currently are mostly used although its also relatively 
performance sensitive and obviously does a lot of list manipulation/iteration.

If I had to decide I would add the missing api in dllist.h to my 
implementation and then remove it. Its barely used - and only in an embedded 
fashion - as far as I can see.
I can understand though if that argument is met with doubt by others ;). If 
thats the way it has to go I would add some more convenience support for 
embedding data to dllist.h and settle for that.

Greetings,

Andres


-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Robert Haas
Date:
On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> From: Andres Freund <andres@anarazel.de>
>
> The previous coding could miss xlog writeouts at several places. E.g. when wal
> was written out by the background writer or even after a commit if
> synchronous_commit=off.
> This could lead to delays in sending data to the standby of up to 7 seconds.
>
> To fix this move the responsibility of notification to the layer where the
> neccessary information is actually present. We take some care not to do the
> notification while we hold conteded locks like WALInsertLock or WalWriteLock
> locks.

I am not convinced that it's a good idea to wake up every walsender
every time we do XLogInsert().  XLogInsert() is a super-hot code path,
and adding more overhead there doesn't seem warranted.  We need to
replicate commit, commit prepared, etc. quickly, by why do we need to
worry about a short delay in replicating heap_insert/update/delete,
for example?  They don't really matter until the commit arrives.  7
seconds might be a bit long, but that could be fixed by decreasing the
polling interval for walsender to, say, a second.

When I was doing some testing recently, the case that was sort of
confusing was the replay of AccessExcusiveLocks.  The lock didn't show
up on the standby for many seconds after it had showed up on the
master.  But that's more a feature than a bug when you really thinking
about it - postponing the lock on the slave for as long as possible
just reduces the user impact of having to take them there at all.

Parenthetically, I find it difficult to extract inline patches.  No
matter whether I try to use it using Gmail + show original or the web
site, something always seems to get garbled.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Robert Haas
Date:
On Tue, Jun 19, 2012 at 3:48 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Why is that code not used more widely? Quite a bit of our list usage should be
> replaced embedding list element in larger structs imo. There are also open-
> coded inline list manipulations around (check aset.c for example).

Because we've got a million+ lines of code and nobody knows where all
the bodies are buried.

> 1. dllist.h has double the element overhead by having an inline value pointer
> (which is not needed when embedding) and a pointer to the list (which I have a
> hard time seing as being useful)
> 2. only double linked list, mine provided single and double linked ones
> 3. missing macros to use when embedded in a larger struct (containerof()
> wrappers and for(...) support basically)
> 4. most things are external function calls...
> 5. way much more branches/complexity in most of the functions. My
> implementation doesn't use any branches for the typical easy modifications
> (push, pop, remove element somewhere) and only one for the typical tests
> (empty, has-next, ...)
>
> The performance and memory aspects were crucial for the aforementioned toy
> project (slab allocator for postgres). Its not that crucial for the applycache
> where the lists currently are mostly used although its also relatively
> performance sensitive and obviously does a lot of list manipulation/iteration.
>
> If I had to decide I would add the missing api in dllist.h to my
> implementation and then remove it. Its barely used - and only in an embedded
> fashion - as far as I can see.
> I can understand though if that argument is met with doubt by others ;). If
> thats the way it has to go I would add some more convenience support for
> embedding data to dllist.h and settle for that.

I think it might be simpler to leave the name as Dllist and just
overhaul the implementation along the lines you suggest, rather than
replacing it with something completely different.  Mostly, I don't
want to add a third thing if we can avoid it, given that Dllist as it
exists today is used only lightly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Marko Kreen
Date:
On Wed, Jun 13, 2012 at 2:28 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> +/*
> + * removes a node from a list
> + * Attention: O(n)
> + */
> +static inline void ilist_s_remove(ilist_s_head *head,
> +                                  ilist_s_node *node)
> +{
> +       ilist_s_node *last = &head->head;
> +       ilist_s_node *cur;
> +#ifndef NDEBUG
> +       bool found = false;
> +#endif
> +       while ((cur = last->next))
> +       {
> +               if (cur == node)
> +               {
> +                       last->next = cur->next;
> +#ifndef NDEBUG
> +                       found = true;
> +#endif
> +                       break;
> +               }
> +               last = cur;
> +       }
> +       assert(found);
> +}

This looks weird.

In cyclic list removal is:
 node->prev->next = node->next; node->next->prev = node->prev;

And thats it.

--
marko


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
Hi,

On Tuesday, June 19, 2012 09:59:48 PM Marko Kreen wrote:
> On Wed, Jun 13, 2012 at 2:28 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > +/*
> > + * removes a node from a list
> > + * Attention: O(n)
> > + */
> > +static inline void ilist_s_remove(ilist_s_head *head,
> > +                                  ilist_s_node *node)
> > +{
> > +       ilist_s_node *last = &head->head;
> > +       ilist_s_node *cur;
> > +#ifndef NDEBUG
> > +       bool found = false;
> > +#endif
> > +       while ((cur = last->next))
> > +       {
> > +               if (cur == node)
> > +               {
> > +                       last->next = cur->next;
> > +#ifndef NDEBUG
> > +                       found = true;
> > +#endif
> > +                       break;
> > +               }
> > +               last = cur;
> > +       }
> > +       assert(found);
> > +}
> 
> This looks weird.
> 
> In cyclic list removal is:
> 
>   node->prev->next = node->next;
>   node->next->prev = node->prev;
> 
> And thats it.
Thats the single linked list, not the double linked one. Thats why it has a 
O(n) warning tacked on...

The double linked one is just as you said:
/** removes a node from a list*/
static inline void ilist_d_remove(unused_attr ilist_d_head *head, ilist_d_node 
*node)
{ilist_d_check(head);node->prev->next = node->next;node->next->prev = node->prev;ilist_d_check(head);
}

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Andres Freund
Date:
On Tuesday, June 19, 2012 09:55:30 PM Robert Haas wrote:
> On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > From: Andres Freund <andres@anarazel.de>
> > 
> > The previous coding could miss xlog writeouts at several places. E.g.
> > when wal was written out by the background writer or even after a commit
> > if synchronous_commit=off.
> > This could lead to delays in sending data to the standby of up to 7
> > seconds.
> > 
> > To fix this move the responsibility of notification to the layer where
> > the neccessary information is actually present. We take some care not to
> > do the notification while we hold conteded locks like WALInsertLock or
> > WalWriteLock locks.
> 
> I am not convinced that it's a good idea to wake up every walsender
> every time we do XLogInsert().  XLogInsert() is a super-hot code path,
> and adding more overhead there doesn't seem warranted.  We need to
> replicate commit, commit prepared, etc. quickly, by why do we need to
> worry about a short delay in replicating heap_insert/update/delete,
> for example?  They don't really matter until the commit arrives.  7
> seconds might be a bit long, but that could be fixed by decreasing the
> polling interval for walsender to, say, a second.
Its not woken up every XLogInsert call. Its only woken up if there was an 
actual disk write + fsync in there. Thats exactly the point of the patch.
The wakeup rate is actually lower for synchronous_commit=on than before 
because then it unconditionally did a wakeup for every commit (and similar) 
and now only does that if something has been written + fsynced.

> Parenthetically, I find it difficult to extract inline patches.  No
> matter whether I try to use it using Gmail + show original or the web
> site, something always seems to get garbled.
Will use git send-mail --attach next time... Btw, git am should be able to 
extract the patches for you.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Marko Kreen
Date:
On Tue, Jun 19, 2012 at 11:02 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On Tuesday, June 19, 2012 09:59:48 PM Marko Kreen wrote:
>> On Wed, Jun 13, 2012 at 2:28 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
>> > +/*
>> > + * removes a node from a list
>> > + * Attention: O(n)
>> > + */
>> > +static inline void ilist_s_remove(ilist_s_head *head,
>> > +                                  ilist_s_node *node)
>> > +{
>> > +       ilist_s_node *last = &head->head;
>> > +       ilist_s_node *cur;
>> > +#ifndef NDEBUG
>> > +       bool found = false;
>> > +#endif
>> > +       while ((cur = last->next))
>> > +       {
>> > +               if (cur == node)
>> > +               {
>> > +                       last->next = cur->next;
>> > +#ifndef NDEBUG
>> > +                       found = true;
>> > +#endif
>> > +                       break;
>> > +               }
>> > +               last = cur;
>> > +       }
>> > +       assert(found);
>> > +}
>>
>> This looks weird.
>>
>> In cyclic list removal is:
>>
>>   node->prev->next = node->next;
>>   node->next->prev = node->prev;
>>
>> And thats it.
> Thats the single linked list, not the double linked one. Thats why it has a
> O(n) warning tacked on...

Oh, you have several list implementations there.
Sorry, I was just browsing and it caught my eye.

--
marko


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
On Tuesday, June 19, 2012 09:58:43 PM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 3:48 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > Why is that code not used more widely? Quite a bit of our list usage
> > should be replaced embedding list element in larger structs imo. There
> > are also open- coded inline list manipulations around (check aset.c for
> > example).
> Because we've got a million+ lines of code and nobody knows where all
> the bodies are buried.
Heh, yes ;). Ive hit several times as you uncovered at least twice ;)

> > 1. dllist.h has double the element overhead by having an inline value
> > pointer (which is not needed when embedding) and a pointer to the list
> > (which I have a hard time seing as being useful)
> > 2. only double linked list, mine provided single and double linked ones
> > 3. missing macros to use when embedded in a larger struct (containerof()
> > wrappers and for(...) support basically)
> > 4. most things are external function calls...
> > 5. way much more branches/complexity in most of the functions. My
> > implementation doesn't use any branches for the typical easy
> > modifications (push, pop, remove element somewhere) and only one for the
> > typical tests (empty, has-next, ...)
> > 
> > The performance and memory aspects were crucial for the aforementioned
> > toy project (slab allocator for postgres). Its not that crucial for the
> > applycache where the lists currently are mostly used although its also
> > relatively performance sensitive and obviously does a lot of list
> > manipulation/iteration.
> > 
> > If I had to decide I would add the missing api in dllist.h to my
> > implementation and then remove it. Its barely used - and only in an
> > embedded fashion - as far as I can see.
> > I can understand though if that argument is met with doubt by others ;).
> > If thats the way it has to go I would add some more convenience support
> > for embedding data to dllist.h and settle for that.
> 
> I think it might be simpler to leave the name as Dllist and just
> overhaul the implementation along the lines you suggest, rather than
> replacing it with something completely different.  Mostly, I don't
> want to add a third thing if we can avoid it, given that Dllist as it
> exists today is used only lightly.
Well, if its the name, I have no problem with changing it, but I don't see how 
you can keep the api as it currently is and address my points.

If there is some buyin I can try to go either way (keeping the existing name, 
changing the api, adjusting the callers or just adjust the callers, throw away 
the old implementation) I just don't want to get into that just to see 
somebody isn't agreeing with the fundamental idea.

The most contentious point is probably relying on USE_INLINE being available 
anywhere. Which I believe to be the point now that we have gotten rid of some 
platforms.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
"Kevin Grittner"
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> However, if we're dead set on doing it that way, let us put
>>> information that is only relevant to logical replication records
>>> into only the logical replication records.
>> Right.  If we decide we need this, and if we did decide to
>> conflate the WAL stream, both of which I disagree with as noted
>> above, then we still don't need it on every record.  It would
>> probably be sufficient for local transactions to do nothing at
>> all (and we can implicitly assume that they have master node ID =
>> local node ID) and transactions which are replaying remote
>> changes to emit one record per XID per checkpoint cycle
>> containing the remote node ID.
> Youve gone from a pretty trivial 150 line patch without any
> runtime/space overhead to something *considerably* more complex in
> that case though. 
I think it might be worth it.  I've done a lot of MM replication,
and so far have not had to use a topology which allowed loops.  Not
only would you be reserving space in the WAL stream which was not
useful for those not using MM replication, you would be reserving it
when even many MM configurations would not need it.  Now you could
argue that the 16 bits you want to use are already there and are not
yet used for anything; but there are two counter-arguments to that:
you lose the opportunity to use them for something else, and you
might want more than 16 bits.
-Kevin


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Marko Kreen
Date:
On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 13 June 2012 19:28, Andres Freund <andres@2ndquadrant.com> wrote:
>> This adds a new configuration parameter multimaster_node_id which determines
>> the id used for wal originating in one cluster.
>
> Looks good and it seems this aspect at least is commitable in this CF.
>
> Design decisions I think we need to review are
>
> * Naming of field. I think origin is the right term, borrowing from Slony.

I have not read too deeply here, so maybe I am missing
some important detail here, but idea that users need
to coordinate a integer config parameter globally does not
sound too attractive to me.

Why not limit integers to local storage only and map
them to string idents on config, UI and transport?

-- 
marko


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
"Kevin Grittner"
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
> The problem is just that to support basically arbitrary decoding
> requirements you need to provide at least those pieces of
> information in a transactionally consistent manner:
> * the data
> * table names
> * column names
> * type information
> * replication configuration
I'm not sure that the last one needs to be in scope for the WAL
stream, but the others I definitely agree eventually need to be
available to a logical transaction stream consumer.  You lay out the
alternative ways to get all of this pretty clearly, and I don't know
what the best answer is; it seems likely that there is not one best
answer.  In the long run, more than one of those options might need
to be supported, to support different environments.
As an initial implementation, I'm leaning toward the position that
requiring a hot standby or a catalog-only proxy is acceptable.  I
think that should allow an application to be written which emits
everything except the replication configuration.  That will allow us
to hook up everything we need at our shop.
-Kevin


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 04:58, Marko Kreen <markokr@gmail.com> wrote:
> On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 13 June 2012 19:28, Andres Freund <andres@2ndquadrant.com> wrote:
>>> This adds a new configuration parameter multimaster_node_id which determines
>>> the id used for wal originating in one cluster.
>>
>> Looks good and it seems this aspect at least is commitable in this CF.
>>
>> Design decisions I think we need to review are
>>
>> * Naming of field. I think origin is the right term, borrowing from Slony.
>
> I have not read too deeply here, so maybe I am missing
> some important detail here, but idea that users need
> to coordinate a integer config parameter globally does not
> sound too attractive to me.
>
> Why not limit integers to local storage only and map
> them to string idents on config, UI and transport?

Yes, that can be done. This is simply how the numbers are used
internally, similar to oids.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Robert Haas
Date:
On Tue, Jun 19, 2012 at 3:18 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> More seriously: Even if we don't put MM in core I think putting the basis for
> it in core so that somebody can build such a solution reusing the existing
> infrastructure is a sensible idea. Imo the only thing that requires explicit
> support which is hard to add outside of core is prevention of loops (aka this
> patch). Everything else should be doable reusing the hopefully modular pieces.

I don't think prevention of loops is hard to do outside of core
either, unless you insist on tying "logical" replication so closely to
WAL that anyone doing MMR is necessarily getting the change stream
from WAL.  In fact, I'd go so far as to say that the ONLY part of this
that's hard to do outside of core is change extraction.  Even
low-level apply can be implemented as a loadable module.

>> Right.  If we decide we need this, and if we did decide to conflate
>> the WAL stream, both of which I disagree with as noted above, then we
>> still don't need it on every record.  It would probably be sufficient
>> for local transactions to do nothing at all (and we can implicitly
>> assume that they have master node ID = local node ID) and transactions
>> which are replaying remote changes to emit one record per XID per
>> checkpoint cycle containing the remote node ID.
> Youve gone from a pretty trivial 150 line patch without any runtime/space
> overhead to something *considerably* more complex in that case though.
>
> You suddently need to have relatively complex logic to remember which
> information you got for a certain xid (and forget that information afterwards)
> and whether you already logged that xid and you need to have to decide about
> logging that information at multiple places.

You need a backend-local hash table inside the wal reader process, and
that hash table needs to map XIDs to node IDs.  And you occasionally
need to prune it, so that it doesn't eat too much memory.  None of
that sounds very hard.

> Btw, what do you mean with "conflating" the stream? I don't really see that
> being proposed.

It seems to me that you are intent on using the WAL stream as the
logical change stream.  I think that's a bad design.  Instead, you
should extract changes from WAL and then ship them around in a format
that is specific to logical replication.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 19 June 2012 14:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "Every WAL record"?  Why in heck would you attach it to every record?
> Surely putting it in WAL page headers would be sufficient.  We could
> easily afford to burn a page switch (if not a whole segment switch)
> when changing masters.

This does appear to be a reasonable idea at first glance, since it
seems that each node has just a single node id, but that is not the
case.

As we pass changes around we maintain the same origin id for a change,
so there is a mix of origin node ids at the WAL record level, not the
page level. The concept of originating node id is essentially same as
that used in Slony.

> I'm against the idea of eating any spare space we have in WAL record
> headers for this purpose, anyway; there are likely to be more pressing
> needs in future.

Not sure what those pressing needs are, but I can't see any. What we
are doing here is fairly important, just not as important as crash
recovery. But then that has worked pretty much unchanged for some time
now.

I raised the possibility of having variable length headers, but there
is no requirement for that yet.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 05:46, Robert Haas <robertmhaas@gmail.com> wrote:

> It seems to me that you are intent on using the WAL stream as the
> logical change stream.  I think that's a bad design.  Instead, you
> should extract changes from WAL and then ship them around in a format
> that is specific to logical replication.

The proposal is to read the WAL in order to generate LCRs. The
information needs to be in the original data in order to allow it to
be passed onwards.

In a multi-master config there will be WAL records with many origin
node ids at any time, so this information must be held at WAL record
level not page level.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Christopher Browne
Date:
On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Btw, what do you mean with "conflating" the stream? I don't really see that
>> being proposed.
>
> It seems to me that you are intent on using the WAL stream as the
> logical change stream.  I think that's a bad design.  Instead, you
> should extract changes from WAL and then ship them around in a format
> that is specific to logical replication.

Yeah, that seems worth elaborating on.

What has been said several times is that it's pretty necessary to
capture the logical changes into WAL.  That seems pretty needful, in
order that the replication data gets fsync()ed avidly, and so that we
don't add in the race condition of needing to fsync() something *else*
almost exactly as avidly as is the case for WAL today..

But it's undesirable to pull *all* the bulk of contents of WAL around
if it's only part of the data that is going to get applied.  On a
"physical streaming" replica, any logical data that gets captured will
be useless.  And on a "logical replica," they "physical" bits of WAL
will be useless.

What I *want* you to mean is that there would be:
a) WAL readers that pull the "physical bits", and
b) WAL readers that just pull "logical bits."

I expect it would be fine to have a tool that pulls LCRs out of WAL to
prepare that to be sent to remote locations.  Is that what you have in
mind?  Or are you feeling that the "logical bits" shouldn't get
captured in WAL altogether, so we need to fsync() them into a
different stream of files?
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Tuesday, June 19, 2012 10:58:44 PM Marko Kreen wrote:
> On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On 13 June 2012 19:28, Andres Freund <andres@2ndquadrant.com> wrote:
> >> This adds a new configuration parameter multimaster_node_id which
> >> determines the id used for wal originating in one cluster.
> > 
> > Looks good and it seems this aspect at least is commitable in this CF.
> > 
> > Design decisions I think we need to review are
> > 
> > * Naming of field. I think origin is the right term, borrowing from
> > Slony.
> 
> I have not read too deeply here, so maybe I am missing
> some important detail here, but idea that users need
> to coordinate a integer config parameter globally does not
> sound too attractive to me.
> 
> Why not limit integers to local storage only and map
> them to string idents on config, UI and transport?
That should be possible. I don't want to go there till we have the base stuff 
in but it shouldn't be too hard.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 04:31, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

> I've done a lot of MM replication,
> and so far have not had to use a topology which allowed loops.

The proposal is to use WAL to generate the logical change stream. That
has been shown in testing to be around x4 faster than having a
separate change stream, which must also be WAL logged (as Jan noted).

If we use WAL in this way, multi-master implies that the data will
*always* be in a loop. So in any configuration we must be able to tell
difference between changes made by one node and another.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 05:59, Christopher Browne <cbbrowne@gmail.com> wrote:

> But it's undesirable to pull *all* the bulk of contents of WAL around
> if it's only part of the data that is going to get applied.  On a
> "physical streaming" replica, any logical data that gets captured will
> be useless.  And on a "logical replica," they "physical" bits of WAL
> will be useless.

Completely agree.

What we're discussing here is the basic information content of a WAL
record to enable the construction of suitable LCRs.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 00:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
>>>> ...  (If you are thinking
>>>> of something sufficiently high-level that merging could possibly work,
>>>> then it's not WAL, and we shouldn't be trying to make the WAL
>>>> representation cater for it.)
>
>> Do you really see this as such a big problem?
>
> It looks suspiciously like "I have a hammer, therefore every problem
> must be a nail".  I don't like the design concept of cramming logical
> replication records into WAL in the first place.

The evidence from prototypes shown at PgCon was that using the WAL in
this way was very efficient and that this level of efficiency is
necessary to make it work in a practical manner. Postgres would not be
the first database to follow this broad design.

> However, if we're dead set on doing it that way, let us put information
> that is only relevant to logical replication records into only the
> logical replication records.

Agreed. In this case, the origin node id information needs to be
available on the WAL record so we can generate the logical changes
(LCRs).

> Saving a couple bytes in each such record
> is penny-wise and pound-foolish, I'm afraid; especially when you're
> nailing down hard, unexpansible limits at the very beginning of the
> development process in order to save those bytes.

Restricting the number of node ids is being done so that there is no
impact on anybody not using this feature.

In later implementations, it should be possible to support greater
numbers of node ids by having a variable length header. But that is an
unnecessary complication for a first release.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Tuesday, June 19, 2012 11:46:56 PM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 3:18 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > More seriously: Even if we don't put MM in core I think putting the basis
> > for it in core so that somebody can build such a solution reusing the
> > existing infrastructure is a sensible idea. Imo the only thing that
> > requires explicit support which is hard to add outside of core is
> > prevention of loops (aka this patch). Everything else should be doable
> > reusing the hopefully modular pieces.
> I don't think prevention of loops is hard to do outside of core
> either, unless you insist on tying "logical" replication so closely to
> WAL that anyone doing MMR is necessarily getting the change stream
> from WAL.  In fact, I'd go so far as to say that the ONLY part of this
> that's hard to do outside of core is change extraction.  Even
> low-level apply can be implemented as a loadable module.
I definitely agree that low-level apply is possible as a module. Sure change 
extraction needs core support but I was talking about what you need to 
implement it reusing the "plain" logical support...

What I do not understand is how you want to prevent loops in a simple manner 
without in core support:

A generates a HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT 
action.
B reads the lcr stream from A and applies the changes. A new HEAP_INSERT 
record. Gets decoded into the lcr stream as a INSERT action.
A reads the lcr stream from B and ???

At this point you need to prevent a loop. If you have the information where a 
change originally happened (xl_origin_id = A in this case) you can have the 
simple filter on A which ignores change records if lcr_origin_id == 
local_replication_origin_id).


> >> Right.  If we decide we need this, and if we did decide to conflate
> >> the WAL stream, both of which I disagree with as noted above, then we
> >> still don't need it on every record.  It would probably be sufficient
> >> for local transactions to do nothing at all (and we can implicitly
> >> assume that they have master node ID = local node ID) and transactions
> >> which are replaying remote changes to emit one record per XID per
> >> checkpoint cycle containing the remote node ID.
> > 
> > Youve gone from a pretty trivial 150 line patch without any runtime/space
> > overhead to something *considerably* more complex in that case though.
> > 
> > You suddently need to have relatively complex logic to remember which
> > information you got for a certain xid (and forget that information
> > afterwards) and whether you already logged that xid and you need to have
> > to decide about logging that information at multiple places.
> You need a backend-local hash table inside the wal reader process, and
> that hash table needs to map XIDs to node IDs.  And you occasionally
> need to prune it, so that it doesn't eat too much memory.  None of
> that sounds very hard.
Its not very hard. Its just more complex than what I propose(d).

> > Btw, what do you mean with "conflating" the stream? I don't really see
> > that being proposed.
> It seems to me that you are intent on using the WAL stream as the
> logical change stream.  I think that's a bad design.  Instead, you
> should extract changes from WAL and then ship them around in a format
> that is specific to logical replication.
No, I don't want that. I think we will need some different format once we have 
agreed how changeset extraction works.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
"Kevin Grittner"
Date:
Simon Riggs <simon@2ndQuadrant.com> wrote:
> The proposal is to use WAL to generate the logical change stream.
> That has been shown in testing to be around x4 faster than having
> a separate change stream, which must also be WAL logged (as Jan
> noted).
Sure, that's why I want it.
> If we use WAL in this way, multi-master implies that the data will
> *always* be in a loop. So in any configuration we must be able to
> tell difference between changes made by one node and another.
Only if you assume that multi-master means identical databases all
replicating the same data to all the others.  If I have 72 master
replicating non-conflicting data to one consolidated database, I
consider that to be multi-master, too.  Especially if I have other
types of databases replicating disjoint data to the same
consolidated database, and the 72 sources have several databases
replicating disjoint sets of data to them.  We have about 1000
replications paths, none of which create a loop which can send data
back to the originator or cause conflicts.
Of course, none of these databases have the same OID for any given
object, and there are numerous different schemas among the
replicating databases, so I need to get to table and column names
before the data is of any use to me.
-Kevin


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Wednesday, June 20, 2012 12:15:03 AM Kevin Grittner wrote:
> Simon Riggs <simon@2ndQuadrant.com> wrote:
> > If we use WAL in this way, multi-master implies that the data will
> > *always* be in a loop. So in any configuration we must be able to
> > tell difference between changes made by one node and another.
> 
> Only if you assume that multi-master means identical databases all
> replicating the same data to all the others.  If I have 72 master
> replicating non-conflicting data to one consolidated database, I
> consider that to be multi-master, too.
> ...
> Of course, none of these databases have the same OID for any given
> object, and there are numerous different schemas among the
> replicating databases, so I need to get to table and column names
> before the data is of any use to me.
Yes, thats definitely a valid use-case. But that doesn't preclude the other - 
also not uncommon - use-case where you want to have different master which all 
contain up2date data.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
"Kevin Grittner"
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
> Yes, thats definitely a valid use-case. But that doesn't preclude
> the other - also not uncommon - use-case where you want to have
> different master which all contain up2date data.
I agree.  I was just saying that while one requires an origin_id,
the other doesn't.  And those not doing MM replication definitely
don't need it.
-Kevin


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Robert Haas
Date:
On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne <cbbrowne@gmail.com> wrote:
> On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Btw, what do you mean with "conflating" the stream? I don't really see that
>>> being proposed.
>>
>> It seems to me that you are intent on using the WAL stream as the
>> logical change stream.  I think that's a bad design.  Instead, you
>> should extract changes from WAL and then ship them around in a format
>> that is specific to logical replication.
>
> Yeah, that seems worth elaborating on.
>
> What has been said several times is that it's pretty necessary to
> capture the logical changes into WAL.  That seems pretty needful, in
> order that the replication data gets fsync()ed avidly, and so that we
> don't add in the race condition of needing to fsync() something *else*
> almost exactly as avidly as is the case for WAL today..

Check.

> But it's undesirable to pull *all* the bulk of contents of WAL around
> if it's only part of the data that is going to get applied.  On a
> "physical streaming" replica, any logical data that gets captured will
> be useless.  And on a "logical replica," they "physical" bits of WAL
> will be useless.
>
> What I *want* you to mean is that there would be:
> a) WAL readers that pull the "physical bits", and
> b) WAL readers that just pull "logical bits."
>
> I expect it would be fine to have a tool that pulls LCRs out of WAL to
> prepare that to be sent to remote locations.  Is that what you have in
> mind?

Yes.  I think it should be possible to generate LCRs from WAL, but I
think that the on-the-wire format for LCRs should be different from
the WAL format.  Trying to use the same format for both things seems
like an unpleasant straightjacket.  This discussion illustrates why:
we're talking about consuming scarce bit-space in WAL records for a
feature that only a tiny minority of users will use, and it's still
not really enough bit space.  That stinks.  If LCR transmission is a
separate protocol, this problem can be engineered away at a higher
level.

Suppose we have three servers, A, B, and C, that are doing
multi-master replication in a loop.  A sends LCRs to B, B sends them
to C, and C sends them back to A.  Obviously, we need to make sure
that each server applies each set of changes just once, but it
suffices to have enough information in WAL to distinguish between
replication transactions and non-replication transactions - that is,
one bit.  So suppose a change is made on server A.  A generates LCRs
from WAL, and tags each LCR with node_id = A.  It then sends those
LCRs to B.  B applies them, flagging the apply transaction in WAL as a
replication transaction, AND ALSO sends the LCRs to C.  The LCR
generator on B sees the WAL from apply, but because it's flagged as a
replication transaction, it does not generate LCRs.  So C receives
LCRs from B just once, without any need for the node_id to to be known
in WAL.  C can now also apply those LCRs (again flagging the apply
transaction as replication) and it can also skip sending them to A,
because it seems that they originated at A.

Now suppose we have a more complex topology.  Suppose we have a
cluster of four servers A .. D which, for improved tolerance against
network outages, are all connected pairwise.  Normally all the links
are up, so each server sends all the LCRs it generates directly to all
other servers.  But how do we prevent cycles?  A generates a change
and sends it to B, C, and D.  B then sees that the change came from A
so it sends it to C and D.  C, receiving that change, sees that came
from A via B, so it sends it to D again, whereupon D, which got it
from C and knows that the origin is A, sends it to B, who will then
send it right back over to D.  Obviously, we've got an infinite loop
here, so this topology will not work.  However, there are several
obvious ways to patch it by changing the LCR protocol.  Most
obviously, and somewhat stupidly, you could add a TTL.  A bit smarter,
you could have each LCR carry a LIST of node_ids that it had already
visited, refusing to send it to any node it had already been to it,
instead of a single node_id.  Smarter still, you could send
handshaking messages around the cluster so that each node can build up
a spanning tree and prefix each LCR it sends with the list of
additional nodes to which the recipient must deliver it.  So,
normally, A would send a message to each of B, C, and D destined only
for that node; but if the A-C link went down, A would choose either B
or D and send each LCR to that node destined for that node *and C*;
then, A would forward the message.  Or perhaps you think this is too
complex and not worth supporting anyway, and that might be true, but
the point is that if you insist that all of the identifying
information must be carried in WAL, you've pretty much ruled it out,
because we are not going to put TTL fields, or lists of node IDs, or
lists of destinations, in WAL.  But there is no reason they can't be
attached to LCRs, which is where they are actually needed.

> Or are you feeling that the "logical bits" shouldn't get
> captured in WAL altogether, so we need to fsync() them into a
> different stream of files?

No, that would be ungood.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Robert Haas
Date:
On Tue, Jun 19, 2012 at 6:14 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> I definitely agree that low-level apply is possible as a module. Sure change
> extraction needs core support but I was talking about what you need to
> implement it reusing the "plain" logical support...
>
> What I do not understand is how you want to prevent loops in a simple manner
> without in core support:
>
> A generates a HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT
> action.
> B reads the lcr stream from A and applies the changes. A new HEAP_INSERT
> record. Gets decoded into the lcr stream as a INSERT action.
> A reads the lcr stream from B and ???
>
> At this point you need to prevent a loop. If you have the information where a
> change originally happened (xl_origin_id = A in this case) you can have the
> simple filter on A which ignores change records if lcr_origin_id ==
> local_replication_origin_id).

See my email to Chris Browne, which I think covers this.  It needs a
bit in WAL (per txn, or, heck, if it's one bit, maybe per record) but
not a whole node ID.

>> You need a backend-local hash table inside the wal reader process, and
>> that hash table needs to map XIDs to node IDs.  And you occasionally
>> need to prune it, so that it doesn't eat too much memory.  None of
>> that sounds very hard.
> Its not very hard. Its just more complex than what I propose(d).

True, but not a whole lot more complex, and a moderate amount of
complexity to save bit-space is a good trade.  Especially when Tom has
come down against eating up the bit space.  And I agree with him.  If
we've only got 16 bits of padding to work with, we surely be judicious
in burning them when it can be avoided for the expense of a few
hundred lines of code.

>> > Btw, what do you mean with "conflating" the stream? I don't really see
>> > that being proposed.
>> It seems to me that you are intent on using the WAL stream as the
>> logical change stream.  I think that's a bad design.  Instead, you
>> should extract changes from WAL and then ship them around in a format
>> that is specific to logical replication.
> No, I don't want that. I think we will need some different format once we have
> agreed how changeset extraction works.

I think you are saying that you agree with me that the formats should
be different, but that the LCR format is undecided as yet.  If that is
in fact what you are saying, great.  We'll need to decide that, of
course, but I think there is a lot of cool stuff that can be done that
way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Robert Haas
Date:
On Tue, Jun 19, 2012 at 2:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Well, the words are fuzzy, but I would define logical replication to
>> be something which is independent of the binary format in which stuff
>> gets stored on disk.  If it's not independent of the disk format, then
>> you can't do heterogenous replication (between versions, or between
>> products).  That precise limitation is the main thing that drives
>> people to use anything other than SR in the first place, IME.
> Not in mine. The main limitation I see is that you cannot write anything on
> the standby. Which sucks majorly for many things. Its pretty much impossible
> to "fix" that for SR outside of very limited cases.
> While many scenarios don't need multimaster *many* need to write outside of
> the standby's replication set.

Well, that's certainly a common problem, even if it's not IME the most
common, but I don't think we need to argue about which one is more
common, because I'm not arguing against it.  The point, though, is
that if the logical format is independent of the on-disk format, the
things we can do are a strict superset of the things we can do if it
isn't.  I don't want to insist that catalogs be the same (or else you
get garbage when you decode tuples).  I want to tolerate the fact that
they may very well be different.  That will in no way preclude writing
outside the standby's replication set, nor will it prevent
multi-master replication.  It will, however, enable heterogenous
replication, which is a very important use case.  It will also mean
that innocent mistakes (like somehow ending up with a column that is
text on one server and numeric on another server) produce
comprehensible error messages, rather than garbage.

> Its not only the logging side which is a limitation in todays replication
> scenarios. The apply side scales even worse because its *very* hard to
> distribute it between multiple backends.

I don't think that making LCR format = on-disk format is going to
solve that problem.  To solve that problem, we need to track
dependencies between transactions, so that if tuple A is modified by
T1 and T2, in that order, we apply T1 before T2.  But if T3 - which
committed after both T1 and T2 - touches none of the same data as T1
or T2 - then we can apply it in parallel, so long as we don't commit
until T1 and T2 have committed (because allowing T3 to commit early
would produce a serialization anomaly from the point of view of a
concurrent reader).

>> Because the routines that decode tuples don't include enough sanity
>> checks to prevent running off the end of the block, or even the end of
>> memory completely.  Consider a corrupt TOAST pointer that indicates
>> that there is a gigabyte of data stored in an 8kB block.  One of the
>> common symptoms of corruption IME is TOAST requests for -3 bytes of
>> memory.
> Yes, but we need to put safeguards against that sort of thing anyway. So sure,
> we can have bugs but this is not a fundamental limitation.

There's a reason we haven't done that already, though: it's probably
going to stink for performance.  If it turns out that it doesn't stink
for performance, great.  But if it causes a 5% slowdown on common use
cases, I suspect we're not gonna do it, and I bet I can construct a
case where it's worse than that (think: 400 column table with lots of
varlenas, sorting by column 400 to return column 399).  I think it's
treading on dangerous ground to assume we're going to be able to "just
go fix" this.

> Postgis uses one information table in a few more complex functions but not in
> anything low-level. Evidenced by the fact that it was totally normal for that
> to go out of sync before < 2.0.
>
> But even if such a thing would be needed, it wouldn't be problematic to make
> extension configuration tables be replicated as well.

Ugh.  That's a hack on top of a hack.  Now it all works great if type
X is installed as an extension but if it isn't installed as an
extension then the world blows up.

> I am pretty sure its not bad-behaved. But how should the code know that? You
> want each type to explictly say that its unsafe if it is?

Yes, exactly.  Or maybe there are varying degrees of non-safety,
allowing varying degrees of optimization.  Like: wire format = binary
format is super-safe.  Then having to call an I/O function that
promises not to look at any catalogs is a bit less safe.  And then
there's really unsafe.

> I have played with several ideas:
>
> 1.)
> keep the decoding catalog in sync with command/event triggers, correctly
> replicating oids. If those log into some internal event table its easy to keep
> the catalog in a correct transactional state because the events from that
> table get decoded in the transaction and replayed at exactly the right spot in
> there *after* it has been reassembled. The locking on the generating side
> takes care of the concurrency aspects.

I am not following this one completely.

> 2.)
> Keep the decoding site up2date by replicating the catalog via normal recovery
> mechanisms

This surely seems better than #1, since it won't do amazingly weird
things if the user bypasses the event triggers.

> 3.)
> Fully versioned catalog

One possible way of doing this would be to have the LCR generator run
on the primary, but hold back RecentGlobalXmin until it's captured the
information that it needs.  It seems like as long as tuples can't get
pruned, the information you need must still be there, as long as you
can figure out which snapshot you need to read it under.  But since
you know the commit ordering, it seems like you ought to be able to
figure out what SnapshotNow would have looked like at any given point
in the WAL stream.  So you could, at that point in the WAL stream,
read the master's catalogs under what we might call SnapshotThen.

> 4.)
> Log enough information in the walstream to make decoding possible using only
> the walstream.
>
> Advantages:
> * Decoding can optionally be done on the master
> * No catalog syncing/access required
> * its possible to make this architecture independent
>
> Disadvantage:
> * high to very high implementation overhead depending on efficiency aims
> * high space overhead in the wal because at least all the catalog information
> needs to be logged in a transactional manner repeatedly
> * misuses wal far more than other methods
> * significant new complexity in somewhat cricital code paths (heapam.c)
> * insanely high space overhead if the decoding should be possible architecture
> independent

I'm not really convinced that the WAL overhead has to be that much
with this method.  Most of the information you need about the catalogs
only needs to be logged when it changes, or once per checkpoint cycle,
or once per transaction, or once per transaction per checkpoint cycle.I will concede that it looks somewhat complex,
butI am not convinced 
that it's undoable.

> 5.)
> The actually good idea. Yours?

Hey, look, an elephant!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Robert Haas
Date:
On Tue, Jun 19, 2012 at 4:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > 1. dllist.h has double the element overhead by having an inline value
>> > pointer (which is not needed when embedding) and a pointer to the list
>> > (which I have a hard time seing as being useful)
>> > 2. only double linked list, mine provided single and double linked ones
>> > 3. missing macros to use when embedded in a larger struct (containerof()
>> > wrappers and for(...) support basically)
>> > 4. most things are external function calls...
>> > 5. way much more branches/complexity in most of the functions. My
>> > implementation doesn't use any branches for the typical easy
>> > modifications (push, pop, remove element somewhere) and only one for the
>> > typical tests (empty, has-next, ...)
>> >
>> > The performance and memory aspects were crucial for the aforementioned
>> > toy project (slab allocator for postgres). Its not that crucial for the
>> > applycache where the lists currently are mostly used although its also
>> > relatively performance sensitive and obviously does a lot of list
>> > manipulation/iteration.
>> >
>> > If I had to decide I would add the missing api in dllist.h to my
>> > implementation and then remove it. Its barely used - and only in an
>> > embedded fashion - as far as I can see.
>> > I can understand though if that argument is met with doubt by others ;).
>> > If thats the way it has to go I would add some more convenience support
>> > for embedding data to dllist.h and settle for that.
>>
>> I think it might be simpler to leave the name as Dllist and just
>> overhaul the implementation along the lines you suggest, rather than
>> replacing it with something completely different.  Mostly, I don't
>> want to add a third thing if we can avoid it, given that Dllist as it
>> exists today is used only lightly.
> Well, if its the name, I have no problem with changing it, but I don't see how
> you can keep the api as it currently is and address my points.
>
> If there is some buyin I can try to go either way (keeping the existing name,
> changing the api, adjusting the callers or just adjust the callers, throw away
> the old implementation) I just don't want to get into that just to see
> somebody isn't agreeing with the fundamental idea.

My guess is that it wouldn't be too hard to remove some of the extra
pointers.  Anyone who is using Dllist as a non-inline list could be
converted to List * instead.  Also, the performance-critical things
could be reimplemented as macros.  I question, though, whether we
really need both single and doubly linked lists.  That seems like it's
almost certainly micro-optimization that we are better off not doing.

> The most contentious point is probably relying on USE_INLINE being available
> anywhere. Which I believe to be the point now that we have gotten rid of some
> platforms.

I would be hesitant to chuck that even though I realize it's unlikely
that we really need !USE_INLINE.  But see sortsupport for an example
of how we've handled this in the recent past.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Simon Riggs <simon@2ndQuadrant.com> wrote:
>> The proposal is to use WAL to generate the logical change stream.
>> That has been shown in testing to be around x4 faster than having
>> a separate change stream, which must also be WAL logged (as Jan
>> noted).
> Sure, that's why I want it.

I think this argument is basically circular.  The reason it's 4x faster
is that the WAL stream doesn't actually contain all the information
needed to generate LCRs (thus all the angst about maintaining catalogs
in sync, what to do about unfriendly datatypes, etc).  By the time the
dust has settled and you have a workable system, you will have bloated
WAL and given back a large chunk of that multiple, thereby invalidating
the design premise.  Or at least that's my prediction.
        regards, tom lane


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 11:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Simon Riggs <simon@2ndQuadrant.com> wrote:
>>> The proposal is to use WAL to generate the logical change stream.
>>> That has been shown in testing to be around x4 faster than having
>>> a separate change stream, which must also be WAL logged (as Jan
>>> noted).
>
>> Sure, that's why I want it.
>
> I think this argument is basically circular.  The reason it's 4x faster
> is that the WAL stream doesn't actually contain all the information
> needed to generate LCRs (thus all the angst about maintaining catalogs
> in sync, what to do about unfriendly datatypes, etc).  By the time the
> dust has settled and you have a workable system, you will have bloated
> WAL and given back a large chunk of that multiple, thereby invalidating
> the design premise.  Or at least that's my prediction.

The tests were conducted with the additional field added, so your
prediction is not verified.

The additional fields do not bloat WAL records - they take up exactly
the same space as before.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Heikki Linnakangas
Date:
On 20.06.2012 01:27, Kevin Grittner wrote:
> Andres Freund<andres@2ndquadrant.com>  wrote:
>
>> Yes, thats definitely a valid use-case. But that doesn't preclude
>> the other - also not uncommon - use-case where you want to have
>> different master which all contain up2date data.
>
> I agree.  I was just saying that while one requires an origin_id,
> the other doesn't.  And those not doing MM replication definitely
> don't need it.

I think it would be helpful to list down a few concrete examples of 
this. The stereotypical multi-master scenario is that you have a single 
table that's replicated to two servers, and you can insert/update/delete 
on either server. Conflict resolution stretegies vary.

The reason we need an origin id in this scenario is that otherwise this 
will happen:

1. A row is updated on node A
2. Node B receives the WAL record from A, and updates the corresponding 
row in B. This generates a new WAL record.
3. Node A receives the WAL record from B, and updates the rows again. 
This again generates a new WAL record, which is replicated to A, and you 
loop indefinitely.

If each WAL record carries an origin id, node A can use it to refrain 
from applying the WAL record it receives from B, which breaks the loop.

However, note that in this simple scenario, if the logical log replay / 
conflict resolution is smart enough to recognize that the row has 
already been updated, because the old and the new rows are identical, 
the loop is broken at step 3 even without the origin id. That works for 
the newest-update-wins and similar strategies. So the origin id is not 
absolutely necessary in this case.

Another interesting scenario is that you maintain a global counter, like 
in an inventory system, and conflicts are resolved by accumulating the 
updates. For example, if you do "UPDATE SET counter = counter + 1" 
simultaneously on two nodes, the result is that the counter is 
incremented by two. The avoid-update-if-already-identical optimization 
doesn't help in this case, the origin id is necessary.

Now, let's take the inventory system example further. There are actually 
two ways to update a counter. One is when an item is checked in or out 
of the warehouse, ie. "UPDATE counter = counter + 1". Those updates 
should accumulate. But another operation resets the counter to a 
specific value, "UPDATE counter = 10", like when taking an inventory. 
That should not accumulate with other changes, but should be 
newest-update-wins. The origin id is not enough for that, because by 
looking at the WAL record and the origin id, you don't know which type 
of an update it was.

So, I don't like the idea of adding the origin id to the record header. 
It's only required in some occasions, and on some record types. And I'm 
worried it might not even be enough in more complicated scenarios.

Perhaps we need a more generic WAL record annotation system, where a 
plugin can tack arbitrary information to WAL records. The extra 
information could be stored in the WAL record after the rmgr payload, 
similar to how backup blocks are stored. WAL replay could just ignore 
the annotations, but a replication system could use it to store the 
origin id or whatever extra information it needs.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 08:35, Robert Haas <robertmhaas@gmail.com> wrote:

>> I expect it would be fine to have a tool that pulls LCRs out of WAL to
>> prepare that to be sent to remote locations.  Is that what you have in
>> mind?
>
> Yes.  I think it should be possible to generate LCRs from WAL, but I
> think that the on-the-wire format for LCRs should be different from
> the WAL format.  Trying to use the same format for both things seems
> like an unpleasant straightjacket.

You're confusing things here.

Yes, we can use a different on-the-wire format. No problem.

As I've already said, the information needs to be in WAL first before
we can put it into LCRs, so the "don't use same format" argument is
not relevant as to why the info must be on the WAL record in the first
place. And as said elsewhere, doing that does not cause problems in
any of these areas:  wal bloat, performance, long term restrictions on
numbers of nodeids, as has so far been claimed on this thread.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 14:40, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

> The reason we need an origin id in this scenario is that otherwise this will
> happen:
>
> 1. A row is updated on node A
> 2. Node B receives the WAL record from A, and updates the corresponding row
> in B. This generates a new WAL record.
> 3. Node A receives the WAL record from B, and updates the rows again. This
> again generates a new WAL record, which is replicated to A, and you loop
> indefinitely.
>
> If each WAL record carries an origin id, node A can use it to refrain from
> applying the WAL record it receives from B, which breaks the loop.
>
> However, note that in this simple scenario, if the logical log replay /
> conflict resolution is smart enough to recognize that the row has already
> been updated, because the old and the new rows are identical, the loop is
> broken at step 3 even without the origin id. That works for the
> newest-update-wins and similar strategies. So the origin id is not
> absolutely necessary in this case.

Including the origin id in the WAL allows us to filter out WAL records
when we generate LCRs, so we can completely avoid step 3, including
all of the CPU, disk and network overhead that implies. Simply put, we
know the change came from A, so no need to send the change to A again.

If we do allow step 3 to exist, we still need to send the origin id.
This is because there is no apriori way to know the origin id is not
required, such as the case where we have concurrent updates which is
effectively a race condition between actions on separate nodes. Not
sending the origin id because it is not required in some cases is
equivalent to saying we can skip locking because a race condition does
not happen in all cases. Making a case that the race condition is rare
is still not a case for skipping locking. Same here: we need the
information to avoid making errors in the general case.

> Another interesting scenario is that you maintain a global counter, like in
> an inventory system, and conflicts are resolved by accumulating the updates.
> For example, if you do "UPDATE SET counter = counter + 1" simultaneously on
> two nodes, the result is that the counter is incremented by two. The
> avoid-update-if-already-identical optimization doesn't help in this case,
> the origin id is necessary.
>
> Now, let's take the inventory system example further. There are actually two
> ways to update a counter. One is when an item is checked in or out of the
> warehouse, ie. "UPDATE counter = counter + 1". Those updates should
> accumulate. But another operation resets the counter to a specific value,
> "UPDATE counter = 10", like when taking an inventory. That should not
> accumulate with other changes, but should be newest-update-wins. The origin
> id is not enough for that, because by looking at the WAL record and the
> origin id, you don't know which type of an update it was.

Yes, of course. Conflict handling in the general case requires much
additional work. This thread is one minor change of many related
changes. The patches are being submitted in smaller chunks to ease
review, so sometimes there are cross links between things.

> So, I don't like the idea of adding the origin id to the record header. It's
> only required in some occasions, and on some record types.

That conclusion doesn't follow from your stated arguments.

> And I'm worried
> it might not even be enough in more complicated scenarios.

It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.


> Perhaps we need a more generic WAL record annotation system, where a plugin
> can tack arbitrary information to WAL records. The extra information could
> be stored in the WAL record after the rmgr payload, similar to how backup
> blocks are stored. WAL replay could just ignore the annotations, but a
> replication system could use it to store the origin id or whatever extra
> information it needs.

Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.

One question I raised in my review was whether this extra information
should be added by a variable length header, so I already asked this
very question. So far there is no evidence that the additional code
complexity would be warranted. If it became so in the future, it can
be modified again. At this stage there is no need, so the proposal is
to add the field to every WAL record without regard to the setting of
wal_level because there is no measurable downside to doing so. The
downsides of additional complexity are clear and real however, so I
wish to avoid them.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Heikki Linnakangas
Date:
On 20.06.2012 10:32, Simon Riggs wrote:
> On 20 June 2012 14:40, Heikki Linnakangas
>> And I'm worried
>> it might not even be enough in more complicated scenarios.
>
> It is not the only required conflict mechanism, and has never been
> claimed to be so. It is simply one piece of information needed, at
> various times.

So, if the origin id is not sufficient for some conflict resolution 
mechanisms, what extra information do you need for those, and where do 
you put it?

>> Perhaps we need a more generic WAL record annotation system, where a plugin
>> can tack arbitrary information to WAL records. The extra information could
>> be stored in the WAL record after the rmgr payload, similar to how backup
>> blocks are stored. WAL replay could just ignore the annotations, but a
>> replication system could use it to store the origin id or whatever extra
>> information it needs.
>
> Additional information required by logical information will be handled
> by a new wal_level.
>
> The discussion here is about adding origin_node_id *only*, which needs
> to be added on each WAL record.

If that's all we can discuss here is, and all other options are off the 
table, then I'll have to just outright object to this patch. Let's 
implement what we can without the origin id, and revisit this later.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 15:45, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 20.06.2012 10:32, Simon Riggs wrote:
>>
>> On 20 June 2012 14:40, Heikki Linnakangas
>>>
>>> And I'm worried
>>>
>>> it might not even be enough in more complicated scenarios.
>>
>>
>> It is not the only required conflict mechanism, and has never been
>> claimed to be so. It is simply one piece of information needed, at
>> various times.
>
>
> So, if the origin id is not sufficient for some conflict resolution
> mechanisms, what extra information do you need for those, and where do you
> put it?

As explained elsewhere, wal_level = logical (or similar) would be used
to provide any additional logical information required.

Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.

In the case of reflexive updates you raised, a typical response in
other DBMS would be to represent the query UPDATE SET counter = counter + 1
by sending just the "+1" part, not the current value of counter, as
would be the case with the non-reflexive update UPDATE SET counter = 1

Handling such things in Postgres would require some subtlety, which
would not be resolved in first release but is pretty certain not to
require any changes to the WAL record header as a way of resolving it.
Having already thought about it, I'd estimate that is a very long
discussion and not relevant to the OT, but if you wish to have it
here, I won't stop you.


>>> Perhaps we need a more generic WAL record annotation system, where a
>>> plugin
>>> can tack arbitrary information to WAL records. The extra information
>>> could
>>> be stored in the WAL record after the rmgr payload, similar to how backup
>>> blocks are stored. WAL replay could just ignore the annotations, but a
>>> replication system could use it to store the origin id or whatever extra
>>> information it needs.
>>
>>
>> Additional information required by logical information will be handled
>> by a new wal_level.
>>
>> The discussion here is about adding origin_node_id *only*, which needs
>> to be added on each WAL record.
>
>
> If that's all we can discuss here is, and all other options are off the
> table, then I'll have to just outright object to this patch. Let's implement
> what we can without the origin id, and revisit this later.

As explained, we can do nothing without the origin id. It is not
optional or avoidable in the way you've described.

We have the choice to add the required information as a static or as a
variable length addition to the WAL record header. Since there is no
additional requirement for expansion of the header at this point and
no measurable harm in doing so, I suggest we avoid the complexity and
lack of robustness that a variable length header would cause. Of
course, we can go there later if needed, but there is no current need.
If a variable length header had been suggested, it is certain somebody
would say that was overcooked and just do it as static.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Heikki Linnakangas
Date:
On 20.06.2012 11:17, Simon Riggs wrote:
> On 20 June 2012 15:45, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> On 20.06.2012 10:32, Simon Riggs wrote:
>>>
>>> On 20 June 2012 14:40, Heikki Linnakangas
>>>>
>>>> And I'm worried
>>>> it might not even be enough in more complicated scenarios.
>>>
>>> It is not the only required conflict mechanism, and has never been
>>> claimed to be so. It is simply one piece of information needed, at
>>> various times.
>>
>> So, if the origin id is not sufficient for some conflict resolution
>> mechanisms, what extra information do you need for those, and where do you
>> put it?
>
> As explained elsewhere, wal_level = logical (or similar) would be used
> to provide any additional logical information required.
>
> Update and Delete WAL records already need to be different in that
> mode, so additional info would be placed there, if there were any.
>
> In the case of reflexive updates you raised, a typical response in
> other DBMS would be to represent the query
>    UPDATE SET counter = counter + 1
> by sending just the "+1" part, not the current value of counter, as
> would be the case with the non-reflexive update
>    UPDATE SET counter = 1
>
> Handling such things in Postgres would require some subtlety, which
> would not be resolved in first release but is pretty certain not to
> require any changes to the WAL record header as a way of resolving it.
> Having already thought about it, I'd estimate that is a very long
> discussion and not relevant to the OT, but if you wish to have it
> here, I won't stop you.

Yeah, I'd like to hear briefly how you would handle that without any 
further changes to the WAL record header.

>>> Additional information required by logical information will be handled
>>> by a new wal_level.
>>>
>>> The discussion here is about adding origin_node_id *only*, which needs
>>> to be added on each WAL record.
>>
>> If that's all we can discuss here is, and all other options are off the
>> table, then I'll have to just outright object to this patch. Let's implement
>> what we can without the origin id, and revisit this later.
>
> As explained, we can do nothing without the origin id. It is not
> optional or avoidable in the way you've described.

It's only needed for multi-master replication, where the same table can 
be updated from multiple nodes. Just leave that out for now. There's 
plenty of functionality and issues left even without that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 16:23, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

> It's only needed for multi-master replication, where the same table can be
> updated from multiple nodes. Just leave that out for now. There's plenty of
> functionality and issues left even without that.

Huh? Multi-master replication is what is being built here and many
people want that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Heikki Linnakangas
Date:
On 20.06.2012 11:34, Simon Riggs wrote:
> On 20 June 2012 16:23, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>
>> It's only needed for multi-master replication, where the same table can be
>> updated from multiple nodes. Just leave that out for now. There's plenty of
>> functionality and issues left even without that.
>
> Huh? Multi-master replication is what is being built here and many
> people want that.

Sure, but presumably you're going to implement master-slave first, and 
build multi-master on top of that. What I'm saying is that we can leave 
out the origin-id for now, since we don't have agreement on it, and 
revisit it after master-slave replication is working.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Wednesday, June 20, 2012 02:35:59 AM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne <cbbrowne@gmail.com> 
wrote:
> > On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas <robertmhaas@gmail.com> 
wrote:
> >>> Btw, what do you mean with "conflating" the stream? I don't really see
> >>> that being proposed.
> >> 
> >> It seems to me that you are intent on using the WAL stream as the
> >> logical change stream.  I think that's a bad design.  Instead, you
> >> should extract changes from WAL and then ship them around in a format
> >> that is specific to logical replication.
> > 
> > Yeah, that seems worth elaborating on.
> > 
> > What has been said several times is that it's pretty necessary to
> > capture the logical changes into WAL.  That seems pretty needful, in
> > order that the replication data gets fsync()ed avidly, and so that we
> > don't add in the race condition of needing to fsync() something *else*
> > almost exactly as avidly as is the case for WAL today..
> 
> Check.
> 
> > But it's undesirable to pull *all* the bulk of contents of WAL around
> > if it's only part of the data that is going to get applied.  On a
> > "physical streaming" replica, any logical data that gets captured will
> > be useless.  And on a "logical replica," they "physical" bits of WAL
> > will be useless.
> > 
> > What I *want* you to mean is that there would be:
> > a) WAL readers that pull the "physical bits", and
> > b) WAL readers that just pull "logical bits."
> > 
> > I expect it would be fine to have a tool that pulls LCRs out of WAL to
> > prepare that to be sent to remote locations.  Is that what you have in
> > mind?
> Yes.  I think it should be possible to generate LCRs from WAL, but I
> think that the on-the-wire format for LCRs should be different from
> the WAL format.  Trying to use the same format for both things seems
> like an unpleasant straightjacket.  This discussion illustrates why:
> we're talking about consuming scarce bit-space in WAL records for a
> feature that only a tiny minority of users will use, and it's still
> not really enough bit space.  That stinks.  If LCR transmission is a
> separate protocol, this problem can be engineered away at a higher
> level.
As I said before, I definitely agree that we want to have a separate transport 
format once we have decoding nailed down. We still need to ship wal around if 
the decoding happens in a different instance, but *after* that it can be 
shipped in something more convenient/appropriate.

> Suppose we have three servers, A, B, and C, that are doing
> multi-master replication in a loop.  A sends LCRs to B, B sends them
> to C, and C sends them back to A.  Obviously, we need to make sure
> that each server applies each set of changes just once, but it
> suffices to have enough information in WAL to distinguish between
> replication transactions and non-replication transactions - that is,
> one bit.  So suppose a change is made on server A.  A generates LCRs
> from WAL, and tags each LCR with node_id = A.  It then sends those
> LCRs to B.  B applies them, flagging the apply transaction in WAL as a
> replication transaction, AND ALSO sends the LCRs to C.  The LCR
> generator on B sees the WAL from apply, but because it's flagged as a
> replication transaction, it does not generate LCRs.  So C receives
> LCRs from B just once, without any need for the node_id to to be known
> in WAL.  C can now also apply those LCRs (again flagging the apply
> transaction as replication) and it can also skip sending them to A,
> because it seems that they originated at A.
One bit is fine if you have only very simple replication topologies. Once you 
think about globally distributed databases its a bit different. You describe 
some of that below, but just to reiterate: 
Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF 
in europe). You may only want to have full intercontinental interconnect 
between two of those (say A and D). If you only have one bit to represent the 
origin thats not going to work because you won't be able discern the changes 
from BC on A from the changes from those originating on DEF.

Another topology which is interesting is circular replications (i.e. changes 
get shipped A->B, B->C, C->A) which is a sensible topology if you only have a 
low change rate and a relatively high number of nodes because you don't need 
the full combinatorial amount of connections.

You still have the origin_id's be meaningful in the local context though. As 
described before, in the communication between the different nodes you can 
simply replace 16bit node id with some fancy UUID or such. And do the reverse 
when replaying LCRs.

> Now suppose we have a more complex topology.  Suppose we have a
> cluster of four servers A .. D which, for improved tolerance against
> network outages, are all connected pairwise.  Normally all the links
> are up, so each server sends all the LCRs it generates directly to all
> other servers.  But how do we prevent cycles?  A generates a change
> and sends it to B, C, and D.  B then sees that the change came from A
> so it sends it to C and D.  C, receiving that change, sees that came
> from A via B, so it sends it to D again, whereupon D, which got it
> from C and knows that the origin is A, sends it to B, who will then
> send it right back over to D.  Obviously, we've got an infinite loop
> here, so this topology will not work.  However, there are several
> obvious ways to patch it by changing the LCR protocol.  Most
> obviously, and somewhat stupidly, you could add a TTL. A bit smarter,
> you could have each LCR carry a LIST of node_ids that it had already
> visited, refusing to send it to any node it had already been to it,
> instead of a single node_id.  Smarter still, you could send
> handshaking messages around the cluster so that each node can build up
> a spanning tree and prefix each LCR it sends with the list of
> additional nodes to which the recipient must deliver it.  So,
> normally, A would send a message to each of B, C, and D destined only
> for that node; but if the A-C link went down, A would choose either B
> or D and send each LCR to that node destined for that node *and C*;
> then, A would forward the message.  Or perhaps you think this is too
> complex and not worth supporting anyway, and that might be true, but
> the point is that if you insist that all of the identifying
> information must be carried in WAL, you've pretty much ruled it out,
> because we are not going to put TTL fields, or lists of node IDs, or
> lists of destinations, in WAL.  But there is no reason they can't be
> attached to LCRs, which is where they are actually needed.
Most of those topologies are possible if you have the ability to retain the 
information about where a change originated. All the more complex information 
like the list of nodes you want to apply changes from and such doesn't belong 
in the wal.

> > Or are you feeling that the "logical bits" shouldn't get
> > captured in WAL altogether, so we need to fsync() them into a
> > different stream of files?
> No, that would be ungood.
Agreed.

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 16:44, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 20.06.2012 11:34, Simon Riggs wrote:
>>
>> On 20 June 2012 16:23, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>
>>> It's only needed for multi-master replication, where the same table can
>>> be
>>> updated from multiple nodes. Just leave that out for now. There's plenty
>>> of
>>> functionality and issues left even without that.
>>
>>
>> Huh? Multi-master replication is what is being built here and many
>> people want that.
>
>
> Sure, but presumably you're going to implement master-slave first, and build
> multi-master on top of that. What I'm saying is that we can leave out the
> origin-id for now, since we don't have agreement on it, and revisit it after
> master-slave replication is working.

I am comfortable with the idea of deferring applying the patch, but I
don't see any need to defer agreeing the patch is OK, so it can be
applied easily later. It does beg the question of when exactly we
would defer it to though. When would that be?

If you have a reason for disagreement, please raise it now, having
seen explanations/comments on various concerns. Of course, people have
made initial objections, which is fine but its not reasonable to
assume that such complaints continue to exist after. Perhaps there are
other thoughts?

The idea that logical rep is some kind of useful end goal in itself is
slightly misleading. If the thought is to block multi-master
completely on that basis, that would be a shame. Logical rep is the
mechanism for implementing multi-master.

Deferring this could easily end up with a huge patch in last CF, and
then it will be rejected/deferred. Patch submission here is following
the requested process - as early as possible, production ready, small
meaningful patches that build towards a final goal. This is especially
true for format changes, which is why this patch is here now. Doing it
differently just makes patch wrangling and review more difficult,
which reduces overall quality and slows down development.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture

From
Andres Freund
Date:
Hi Robert, Hi All!

On Wednesday, June 20, 2012 03:08:48 AM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 2:23 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> Well, the words are fuzzy, but I would define logical replication to
> >> be something which is independent of the binary format in which stuff
> >> gets stored on disk.  If it's not independent of the disk format, then
> >> you can't do heterogenous replication (between versions, or between
> >> products).  That precise limitation is the main thing that drives
> >> people to use anything other than SR in the first place, IME.
> > 
> > Not in mine. The main limitation I see is that you cannot write anything
> > on the standby. Which sucks majorly for many things. Its pretty much
> > impossible to "fix" that for SR outside of very limited cases.
> > While many scenarios don't need multimaster *many* need to write outside
> > of the standby's replication set.
> Well, that's certainly a common problem, even if it's not IME the most
> common, but I don't think we need to argue about which one is more
> common, because I'm not arguing against it.  The point, though, is
> that if the logical format is independent of the on-disk format, the
> things we can do are a strict superset of the things we can do if it
> isn't.  I don't want to insist that catalogs be the same (or else you
> get garbage when you decode tuples).  I want to tolerate the fact that
> they may very well be different.  That will in no way preclude writing
> outside the standby's replication set, nor will it prevent
> multi-master replication.  It will, however, enable heterogenous
> replication, which is a very important use case.  It will also mean
> that innocent mistakes (like somehow ending up with a column that is
> text on one server and numeric on another server) produce
> comprehensible error messages, rather than garbage.
I agree with most of that. I think that some parts of the above need to be 
optional because you do loose too much for other scenarious.
I *definitely* want to build the *infrastructure* which make it easy to 
implement all of the above but I find it a bit much to require that from the 
get-go. Its important that everything is reusable for that, yes. Does a 
patchset that wants to implement tightly coupled multimaster need to implement 
everything for that? No.
If we raise the barrier for anything around this topic so high we will *NEVER* 
get anywhere. Its a huge topic with loads of people wanting loads of different 
things. And that will hurt people wanting some feature which matches 90% of 
the proposed goals *far* more.

> > Its not only the logging side which is a limitation in todays replication
> > scenarios. The apply side scales even worse because its *very* hard to
> > distribute it between multiple backends.

> I don't think that making LCR format = on-disk format is going to
> solve that problem.  To solve that problem, we need to track
> dependencies between transactions, so that if tuple A is modified by
> T1 and T2, in that order, we apply T1 before T2.  But if T3 - which
> committed after both T1 and T2 - touches none of the same data as T1
> or T2 - then we can apply it in parallel, so long as we don't commit
> until T1 and T2 have committed (because allowing T3 to commit early
> would produce a serialization anomaly from the point of view of a
> concurrent reader).
Well, doing apply on such low level, without reencoding the data increased 
throughput nearly threefold even for trivial types. So it pushes of the point 
where we need to do the above quite a bit.

> >> Because the routines that decode tuples don't include enough sanity
> >> checks to prevent running off the end of the block, or even the end of
> >> memory completely.  Consider a corrupt TOAST pointer that indicates
> >> that there is a gigabyte of data stored in an 8kB block.  One of the
> >> common symptoms of corruption IME is TOAST requests for -3 bytes of
> >> memory.
> > Yes, but we need to put safeguards against that sort of thing anyway. So
> > sure, we can have bugs but this is not a fundamental limitation.
> There's a reason we haven't done that already, though: it's probably
> going to stink for performance.  If it turns out that it doesn't stink
> for performance, great.  But if it causes a 5% slowdown on common use
> cases, I suspect we're not gonna do it, and I bet I can construct a
> case where it's worse than that (think: 400 column table with lots of
> varlenas, sorting by column 400 to return column 399).  I think it's
> treading on dangerous ground to assume we're going to be able to "just
> go fix" this.
I am talking about ensuring that the catalog is the same on the decoding site 
not about making all decoding totally safe in the face of corrupted 
information.

> > Postgis uses one information table in a few more complex functions but
> > not in anything low-level. Evidenced by the fact that it was totally
> > normal for that to go out of sync before < 2.0.
> > 
> > But even if such a thing would be needed, it wouldn't be problematic to
> > make extension configuration tables be replicated as well.
> Ugh.  That's a hack on top of a hack.  Now it all works great if type
> X is installed as an extension but if it isn't installed as an
> extension then the world blows up.
Then introduce a storage attribute (or something similar) which does say the 
same.


> > I have played with several ideas:
> > 
> > 1.)
> > keep the decoding catalog in sync with command/event triggers, correctly
> > replicating oids. If those log into some internal event table its easy to
> > keep the catalog in a correct transactional state because the events
> > from that table get decoded in the transaction and replayed at exactly
> > the right spot in there *after* it has been reassembled. The locking on
> > the generating side takes care of the concurrency aspects.
> I am not following this one completely.
If (and yes, thats a somewhat big if) we had event triggers which can 
reconstruct equivalent DDL statements with some additions to preserve oids you 
can keep a 2nd catalog in sync. That catalog can be part of a full database or 
just a decoding instance.
If those event triggers log into some system table the wal entries of those 
INSERTS will be at the exactly right point in the wal stream *and* at the 
right point in the transaction to apply that change exactly when you decode 
(or apply) the wal contents after reassembling transactions.
That makes most (all?) of the syscache/snapshot catalog consistency problems 
go away.

> > 2.)
> > Keep the decoding site up2date by replicating the catalog via normal
> > recovery mechanisms
> This surely seems better than #1, since it won't do amazingly weird
> things if the user bypasses the event triggers.
It has the disadvantage that it cannot be used to keep tightly coupled 
instances in sync without a proxy instance inbetween.

Why should the user be able to bypass event triggers? If we design event 
triggers to be bypassable by anything but eplicit actions from a superuser we 
have made grave design errors.
Sure, a superuser can screw that, but then, he already has *loads* of things 
he can do to corrupt an instance.

> > 3.)
> > Fully versioned catalog
> One possible way of doing this would be to have the LCR generator run
> on the primary, but hold back RecentGlobalXmin until it's captured the
> information that it needs.  It seems like as long as tuples can't get
> pruned, the information you need must still be there, as long as you
> can figure out which snapshot you need to read it under.  But since
> you know the commit ordering, it seems like you ought to be able to
> figure out what SnapshotNow would have looked like at any given point
> in the WAL stream.  So you could, at that point in the WAL stream,
> read the master's catalogs under what we might call SnapshotThen.
Yes, I considered it before but never got comfortable enough with the idea. 
Sounds like it would involve some trickery but it might be possible. I am 
happy to go that route if people tentatively agree that the resulting
ugliness/intricate code is acceptable. Sounds like fun.

> > 4.)
> > Log enough information in the walstream to make decoding possible using
> > only the walstream.
> > 
> > Advantages:
> > * Decoding can optionally be done on the master
> > * No catalog syncing/access required
> > * its possible to make this architecture independent
> > 
> > Disadvantage:
> > * high to very high implementation overhead depending on efficiency aims
> > * high space overhead in the wal because at least all the catalog
> > information needs to be logged in a transactional manner repeatedly
> > * misuses wal far more than other methods
> > * significant new complexity in somewhat cricital code paths (heapam.c)
> > * insanely high space overhead if the decoding should be possible
> > architecture independent
> 
> I'm not really convinced that the WAL overhead has to be that much
> with this method.  Most of the information you need about the catalogs
> only needs to be logged when it changes, or once per checkpoint cycle,
> or once per transaction, or once per transaction per checkpoint cycle.
>  I will concede that it looks somewhat complex, but I am not convinced
> that it's undoable.
I am not saying its impossible to achieve only moderate space overhead , but I 
have a hard time believing its possible to do this in a manner thats 
realistically implementable *and* acceptable for Tom.
I think I am more worried about the complexities introduced than the space 
overhead...

> > 5.)
> > The actually good idea. Yours?
> Hey, look, an elephant!
One can dream...

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/ PostgreSQL 
Development, 24x7 Support, Training & Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
On Wednesday, June 20, 2012 05:01:16 AM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 4:22 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> > 1. dllist.h has double the element overhead by having an inline value
> >> > pointer (which is not needed when embedding) and a pointer to the list
> >> > (which I have a hard time seing as being useful)
> >> > 2. only double linked list, mine provided single and double linked
> >> > ones 3. missing macros to use when embedded in a larger struct
> >> > (containerof() wrappers and for(...) support basically)
> >> > 4. most things are external function calls...
> >> > 5. way much more branches/complexity in most of the functions. My
> >> > implementation doesn't use any branches for the typical easy
> >> > modifications (push, pop, remove element somewhere) and only one for
> >> > the typical tests (empty, has-next, ...)
> >> > 
> >> > The performance and memory aspects were crucial for the aforementioned
> >> > toy project (slab allocator for postgres). Its not that crucial for
> >> > the applycache where the lists currently are mostly used although its
> >> > also relatively performance sensitive and obviously does a lot of
> >> > list manipulation/iteration.
> >> > 
> >> > If I had to decide I would add the missing api in dllist.h to my
> >> > implementation and then remove it. Its barely used - and only in an
> >> > embedded fashion - as far as I can see.
> >> > I can understand though if that argument is met with doubt by others
> >> > ;). If thats the way it has to go I would add some more convenience
> >> > support for embedding data to dllist.h and settle for that.
> >> 
> >> I think it might be simpler to leave the name as Dllist and just
> >> overhaul the implementation along the lines you suggest, rather than
> >> replacing it with something completely different.  Mostly, I don't
> >> want to add a third thing if we can avoid it, given that Dllist as it
> >> exists today is used only lightly.
> > 
> > Well, if its the name, I have no problem with changing it, but I don't
> > see how you can keep the api as it currently is and address my points.
> > 
> > If there is some buyin I can try to go either way (keeping the existing
> > name, changing the api, adjusting the callers or just adjust the
> > callers, throw away the old implementation) I just don't want to get
> > into that just to see somebody isn't agreeing with the fundamental idea.
> My guess is that it wouldn't be too hard to remove some of the extra
> pointers.  Anyone who is using Dllist as a non-inline list could be
> converted to List * instead. 
There are only three users of the whole dllist.h. Catcache, autovacuum and 
postmaster. The latter two just keep a list of databases around. So any change 
will only be moderatively intrusive.

> Also, the performance-critical things
> could be reimplemented as macros. 

> I question, though, whether we really need both single and doubly linked
> lists.  That seems like it's almost certainly micro-optimization that we are
> better off not doing.
It was certainly worthwile for the memory manager (lower per allocation 
overhead). You might be right that its not worth it for many other possible 
usecases in pg. Its not much code though.

*looks around*

A quick grep found:

single linked list like code:  guc_private.h, aset.c, elog.h, regguts.h (ok, 
irrelevant), dynahash.c, resowner.c, extension.c, pgstat.c, xlog.c
Double linked like code: shmqueue.c, lwlock.c, dynahash.c, xact.c

I stopped at that point because while surely not of all of the above usecases 
could be replaced by a common implementation several could from a quick look. 
Also, several pg_list.h users could benefit from a conversion. So I think 
adding a single linked list implementation is worthwile.

> > The most contentious point is probably relying on USE_INLINE being
> > available anywhere. Which I believe to be the point now that we have
> > gotten rid of some platforms.
> I would be hesitant to chuck that even though I realize it's unlikely
> that we really need !USE_INLINE.  But see sortsupport for an example
> of how we've handled this in the recent past.
I agree its possible to resolve this. But hell, do we really need to add all 
that ugliness in 2012? I don't think its worth the effort of support ancient 
compilers that don't support inline anymore. If we could stop trying to 
support catering for probably non-existing compilers we could remove some 
*very* ugly long macros for example (e.g. in htup.h).

If support for !USE_INLINE is required I would prefer to have an header define 
the functions like

#ifdef USE_INLINE
#define OPTIONALLY_INLINE static inline
#define USE_LINKED_LIST_IMPL
#endif

#ifdef USE_LINKED_LIST_IMPL

OPTIONALLY_INLINE void myFuncCall(){
...
}
#endif

which then gets included with #define USE_LINKED_LIST_IMPL by some c file 
defining OPTIONALLY_INLINE to something empty if !USE_INLINE.
Its too much code to duplicate imo.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Robert Haas
Date:
On Wed, Jun 20, 2012 at 6:59 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> My guess is that it wouldn't be too hard to remove some of the extra
>> pointers.  Anyone who is using Dllist as a non-inline list could be
>> converted to List * instead.
> There are only three users of the whole dllist.h. Catcache, autovacuum and
> postmaster. The latter two just keep a list of databases around. So any change
> will only be moderatively intrusive.

Yeah.

>> Also, the performance-critical things
>> could be reimplemented as macros.
>
>> I question, though, whether we really need both single and doubly linked
>> lists.  That seems like it's almost certainly micro-optimization that we are
>> better off not doing.
> It was certainly worthwile for the memory manager (lower per allocation
> overhead). You might be right that its not worth it for many other possible
> usecases in pg. Its not much code though.
>
> *looks around*
>
> A quick grep found:
>
> single linked list like code:  guc_private.h, aset.c, elog.h, regguts.h (ok,
> irrelevant), dynahash.c, resowner.c, extension.c, pgstat.c, xlog.c
> Double linked like code: shmqueue.c, lwlock.c, dynahash.c, xact.c
>
> I stopped at that point because while surely not of all of the above usecases
> could be replaced by a common implementation several could from a quick look.
> Also, several pg_list.h users could benefit from a conversion. So I think
> adding a single linked list implementation is worthwile.

I can believe that, although I fear it may be a distraction in the
grand scheme of getting logical replication implemented.  There should
be very few places where this is actually performance-critical, and
Tom will complain about large amounts of code churn that don't improve
performance.

If we're going to do that, how about transforming dllist.h into the
doubly-linked list and adding sllist.h for the singly-linked list?

>> > The most contentious point is probably relying on USE_INLINE being
>> > available anywhere. Which I believe to be the point now that we have
>> > gotten rid of some platforms.
>> I would be hesitant to chuck that even though I realize it's unlikely
>> that we really need !USE_INLINE.  But see sortsupport for an example
>> of how we've handled this in the recent past.
> I agree its possible to resolve this. But hell, do we really need to add all
> that ugliness in 2012? I don't think its worth the effort of support ancient
> compilers that don't support inline anymore. If we could stop trying to
> support catering for probably non-existing compilers we could remove some
> *very* ugly long macros for example (e.g. in htup.h).

I don't feel qualified to make a decision on this one, so will defer
to the opinions of others.

> If support for !USE_INLINE is required I would prefer to have an header define
> the functions like
>
> #ifdef USE_INLINE
> #define OPTIONALLY_INLINE static inline
> #define USE_LINKED_LIST_IMPL
> #endif
>
> #ifdef USE_LINKED_LIST_IMPL
>
> OPTIONALLY_INLINE void myFuncCall(){
> ...
> }
> #endif
>
> which then gets included with #define USE_LINKED_LIST_IMPL by some c file
> defining OPTIONALLY_INLINE to something empty if !USE_INLINE.
> Its too much code to duplicate imo.

Neat trick.  Maybe we should revise the sortsupport stuff to do it that way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Robert Haas
Date:
On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> As I said before, I definitely agree that we want to have a separate transport
> format once we have decoding nailed down. We still need to ship wal around if
> the decoding happens in a different instance, but *after* that it can be
> shipped in something more convenient/appropriate.

Right, OK, we agree on this then.

> One bit is fine if you have only very simple replication topologies. Once you
> think about globally distributed databases its a bit different. You describe
> some of that below, but just to reiterate:
> Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF
> in europe). You may only want to have full intercontinental interconnect
> between two of those (say A and D). If you only have one bit to represent the
> origin thats not going to work because you won't be able discern the changes
> from BC on A from the changes from those originating on DEF.

I don't see the problem.  A certainly knows via which link the LCRs arrived.

So: change happens on A.  A sends the change to B, C, and D.  B and C
apply the change.  One bit is enough to keep them from regenerating
new LCRs that get sent back to A.  So they're fine.  D also receives
the changes (from A) and applies them, but it also does not need to
regenerate LCRs.  Instead, it can take the LCRs that it has already
got (from A) and send those to E and F.

Or: change happens on B.  B sends the changes to A.  Since A knows the
network topology, it sends the changes to C and D.  D sends them to E
and F.  Nobody except B needs to *generate* LCRs.  All any other node
needs to do is suppress *redundant* LCR generation.

> Another topology which is interesting is circular replications (i.e. changes
> get shipped A->B, B->C, C->A) which is a sensible topology if you only have a
> low change rate and a relatively high number of nodes because you don't need
> the full combinatorial amount of connections.

I think this one is OK too.  You just generate LCRs on the origin node
and then pass them around the ring at every step.  When the next hop
would be the origin node then you're done.

I think you may be imagining that A generates LCRs and sends them to
B.  B applies them, and then from the WAL just generated, it produces
new LCRs which then get sent to C.  If you do that, then, yes,
everything that you need to disentangle various network topologies
must be present in WAL.  But what I'm saying is: don't do it like
that.  Generate the LCRs just ONCE, at the origin node, and then pass
them around the network, applying them at every node.  Then, the
information that is needed in WAL is confined to one bit: the
knowledge of whether or not a particular transaction is local (and
thus LCRs should be generated) or non-local (and thus they shouldn't,
because the origin already generated them and thus we're just handing
them around to apply everywhere).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
On Wednesday, June 20, 2012 02:51:30 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 6:59 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> Also, the performance-critical things
> >> could be reimplemented as macros.
> >> 
> >> I question, though, whether we really need both single and doubly linked
> >> lists.  That seems like it's almost certainly micro-optimization that we
> >> are better off not doing.
> > 
> > It was certainly worthwile for the memory manager (lower per allocation
> > overhead). You might be right that its not worth it for many other
> > possible usecases in pg. Its not much code though.
> > 
> > *looks around*
> > 
> > A quick grep found:
> > 
> > single linked list like code:  guc_private.h, aset.c, elog.h, regguts.h
> > (ok, irrelevant), dynahash.c, resowner.c, extension.c, pgstat.c, xlog.c
> > Double linked like code: shmqueue.c, lwlock.c, dynahash.c, xact.c
> > 
> > I stopped at that point because while surely not of all of the above
> > usecases could be replaced by a common implementation several could from
> > a quick look. Also, several pg_list.h users could benefit from a
> > conversion. So I think adding a single linked list implementation is
> > worthwile.
> 
> I can believe that, although I fear it may be a distraction in the
> grand scheme of getting logical replication implemented.  There should
> be very few places where this is actually performance-critical, and
> Tom will complain about large amounts of code churn that don't improve
> performance.
Uh. I don't want to just go around and replace anything randomly. Actually I 
don't want to change anything for now except whats necessary to get the patch 
in. The point I tried to make was just that the relatively widespread usage of 
similar structure make it likely that it can be used in more places in future.

> If we're going to do that, how about transforming dllist.h into the
> doubly-linked list and adding sllist.h for the singly-linked list?
I would be fine with that.

I will go and try to cook up a patch, assuming for now that we rely on inline, 
the ugliness can be added back afterwards.

> >> > The most contentious point is probably relying on USE_INLINE being
> >> > available anywhere. Which I believe to be the point now that we have
> >> > gotten rid of some platforms.

> >> I would be hesitant to chuck that even though I realize it's unlikely
> >> that we really need !USE_INLINE.  But see sortsupport for an example
> >> of how we've handled this in the recent past.
> > 
> > I agree its possible to resolve this. But hell, do we really need to add
> > all that ugliness in 2012? I don't think its worth the effort of support
> > ancient compilers that don't support inline anymore. If we could stop
> > trying to support catering for probably non-existing compilers we could
> > remove some *very* ugly long macros for example (e.g. in htup.h).
> 
> I don't feel qualified to make a decision on this one, so will defer
> to the opinions of others.
Ok.

> > If support for !USE_INLINE is required I would prefer to have an header
> > define the functions like
> > 
> > #ifdef USE_INLINE
> > #define OPTIONALLY_INLINE static inline
> > #define USE_LINKED_LIST_IMPL
> > #endif
> > 
> > #ifdef USE_LINKED_LIST_IMPL
> > 
> > OPTIONALLY_INLINE void myFuncCall(){
> > ...
> > }
> > #endif
> > 
> > which then gets included with #define USE_LINKED_LIST_IMPL by some c file
> > defining OPTIONALLY_INLINE to something empty if !USE_INLINE.
> > Its too much code to duplicate imo.
> 
> Neat trick.  Maybe we should revise the sortsupport stuff to do it that
> way.
Either that or at least add a comment to both that its duplicated...

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Robert Haas
Date:
On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The idea that logical rep is some kind of useful end goal in itself is
> slightly misleading. If the thought is to block multi-master
> completely on that basis, that would be a shame. Logical rep is the
> mechanism for implementing multi-master.

If you're saying that single-master logical replication isn't useful,
I disagree.  Of course, having both single-master and multi-master
replication together is even more useful.  But I think getting even
single-master logical replication working well in a single release
cycle is going to be a job and a half.  Thinking that we're going to
get MMR in one release is not realistic.  The only way to make it
realistic is to put MMR ahead of every other goal that people have for
logical replication, including robustness and stability.  It's
entirely premature to be designing features for MMR when we don't even
have the design for SMR nailed down yet.  And that's even assuming we
EVER want MMR in core, which has not even really been argued, let
alone agreed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Wednesday, June 20, 2012 03:19:55 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > The idea that logical rep is some kind of useful end goal in itself is
> > slightly misleading. If the thought is to block multi-master
> > completely on that basis, that would be a shame. Logical rep is the
> > mechanism for implementing multi-master.
> 
> If you're saying that single-master logical replication isn't useful,
> I disagree.  Of course, having both single-master and multi-master
> replication together is even more useful.  But I think getting even
> single-master logical replication working well in a single release
> cycle is going to be a job and a half.  Thinking that we're going to
> get MMR in one release is not realistic.  The only way to make it
> realistic is to put MMR ahead of every other goal that people have for
> logical replication, including robustness and stability.  It's
> entirely premature to be designing features for MMR when we don't even
> have the design for SMR nailed down yet.  And that's even assuming we
> EVER want MMR in core, which has not even really been argued, let
> alone agreed.
I agree it has not been agreed uppon, but I certainly would consider 
submitting a prototype implementing it an argument for doing it ;)

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Robert Haas
Date:
On Wed, Jun 20, 2012 at 9:12 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Uh. I don't want to just go around and replace anything randomly. Actually I
> don't want to change anything for now except whats necessary to get the patch
> in. The point I tried to make was just that the relatively widespread usage of
> similar structure make it likely that it can be used in more places in future.

Well, the question is for anywhere you might be thinking of using
this: why not just use List?  We do that in a lot of other places, and
there's not much reason to event something new unless there is a
problem with what we already have.  I assume this is related to
logical replication somehow, but it's not clear to me exactly what
problem you hit doing this in the obvious way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 21:19, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> The idea that logical rep is some kind of useful end goal in itself is
>> slightly misleading. If the thought is to block multi-master
>> completely on that basis, that would be a shame. Logical rep is the
>> mechanism for implementing multi-master.
>
> If you're saying that single-master logical replication isn't useful,
> I disagree.  Of course, having both single-master and multi-master
> replication together is even more useful.

> But I think getting even
> single-master logical replication working well in a single release
> cycle is going to be a job and a half.

OK, so your estimate is 1.5 people to do that. And if we have more
people, should they sit around doing nothing?

> Thinking that we're going to
> get MMR in one release is not realistic.

If you block it, then the above becomes true, whether or not it starts true.

You may not want MMR, but others do. I see no reason to prevent people
from having it, which is what you suggest.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
On Wednesday, June 20, 2012 03:24:58 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 9:12 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > Uh. I don't want to just go around and replace anything randomly.
> > Actually I don't want to change anything for now except whats necessary
> > to get the patch in. The point I tried to make was just that the
> > relatively widespread usage of similar structure make it likely that it
> > can be used in more places in future.
> 
> Well, the question is for anywhere you might be thinking of using
> this: why not just use List?  We do that in a lot of other places, and
> there's not much reason to event something new unless there is a
> problem with what we already have.  I assume this is related to
> logical replication somehow, but it's not clear to me exactly what
> problem you hit doing this in the obvious way.
It incurs a rather high performance overhead due to added memory allocations 
and added pointer indirections. Thats fine for most of the current users of 
the List interface, but certainly not for all. In other places you cannot even 
have memory allocations because the list lives in shared memory.

E.g. in the ApplyCache, where I use the submitted ilist.h stuff, when 
reconstructing transactions you add to a potentially really long linked list 
of individual changes for every interesting wal record. Before I prevented 
memory allocations in that path it took about 12-14% of the time when applying 
changes in the same backend. Afterwards it wasn't visible in the profile 
anymore.

Several of the pieces of code I pointed out in a previous email use open-coded 
list implementation exactly to prevent those problems.

If you look at the parsing, planning & execution of trivial statements you 
will also notice the overhead of memory allocations. A good bit of those is 
caused by list manipulation. Check Stephen Frost's "Pre-alloc ListCell's 
optimization" for workarounds..

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Robert Haas
Date:
On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 20 June 2012 21:19, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> The idea that logical rep is some kind of useful end goal in itself is
>>> slightly misleading. If the thought is to block multi-master
>>> completely on that basis, that would be a shame. Logical rep is the
>>> mechanism for implementing multi-master.
>>
>> If you're saying that single-master logical replication isn't useful,
>> I disagree.  Of course, having both single-master and multi-master
>> replication together is even more useful.
>
>> But I think getting even
>> single-master logical replication working well in a single release
>> cycle is going to be a job and a half.
>
> OK, so your estimate is 1.5 people to do that. And if we have more
> people, should they sit around doing nothing?

Oh, give me a break.  You're willfully missing my point.  And to quote
Fred Brooks, nine women can't make a baby in one month.

>> Thinking that we're going to
>> get MMR in one release is not realistic.
>
> If you block it, then the above becomes true, whether or not it starts true.

If I had no rational basis for my objections, that would be true.
You've got four people objecting to this patch now, all of whom happen
to be committers.  Whether or not MMR goes into core, who knows, but
it doesn't seem that this patch is going to fly.

My main point in bringing this up is that if you pick a project that
is too large, you will fail.  As I would rather see this project
succeed, I recommend that you don't do that.  Both you and Andres seem
to believe that MMR is a reasonable first target to shoot at, but I
don't think anyone else - including the Slony developers who have
commented on this issue - endorses that position.  At PGCon, you were
talking about getting a new set of features into PG over the next 3-5
years.  Now, it seems like you want to compress that timeline to a
year.  I don't think that's going to work.  You also requested that
people tell you sooner when large patches were in danger of not making
the release.  Now I'm doing that, VERY early, and you're apparently
angry about it.  If the only satisfactory outcome of this conversation
is that everyone agrees with the design pattern you've already decided
on, then you haven't left yourself very much room to walk away
satisfied.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Wednesday, June 20, 2012 03:02:28 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > One bit is fine if you have only very simple replication topologies. Once
> > you think about globally distributed databases its a bit different. You
> > describe some of that below, but just to reiterate:
> > Imagine having 6 nodes, 3 on one of two continents (ABC in north america,
> > DEF in europe). You may only want to have full intercontinental
> > interconnect between two of those (say A and D). If you only have one
> > bit to represent the origin thats not going to work because you won't be
> > able discern the changes from BC on A from the changes from those
> > originating on DEF.
> 
> I don't see the problem.  A certainly knows via which link the LCRs
> arrived.

> So: change happens on A.  A sends the change to B, C, and D.  B and C
> apply the change.  One bit is enough to keep them from regenerating
> new LCRs that get sent back to A.  So they're fine.  D also receives
> the changes (from A) and applies them, but it also does not need to
> regenerate LCRs.  Instead, it can take the LCRs that it has already
> got (from A) and send those to E and F.

> Or: change happens on B.  B sends the changes to A.  Since A knows the
> network topology, it sends the changes to C and D.  D sends them to E
> and F.  Nobody except B needs to *generate* LCRs.  All any other node
> needs to do is suppress *redundant* LCR generation.
> 
> > Another topology which is interesting is circular replications (i.e.
> > changes get shipped A->B, B->C, C->A) which is a sensible topology if
> > you only have a low change rate and a relatively high number of nodes
> > because you don't need the full combinatorial amount of connections.
> 
> I think this one is OK too.  You just generate LCRs on the origin node
> and then pass them around the ring at every step.  When the next hop
> would be the origin node then you're done.
> 
> I think you may be imagining that A generates LCRs and sends them to
> B.  B applies them, and then from the WAL just generated, it produces
> new LCRs which then get sent to C. 
Yes, thats what I am proposing.

> If you do that, then, yes,
> everything that you need to disentangle various network topologies
> must be present in WAL.  But what I'm saying is: don't do it like
> that.  Generate the LCRs just ONCE, at the origin node, and then pass
> them around the network, applying them at every node.  Then, the
> information that is needed in WAL is confined to one bit: the
> knowledge of whether or not a particular transaction is local (and
> thus LCRs should be generated) or non-local (and thus they shouldn't,
> because the origin already generated them and thus we're just handing
> them around to apply everywhere).
Sure, you can do it that way, but I don't think its a good idea. If you do it 
my way you *guarantee* that when replaying changes from node B on node C you 
have replayed changes from A at least as far as B has. Thats a really nice 
property for MM.
You *can* get same with your solution but it starts to get complicated rather 
fast. While my/our proposed solution is trivial to implement.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Hannu Valtonen
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/19/2012 01:47 AM, Christopher Browne wrote:
> That numbering scheme gets pretty anti-intuitive fairly quickly,
> from whence we took the approach of having a couple digits
> indicating data centre followed by a digit indicating which node in
> that data centre.
> 
> If that all sounds incoherent, well, the more nodes you have
> around, the more difficult it becomes to make sure you *do* have a
> coherent picture of your cluster.
> 
> I recall the Slony-II project having a notion of attaching a
> permanent UUID-based node ID to each node.  As long as there is
> somewhere decent to find a symbolically significant node "name," I
> like the idea of the ID *not* being in a tiny range, and being
> UUID/OID-like...
> 

Just as a sidenote, MySQL's new global transaction ids use a UUID for
the serverid bit of it. [1]

- - Hannu Valtonen

[1]
http://d2-systems.blogspot.fi/2012/04/global-transaction-identifiers-are-in.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJP4B2ZAAoJENx2aQJr96ohm/4QAOKqFkyBlbwn1k69z6jmqbyq
vn1pCtAKgpdL3wdOEIipotylJnyl9Aqd8IRQT/j3u2dDOWbFqf6Xs3t8x6NbE06T
4kBANs7VjZwSgUlnmU6zsGhX5RB+mHEmnjKn+GG3uA4m5Qo5WvvzfwxKXDdOfSuw
ChLgTTse28OV5xOeQnU4FSPPrsc0AxL8suoUDAi3Qp3EkefgN7zzkfLkw1B+E119
mfxSus9nu4rl4givP+z8VMtfIhU4EqQvQvcI5w6E4aW88iYxzkH/BICyBYGg73e4
SnJfLpaXg/C/Ll0NjKRr9Gsuxl1yStb7vPzc0AQahyE2IyspN+Ga5Y1eyvEfcv5s
cOpKFLMPE5sQpvTdcSfZ3/nGotos1PDijVpAy7qVY3m0ow5ECyGv/8sdXSQXut+x
f9QVNe3rOznsy3J38Z8OOEaftq30UZt7+cXl4fMvI/eVkva+MyHjdR+aIqzMVh1d
S6uCpkK/UUC11PCENdRmVIka6EHAsvs+m9B4kmsHpR4+T/qKVaLdBX27L7k2prse
OmJ1qFtOmMqZJHFZ8oCVYoVoK5UEaJfZBLbSyf9vU0iSDh2XlKgVNe6TTE0f+OVE
GcqIT2qbc9XG6hFO/C0zy5qdNXem96feBszjWLhf+F9ULq1d0HNEg2WghJJPpan1
adx5JGZK5k0xtSewTj7O
=lK2f
-----END PGP SIGNATURE-----


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Robert Haas
Date:
On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> If you do that, then, yes,
>> everything that you need to disentangle various network topologies
>> must be present in WAL.  But what I'm saying is: don't do it like
>> that.  Generate the LCRs just ONCE, at the origin node, and then pass
>> them around the network, applying them at every node.  Then, the
>> information that is needed in WAL is confined to one bit: the
>> knowledge of whether or not a particular transaction is local (and
>> thus LCRs should be generated) or non-local (and thus they shouldn't,
>> because the origin already generated them and thus we're just handing
>> them around to apply everywhere).
> Sure, you can do it that way, but I don't think its a good idea. If you do it
> my way you *guarantee* that when replaying changes from node B on node C you
> have replayed changes from A at least as far as B has. Thats a really nice
> property for MM.
> You *can* get same with your solution but it starts to get complicated rather
> fast. While my/our proposed solution is trivial to implement.

That's an interesting point.  I agree that's a useful property, and
might be a reason not to just use a single-bit flag, but I still think
you'd be better off handling that requirement via some other method,
like logging the node ID once per transaction or something.  That lets
you have as much metadata as you end up needing, which is a lot more
flexible than a 16-bit field, as Kevin, Heikki, and Tom have also
said.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Wednesday, June 20, 2012 03:42:39 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On 20 June 2012 21:19, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs <simon@2ndquadrant.com> 
wrote:
> >>> The idea that logical rep is some kind of useful end goal in itself is
> >>> slightly misleading. If the thought is to block multi-master
> >>> completely on that basis, that would be a shame. Logical rep is the
> >>> mechanism for implementing multi-master.
> >> 
> >> If you're saying that single-master logical replication isn't useful,
> >> I disagree.  Of course, having both single-master and multi-master
> >> replication together is even more useful.
> >> 
> >> But I think getting even
> >> single-master logical replication working well in a single release
> >> cycle is going to be a job and a half.
> >> Thinking that we're going to
> >> get MMR in one release is not realistic.
> > If you block it, then the above becomes true, whether or not it starts
> > true.
> My main point in bringing this up is that if you pick a project that
> is too large, you will fail.
Were not the only ones here that are performing scope creep though... I think 
about all people who have posted in the whole thread except maybe Tom and 
Marko are guilty of doing so.

I still think its rather sensible to focus on exactly duplicated schemas in a 
very first version just because that leaves out some of the complexity while 
paving the road for other nice things.

> You've got four people objecting to this patch now, all of whom happen
> to be committers.  Whether or not MMR goes into core, who knows, but
> it doesn't seem that this patch is going to fly.
I find that a bit too early to say. Sure it won't fly exactly as proposed, but 
hell, who cares? What I want to get in is a solution to the specific problem 
the patch targets. At least you have, not sure about others, accepted that the 
problem needs a solution.
We do not agree yet how that solution looks should like but thats not exactly 
surprising as we started discussing the problem only a good day ago.

If people agree that your proposed way of just one flag bit is the way to go 
we will have to live with that. But thats different from saying the whole 
thing is dead.

> As I would rather see this project
> succeed, I recommend that you don't do that.  Both you and Andres seem
> to believe that MMR is a reasonable first target to shoot at, but I
> don't think anyone else - including the Slony developers who have
> commented on this issue - endorses that position. 
I don't think we get full MMR into 9.3. What I am proposing is that we build 
in the few pieces that are required to implement MMR *ontop* of whats 
hopefully in 9.3.
And I think thats a realistic goal.

> At PGCon, you were
> talking about getting a new set of features into PG over the next 3-5
> years.  Now, it seems like you want to compress that timeline to a
> year.
Well, I obviously would like to be everything be done in a release, but I also 
would like to go hiking for a year, have a restored sailing boat and some 
more. That doesn't make it reality...
To make it absolutely clear: I definitely don't think its realistic to have 
everything in 9.3. And I don't think Simon does so either. What I want is to 
have the basic building blocks in 9.3.
The difference probably just is whats determined as a building block.

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Wednesday, June 20, 2012 03:54:43 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> If you do that, then, yes,
> >> everything that you need to disentangle various network topologies
> >> must be present in WAL.  But what I'm saying is: don't do it like
> >> that.  Generate the LCRs just ONCE, at the origin node, and then pass
> >> them around the network, applying them at every node.  Then, the
> >> information that is needed in WAL is confined to one bit: the
> >> knowledge of whether or not a particular transaction is local (and
> >> thus LCRs should be generated) or non-local (and thus they shouldn't,
> >> because the origin already generated them and thus we're just handing
> >> them around to apply everywhere).
> > 
> > Sure, you can do it that way, but I don't think its a good idea. If you
> > do it my way you *guarantee* that when replaying changes from node B on
> > node C you have replayed changes from A at least as far as B has. Thats
> > a really nice property for MM.
> > You *can* get same with your solution but it starts to get complicated
> > rather fast. While my/our proposed solution is trivial to implement.
> 
> That's an interesting point.  I agree that's a useful property, and
> might be a reason not to just use a single-bit flag, but I still think
> you'd be better off handling that requirement via some other method,
> like logging the node ID once per transaction or something.  That lets
> you have as much metadata as you end up needing, which is a lot more
> flexible than a 16-bit field, as Kevin, Heikki, and Tom have also
> said.
If it comes down to that I can definitely live with that. I still think making 
the filtering trivial so it can be done without any logic on a low level is a 
very desirable property but if not, so be it.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 21:42, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 20 June 2012 21:19, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> The idea that logical rep is some kind of useful end goal in itself is
>>>> slightly misleading. If the thought is to block multi-master
>>>> completely on that basis, that would be a shame. Logical rep is the
>>>> mechanism for implementing multi-master.
>>>
>>> If you're saying that single-master logical replication isn't useful,
>>> I disagree.  Of course, having both single-master and multi-master
>>> replication together is even more useful.
>>
>>> But I think getting even
>>> single-master logical replication working well in a single release
>>> cycle is going to be a job and a half.
>>
>> OK, so your estimate is 1.5 people to do that. And if we have more
>> people, should they sit around doing nothing?
>
> Oh, give me a break.  You're willfully missing my point.  And to quote
> Fred Brooks, nine women can't make a baby in one month.


No, I'm not. The question is not how quickly can N people achieve a
single thing, but how long will it take a few skilled people working
on carefully selected tasks that have few dependencies between them to
achieve something.

We have significantly more preparation, development time and resources
than any other project previously performed for PostgreSQL, that I am
aware of.

Stating that it is impossible to perform a task in a certain period of
time without even considering those points is clearly rubbish. I've
arrived at my thinking based upon detailed project planning of what
was possible in the time.

How exactly did you arrive at your conclusion? Why is yours right and
mine wrong?



>>> Thinking that we're going to
>>> get MMR in one release is not realistic.
>>
>> If you block it, then the above becomes true, whether or not it starts true.
>
> If I had no rational basis for my objections, that would be true.
> You've got four people objecting to this patch now, all of whom happen
> to be committers.  Whether or not MMR goes into core, who knows, but
> it doesn't seem that this patch is going to fly.

No, I have four people who had initial objections and who have not
commented on the fact that the points made are regrettably incorrect.
I don't expect everybody commenting on the design to have perfect
knowledge of the whole design, so I expect people to make errors in
their comments. I also expect people to take note of what has been
said before making further objections or drawing conclusions.

Since at least 3 of the people making such comments did not attend the
full briefing meeting in Ottawa, I am not particularly surprised.
However, I do expect people that didn't come to the meeting to
recognise that they are likely to be missing information and to listen
closely, as I listen to them.

"When the facts change, I change my mind. What do you do, sir?"


> My main point in bringing this up is that if you pick a project that
> is too large, you will fail.  As I would rather see this project
> succeed, I recommend that you don't do that.  Both you and Andres seem
> to believe that MMR is a reasonable first target to shoot at, but I
> don't think anyone else - including the Slony developers who have
> commented on this issue - endorses that position.  At PGCon, you were
> talking about getting a new set of features into PG over the next 3-5
> years.  Now, it seems like you want to compress that timeline to a
> year.  I don't think that's going to work.  You also requested that
> people tell you sooner when large patches were in danger of not making
> the release.  Now I'm doing that, VERY early, and you're apparently
> angry about it.  If the only satisfactory outcome of this conversation
> is that everyone agrees with the design pattern you've already decided
> on, then you haven't left yourself very much room to walk away
> satisfied.

Note that I have already myself given review feedback to Andres and
that change has visibly occurred during this thread via public debate.

Claiming that I only stick to what has already been decided is
patently false, with me at least.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 16:23, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 20.06.2012 11:17, Simon Riggs wrote:
>>
>> On 20 June 2012 15:45, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>>
>>> On 20.06.2012 10:32, Simon Riggs wrote:
>>>>
>>>>
>>>> On 20 June 2012 14:40, Heikki Linnakangas
>>>>>
>>>>>
>>>>> And I'm worried
>>>>> it might not even be enough in more complicated scenarios.
>>>>
>>>>
>>>> It is not the only required conflict mechanism, and has never been
>>>> claimed to be so. It is simply one piece of information needed, at
>>>> various times.
>>>
>>>
>>> So, if the origin id is not sufficient for some conflict resolution
>>> mechanisms, what extra information do you need for those, and where do
>>> you
>>> put it?
>>
>>
>> As explained elsewhere, wal_level = logical (or similar) would be used
>> to provide any additional logical information required.
>>
>> Update and Delete WAL records already need to be different in that
>> mode, so additional info would be placed there, if there were any.
>>
>> In the case of reflexive updates you raised, a typical response in
>> other DBMS would be to represent the query
>>   UPDATE SET counter = counter + 1
>> by sending just the "+1" part, not the current value of counter, as
>> would be the case with the non-reflexive update
>>   UPDATE SET counter = 1
>>
>> Handling such things in Postgres would require some subtlety, which
>> would not be resolved in first release but is pretty certain not to
>> require any changes to the WAL record header as a way of resolving it.
>> Having already thought about it, I'd estimate that is a very long
>> discussion and not relevant to the OT, but if you wish to have it
>> here, I won't stop you.
>
>
> Yeah, I'd like to hear briefly how you would handle that without any further
> changes to the WAL record header.

I already did:

>> Update and Delete WAL records already need to be different in that
>> mode, so additional info would be placed there, if there were any.

The case you mentioned relates to UPDATEs only, so I would suggest
that we add that information to a new form of update record only.

That has nothing to do with the WAL record header.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Robert Haas
Date:
On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Were not the only ones here that are performing scope creep though... I think
> about all people who have posted in the whole thread except maybe Tom and
> Marko are guilty of doing so.
>
> I still think its rather sensible to focus on exactly duplicated schemas in a
> very first version just because that leaves out some of the complexity while
> paving the road for other nice things.

Well, I guess what I want to know is: what does focusing on exactly
duplicated schemas mean?  If it means we'll disable DDL for tables
when we turn on replication, that's basically the Slony approach: when
you want to make a DDL change, you have to quiesce replication, do it,
and then resume replication.  I would possibly be OK with that
approach.  If it means that we'll hope that the schemas are duplicated
and start spewing garbage data when they're not, then I'm not
definitely not OK with that approach.  If it means using event
triggers to keep the catalogs synchronized, then I don't think I don't
think that's adequately robust.  The user could add more event
triggers that run before or after the ones the replication system
adds, and then you are back to garbage decoding (or crashes).  They
could also modify the catalogs directly, although it's possible we
don't care quite as much about that case (but on the other hand people
do sometimes need to do it to solve real problems).  Although I am
100% OK with pairing back the initial feature set - indeed, I strongly
recommend it - I think that robustness is not a feature which can be
left out in v1 and added in later.  All the robustness has to be
designed in at the start, or we will never have it.

On the whole, I think we're spending far too much time talking about
code and far too little time talking about what the overall design
should look like.  We are having a discussion about whether or not MMR
should be supported by sticking a 16-bit node ID into every WAL record
without having first decided whether we should support MMR, whether
that requires node IDs, whether they should be integers, whether those
integers should be 16 bits in size, whether they should be present in
WAL, and whether or not the record header is the right place to put
them.  There's a right order in which to resolve those questions, and
this isn't it.  More generally, I think there is a ton of complexity
that we're probably overlooking here in focusing in on specific coding
details.  I think the most interesting comment made to date is Steve
Singer's observation that very little of Slony is concerned with
changeset extraction or apply.  Now, on the flip side, all of these
patches seem to be concerned with changeset extraction and apply.
That suggests that we're missing some pretty significant pieces
somewhere in this design.  I think those pieces are things like error
recovery, fault tolerance, user interface design, and control logic.
Slony has spent years trying to get those things right.  Whether or
not they actually have gotten them right is of course an arguable
point, but we're unlikely to do better by ignoring all of those issues
and implementing whatever is most technically expedient.

>> You've got four people objecting to this patch now, all of whom happen
>> to be committers.  Whether or not MMR goes into core, who knows, but
>> it doesn't seem that this patch is going to fly.
> I find that a bit too early to say. Sure it won't fly exactly as proposed, but
> hell, who cares? What I want to get in is a solution to the specific problem
> the patch targets. At least you have, not sure about others, accepted that the
> problem needs a solution.
> We do not agree yet how that solution looks should like but thats not exactly
> surprising as we started discussing the problem only a good day ago.

Oh, no argument with any of that.  I strongly object to the idea of
shoving this patch through as-is, but I don't object to solving the
problem in some other, more appropriate way.  I think that won't look
much like this patch, though; it will be some new patch.

> If people agree that your proposed way of just one flag bit is the way to go
> we will have to live with that. But thats different from saying the whole
> thing is dead.

I think you've convinced me that a single flag-bit is not enough, but
I don't think you've convinced anyone that it belongs in the record
header.

>> As I would rather see this project
>> succeed, I recommend that you don't do that.  Both you and Andres seem
>> to believe that MMR is a reasonable first target to shoot at, but I
>> don't think anyone else - including the Slony developers who have
>> commented on this issue - endorses that position.
> I don't think we get full MMR into 9.3. What I am proposing is that we build
> in the few pieces that are required to implement MMR *ontop* of whats
> hopefully in 9.3.
> And I think thats a realistic goal.

I can't quite follow that sentence, but my general sense is that,
while you're saying that this infrastructure will be reusable by other
projects, you don't actually intend to expose APIs that they can use.
IOW, we'll give you an apply cache - which we believe to be necessary
to extra tuples as text - but we're leaving the exercise of actually
generating those tuples as text as an exercise for the reader.  I find
that a highly undesirable plan.  First, if we don't actually have the
infrastructure to extract tuples as text, then the contention that the
infrastructure is adequate for that purpose can't be proven or
disproven.  Second, until someone from one of those other projects (or
elsewhere in the community) actually goes and builds it, the built-in
logical replication will be the only thing that can get benefit out of
the new infrastructure.  I think it's completely unacceptable to give
an unproven built-in logical replication technology that kind of pride
of place out of the gate.  That potentially allows it to supplant
systems such as Slony and Bucardo even if it is in many respects
inferior, just because it's been given an inside track.  They have
lived without core support for years, and if we're going to start
adding core support for replication, we ought to start by adding the
things that they think, on the basis of their knowledge and
experience, are the most important places where core support is
needed, not going off in a completely new and untested direction.
Third, when logical replication fails, which it will, because even
simple things fail and replication is complicated, how am I going to
debug it?  A raw dump of the tuple data that's being shipped around?
No thanks.

IOW, what I see you proposing is, basically, let's short-cut the hard
problems so we can get to MMR faster.  I oppose that.  That's not the
Postgres way of building features.  We start slow and incremental and
we make each thing as solid as we possibly can before going on to the
next thing.  It's not always the fastest way of building technology,
but the results are of very high quality, and we rarely have to throw
out features completely and start over.  When we do misdesign a
feature (rules, contrib/xm2) the bug reports persist for years, if not
decades, and it's often hard to get even the most obvious bugs (e.g.
crashes) fixed, because the original developers have moved on to other
things, and as a community we don't get to tell people what projects
to work on.  If we mess up with logical replication, the results will
be exponentially worse, so I think it is 100% appropriate to be
extremely conservative.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Robert Haas
Date:
On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> But I think getting even
>>>> single-master logical replication working well in a single release
>>>> cycle is going to be a job and a half.
>>>
>>> OK, so your estimate is 1.5 people to do that. And if we have more
>>> people, should they sit around doing nothing?
>>
>> Oh, give me a break.  You're willfully missing my point.  And to quote
>> Fred Brooks, nine women can't make a baby in one month.
>
> No, I'm not. The question is not how quickly can N people achieve a
> single thing, but how long will it take a few skilled people working
> on carefully selected tasks that have few dependencies between them to
> achieve something.

The bottleneck is getting the design right, not writing the code.
Selecting tasks for people to work on without an agreement on the
design will not advance the process, unless we just accept whatever
code you choose to right based on whatever design you happen to pick.

> How exactly did you arrive at your conclusion? Why is yours right and
> mine wrong?

I estimated the amount of work that would be required to do this right
and compared it to other large projects that have been successfully
done in the past.  I think you are looking at something on the order
of magnitude of the Windows port, which took about four releases to
become stable, or the SE-Linux project, which still isn't
feature-complete.  Even if it's only a HS-sized project, that took two
releases, as did streaming replication.  SSI got committed within one
release cycle, but there were several years of design and advocacy
work before any code was written, so that, too, was really a
multi-year project.

I'll confine my comments on the second part of the question to the
observation that it is a bit early to know who is right and who is
wrong, but the question could just as easily be turned on its head.

> No, I have four people who had initial objections and who have not
> commented on the fact that the points made are regrettably incorrect.

I think Kevin addressed this point better than I can.  Asserting
something doesn't make it true, and you haven't offered any rational
argument against the points that have been made, probably because
there isn't one.  We *cannot* justify steeling 100% of the available
bit space for a feature that many people won't use and may not be
enough to address the real requirement anyway.

> Since at least 3 of the people making such comments did not attend the
> full briefing meeting in Ottawa, I am not particularly surprised.
> However, I do expect people that didn't come to the meeting to
> recognise that they are likely to be missing information and to listen
> closely, as I listen to them.

Participation in the community development process is not contingent
on having flown to Ottawa in May, or on having decided to spend that
evening at your briefing meeting.  Attributing to ignorance what is
adequately explained by honest disagreement is impolite.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
Hi,

On Wednesday, June 20, 2012 05:44:09 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > Were not the only ones here that are performing scope creep though... I
> > think about all people who have posted in the whole thread except maybe
> > Tom and Marko are guilty of doing so.
> > 
> > I still think its rather sensible to focus on exactly duplicated schemas
> > in a very first version just because that leaves out some of the
> > complexity while paving the road for other nice things.
> 
> Well, I guess what I want to know is: what does focusing on exactly
> duplicated schemas mean?  If it means we'll disable DDL for tables
> when we turn on replication, that's basically the Slony approach: when
> you want to make a DDL change, you have to quiesce replication, do it,
> and then resume replication.  I would possibly be OK with that
> approach.  If it means that we'll hope that the schemas are duplicated
> and start spewing garbage data when they're not, then I'm not
> definitely not OK with that approach.  If it means using event
> triggers to keep the catalogs synchronized, then I don't think I don't
> think that's adequately robust.  The user could add more event
> triggers that run before or after the ones the replication system
> adds, and then you are back to garbage decoding (or crashes). 
I would prefer the event trigger way because that seems to be the most 
extensible/reusable. It would allow a fully replicated databases and catalog 
only instances.
I think we need to design event triggers in a way you cannot simply circumvent 
them. We already have the case that if users try to screw around system 
triggers we give back wrong answers with the planner relying on foreign keys 
btw.
If the problem is having user trigger after system triggers: Lets make that 
impossible. Forbidding DDL on the other instances once we have that isn't that 
hard.

Perhaps all that will get simpler if we can make reading the catalog via 
custom built snapshots work as you proposed otherwhere in this thread. That 
would make checking errors way much easier even if you just want to apply to  
a database with exactly the same schema. Thats the next thing I plan to work 
on.

> They could also modify the catalogs directly, although it's possible we
> don't care quite as much about that case (but on the other hand people
> do sometimes need to do it to solve real problems).
With that you already can crash the database perfectly fine today. I think 
trying to care for that is a waste of time.

> Although I am
> 100% OK with pairing back the initial feature set - indeed, I strongly
> recommend it - I think that robustness is not a feature which can be
> left out in v1 and added in later.  All the robustness has to be
> designed in at the start, or we will never have it.
I definitely don't intend to cut down on robustness.

> On the whole, I think we're spending far too much time talking about
> code and far too little time talking about what the overall design
> should look like.
Agreed.

> We are having a discussion about whether or not MMR
> should be supported by sticking a 16-bit node ID into every WAL record
> without having first decided whether we should support MMR, whether
> that requires node IDs, whether they should be integers, whether those
> integers should be 16 bits in size, whether they should be present in
> WAL, and whether or not the record header is the right place to put
> them.  There's a right order in which to resolve those questions, and
> this isn't it.  More generally, I think there is a ton of complexity
> that we're probably overlooking here in focusing in on specific coding
> details.  I think the most interesting comment made to date is Steve
> Singer's observation that very little of Slony is concerned with
> changeset extraction or apply.  Now, on the flip side, all of these
> patches seem to be concerned with changeset extraction and apply.
> That suggests that we're missing some pretty significant pieces
> somewhere in this design.  I think those pieces are things like error
> recovery, fault tolerance, user interface design, and control logic.
> Slony has spent years trying to get those things right.  Whether or
> not they actually have gotten them right is of course an arguable
> point, but we're unlikely to do better by ignoring all of those issues
> and implementing whatever is most technically expedient.
I agree that the focus isn't 100% optimal and that there are *loads* of issues 
we haven't event started to look at. But you need a point to start and 
extraction & apply seems to be a good one because you can actually test it 
without the other issues solved which is not really the case the other way 
round.
Also its possible to plug in the newly built changeset extraction into 
existing solutions to make them more efficient while retaining most of their 
respective framework.

So I disagree that thats the wrong part to start with.

> >> You've got four people objecting to this patch now, all of whom happen
> >> to be committers.  Whether or not MMR goes into core, who knows, but
> >> it doesn't seem that this patch is going to fly.
> > 
> > I find that a bit too early to say. Sure it won't fly exactly as
> > proposed, but hell, who cares? What I want to get in is a solution to
> > the specific problem the patch targets. At least you have, not sure
> > about others, accepted that the problem needs a solution.
> > We do not agree yet how that solution looks should like but thats not
> > exactly surprising as we started discussing the problem only a good day
> > ago.
> Oh, no argument with any of that.  I strongly object to the idea of
> shoving this patch through as-is, but I don't object to solving the
> problem in some other, more appropriate way.  I think that won't look
> much like this patch, though; it will be some new patch.
No problem with that.

> > If people agree that your proposed way of just one flag bit is the way to
> > go we will have to live with that. But thats different from saying the
> > whole thing is dead.
> I think you've convinced me that a single flag-bit is not enough, but
> I don't think you've convinced anyone that it belongs in the record
> header.
Not totally happy but also ok with it. As I just wrote to Kevin that just 
makes things harder because you need to reassemble transactions before 
filtering which is a shame in my opinion.

> >> As I would rather see this project
> >> succeed, I recommend that you don't do that.  Both you and Andres seem
> >> to believe that MMR is a reasonable first target to shoot at, but I
> >> don't think anyone else - including the Slony developers who have
> >> commented on this issue - endorses that position.
> > 
> > I don't think we get full MMR into 9.3. What I am proposing is that we
> > build in the few pieces that are required to implement MMR *ontop* of
> > whats hopefully in 9.3.
> > And I think thats a realistic goal.
> 
> I can't quite follow that sentence, but my general sense is that,
> while you're saying that this infrastructure will be reusable by other
> projects, you don't actually intend to expose APIs that they can use.
> IOW, we'll give you an apply cache - which we believe to be necessary
> to extra tuples as text - but we're leaving the exercise of actually
> generating those tuples as text as an exercise for the reader.  I find
> that a highly undesirable plan.  First, if we don't actually have the
> infrastructure to extract tuples as text, then the contention that the
> infrastructure is adequate for that purpose can't be proven or
> disproven.  Second, until someone from one of those other projects (or
> elsewhere in the community) actually goes and builds it, the built-in
> logical replication will be the only thing that can get benefit out of
> the new infrastructure.  I think it's completely unacceptable to give
> an unproven built-in logical replication technology that kind of pride
> of place out of the gate.  That potentially allows it to supplant
> systems such as Slony and Bucardo even if it is in many respects
> inferior, just because it's been given an inside track.  They have
> lived without core support for years, and if we're going to start
> adding core support for replication, we ought to start by adding the
> things that they think, on the basis of their knowledge and
> experience, are the most important places where core support is
> needed, not going off in a completely new and untested direction.
> Third, when logical replication fails, which it will, because even
> simple things fail and replication is complicated, how am I going to
> debug it?  A raw dump of the tuple data that's being shipped around?
> No thanks.
> IOW, what I see you proposing is, basically, let's short-cut the hard
> problems so we can get to MMR faster.  I oppose that.  That's not the
> Postgres way of building features.  We start slow and incremental and
> we make each thing as solid as we possibly can before going on to the
> next thing.
No, I am not saying that I just want to provide some untested base modules and 
leave it at that. I am saying that I don't think providing a full fledged 
framework for implementing basically arbitrary replication solutions from the 
get-go is a sane goal for something that should be finished someday. Even less 
so if that implementation is something that will be discussed on -hackers and 
needs people to aggree.
I definitely do want to provide code that generates a textual representation 
of the changes. As you say, even if its not used for anything its needed for 
debugging. Not sure if it should be sql or maybe the new slony representation. 
If thats provided and reusable it should make sure that ontop of that other 
solutions can be built.

I find your supposition that I/we just want to get MMR without regard for 
anything else a bit offensive. I wrote at least three times in this thread 
that I do think its likely that we will not get more than the minimal basis 
for implementing MMR into 9.3. I wrote multiple times that I want to provide 
the basis for multiple solutions. The prototype - while obviously being 
incomplete - tried hard to be modular.
You cannot blame us that we want the work we do to be *also* usable for what 
one of our major aims?
What can I do to convince you/others that I am not planning to do something 
"evil" but that I try to reach as many goals at once as possible?


Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Robert Haas
Date:
On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> I would prefer the event trigger way because that seems to be the most
> extensible/reusable. It would allow a fully replicated databases and catalog
> only instances.
> I think we need to design event triggers in a way you cannot simply circumvent
> them. We already have the case that if users try to screw around system
> triggers we give back wrong answers with the planner relying on foreign keys
> btw.
> If the problem is having user trigger after system triggers: Lets make that
> impossible. Forbidding DDL on the other instances once we have that isn't that
> hard.

So, this is interesting.  I think something like this could solve the
problem, but then why not just make it built-in code that runs from
the same place as the event trigger rather than using the trigger
mechanism per se?  Presumably the "trigger" code's purpose is going to
be to inject additional data into the WAL stream (am I wrong?) which
is not something you're going to be able to do from PL/pgsql anyway,
so you don't really need a trigger, just a call to some C function -
which not only has the advantage of being not bypassable, but is also
faster.

> Perhaps all that will get simpler if we can make reading the catalog via
> custom built snapshots work as you proposed otherwhere in this thread. That
> would make checking errors way much easier even if you just want to apply to
> a database with exactly the same schema. Thats the next thing I plan to work
> on.

I realized a problem with that idea this morning: it might work for
reading things, but if anyone attempts to write data you've got big
problems.  Maybe we could get away with forbidding that, not sure.
Would be nice to get some input from other hackers on this.

>> They could also modify the catalogs directly, although it's possible we
>> don't care quite as much about that case (but on the other hand people
>> do sometimes need to do it to solve real problems).
> With that you already can crash the database perfectly fine today. I think
> trying to care for that is a waste of time.

You're probably right.

> I agree that the focus isn't 100% optimal and that there are *loads* of issues
> we haven't event started to look at. But you need a point to start and
> extraction & apply seems to be a good one because you can actually test it
> without the other issues solved which is not really the case the other way
> round.
> Also its possible to plug in the newly built changeset extraction into
> existing solutions to make them more efficient while retaining most of their
> respective framework.
>
> So I disagree that thats the wrong part to start with.

I think extraction is a very sensible place to start; actually, I
think it's the best possible place to start.  But this particular
thread is about adding origin_ids to WAL, which I think is definitely
not the best place to start.

> I definitely do want to provide code that generates a textual representation
> of the changes. As you say, even if its not used for anything its needed for
> debugging. Not sure if it should be sql or maybe the new slony representation.
> If thats provided and reusable it should make sure that ontop of that other
> solutions can be built.

Oh, yeah.  If we can get that, I will throw a party.

> I find your supposition that I/we just want to get MMR without regard for
> anything else a bit offensive. I wrote at least three times in this thread
> that I do think its likely that we will not get more than the minimal basis
> for implementing MMR into 9.3. I wrote multiple times that I want to provide
> the basis for multiple solutions. The prototype - while obviously being
> incomplete - tried hard to be modular.
> You cannot blame us that we want the work we do to be *also* usable for what
> one of our major aims?
> What can I do to convince you/others that I am not planning to do something
> "evil" but that I try to reach as many goals at once as possible?

Sorry.  I don't think you're planning to do something evil, but before
I thought you said you did NOT want to write the code to extract
changes as text or something similar.  I think that would be a really
bad thing to skip for all kinds of reasons.  I think we need that as a
foundational technology before we do much else.  Now, once we have
that, if we can safely detect cases where it's OK to bypass decoding
to text and skip it in just those cases, I think that's great
(although possibly difficult to implement correctly).  I basically
feel that without decode-to-text, this can't possibly be a basis for
multiple solutions; it will be a basis only for itself, and extremely
difficult to debug, too.  No other replication solution can even
theoretically have any use for the raw on-disk tuple, at least not
without horrible kludgery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Wednesday, June 20, 2012 07:17:57 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > I would prefer the event trigger way because that seems to be the most
> > extensible/reusable. It would allow a fully replicated databases and
> > catalog only instances.
> > I think we need to design event triggers in a way you cannot simply
> > circumvent them. We already have the case that if users try to screw
> > around system triggers we give back wrong answers with the planner
> > relying on foreign keys btw.
> > If the problem is having user trigger after system triggers: Lets make
> > that impossible. Forbidding DDL on the other instances once we have that
> > isn't that hard.
> 
> So, this is interesting.  I think something like this could solve the
> problem, but then why not just make it built-in code that runs from
> the same place as the event trigger rather than using the trigger
> mechanism per se?  Presumably the "trigger" code's purpose is going to
> be to inject additional data into the WAL stream (am I wrong?) which
> is not something you're going to be able to do from PL/pgsql anyway,
> so you don't really need a trigger, just a call to some C function -
> which not only has the advantage of being not bypassable, but is also
> faster.
I would be totally fine with that. As long as event triggers provide the 
infrastructure that shouldn't be a big problem.

> > Perhaps all that will get simpler if we can make reading the catalog via
> > custom built snapshots work as you proposed otherwhere in this thread.
> > That would make checking errors way much easier even if you just want to
> > apply to a database with exactly the same schema. Thats the next thing I
> > plan to work on.
> I realized a problem with that idea this morning: it might work for
> reading things, but if anyone attempts to write data you've got big
> problems.  Maybe we could get away with forbidding that, not sure.
Hm, why is writing a problem? You mean io conversion routines writing data? 
Yes, that will be a problem. I am fine with simply forbidding that, we should 
be able to catch that and provide a sensible error message, since SSI we have 
the support for that.

> Would be nice to get some input from other hackers on this.
Oh, yes!


> > I agree that the focus isn't 100% optimal and that there are *loads* of
> > issues we haven't event started to look at. But you need a point to
> > start and extraction & apply seems to be a good one because you can
> > actually test it without the other issues solved which is not really the
> > case the other way round.
> > Also its possible to plug in the newly built changeset extraction into
> > existing solutions to make them more efficient while retaining most of
> > their respective framework.
> > 
> > So I disagree that thats the wrong part to start with.
> 
> I think extraction is a very sensible place to start; actually, I
> think it's the best possible place to start.  But this particular
> thread is about adding origin_ids to WAL, which I think is definitely
> not the best place to start.
Yep. I think the reason everyone started at it is that the patch was actually 
really simple ;).
Note that the wal enrichement & decoding patches were before the origin_id 
patch in the patchseries ;)

> > I definitely do want to provide code that generates a textual
> > representation of the changes. As you say, even if its not used for
> > anything its needed for debugging. Not sure if it should be sql or maybe
> > the new slony representation. If thats provided and reusable it should
> > make sure that ontop of that other solutions can be built.
> Oh, yeah.  If we can get that, I will throw a party.
Good ;)

> > I find your supposition that I/we just want to get MMR without regard for
> > anything else a bit offensive. I wrote at least three times in this
> > thread that I do think its likely that we will not get more than the
> > minimal basis for implementing MMR into 9.3. I wrote multiple times that
> > I want to provide the basis for multiple solutions. The prototype -
> > while obviously being incomplete - tried hard to be modular.
> > You cannot blame us that we want the work we do to be *also* usable for
> > what one of our major aims?
> > What can I do to convince you/others that I am not planning to do
> > something "evil" but that I try to reach as many goals at once as
> > possible?
> Sorry.  I don't think you're planning to do something evil, but before
> I thought you said you did NOT want to write the code to extract
> changes as text or something similar.
Hm. I might have been a bit ambiguous when saying that I do not want to 
provide everything for that use-case.
Once we have a callpoint that has a correct catalog snapshot for exactly the 
tuple in question text conversion is damn near trivial. The point where you 
get passed all that information (action, tuple, table, snapshot) is the one I 
think the patch should mainly provide.

> I think that would be a really
> bad thing to skip for all kinds of reasons.  I think we need that as a
> foundational technology before we do much else.  Now, once we have
> that, if we can safely detect cases where it's OK to bypass decoding
> to text and skip it in just those cases, I think that's great
> (although possibly difficult to implement correctly).  I basically
> feel that without decode-to-text, this can't possibly be a basis for
> multiple solutions; it will be a basis only for itself, and extremely
> difficult to debug, too.  No other replication solution can even
> theoretically have any use for the raw on-disk tuple, at least not
> without horrible kludgery.
We need a simple decode to text feature. Agreed.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 20 June 2012 23:56, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>>> But I think getting even
>>>>> single-master logical replication working well in a single release
>>>>> cycle is going to be a job and a half.
>>>>
>>>> OK, so your estimate is 1.5 people to do that. And if we have more
>>>> people, should they sit around doing nothing?
>>>
>>> Oh, give me a break.  You're willfully missing my point.  And to quote
>>> Fred Brooks, nine women can't make a baby in one month.
>>
>> No, I'm not. The question is not how quickly can N people achieve a
>> single thing, but how long will it take a few skilled people working
>> on carefully selected tasks that have few dependencies between them to
>> achieve something.
>
> The bottleneck is getting the design right, not writing the code.
> Selecting tasks for people to work on without an agreement on the
> design will not advance the process, unless we just accept whatever
> code you choose to right based on whatever design you happen to pick.

Right. Which is why there are multiple people doing design on
different aspects and much work going into prototyping to give some
useful information into debates that would otherwise be bikeshedding.


>> How exactly did you arrive at your conclusion? Why is yours right and
>> mine wrong?
>
> I estimated the amount of work that would be required to do this right
> and compared it to other large projects that have been successfully
> done in the past.  I think you are looking at something on the order
> of magnitude of the Windows port, which took about four releases to
> become stable, or the SE-Linux project, which still isn't
> feature-complete.  Even if it's only a HS-sized project, that took two
> releases, as did streaming replication.  SSI got committed within one
> release cycle, but there were several years of design and advocacy
> work before any code was written, so that, too, was really a
> multi-year project.

The current BDR project uses concepts that have been in discussion for
many years, something like 6-7 years. I first wrote multi-master for
Postgres in userspace in 2007 and the current designs build upon that.
My detailed design for WAL-based logical replication was produced in
2009. I'm not really breaking new ground in the design, and we're
reusing significant parts of existing code. You don't know that cos
you didn't think to ask, which is a little surprising.

On top of that, Andres has refined many of the basic ideas and what he
presents here is a level above my initial design work. I have every
confidence that he'll work independently and produce useful code while
other aspects are considered.


> I'll confine my comments on the second part of the question to the
> observation that it is a bit early to know who is right and who is
> wrong, but the question could just as easily be turned on its head.
>
>> No, I have four people who had initial objections and who have not
>> commented on the fact that the points made are regrettably incorrect.
>
> I think Kevin addressed this point better than I can.  Asserting
> something doesn't make it true, and you haven't offered any rational
> argument against the points that have been made, probably because
> there isn't one.  We *cannot* justify steeling 100% of the available
> bit space for a feature that many people won't use and may not be
> enough to address the real requirement anyway.


You've some assertions above that aren't correct, so its good that you
recognise assertions may not be true, on both sides. I ask that you
listen a little before casting judgements. We won't get anywhere
otherwise.

The current wasted space on 64bit systems is 6 bytes per record. The
current suggestion is to use 2 of those, in a flexible manner that
allows future explansion if it is required, and only if it is
required. That flexibility covers anything else we need. Asserting
that "100% of the available space" is being used is wrong, and to
continue to assert that even when told otherwise multiple times is
something else.


>> Since at least 3 of the people making such comments did not attend the
>> full briefing meeting in Ottawa, I am not particularly surprised.
>> However, I do expect people that didn't come to the meeting to
>> recognise that they are likely to be missing information and to listen
>> closely, as I listen to them.
>
> Participation in the community development process is not contingent
> on having flown to Ottawa in May, or on having decided to spend that
> evening at your briefing meeting.  Attributing to ignorance what is
> adequately explained by honest disagreement is impolite.

I don't think people need to have attended that briefing. But if they
did not, then they do need to be aware they missed hours of community
discussion and explanation, built on top of months of careful
investigation, all of which was aimed at helping everybody understand.

Jumping straight into a discussion is what the community is about
though and I welcome people's input. It seems reasonable that we all
listen to each other.

By responding to each point made I'm showing that I listen. And I hope
that people read what is written in reply, and consider it, rather
than just discard it and repeat the same objection as if nothing had
been said.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Robert Haas
Date:
On Wed, Jun 20, 2012 at 1:40 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I realized a problem with that idea this morning: it might work for
>> reading things, but if anyone attempts to write data you've got big
>> problems.  Maybe we could get away with forbidding that, not sure.
> Hm, why is writing a problem? You mean io conversion routines writing data?
> Yes, that will be a problem. I am fine with simply forbidding that, we should
> be able to catch that and provide a sensible error message, since SSI we have
> the support for that.

I think we could do something a little more vigorous than that, like
maybe error out if anyone tries to do anything that would write WAL or
acquire an XID.  Of course, then the question becomes: then what?  We
probably need to think about what happens after that - we don't want
an error replicating one row in one table to permanently break
replication for the entire system.

>> Sorry.  I don't think you're planning to do something evil, but before
>> I thought you said you did NOT want to write the code to extract
>> changes as text or something similar.
> Hm. I might have been a bit ambiguous when saying that I do not want to
> provide everything for that use-case.
> Once we have a callpoint that has a correct catalog snapshot for exactly the
> tuple in question text conversion is damn near trivial. The point where you
> get passed all that information (action, tuple, table, snapshot) is the one I
> think the patch should mainly provide.

This is actually a very interesting list.  We could rephrase the
high-level question about the design of this feature as "what is the
best way to make sure that you have these things available?".  Action
and tuple are trivial to get, and table isn't too hard either.  It's
really the snapshot - and all the downstream information that can only
be obtained via using that snapshot - that is the hard part.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 21 June 2012 01:40, Andres Freund <andres@2ndquadrant.com> wrote:

>> I think extraction is a very sensible place to start; actually, I
>> think it's the best possible place to start.  But this particular
>> thread is about adding origin_ids to WAL, which I think is definitely
>> not the best place to start.

> Yep. I think the reason everyone started at it is that the patch was actually
> really simple ;).
> Note that the wal enrichement & decoding patches were before the origin_id
> patch in the patchseries ;)

Actually, I thought it would be non contentious...

A format change, so early in release cycle. Already openly discussed
with no comment. Using wasted space for efficiency, for information
already in use by Slony for similar reasons.

Oh well.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Heikki Linnakangas
Date:
On 20.06.2012 17:35, Simon Riggs wrote:
> On 20 June 2012 16:23, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> On 20.06.2012 11:17, Simon Riggs wrote:
>>>
>>> On 20 June 2012 15:45, Heikki Linnakangas
>>> <heikki.linnakangas@enterprisedb.com>    wrote:
>>>>
>>>> So, if the origin id is not sufficient for some conflict resolution
>>>> mechanisms, what extra information do you need for those, and where do
>>>> you put it?
>>>
>>> As explained elsewhere, wal_level = logical (or similar) would be used
>>> to provide any additional logical information required.
>>>
>>> Update and Delete WAL records already need to be different in that
>>> mode, so additional info would be placed there, if there were any.
>>>
>>> In the case of reflexive updates you raised, a typical response in
>>> other DBMS would be to represent the query
>>>    UPDATE SET counter = counter + 1
>>> by sending just the "+1" part, not the current value of counter, as
>>> would be the case with the non-reflexive update
>>>    UPDATE SET counter = 1
>>>
>>> Handling such things in Postgres would require some subtlety, which
>>> would not be resolved in first release but is pretty certain not to
>>> require any changes to the WAL record header as a way of resolving it.
>>> Having already thought about it, I'd estimate that is a very long
>>> discussion and not relevant to the OT, but if you wish to have it
>>> here, I won't stop you.
>>
>>
>> Yeah, I'd like to hear briefly how you would handle that without any further
>> changes to the WAL record header.
>
> I already did:
>
>>> Update and Delete WAL records already need to be different in that
>>> mode, so additional info would be placed there, if there were any.
>
> The case you mentioned relates to UPDATEs only, so I would suggest
> that we add that information to a new form of update record only.
>
> That has nothing to do with the WAL record header.

Hmm, so you need the origin id in the WAL record header to do filtering. 
Except when that's not enough, you add some more fields to heap update 
and delete records.

Don't you think it would be simpler to only add the extra fields to heap 
insert, update and delete records, and leave the WAL record header 
alone? Do you ever need extra information on other record types?

The other question is, *what* information do you need to put there? We 
don't necessarily need to have a detailed design of that right now, but 
it seems quite clear to me that we need more flexibility there than this 
patch provides, to support more complicated conflict resolution.

I'm not saying that we need to implement all possible conflict 
resolution algorithms right now - on the contrary I think conflict 
resolution belongs outside core - but if we're going to change the WAL 
record format to support such conflict resolution, we better make sure 
the foundation we provide for it is solid.

BTW, one way to work around the lack of origin id in the WAL record 
header is to just add an origin-id column to the table, indicating the 
last node that updated the row. That would be a kludge, but I thought 
I'd mention it..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Wednesday, June 20, 2012 07:50:37 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 1:40 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> I realized a problem with that idea this morning: it might work for
> >> reading things, but if anyone attempts to write data you've got big
> >> problems.  Maybe we could get away with forbidding that, not sure.
> > 
> > Hm, why is writing a problem? You mean io conversion routines writing
> > data? Yes, that will be a problem. I am fine with simply forbidding
> > that, we should be able to catch that and provide a sensible error
> > message, since SSI we have the support for that.
> I think we could do something a little more vigorous than that, like
> maybe error out if anyone tries to do anything that would write WAL or
> acquire an XID.
I would go for all of them ;). The read only transaction warnings will 
probably result in the best error messages.

> Of course, then the question becomes: then what?  We
> probably need to think about what happens after that - we don't want
> an error replicating one row in one table to permanently break
> replication for the entire system.
Interesting problem, yes. My first reaction was to just to warn and log and 
throw the transaction away... But thats not likely to work very well on the 
apply side...

I don't think its a particularly likely problem though. An io function doing 
that would probably fail in lots of other situations (HS, read only 
transactions, permission problems).

> >> Sorry.  I don't think you're planning to do something evil, but before
> >> I thought you said you did NOT want to write the code to extract
> >> changes as text or something similar.
> > Hm. I might have been a bit ambiguous when saying that I do not want to
> > provide everything for that use-case.
> > Once we have a callpoint that has a correct catalog snapshot for exactly
> > the tuple in question text conversion is damn near trivial. The point
> > where you get passed all that information (action, tuple, table,
> > snapshot) is the one I think the patch should mainly provide.
> This is actually a very interesting list.  We could rephrase the
> high-level question about the design of this feature as "what is the
> best way to make sure that you have these things available?".  Action
> and tuple are trivial to get, and table isn't too hard either.  It's
> really the snapshot - and all the downstream information that can only
> be obtained via using that snapshot - that is the hard part.
For others, a sensible entry point into this discussion before switching 
subthreads probably is in http://archives.postgresql.org/message-
id/201206192023.20589.andres@2ndquadrant.com

The table isn't as easy as you might think as the wal record only contains the 
relfilenode. Unless you want to log more its basically solved by solving the 
snapshot issue.

And yes, agreeing on how to do that is the thing we need to solve next. Patch 
07 (add enough information to wal) and this are the first ones that should get 
committed imo. And be ready for that obviously.

Just noticed that I failed to properly add Patch 07 to the commitfest. I have 
done so now, hope thats ok.


Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 21 June 2012 02:32, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 20.06.2012 17:35, Simon Riggs wrote:
>>
>> On 20 June 2012 16:23, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>>
>>> On 20.06.2012 11:17, Simon Riggs wrote:
>>>>
>>>>
>>>> On 20 June 2012 15:45, Heikki Linnakangas
>>>> <heikki.linnakangas@enterprisedb.com>    wrote:
>>>>>
>>>>>
>>>>> So, if the origin id is not sufficient for some conflict resolution
>>>>> mechanisms, what extra information do you need for those, and where do
>>>>> you put it?
>>>>
>>>>
>>>> As explained elsewhere, wal_level = logical (or similar) would be used
>>>> to provide any additional logical information required.
>>>>
>>>> Update and Delete WAL records already need to be different in that
>>>> mode, so additional info would be placed there, if there were any.
>>>>
>>>> In the case of reflexive updates you raised, a typical response in
>>>> other DBMS would be to represent the query
>>>>   UPDATE SET counter = counter + 1
>>>> by sending just the "+1" part, not the current value of counter, as
>>>> would be the case with the non-reflexive update
>>>>   UPDATE SET counter = 1
>>>>
>>>> Handling such things in Postgres would require some subtlety, which
>>>> would not be resolved in first release but is pretty certain not to
>>>> require any changes to the WAL record header as a way of resolving it.
>>>> Having already thought about it, I'd estimate that is a very long
>>>> discussion and not relevant to the OT, but if you wish to have it
>>>> here, I won't stop you.
>>>
>>>
>>>
>>> Yeah, I'd like to hear briefly how you would handle that without any
>>> further
>>> changes to the WAL record header.
>>
>>
>> I already did:
>>
>>>> Update and Delete WAL records already need to be different in that
>>>> mode, so additional info would be placed there, if there were any.
>>
>>
>> The case you mentioned relates to UPDATEs only, so I would suggest
>> that we add that information to a new form of update record only.
>>
>> That has nothing to do with the WAL record header.
>
>
> Hmm, so you need the origin id in the WAL record header to do filtering.
> Except when that's not enough, you add some more fields to heap update and
> delete records.

Yes

> Don't you think it would be simpler to only add the extra fields to heap
> insert, update and delete records, and leave the WAL record header alone? Do
> you ever need extra information on other record types?

No extra info on other record types, in general at least. Doing it
that way is the most logical place, just not the most efficient.

> The other question is, *what* information do you need to put there? We don't
> necessarily need to have a detailed design of that right now, but it seems
> quite clear to me that we need more flexibility there than this patch
> provides, to support more complicated conflict resolution.

Another patch already covers that for the common non-reflexive case.
More on conflict resolution soon, I guess. I was going to begin
working on that in July. Starting with the design discussed on list,
of course. This patch has thrown up stuff I thought was
compartmentalised, but I was wrong.

> I'm not saying that we need to implement all possible conflict resolution
> algorithms right now - on the contrary I think conflict resolution belongs
> outside core

It's a pretty standard requirement to have user defined conflict
handling, if that's what you mean.

I'm OK with conflict handling being outside of core as a module, if
that's what people think. We just need a commit hook. That is more
likely to allow a workable solution with 9.3 as well, ISTM. It's
likely to be contentious...

> - but if we're going to change the WAL record format to support
> such conflict resolution, we better make sure the foundation we provide for
> it is solid.

Agreed

> BTW, one way to work around the lack of origin id in the WAL record header
> is to just add an origin-id column to the table, indicating the last node
> that updated the row. That would be a kludge, but I thought I'd mention it..

err, I hope you mean that to be funny. (It wouldn't actually work either)

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Heikki Linnakangas
Date:
On 20.06.2012 21:51, Simon Riggs wrote:
> On 21 June 2012 02:32, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> I'm not saying that we need to implement all possible conflict resolution
>> algorithms right now - on the contrary I think conflict resolution belongs
>> outside core
>
> It's a pretty standard requirement to have user defined conflict
> handling, if that's what you mean.
>
> I'm OK with conflict handling being outside of core as a module, if
> that's what people think. We just need a commit hook. That is more
> likely to allow a workable solution with 9.3 as well, ISTM. It's
> likely to be contentious...

Hmm, what do you need to happen at commit time?

We already have RegisterXactCallback(), if that's enough...

>> BTW, one way to work around the lack of origin id in the WAL record header
>> is to just add an origin-id column to the table, indicating the last node
>> that updated the row. That would be a kludge, but I thought I'd mention it..
>
> err, I hope you mean that to be funny. (It wouldn't actually work either)

No, I wasn't serious that we should implement it that way. But now you 
made me curious; why would it not work? If there's an origin-id column 
in a table, it's included in every heap insert/delete/update WAL record. 
Just set it to the current node's id on a local modification, and to the 
origin's id when replaying changes from another node, and you have the 
exact same information as you would with the extra field in WAL record 
header.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
Hi,

On Wednesday, June 20, 2012 08:32:53 PM Heikki Linnakangas wrote:
> On 20.06.2012 17:35, Simon Riggs wrote:
> > On 20 June 2012 16:23, Heikki Linnakangas
> > 
> > <heikki.linnakangas@enterprisedb.com>  wrote:
> >> On 20.06.2012 11:17, Simon Riggs wrote:
> >>> On 20 June 2012 15:45, Heikki Linnakangas
> >>> 
> >>> <heikki.linnakangas@enterprisedb.com>    wrote:
> >>>> So, if the origin id is not sufficient for some conflict resolution
> >>>> mechanisms, what extra information do you need for those, and where do
> >>>> you put it?
> >>> 
> >>> As explained elsewhere, wal_level = logical (or similar) would be used
> >>> to provide any additional logical information required.
> >>> 
> >>> Update and Delete WAL records already need to be different in that
> >>> mode, so additional info would be placed there, if there were any.
> >>> 
> >>> In the case of reflexive updates you raised, a typical response in
> >>> other DBMS would be to represent the query
> >>> 
> >>>    UPDATE SET counter = counter + 1
> >>> 
> >>> by sending just the "+1" part, not the current value of counter, as
> >>> would be the case with the non-reflexive update
> >>> 
> >>>    UPDATE SET counter = 1
> >>> 
> >>> Handling such things in Postgres would require some subtlety, which
> >>> would not be resolved in first release but is pretty certain not to
> >>> require any changes to the WAL record header as a way of resolving it.
> >>> Having already thought about it, I'd estimate that is a very long
> >>> discussion and not relevant to the OT, but if you wish to have it
> >>> here, I won't stop you.
> >> 
> >> Yeah, I'd like to hear briefly how you would handle that without any
> >> further changes to the WAL record header.
> > 
> > I already did:
> >>> Update and Delete WAL records already need to be different in that
> >>> mode, so additional info would be placed there, if there were any.
> > 
> > The case you mentioned relates to UPDATEs only, so I would suggest
> > that we add that information to a new form of update record only.
> > 
> > That has nothing to do with the WAL record header.
> 
> Hmm, so you need the origin id in the WAL record header to do filtering.
> Except when that's not enough, you add some more fields to heap update
> and delete records.
Imo the whole +1 stuff doesn't have anything to do with the origin_id proposal 
and should be ignored for quite a while. We might go to something like it 
sometime in the future but its nothing we work on (as far as I know ;)).

wal_level=logical (in patch 07) currently only changes the following things 
about the wal stream:

For HEAP_(INSERT|(HOT_)?UPDATE|DELETE)
* prevent full page writes from removing the row data (could be optimized at 
some point to just store the tuple slot)

For HEAP_DELETE
* add the primary key of the changed row

HEAP_MULTI_INSERT obviously needs to get the same treatment in future.

The only real addition that I forsee in the near future is logging the old 
primary key when the primary key changes in HEAP_UPDATE.

Kevin wants an option for full pre-images of rows in HEAP_(UPDATE|DELETE)

> Don't you think it would be simpler to only add the extra fields to heap
> insert, update and delete records, and leave the WAL record header
> alone? Do you ever need extra information on other record types?
Its needed in some more locations: HEAP_HOT_UPDATE, HEAP2_MULTI_INSERT, 
HEAP_NEWPAGE, HEAP_XACT_(ASSIGN, COMMIT, COMMIT_PREPARED, COMMIT_COMPACT, 
ABORT, ABORT_PREPARED) and probably some I didn't remember right now.

Sure, we can add it to all those but then you need to have individual 
knowledge about *all* of those because the location where its stored will be 
different for each of them.

To recap why we think origin_id is a sensible design choice:

There are many sensible replication topologies where it does make sense that 
you want to receive changes (on node C) from one node (say B) that originated 
from some other node (say A).
Reasons include:
* the order of applying changes should be as similar as possible on all nodes. 
That means when applying a change on C that originated on B and if changes 
replicated faster from A->B than from A->C you want to be at least as far with 
the replication from A as B was. Otherwise the conflict ratio will increase. 
If you can recreate the stream from the wal of every node and still detect 
where an individual change originated, thats easy.
* the interconnects between some nodes may be more expensive than from others
* an interconnect between two nodes may fail but others dont

Because of that we think its sensible to be able generate the full LCR stream 
with all changes, local and remote ones, on each individual node. If you then 
can filter on individual origin_id's you can build complex replication 
topologies without much additional complexity.

> I'm not saying that we need to implement all possible conflict
> resolution algorithms right now - on the contrary I think conflict
> resolution belongs outside core - but if we're going to change the WAL
> record format to support such conflict resolution, we better make sure
> the foundation we provide for it is solid.
I think this already provides a lot. At some point we probably want to have 
support for looking on which node a certain local xid originated and when that 
was originally executed. While querying that efficiently requires additional 
support we already have all the information for that.

There are some more complexities with consistently determining conflicts on 
changes that happened in a very small timewindown on different nodes but thats 
something for another day.

> BTW, one way to work around the lack of origin id in the WAL record
> header is to just add an origin-id column to the table, indicating the
> last node that updated the row. That would be a kludge, but I thought
> I'd mention it..
Yuck. The aim is to improve on whats done today ;)

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Aidan Van Dyk
Date:
On Wed, Jun 20, 2012 at 3:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:

> To recap why we think origin_id is a sensible design choice:
>
> There are many sensible replication topologies where it does make sense that
> you want to receive changes (on node C) from one node (say B) that originated
> from some other node (say A).
> Reasons include:
> * the order of applying changes should be as similar as possible on all nodes.
> That means when applying a change on C that originated on B and if changes
> replicated faster from A->B than from A->C you want to be at least as far with
> the replication from A as B was. Otherwise the conflict ratio will increase.
> If you can recreate the stream from the wal of every node and still detect
> where an individual change originated, thats easy.

OK, so in this case, I still don't see how the "origin_id" is even enough.

C applies the change originally from A (routed through B, because it's
faster).  But when it get's the change directly from A, how does it
know to *not* apply it again?




> * the interconnects between some nodes may be more expensive than from others
> * an interconnect between two nodes may fail but others dont
>
> Because of that we think its sensible to be able generate the full LCR stream
> with all changes, local and remote ones, on each individual node. If you then
> can filter on individual origin_id's you can build complex replication
> topologies without much additional complexity.
>
>> I'm not saying that we need to implement all possible conflict
>> resolution algorithms right now - on the contrary I think conflict
>> resolution belongs outside core - but if we're going to change the WAL
>> record format to support such conflict resolution, we better make sure
>> the foundation we provide for it is solid.
> I think this already provides a lot. At some point we probably want to have
> support for looking on which node a certain local xid originated and when that
> was originally executed. While querying that efficiently requires additional
> support we already have all the information for that.
>
> There are some more complexities with consistently determining conflicts on
> changes that happened in a very small timewindown on different nodes but thats
> something for another day.
>
>> BTW, one way to work around the lack of origin id in the WAL record
>> header is to just add an origin-id column to the table, indicating the
>> last node that updated the row. That would be a kludge, but I thought
>> I'd mention it..
> Yuck. The aim is to improve on whats done today ;)
>
> --
>  Andres Freund                     http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Simon Riggs
Date:
On 21 June 2012 03:13, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 20.06.2012 21:51, Simon Riggs wrote:
>>
>> On 21 June 2012 02:32, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>>
>>> I'm not saying that we need to implement all possible conflict resolution
>>>
>>> algorithms right now - on the contrary I think conflict resolution
>>> belongs
>>> outside core
>>
>>
>> It's a pretty standard requirement to have user defined conflict
>> handling, if that's what you mean.
>>
>> I'm OK with conflict handling being outside of core as a module, if
>> that's what people think. We just need a commit hook. That is more
>> likely to allow a workable solution with 9.3 as well, ISTM. It's
>> likely to be contentious...
>
>
> Hmm, what do you need to happen at commit time?
>
> We already have RegisterXactCallback(), if that's enough...

Let me refer back to my notes and discuss this later, I'm just going
from vague memory here, which isn't helpful.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Wednesday, June 20, 2012 09:24:29 PM Aidan Van Dyk wrote:
> On Wed, Jun 20, 2012 at 3:15 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > To recap why we think origin_id is a sensible design choice:
> > 
> > There are many sensible replication topologies where it does make sense
> > that you want to receive changes (on node C) from one node (say B) that
> > originated from some other node (say A).
> > Reasons include:
> > * the order of applying changes should be as similar as possible on all
> > nodes. That means when applying a change on C that originated on B and
> > if changes replicated faster from A->B than from A->C you want to be at
> > least as far with the replication from A as B was. Otherwise the
> > conflict ratio will increase. If you can recreate the stream from the
> > wal of every node and still detect where an individual change
> > originated, thats easy.
> 
> OK, so in this case, I still don't see how the "origin_id" is even enough.
> 
> C applies the change originally from A (routed through B, because it's
> faster).  But when it get's the change directly from A, how does it
> know to *not* apply it again?
The lsn of the change.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Aidan Van Dyk
Date:
On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund <andres@2ndquadrant.com> wrote:

>> OK, so in this case, I still don't see how the "origin_id" is even enough.
>>
>> C applies the change originally from A (routed through B, because it's
>> faster).  But when it get's the change directly from A, how does it
>> know to *not* apply it again?
> The lsn of the change.

So why isn't the LSN good enough for when C propagates the change back to A?

Why does A need more information than C?

a.


--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote:
> On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> OK, so in this case, I still don't see how the "origin_id" is even
> >> enough.
> >> 
> >> C applies the change originally from A (routed through B, because it's
> >> faster).  But when it get's the change directly from A, how does it
> >> know to *not* apply it again?
> > 
> > The lsn of the change.
> 
> So why isn't the LSN good enough for when C propagates the change back to
> A?
Because every node has individual progress in the wal so the lsn doesn't mean 
anything unless you know from which node it originally is.

> Why does A need more information than C?
Didn't I explain that two mails up?

Perhaps Chris' phrasing explains the basic idea better:

On Wednesday, June 20, 2012 07:06:28 PM Christopher Browne wrote:
> The case where it would be needful is if you are in the process of
> assembling together updates coming in from multiple masters, and need
> to know:
>    - This INSERT was replicated from node #1, so should be ignored
> downstream - That INSERT was replicated from node #2, so should be ignored
> downstream - This UPDATE came from the local node, so needs to be passed
> to downstream users

Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in north 
america.
#1 wants changes from #3 and #4 when talking to #3 but not those from #2  and 
itself (because that would be cheaper to get locally).

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Aidan Van Dyk
Date:
On Wed, Jun 20, 2012 at 3:49 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote:
>> On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
>> >> OK, so in this case, I still don't see how the "origin_id" is even
>> >> enough.
>> >>
>> >> C applies the change originally from A (routed through B, because it's
>> >> faster).  But when it get's the change directly from A, how does it
>> >> know to *not* apply it again?
>> >
>> > The lsn of the change.
>>
>> So why isn't the LSN good enough for when C propagates the change back to
>> A?
> Because every node has individual progress in the wal so the lsn doesn't mean
> anything unless you know from which node it originally is.
>
>> Why does A need more information than C?
> Didn't I explain that two mails up?

Probably, but that didn't mean I understood it... I'm trying to keep up here ;-)

So the origin_id isn't strictly for the origin node to know filter an
LCR it's applied already, but it is also to correlate the LSN's
because the LSN's of the re-generated LCR's are meant to contain the
originating nodes's LSN, and every every node applying LCRs needs to
be able to know where it is in every node's LSN progress.

I had assumed any LCR's generated on a node woudl be relative to the
LSN sequencing of that node.

> Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in north
> america.
> #1 wants changes from #3 and #4 when talking to #3 but not those from #2  and
> itself (because that would be cheaper to get locally).

Right, but if the link between #1 and #2 ever "slows down", changes
from #3 and #4 may very well already have #2's changes, and even
require them.

#1 has to apply them, or is it going to stop applying LCR's from #3
when it see's LCRs from #3 coming in that originate on #2 and have
LSNs greater than what it's so far received from #2?


--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Wednesday, June 20, 2012 10:12:46 PM Aidan Van Dyk wrote:
> On Wed, Jun 20, 2012 at 3:49 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote:
> >> On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund <andres@2ndquadrant.com>
> > 
> > wrote:
> >> >> OK, so in this case, I still don't see how the "origin_id" is even
> >> >> enough.
> >> >> 
> >> >> C applies the change originally from A (routed through B, because
> >> >> it's faster).  But when it get's the change directly from A, how
> >> >> does it know to *not* apply it again?
> >> > 
> >> > The lsn of the change.
> >> 
> >> So why isn't the LSN good enough for when C propagates the change back
> >> to A?
> > 
> > Because every node has individual progress in the wal so the lsn doesn't
> > mean anything unless you know from which node it originally is.
> > 
> >> Why does A need more information than C?
> > 
> > Didn't I explain that two mails up?
> 
> Probably, but that didn't mean I understood it... I'm trying to keep up
> here ;-)
Heh. Yes. This developed into a huge mess already ;)

> So the origin_id isn't strictly for the origin node to know filter an
> LCR it's applied already, but it is also to correlate the LSN's
> because the LSN's of the re-generated LCR's are meant to contain the
> originating nodes's LSN, and every every node applying LCRs needs to
> be able to know where it is in every node's LSN progress.
There are multiple use-cases for it, this is one of them.

> I had assumed any LCR's generated on a node woudl be relative to the
> LSN sequencing of that node.
We have the original lsn in the commit record. Thats needed to support crash 
recovery because you need to know where to restart applying changes again 
from.

> > Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in
> > north america.
> > #1 wants changes from #3 and #4 when talking to #3 but not those from #2
> >  and itself (because that would be cheaper to get locally).
> 
> Right, but if the link between #1 and #2 ever "slows down", changes
> from #3 and #4 may very well already have #2's changes, and even
> require them.
> 
> #1 has to apply them, or is it going to stop applying LCR's from #3
> when it see's LCRs from #3 coming in that originate on #2 and have
> LSNs greater than what it's so far received from #2?
We will see ;). There are several solutions possible, possibly even depending 
on the use-case. This patch just want's to provide the very basic information 
required to implement such solutions...

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


On 12-06-13 07:28 AM, Andres Freund wrote:
> From: Andres Freund<andres@anarazel.de>
>
> The individual changes need to be identified by an xid. The xid can be a
> subtransaction or a toplevel one, at commit those can be reintegrated by doing
> a k-way mergesort between the individual transaction.
>
> Callbacks for apply_begin, apply_change and apply_commit are provided to
> retrieve complete transactions.
>
> Missing:
> - spill-to-disk
> - correct subtransaction merge, current behaviour is simple/wrong
> - DDL handling (?)
> - resource usage controls
Here is an initial review of the ApplyCache patch.

This patch provides a module for taking actions in the WAL stream and 
groups the actions by transaction, then passing these change records to 
a set of plugin functions.

For each transaction it encounters it keeps a list of the actions in 
that transaction. The ilist included in an earlier patch is used, 
changes resulting from that patch review would effect the code here but 
not in a way that chances the design.  When the module sees a commit for 
a transaction it calls the apply_change callback for each change.

I can think of three ways that a replication system like this could try 
to apply transactions.

1) Each time it sees a new transaction it could open up a new 
transaction on the replica and makes that change.  It leaves the
transaction open and goes on applying the next change (which might be 
for the current transaction or might be for another one).
When it comes across a commit record it would then commit the 
transaction.   If 100 concurrent transactions were open on the origin 
then 100 concurrent transactions will be open on the replica.

2) Determine the commit order of the transactions, group all the changes 
for a particular transaction together and apply them in that order for 
the transaction that committed first, commit that transaction and then 
move onto the transaction that committed second.

3) Group the transactions in a way that you move the replica from one 
consistent snapshot to another.  This is what Slony and Londiste do 
because they don't have the commit order or commit timestamps. Built-in 
replication can do better.

This patch implements option (2).   If we had a way of implementing 
option (1) efficiently would we be better off?

Option (2) requires us to put unparsed WAL data (HeapTuples) in the 
apply cache.  You can't translate this to an independent LCR until you 
call the apply_change record (which happens once the commit is 
encountered).  The reason for this is because some of the changes might 
be DDL (or things generated by a DDL trigger) that will change the 
translation catalog so you can't translate the HeapData to LCR's until 
your at a stage where you can update the translation catalog.  In both 
cases you might need to see later WAL records before you can convert an 
earlier one into an LCR (ie TOAST).

Some of my concerns with the apply cache are

Big transactions (bulk loads, mass updates) will be cached in the apply 
cache until the commit comes along.  One issue Slony has todo with bulk 
operations is that the replicas can't start processing the bulk INSERT 
until after it has commited.  If it takes 10 hours to load the data on 
the master it will take another 10 hours (at best) to load the data into 
the replica(20 hours after you start the process).  With binary 
streaming replication your replica is done processing the bulk update 
shortly after the master is.

Long running transactions can sit in the cache for a long time.  When 
you spill to disk we would want the long running but inactive ones 
spilled to disk first.  This is solvable but adds to the complexity of 
this module, how were you planning on managing which items of the list 
get spilled to disk?

The idea that we can safely reorder the commands into transactional 
groupings works (as far as I know) today because DDL commands get big 
heavy locks that are held until the end of the transaction.  I think 
Robert mentioned earlier in the parent thread that maybe some of that 
will be changed one day.

The downsides of (1) that I see are:

We would want a single backend to keep open multiple transactions at 
once. How hard would that be to implement? Would subtransactions be good 
enough here?

Applying (or even translating WAL to LCR's) the changes in parallel 
across transactions might complicate the catalog structure because each 
concurrent transaction might need its own version of the catalog (or can 
you depend on the locking at the master for this? I think you can today)

With approach (1) changes that are part of a rolledback transaction 
would have more overhead because you would call apply_change on them.

With approach (1) a later component could still group the LCR's by 
transaction before applying by running the LCR's through a data 
structure very similar to the ApplyCache.


I think I need more convincing that approach (2), what this patch 
implements, is the best way doing things, compared (1). I will hold off 
on a more detailed review of the code until I get a better sense of if 
the design will change.

Steve



Hi Steve,

On Thursday, June 21, 2012 02:16:57 AM Steve Singer wrote:
> On 12-06-13 07:28 AM, Andres Freund wrote:
> > From: Andres Freund<andres@anarazel.de>
> > 
> > The individual changes need to be identified by an xid. The xid can be a
> > subtransaction or a toplevel one, at commit those can be reintegrated by
> > doing a k-way mergesort between the individual transaction.
> > 
> > Callbacks for apply_begin, apply_change and apply_commit are provided to
> > retrieve complete transactions.
> > 
> > Missing:
> > - spill-to-disk
> > - correct subtransaction merge, current behaviour is simple/wrong
> > - DDL handling (?)
> > - resource usage controls
> 
> Here is an initial review of the ApplyCache patch.
Thanks!

> This patch provides a module for taking actions in the WAL stream and
> groups the actions by transaction, then passing these change records to
> a set of plugin functions.
> 
> For each transaction it encounters it keeps a list of the actions in
> that transaction. The ilist included in an earlier patch is used,
> changes resulting from that patch review would effect the code here but
> not in a way that chances the design.  When the module sees a commit for
> a transaction it calls the apply_change callback for each change.
> 
> I can think of three ways that a replication system like this could try
> to apply transactions.
> 
> 1) Each time it sees a new transaction it could open up a new
> transaction on the replica and makes that change.  It leaves the
> transaction open and goes on applying the next change (which might be
> for the current transaction or might be for another one).
> When it comes across a commit record it would then commit the
> transaction.   If 100 concurrent transactions were open on the origin
> then 100 concurrent transactions will be open on the replica.
> 
> 2) Determine the commit order of the transactions, group all the changes
> for a particular transaction together and apply them in that order for
> the transaction that committed first, commit that transaction and then
> move onto the transaction that committed second.
> 
> 3) Group the transactions in a way that you move the replica from one
> consistent snapshot to another.  This is what Slony and Londiste do
> because they don't have the commit order or commit timestamps. Built-in
> replication can do better.
> 
> This patch implements option (2).   If we had a way of implementing
> option (1) efficiently would we be better off?
> Option (2) requires us to put unparsed WAL data (HeapTuples) in the
> apply cache.  You can't translate this to an independent LCR until you
> call the apply_change record (which happens once the commit is
> encountered).  The reason for this is because some of the changes might
> be DDL (or things generated by a DDL trigger) that will change the
> translation catalog so you can't translate the HeapData to LCR's until
> your at a stage where you can update the translation catalog.  In both
> cases you might need to see later WAL records before you can convert an
> earlier one into an LCR (ie TOAST).
Very good analysis, thanks!

Another reasons why we cannot easily do 1) is that subtransactions aren't 
discernible from top-level transactions before the top-level commit happens, 
we can only properly merge in the right order (by "sorting" via lsn) once we 
have seen the commit record which includes a list of all committed 
subtransactions.

I also don't think 1) would be particularly welcome by people trying to 
replicate into foreign systems.

> Some of my concerns with the apply cache are
> 
> Big transactions (bulk loads, mass updates) will be cached in the apply
> cache until the commit comes along.  One issue Slony has todo with bulk
> operations is that the replicas can't start processing the bulk INSERT
> until after it has commited.  If it takes 10 hours to load the data on
> the master it will take another 10 hours (at best) to load the data into
> the replica(20 hours after you start the process).  With binary
> streaming replication your replica is done processing the bulk update
> shortly after the master is.
One reason why we have the low level apply stuff planned is that that way the 
apply on the standby actually takes less time than on the master. If you do it 
right there is significantly lower overhead there.
Still, a 10h bulk load will definitely cause problems. I don't think there is 
a way around that for now.

> Long running transactions can sit in the cache for a long time.  When
> you spill to disk we would want the long running but inactive ones
> spilled to disk first.  This is solvable but adds to the complexity of
> this module, how were you planning on managing which items of the list
> get spilled to disk?
I planned to have some cutoff 'max_changes_in_memory_per_txn' value. If it has 
been reached for one transaction all existing changes are spilled to disk. New 
changes again can be kept in memory till its reached again.

We need to support serializing the cache for crash recovery + shutdown of the 
receiving side as well. Depending on how we do the wal decoding we will need 
it more frequently...

> The idea that we can safely reorder the commands into transactional
> groupings works (as far as I know) today because DDL commands get big
> heavy locks that are held until the end of the transaction.  I think
> Robert mentioned earlier in the parent thread that maybe some of that
> will be changed one day.
I think we shouldn't worry about that overly much today.

> The downsides of (1) that I see are:
> 
> We would want a single backend to keep open multiple transactions at
> once. How hard would that be to implement? Would subtransactions be good
> enough here?
Subtransactions wouldn't be good enough (they cannot ever be concurrent 
anyway). Implementing multiple concurrent top-level transactions is a major 
project on its own imo.

> Applying (or even translating WAL to LCR's) the changes in parallel
> across transactions might complicate the catalog structure because each
> concurrent transaction might need its own version of the catalog (or can
> you depend on the locking at the master for this? I think you can today)
The locking should be enough, yes.

> I think I need more convincing that approach (2), what this patch
> implements, is the best way doing things, compared (1). I will hold off
> on a more detailed review of the code until I get a better sense of if
> the design will change.
I don't think 1) is really possible without much, much more work.
* Multiple autonomous transactions in one backend are a major feature on its 
own
* correct decoding of tuples likely requires some form of reassembling txns
* toast reassembly requires some minimal transaction reassembly
* subtransaction handling requires full transaction reassembly

I have some ideas how we can parallelize some cases of apply in the future but 
imo thats an incremental improvement ontop of this.

Thanks for the review!

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Peter Geoghegan
Date:
On 20 June 2012 14:38, Andres Freund <andres@2ndquadrant.com> wrote:
> It incurs a rather high performance overhead due to added memory allocations
> and added pointer indirections. Thats fine for most of the current users of
> the List interface, but certainly not for all. In other places you cannot even
> have memory allocations because the list lives in shared memory.

Yes, in general lists interact horribly with the memory hierarchy. I
think I pointed out to you once a rant of mine on -hackers a while
back in which I made various points about just how badly they do these
days.

On modern architectures, with many layers of cache, the cost of the
linear search to get an insertion point is very large. So this:

/** removes a node from a list* Attention: O(n)*/
static inline void ilist_s_remove(ilist_s_head *head,                                 ilist_s_node *node)


is actually even worse than you might suspect.

> E.g. in the ApplyCache, where I use the submitted ilist.h stuff, when
> reconstructing transactions you add to a potentially really long linked list
> of individual changes for every interesting wal record. Before I prevented
> memory allocations in that path it took about 12-14% of the time when applying
> changes in the same backend. Afterwards it wasn't visible in the profile
> anymore.

I find that very easy to believe.

> Several of the pieces of code I pointed out in a previous email use open-coded
> list implementation exactly to prevent those problems.

Interesting.

So, it seems like this list implementation could be described as a
minimal embeddable list implementation that requires the user to do
all the memory allocation, and offers a doubly-linked list too. Not an
unreasonable idea. I do think that the constraints you have are not
well served by any existing implementation, including List and Dllist.
Are you planning on just overhauling the Dllist interface in your next
iteration?

As to the question of inling, the C99 standard (where inlining is
standardised by ANSI, but inspired by earlier extensions to C), unlike
the C89 standard, seems to be well respected by vendors as far as it
goes, with some compilers going to pains to implement it correctly,
like ICC and Clang. We can't really switch to C99, because MSVC
doesn't support it, and it is patently obvious that Microsoft have
zero interest in it.

Funnily enough, if anyone takes C89 as a standard seriously still,
it's Microsoft, if only due to indifference to later standards. This
hack exists purely for the benefit of their strict interpretation of
C89, I think:

/* Define to `__inline__' or `__inline' if that's what the C compiler  calls it, or to nothing if 'inline' is not
supportedunder any name.  */ 
#ifndef __cplusplus
/* #undef inline */
#endif

If anyone today is using PostgreSQL binaries in production that were
built with a compiler that does not USE_INLINE, I would be very
surprised indeed. The idea that anyone intends to build 9.3 with a
compiler that doesn't support inline functions is very difficult to
believe. Other C open source projects like Linux freely use inline
functions. Now, granted, it was only possible to build the kernel for
a long time using gcc, but inline had nothing to do with the problem
of building the kernel.

My point is that broadly it makes more practical sense to talk about
GNU C as a standard than C89, and GNU C supports inline functions (C99
is a different matter, but that isn't going to happen in the
foreseeable future). Don't believe me? This is from our configure
script:

# Check if it's Intel's compiler, which (usually) pretends to be gcc,
# but has idiosyncrasies of its own.  We assume icc will define
# __INTEL_COMPILER regardless of CFLAGS.

All of the less popular compilers we support we support precisely
because they pretend to be GCC, with the sole exception, as always, of
the Microsoft product, in this case MSVC. So my position is that I'm
in broad agreement that we should freely allow the use of inline
without macro hacks, since we generally resists using macro hacks if
that makes code ugly, which USE_INLINE certainly does, and for a
benefit that is indistinguishable from zero, at least to me.

Why are you using the stdlib's <assert.h>? Why have you used the
NDEBUG macro rather than USE_ASSERT_CHECKING? This might make sense if
the header was intended to live in port, but it isn't, right?

Why have you done this:

#ifdef __GNUC__
#define unused_attr __attribute__((unused))
#else
#define unused_attr
#endif

and then gone on to use this unused_attr macro all over the place?
Firstly, that isn't going to suppress the warnings on many platforms
that we support, and we do make an effort to build warning free on at
least 3 compilers these days - GCC, Clang and MSVC. Secondly,
compilers give these warnings because it doesn't make any sense to
have an unused parameter - so why have you used one? At the very
least, if you require this exact interface, use compatibility macros.
I can't imagine why that would be important though. And even if you
did want a standard unused_attr facility, you'd do that in c.h, where
a lot of that stuff lives.

You haven't put a copyright notice or description in this new file,
which is required.

So, I infer that by embeddable you mean that the intended interface of
this is that someone writes a struct that has as its first field a
ilist_d_head or a ilist_s_head, and through the magic of C's
guarantees about how those structures will be laid-out, type punning
can be used to traverse the unknown structure, as opposed to doing
that weird ListCell thing - IIRC, Berkeley sockets uses this kind of
inheritance. This isn't really obvious from the code, and if you
engaged me in conversation and asked what an embeddable list was, I
wouldn't have been able to tell you off the top of my head before
today. The lack of comments generally should be worked on.

I am not sure that I like this inconsistency:
ilist_d_foreach(t, head)ilist_d_remove(head, t);

In the first call, head is specified and then node. In the second,
node and then head. This could probably stand to be made consistent.

That's all for now...

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Tom Lane
Date:
Peter Geoghegan <peter@2ndquadrant.com> writes:
> All of the less popular compilers we support we support precisely
> because they pretend to be GCC, with the sole exception, as always, of
> the Microsoft product, in this case MSVC.

This is nonsense.  There are at least three buildfarm machines running
compilers that do not "pretend to be gcc" (at least, configure
recognizes them as not gcc) and are not MSVC either.  We ought to have
more IMO, because software monocultures are dangerous.  Of those three,
two pass the "quiet inline" test and one --- the newest of the three
if I guess correctly --- does not.  So it is not the case that
!USE_INLINE is dead code, even if you adopt the position that we don't
care about any compiler not represented in the buildfarm.
        regards, tom lane


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Peter Geoghegan
Date:
On 22 June 2012 01:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This is nonsense.  There are at least three buildfarm machines running
> compilers that do not "pretend to be gcc" (at least, configure
> recognizes them as not gcc) and are not MSVC either.

So those three don't have medium to high degrees of compatibility with GCC?

> We ought to have more IMO, because software monocultures are dangerous.  Of
> those three, two pass the "quiet inline" test and one --- the newest of the three
> if I guess correctly --- does not.  So it is not the case that
> !USE_INLINE is dead code, even if you adopt the position that we don't
> care about any compiler not represented in the buildfarm.

I note that you said that it doesn't pass the "quiet inline" test, and
not that it doesn't support inline functions.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
On Friday, June 22, 2012 12:23:57 AM Peter Geoghegan wrote:
> On 20 June 2012 14:38, Andres Freund <andres@2ndquadrant.com> wrote:
> > It incurs a rather high performance overhead due to added memory
> > allocations and added pointer indirections. Thats fine for most of the
> > current users of the List interface, but certainly not for all. In other
> > places you cannot even have memory allocations because the list lives in
> > shared memory.
> Yes, in general lists interact horribly with the memory hierarchy. I
> think I pointed out to you once a rant of mine on -hackers a while
> back in which I made various points about just how badly they do these
> days.
Yes, but how is that relevant? Its still the best data structure for many use-
cases. Removing one of the two indirections is still a good idea, hence this 
patch ;)

> On modern architectures, with many layers of cache, the cost of the
> linear search to get an insertion point is very large. So this:
> 
> /*
>  * removes a node from a list
>  * Attention: O(n)
>  */
> static inline void ilist_s_remove(ilist_s_head *head,
>                                   ilist_s_node *node)
> 
> 
> is actually even worse than you might suspect.
O(n) is O(n), the constant is irrelevant. Anybody who uses arbitrary node 
removal in a single linked link in the fast path is deserves the pain ;)

> > Several of the pieces of code I pointed out in a previous email use
> > open-coded list implementation exactly to prevent those problems.
> 
> Interesting.
> 
> So, it seems like this list implementation could be described as a
> minimal embeddable list implementation that requires the user to do
> all the memory allocation, and offers a doubly-linked list too. Not an
> unreasonable idea. I do think that the constraints you have are not
> well served by any existing implementation, including List and Dllist.
Yep. Note though that you normally wouldn't do extra/manual memory allocation 
because you just use the already allocated memory of the struct where you 
embedded the list element into.

> Are you planning on just overhauling the Dllist interface in your next
> iteration?
It needs to be unified. Not yet sure whether its better to just remove Dllist 
or morph my code into it.

> All of the less popular compilers we support we support precisely
> because they pretend to be GCC, with the sole exception, as always, of
> the Microsoft product, in this case MSVC. So my position is that I'm
> in broad agreement that we should freely allow the use of inline
> without macro hacks, since we generally resists using macro hacks if
> that makes code ugly, which USE_INLINE certainly does, and for a
> benefit that is indistinguishable from zero, at least to me.
Tom already pointed out that not all compilers pretend to be gcc. I agree 
though that we should try to make all supported compilers support USE_INLINE. 
I think with some ugliness that should be possible at least for aCC. Will 
respond to Tom on that.

> Why are you using the stdlib's <assert.h>? Why have you used the
> NDEBUG macro rather than USE_ASSERT_CHECKING? This might make sense if
> the header was intended to live in port, but it isn't, right?
That should probably be removed, yes. I did it that way that it could be 
tested independently of casserts because the list checking code turns some 
linear algorithms into quadratic ones which is noticeable even when --enable-
cassert is defined.

> Why have you done this:
> 
> #ifdef __GNUC__
> #define unused_attr __attribute__((unused))
> #else
> #define unused_attr
> #endif
> 
> and then gone on to use this unused_attr macro all over the place?
> Firstly, that isn't going to suppress the warnings on many platforms
> that we support, and we do make an effort to build warning free on at
> least 3 compilers these days - GCC, Clang and MSVC. Secondly,
> compilers give these warnings because it doesn't make any sense to
> have an unused parameter - so why have you used one? At the very
> least, if you require this exact interface, use compatibility macros.
> I can't imagine why that would be important though. And even if you
> did want a standard unused_attr facility, you'd do that in c.h, where
> a lot of that stuff lives.
If you look at the places its mostly used in functions like:

/** adds a node at the beginning of the list*/
static inline void ilist_d_push_front(ilist_d_head *head, ilist_d_node *node)
{node->next = head->head.next;node->prev = &head->head;node->next->prev = node;head->head.next =
node;ilist_d_check(head);
}
Where ilist_d_check doesn't do anything if assertions aren't enabled which gcc 
unfortunately groks and warns.

The other case is functions like:

static inline void ilist_s_add_after(unused_attr ilist_s_head *head,                                    ilist_s_node
*after,ilist_s_node *node)
 
{node->next = after->next;after->next = node;
}
Where it makes sense for the api to get the head element for consistency 
reasons. It very well would be possible to add a checking function in there 
too. Its just easier to make single linked lists work correctly than double 
linked ones, thats why there is no check so far ;)

> You haven't put a copyright notice or description in this new file,
> which is required.
good point.

> So, I infer that by embeddable you mean that the intended interface of
> this is that someone writes a struct that has as its first field a
> ilist_d_head or a ilist_s_head, and through the magic of C's
> guarantees about how those structures will be laid-out, type punning
> can be used to traverse the unknown structure, as opposed to doing
> that weird ListCell thing - IIRC, Berkeley sockets uses this kind of
> inheritance. This isn't really obvious from the code, and if you
> engaged me in conversation and asked what an embeddable list was, I
> wouldn't have been able to tell you off the top of my head before
> today. The lack of comments generally should be worked on.
It should contain an example + some comments, yes. Let me scribble you 
together an example from the code using this:

First some structs, ignore the specific contents:

typedef struct ApplyCacheChange
{XLogRecPtr lsn;enum ApplyCacheChangeType action;ApplyCacheTupleBuf* newtuple;ApplyCacheTupleBuf* oldtuple;HeapTuple
table;
/* * While in use this is how a change is linked into a transactions, * otherwise its the preallocated
list.*/ilist_d_nodenode;
 
} ApplyCacheChange;

typedef struct ApplyCacheTXN
{TransactionId xid;
...ilist_d_head changes;
.../* * our position in a list of subtransactions while the TXN is in * use. Otherwise its the position in the list of
preallocated* transactions. */ilist_d_node node;
 
} ApplyCacheTXN;

struct ApplyCache
{
....ilist_d_head cached_changes;size_t nr_cached_changes;
....
}

------------
Two example functions:

ApplyCacheChange*
ApplyCacheGetChange(ApplyCache* cache)
{ApplyCacheChange* change;
if (cache->nr_cached_changes){    cache->nr_cached_changes--;    change = ilist_container(ApplyCacheChange, node,
                     ilist_d_pop_front(&cache->cached_changes));}else{       ...    }
 
...
}

void
ApplyCacheAddChange(ApplyCache* cache, TransactionId xid, XLogRecPtr lsn,                   ApplyCacheChange* change)
{ApplyCacheTXN* txn = ApplyCacheTXNByXid(cache, xid, true);txn->lsn = lsn;ilist_d_push_back(&txn->changes,
&change->node);
}

-----------
Explanation:

In ApplyCacheGetChange we pop an item from a list:

change = ilist_container(ApplyCacheChange, node,                    ilist_d_pop_front(&cache->cached_changes));

If you put that into individual parts:

ilist_d_node *n = ilist_d_pop_front(&cache->cached_changes);

removes one element from the list and returns a ilist_d_node. Obviously thats 
not really all that interesting for us because we want the "content" thats 
been stored there.
So we use:
change = ilist_container(ApplyCacheChange, node, n);

What that does is to use the offsetof() macro/builtin to calculate the offset
of the member 'node' in the struct 'ApplyCacheChange' and do the pointer math 
to subtract that from 'n'.
Which means that, without any further pointer indirection you could access the 
contents of the list element.

The other required example is:
ilist_d_push_back(&txn->changes, &change->node);

We push a change on the list of changes. But as the ilist machinery only deals 
with 'ilist_d_nodes' we pass it '&change->node' as a parameter.

Does that explanation make sense?

Due do the ilist_container trickery we can can have multiple list membership 
nodes in one struct. The important part is that the code that uses the lists 
needs to know in which struct the list element is embedded.

> 
> I am not sure that I like this inconsistency:
> 
>     ilist_d_foreach(t, head)
>     ilist_d_remove(head, t);
> 
> In the first call, head is specified and then node. In the second,
> node and then head. This could probably stand to be made consistent.
I think thats fine because it looks more like the traditional for loops. But 
if others don't agree...

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
On Friday, June 22, 2012 02:04:02 AM Tom Lane wrote:
> Peter Geoghegan <peter@2ndquadrant.com> writes:
> > All of the less popular compilers we support we support precisely
> > because they pretend to be GCC, with the sole exception, as always, of
> > the Microsoft product, in this case MSVC.
> 
> This is nonsense.  There are at least three buildfarm machines running
> compilers that do not "pretend to be gcc" (at least, configure
> recognizes them as not gcc) and are not MSVC either.  We ought to have
> more IMO, because software monocultures are dangerous.  Of those three,
> two pass the "quiet inline" test and one --- the newest of the three
> if I guess correctly --- does not.  So it is not the case that
> !USE_INLINE is dead code, even if you adopt the position that we don't
> care about any compiler not represented in the buildfarm.
I think you can make hpux's acc do the right thing with some trickery though.  
I don't have access to hpux anymore though so I can't test it.

Should there be no other trick - I think there is though - we could just 
specify -W2177 as an alternative parameter to test in the 'quiet static 
inline' test.

I definitely do not want to bar any sensible compiler from compiling postgres 
but the keyword here is 'sensible'. If it requires some modest force/trickery 
to behave sensible, thats ok, but if we need to ship around huge unreadable 
crufty macros just to support them I don't find it ok.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Robert Haas
Date:
>> I am not convinced that it's a good idea to wake up every walsender
>> every time we do XLogInsert().  XLogInsert() is a super-hot code path,
>> and adding more overhead there doesn't seem warranted.  We need to
>> replicate commit, commit prepared, etc. quickly, by why do we need to
>> worry about a short delay in replicating heap_insert/update/delete,
>> for example?  They don't really matter until the commit arrives.  7
>> seconds might be a bit long, but that could be fixed by decreasing the
>> polling interval for walsender to, say, a second.
> Its not woken up every XLogInsert call. Its only woken up if there was an
> actual disk write + fsync in there. Thats exactly the point of the patch.

Sure, but it's still adding cycles to XLogInsert.  I'm not sure that
XLogBackgroundFlush() is the right place to be doing this, but at
least it's in the background rather than the foreground.

> The wakeup rate is actually lower for synchronous_commit=on than before
> because then it unconditionally did a wakeup for every commit (and similar)
> and now only does that if something has been written + fsynced.

I'm a bit confused by this, because surely if there's been a commit,
then WAL has been written and fsync'd, but the reverse is not true.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Friday, June 22, 2012 12:23:57 AM Peter Geoghegan wrote:
>> Why are you using the stdlib's <assert.h>? Why have you used the
>> NDEBUG macro rather than USE_ASSERT_CHECKING? This might make sense if
>> the header was intended to live in port, but it isn't, right?

> That should probably be removed, yes. I did it that way that it could be 
> tested independently of casserts because the list checking code turns some 
> linear algorithms into quadratic ones which is noticeable even when --enable-
> cassert is defined.

As far as that goes, I wonder whether the list-checking code hasn't
long since served its purpose.  Neil Conway put it in when he redid the
List API to help catch places that were using no-longer-supported hacks;
but it's been years since I've seen it catch anything.  I suggest that
we might want to either remove it, or enable it via something other than
USE_ASSERT_CHECKING (and not enable it by default).
        regards, tom lane


Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Andres Freund
Date:
On Friday, June 22, 2012 04:09:59 PM Robert Haas wrote:
> >> I am not convinced that it's a good idea to wake up every walsender
> >> every time we do XLogInsert().  XLogInsert() is a super-hot code path,
> >> and adding more overhead there doesn't seem warranted.  We need to
> >> replicate commit, commit prepared, etc. quickly, by why do we need to
> >> worry about a short delay in replicating heap_insert/update/delete,
> >> for example?  They don't really matter until the commit arrives.  7
> >> seconds might be a bit long, but that could be fixed by decreasing the
> >> polling interval for walsender to, say, a second.
> > 
> > Its not woken up every XLogInsert call. Its only woken up if there was an
> > actual disk write + fsync in there. Thats exactly the point of the patch.
> Sure, but it's still adding cycles to XLogInsert.  I'm not sure that
> XLogBackgroundFlush() is the right place to be doing this, but at
> least it's in the background rather than the foreground.
It adds one if() if nothing was fsynced. If something was written and fsynced 
inside XLogInsert some kill() calls are surely not the problem.

> > The wakeup rate is actually lower for synchronous_commit=on than before
> > because then it unconditionally did a wakeup for every commit (and
> > similar) and now only does that if something has been written + fsynced.
> I'm a bit confused by this, because surely if there's been a commit,
> then WAL has been written and fsync'd, but the reverse is not true.
As soon as you have significant concurrency by the time the XLogFlush in 
RecordTransactionCommit() is reached another backend or the wal writer may 
have already fsynced the wal up to the requested point. In that case no wakeup 
will performed by the comitting backend at all. 9.2 improved the likelihood of 
that as you know.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
On Friday, June 22, 2012 04:18:35 PM Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On Friday, June 22, 2012 12:23:57 AM Peter Geoghegan wrote:
> >> Why are you using the stdlib's <assert.h>? Why have you used the
> >> NDEBUG macro rather than USE_ASSERT_CHECKING? This might make sense if
> >> the header was intended to live in port, but it isn't, right?
> > 
> > That should probably be removed, yes. I did it that way that it could be
> > tested independently of casserts because the list checking code turns
> > some linear algorithms into quadratic ones which is noticeable even when
> > --enable- cassert is defined.
> 
> As far as that goes, I wonder whether the list-checking code hasn't
> long since served its purpose.  Neil Conway put it in when he redid the
> List API to help catch places that were using no-longer-supported hacks;
> but it's been years since I've seen it catch anything.  I suggest that
> we might want to either remove it, or enable it via something other than
> USE_ASSERT_CHECKING (and not enable it by default).
Oh, I and Peter weren't talking about the pg_list.h stuff, it was about my 
'embedded list' implementation which started this subthread. The 
pg_list.h/list.c stuff isn't problematic as far as I have seen in profiles; 
its checks are pretty simple so I do not find that surprising. We might want 
to disable it by default anyway.

In my code the list checking stuff iterates over the complete list after 
modifications and checks that all prev/next pointers are correct so its linear 
in itself...

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Robert Haas
Date:
On Fri, Jun 22, 2012 at 10:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On Friday, June 22, 2012 04:09:59 PM Robert Haas wrote:
>> >> I am not convinced that it's a good idea to wake up every walsender
>> >> every time we do XLogInsert().  XLogInsert() is a super-hot code path,
>> >> and adding more overhead there doesn't seem warranted.  We need to
>> >> replicate commit, commit prepared, etc. quickly, by why do we need to
>> >> worry about a short delay in replicating heap_insert/update/delete,
>> >> for example?  They don't really matter until the commit arrives.  7
>> >> seconds might be a bit long, but that could be fixed by decreasing the
>> >> polling interval for walsender to, say, a second.
>> >
>> > Its not woken up every XLogInsert call. Its only woken up if there was an
>> > actual disk write + fsync in there. Thats exactly the point of the patch.
>> Sure, but it's still adding cycles to XLogInsert.  I'm not sure that
>> XLogBackgroundFlush() is the right place to be doing this, but at
>> least it's in the background rather than the foreground.
> It adds one if() if nothing was fsynced. If something was written and fsynced
> inside XLogInsert some kill() calls are surely not the problem.
>
>> > The wakeup rate is actually lower for synchronous_commit=on than before
>> > because then it unconditionally did a wakeup for every commit (and
>> > similar) and now only does that if something has been written + fsynced.
>> I'm a bit confused by this, because surely if there's been a commit,
>> then WAL has been written and fsync'd, but the reverse is not true.
> As soon as you have significant concurrency by the time the XLogFlush in
> RecordTransactionCommit() is reached another backend or the wal writer may
> have already fsynced the wal up to the requested point. In that case no wakeup
> will performed by the comitting backend at all. 9.2 improved the likelihood of
> that as you know.

Hmm, well, I guess.  I'm still not sure I really understand what
benefit we're getting out of this.  If we lose a few WAL records for
an uncommitted transaction, who cares?  That transaction is gone
anyway.

As an implementation detail, I suggest rewriting WalSndWakeupRequest
and WalSndWakeupProcess as macros.  The old code does an in-line test
for max_wal_senders > 0, which suggests that somebody thought the
function call overhead might be enough to matter here.  Perhaps they
were wrong, but it shouldn't hurt anything to keep it that way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> Oh, I and Peter weren't talking about the pg_list.h stuff, it was about my 
> 'embedded list' implementation which started this subthread. The 
> pg_list.h/list.c stuff isn't problematic as far as I have seen in profiles; 
> its checks are pretty simple so I do not find that surprising. We might want 
> to disable it by default anyway.

> In my code the list checking stuff iterates over the complete list after 
> modifications and checks that all prev/next pointers are correct so its linear 
> in itself...

Well, so does list.c, so I'd expect the performance risks to be similar.
Possibly you're testing on longer lists than are typical in the backend.
        regards, tom lane


Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Andres Freund
Date:
On Friday, June 22, 2012 04:34:33 PM Robert Haas wrote:
> On Fri, Jun 22, 2012 at 10:19 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > On Friday, June 22, 2012 04:09:59 PM Robert Haas wrote:
> >> >> I am not convinced that it's a good idea to wake up every walsender
> >> >> every time we do XLogInsert().  XLogInsert() is a super-hot code
> >> >> path, and adding more overhead there doesn't seem warranted.  We
> >> >> need to replicate commit, commit prepared, etc. quickly, by why do
> >> >> we need to worry about a short delay in replicating
> >> >> heap_insert/update/delete, for example?  They don't really matter
> >> >> until the commit arrives.  7 seconds might be a bit long, but that
> >> >> could be fixed by decreasing the polling interval for walsender to,
> >> >> say, a second.
> >> > 
> >> > Its not woken up every XLogInsert call. Its only woken up if there was
> >> > an actual disk write + fsync in there. Thats exactly the point of the
> >> > patch.
> >> 
> >> Sure, but it's still adding cycles to XLogInsert.  I'm not sure that
> >> XLogBackgroundFlush() is the right place to be doing this, but at
> >> least it's in the background rather than the foreground.
> > 
> > It adds one if() if nothing was fsynced. If something was written and
> > fsynced inside XLogInsert some kill() calls are surely not the problem.
> > 
> >> > The wakeup rate is actually lower for synchronous_commit=on than
> >> > before because then it unconditionally did a wakeup for every commit
> >> > (and similar) and now only does that if something has been written +
> >> > fsynced.
> >> 
> >> I'm a bit confused by this, because surely if there's been a commit,
> >> then WAL has been written and fsync'd, but the reverse is not true.
> > 
> > As soon as you have significant concurrency by the time the XLogFlush in
> > RecordTransactionCommit() is reached another backend or the wal writer
> > may have already fsynced the wal up to the requested point. In that case
> > no wakeup will performed by the comitting backend at all. 9.2 improved
> > the likelihood of that as you know.
> Hmm, well, I guess.  I'm still not sure I really understand what
> benefit we're getting out of this.  If we lose a few WAL records for
> an uncommitted transaction, who cares?  That transaction is gone
> anyway.
Well, before the simple fix Simon applied after my initial complaint you 
didn't get wakeups *at all* in the synchronous_commit=off case.

Now, with the additional changes, the walsender is woken exactly when data is 
available to send and not always when a commit happens. I played around with 
various scenarios and it always was a win. One reason is that the walreceiver 
often is a bottleneck because it fsyncs the received data immediately so a 
less blocky transfer pattern is reducing that problem a bit.

> As an implementation detail, I suggest rewriting WalSndWakeupRequest
> and WalSndWakeupProcess as macros.  The old code does an in-line test
> for max_wal_senders > 0, which suggests that somebody thought the
> function call overhead might be enough to matter here.  Perhaps they
> were wrong, but it shouldn't hurt anything to keep it that way.
True.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
On Friday, June 22, 2012 04:41:20 PM Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > Oh, I and Peter weren't talking about the pg_list.h stuff, it was about
> > my 'embedded list' implementation which started this subthread. The
> > pg_list.h/list.c stuff isn't problematic as far as I have seen in
> > profiles; its checks are pretty simple so I do not find that surprising.
> > We might want to disable it by default anyway.
> > 
> > In my code the list checking stuff iterates over the complete list after
> > modifications and checks that all prev/next pointers are correct so its
> > linear in itself...
> 
> Well, so does list.c, so I'd expect the performance risks to be similar.
> Possibly you're testing on longer lists than are typical in the backend.
I don't think list.c does so:

static void
check_list_invariants(const List *list)
{if (list == NIL)    return;
Assert(list->length > 0);Assert(list->head != NULL);Assert(list->tail != NULL);
Assert(list->type == T_List ||       list->type == T_IntList ||       list->type == T_OidList);
if (list->length == 1)    Assert(list->head == list->tail);if (list->length == 2)    Assert(list->head->next ==
list->tail);Assert(list->tail->next== NULL);
 
}

But yes, the lists I deal with are significantly longer, so replacing O(n) by 
O(n^2) is rather painful there...

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Friday, June 22, 2012 04:41:20 PM Tom Lane wrote:
>> Well, so does list.c, so I'd expect the performance risks to be similar.

> I don't think list.c does so:

Huh, OK.  I seem to remember that the original version actually chased
down the whole list and verified that the length matched.  We must've
soon decided that that was insupportable in practice.  There might be
a lesson here for your checks.
        regards, tom lane


Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Robert Haas
Date:
On Fri, Jun 22, 2012 at 10:45 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > the likelihood of that as you know.
>> Hmm, well, I guess.  I'm still not sure I really understand what
>> benefit we're getting out of this.  If we lose a few WAL records for
>> an uncommitted transaction, who cares?  That transaction is gone
>> anyway.
> Well, before the simple fix Simon applied after my initial complaint you
> didn't get wakeups *at all* in the synchronous_commit=off case.
>
> Now, with the additional changes, the walsender is woken exactly when data is
> available to send and not always when a commit happens. I played around with
> various scenarios and it always was a win.

Can you elaborate on that a bit?  What scenarios did you play around
with, and what does "win" mean in this context?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Andres Freund
Date:
On Friday, June 22, 2012 04:59:45 PM Robert Haas wrote:
> On Fri, Jun 22, 2012 at 10:45 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> > the likelihood of that as you know.
> >> 
> >> Hmm, well, I guess.  I'm still not sure I really understand what
> >> benefit we're getting out of this.  If we lose a few WAL records for
> >> an uncommitted transaction, who cares?  That transaction is gone
> >> anyway.
> > 
> > Well, before the simple fix Simon applied after my initial complaint you
> > didn't get wakeups *at all* in the synchronous_commit=off case.
> > 
> > Now, with the additional changes, the walsender is woken exactly when
> > data is available to send and not always when a commit happens. I played
> > around with various scenarios and it always was a win.
> 
> Can you elaborate on that a bit?  What scenarios did you play around
> with, and what does "win" mean in this context?
I had two machines connected locally and setup HS and my prototype between 
them (not at once obviously).
The patch reduced all the average latency between both nodes (measured by 
'ticker' rows arriving in a table on the standby), the jitter in latency and 
the amount of load I had to put on the master before the standby couldn't keep 
up anymore.

I played with different loads:
* multple concurrent ~50MB COPY's 
* multple concurrent ~50MB COPY's, pgbench
* pgbench

All three had a ticker running concurrently with synchronous_commit=off 
(because it shouldn't cause any difference in the replication pattern itself).

The difference in averagelag and cutoff were smallest with just pgbench running 
alone and biggest with COPY running alone. Highjitter was most visible with 
just pgbench running alone but thats likely just because the average lag was 
smaller.

Its not that surprising imo. On workloads that have a high wal throughput like 
all of the above XLogInsert frequently has to write out data itself. If that 
happens the walsender might not get waken up in the current setup so the 
walsender/receiver pair is inactive and starts to work like crazy afterwards 
to catch up. During that period of higher activity it does fsync's of 
MAX_SEND_SIZE (16 * XLOG_BLKSZ) in a high rate which reduces the throughput of 
apply...

Greetings,

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Tom Lane
Date:
Peter Geoghegan <peter@2ndquadrant.com> writes:
> On 22 June 2012 01:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This is nonsense. �There are at least three buildfarm machines running
>> compilers that do not "pretend to be gcc" (at least, configure
>> recognizes them as not gcc) and are not MSVC either.

> So those three don't have medium to high degrees of compatibility with GCC?

Uh, they all compile C, so perforce they have reasonable degrees of
compatibility with gcc.  That doesn't mean they implement gcc's
nonstandard extensions.

>> We ought to have more IMO, because software monocultures are
>> dangerous.  Of
>> those three, two pass the "quiet inline" test and one --- the newest of the three
>> if I guess correctly --- does not.  So it is not the case that
>> !USE_INLINE is dead code, even if you adopt the position that we don't
>> care about any compiler not represented in the buildfarm.

> I note that you said that it doesn't pass the "quiet inline" test, and
> not that it doesn't support inline functions.

What's your point?  If the compiler isn't implementing inline the same
way gcc does, we can't use the same inlining arrangements.  I will be
the first to agree that C99's definition of inline sucks, but that
doesn't mean we can assume that gcc's version is implemented everywhere.
        regards, tom lane


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Friday, June 22, 2012 02:04:02 AM Tom Lane wrote:
>> This is nonsense.  There are at least three buildfarm machines running
>> compilers that do not "pretend to be gcc" (at least, configure
>> recognizes them as not gcc) and are not MSVC either.

> Should there be no other trick - I think there is though - we could just 
> specify -W2177 as an alternative parameter to test in the 'quiet static 
> inline' test.

What is that, an MSVC switch?  If so it's rather irrelevant to non-MSVC
compilers.

> I definitely do not want to bar any sensible compiler from compiling postgres
> but the keyword here is 'sensible'. If it requires some modest force/trickery
> to behave sensible, thats ok, but if we need to ship around huge unreadable 
> crufty macros just to support them I don't find it ok.

So you propose to define any compiler that strictly implements C99 as
not sensible and not one that will be able to compile Postgres?  I do
not think that's acceptable.  I have no problem with producing better
code on gcc than elsewhere (as we already do), but being flat out broken
for compilers that don't match gcc's interpretation of "inline" is not
good enough.
        regards, tom lane


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
On Monday, June 25, 2012 05:15:43 PM Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On Friday, June 22, 2012 02:04:02 AM Tom Lane wrote:
> >> This is nonsense.  There are at least three buildfarm machines running
> >> compilers that do not "pretend to be gcc" (at least, configure
> >> recognizes them as not gcc) and are not MSVC either.
> > 
> > Should there be no other trick - I think there is though - we could just
> > specify -W2177 as an alternative parameter to test in the 'quiet static
> > inline' test.
> What is that, an MSVC switch?  If so it's rather irrelevant to non-MSVC
> compilers.
HP-UX/aCC, the only compiler in the buildfarm I found that seems to fall short 
in the "quiet inline" test.

MSVC seems to work fine with in supported versions, USE_INLINE is defined 
these days.

> > I definitely do not want to bar any sensible compiler from compiling
> > postgres but the keyword here is 'sensible'. If it requires some modest
> > force/trickery to behave sensible, thats ok, but if we need to ship
> > around huge unreadable crufty macros just to support them I don't find
> > it ok.
> So you propose to define any compiler that strictly implements C99 as
> not sensible and not one that will be able to compile Postgres?  I do
> not think that's acceptable.  I have no problem with producing better
> code on gcc than elsewhere (as we already do), but being flat out broken
> for compilers that don't match gcc's interpretation of "inline" is not
> good enough.
I propose to treat any compiler which has no way to get to equivalent 
behaviour as not sensible. Yes. I don't think there really are many of those 
around. As you pointed out there is only one compiler in the buildfarm with 
problems and I think those can be worked around (can't test it yet though, the 
only HP-UX I could get my hands on quickly is at 11.11...).

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Monday, June 25, 2012 05:15:43 PM Tom Lane wrote:
>> So you propose to define any compiler that strictly implements C99 as
>> not sensible and not one that will be able to compile Postgres?

> I propose to treat any compiler which has no way to get to equivalent 
> behaviour as not sensible. Yes.

Well, my response is "no".  I could see saying that we require (some) C99
features at this point, but not features that are in no standard, no
matter how popular gcc might be.

> I don't think there really are many of those 
> around. As you pointed out there is only one compiler in the buildfarm with 
> problems

This just means we don't have a wide enough collection of non-mainstream
machines in the buildfarm.  Deciding to break any platform with a
non-gcc-equivalent compiler isn't going to improve that.
        regards, tom lane


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Andres Freund
Date:
On Monday, June 25, 2012 05:57:51 PM Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On Monday, June 25, 2012 05:15:43 PM Tom Lane wrote:
> >> So you propose to define any compiler that strictly implements C99 as
> >> not sensible and not one that will be able to compile Postgres?
> > 
> > I propose to treat any compiler which has no way to get to equivalent
> > behaviour as not sensible. Yes.

> Well, my response is "no".  I could see saying that we require (some) C99
> features at this point, but not features that are in no standard, no
> matter how popular gcc might be.
I fail to see how gcc is the relevant point here given that there is 
equivalent definitions available from multiple compiler vendors.

Also, 'static inline' *is* C99 conforming as far as I can see? The problem 
with it is that some compilers may warn if the function isn't used in the same 
translation unit. Thats doesn't make not using a function non standard-
conforming though.

> > I don't think there really are many of those
> > around. As you pointed out there is only one compiler in the buildfarm
> > with problems
> This just means we don't have a wide enough collection of non-mainstream
> machines in the buildfarm.  Deciding to break any platform with a
> non-gcc-equivalent compiler isn't going to improve that.
No, it won't improve that. But neither will the contrary.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Monday, June 25, 2012 05:57:51 PM Tom Lane wrote:
>> Well, my response is "no".  I could see saying that we require (some) C99
>> features at this point, but not features that are in no standard, no
>> matter how popular gcc might be.

> Also, 'static inline' *is* C99 conforming as far as I can see?

Hmm.  I went back and re-read the C99 spec, and it looks like most of
the headaches we had in the past with C99 inline are specific to the
case where you want an extern declaration to be available.  For a
function that exists *only* as a static it might be all right.  So maybe
I'm misremembering how well this would work.  We'd have to be sure we
don't need any extern declarations, though.

Having said that, I'm still of the opinion that it's not so hard to deal
with that we should just blow off compilers where "inline" doesn't work
well.  I have no sympathy at all for the "we'd need two copies"
argument.  First off, if the code is at any risk whatsoever of changing
intra-major-release, it is not acceptable to inline it (there would be
inline copies in third-party modules where we couldn't ensure
recompilation).  So that's going to force us to use this only in cases
where the code is small and stable enough that two copies aren't such
a big problem.  Second, it's not that hard to set things up so there's
only one source-code copy, as was noted upthread.
        regards, tom lane


Re: [PATCH 04/16] Add embedded list interface (header only)

From
Peter Geoghegan
Date:
On 25 June 2012 20:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> Also, 'static inline' *is* C99 conforming as far as I can see?
>
> Hmm.  I went back and re-read the C99 spec, and it looks like most of
> the headaches we had in the past with C99 inline are specific to the
> case where you want an extern declaration to be available.  For a
> function that exists *only* as a static it might be all right.  So maybe
> I'm misremembering how well this would work.  We'd have to be sure we
> don't need any extern declarations, though.

Yeah, the extern inline functions sounds at least superficially
similar to what happened with extern templates in C++ - exactly one
compiler vendor implemented them to the letter of the standard (they
remained completely unimplemented elsewhere), and subsequently went
bust, before they were eventually removed from the standard last year.

Note that when you build Postgres with Clang, it's implicitly and
automatically building C code as C99. There is an excellent analysis
of the situation here, under "C99 inline functions":

http://clang.llvm.org/compatibility.html

> Having said that, I'm still of the opinion that it's not so hard to deal
> with that we should just blow off compilers where "inline" doesn't work
> well.

Fair enough.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


On 12-06-21 04:37 AM, Andres Freund wrote:
> Hi Steve,
> Thanks!
>

Attached is a detailed review of the patch.

Very good analysis, thanks!
> Another reasons why we cannot easily do 1) is that subtransactions aren't
> discernible from top-level transactions before the top-level commit happens,
> we can only properly merge in the right order (by "sorting" via lsn) once we
> have seen the commit record which includes a list of all committed
> subtransactions.
>

Based on that and your comments further down in your reply (and that no 
one spoke up and disagreed with you) It sounds like that doing (1) isn't 
going to be practical.

> I also don't think 1) would be particularly welcome by people trying to
> replicate into foreign systems.
>

They could still sort the changes into transaction groups before 
applying to the foreign system.


> I planned to have some cutoff 'max_changes_in_memory_per_txn' value. 
> If it has
> been reached for one transaction all existing changes are spilled to disk. New
> changes again can be kept in memory till its reached again.
>
Do you want max_changes_per_in_memory_txn or do you want to put a limit 
on the total amount of memory that the cache is able to use? How are you 
going to tell a DBA to tune max_changes_in_memory_per_txn? They know how 
much memory their system has and that they can devote to the apply cache 
versus other things, giving them guidance on how estimating how much 
open transactions they might have at a point in time  and how many
WAL change records each transaction generates seems like a step 
backwards from the progress we've been making in getting Postgresql to 
be easier to tune.  The maximum number of transactions that could be 
opened at a time is governed by max_connections on the master at the 
time the WAL was generated , so I don't even see how the machine 
processing the WAL records could autotune/guess that.



> We need to support serializing the cache for crash recovery + shutdown of the
> receiving side as well. Depending on how we do the wal decoding we will need
> it more frequently...
>

Have you described your thoughts on crash recovery on another thread?

I am thinking that this module would have to serialize some state 
everytime it calls cache->commit() to ensure that consumers don't get 
invoked twice on the same transaction.

If the apply module is making changes to the same backend that the apply 
cache serializes to then both the state for the apply cache and the 
changes that committed changes/transactions make will be persisted (or 
not persisted) together.   What if I am replicating from x86 to x86_64 
via a apply module that does textout conversions?

x86         Proxy                                 x86_64
----WAL------> apply                     cache                      |   (proxy catalog)                      |
          apply module                      textout  --------------------->                                      SQL
statements


How do we ensure that the commits are all visible(or not visible)  on 
the catalog on the proxy instance used for decoding WAL, the destination 
database, and the state + spill files of the apply cache stay consistent 
in the event of a crash of either the proxy or the target?
I don't think you can (unless we consider two-phase commit, and I'd 
rather we didn't).  Can we come up with a way of avoiding the need for 
them to be consistent with each other?

I think apply modules will need to be able to be passed the same 
transaction twice (once before the crash and again after) and come up 
with a  way of deciding if that transaction has  a) Been applied to the 
translation/proxy catalog and b) been applied on the replica instance.   
How is the walreceiver going to decide which WAL sgements it needs to 
re-process after a crash?  I would want to see more of these details 
worked out before we finalize the interface between the apply cache and 
the apply modules and how the serialization works.


Code Review
=========

applycache.h
-----------------------
+typedef struct ApplyCacheTupleBuf
+{
+    /* position in preallocated list */
+    ilist_s_node node;
+
+    HeapTupleData tuple;
+    HeapTupleHeaderData header;
+    char data[MaxHeapTupleSize];
+} ApplyCacheTupleBuf;

Each ApplyCacheTupleBuf will be about 8k (BLKSZ) big no matter how big 
the data in the transaction is? Wouldn't workloads with inserts of lots 
of small rows in a transaction eat up lots of memory that is allocated 
but storing nothing?  The only alternative I can think of is dynamically 
allocating these and I don't know what the cost/benefit of that overhead 
will be versus spilling to disk sooner.

+* FIXME: better name
+ */
+ApplyCacheChange*
+ApplyCacheGetChange(ApplyCache*);

How about:

ApplyCacheReserveChangeStruct(..)
ApplyCacheReserveChange(...)
ApplyCacheAllocateChange(...)

as ideas?
+/*
+ * Return an unused ApplyCacheChange struct +*/
+void
+ApplyCacheReturnChange(ApplyCache*, ApplyCacheChange*);

ApplyCacheReleaseChange(...) ?  I keep thinking of 'Return' as us 
returning the data somewhere not the memory.


applycache.c:
-------------------

I've taken a quick look through this file and I don't see any issues 
other than the many FIXME's and other issues you've identified already, 
which I don't expect you to address in this CF.

> Andres
>



Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Robert Haas
Date:
On Fri, Jun 22, 2012 at 12:35 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Can you elaborate on that a bit?  What scenarios did you play around
>> with, and what does "win" mean in this context?
> I had two machines connected locally and setup HS and my prototype between
> them (not at once obviously).
> The patch reduced all the average latency between both nodes (measured by
> 'ticker' rows arriving in a table on the standby), the jitter in latency and
> the amount of load I had to put on the master before the standby couldn't keep
> up anymore.
>
> I played with different loads:
> * multple concurrent ~50MB COPY's
> * multple concurrent ~50MB COPY's, pgbench
> * pgbench
>
> All three had a ticker running concurrently with synchronous_commit=off
> (because it shouldn't cause any difference in the replication pattern itself).
>
> The difference in averagelag and cutoff were smallest with just pgbench running
> alone and biggest with COPY running alone. Highjitter was most visible with
> just pgbench running alone but thats likely just because the average lag was
> smaller.

OK, that sounds pretty promising.  I'd like to run a few performance
tests on this just to convince myself that it doesn't lead to a
significant regression in other scenarios.  Assuming that doesn't turn
up anything major, I'll go ahead and commit this.

Can you provide a rebased version?  It seems that one of the hunks in
xlog.c no longer applies.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Andres Freund
Date:
On Tuesday, June 26, 2012 04:01:26 PM Robert Haas wrote:
> On Fri, Jun 22, 2012 at 12:35 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> Can you elaborate on that a bit?  What scenarios did you play around
> >> with, and what does "win" mean in this context?
> > 
> > I had two machines connected locally and setup HS and my prototype
> > between them (not at once obviously).
> > The patch reduced all the average latency between both nodes (measured by
> > 'ticker' rows arriving in a table on the standby), the jitter in latency
> > and the amount of load I had to put on the master before the standby
> > couldn't keep up anymore.
> > 
> > I played with different loads:
> > * multple concurrent ~50MB COPY's
> > * multple concurrent ~50MB COPY's, pgbench
> > * pgbench
> > 
> > All three had a ticker running concurrently with synchronous_commit=off
> > (because it shouldn't cause any difference in the replication pattern
> > itself).
> > 
> > The difference in averagelag and cutoff were smallest with just pgbench
> > running alone and biggest with COPY running alone. Highjitter was most
> > visible with just pgbench running alone but thats likely just because
> > the average lag was smaller.
> 
> OK, that sounds pretty promising.  I'd like to run a few performance
> tests on this just to convince myself that it doesn't lead to a
> significant regression in other scenarios.  Assuming that doesn't turn
> up anything major, I'll go ahead and commit this.
Independent testing would be great, its definitely possible that I oversaw 
something although I obviously don't think so ;).

> Can you provide a rebased version?  It seems that one of the hunks in
> xlog.c no longer applies.
Will do so. Not sure if I can finish it today though, I am in the midst of 
redoing the ilist and xlogreader patches. I guess tomorrow will suffice 
otherwise...

Thanks!

Andres


-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Hi Steve,

On Tuesday, June 26, 2012 02:14:22 AM Steve Singer wrote:
> I planned to have some cutoff 'max_changes_in_memory_per_txn' value.
> > If it has
> > been reached for one transaction all existing changes are spilled to
> > disk. New changes again can be kept in memory till its reached again.
> Do you want max_changes_per_in_memory_txn or do you want to put a limit
> on the total amount of memory that the cache is able to use? How are you
> going to tell a DBA to tune max_changes_in_memory_per_txn? They know how
> much memory their system has and that they can devote to the apply cache
> versus other things, giving them guidance on how estimating how much
> open transactions they might have at a point in time  and how many
> WAL change records each transaction generates seems like a step
> backwards from the progress we've been making in getting Postgresql to
> be easier to tune.  The maximum number of transactions that could be
> opened at a time is governed by max_connections on the master at the
> time the WAL was generated , so I don't even see how the machine
> processing the WAL records could autotune/guess that.
It even can be significantly higher than max_connections because 
subtransactions are only recognizable as part of their parent transaction 
uppon commit.

I think max_changes_in_memory_per_txn will be the number of changes for now. 
Making memory based accounting across multiple concurrent transactions work 
efficiently and correctly isn't easy.


> > We need to support serializing the cache for crash recovery + shutdown of
> > the receiving side as well. Depending on how we do the wal decoding we
> > will need it more frequently...
> Have you described your thoughts on crash recovery on another thread?
I think I have somewhere, but given how much in flux our thoughts on decoding 
are I think its not that important yet.

> I am thinking that this module would have to serialize some state
> everytime it calls cache->commit() to ensure that consumers don't get
> invoked twice on the same transaction.
In one of the other patches I implemented it by adding the (origin_id, 
origin_lsn) pair to replicated commits. During recovery the startup process 
sets up the shared memory status up to which point we applied.
If you then every now and then perform a 'logical checkpoint' writing down 
whats the beginning lsn of the longest in-progress transaction is you can 
fully recover from that point on.

> If the apply module is making changes to the same backend that the apply
> cache serializes to then both the state for the apply cache and the
> changes that committed changes/transactions make will be persisted (or
> not persisted) together.   What if I am replicating from x86 to x86_64
> via a apply module that does textout conversions?
> 
> x86         Proxy                                 x86_64
> ----WAL------> apply
>                       cache
> 
>                        |   (proxy catalog)
> 
>                       apply module
>                        textout  --------------------->
>                                        SQL statements
> 
> 
> How do we ensure that the commits are all visible(or not visible)  on
> the catalog on the proxy instance used for decoding WAL, the destination
> database, and the state + spill files of the apply cache stay consistent
> in the event of a crash of either the proxy or the target?
> I don't think you can (unless we consider two-phase commit, and I'd
> rather we didn't).  Can we come up with a way of avoiding the need for
> them to be consistent with each other?
Thats discussed in the "Catalog/Metadata consistency during changeset 
extraction from wal" thread and we haven't yet determined which solution is 
the best ;)

> Code Review
> =========
> 
> applycache.h
> -----------------------
> +typedef struct ApplyCacheTupleBuf
> +{
> +    /* position in preallocated list */
> +    ilist_s_node node;
> +
> +    HeapTupleData tuple;
> +    HeapTupleHeaderData header;
> +    char data[MaxHeapTupleSize];
> +} ApplyCacheTupleBuf;
> 
> Each ApplyCacheTupleBuf will be about 8k (BLKSZ) big no matter how big
> the data in the transaction is? Wouldn't workloads with inserts of lots
> of small rows in a transaction eat up lots of memory that is allocated
> but storing nothing?  The only alternative I can think of is dynamically
> allocating these and I don't know what the cost/benefit of that overhead
> will be versus spilling to disk sooner.
Dynamically allocating them totally destroys performance, I tried that. I 
think at some point we should have 4 or so list of preallocated tuple bufs of 
different sizes and then use the smallest possible one. But I think this 
solution is ok in the very first version.

If you allocate dynamically you also get a noticeable performance drop when 
you let the decoding run for a while because of fragmentation inside the 
memory allocator.

> +* FIXME: better name
> + */
> +ApplyCacheChange*
> +ApplyCacheGetChange(ApplyCache*);
> 
> How about:
> 
> ApplyCacheReserveChangeStruct(..)
> ApplyCacheReserveChange(...)
> ApplyCacheAllocateChange(...)
> 
> as ideas?
> +/*
> + * Return an unused ApplyCacheChange struct
>   +*/
> +void
> +ApplyCacheReturnChange(ApplyCache*, ApplyCacheChange*);
> 
> ApplyCacheReleaseChange(...) ?  I keep thinking of 'Return' as us
> returning the data somewhere not the memory.
Hm. Reserve/Release doesn't sound bad. Acquire/Release is possibly even better 
because reserve could be understood as a preparatory step?

> applycache.c:
> -------------------
> 
> I've taken a quick look through this file and I don't see any issues
> other than the many FIXME's and other issues you've identified already,
> which I don't expect you to address in this CF.
Thanks for the review so far!

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Andres Freund
Date:
On Tuesday, June 26, 2012 04:06:08 PM Andres Freund wrote:
> On Tuesday, June 26, 2012 04:01:26 PM Robert Haas wrote:
> > On Fri, Jun 22, 2012 at 12:35 PM, Andres Freund <andres@2ndquadrant.com>
> 
> wrote:
> > >> Can you elaborate on that a bit?  What scenarios did you play around
> > >> with, and what does "win" mean in this context?
> > > 
> > > I had two machines connected locally and setup HS and my prototype
> > > between them (not at once obviously).
> > > The patch reduced all the average latency between both nodes (measured
> > > by 'ticker' rows arriving in a table on the standby), the jitter in
> > > latency and the amount of load I had to put on the master before the
> > > standby couldn't keep up anymore.
> > > 
> > > I played with different loads:
> > > * multple concurrent ~50MB COPY's
> > > * multple concurrent ~50MB COPY's, pgbench
> > > * pgbench
> > > 
> > > All three had a ticker running concurrently with synchronous_commit=off
> > > (because it shouldn't cause any difference in the replication pattern
> > > itself).
> > > 
> > > The difference in averagelag and cutoff were smallest with just pgbench
> > > running alone and biggest with COPY running alone. Highjitter was most
> > > visible with just pgbench running alone but thats likely just because
> > > the average lag was smaller.
> > 
> > OK, that sounds pretty promising.  I'd like to run a few performance
> > tests on this just to convince myself that it doesn't lead to a
> > significant regression in other scenarios.  Assuming that doesn't turn
> > up anything major, I'll go ahead and commit this.
> 
> Independent testing would be great, its definitely possible that I oversaw
> something although I obviously don't think so ;).
> 
> > Can you provide a rebased version?  It seems that one of the hunks in
> > xlog.c no longer applies.
> 
> Will do so. Not sure if I can finish it today though, I am in the midst of
> redoing the ilist and xlogreader patches. I guess tomorrow will suffice
> otherwise...
Ok, attached are two patches:
The first is the rebased version of the original patch with 
WalSndWakeupProcess renamed to WalSndWakeupProcessRequests (seems clearer).

The second changes WalSndWakeupRequest and WalSndWakeupProcessRequests into 
macros as you requested before. I am not sure if its a good idea or not.

Anything else?

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

On 13.06.2012 14:28, Andres Freund wrote:
> @@ -2584,6 +2610,73 @@ l1:
>           rdata[1].buffer_std = true;
>           rdata[1].next = NULL;
>
> +        /*
> +         * XXX: We could decide not to log changes when the origin is not the
> +         * local node, that should reduce redundant logging.
> +         */
> +        if(need_tuple){
> +            xl_heap_header xlhdr;> ...
> +            relationFindPrimaryKey(relation,&indexoid,&pknratts, pkattnum, pktypoid, pkopclass);
> +
> +            if(!indexoid){
> +                elog(WARNING, "Could not find primary key for table with oid %u",
> +                     relation->rd_id);
> +                goto no_index_found;
> +            }
> +
> +            index_rel = index_open(indexoid, AccessShareLock);
> +
> +            indexdesc = RelationGetDescr(index_rel);
> +
> +            for(natt = 0; natt<  indexdesc->natts; natt++){
> +                idxvals[natt] =
> +                    fastgetattr(&tp, pkattnum[natt], desc,&idxisnull[natt]);
> +                Assert(!idxisnull[natt]);
> +            }
> +
> +            idxtuple = heap_form_tuple(indexdesc, idxvals, idxisnull);
> +
> +            xlhdr.t_infomask2 = idxtuple->t_data->t_infomask2;
> +            xlhdr.t_infomask = idxtuple->t_data->t_infomask;
> +            xlhdr.t_hoff = idxtuple->t_data->t_hoff;
> +
> +            rdata[1].next =&(rdata[2]);
> +            rdata[2].data = (char*)&xlhdr;
> +            rdata[2].len = SizeOfHeapHeader;
> +            rdata[2].buffer = InvalidBuffer;
> +            rdata[2].next = NULL;
> +
> +            rdata[2].next =&(rdata[3]);
> +            rdata[3].data = (char *) idxtuple->t_data + offsetof(HeapTupleHeaderData, t_bits);
> +            rdata[3].len = idxtuple->t_len - offsetof(HeapTupleHeaderData, t_bits);
> +            rdata[3].buffer = InvalidBuffer;
> +            rdata[3].next = NULL;
> +
> +            heap_close(index_rel, NoLock);
> +        no_index_found:
> +            ;
> +        }
> +
>           recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE, rdata);
>
>           PageSetLSN(page, recptr);

It's not cool to do all that primary key lookup stuff within the 
critical section, while holding a lock on the page.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Tue, Jun 26, 2012 at 8:13 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> It even can be significantly higher than max_connections because
> subtransactions are only recognizable as part of their parent transaction
> uppon commit.

I've been wondering whether sub-XID assignment was going to end up on
the list of things that need to be WAL-logged to enable logical
replication.  It would be nicer to avoid that if we can, but I have a
feeling that we may not be able to.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thursday, June 28, 2012 06:01:10 PM Robert Haas wrote:
> On Tue, Jun 26, 2012 at 8:13 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > It even can be significantly higher than max_connections because
> > subtransactions are only recognizable as part of their parent transaction
> > uppon commit.
> 
> I've been wondering whether sub-XID assignment was going to end up on
> the list of things that need to be WAL-logged to enable logical
> replication.  It would be nicer to avoid that if we can, but I have a
> feeling that we may not be able to.
I don't think it needs to. We only need that information during commit and we 
have it there. If a subtxn aborts a separate abort is logged, so thats no 
problem. The 'merging' of the transactions would be slightly easier if we had 
the knowledge from the get go but that would add complications again in the 
case of rollbacks.

What do you think we need it?

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Boszormenyi Zoltan
Date:
2012-06-19 09:24 keltezéssel, Andres Freund írta:
> On Tuesday, June 19, 2012 04:12:47 AM Steve Singer wrote:
>> On 12-06-18 07:30 AM, Andres Freund wrote:
>>> Hrmpf #666. I will go through through the series commit-by-commit again
>>> to make sure everything compiles again. Reordinging this late definitely
>>> wasn't a good idea...
>>>
>>> I pushed a rebased version with all those fixups (and removal of the
>>> zeroRecPtr patch).
>> Where did you push that rebased version to? I don't see an attachment,
>> or an updated patch in the commitfest app and your repo at
>> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summa
>> ry hasn't been updated in 5 days.
> To the 2ndquadrant internal repo. Which strangely doesn't help you.
> *Headdesk*. Pushed to the correct repo and manually verified.

Which repository is the correct one?
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git
was refreshed 11 days ago. The patch taken from there fails with a reject
in src/include/access/xlog.h.

>
> Andres


--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de     http://www.postgresql.at/



<div class="moz-cite-prefix">Hi,<br /><br /> trying to review this one according to <a class="moz-txt-link-freetext"
href="http://wiki.postgresql.org/wiki/Reviewing_a_Patch">http://wiki.postgresql.org/wiki/Reviewing_a_Patch</a><br/><br
/><ul><li>Is the patch in <a class="external text" href="http://en.wikipedia.org/wiki/Diff#Context_format"
rel="nofollow"title="http://en.wikipedia.org/wiki/Diff#Context_format">context diff format</a>?</ul></div><br /> No.
(Doesthis requirement still apply after PostgreSQL switched to GIT?)<br /><br /><div class="moz-cite-prefix"><ul><li>
Doesit apply cleanly to the current git master? <br /> No. The patches 01...09 in this series taken from the mailing
listapply cleanly,<br /> 10 and 11 fail with rejects.<br /><br /> Best regards,<br /> Zoltán Böszörményi<br /><br />
2012-06-1313:28 keltezéssel, Andres Freund írta:<br /></ul></div><blockquote
cite="mid:1339586927-13156-11-git-send-email-andres@2ndquadrant.com"type="cite"><pre wrap="">From: Andres Freund <a
class="moz-txt-link-rfc2396E"href="mailto:andres@anarazel.de"><andres@anarazel.de></a> 

For that add a 'node_id' parameter to most commands dealing with wal
segments. A node_id thats 'InvalidMultimasterNodeId' references local wal,
every other node_id referes to wal in a new pg_lcr directory.

Using duplicated code would reduce the impact of that change but the long-term
code-maintenance burden outweighs that by a far bit.

Besides the decision to add a 'node_id' parameter to several functions the
changes in this patch are fairly mechanical.
---src/backend/access/transam/xlog.c           |   54 ++++++++++++++++-----------src/backend/replication/basebackup.c
    |    4 +-src/backend/replication/walreceiver.c       |    2 +-src/backend/replication/walsender.c         |    9
+++--src/bin/initdb/initdb.c                    |    1 +src/bin/pg_resetxlog/pg_resetxlog.c         |    2
+-src/include/access/xlog.h                  |    2 +-src/include/access/xlog_internal.h          |   13
+++++--src/include/replication/logical.h          |    2 +src/include/replication/walsender_private.h |    2 +-10 files
changed,56 insertions(+), 35 deletions(-) 

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 504b4d0..0622726 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -635,8 +635,8 @@ static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,static bool
AdvanceXLInsertBuffer(boolnew_segment);static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);static void
XLogWrite(XLogwrtRqstWriteRqst, bool flexible, bool xlog_switch); 
-static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
-                       bool find_free, int *max_advance,
+static bool InstallXLogFileSegment(RepNodeId node_id, uint32 *log, uint32 *seg,
+                       char *tmppath, bool find_free, int *max_advance,                       bool use_lock);static
intXLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,             int source, bool notexistOk); 
@@ -1736,8 +1736,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)            /* create/use new
logfile */            use_existent = true; 
-            openLogFile = XLogFileInit(openLogId, openLogSeg,
-                                       &use_existent, true);
+            openLogFile = XLogFileInit(InvalidMultimasterNodeId, openLogId,
+                                       openLogSeg, &use_existent, true);            openLogOff = 0;        }
@@ -2376,6 +2376,9 @@ XLogNeedsFlush(XLogRecPtr record) * place.  This should be TRUE except during bootstrap log
creation. The * caller must *not* hold the lock at call. * 
+ * node_id: if != InvalidMultimasterNodeId this xlog file is actually a LCR
+ * file
+ * * Returns FD of opened file. * * Note: errors here are ERROR not PANIC because we might or might not be
@@ -2384,8 +2387,8 @@ XLogNeedsFlush(XLogRecPtr record) * in a critical section. */int
-XLogFileInit(uint32 log, uint32 seg,
-             bool *use_existent, bool use_lock)
+XLogFileInit(RepNodeId node_id, uint32 log, uint32 seg,
+             bool *use_existent, bool use_lock){    char        path[MAXPGPATH];    char        tmppath[MAXPGPATH];
@@ -2396,7 +2399,7 @@ XLogFileInit(uint32 log, uint32 seg,    int            fd;    int            nbytes;
-    XLogFilePath(path, ThisTimeLineID, log, seg);
+    XLogFilePath(path, ThisTimeLineID, node_id, log, seg);    /*     * Try to use existent file (checkpoint maker may
havecreated it already) 
@@ -2425,6 +2428,11 @@ XLogFileInit(uint32 log, uint32 seg,     */    elog(DEBUG2, "creating and filling new WAL
file");
+    /*
+     * FIXME: to be safe we need to create tempfile in the pg_lcr directory if
+     * its actually an lcr file because pg_lcr might be in a different
+     * partition.
+     */    snprintf(tmppath, MAXPGPATH, XLOGDIR "/xlogtemp.%d", (int) getpid());    unlink(tmppath);
@@ -2493,7 +2501,7 @@ XLogFileInit(uint32 log, uint32 seg,    installed_log = log;    installed_seg = seg;
max_advance= XLOGfileslop; 
-    if (!InstallXLogFileSegment(&installed_log, &installed_seg, tmppath,
+    if (!InstallXLogFileSegment(node_id, &installed_log, &installed_seg, tmppath,
 *use_existent, &max_advance,                                use_lock))    { 
@@ -2548,7 +2556,7 @@ XLogFileCopy(uint32 log, uint32 seg,    /*     * Open the source file     */
-    XLogFilePath(path, srcTLI, srclog, srcseg);
+    XLogFilePath(path, srcTLI, InvalidMultimasterNodeId, srclog, srcseg);    srcfd = BasicOpenFile(path, O_RDONLY |
PG_BINARY,0);    if (srcfd < 0)        ereport(ERROR, 
@@ -2619,7 +2627,8 @@ XLogFileCopy(uint32 log, uint32 seg,    /*     * Now move the segment into place with its final
name.    */ 
-    if (!InstallXLogFileSegment(&log, &seg, tmppath, false, NULL, false))
+    if (!InstallXLogFileSegment(InvalidMultimasterNodeId, &log, &seg, tmppath,
+                                false, NULL, false))        elog(ERROR, "InstallXLogFileSegment should not have
failed");}
@@ -2653,14 +2662,14 @@ XLogFileCopy(uint32 log, uint32 seg, * file into place. */static bool
-InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
+InstallXLogFileSegment(RepNodeId node_id, uint32 *log, uint32 *seg, char *tmppath,                       bool
find_free,int *max_advance,                       bool use_lock){    char        path[MAXPGPATH];    struct stat
stat_buf;
-    XLogFilePath(path, ThisTimeLineID, *log, *seg);
+    XLogFilePath(path, ThisTimeLineID, node_id, *log, *seg);    /*     * We want to be sure that only one process does
thisat a time. 
@@ -2687,7 +2696,7 @@ InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,            }
NextLogSeg(*log,*seg);            (*max_advance)--; 
-            XLogFilePath(path, ThisTimeLineID, *log, *seg);
+            XLogFilePath(path, ThisTimeLineID, node_id, *log, *seg);        }    }
@@ -2736,7 +2745,7 @@ XLogFileOpen(uint32 log, uint32 seg)    char        path[MAXPGPATH];    int            fd;
-    XLogFilePath(path, ThisTimeLineID, log, seg);
+    XLogFilePath(path, ThisTimeLineID, InvalidMultimasterNodeId, log, seg);    fd = BasicOpenFile(path, O_RDWR |
PG_BINARY| get_sync_bit(sync_method),                       S_IRUSR | S_IWUSR); 
@@ -2783,7 +2792,7 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,        case XLOG_FROM_PG_XLOG:
   case XLOG_FROM_STREAM: 
-            XLogFilePath(path, tli, log, seg);
+            XLogFilePath(path, tli, InvalidMultimasterNodeId, log, seg);            restoredFromArchive = false;
    break; 
@@ -2804,7 +2813,7 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,        bool        reload =
false;       struct stat statbuf; 
-        XLogFilePath(xlogfpath, tli, log, seg);
+        XLogFilePath(xlogfpath, tli, InvalidMultimasterNodeId, log, seg);        if (stat(xlogfpath, &statbuf) ==
0)       {            if (unlink(xlogfpath) != 0) 
@@ -2922,7 +2931,7 @@ XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources)    }    /* Couldn't find it.
Forsimplicity, complain about front timeline */ 
-    XLogFilePath(path, recoveryTargetTLI, log, seg);
+    XLogFilePath(path, recoveryTargetTLI, InvalidMultimasterNodeId, log, seg);    errno = ENOENT;    ereport(emode,
       (errcode_for_file_access(), 
@@ -3366,7 +3375,8 @@ PreallocXlogFiles(XLogRecPtr endptr)    {        NextLogSeg(_logId, _logSeg);        use_existent
=true; 
-        lf = XLogFileInit(_logId, _logSeg, &use_existent, true);
+        lf = XLogFileInit(InvalidMultimasterNodeId, _logId, _logSeg,
+                          &use_existent, true);        close(lf);        if (!use_existent)
CheckpointStats.ckpt_segs_added++;
@@ -3486,8 +3496,9 @@ RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr)                 * separate archive
directory.                */                if (lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode)
&&
-                    InstallXLogFileSegment(&endlogId, &endlogSeg, path,
-                                           true, &max_advance, true))
+                    InstallXLogFileSegment(InvalidMultimasterNodeId, &endlogId,
+                                           &endlogSeg, path, true,
+                                           &max_advance, true))                {
ereport(DEBUG2,                           (errmsg("recycled transaction log file \"%s\"<a class="moz-txt-link-rfc2396E"
href="mailto:,@@-5255,7+5266,8@@BootStrapXLOG(void)/*CreatefirstXLOGsegmentfile*/use_existent=false;-openLogFile=XLogFileInit(0,1,&use_existent,false);+openLogFile=XLogFileInit(InvalidMultimasterNodeId,0,1,+&use_existent,false);/*Writethefirstpagewiththeinitialrecord*/errno=0;diff--gita/src/backend/replication/basebackup.cb/src/backend/replication/basebackup.cindex0bc88a4..47e4641100644---a/src/backend/replication/basebackup.c+++b/src/backend/replication/basebackup.c@@-245,7+245,7@@perform_base_backup(basebackup_options*opt,DIR*tblspcdir)charfn[MAXPGPATH];inti;-XLogFilePath(fn,ThisTimeLineID,logid,logseg);+XLogFilePath(fn,ThisTimeLineID,InvalidMultimasterNodeId,logid,logseg);_tarWriteHeader(fn,NULL,&statbuf);/*SendtheactualWALfilecontents,block-by-block*/@@-264,7+264,7@@perform_base_backup(basebackup_options*opt,DIR*tblspcdir)*http://lists.apple.com/archives/xcode-users/2003/Dec//msg000*5

1.html*/-XLogRead(buf,ptr,TAR_SEND_SIZE);+XLogRead(buf,InvalidMultimasterNodeId,ptr,TAR_SEND_SIZE);if(pq_putmessage('d',buf,TAR_SEND_SIZE))ereport(ERROR,(errmsg(">",
@@ -5255,7 +5266,8 @@ BootStrapXLOG(void)    /* Create first XLOG segment file */    use_existent = false;
-    openLogFile = XLogFileInit(0, 1, &use_existent, false);
+    openLogFile = XLogFileInit(InvalidMultimasterNodeId, 0, 1,
+                               &use_existent, false);    /* Write the first page with the initial record */
errno= 0; 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 0bc88a4..47e4641 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -245,7 +245,7 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)            char        fn[MAXPGPATH];
         int            i; 
-            XLogFilePath(fn, ThisTimeLineID, logid, logseg);
+            XLogFilePath(fn, ThisTimeLineID, InvalidMultimasterNodeId, logid, logseg);            _tarWriteHeader(fn,
NULL,&statbuf);            /* Send the actual WAL file contents, block-by-block */ 
@@ -264,7 +264,7 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)                 *
http://lists.apple.com/archives/xcode-users/2003/Dec//msg000                * 51.html                 */ 
-                XLogRead(buf, ptr, TAR_SEND_SIZE);
+                XLogRead(buf, InvalidMultimasterNodeId, ptr, TAR_SEND_SIZE);                if (pq_putmessage('d',
buf,TAR_SEND_SIZE))                    ereport(ERROR,                            (errmsg("</a>base backup could not
senddata, aborting backup"))); 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 650b74f..e97196b 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -509,7 +509,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)            /* Create/use new log file */
          XLByteToSeg(recptr, recvId, recvSeg);            use_existent = true; 
-            recvFile = XLogFileInit(recvId, recvSeg, &use_existent, true);
+            recvFile = XLogFileInit(InvalidMultimasterNodeId, recvId, recvSeg, &use_existent, true);
recvOff= 0;        } 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e44c734..8cd3a00 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -977,7 +977,7 @@ WalSndKill(int code, Datum arg) * more than one. */void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+XLogRead(char *buf, RepNodeId node_id, XLogRecPtr startptr, Size count){    char       *p;    XLogRecPtr    recptr;
@@ -1009,8 +1009,8 @@ retry:                close(sendFile);            XLByteToSeg(recptr, sendId, sendSeg);
-            XLogFilePath(path, ThisTimeLineID, sendId, sendSeg);
-
+            XLogFilePath(path, ThisTimeLineID, node_id,
+                         sendId, sendSeg);            sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
  if (sendFile < 0)            { 
@@ -1215,7 +1215,8 @@ XLogSend(char *msgbuf, bool *caughtup)     * Read the log directly into the output buffer to
avoidextra memcpy     * calls.     */ 
-    XLogRead(msgbuf + 1 + sizeof(WalDataMessageHeader), startptr, nbytes);
+    XLogRead(msgbuf + 1 + sizeof(WalDataMessageHeader), InvalidMultimasterNodeId,
+             startptr, nbytes);    /*     * We fill the message header last so that the send timestamp is taken as
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3789948..1f26382 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2637,6 +2637,7 @@ main(int argc, char *argv[])        "global",        "pg_xlog",        "pg_xlog/archive_status",
+        "pg_lcr",        "pg_clog",        "pg_notify",        "pg_serial",
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 65ba910..7ee3a3a 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -973,7 +973,7 @@ WriteEmptyXLOG(void)    /* Write the first page */    XLogFilePath(path,
ControlFile.checkPointCopy.ThisTimeLineID,
-                 newXlogId, newXlogSeg);
+                 InvalidMultimasterNodeId, newXlogId, newXlogSeg);    unlink(path);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index dd89cff..3b02c0b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -268,7 +268,7 @@ extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);extern void
XLogFlush(XLogRecPtrRecPtr);extern bool XLogBackgroundFlush(void);extern bool XLogNeedsFlush(XLogRecPtr RecPtr); 
-extern int XLogFileInit(uint32 log, uint32 seg,
+extern int XLogFileInit(RepNodeId node_id, uint32 log, uint32 seg,             bool *use_existent, bool
use_lock);externint    XLogFileOpen(uint32 log, uint32 seg); 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 3328a50..deadddf 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -19,6 +19,7 @@#include "access/xlog.h"#include "fmgr.h"#include "pgtime.h"
+#include "replication/logical.h"#include "storage/block.h"#include "storage/relfilenode.h"
@@ -216,14 +217,11 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;#define MAXFNAMELEN        64#define
XLogFileName(fname,tli, log, seg)    \ 
-    snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli, log, seg)
+    snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli, log, seg);#define XLogFromFileName(fname, tli, log, seg)    \
sscanf(fname,"%08X%08X%08X", tli, log, seg) 
-#define XLogFilePath(path, tli, log, seg)    \
-    snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X", tli, log, seg)
-#define TLHistoryFileName(fname, tli)    \    snprintf(fname, MAXFNAMELEN, "%08X.history", tli)
@@ -239,6 +237,13 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;#define BackupHistoryFilePath(path, tli, log,
seg,offset)    \    snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X.%08X.backup", tli, log, seg, offset) 
+/* FIXME: move to xlogutils.c, needs to fix sharing with receivexlog.c first though */
+static inline int XLogFilePath(char* path, TimeLineID tli, RepNodeId node_id, uint32 log, uint32 seg){
+    if(node_id == InvalidMultimasterNodeId)
+        return snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X", tli, log, seg);
+    else
+        return snprintf(path, MAXPGPATH, LCRDIR "/%d/%08X%08X%08X", node_id, tli, log, seg);
+}/* * Method table for resource managers.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 0698b61..8f44fad 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -19,4 +19,6 @@ extern XLogRecPtr current_replication_origin_lsn;#define InvalidMultimasterNodeId 0#define
MaxMultimasterNodeId(2<<3) 
+
+#define LCRDIR                "pg_lcr"#endif
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 66234cd..bc58ff4 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -95,7 +95,7 @@ extern WalSndCtlData *WalSndCtl;extern void WalSndSetState(WalSndState state);
-extern void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+extern void XLogRead(char *buf, RepNodeId node_id, XLogRecPtr startptr, Size count);/* * Internal functions for
parsingthe replication grammar, in repl_gram.y and 
</pre></blockquote><br /><br /><pre class="moz-signature" cols="90">--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: <a class="moz-txt-link-freetext" href="http://www.postgresql-support.de">http://www.postgresql-support.de</a>
<aclass="moz-txt-link-freetext" href="http://www.postgresql.at/">http://www.postgresql.at/</a> 
</pre>

Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From
Andres Freund
Date:
On Friday, June 29, 2012 02:43:49 PM Boszormenyi Zoltan wrote:
> 2012-06-19 09:24 keltezéssel, Andres Freund írta:
> > On Tuesday, June 19, 2012 04:12:47 AM Steve Singer wrote:
> >> On 12-06-18 07:30 AM, Andres Freund wrote:
> >>> Hrmpf #666. I will go through through the series commit-by-commit again
> >>> to make sure everything compiles again. Reordinging this late
> >>> definitely wasn't a good idea...
> >>>
> >>> I pushed a rebased version with all those fixups (and removal of the
> >>> zeroRecPtr patch).
> >>
> >> Where did you push that rebased version to? I don't see an attachment,
> >> or an updated patch in the commitfest app and your repo at
> >> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=su
> >> mma ry hasn't been updated in 5 days.
> >
> > To the 2ndquadrant internal repo. Which strangely doesn't help you.
> > *Headdesk*. Pushed to the correct repo and manually verified.
>
> Which repository is the correct one?
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git
> was refreshed 11 days ago. The patch taken from there fails with a reject
> in src/include/access/xlog.h.
Thats the right repository but Heikki's recent' changes
(dfda6ebaec6763090fb78b458a979b558c50b39b and several following) changed quite
a bit around on master and I am currently rebasing and addressing review
comments.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Hi,

On Friday, June 29, 2012 02:43:52 PM Boszormenyi Zoltan wrote:
> trying to review this one according to
> http://wiki.postgresql.org/wiki/Reviewing_a_Patch
> 
> # Is the patch in context diff format
> <http://en.wikipedia.org/wiki/Diff#Context_format>?
> No. (Does this requirement still apply after PostgreSQL switched to GIT?)
Many people seem to send patches in unified format and just some days ago Tom 
said it doesn't matter to him. I still can't properly read context diffs, so I 
am using unified...

> # Does it apply cleanly to the current git master?
> No. The patches 01...09 in this series taken from the mailing list apply
> cleanly, 10 and 11 fail with rejects.
Yea, and even the patches before that need to be rebased, at least partially 
they won't compile even though they apply cleanly.

I will produce a rebased version soon, but then we haven't fully aggreed on 
preliminary patches to this one, so there doesn't seem to be too much point in 
reviewing this one before the other stuff is clear.

Marking the patch as "Returned with Feedback" for now. I have done the same 
with 13, 15. Those seem to be too much in limbo for CF-style reviews.

Thanks!

Andres


2012-06-29 15:01 keltezéssel, Andres Freund írta:
> Hi,
>
> On Friday, June 29, 2012 02:43:52 PM Boszormenyi Zoltan wrote:
>> trying to review this one according to
>> http://wiki.postgresql.org/wiki/Reviewing_a_Patch
>>
>> # Is the patch in context diff format
>> <http://en.wikipedia.org/wiki/Diff#Context_format>?
>> No. (Does this requirement still apply after PostgreSQL switched to GIT?)
> Many people seem to send patches in unified format and just some days ago Tom
> said it doesn't matter to him. I still can't properly read context diffs, so I
> am using unified...

Unified diffs are usually more readable for me after following
the Linux kernel development for years and are shorter than
context diffs.

>> # Does it apply cleanly to the current git master?
>> No. The patches 01...09 in this series taken from the mailing list apply
>> cleanly, 10 and 11 fail with rejects.
> Yea, and even the patches before that need to be rebased, at least partially
> they won't compile even though they apply cleanly.
>
> I will produce a rebased version soon, but then we haven't fully aggreed on
> preliminary patches to this one, so there doesn't seem to be too much point in
> reviewing this one before the other stuff is clear.
>
> Marking the patch as "Returned with Feedback" for now. I have done the same
> with 13, 15. Those seem to be too much in limbo for CF-style reviews.
>
> Thanks!

You're welcome.

> Andres


--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de     http://www.postgresql.at/



Re: [PATCH 13/16] Introduction of pair of logical walreceiver/sender

From
Heikki Linnakangas
Date:
On 13.06.2012 14:28, Andres Freund wrote:
> A logical WALReceiver is started directly by Postmaster when we enter PM_RUN
> state and the new parameter multimaster_conninfo is set. For now only one of
> those is started, but the code doesn't rely on that. In future multiple ones
> should be allowed.

Could the receiver-side of this be handled as an extra daemon: 
http://archives.postgresql.org/message-id/CADyhKSW2uyrO3zx-tohzRhN5-vaBEfKNHyvLG1yp7=cx_YH9UA@mail.gmail.com

In general, I feel that the receiver-side could live outside core. The 
sender-side needs to be at least somewhat integrated into the walsender 
stuff, and there are changes to the WAL records etc. that are hard to do 
outside, but AFAICS the stuff to receive changes is pretty high-level 
stuff. As long as the protocol between the logical replication client 
and server is well-defined, it should be possible to write all kinds of 
clients. You could replay the changes to a MySQL database instead of 
PostgreSQL, for example, or send them to a message queue, or just log 
them to a log file for auditing purposes. None of that needs to be in 
implemented inside a PostgreSQL server.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [PATCH 13/16] Introduction of pair of logical walreceiver/sender

From
"Kevin Grittner"
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> On 13.06.2012 14:28, Andres Freund wrote:
>> A logical WALReceiver is started directly by Postmaster when we
>> enter PM_RUN state and the new parameter multimaster_conninfo is
>> set. For now only one of those is started, but the code doesn't
>> rely on that. In future multiple ones should be allowed.
> In general, I feel that the receiver-side could live outside core.
> The sender-side needs to be at least somewhat integrated into the
> walsender stuff, and there are changes to the WAL records etc.
> that are hard to do outside, but AFAICS the stuff to receive
> changes is pretty high-level stuff.
It would be nice if there was at least a thin layer of the sender
portion which could by used by a stand-alone program.  I can think
of lots of useful reasons to "T" the WAL stream -- passing through
the stream with little or no modification to at least one side.  As
just one example, I would like a program to write traditional WAL
files to match what an archive on the sending side would look like
while passing the stream through to an asynchronous hot standby.
> As long as the protocol between the logical replication client 
> and server is well-defined, it should be possible to write all
> kinds of clients. You could replay the changes to a MySQL database
> instead of PostgreSQL, for example, or send them to a message
> queue, or just log them to a log file for auditing purposes. None
> of that needs to be in implemented inside a PostgreSQL server.
+1
-Kevin


Re: [PATCH 13/16] Introduction of pair of logical walreceiver/sender

From
Andres Freund
Date:
On Friday, June 29, 2012 05:16:11 PM Heikki Linnakangas wrote:
> On 13.06.2012 14:28, Andres Freund wrote:
> > A logical WALReceiver is started directly by Postmaster when we enter
> > PM_RUN state and the new parameter multimaster_conninfo is set. For now
> > only one of those is started, but the code doesn't rely on that. In
> > future multiple ones should be allowed.
> 
> Could the receiver-side of this be handled as an extra daemon:
> http://archives.postgresql.org/message-id/CADyhKSW2uyrO3zx-tohzRhN5-vaBEfKN
> HyvLG1yp7=cx_YH9UA@mail.gmail.com
Well, I think it depends on what the protocol turns out to be. In the 
prototype we used the infrastructure from walreceiver which reduced the 
required code considerably.

> In general, I feel that the receiver-side could live outside core.
I think it should be possible to write receivers outside core, but one 
sensible implementation should be in-core.

> The sender-side needs to be at least somewhat integrated into the walsender
> stuff, and there are changes to the WAL records etc. that are hard to do
> outside, but AFAICS the stuff to receive changes is pretty high-level
> stuff. 
> None of that needs to be in implemented inside a PostgreSQL server.
If you want robust and low-overhead crash recovery you need (at least I think 
so) tighter integration into postgres. To be sure that you pick of where you 
stopped after a crash you need to have a state synchronized to the commits 
into the receiving side. So you either always write to another table and 
analyze that afterwards - which imo sucks - or you integrate it with the 
commit record. Which needs integration into pg.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: [PATCH 13/16] Introduction of pair of logical walreceiver/sender

From
Heikki Linnakangas
Date:
On 29.06.2012 18:28, Kevin Grittner wrote:
> It would be nice if there was at least a thin layer of the sender
> portion which could by used by a stand-alone program.  I can think
> of lots of useful reasons to "T" the WAL stream -- passing through
> the stream with little or no modification to at least one side.  As
> just one example, I would like a program to write traditional WAL
> files to match what an archive on the sending side would look like
> while passing the stream through to an asynchronous hot standby.

That isn't really related to the logical replication stuff, but I agree 
that would be cool. You can sort of do that with cascading replication, 
but a thin stand-alone program would be nicer.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Hi, Andres!

There is my review of this patch.

1) Patches don't apply cleanly to head. So I used commit bed88fceac04042f0105eb22a018a4f91d64400d as the base for patches, then all the patches close to this apply cleanly. Regression tests pass OK, but it seems that new functionality isn't covered by regression tests.

2) Patch needs more comments. I think we need at least one comment in head of each function describing its behaviour, even if it is evident from function name.

4) There is significant code duplication in APPLY_CACHE_CHANGE_UPDATE and APPLY_CACHE_CHANGE_DELETE branches of case in apply_change function. I think this could be refactored to reduce code duplication. 

5) Apply mechanism requires PK from each table. So, throwing error here if we don't find PK is necessary. But we need to prevent user from run logical replication earlier than failing applying received messages. AFACS patch which is creating corresponding log messages is here: http://archives.postgresql.org/message-id/1339586927-13156-7-git-send-email-andres@2ndquadrant.com. And it throws any warning if it fails to find PK. On which stage we prevent user from running logical replication on tables which doesn't have PK?

6) I've seen comment /* FIXME: locking */. But you open index with command
index_rel = index_open(indexoid, AccessShareLock);
and close it with command
heap_close(index_rel, NoLock);
Shouldn't we use same level of locking on close as on open? Also, heap_close doesn't looks good to me for closing index. Why don't use index_close or relation_close?

7) We find each updated and deleted tuple by PK. Imagine we update significant part of the table (for example, 10%) in single query and planner choose sequential scan for it. Then applying of this changes could be more expensive than doing original changes. This it probably ok. But, we could do some heuristics: detect that sequential scan is cheaper because of large amount of updates or deletes in one table.

------
With best regards,
Alexander Korotkov.
On Sun, Jul 1, 2012 at 3:11 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
1) Patches don't apply cleanly to head. So I used commit bed88fceac04042f0105eb22a018a4f91d64400d as the base for patches, then all the patches close to this apply cleanly. Regression tests pass OK, but it seems that new functionality isn't covered by regression tests.

2) Patch needs more comments. I think we need at least one comment in head of each function describing its behaviour, even if it is evident from function name.

4) There is significant code duplication in APPLY_CACHE_CHANGE_UPDATE and APPLY_CACHE_CHANGE_DELETE branches of case in apply_change function. I think this could be refactored to reduce code duplication. 

5) Apply mechanism requires PK from each table. So, throwing error here if we don't find PK is necessary. But we need to prevent user from run logical replication earlier than failing applying received messages. AFACS patch which is creating corresponding log messages is here: http://archives.postgresql.org/message-id/1339586927-13156-7-git-send-email-andres@2ndquadrant.com. And it throws any warning if it fails to find PK. On which stage we prevent user from running logical replication on tables which doesn't have PK?

6) I've seen comment /* FIXME: locking */. But you open index with command
index_rel = index_open(indexoid, AccessShareLock);
and close it with command
heap_close(index_rel, NoLock);
Shouldn't we use same level of locking on close as on open? Also, heap_close doesn't looks good to me for closing index. Why don't use index_close or relation_close?

7) We find each updated and deleted tuple by PK. Imagine we update significant part of the table (for example, 10%) in single query and planner choose sequential scan for it. Then applying of this changes could be more expensive than doing original changes. This it probably ok. But, we could do some heuristics: detect that sequential scan is cheaper because of large amount of updates or deletes in one table.

8) If we can't find tuple for update or delete we likely need to put PK value into the log message.

------
With best regards,
Alexander Korotkov.

Hi,

On Sunday, July 01, 2012 05:51:54 PM Alexander Korotkov wrote:
> On Sun, Jul 1, 2012 at 3:11 PM, Alexander Korotkov 
<aekorotkov@gmail.com>wrote:
> > 1) Patches don't apply cleanly to head. So I used commit
> > bed88fceac04042f0105eb22a018a4f91d64400d as the base for patches, then
> > all the patches close to this apply cleanly. Regression tests pass OK,
> > but it seems that new functionality isn't covered by regression tests.
I don't think the designs is stable enough yet to justify regression tests 
just yet. We also don't really have the infrastructure for testing things like 
this :(

> > 2) Patch needs more comments. I think we need at least one comment in
> > head of each function describing its behaviour, even if it is evident
> > from function name.
I can do that.

> > 4) There is significant code duplication in APPLY_CACHE_CHANGE_UPDATE and
> > APPLY_CACHE_CHANGE_DELETE branches of case in apply_change function. I
> > think this could be refactored to reduce code duplication.
Yes. I did a quick stab at that when writing the code but its different enough 
in the details (i.e. fastgetattr(&change->oldtuple->tuple, i + 1, index_desc, 
&isnull) vs fastgetattr(&change->newtuple->tuple, pkattnum[i], desc, &isnull)) 
and some others that it wasn't totally obvious how a good api would look like.

> > 5) Apply mechanism requires PK from each table. So, throwing error here
> > if we don't find PK is necessary. But we need to prevent user from run
> > logical replication earlier than failing applying received messages.
> > AFACS patch which is creating corresponding log messages is here:
> > http://archives.postgresql.org/message-id/1339586927-13156-7-git-send-ema
> > il-andres@2ndquadrant.com. And it throws any warning if it fails to find
> > PK. On which stage we prevent user from running logical replication on
> > tables which doesn't have PK?
We don't have that layer yet ;). As we currently don't have any of the 
configuration nailed down I annot answer that yet.
The reason it doesn't throw an error when generating the additional data is 
that it can easily get you in a situation where you cannot get out of.

> > 6) I've seen comment /* FIXME: locking */. But you open index with
> > command index_rel = index_open(indexoid, AccessShareLock);
> > and close it with command
> > heap_close(index_rel, NoLock);
> > Shouldn't we use same level of locking on close as on open?
We cannot release the locks before the transaction ends. If you do an (heap|
index)_close with NoLock specified the relation will be rleeased at the end of 
transaction.

> > Also,
> > heap_close doesn't looks good to me for closing index. Why don't use
> > index_close or relation_close?
Good point. index_close and heap_close currently do exactly the same thing but 
that doesn't mean it will always do so.

> > 7) We find each updated and deleted tuple by PK. Imagine we update
> > significant part of the table (for example, 10%) in single query and
> > planner choose sequential scan for it. Then applying of this changes
> > could be more expensive than doing original changes. This it probably
> > ok. But, we could do some heuristics: detect that sequential scan is
> > cheaper because of large amount of updates or deletes in one table.
I don't think thats a good idea. We only ever update a single row at once. You 
need *loads* of non-HOT updates to the same row for a seqscan to be faster 
than a single, one-row, index lookup.
Now if we would update the whole table at once and if its smaller than RAM it 
might be beneficial to preload the whole table into the cache. But thats 
nothing we should at this level imo.

> 8) If we can't find tuple for update or delete we likely need to put PK
> value into the log message.
Yes, thats a good idea.

Greetings,

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Robert Haas
Date:
On Wed, Jun 27, 2012 at 6:33 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Will do so. Not sure if I can finish it today though, I am in the midst of
>> redoing the ilist and xlogreader patches. I guess tomorrow will suffice
>> otherwise...
> Ok, attached are two patches:
> The first is the rebased version of the original patch with
> WalSndWakeupProcess renamed to WalSndWakeupProcessRequests (seems clearer).
>
> The second changes WalSndWakeupRequest and WalSndWakeupProcessRequests into
> macros as you requested before. I am not sure if its a good idea or not.
>
> Anything else?

I committed these with just a bit of cleanup.  Sorry it took me a
while to get back to this.  I did run some more tests but found no
regressions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH 01/16] Overhaul walsender wakeup handling

From
Andres Freund
Date:
On Monday, July 02, 2012 03:51:08 PM Robert Haas wrote:
> On Wed, Jun 27, 2012 at 6:33 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> Will do so. Not sure if I can finish it today though, I am in the midst
> >> of redoing the ilist and xlogreader patches. I guess tomorrow will
> >> suffice otherwise...
> > 
> > Ok, attached are two patches:
> > The first is the rebased version of the original patch with
> > WalSndWakeupProcess renamed to WalSndWakeupProcessRequests (seems
> > clearer).
> > 
> > The second changes WalSndWakeupRequest and WalSndWakeupProcessRequests
> > into macros as you requested before. I am not sure if its a good idea or
> > not.
> > 
> > Anything else?
> I committed these with just a bit of cleanup.  Sorry it took me a
> while to get back to this.  I did run some more tests but found no
> regressions.
Nothing to be sorry for. Looks good.

Thanks

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


On Thu, Jun 28, 2012 at 12:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On Thursday, June 28, 2012 06:01:10 PM Robert Haas wrote:
>> On Tue, Jun 26, 2012 at 8:13 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
>> > It even can be significantly higher than max_connections because
>> > subtransactions are only recognizable as part of their parent transaction
>> > uppon commit.
>>
>> I've been wondering whether sub-XID assignment was going to end up on
>> the list of things that need to be WAL-logged to enable logical
>> replication.  It would be nicer to avoid that if we can, but I have a
>> feeling that we may not be able to.
> I don't think it needs to. We only need that information during commit and we
> have it there. If a subtxn aborts a separate abort is logged, so thats no
> problem. The 'merging' of the transactions would be slightly easier if we had
> the knowledge from the get go but that would add complications again in the
> case of rollbacks.
>
> What do you think we need it?

Well, I don't know for sure that we do.  But, ultimately, I think
we're going to find that applying the whole transaction at
transaction-end is something that we need to optimize - so that we can
apply the transaction as it happens and commit it at the end.  For
small transactions this is no big deal, but for big transactions it
could lead to hours and hours of replication delay.  I'm not sure
whether it's practical to fix that in version 1 - certainly there are
a lot of issues there - but ultimately I think it's something that our
users are going to demand.  And it seems like that might require
knowing the transaction tree at some point before the commit record
shows up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company