On Jun21, 2012, at 13:41 , Andres Freund wrote:
> 5.)
> The actually good idea. Yours?
What about a mixure of (3b) and (4), which writes the data not to the WAL
but to a separate logical replication log. More specifically:
There's a per-backend queue of change notifications.
Whenever a non-catalog tuple is modified, we queue a TUPLE_MODIFIED
record containing (xid, databaseoid, tableoid, old xmin, old ctid, new ctid)
Whenever a table (or something that a table depends on) is modified we
wait until all references to that table's oid have vanished from the queue,
then queue a DDL record containing (xid, databaseoid, tableoid, text). Other
backend cannot concurrently add further TUPLE_MODIFIED records since we alreay
hold an exclusive lock on the table at that point.
A background process continually processes these queues. If the front of the
queue is a TUPLE_MODIFIED record, it fetches the old and the new tuple
based on their ctids and writes the old tuple's PK and the full new tuple
to the logical replication log. Since table modifications always wait for
all previously queued TUPLE_MODIFIED records referencing that table to be
processes *before* altering the catalog, tuples can always be interpreted
according to the current (SnapshotNow) catalog contents.
Upon transaction COMMIT and ROLLBACK, we queue COMMIT and ROLLBACK records,
which are also written to the log by the background process. The background
process may decide to wait until a backend commits before processing that
backend's log. In that case, rolled back transaction don't leave a trace in
the logical replication log. Should a backend, however, issue a DDL statement,
the background process *must* process that backend's queue immediately, since
otherwise there's a dead lock.
The background process also maintains a value in shared memory which
contains the oldest value in any of the queue's xid or "old xmin" fields.
VACUUM and the like must not remove tuples whose xmin is >= that value.
Hit bits *may* be set for newest tuples though, provided that the background
process ignores hint bits when fetching the old and new tuples.
best regards,
Florian Pflug