logical changeset generation v3 - Mailing list pgsql-hackers

From andres@anarazel.de (Andres Freund)
Subject logical changeset generation v3
Date
Msg-id 20121115002746.GA7692@awork2.anarazel.de
Whole thread Raw
Responses [PATCH 04/14] Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode)  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 01/14] Add minimal binary heap implementation  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 03/14] Add simple xlogdump tool  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 06/14] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 11/14] Introduce wal decoding via catalog timetravel  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 14/14] design document v2.3 and snapshot building design doc v0.2  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 07/14] Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 10/14] Allow walsender's to connect to a specific database  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 09/14] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 13/14] Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 12/14] Add a simple decoding module in contrib named 'test_decoding'  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader  (Andres Freund <andres@2ndquadrant.com>)
Re: logical changeset generation v3 - git repository  (Andres Freund <andres@2ndquadrant.com>)
Re: lcr - walsender integration  (Andres Freund <andres@2ndquadrant.com>)
Re: logical changeset generation v3  (Josh Berkus <josh@agliodbs.com>)
Re: logical changeset generation v3  (Michael Paquier <michael.paquier@gmail.com>)
Re: logical changeset generation v3 - comparison to Postgres-R change set format  (Markus Wanner <markus@bluegap.ch>)
Re: logical changeset generation v3  (Michael Paquier <michael.paquier@gmail.com>)
Re: logical changeset generation v3 - Source for Slony  (Steve Singer <steve@ssinger.info>)
List pgsql-hackers
Hi,

In response to this you will soon find the 14 patches that currently
implement $subject. I'll go over each one after showing off for a bit:

Start postgres:

Start postgres instance (with pg_hba.conf allowing replication cons):

$ postgres -D ~/tmp/pgdev-lcr \         -c wal_level=logical \         -c max_wal_senders=10 \         -c
max_logical_slots=10\         -c wal_keep_segments=100 \         -c log_line_prefix="[%p %x] "
 

Start the changelog receiver:
$ pg_receivellog -h /tmp -f /dev/stderr -d postgres -v

Generate changes:
$ psql -h /tmp postgres <<EOF

DROP TABLE IF EXISTS replication_example;

CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

-- plain insert
INSERT INTO replication_example(somedata, text) VALUES (1, 1);

-- plain update
UPDATE replication_example SET somedata = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));

-- plain delete
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));

-- wrapped in a transaction
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));
COMMIT;

-- dont write out aborted data
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (2, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));
ROLLBACK;

-- add a column
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
ALTER TABLE replication_example ADD COLUMN bar int;
INSERT INTO replication_example(somedata, text, bar) VALUES (3, 1, 1);
COMMIT;

-- once more outside
INSERT INTO replication_example(somedata, text, bar) VALUES (4, 1, 1);


-- DDL with table rewrite
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (5, 1);
ALTER TABLE replication_example RENAME COLUMN text TO somenum;
INSERT INTO replication_example(somedata, somenum) VALUES (5, 2);
ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING (somenum::int4);
INSERT INTO replication_example(somedata, somenum) VALUES (5, 3);
COMMIT;

EOF

And the results printed by llog:

BEGIN 16556826
COMMIT 16556826
BEGIN 16556827
table "replication_example_id_seq": INSERT: sequence_name[name]:replication_example_id_seq last_value[int8]:1
start_value[int8]:1increment_by[int8]:1 max_value[int8]:9223372036854775807 min_value[int8]:1 cache_value[int8]:1
log_cnt[int8]:0is_cycled[bool]:f is_called[bool]:f
 
COMMIT 16556827
BEGIN 16556828
table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1
COMMIT 16556828
BEGIN 16556829
table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1
COMMIT 16556829
BEGIN 16556830
table "replication_example": DELETE (pkey): id[int4]:1
COMMIT 16556830
BEGIN 16556833
table "replication_example": INSERT: id[int4]:4 somedata[int4]:3 text[varchar]:1
table "replication_example": INSERT: id[int4]:5 somedata[int4]:3 text[varchar]:1 bar[int4]:1
COMMIT 16556833
BEGIN 16556834
table "replication_example": INSERT: id[int4]:6 somedata[int4]:4 text[varchar]:1 bar[int4]:1
COMMIT 16556834
BEGIN 16556835
table "replication_example": INSERT: id[int4]:7 somedata[int4]:5 text[varchar]:1 bar[int4]:(null)
table "replication_example": INSERT: id[int4]:8 somedata[int4]:5 somenum[varchar]:2 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:4 somedata[int4]:3 somenum[int4]:1 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:5 somedata[int4]:3 somenum[int4]:1 bar[int4]:1
table "pg_temp_74943": INSERT: id[int4]:6 somedata[int4]:4 somenum[int4]:1 bar[int4]:1
table "pg_temp_74943": INSERT: id[int4]:7 somedata[int4]:5 somenum[int4]:1 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:8 somedata[int4]:5 somenum[int4]:2 bar[int4]:(null)
table "replication_example": INSERT: id[int4]:9 somedata[int4]:5 somenum[int4]:3 bar[int4]:(null)
COMMIT 16556835

As you can see above we can decode WAL in the presence of nearly all
forms of DDL. The plugin that outputted these changes is supposed to be
added to contrib and is fairly small and uncomplicated.

An interesting piece of information might be that in the very
preliminary benchmarking I have done on this even the textual decoding
could keep up with a full tilt pgbench -c16 -j16 -M prepared on my
(somewhat larger) workstation. The wal space overhead was less than 1%
between two freshly initdb'ed clusters, comparing
wal_level=hot_standby with =logical.
With a custom pgbench script I can saturate the decoding to the effect
that it lags a second or so, but once I write out the data in a binary
format it can keep up again.
The biggest overhead is currently the more slowly increasing
Global/RecentXmin, but that can be greatly improved by logging
xl_running_xact's more than just every checkpoint.


A short overview over the patches in this series:

* Add minimal binary heap implementation
Abhijit submitted a nicer version of this, the plan is to rebase ontop
of that once people are happy with the interface.
(unchanged)

* Add support for a generic wal reading facility dubbed XLogReader
There's some discussion about whats the best way to implement this in a
separate CF topic.
(unchanged)

* Add simple xlogdump tool
Very nice for debugging, couldn't have developed this without. Obviously
not a prerequisite for comitting this feature but still pretty worthy.
(quite a bit updated, still bad build infrastructure)

* Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode)
Relatively simple, somewhat contentious due to some uniqueness
issues. Would very much welcome input from somebody with syscache
experience on this. It was previously suggested to write something like
attoptcache.c for this, but to me that seems to be code-duplication. We
can go that route though.
(unchanged)

* Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode
Simple. I don't even think its contentious... Just wasn't needed before.
(unchanged)

* Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs
Just a nice to have thing for debugging, not a prerequisite for the
feature.
(unchanged)

* Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement
Uncomplicated and I hope uncontentious.
(new)

*Store the number of subtransactions in xl_running_xacts separately fromtoplevel xids
Increases the size of xl_running_xacts by 4bytes in the worst case,
decreases it in some others. Improves the efficiency of some HS
operations.
Should be ok?
(new)

* Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader
Not sure if people will complain about this? Its rather simple due to
the fact that the HeapTupleSatisfiesVisibility wrapper already took a
HeapTupleHeader as parameter.
(new)

* Allow walsender's to connect to a specific database
This has been requested by others. I think we need to work on the
external interface a bit, should be ok otherwise.
(new)

* Introduce wal decoding via catalog timetravel
This is the meat of the feature. I think this is going in a good
direction, still needs some work, but architectural review can really
start now. (more later)
(heavily changed)

* Add a simple decoding module in contrib named 'test_decoding'
The much requested example contrib module.
(new)

* Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes
Debugging tool to receive changes and write them to a file. Needs some
more options and probably shouldn't live inside pg_basebackup's
directory.
(new)

* design document v2.3 and snapshot building design doc v0.2
(unchanged)


There remains quite a bit to be done but I think the state of the patch
has improved quite a bit. The biggest thing now is to get input about
the user facing parts so we can get some aggreement there.

Todo:
* testing infrastructure (isolationtester)
* persistence/spilling to disk of built snapshots, longrunning transactions
* user docs
* more frequent lowering of xmins
* more docs about the internals
* support for user declared catalog tables
* actual exporting of initial pg_export snapshots after INIT_LOGICAL_REPLICATION
* own shared memory segment instead of piggybacking on walsender's
* nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the outside.
* more frequent xl_running_xid's so xmin can be upped more frequently

Please comment!

Happy and tired,

Andres

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Pg_upgrade speed for many tables
Next
From: Andres Freund
Date:
Subject: [PATCH 01/14] Add minimal binary heap implementation