Thread: logical changeset generation v3
Hi, In response to this you will soon find the 14 patches that currently implement $subject. I'll go over each one after showing off for a bit: Start postgres: Start postgres instance (with pg_hba.conf allowing replication cons): $ postgres -D ~/tmp/pgdev-lcr \ -c wal_level=logical \ -c max_wal_senders=10 \ -c max_logical_slots=10\ -c wal_keep_segments=100 \ -c log_line_prefix="[%p %x] " Start the changelog receiver: $ pg_receivellog -h /tmp -f /dev/stderr -d postgres -v Generate changes: $ psql -h /tmp postgres <<EOF DROP TABLE IF EXISTS replication_example; CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); -- plain insert INSERT INTO replication_example(somedata, text) VALUES (1, 1); -- plain update UPDATE replication_example SET somedata = - somedata WHERE id = (SELECT currval('replication_example_id_seq')); -- plain delete DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq')); -- wrapped in a transaction BEGIN; INSERT INTO replication_example(somedata, text) VALUES (1, 1); UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq')); DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq')); COMMIT; -- dont write out aborted data BEGIN; INSERT INTO replication_example(somedata, text) VALUES (2, 1); UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq')); DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq')); ROLLBACK; -- add a column BEGIN; INSERT INTO replication_example(somedata, text) VALUES (3, 1); ALTER TABLE replication_example ADD COLUMN bar int; INSERT INTO replication_example(somedata, text, bar) VALUES (3, 1, 1); COMMIT; -- once more outside INSERT INTO replication_example(somedata, text, bar) VALUES (4, 1, 1); -- DDL with table rewrite BEGIN; INSERT INTO replication_example(somedata, text) VALUES (5, 1); ALTER TABLE replication_example RENAME COLUMN text TO somenum; INSERT INTO replication_example(somedata, somenum) VALUES (5, 2); ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING (somenum::int4); INSERT INTO replication_example(somedata, somenum) VALUES (5, 3); COMMIT; EOF And the results printed by llog: BEGIN 16556826 COMMIT 16556826 BEGIN 16556827 table "replication_example_id_seq": INSERT: sequence_name[name]:replication_example_id_seq last_value[int8]:1 start_value[int8]:1increment_by[int8]:1 max_value[int8]:9223372036854775807 min_value[int8]:1 cache_value[int8]:1 log_cnt[int8]:0is_cycled[bool]:f is_called[bool]:f COMMIT 16556827 BEGIN 16556828 table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1 COMMIT 16556828 BEGIN 16556829 table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1 COMMIT 16556829 BEGIN 16556830 table "replication_example": DELETE (pkey): id[int4]:1 COMMIT 16556830 BEGIN 16556833 table "replication_example": INSERT: id[int4]:4 somedata[int4]:3 text[varchar]:1 table "replication_example": INSERT: id[int4]:5 somedata[int4]:3 text[varchar]:1 bar[int4]:1 COMMIT 16556833 BEGIN 16556834 table "replication_example": INSERT: id[int4]:6 somedata[int4]:4 text[varchar]:1 bar[int4]:1 COMMIT 16556834 BEGIN 16556835 table "replication_example": INSERT: id[int4]:7 somedata[int4]:5 text[varchar]:1 bar[int4]:(null) table "replication_example": INSERT: id[int4]:8 somedata[int4]:5 somenum[varchar]:2 bar[int4]:(null) table "pg_temp_74943": INSERT: id[int4]:4 somedata[int4]:3 somenum[int4]:1 bar[int4]:(null) table "pg_temp_74943": INSERT: id[int4]:5 somedata[int4]:3 somenum[int4]:1 bar[int4]:1 table "pg_temp_74943": INSERT: id[int4]:6 somedata[int4]:4 somenum[int4]:1 bar[int4]:1 table "pg_temp_74943": INSERT: id[int4]:7 somedata[int4]:5 somenum[int4]:1 bar[int4]:(null) table "pg_temp_74943": INSERT: id[int4]:8 somedata[int4]:5 somenum[int4]:2 bar[int4]:(null) table "replication_example": INSERT: id[int4]:9 somedata[int4]:5 somenum[int4]:3 bar[int4]:(null) COMMIT 16556835 As you can see above we can decode WAL in the presence of nearly all forms of DDL. The plugin that outputted these changes is supposed to be added to contrib and is fairly small and uncomplicated. An interesting piece of information might be that in the very preliminary benchmarking I have done on this even the textual decoding could keep up with a full tilt pgbench -c16 -j16 -M prepared on my (somewhat larger) workstation. The wal space overhead was less than 1% between two freshly initdb'ed clusters, comparing wal_level=hot_standby with =logical. With a custom pgbench script I can saturate the decoding to the effect that it lags a second or so, but once I write out the data in a binary format it can keep up again. The biggest overhead is currently the more slowly increasing Global/RecentXmin, but that can be greatly improved by logging xl_running_xact's more than just every checkpoint. A short overview over the patches in this series: * Add minimal binary heap implementation Abhijit submitted a nicer version of this, the plan is to rebase ontop of that once people are happy with the interface. (unchanged) * Add support for a generic wal reading facility dubbed XLogReader There's some discussion about whats the best way to implement this in a separate CF topic. (unchanged) * Add simple xlogdump tool Very nice for debugging, couldn't have developed this without. Obviously not a prerequisite for comitting this feature but still pretty worthy. (quite a bit updated, still bad build infrastructure) * Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode) Relatively simple, somewhat contentious due to some uniqueness issues. Would very much welcome input from somebody with syscache experience on this. It was previously suggested to write something like attoptcache.c for this, but to me that seems to be code-duplication. We can go that route though. (unchanged) * Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode Simple. I don't even think its contentious... Just wasn't needed before. (unchanged) * Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs Just a nice to have thing for debugging, not a prerequisite for the feature. (unchanged) * Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement Uncomplicated and I hope uncontentious. (new) *Store the number of subtransactions in xl_running_xacts separately fromtoplevel xids Increases the size of xl_running_xacts by 4bytes in the worst case, decreases it in some others. Improves the efficiency of some HS operations. Should be ok? (new) * Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader Not sure if people will complain about this? Its rather simple due to the fact that the HeapTupleSatisfiesVisibility wrapper already took a HeapTupleHeader as parameter. (new) * Allow walsender's to connect to a specific database This has been requested by others. I think we need to work on the external interface a bit, should be ok otherwise. (new) * Introduce wal decoding via catalog timetravel This is the meat of the feature. I think this is going in a good direction, still needs some work, but architectural review can really start now. (more later) (heavily changed) * Add a simple decoding module in contrib named 'test_decoding' The much requested example contrib module. (new) * Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes Debugging tool to receive changes and write them to a file. Needs some more options and probably shouldn't live inside pg_basebackup's directory. (new) * design document v2.3 and snapshot building design doc v0.2 (unchanged) There remains quite a bit to be done but I think the state of the patch has improved quite a bit. The biggest thing now is to get input about the user facing parts so we can get some aggreement there. Todo: * testing infrastructure (isolationtester) * persistence/spilling to disk of built snapshots, longrunning transactions * user docs * more frequent lowering of xmins * more docs about the internals * support for user declared catalog tables * actual exporting of initial pg_export snapshots after INIT_LOGICAL_REPLICATION * own shared memory segment instead of piggybacking on walsender's * nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the outside. * more frequent xl_running_xid's so xmin can be upped more frequently Please comment! Happy and tired, Andres --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Will be replaces by the "binaryheap.[ch]" from Abhijit once its been reviewed. --- src/backend/lib/Makefile | 3 +- src/backend/lib/simpleheap.c | 255 +++++++++++++++++++++++++++++++++++++++++++ src/include/lib/simpleheap.h | 91 +++++++++++++++ 3 files changed, 348 insertions(+), 1 deletion(-) create mode 100644 src/backend/lib/simpleheap.c create mode 100644 src/include/lib/simpleheap.h
Attachment
[PATCH 04/14] Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode)
From
Andres Freund
Date:
This cache is somewhat problematic because formally indexes used by syscaches needs to be unique, this one is not. This is "just" because of 0/InvalidOids stored in pg_class.relfilenode for nailed/shared catalog relations. The syscache will never be queried for InvalidOid relfilenodes however so it seems to be safe even if it violates the rules somewhat. It might be nicer to add infrastructure to do this properly, like using a partial index, its not clear what the best way to do this is though. Needs a CATVERSION bump. --- src/backend/utils/cache/syscache.c | 11 +++++++++++ src/include/catalog/indexing.h | 2 ++ src/include/catalog/pg_proc.h | 1 + src/include/utils/syscache.h | 1 + 4 files changed, 15 insertions(+)
Attachment
[PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode
From
Andres Freund
Date:
--- src/backend/utils/cache/relmapper.c | 53 +++++++++++++++++++++++++++++++++++++ src/include/catalog/indexing.h | 4 +-- src/include/utils/relmapper.h | 2 ++ 3 files changed, 57 insertions(+), 2 deletions(-)
Attachment
--- src/bin/Makefile | 2 +- src/bin/xlogdump/Makefile | 25 +++ src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 494 insertions(+), 1 deletion(-) create mode 100644 src/bin/xlogdump/Makefile create mode 100644 src/bin/xlogdump/xlogdump.c
Attachment
[PATCH 06/14] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs
From
Andres Freund
Date:
This requires the previously added RELFILENODE syscache and the added RelationMapFilenodeToOid function added in previous commits. --- doc/src/sgml/func.sgml | 23 +++++++++++- src/backend/utils/adt/dbsize.c | 79 ++++++++++++++++++++++++++++++++++++++++++ src/include/catalog/pg_proc.h | 2 ++ src/include/utils/builtins.h | 1 + 4 files changed, 104 insertions(+), 1 deletion(-)
Attachment
[PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids
From
Andres Freund
Date:
To avoid complicating logic we store both, the toplevel and the subxids, in ->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids. Also skip logging any subxids if the snapshot is suboverflowed, they aren't useful in that case anyway. This allows to make some operations cheaper and it allows faster startup for the future logical decoding feature because that doesn't care about subtransactions/suboverflow'edness. --- src/backend/access/transam/xlog.c | 2 ++ src/backend/storage/ipc/procarray.c | 65 ++++++++++++++++++++++++------------- src/backend/storage/ipc/standby.c | 8 +++-- src/include/storage/standby.h | 2 ++ 4 files changed, 52 insertions(+), 25 deletions(-)
Attachment
This introduces several things: * 'reorderbuffer' module which reassembles transactions from a stream of interspersed changes * 'snapbuilder' which builds catalog snapshots so that tuples from wal can be understood * logging more data into wal to facilitate logical decoding * wal decoding into an reorderbuffer * shared library output plugins with 5 callbacks * init * begin * change * commit * walsender infrastructur to stream out changes and to keep the global xmin low enough * INIT_LOGICAL_REPLICATION $plugin; waits till a consistent snapshot is built and returns * initial LSN * replication slot identifier * id of a pg_export() style snapshot * START_LOGICAL_REPLICATION $id $lsn; streams out changes * uses named output plugins for output specification Todo: * testing infrastructure (isolationtester) * persistence/spilling to disk of built snapshots, longrunning transactions * user docs * more frequent lowering of xmins * more docs about the internals * support for user declared catalog tables * actual exporting of initial pg_export snapshots after INIT_LOGICAL_REPLICATION * own shared memory segment instead of piggybacking on walsender's * nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the outside. * more frequent xl_running_xid's so xmin can be upped more frequently * add STOP_LOGICAL_REPLICATION $id --- src/backend/access/heap/heapam.c | 280 +++++- src/backend/access/transam/xlog.c | 1 + src/backend/catalog/index.c | 74 ++ src/backend/replication/Makefile | 2 + src/backend/replication/logical/Makefile | 19 + src/backend/replication/logical/decode.c | 496 ++++++++++ src/backend/replication/logical/logicalfuncs.c | 247 +++++ src/backend/replication/logical/reorderbuffer.c | 1156 +++++++++++++++++++++++ src/backend/replication/logical/snapbuild.c | 1144 ++++++++++++++++++++++ src/backend/replication/repl_gram.y | 32 +- src/backend/replication/repl_scanner.l | 2 + src/backend/replication/walsender.c | 566 ++++++++++- src/backend/storage/ipc/procarray.c | 23 + src/backend/storage/ipc/standby.c | 8 +- src/backend/utils/cache/inval.c | 2 +- src/backend/utils/cache/relcache.c | 3 +- src/backend/utils/misc/guc.c | 11 + src/backend/utils/time/tqual.c | 249 +++++ src/bin/pg_controldata/pg_controldata.c | 2 + src/include/access/heapam_xlog.h | 23 + src/include/access/transam.h | 5 + src/include/access/xlog.h | 3 +- src/include/catalog/index.h | 4 + src/include/nodes/nodes.h | 2 + src/include/nodes/replnodes.h | 22 + src/include/replication/decode.h | 21 + src/include/replication/logicalfuncs.h | 44 + src/include/replication/output_plugin.h | 76 ++ src/include/replication/reorderbuffer.h | 284 ++++++ src/include/replication/snapbuild.h | 128 +++ src/include/replication/walsender.h | 1 + src/include/replication/walsender_private.h | 34 +- src/include/storage/itemptr.h | 3 + src/include/storage/sinval.h | 2 + src/include/utils/tqual.h | 31 +- 35 files changed, 4966 insertions(+), 34 deletions(-) create mode 100644 src/backend/replication/logical/Makefile create mode 100644 src/backend/replication/logical/decode.c create mode 100644 src/backend/replication/logical/logicalfuncs.c create mode 100644 src/backend/replication/logical/reorderbuffer.c create mode 100644 src/backend/replication/logical/snapbuild.c create mode 100644 src/include/replication/decode.h create mode 100644 src/include/replication/logicalfuncs.h create mode 100644 src/include/replication/output_plugin.h create mode 100644 src/include/replication/reorderbuffer.h create mode 100644 src/include/replication/snapbuild.h
Attachment
--- src/backend/replication/logical/DESIGN.txt | 603 +++++++++++++++++++++ src/backend/replication/logical/Makefile | 6 + .../replication/logical/README.SNAPBUILD.txt | 298 ++++++++++ 3 files changed, 907 insertions(+) create mode 100644 src/backend/replication/logical/DESIGN.txt create mode 100644 src/backend/replication/logical/README.SNAPBUILD.txt
Attachment
[PATCH 07/14] Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement
From
Andres Freund
Date:
This is useful to be able to represent a CommandId thats invalid. There was no such value before. This decreases the possible number of subtransactions by one which seems unproblematic. Its also not a problem for pg_upgrade because cmin/cmax are never looked at outside the context of their own transaction (spare timetravel access, but thats new anyway). --- src/backend/access/transam/xact.c | 4 ++-- src/include/c.h | 1 + 2 files changed, 3 insertions(+), 2 deletions(-)
Attachment
Currently the decision whether to connect to a database or not is made by checking whether the passed "dbname" parameter is "replication". Unfortunately this makes it impossible to connect a to a database named replication... This is useful for future walsender commands which need database interaction. --- src/backend/postmaster/postmaster.c | 7 ++++-- .../libpqwalreceiver/libpqwalreceiver.c | 4 ++-- src/backend/replication/walsender.c | 27 ++++++++++++++++++---- src/backend/utils/init/postinit.c | 5 ++++ src/bin/pg_basebackup/pg_basebackup.c | 4 ++-- src/bin/pg_basebackup/pg_receivexlog.c | 4 ++-- src/bin/pg_basebackup/receivelog.c | 4 ++-- 7 files changed, 41 insertions(+), 14 deletions(-)
Attachment
[PATCH 09/14] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader
From
Andres Freund
Date:
For the regular satisfies routines this is needed in prepareation of logical decoding. I changed the non-regular ones for consistency as well. The naming between htup, tuple and similar is rather confused, I could not find any consistent naming anywhere. This is preparatory work for the logical decoding feature which needs to be able to get to a valid relfilenode from when checking the visibility of a tuple. --- contrib/pgrowlocks/pgrowlocks.c | 2 +- src/backend/access/heap/heapam.c | 13 ++++++---- src/backend/access/heap/pruneheap.c | 16 ++++++++++-- src/backend/catalog/index.c | 2 +- src/backend/commands/analyze.c | 3 ++- src/backend/commands/cluster.c | 2 +- src/backend/commands/vacuumlazy.c | 3 ++- src/backend/storage/lmgr/predicate.c | 2 +- src/backend/utils/time/tqual.c | 50 +++++++++++++++++++++++++++++------- src/include/utils/snapshot.h | 4 +-- src/include/utils/tqual.h | 20 +++++++-------- 11 files changed, 83 insertions(+), 34 deletions(-)
Attachment
[PATCH 13/14] Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes
From
Andres Freund
Date:
--- src/bin/pg_basebackup/Makefile | 7 +- src/bin/pg_basebackup/pg_receivellog.c | 717 +++++++++++++++++++++++++++++++++ src/bin/pg_basebackup/streamutil.c | 3 +- src/bin/pg_basebackup/streamutil.h | 1 + 4 files changed, 725 insertions(+), 3 deletions(-) create mode 100644 src/bin/pg_basebackup/pg_receivellog.c
Attachment
[PATCH 12/14] Add a simple decoding module in contrib named 'test_decoding'
From
Andres Freund
Date:
--- contrib/Makefile | 1 + contrib/test_decoding/Makefile | 16 +++ contrib/test_decoding/test_decoding.c | 192 ++++++++++++++++++++++++++++++++++ 3 files changed, 209 insertions(+) create mode 100644 contrib/test_decoding/Makefile create mode 100644 contrib/test_decoding/test_decoding.c
Attachment
[PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
Features: - streaming reading/writing - filtering - reassembly of records Reusing the ReadRecord infrastructure in situations where the code that wants to do so is not tightly integrated into xlog.c is rather hard and would require changes to rather integral parts of the recovery code which doesn't seem to be a good idea. Missing: - "compressing" the stream when removing uninteresting records - writing out correct CRCs - separating reader/writer --- src/backend/access/transam/Makefile | 2 +- src/backend/access/transam/xlogreader.c | 1032 +++++++++++++++++++++++++++++++ src/include/access/xlogreader.h | 264 ++++++++ 3 files changed, 1297 insertions(+), 1 deletion(-) create mode 100644 src/backend/access/transam/xlogreader.c create mode 100644 src/include/access/xlogreader.h
Attachment
On 2012-11-15 01:27:46 +0100, Andres Freund wrote: > In response to this you will soon find the 14 patches that currently > implement $subject. As its not very wieldly to send around that many/big patches all the time, until the next "major" version I will just update the git tree at: Web: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf3 Git: git clone git://git.postgresql.org/git/users/andresfreund/postgres.git xlog-decoding-rebasing-cf3 Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, The current logical walsender integration looks like the following: =# INIT_LOGICAL_REPLICATION 'text'; WARNING: Initiating logical rep WARNING: reached consistent point, stopping!replication_id | consistent_point | snapshot_name | plugin ----------------+------------------+---------------+--------id-2 | 3/CACBDF98 | 0xDEADBEEF | text (1 row) =# START_LOGICAL_REPLICATION 'id-2' 3/CACBDF98; ... So the current protocol is: INIT_LOGICAL_REPLICATION '$plugin'; returns * slot * first consistent point * snapshot id START_LOGICAL_REPLICATION '$slot' $last_received_lsn; streams changes, each wrapped in a 'w' message with (start, end) set to the same value. The content of the data is completely free-format and only depends on the output plugin. Feedback is provided from the client via the normal 'r' messages. I think thats not a bad start, but we probably can improve it a bit: INIT_LOGICAL_REPLICATION '$slot' '$plugin' ($value = $key, ...); START_LOGICAL_REPLICATION '$slot' $last_received_lsn; STOP_LOGICAL_REPLICATION '$slot'; The option to INIT_LOGICAL_REPLICATION would then get passed to the 'pg_decode_init' output plugin function (i.e. a function of that name would get dlsym()'ed using the pg infrastructure for that). Does that look good to you? Any suggestions? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/14/12 4:27 PM, Andres Freund wrote: > Hi, > > In response to this you will soon find the 14 patches that currently > implement $subject. I'll go over each one after showing off for a bit: Lemme be the first to say, "wow". Impressive work. Now the debugging starts ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Looks like cool stuff @-@<br />I might be interested in looking at that a bit as I think I will hopefully be hopefully beable to grab some time in the next couple of weeks.<br />Are some of those patches already submitted to a CF?<br /> --<br />Michael Paquier<br /><a href="http://michael.otacoo.com" target="_blank">http://michael.otacoo.com</a><br />
Hi, On Thursday, November 15, 2012 05:08:26 AM Michael Paquier wrote: > Looks like cool stuff @-@ > I might be interested in looking at that a bit as I think I will hopefully > be hopefully be able to grab some time in the next couple of weeks. > Are some of those patches already submitted to a CF? I added the patchset as one entry to the CF this time, it seems to me they are too hard to judge individually to make them really separately reviewable. I can split it off there, but really all the complicated stuff is in one patch anyway... Greetings, Andres
Re: [PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids
From
Simon Riggs
Date:
On 14 November 2012 22:17, Andres Freund <andres@2ndquadrant.com> wrote: > To avoid complicating logic we store both, the toplevel and the subxids, in > ->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids. That looks good, not much change. Will apply in next few days. Please add me as committer and mark ready. > Also skip logging any subxids if the snapshot is suboverflowed, they aren't > useful in that case anyway. > This allows to make some operations cheaper and it allows faster startup for > the future logical decoding feature because that doesn't care about > subtransactions/suboverflow'edness. ...but please don't add extra touches of Andres magic along the way. Doing that will just slow down patch acceptance and its not important. I suggest to keep note of things like that and come back to them later. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids
From
Andres Freund
Date:
On 2012-11-15 09:07:23 -0300, Simon Riggs wrote: > On 14 November 2012 22:17, Andres Freund <andres@2ndquadrant.com> wrote: > > > To avoid complicating logic we store both, the toplevel and the subxids, in > > ->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids. > > That looks good, not much change. Will apply in next few days. Please > add me as committer and mark ready. Cool. Will do. > > Also skip logging any subxids if the snapshot is suboverflowed, they aren't > > useful in that case anyway. > > > This allows to make some operations cheaper and it allows faster startup for > > the future logical decoding feature because that doesn't care about > > subtransactions/suboverflow'edness. > > ...but please don't add extra touches of Andres magic along the way. > Doing that will just slow down patch acceptance and its not important. > I suggest to keep note of things like that and come back to them > later. Which magic are you talking about? Only two parts changed in comparison to the previous situation. One is that the following in ProcArrayApplyRecoveryInfo only applies to toplevel transactions by virtue of ->xcnt now only containing the toplevel transaction count: > + /* > + * Remove stale locks, if any. > + * > + * Locks are always assigned to the toplevel xid so we don't > need to care > + * about subxcnt/subxids (and by extension not about > ->suboverflowed). > + */ > StandbyReleaseOldLocks(running->xcnt, running->xids); Note that there was no code change, just a change in meaning. The other part is: > + /* > + * Spin over procArray collecting all subxids, but only if there hasn't > + * been a suboverflow. > + */ > + if (!suboverflowed) Well, thats something that basically had to be decided either way when writing the patch... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
On 15.11.2012 03:17, Andres Freund wrote: > > Features: > - streaming reading/writing > - filtering > - reassembly of records > > Reusing the ReadRecord infrastructure in situations where the code that wants > to do so is not tightly integrated into xlog.c is rather hard and would require > changes to rather integral parts of the recovery code which doesn't seem to be > a good idea. > > Missing: > - "compressing" the stream when removing uninteresting records > - writing out correct CRCs > - separating reader/writer I'm disappointed to see that there has been no progress on this patch since last commitfest. I thought we agreed on the approach I championed for here: http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. There wasn't much work left to finish that, I believe. Are you going to continue working on this? - Heikki
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On 2012-11-15 16:22:56 +0200, Heikki Linnakangas wrote: > On 15.11.2012 03:17, Andres Freund wrote: > > > >Features: > >- streaming reading/writing > >- filtering > >- reassembly of records > > > >Reusing the ReadRecord infrastructure in situations where the code that wants > >to do so is not tightly integrated into xlog.c is rather hard and would require > >changes to rather integral parts of the recovery code which doesn't seem to be > >a good idea. > > > >Missing: > >- "compressing" the stream when removing uninteresting records > >- writing out correct CRCs > >- separating reader/writer > > I'm disappointed to see that there has been no progress on this patch since > last commitfest. I thought we agreed on the approach I championed for here: > http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. There > wasn't much work left to finish that, I believe. While I still think my approach is superior at this point I have accepted that I haven't convinced anybody of that. I plan to port over what I have submitted to Alvaro's version of your patch. I have actually started that but I simply couldn't finish it in time. The approach for porting I took didn't work all that well and I plan to restart doing that after doing some review work. > Are you going to continue working on this? "this" being my version of XlogReader? No. The patch above is unchanged except some very minor rebasing to recent wal changes by Tom. The reason its included in the series is simply that I haven't gotten rid of it yet and the subsequent patches needed it. I do plan to continue working on a rebased xlogdump version if nobody beats me to it (please do beat me!). Ok? The cover letter said: > * Add support for a generic wal reading facility dubbed XLogReader > There's some discussion about whats the best way to implement this in a > separate CF topic. > (unchanged) I should have folded that in into the patch description, sorry. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Alvaro Herrera
Date:
Heikki Linnakangas wrote: > On 15.11.2012 03:17, Andres Freund wrote: > > > >Features: > >- streaming reading/writing > >- filtering > >- reassembly of records > > > >Reusing the ReadRecord infrastructure in situations where the code that wants > >to do so is not tightly integrated into xlog.c is rather hard and would require > >changes to rather integral parts of the recovery code which doesn't seem to be > >a good idea. > > > >Missing: > >- "compressing" the stream when removing uninteresting records > >- writing out correct CRCs > >- separating reader/writer > > I'm disappointed to see that there has been no progress on this > patch since last commitfest. I thought we agreed on the approach I > championed for here: > http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. > There wasn't much work left to finish that, I believe. > > Are you going to continue working on this? I worked a bit more on that patch of yours, but I neglected to submit it. Did you have something in particular that you wanted changed in it? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On 2012-11-15 11:50:37 -0300, Alvaro Herrera wrote: > Heikki Linnakangas wrote: > > On 15.11.2012 03:17, Andres Freund wrote: > > > > > >Features: > > >- streaming reading/writing > > >- filtering > > >- reassembly of records > > > > > >Reusing the ReadRecord infrastructure in situations where the code that wants > > >to do so is not tightly integrated into xlog.c is rather hard and would require > > >changes to rather integral parts of the recovery code which doesn't seem to be > > >a good idea. > > > > > >Missing: > > >- "compressing" the stream when removing uninteresting records > > >- writing out correct CRCs > > >- separating reader/writer > > > > I'm disappointed to see that there has been no progress on this > > patch since last commitfest. I thought we agreed on the approach I > > championed for here: > > http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. > > There wasn't much work left to finish that, I believe. > > > > Are you going to continue working on this? > > I worked a bit more on that patch of yours, but I neglected to submit > it. Did you have something in particular that you wanted changed in it? Could you push your newest version to your git repository or similar? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Alvaro Herrera
Date:
Andres Freund wrote: > On 2012-11-15 11:50:37 -0300, Alvaro Herrera wrote: > > I worked a bit more on that patch of yours, but I neglected to submit > > it. Did you have something in particular that you wanted changed in it? > > Could you push your newest version to your git repository or similar? Sadly, I cannot, because I had it on my laptop only and its screen died this morning (well, actually it doesn't boot at all, so I can't use the external screen either). I'm trying to get it fixed as soon as possible but obviously I have no idea when I will be able to get it back. Most likely I will have to go out and buy a 2.5" drive enclosure to get the valuable stuff out of it. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
On 15.11.2012 16:50, Alvaro Herrera wrote: > Heikki Linnakangas wrote: >> On 15.11.2012 03:17, Andres Freund wrote: >>> >>> Features: >>> - streaming reading/writing >>> - filtering >>> - reassembly of records >>> >>> Reusing the ReadRecord infrastructure in situations where the code that wants >>> to do so is not tightly integrated into xlog.c is rather hard and would require >>> changes to rather integral parts of the recovery code which doesn't seem to be >>> a good idea. >>> >>> Missing: >>> - "compressing" the stream when removing uninteresting records >>> - writing out correct CRCs >>> - separating reader/writer >> >> I'm disappointed to see that there has been no progress on this >> patch since last commitfest. I thought we agreed on the approach I >> championed for here: >> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. >> There wasn't much work left to finish that, I believe. >> >> Are you going to continue working on this? > > I worked a bit more on that patch of yours, but I neglected to submit > it. Did you have something in particular that you wanted changed in it? Off the top of my head, there were a two open items with the patch as I submitted it: 1. Need to make sure it's easy to compile outside backend code. So that it's suitable for using in an xlogdump contrib module, for example. 2. do something about error reporting. In particular, xlogreader.c should not call emode_for_corrupt_record(), but we need to provide for that functionlity somehow. I think I'd prefer xlogreader.c to not ereport() on a corrupt record. Instead, it would return an error string to the caller, which could then decide what to do with it. Translating the messages needs some thought, though. Other than those, and cleanup of any obsoleted comments etc. and adding docs, I think it was good to go. - Heikki
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Alvaro Herrera
Date:
Heikki Linnakangas wrote: > On 15.11.2012 16:50, Alvaro Herrera wrote: > >I worked a bit more on that patch of yours, but I neglected to submit > >it. Did you have something in particular that you wanted changed in it? > > Off the top of my head, there were a two open items with the patch > as I submitted it: > > 1. Need to make sure it's easy to compile outside backend code. So > that it's suitable for using in an xlogdump contrib module, for > example. > > 2. do something about error reporting. In particular, xlogreader.c > should not call emode_for_corrupt_record(), but we need to provide > for that functionlity somehow. I think I'd prefer xlogreader.c to > not ereport() on a corrupt record. Instead, it would return an error > string to the caller, which could then decide what to do with it. > Translating the messages needs some thought, though. > > Other than those, and cleanup of any obsoleted comments etc. and > adding docs, I think it was good to go. Thanks. I was toying with the idea that xlogreader.c should return a status code to the caller, and additionally an error string; not all error cases are equal. Most of what I did (other than general cleanup) was moving some xlog.c global vars into a private_data struct for xlogreader.c to pass around; one problem I had was deciding what to do with curFileTLI and LastFileTLI (IIRC), because they are used outside of the reader module (they were examined after recovery finished). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 11/14/12 8:17 PM, Andres Freund wrote: > diff --git a/src/bin/Makefile b/src/bin/Makefile > index b4dfdba..9992f7a 100644 > --- a/src/bin/Makefile > +++ b/src/bin/Makefile > @@ -14,7 +14,7 @@ top_builddir = ../.. > include $(top_builddir)/src/Makefile.global > > SUBDIRS = initdb pg_ctl pg_dump \ > - psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup > + psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump should be pg_xlogdump > > ifeq ($(PORTNAME), win32) > SUBDIRS += pgevent > diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile > new file mode 100644 > index 0000000..d54640a > --- /dev/null > +++ b/src/bin/xlogdump/Makefile > @@ -0,0 +1,25 @@ > +#------------------------------------------------------------------------- > +# > +# Makefile for src/bin/xlogdump > +# > +# Copyright (c) 1998-2012, PostgreSQL Global Development Group > +# > +# src/bin/pg_resetxlog/Makefile fix that > +# > +#------------------------------------------------------------------------- > + > +PGFILEDESC = "xlogdump" > +PGAPPICON=win32 > + > +subdir = src/bin/xlogdump > +top_builddir = ../../.. > +include $(top_builddir)/src/Makefile.global > + > +OBJS= xlogdump.o \ > + $(WIN32RES) > + > +all: xlogdump > + > + > +xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v /main.o|sed's/^/..\/..\/..\//') > + $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) This looks pretty evil, and there is no documentation about what it is supposed to do. Windows build support needs some thought. > diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c > new file mode 100644 > index 0000000..0f984e4 > --- /dev/null > +++ b/src/bin/xlogdump/xlogdump.c > @@ -0,0 +1,468 @@ > +#include "postgres.h" > + > +#include <unistd.h> > + > +#include "access/xlogreader.h" > +#include "access/rmgr.h" > +#include "miscadmin.h" > +#include "storage/ipc.h" > +#include "utils/memutils.h" > +#include "utils/guc.h" > + > +#include "getopt_long.h" > + > +/* > + * needs to be declared because otherwise its defined in main.c which we cannot > + * link from here. > + */ > +const char *progname = "xlogdump"; Which may be a reason not to link with main.o. We generally don't want to hardcode the program name inside the program. > +static void > +usage(void) > +{ > + printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"), > + progname); > + printf(_("Usage:\n")); > + printf(_(" %s [OPTION]...\n"), progname); > + printf(_("\nOptions:\n")); > + printf(_(" -v, --version output version information, then exit\n")); > + printf(_(" -h, --help show this help, then exit\n")); > + printf(_(" -s, --start from where recptr onwards to read\n")); > + printf(_(" -e, --end up to which recptr to read\n")); > + printf(_(" -t, --timeline which timeline do we want to read\n")); > + printf(_(" -i, --inpath from where do we want to read? cwd/pg_xlog is the default\n")); > + printf(_(" -o, --output where to write [start, end]\n")); > + printf(_(" -f, --file wal file to parse\n")); > +} Options list should be in alphabetic order (or some other less random order). Most of these descriptions are not very intelligible (at least without additional documentation). > + > +int main(int argc, char **argv) > +{ > + uint32 xlogid; > + uint32 xrecoff; > + XLogReaderState *xlogreader_state; > + XLogDumpPrivateData private; > + XLogRecPtr from = InvalidXLogRecPtr; > + XLogRecPtr to = InvalidXLogRecPtr; > + bool bad_argument = false; > + > + static struct option long_options[] = { > + {"help", no_argument, NULL, 'h'}, > + {"version", no_argument, NULL, 'v'}, Standard letters for help and version are ? and V. > + {"start", required_argument, NULL, 's'}, > + {"end", required_argument, NULL, 'e'}, > + {"timeline", required_argument, NULL, 't'}, > + {"inpath", required_argument, NULL, 'i'}, > + {"outpath", required_argument, NULL, 'o'}, > + {"file", required_argument, NULL, 'f'}, > + {NULL, 0, NULL, 0} > + }; > + int c; > + int option_index; > + > + memset(&private, 0, sizeof(XLogDumpPrivateData)); > + > + while ((c = getopt_long(argc, argv, "hvs:e:t:i:o:f:", This could also be in a less random order. > + long_options, &option_index)) != -1) > + { > + switch (c) > + { > + case 'h': > + usage(); > + exit(0); > + break; > + case 'v': > + printf("Version: 0.1\n"); > + exit(0); > + break; This should be the PostgreSQL version. also: no man page no nls.mk
On 2012-11-15 11:31:55 -0500, Peter Eisentraut wrote: > On 11/14/12 8:17 PM, Andres Freund wrote: > > diff --git a/src/bin/Makefile b/src/bin/Makefile > > index b4dfdba..9992f7a 100644 > > --- a/src/bin/Makefile > > +++ b/src/bin/Makefile > > @@ -14,7 +14,7 @@ top_builddir = ../.. > > include $(top_builddir)/src/Makefile.global > > > > SUBDIRS = initdb pg_ctl pg_dump \ > > - psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup > > + psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump > > should be pg_xlogdump Good Point. > > > > ifeq ($(PORTNAME), win32) > > SUBDIRS += pgevent > > diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile > > new file mode 100644 > > index 0000000..d54640a > > --- /dev/null > > +++ b/src/bin/xlogdump/Makefile > > @@ -0,0 +1,25 @@ > > +#------------------------------------------------------------------------- > > +# > > +# Makefile for src/bin/xlogdump > > +# > > +# Copyright (c) 1998-2012, PostgreSQL Global Development Group > > +# > > +# src/bin/pg_resetxlog/Makefile > > fix that Dito. > > +# > > +#------------------------------------------------------------------------- > > + > > +PGFILEDESC = "xlogdump" > > +PGAPPICON=win32 > > + > > +subdir = src/bin/xlogdump > > +top_builddir = ../../.. > > +include $(top_builddir)/src/Makefile.global > > + > > +OBJS= xlogdump.o \ > > + $(WIN32RES) > > + > > +all: xlogdump > > + > > + > > +xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v /main.o|sed's/^/..\/..\/..\//') > > + $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) > > This looks pretty evil, and there is no documentation about what it is > supposed to do. There has been some talk about this before and this clearly isn't an acceptable solution. The previously stated idea was to split of the _desc routines so we don't need to link with the whole backend. Alvaro stared to work on that a bit: http://archives.postgresql.org/message-id/1346268803-sup-9854%40alvh.no-ip.org (What the above does is simply collect all backend object files, remove main.o from that list an dlist them as dependencies.) > Windows build support needs some thought. I don't have the slightest clue how the windows build environment works, is there still a problem if we only link to a very selected list of backend object files? Or do we need to link them to some external location? > > diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c > > new file mode 100644 > > index 0000000..0f984e4 > > --- /dev/null > > +++ b/src/bin/xlogdump/xlogdump.c > > @@ -0,0 +1,468 @@ > > +#include "postgres.h" > > + > > +#include <unistd.h> > > + > > +#include "access/xlogreader.h" > > +#include "access/rmgr.h" > > +#include "miscadmin.h" > > +#include "storage/ipc.h" > > +#include "utils/memutils.h" > > +#include "utils/guc.h" > > + > > +#include "getopt_long.h" > > + > > +/* > > + * needs to be declared because otherwise its defined in main.c which we cannot > > + * link from here. > > + */ > > +const char *progname = "xlogdump"; > > Which may be a reason not to link with main.o. Well, we're not linking to main.o which causes the problem, but yes, really fixing this is definitely the goal, but not really possible yet. > > +static void > > +usage(void) > > +{ > > + printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"), > > + progname); > > + printf(_("Usage:\n")); > > + printf(_(" %s [OPTION]...\n"), progname); > > + printf(_("\nOptions:\n")); > > + printf(_(" -v, --version output version information, then exit\n")); > > + printf(_(" -h, --help show this help, then exit\n")); > > + printf(_(" -s, --start from where recptr onwards to read\n")); > > + printf(_(" -e, --end up to which recptr to read\n")); > > + printf(_(" -t, --timeline which timeline do we want to read\n")); > > + printf(_(" -i, --inpath from where do we want to read? cwd/pg_xlog is the default\n")); > > + printf(_(" -o, --output where to write [start, end]\n")); > > + printf(_(" -f, --file wal file to parse\n")); > > +} > > Options list should be in alphabetic order (or some other less random > order). Most of these descriptions are not very intelligible (at > least without additional documentation). True, its noticeable that this mostly was a development tool. But it shouldn't stay that way. There have been several bugreports of late where a bin/pg_xlogdump would have been very helpful... > This should be the PostgreSQL version. > > > also: > > no man page > > no nls.mk Will try to provide some actually submittable version once the xlogreader situation is finalized and the _desc routines are splitted... Thanks! Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote: > --- > src/bin/Makefile | 2 +- > src/bin/xlogdump/Makefile | 25 +++ > src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 494 insertions(+), 1 deletion(-) > create mode 100644 src/bin/xlogdump/Makefile > create mode 100644 src/bin/xlogdump/xlogdump.c Is this intended to be the successor of https://github.com/snaga/xlogdump which will then be deprecated? Thanks, Jeff
On 2012-11-15 09:06:23 -0800, Jeff Janes wrote: > On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > --- > > src/bin/Makefile | 2 +- > > src/bin/xlogdump/Makefile | 25 +++ > > src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++ > > 3 files changed, 494 insertions(+), 1 deletion(-) > > create mode 100644 src/bin/xlogdump/Makefile > > create mode 100644 src/bin/xlogdump/xlogdump.c > > Is this intended to be the successor of > https://github.com/snaga/xlogdump which will then be deprecated? As-is this is just a development tool which was sorely needed for the development of this patchset. But yes I think that once ready (xlogreader infrastructure, *_desc routines splitted) it should definitely be able to do most of what the above xlogdump can do and it should live in bin/. I think mostly some filtering is missing. That doesn't really "deprecate" the above though. Does that answer your question? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Alvaro Herrera
Date:
Heikki Linnakangas wrote: > I'm disappointed to see that there has been no progress on this > patch since last commitfest. I thought we agreed on the approach I > championed for here: > http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. > There wasn't much work left to finish that, I believe. > > Are you going to continue working on this? Here's what I have right now. It's your patch, plus some tweaks such as changing the timing for allocating readRecordBuf; I also added a struct to contain XLogReadPage's private data, instead of using global variables. (The main conclusion I get from this, is that it's relatively easy to split out reading of XLog out of xlog.c; there are some global variables still remaining, but AFAICS that should be relatively simple to fix). There is no consensus on the way to handle error reporting. Tom suggests having the hypothetical client-side code redefine ereport() somehow; as far as I can see that means we would have to reimplement errstart, errfinish, etc. That doesn't sound all that nice to me. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Do you have a git repository or something where all the 14 patches are applied? I would like to test the feature globally.<br/>Sorry I recall that you put a link somewhere but I cannot remember its email...<br /><br /><div class="gmail_quote">On Thu, Nov 15, 2012 at 6:34 PM, Andres Freund <span dir="ltr"><<a href="mailto:andres@anarazel.de"target="_blank">andres@anarazel.de</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hi,<br /><div class="im"><br />On Thursday, November 15, 2012 05:08:26 AM Michael Paquier wrote:<br /> > Looks like cool stuff @-@<br /> > I mightbe interested in looking at that a bit as I think I will hopefully<br /> > be hopefully be able to grab some timein the next couple of weeks.<br /> > Are some of those patches already submitted to a CF?<br /><br /></div>I addedthe patchset as one entry to the CF this time, it seems to me they are<br /> too hard to judge individually to makethem really separately reviewable.<br /><br /> I can split it off there, but really all the complicated stuff is in onepatch<br /> anyway...<br /><br /> Greetings,<br /><br /> Andres<br /></blockquote></div><br /><br clear="all" /><br />--<br />Michael Paquier<br /><a href="http://michael.otacoo.com" target="_blank">http://michael.otacoo.com</a><br />
Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode
From
Michael Paquier
Date:
Hi,<br /><br />This patch looks OK.<br /><br />I got 3 comments:<br />1) Why changing the OID of pg_class_tblspc_relfilenode_indexfrom 3171 to 3455? It does not look necessary.<br />2) You should perhaps change the headerof RelationMapFilenodeToOid so as not mentionning it as the opposite operation of RelationMapOidToFilenode but as anoperation that looks for the OID of a relation based on its relfilenode. Both functions are opposite but independent.<br/> 3) Both functions are doing similar operations. Could it be possible to wrap them in the same central function?<br/><br /><div class="gmail_quote">On Thu, Nov 15, 2012 at 10:17 AM, Andres Freund <span dir="ltr"><<a href="mailto:andres@2ndquadrant.com"target="_blank">andres@2ndquadrant.com</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">---<br /> src/backend/utils/cache/relmapper.c| 53 +++++++++++++++++++++++++++++++++++++<br /> src/include/catalog/indexing.h | 4 +--<br /> src/include/utils/relmapper.h | 2 ++<br /> 3 files changed, 57 insertions(+), 2 deletions(-)<br/><br /><br /><br /> --<br /> Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br /></blockquote></div><br /><br clear="all" /><br/>-- <br />Michael Paquier<br /><a href="http://michael.otacoo.com" target="_blank">http://michael.otacoo.com</a><br/>
Il 16/11/2012 05:34, Michael Paquier ha scritto: > Do you have a git repository or something where all the 14 patches are applied? I would like to test the feature globally. > Sorry I recall that you put a link somewhere but I cannot remember its email... http://archives.postgresql.org/pgsql-hackers/2012-11/msg00686.php Andrea
Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode
From
Andres Freund
Date:
Hi, On 2012-11-16 13:44:45 +0900, Michael Paquier wrote: > This patch looks OK. > > I got 3 comments: > 1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to > 3455? It does not look necessary. Its a mismerge and should have happened in "Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode)" but it seems I squashed the wrong two commits. I had originally used 3171 but that since got used up for lo_tell64... > 2) You should perhaps change the header of RelationMapFilenodeToOid so as > not mentionning it as the opposite operation of RelationMapOidToFilenode > but as an operation that looks for the OID of a relation based on its > relfilenode. Both functions are opposite but independent. I described it as the opposite because RelationMapOidToFilenode is the relmappers stated goal and RelationMapFilenodeToOid is just some side-business. > 3) Both functions are doing similar operations. Could it be possible > to wrap them in the same central function? I don't really see how without making both quite a bit more complicated. The amount of if's needed seems to be too large to me. Thanks, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Markus Wanner
Date:
Andres, On 11/15/2012 01:27 AM, Andres Freund wrote: > In response to this you will soon find the 14 patches that currently > implement $subject. Congratulations on that piece of work. I'd like to provide a comparison of the proposed change set format to the one used in Postgres-R. I hope for this comparison to shed some light on the similarities and differences of the two projects. As the author of Postgres-R, I'm obviously biased, but I try to be as neutral as I can. Let's start with the representation: I so far considered the Postgres-R change set format to be an implementation detail and I don't intend it to be readable by humans or third party tools. It's thus binary only and doesn't offer a textual representation. The approach presented here seems to target different formats for different audiences, including binary representations. More general, less specific. Next, contents: this proposal is more verbose. In the textual representation shown, it provides (usually redundant) information about attribute names and types. Postgres-R doesn't ever transmit attribute name or type information for INSERT, UPDATE or DELETE operations. Instead, it relies on attribute numbers and pg_attributes being at some known consistent state. Let's compare by example: > table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1 > table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1 > table "replication_example": DELETE (pkey): id[int4]:1 In Postgres-R, the change sets for these same operations would carry the following information, in a binary representation: > table "replication_example": INSERT: VALUES (1, 1, '1') > table "replication_example": UPDATE: PKEY(1) COID(77) MODS('nyn') VALUES(-1) > table "replication_example": DELETE: PKEY(1) COID(78) You may have noticed that there's an additional COID field. This is an identifier for the transaction that last changed this tuple. Together with the primary key, it effectively identifies the exact version of a tuple (during its lifetime, for example before vs after an UPDATE). This in turn is used by Postgres-R to detect conflicts. It may be possible to add that to the proposed format as well, for it to be able to implement a Postgres-R-like algorithm. To finish off this comparison, let's take a look at how and where the change sets are generated: in Postgres-R the change set stream is constructed directly from the heap modification routines, i.e. in heapam.c's heap_{insert,update,delete}() methods. Where as the patches proposed here parse the WAL to reconstruct the modifications and add the required meta information. To me, going via the WAL first sounded like a step that unnecessarily complicates matters. I recently talked to Andres and brought that up. Here's my current view of things: The Postgres-R approach is independent of WAL and its format, where as the approach proposed here clearly is not. Either way, there is a certain overhead - however minimal it is - which the former adds to the transaction processing itself, while the later postpones it to a separate XLogReader process. If there's any noticeable difference, it might reduce latency in case of asynchronous replication, but can only increase latency in the synchronous case. As far as I understood Andres, it was easier to collect the additional meta data from within the separate process. In summary, I'd say that Postgres-R is an approach specifically targeting and optimized for multi-master replication between Postgres nodes, where as the proposed patches are kept more general. I hope you found this to be an insightful and fair comparison. Regards Markus Wanner
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Andres Freund
Date:
Hi Markus, On 2012-11-16 14:46:39 +0100, Markus Wanner wrote: > On 11/15/2012 01:27 AM, Andres Freund wrote: > > In response to this you will soon find the 14 patches that currently > > implement $subject. > > Congratulations on that piece of work. Thanks. > I'd like to provide a comparison of the proposed change set format to > the one used in Postgres-R. Uh, sorry to interrupt you right here, but thats not the "proposed format" ;) Thats just an example output plugin that people wished for. For the use-case were after we (as in 2ndq) also want to use binary data. Its also rather useful for debugging and such. I generally aggree that the presented format is too verbose for actual replication, but it seems fine enough for showing off ;) If you look at Patch 12/14 "Add a simple decoding module in contrib named 'test_decoding'" you can see that adding a different output format should be pretty straight forward. Which output plugin is used is determined by the initial INIT_LOGICAL_REPLICATION '$plugin'; command in a replication connection. > To finish off this comparison, let's take a look at how and where the > change sets are generated: in Postgres-R the change set stream is > constructed directly from the heap modification routines, i.e. in > heapam.c's heap_{insert,update,delete}() methods. Where as the patches > proposed here parse the WAL to reconstruct the modifications and add the > required meta information. > > To me, going via the WAL first sounded like a step that unnecessarily > complicates matters. I recently talked to Andres and brought that up. > Here's my current view of things: > > The Postgres-R approach is independent of WAL and its format, where as > the approach proposed here clearly is not. Either way, there is a > certain overhead - however minimal it is - which the former adds to the > transaction processing itself, while the later postpones it to a > separate XLogReader process. If there's any noticeable difference, it > might reduce latency in case of asynchronous replication, but can only > increase latency in the synchronous case. As far as I understood Andres, > it was easier to collect the additional meta data from within the > separate process. There also is the point that if you do the processing inside heap_* you need to make sure the replication targeted data is safely received & fsynced away, in "our" case thats not necessary as WAL already provides crash safety, so should the replication connection break you can simply start from the location last confirmed as being safely sent. As we want to provide asynchronous replication thats a rather major point. > In summary, I'd say that Postgres-R is an approach specifically > targeting and optimized for multi-master replication between Postgres > nodes, where as the proposed patches are kept more general. One major aim definitely was optionally be able to replicate into just about any target system, so yes, I certainly agree. > I hope you found this to be an insightful and fair comparison. Yes, input in general and especially from other replication providers is certainly interesting and important! Thanks, Andres --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Andres Freund
Date:
Hi, On 2012-11-16 14:46:39 +0100, Markus Wanner wrote: > You may have noticed that there's an additional COID field. This is an > identifier for the transaction that last changed this tuple. Together > with the primary key, it effectively identifies the exact version of a > tuple (during its lifetime, for example before vs after an UPDATE). This > in turn is used by Postgres-R to detect conflicts. Whats the data type of the "COID" in -R? In the patchset the output plugin has enough data to get the old xid and the new xid in the case of updates (not in the case of deletes, but thats a small bug and should be fixable with a single line of code), and it has enough information to extract the primary key without problems. I wonder whether we also should track the xid epoch... > It may be possible to add that to the proposed format as well, for it to > be able to implement a Postgres-R-like algorithm. I don't know the exact Postgres-R algorithm (but I queued reading some papers you referred to when we talked), but I guess what we have in mind is roughly similar - its just not even remotely part of this patchset ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 09/14] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader
From
Steve Singer
Date:
<div class="moz-cite-prefix">On 12-11-14 08:17 PM, Andres Freund wrote:<br /></div><blockquote cite="mid:1352942234-3953-9-git-send-email-andres@2ndquadrant.com"type="cite"><pre wrap=""> For the regular satisfies routines this is needed in prepareation of logical decoding. I changed the non-regular ones for consistency as well. The naming between htup, tuple and similar is rather confused, I could not find any consistent naming anywhere. This is preparatory work for the logical decoding feature which needs to be able to get to a valid relfilenode from when checking the visibility of a tuple. </pre></blockquote><br /> I have taken a look at this patch. The patch does what it says, it changes a bunch of <br />HeapTupleSatisfiesXXXX routines to take a HeapTuple structure instead of a HeapTupleHeader.<br /> It also sets the HeapTuple.t_tableOidvalue before calling these routines.<br /><br /> The patch does not modify these routines to actuallydo anything useful with the additional data fields though it does add some assertions to make sure t_tableOid isset.<br /><br /> The patch compiles cleanly and the unit tests pass.<br /><br /> This patch does not seem to depend onany of the other patches in this set and applies cleanly against master. The patch doesn't actually add any functionality,unless someone sees a reason for complaining about this that I don't see, then I think it can be committed.<br/><br /> Steve<br /><br /><blockquote cite="mid:1352942234-3953-9-git-send-email-andres@2ndquadrant.com" type="cite"><prewrap=""> ---contrib/pgrowlocks/pgrowlocks.c | 2 +-src/backend/access/heap/heapam.c | 13 ++++++----src/backend/access/heap/pruneheap.c | 16 ++++++++++--src/backend/catalog/index.c | 2 +-src/backend/commands/analyze.c | 3 ++-src/backend/commands/cluster.c | 2 +-src/backend/commands/vacuumlazy.c | 3 ++-src/backend/storage/lmgr/predicate.c | 2 +-src/backend/utils/time/tqual.c | 50 +++++++++++++++++++++++++++++-------src/include/utils/snapshot.h | 4+--src/include/utils/tqual.h | 20 +++++++--------11 files changed, 83 insertions(+), 34 deletions(-) </pre><br /><fieldset class="mimeAttachmentHeader"></fieldset><br /><pre wrap=""> </pre></blockquote><br />
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Markus Wanner
Date:
On 11/16/2012 03:05 PM, Andres Freund wrote: >> I'd like to provide a comparison of the proposed change set format to >> the one used in Postgres-R. > > Uh, sorry to interrupt you right here, but thats not the "proposed > format" ;) Understood. Sorry, I didn't mean to imply that. It's pretty obvious to me that this is more of a human readable format and that others, including binary formats, can be implemented. I apologize for the bad wording of a "proposed format", which doesn't make that clear. >> The Postgres-R approach is independent of WAL and its format, where as >> the approach proposed here clearly is not. Either way, there is a >> certain overhead - however minimal it is - which the former adds to the >> transaction processing itself, while the later postpones it to a >> separate XLogReader process. If there's any noticeable difference, it >> might reduce latency in case of asynchronous replication, but can only >> increase latency in the synchronous case. As far as I understood Andres, >> it was easier to collect the additional meta data from within the >> separate process. > > There also is the point that if you do the processing inside heap_* you > need to make sure the replication targeted data is safely received & > fsynced away, in "our" case thats not necessary as WAL already provides > crash safety, so should the replication connection break you can simply > start from the location last confirmed as being safely sent. In the case of Postgres-R, the "safely received" part isn't really handled at the change set level at all. And regarding the fsync guarantee: you can well use the WAL to provide that, without basing change set generation on in. In that regard, Postgres-R is probably the more general approach: you can run Postgres-R with WAL turned off entirely - which may well make sense if you take into account the vast amount of cloud resources available, which don't have a BBWC. Instead of WAL, you can add more nodes at more different locations. And no, you don't want your database to ever go down, anyway :-) >> In summary, I'd say that Postgres-R is an approach specifically >> targeting and optimized for multi-master replication between Postgres >> nodes, where as the proposed patches are kept more general. > > One major aim definitely was optionally be able to replicate into just > about any target system, so yes, I certainly agree. I'm glad I got that correct ;-) Regards Markus Wanner
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Markus Wanner
Date:
On 11/16/2012 03:14 PM, Andres Freund wrote: > Whats the data type of the "COID" in -R? It's short for CommitOrderId, a 32bit global transaction identifier, being wrapped-around, very much like TransactionIds are. (In that sense, it's global, but unique only for a certain amount of time). > In the patchset the output plugin has enough data to get the old xid and > the new xid in the case of updates (not in the case of deletes, but > thats a small bug and should be fixable with a single line of code), and > it has enough information to extract the primary key without problems. It's the xmin of the old tuple that Postgres-R needs to get the COID. Regards Markus Wanner
Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode
From
Michael Paquier
Date:
On Fri, Nov 16, 2012 at 7:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- Hi,Its a mismerge and should have happened in "Add a new RELFILENODE
On 2012-11-16 13:44:45 +0900, Michael Paquier wrote:
> This patch looks OK.
>
> I got 3 comments:
> 1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to
> 3455? It does not look necessary.
syscache to fetch a pg_class entry via (reltablespace, relfilenode)" but
it seems I squashed the wrong two commits.
I had originally used 3171 but that since got used up for lo_tell64...I described it as the opposite because RelationMapOidToFilenode is the
> 2) You should perhaps change the header of RelationMapFilenodeToOid so as
> not mentionning it as the opposite operation of RelationMapOidToFilenode
> but as an operation that looks for the OID of a relation based on its
> relfilenode. Both functions are opposite but independent.
relmappers stated goal and RelationMapFilenodeToOid is just some
side-business.I don't really see how without making both quite a bit more
> 3) Both functions are doing similar operations. Could it be possible
> to wrap them in the same central function?
complicated. The amount of if's needed seems to be too large to me.
OK thanks for your answers.
As this patch only adds a new function and is not that much complicated, I think there is no problem in committing it. The only thing to remove is the diff in indexing.h. Could someone take care of that?
If other people have additional comments on the ability to perform a relfileoid->reloid operation for cached maps, of course go ahead.
As this patch only adds a new function and is not that much complicated, I think there is no problem in committing it. The only thing to remove is the diff in indexing.h. Could someone take care of that?
If other people have additional comments on the ability to perform a relfileoid->reloid operation for cached maps, of course go ahead.
Michael Paquier
http://michael.otacoo.com
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Hannu Krosing
Date:
On 11/16/2012 02:46 PM, Markus Wanner wrote: > Andres, > > On 11/15/2012 01:27 AM, Andres Freund wrote: >> In response to this you will soon find the 14 patches that currently >> implement $subject. > Congratulations on that piece of work. > > > I'd like to provide a comparison of the proposed change set format to > the one used in Postgres-R. I hope for this comparison to shed some > light on the similarities and differences of the two projects. As the > author of Postgres-R, I'm obviously biased, but I try to be as neutral > as I can. ... > Let's compare by example: > >> table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1 >> table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1 >> table "replication_example": DELETE (pkey): id[int4]:1 > In Postgres-R, the change sets for these same operations would carry the > following information, in a binary representation: > >> table "replication_example": INSERT: VALUES (1, 1, '1') >> table "replication_example": UPDATE: PKEY(1) COID(77) MODS('nyn') VALUES(-1) >> table "replication_example": DELETE: PKEY(1) COID(78) Is it possible to replicate UPDATEs and DELETEs without a primary key in PostgreSQL-R Hannu
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Markus Wanner
Date:
On 11/17/2012 02:30 PM, Hannu Krosing wrote: > Is it possible to replicate UPDATEs and DELETEs without a primary key in > PostgreSQL-R No. There must be some way to logically identify the tuple. Note, though, that theoretically any (unconditional) unique key would suffice. In practice, that usually doesn't matter, as you rarely have one or more unique keys without a primary. Also note that the underlying index is useful for remote application of change sets (except perhaps for very small tables). In some cases, for example for n:m linking tables, you need to add a uniqueness key that spans all columns (as opposed to a simple index on one of the columns that's usually required, anyway). I hope for index-only scans eventually mitigating this issue. Alternatively, I've been thinking about the ability to add a hidden column, which can then be used as a PRIMARY KEY without breaking legacy applications that rely on SELECT * not returning that primary key. Are there other reasons to want tables without primary keys that I'm missing? Regards Markus Wanner
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Hannu Krosing
Date:
On 11/17/2012 03:00 PM, Markus Wanner wrote: > On 11/17/2012 02:30 PM, Hannu Krosing wrote: >> Is it possible to replicate UPDATEs and DELETEs without a primary key in >> PostgreSQL-R > No. There must be some way to logically identify the tuple. It can be done as selecting on _all_ attributes and updating/deleting just the first matching row create cursor ... select from t ... where t.* = (....) fetch one ... delete where current of ... This is on distant (round 3 or 4) roadmap for this work, just was interested if you had found any better way of doing this :) Hannu > Note, > though, that theoretically any (unconditional) unique key would suffice. > In practice, that usually doesn't matter, as you rarely have one or more > unique keys without a primary. > > Also note that the underlying index is useful for remote application of > change sets (except perhaps for very small tables). > > In some cases, for example for n:m linking tables, you need to add a > uniqueness key that spans all columns (as opposed to a simple index on > one of the columns that's usually required, anyway). I hope for > index-only scans eventually mitigating this issue. > > Alternatively, I've been thinking about the ability to add a hidden > column, which can then be used as a PRIMARY KEY without breaking legacy > applications that rely on SELECT * not returning that primary key. > > Are there other reasons to want tables without primary keys that I'm > missing? > > Regards > > Markus Wanner
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Hannu Krosing
Date:
On 11/17/2012 03:00 PM, Markus Wanner wrote: > On 11/17/2012 02:30 PM, Hannu Krosing wrote: >> Is it possible to replicate UPDATEs and DELETEs without a primary key in >> PostgreSQL-R > No. There must be some way to logically identify the tuple. Note, > though, that theoretically any (unconditional) unique key would suffice. > In practice, that usually doesn't matter, as you rarely have one or more > unique keys without a primary. ... > Are there other reasons to want tables without primary keys that I'm > missing? > Nope. The only place a table without a primary key would be needed is a log table, but as these are (supposed to be) INSERT-only this is not a problem for them. Hannu
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Markus Wanner
Date:
Hannu, On 11/17/2012 03:40 PM, Hannu Krosing wrote: > On 11/17/2012 03:00 PM, Markus Wanner wrote: >> On 11/17/2012 02:30 PM, Hannu Krosing wrote: >>> Is it possible to replicate UPDATEs and DELETEs without a primary key in >>> PostgreSQL-R >> No. There must be some way to logically identify the tuple. > It can be done as selecting on _all_ attributes and updating/deleting > just the first matching row > > create cursor ... > select from t ... where t.* = (....) > fetch one ... > delete where current of ... That doesn't sound like it could possibly work for Postgres-R. At least not when there can be multiple rows with all the same attributes, i.e. without a unique key constraint over all columns. Otherwise, some nodes could detect two concurrent UPDATES as a conflict, while other nodes select different rows and don't handle it as a conflict. Regards Markus Wanner
Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode
From
Andres Freund
Date:
On 2012-11-17 19:14:06 +0900, Michael Paquier wrote: > On Fri, Nov 16, 2012 at 7:58 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > Hi, > > > > On 2012-11-16 13:44:45 +0900, Michael Paquier wrote: > > > This patch looks OK. > > > > > > I got 3 comments: > > > 1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to > > > 3455? It does not look necessary. > > > > Its a mismerge and should have happened in "Add a new RELFILENODE > > syscache to fetch a pg_class entry via (reltablespace, relfilenode)" but > > it seems I squashed the wrong two commits. > > I had originally used 3171 but that since got used up for lo_tell64... > > > > > 2) You should perhaps change the header of RelationMapFilenodeToOid so as > > > not mentionning it as the opposite operation of RelationMapOidToFilenode > > > but as an operation that looks for the OID of a relation based on its > > > relfilenode. Both functions are opposite but independent. > > > > I described it as the opposite because RelationMapOidToFilenode is the > > relmappers stated goal and RelationMapFilenodeToOid is just some > > side-business. > > > > > 3) Both functions are doing similar operations. Could it be possible > > > to wrap them in the same central function? > > > > I don't really see how without making both quite a bit more > > complicated. The amount of if's needed seems to be too large to me. > > > OK thanks for your answers. > As this patch only adds a new function and is not that much complicated, I > think there is no problem in committing it. The only thing to remove is the > diff in indexing.h. Could someone take care of that? I pushed a rebase to the git repository that fixed it... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Nov 16, 2012 at 5:16 PM, Andrea Suisani <sickpig@opinioni.net> wrote:
-- Il 16/11/2012 05:34, Michael Paquier ha scritto:http://archives.postgresql.org/pgsql-hackers/2012-11/msg00686.phpDo you have a git repository or something where all the 14 patches are applied? I would like to test the feature globally.
Sorry I recall that you put a link somewhere but I cannot remember its email...
Thanks Andrea.
I am pretty sure I will be able to provide some feedback by Friday.
I am pretty sure I will be able to provide some feedback by Friday.
Michael Paquier
http://michael.otacoo.com
Re: [PATCH 13/14] Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes
From
Michael Paquier
Date:
Hi,
I am just looking at this patch and will provide some comments.
By the way, you forgot the installation part of pg_receivellog, please see patch attached.
Thanks,
--
Michael Paquier
http://michael.otacoo.com
I am just looking at this patch and will provide some comments.
By the way, you forgot the installation part of pg_receivellog, please see patch attached.
Thanks,
On Thu, Nov 15, 2012 at 10:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:
---
src/bin/pg_basebackup/Makefile | 7 +-
src/bin/pg_basebackup/pg_receivellog.c | 717 +++++++++++++++++++++++++++++++++
src/bin/pg_basebackup/streamutil.c | 3 +-
src/bin/pg_basebackup/streamutil.h | 1 +
4 files changed, 725 insertions(+), 3 deletions(-)
create mode 100644 src/bin/pg_basebackup/pg_receivellog.c
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Michael Paquier
http://michael.otacoo.com
Attachment
Hi Andres,
I have been able to fetch your code (thanks Andrea!) and some it. For the time being I am spending some time reading the code and understanding the whole set of features you are trying to implement inside core, even if I got some background from what you presented at PGCon and from the hackers ML.
Btw, as a first approach, I tried to run the logical log receiver plugged on a postgres server, and I am not able to make it work.
Well, I am using settings similar to yours.
# Run master
rm -r ~/bin/pgsql/master/
initdb -D ~/bin/pgsql/master/
echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf
postgres -D ~/bin/pgsql/master \
-c wal_level=logical \
-c max_wal_senders=10 \
-c max_logical_slots=10 \
-c wal_keep_segments=100 \
-c log_line_prefix="[%p %x] "
# Logical log receiver
pg_receivellog -f $HOME/output.txt -d postgres -v
After launching some SQLs, the logical receiver is stuck just after sending INIT_LOGICAL_REPLICATION, please see bt of process waiting:
(gdb) bt
#0 0x00007f1bbc13b170 in __poll_nocancel () from /usr/lib/libc.so.6
#1 0x00007f1bbc43072d in pqSocketPoll (sock=3, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1089
#2 0x00007f1bbc43060d in pqSocketCheck (conn=0x1dd0290, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1031
#3 0x00007f1bbc4304dd in pqWaitTimed (forRead=1, forWrite=0, conn=0x1dd0290, finish_time=-1) at fe-misc.c:963
#4 0x00007f1bbc4304af in pqWait (forRead=1, forWrite=0, conn=0x1dd0290) at fe-misc.c:946
#5 0x00007f1bbc42c64c in PQgetResult (conn=0x1dd0290) at fe-exec.c:1709
#6 0x00007f1bbc42cd62 in PQexecFinish (conn=0x1dd0290) at fe-exec.c:1974
#7 0x00007f1bbc42c9c8 in PQexec (conn=0x1dd0290, query=0x406c60 "INIT_LOGICAL_REPLICATION 'test_decoding'") at fe-exec.c:1808
#8 0x0000000000402370 in StreamLog () at pg_receivellog.c:263
#9 0x00000000004030c9 in main (argc=6, argv=0x7fff44edb288) at pg_receivellog.c:694
So I am not able to output any results using pg_receivellog.
Also, I noticed 2 errors in your set of tests.
Michael Paquier
http://michael.otacoo.com
I have been able to fetch your code (thanks Andrea!) and some it. For the time being I am spending some time reading the code and understanding the whole set of features you are trying to implement inside core, even if I got some background from what you presented at PGCon and from the hackers ML.
Btw, as a first approach, I tried to run the logical log receiver plugged on a postgres server, and I am not able to make it work.
Well, I am using settings similar to yours.
# Run master
rm -r ~/bin/pgsql/master/
initdb -D ~/bin/pgsql/master/
echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf
postgres -D ~/bin/pgsql/master \
-c wal_level=logical \
-c max_wal_senders=10 \
-c max_logical_slots=10 \
-c wal_keep_segments=100 \
-c log_line_prefix="[%p %x] "
# Logical log receiver
pg_receivellog -f $HOME/output.txt -d postgres -v
After launching some SQLs, the logical receiver is stuck just after sending INIT_LOGICAL_REPLICATION, please see bt of process waiting:
(gdb) bt
#0 0x00007f1bbc13b170 in __poll_nocancel () from /usr/lib/libc.so.6
#1 0x00007f1bbc43072d in pqSocketPoll (sock=3, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1089
#2 0x00007f1bbc43060d in pqSocketCheck (conn=0x1dd0290, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1031
#3 0x00007f1bbc4304dd in pqWaitTimed (forRead=1, forWrite=0, conn=0x1dd0290, finish_time=-1) at fe-misc.c:963
#4 0x00007f1bbc4304af in pqWait (forRead=1, forWrite=0, conn=0x1dd0290) at fe-misc.c:946
#5 0x00007f1bbc42c64c in PQgetResult (conn=0x1dd0290) at fe-exec.c:1709
#6 0x00007f1bbc42cd62 in PQexecFinish (conn=0x1dd0290) at fe-exec.c:1974
#7 0x00007f1bbc42c9c8 in PQexec (conn=0x1dd0290, query=0x406c60 "INIT_LOGICAL_REPLICATION 'test_decoding'") at fe-exec.c:1808
#8 0x0000000000402370 in StreamLog () at pg_receivellog.c:263
#9 0x00000000004030c9 in main (argc=6, argv=0x7fff44edb288) at pg_receivellog.c:694
So I am not able to output any results using pg_receivellog.
Also, I noticed 2 errors in your set of tests.
On Thu, Nov 15, 2012 at 9:27 AM, Andres Freund <andres@anarazel.de> wrote:
-- -- wrapped in a transaction
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
In SET clause, the column name is somedata and not somedate
-- dont write out aborted data
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (2, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
Same error here, somedata and not somedate. Not a big deal, it made the transactions failing though.
Michael Paquier
http://michael.otacoo.com
Hi Michael, On 2012-11-19 16:28:55 +0900, Michael Paquier wrote: > I have been able to fetch your code (thanks Andrea!) and some it. For the > time being I am spending some time reading the code and understanding the > whole set of features you are trying to implement inside core, even if I > got some background from what you presented at PGCon and from the hackers > ML. Cool. > Btw, as a first approach, I tried to run the logical log receiver plugged > on a postgres server, and I am not able to make it work. > Well, I am using settings similar to yours. > # Run master > rm -r ~/bin/pgsql/master/ > initdb -D ~/bin/pgsql/master/ > echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf > postgres -D ~/bin/pgsql/master \ > -c wal_level=logical \ > -c max_wal_senders=10 \ > -c max_logical_slots=10 \ > -c wal_keep_segments=100 \ > -c log_line_prefix="[%p %x] " > # Logical log receiver > pg_receivellog -f $HOME/output.txt -d postgres -v > > After launching some SQLs, the logical receiver is stuck just after sending > INIT_LOGICAL_REPLICATION, please see bt of process waiting: Its waiting till it sees initial an initial xl_running_xacts record. The easiest way to do that is manually issue a checkpoint. Sorry, should have included that in the description. Otherwise you can wait till the next routine checkpoint comes arround... I plan to cause more xl_running_xacts record to be logged in the future. I think the timing of those currently is non-optimal, you have the same problem as here in normal streaming replication as well :( > > -- wrapped in a transaction > > BEGIN; > > INSERT INTO replication_example(somedata, text) VALUES (1, 1); > > UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT > > currval('replication_example_id_seq')); > > > In SET clause, the column name is *somedata* and not *somedate* Crap. Sorry for that. I wrote the example in the mailclient and then executed it and I seem to have forgot to put back some of the fixes... > I am just looking at this patch and will provide some comments. > By the way, you forgot the installation part of pg_receivellog, please see > patch attached. That actually was somewhat intended, I thought people wouldn't like the name and I didn't want a binary thats going to be replaced anyway lying arround ;) Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
First, you can add me to the list of people saying 'wow', I'm impressed. The approach I am taking to reviewing this to try and answer the following question 1) How might a future version of slony be able to use logical replication as described by your patch and design documents and what would that look like. 2) What functionality is missing from the patch set that would stop me from implementing or prototyping the above. Connecting slon to the remote postgresql ======================== Today the slony remote listener thread queries a bunch of events from sl_event for a batch of SYNC events. Then the remote helper thread queries data from sl_log_1 and sl_log_2. I see this changing, instead the slony remote listener thread would connect to the remote system and get a logical replication stream. 1) Would slony connect as a normal client connection and call something like 'select pg_slony_process_xlog(...)' to get bunch of logical replication change records to process. OR 2) Would slony connect as a replication connection similar to howthe pg_receivelog program does today and then process the logical changeset outputs. Instead of writing it to a file (aspg_receivelog does) It seems that the second approach is what is encouraged. I think we would put a lot of the pg_receivelog functionality into slon and it would issue a command like 'INIT_LOGICAL_REPLICATION 'slony') to use the slony logical replication plugin. Slon would also have to provide feedback to the walsender about what it has processed so the origin database knows what catalog snapshots can be expired. Based on eyeballing pg_receivelog.c it seems that about half the code in the 700 file is related to command line arguments etc, and the other half is related to looping over the copy out stream, sending feedback and other things that we would need to duplicate in slon. pg_receivelog.c has comment: /* * We have to use postgres.h not postgres_fe.h here, because there's so much * backend-only stuff in the XLOG include files we need. But we need a * frontend-ish environment otherwise. Hencethis ugly hack. */ This looks like more of a carryover from pg_receivexlog.c. From what I can tell we can eliminate the postgres.h include if we also eliminate the utils/datetime.h and utils/timestamp.h and instead add in: #include "postgres_fe.h" #define POSTGRES_EPOCH_JDATE 2451545 #define UNIX_EPOCH_JDATE 2440588 #define SECS_PER_DAY 86400 #define USECS_PER_SEC INT64CONST(1000000) typedef int64 XLogRecPtr; #define InvalidXLogRecPtr 0 If there is a better way of getting these defines someone should speak up. I recall that in the past slon actually did include postgres.h and it caused some issues (I think with MSVC win32 builds). Since pg_receivelog.c will be used as a starting point/sample for third parties to write client programs it would be better if it didn't encourage client programs to include postgres.h The Slony Output Plugin ===================== Once we've modified slon to connect as a logical replication client we will need to write a slony plugin. As I understand the plugin API: * A walsender is processing through WAL records, each time it sees a COMMIT WAL record it will call my plugins .begin .change (for each change in the transaction) .commit * The plugin for a particular stream/replication client will see one transaction at a time passed to it in commit order. It won't see .change(t1) followed by .change (t2), followed by a second .change(t1). The reorder buffer code hides me from all that complexity (yah) From a slony point of view I think the output of the plugin will be rows, suitable to be passed to COPY IN of the form: origin_id, table_namespace,table_name,command_type, cmd_updatencols,command_args This is basically the Slony 2.2 sl_log format minus a few columns we no longer need (txid, actionseq). command_args is a postgresql text array of column=value pairs. Ie [ {id=1},{name='steve'},{project='slony'}] I don't t think our output plugin will be much more complicated than the test_decoding plugin. I suspect we will want to give it the ability to filter out non-replicated tables. We will also have to filter out change records that didn't originate on the local-node that aren't part of a cascaded subscription. Remember that in a two node cluster slony will have connections from A-->B and from B--->A even if user tables only flow one way. Data that is replicated from A into B will show up in the WAL stream for B. Exactly how we do this filtering is an open question, I think the output plugin will at a minimum need to know: a) What the slony node id is of the node it is running on. This is easy to figure out if the output plugin is able/allowed to query its database. Will this be possible? I would expect to be able to query the database as it exists now(at plugin invocation time) not as it existed in the past when the WAL was generated. In addition to the node ID I can see us wanting to be able to query other slony tables (sl_table,sl_set etc...) b) What the slony node id is of the node we are streaming too. It would be nice if we could pass extra, arbitrary data/parameters to the output plugins that could include that, or other things. At the moment the start_logical_replication rule in repl_gram.y doesn't allow for that but I don't see why we couldn't make it do so. I still see some open questions about exactly how we would filter out data in this stage. <editorial> Everything above deals with the postgresql side of things, ie the patch in question or the plugin API we would have to work with. Much of what is below deals with slony side change and might of limited interest to some on pgsql-hackers </editorial> Slon Applying Changes ================ The next task we will have is to make slon and the replica instance be able to apply these changes. In slony 2.2 we do a COPY from sl_log and apply that stream to a table on the replica with COPY. We then have triggers on the replica that decode the command_args and apply the changes as INSERT/UPDATE/DELETE statements on the user tables. I see this continuing to work in this fashion, but there are a few special cases: 1) Changes made to sl_event on the origin will result in records in the logical replication stream that change sl_event. In many cases we won't just be inserting records into sl_event but we will need to instead do the logic in remote_worker.c for processing the different types of events. Worst case we could parse the change records we receive from our version pg_receivellog and split the sl_event records out into a sl_event stream and a sl_log stream. Another approach might be to have the slony apply trigger build up a list of events that the slon remote_worker code can than process through. 2) Slony is normally bi-directional even if user data only replicates one way. Confirm (sl_confirm) entries go from a replica back to an origin. In a two node origin->replica scenario for data, the way I see this working is that the slon for the origin would connect to the replica (like it does today). It would receive the logical replication records, but since it isn't subscribed to any tables it won't receive/process the WAL for user-tables but it will still receive/process sl_confirm rows. It will then insert the rows in sl_confirm that it 'replicated' from the remote node. With what I have described so far, Slony would then be receiving a stream of events that look like t1-insert into foo , [id=1, name='steve'] t1-insert into bar [id=1, something='somethingelse'] t1-commit t2- insert into foo [....] t2-commit t3- insert into sl_event [ev_type=SYNC, ev_origin=1,ev_seqno=12345] t3-commit Even though, from a data-correctness point of view, slony could commit the transaction on the replica after it sees the t1 commit, we won't want it to do commits other than on a SYNC boundary. This means that the replicas will continue to move between consistent SYNC snapshots and that we can still track the state/progress of replication by knowing what events (SYNC or otherwise) have been confirmed. This also means that slony should only provide feedback to the walsender on SYNC boundaries after the transaction has committed on the receiver. I don't see this as being an issue. Setting up Subscriptions =================== At first we have a slon cluster with just 1 node, life is good. When a second node is created and a path(or pair of paths) are defined between the nodes I think they will each: 1. Connect to the remote node with a normal libpq connection. a. Get the current xlog recptr, b. Query any non-syncevents of interest from sl_event. 2. Connect to the remote node with a logical replication connection and start streaming logical replication changes start at the recptr we retrieved above. Slon will then receive any future events from the remote sl_event as part of the logical replication stream. It won't receive any user tables because it isn't yet subscribed to any. When a subscription is started, the SUBSCRIBE_SET and ENABLE_SUBSCRIPTION events will go through sl_event and the INSERT INTO sl_event will be part of a change record in the replication stream and be picked up by the subscribers slon remote_worker. The remote_worker:copy_set will then need to get a consistent COPY of the tables in the replication set such that any changes made to the tables after the copy is started get included in the replication stream. The approach proposed in the DESIGN.TXT file with exporting a snapshot sounds okay for this. I *think* slony could get by with something less fancy as well but it would be ugly. 1. Make sure that the origin starts including change records for the tables in the set 2. have the slon(copy_set) wait until any transactions on the origin, that started prior to the ENABLE_SUBSCRIPTION, are committed. Slony does this today as part of the copy_set logic. 3. Get/remember the snapshot visibility information for the COPY's transaction 4. When we start to process change records we need to filter out records for transactions that were already visible by the copy. Steps 1-3 are similar to how slony works today, but step 4 will be a bit awkward/ugly. This isn't an issue today because we are already using the transaction visibility information for selecting from sl_log so it works, but above I had proposed stripping the xid from the logical change records. Cascading Replication ================= A-->B--->C The slon for B will insert records from A into B's tables. This insert will generate WAL records on B. The slon for C should be able to pull the data it needs (both sl_event entries with ev_origin=A, and user table data originating on A) from B's logical replication stream. I don't see any issues here nor do I see a need to 'cache' the data in an sl_log type of table on B. Reshaping Replication ================= In Slony replication is reshaped by two types events, a MOVE SET and a FAILOVER. Move Set: A replication set might be subscribed in a cascaded fashion like A--->B--->C When a MOVE SET is issued node A will stop accepting new write transactions for tables in the set. A MOVE_SET(1,A,B) event is then put into sl_event on node A. Node A will then stop accepting new transactions on the tables in set 1. Node B receives the MOVE_SET command in the proper order, after it has processed the last SYNC generated on A when A was still accepting write transactions to those tables. When Node B processes the MOVE_SET event then node B starts accepting write transactions on the tables. Node B will also generates an ACCEPT_SET event. Node C will then receive the MOVE SET (ev_origin=A) and the ACCEPT_SET(ev_origin=B) command (after all SYNC events from A with data changes to the set) and then knows that it should start data on those tables from B. I don't see any of this changing with logical replication acting as the data source. FAILOVER: --------------- A---->B | . v . C Today with slony, if B is a valid failover target then it is a forwarding node of the set. This means that B keeps a record in sl_log of any changes originating on A until B knows that node C has received those changes. In the event of a failover, if node C is far behind, it can just get the missing data from sl_log on node B (the failover target/new origin). I see a problem with what I have discussed above, B won't explicitly store the data from A in sl_log, a cascaded node would depend on B's WAL stream. The problem is that at FAILOVER time, B might have processed some changes from A. Node C might also be processing Node B's WAL stream for events (or data from another set). Node C will discard/not receive the data for A's tables since it isn't subscribed to those tables from B. What happens then if at some later point B and C receive the FAILOVER event. What does node C do? It can't get the missing data from node A because node A has failed, and it can't get it from node B because node C has already processed the WAL changes from node B that included the data but it ignored/discarded it. Maybe node C could reprocess older WAL from node B? Maybe this forces us to keep an sl_log type structure around? Is it complete enough to build a prototype? ========================== I think so, the incomplete areas I see are the ones that mentioned in the patch submission including: * Snapshot exporting for the initial COPY * Spilling the reorder buffer to disk I think it would be possible to build a prototype without those even though we'd need them before I could build a production system. Conclusions ============= I like this design much better than the original design from the spring that would have required keeping a catalog proxy on the decoding machine. Based on what I've seen it should be possible to make slony use logical replication as a source for events instead of triggers populating sl_log. My thinking is that we want a way for logreceiver programs to pass arguments/parameters to the output plugins. Beyond that this looks like something slony can use.
Hi Steve! On 2012-11-17 22:50:35 -0500, Steve Singer wrote: > First, you can add me to the list of people saying 'wow', I'm impressed. Thanks! > The approach I am taking to reviewing this to try and answer the following > question > > 1) How might a future version of slony be able to use logical replication as > described by your patch and design documents > and what would that look like. > > 2) What functionality is missing from the patch set that would stop me from > implementing or prototyping the above. Sounds like a good plan to me. > > Connecting slon to the remote postgresql > ======================== > > Today the slony remote listener thread queries a bunch of events from > sl_event for a batch of SYNC events. Then the remote helper thread queries > data from sl_log_1 and sl_log_2. I see this changing, instead the slony > remote listener thread would connect to the remote system and get a logical > replication stream. > > 1) Would slony connect as a normal client connection and call something > like 'select pg_slony_process_xlog(...)' to get bunch of logical replication > change records to process. > OR > 2) Would slony connect as a replication connection similar to how the > pg_receivelog program does today and then process the logical changeset > outputs. Instead of writing it to a file (as pg_receivelog does) It would need to be the latter. We need the feedback messages it sends for several purposes: - increasing the lowered xmin - implementing optionally synchronous replication at some point - using 1) would mean having transactions open... > It seems that the second approach is what is encouraged. I think we would > put a lot of the pg_receivelog functionality into slon and it would issue a > command like 'INIT_LOGICAL_REPLICATION 'slony') to use the slony logical > replication plugin. Slon would also have to provide feedback to the > walsender about what it has processed so the origin database knows what > catalog snapshots can be expired. Based on eyeballing pg_receivelog.c it > seems that about half the code in the 700 file is related to command line > arguments etc, and the other half is related to looping over the copy out > stream, sending feedback and other things that we would need to duplicate in > slon. I think we should provide some glue code to do this, otherwise people will start replicating all the bugs I hacked into this... More seriously: I think we should have support code here, no user will want to learn the intracacies of feedback messages and such. Where that would live? No idea. > pg_receivelog.c has comment: (its pg_receivellog btw. ;)) > > /* > * We have to use postgres.h not postgres_fe.h here, because there's so much > * backend-only stuff in the XLOG include files we need. But we need a > * frontend-ish environment otherwise. Hence this ugly hack. > */ > > This looks like more of a carryover from pg_receivexlog.c. From what I can > tell we can eliminate the postgres.h include if we also eliminate the > utils/datetime.h and utils/timestamp.h and instead add in: > > #include "postgres_fe.h" > #define POSTGRES_EPOCH_JDATE 2451545 > #define UNIX_EPOCH_JDATE 2440588 > #define SECS_PER_DAY 86400 > #define USECS_PER_SEC INT64CONST(1000000) > typedef int64 XLogRecPtr; > #define InvalidXLogRecPtr 0 > > If there is a better way of getting these defines someone should speak up. > I recall that in the past slon actually did include postgres.h and it caused > some issues (I think with MSVC win32 builds). Since pg_receivelog.c will be > used as a starting point/sample for third parties to write client programs > it would be better if it didn't encourage client programs to include > postgres.h I wholeheartedly aggree. It should also be cleaned up a fair bit before others copy it should we not go for having some client side library. Imo the library could very roughly be something like: state = SetupStreamingLLog(replication-slot, ...); while((message = StreamingLLogNextMessage(state)) { write(outfd, message->data, message->length); if (received_100_messages) { fsync(outfd); StreamingLLogConfirm(message); } } Although I guess thats not good enough because StreamingLLogNextMessage would be blocking, but that shouldn't be too hard to work around. > The Slony Output Plugin > ===================== > > Once we've modified slon to connect as a logical replication client we will > need to write a slony plugin. > > As I understand the plugin API: > * A walsender is processing through WAL records, each time it sees a COMMIT > WAL record it will call my plugins > .begin > .change (for each change in the transaction) > .commit > > * The plugin for a particular stream/replication client will see one > transaction at a time passed to it in commit order. It won't see > .change(t1) followed by .change (t2), followed by a second .change(t1). The > reorder buffer code hides me from all that complexity (yah) Correct. > From a slony point of view I think the output of the plugin will be rows, > suitable to be passed to COPY IN of the form: > > origin_id, table_namespace,table_name,command_type, > cmd_updatencols,command_args > > This is basically the Slony 2.2 sl_log format minus a few columns we no > longer need (txid, actionseq). > command_args is a postgresql text array of column=value pairs. Ie [ > {id=1},{name='steve'},{project='slony'}] It seems to me that that makes escaping unneccesarily complicated, but given you already have all the code... ;) > I don't t think our output plugin will be much more complicated than the > test_decoding plugin. Good. Thats the idea ;). Are you ok with the interface as it is now or would you like to change something? > I suspect we will want to give it the ability to > filter out non-replicated tables. We will also have to filter out change > records that didn't originate on the local-node that aren't part of a > cascaded subscription. Remember that in a two node cluster slony will have > connections from A-->B and from B--->A even if user tables only flow one > way. Data that is replicated from A into B will show up in the WAL stream > for B. Yes. We will also need something like that. If you remember the first prototype we sent to the list, it included the concept of an 'origin_node' in wal record. I think you actually reviewed that one ;) That was exactly aimed at something like this... Since then my thoughts about how the origin_id looks like have changed a bit: - origin id is internally still represented as an uint32/Oid - never visible outside of wal/system catalogs - externally visible it gets - assigned an uuid - optionally assigned a user defined name - user settable (permissions?) origin when executing sql: - SET change_origin_uuid = 'uuid'; - SET change_origin_name = 'user-settable-name';- defaults to the local node - decoding callbacks get passed the origin of a change - txn->{origin_uuid, origin_name, origin_internal?} - the init decoding callback can setup an array of interesting origins, so the others don't even get the ReorderBuffer treatment I have to thank the discussion on -hackers and a march through prague with Marko here... > Exactly how we do this filtering is an open question, I think the output > plugin will at a minimum need to know: > > a) What the slony node id is of the node it is running on. This is easy to > figure out if the output plugin is able/allowed to query its database. Will > this be possible? I would expect to be able to query the database as it > exists now(at plugin invocation time) not as it existed in the past when the > WAL was generated. In addition to the node ID I can see us wanting to be > able to query other slony tables (sl_table,sl_set etc...) Hm. There is no fundamental reason not to allow normal database access to the current database but it won't be all that cheap, so doing it frequently is not a good idea. The reason its not cheap is that you basically need to teardown the postgres internal caches if you switch the timestream in which you are working. Would go something like: TransactionContext = AllocSetCreate(...); RevertFromDecodingSnapshot(); InvalidateSystemCaches(); StartTransactionCommand(); /* do database work */ CommitTransactionCommand(); /* cleanup memory*/ SetupDecodingSnapshot(snapshot, data); InvalidateSystemCaches(); Why do you need to be able to query the present? I thought it might be neccesary to allow additional tables be accessed in a timetraveling manner, but not this way round. I guess an initial round of querying during plugin initialization won't be good enough? > b) What the slony node id is of the node we are streaming too. It would be > nice if we could pass extra, arbitrary data/parameters to the output plugins > that could include that, or other things. At the moment the > start_logical_replication rule in repl_gram.y doesn't allow for that but I > don't see why we couldn't make it do so. Yes, I think we want something like that. I even asked input on that recently ;): http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de Input welcome! > Even though, from a data-correctness point of view, slony could commit the > transaction on the replica after it sees the t1 commit, we won't want it to > do commits other than on a SYNC boundary. This means that the replicas will > continue to move between consistent SYNC snapshots and that we can still > track the state/progress of replication by knowing what events (SYNC or > otherwise) have been confirmed. I don't know enough about slony internals, but: why? This will prohibit you from ever doing (per-transaction) synchronous replication... > This also means that slony should only provide feedback to the walsender on > SYNC boundaries after the transaction has committed on the receiver. I don't > see this as being an issue. Yes, thats no problem. You need to give feedback more frequently (otherwise walsender kicks you off), but you don't have to increase the confirmed flush location. > Setting up Subscriptions > =================== > At first we have a slon cluster with just 1 node, life is good. When a > second node is created and a path(or pair of paths) are defined between the > nodes I think they will each: > 1. Connect to the remote node with a normal libpq connection. > a. Get the current xlog recptr, > b. Query any non-sync events of interest from sl_event. > 2. Connect to the remote node with a logical replication connection and > start streaming logical replication changes start at the recptr we retrieved > above. Note that INIT_LOGICAL_REPLICATION can take some time to get to the initial consistent state (especially if there are longrunning transactions). So you should do the init in 1), query all the events in the snapshot that returns and then go over to 2). > The remote_worker:copy_set will then need to get a consistent COPY of the > tables in the replication set such that any changes made to the tables after > the copy is started get included in the replication stream. The approach > proposed in the DESIGN.TXT file with exporting a snapshot sounds okay for > this. I *think* slony could get by with something less fancy as well but > it would be ugly. The snapshot exporting isn't really that much additional work as we already need to support most of it for keeping state across restarts. > FAILOVER: > --------------- > A---->B > | . > v . > C > > Today with slony, if B is a valid failover target then it is a forwarding > node of the set. This means that B keeps a record in sl_log of any changes > originating on A until B knows that node C has received those changes. In > the event of a failover, if node C is far behind, it can just get the > missing data from sl_log on node B (the failover target/new origin). > > I see a problem with what I have discussed above, B won't explicitly store > the data from A in sl_log, a cascaded node would depend on B's WAL stream. > The problem is that at FAILOVER time, B might have processed some changes > from A. Node C might also be processing Node B's WAL stream for events (or > data from another set). Node C will discard/not receive the data for A's > tables since it isn't subscribed to those tables from B. What happens then > if at some later point B and C receive the FAILOVER event. > What does node C do? It can't get the missing data from node A because node > A has failed, and it can't get it from node B because node C has already > processed the WAL changes from node B that included the data but it > ignored/discarded it. Maybe node C could reprocess older WAL from node B? > Maybe this forces us to keep an sl_log type structure around? I fear youve left me behind here, sorry, can't give you any input. > Is it complete enough to build a prototype? > ========================== > I think so, the incomplete areas I see are the ones that mentioned in the > patch submission including: > * Snapshot exporting for the initial COPY > * Spilling the reorder buffer to disk > > I think it would be possible to build a prototype without those even though > we'd need them before I could build a production system. > Conclusions > ============= > I like this design much better than the original design from the spring that > would have required keeping a catalog proxy on the decoding machine. Based > on what I've seen it should be possible to make slony use logical > replication as a source for events instead of triggers populating sl_log. > My thinking is that we want a way for logreceiver programs to pass > arguments/parameters to the output plugins. Beyond that this looks like > something slony can use. > > Cool! Don't hesitate to mention anything that you think would make you life easier, chances are that youre not the only one who could benefit from it... Thanks, Andres --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
OK no problem. For sure this is going to happen, I was wondering myself if it could be possible to merge pg_receivexlog and pg_receivellog into a single utility with multiple modes :)Hi Michael,Cool.
On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> I have been able to fetch your code (thanks Andrea!) and some it. For the
> time being I am spending some time reading the code and understanding the
> whole set of features you are trying to implement inside core, even if I
> got some background from what you presented at PGCon and from the hackers
> ML.Its waiting till it sees initial an initial xl_running_xacts record. The
> Btw, as a first approach, I tried to run the logical log receiver plugged
> on a postgres server, and I am not able to make it work.
> Well, I am using settings similar to yours.
> # Run master
> rm -r ~/bin/pgsql/master/
> initdb -D ~/bin/pgsql/master/
> echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf
> postgres -D ~/bin/pgsql/master \
> -c wal_level=logical \
> -c max_wal_senders=10 \
> -c max_logical_slots=10 \
> -c wal_keep_segments=100 \
> -c log_line_prefix="[%p %x] "
> # Logical log receiver
> pg_receivellog -f $HOME/output.txt -d postgres -v
>
> After launching some SQLs, the logical receiver is stuck just after sending
> INIT_LOGICAL_REPLICATION, please see bt of process waiting:
easiest way to do that is manually issue a checkpoint. Sorry, should
have included that in the description.
Otherwise you can wait till the next routine checkpoint comes arround...
I plan to cause more xl_running_xacts record to be logged in the
future. I think the timing of those currently is non-optimal, you have
the same problem as here in normal streaming replication as well :(
> I am just looking at this patch and will provide some comments.
> By the way, you forgot the installation part of pg_receivellog, please see
> patch attached.
That actually was somewhat intended, I thought people wouldn't like the
name and I didn't want a binary that's going to be replaced anyway lying
around ;)
Btw, here are some extra comments based on my progress, hope it will be useful for other people playing around with your patches.
1) Necessary to install the contrib module test_decoding on server side or the test case will not work.
2) Obtention of the following logs on server:
LOG: forced to assume catalog changes for xid 1370 because it was running to early
WARNING: ABORT 1370
Actually I saw that there are many warnings like this.
3) Assertion failure while running pgbench, I was just curious to see how it reacted when logical replication was put under a little bit of load.
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c", Line: 877)
=> pgbench -i postgres; pgbench -T 500 -c 8 postgres
4) Mentionned by Andres above, but logical replication begins only there is a xl_running_xacts record. I just enforced a checkpoint manually.
With all those things done, I have been able to set up the system, for example those queries:
postgres=# create table ac (a int);
CREATE TABLE
postgres=# insert into ac values (1);
INSERT 0 1
created the expected output:
BEGIN 32135
COMMIT 32135
BEGIN 32136
table "ac": INSERT: a[int4]:1
COMMIT 32136
Now it is time to dig into the code...
--
Michael Paquier
http://michael.otacoo.com
On 12-11-18 11:07 AM, Andres Freund wrote: > Hi Steve! > > > I think we should provide some glue code to do this, otherwise people > will start replicating all the bugs I hacked into this... More > seriously: I think we should have support code here, no user will want > to learn the intracacies of feedback messages and such. Where that would > live? No idea. libpglogicalrep.so ? > I wholeheartedly aggree. It should also be cleaned up a fair bit before > others copy it should we not go for having some client side library. > > Imo the library could very roughly be something like: > > state = SetupStreamingLLog(replication-slot, ...); > while((message = StreamingLLogNextMessage(state)) > { > write(outfd, message->data, message->length); > if (received_100_messages) > { > fsync(outfd); > StreamingLLogConfirm(message); > } > } > > Although I guess thats not good enough because StreamingLLogNextMessage > would be blocking, but that shouldn't be too hard to work around. > How about we pass a timeout value to StreamingLLogNextMessage (..) where it returns if no data is available after the timeout to give the caller a chance to do something else. >> This is basically the Slony 2.2 sl_log format minus a few columns we no >> longer need (txid, actionseq). >> command_args is a postgresql text array of column=value pairs. Ie [ >> {id=1},{name='steve'},{project='slony'}] > It seems to me that that makes escaping unneccesarily complicated, but > given you already have all the code... ;) When I look at the actual code/representation we picked it is closer to {column1,value1,column2,value2...} >> I don't t think our output plugin will be much more complicated than the >> test_decoding plugin. > Good. Thats the idea ;). Are you ok with the interface as it is now or > would you like to change something? I'm going to think about this some more and maybe try to write an example plugin before I can say anything with confidence. > > Yes. We will also need something like that. If you remember the first > prototype we sent to the list, it included the concept of an > 'origin_node' in wal record. I think you actually reviewed that one ;) > > That was exactly aimed at something like this... > > Since then my thoughts about how the origin_id looks like have changed a > bit: > - origin id is internally still represented as an uint32/Oid > - never visible outside of wal/system catalogs > - externally visible it gets > - assigned an uuid > - optionally assigned a user defined name > - user settable (permissions?) origin when executing sql: > - SET change_origin_uuid = 'uuid'; > - SET change_origin_name = 'user-settable-name'; > - defaults to the local node > - decoding callbacks get passed the origin of a change > - txn->{origin_uuid, origin_name, origin_internal?} > - the init decoding callback can setup an array of interesting origins, > so the others don't even get the ReorderBuffer treatment > > I have to thank the discussion on -hackers and a march through prague > with Marko here... So would the uuid and optional name assignment be done in the output plugin or some else? When/how does the uuid get generated and where do we store it so the same uuid gets returned when postgres restarts. Slony today stores all this type of stuff in user-level tables and user-level functions (because it has no other choice). What is the connection between these values and the 'slot-id' in your proposal for the init arguments? Does the slot-id need to be the external uuid of the other end or is there no direct connection? Today slony allows us to replicate between two databases in the same postgresql cluster (I use this for testing all the time) Slony also allows for two different 'slony clusters' to be setup in the same database (or so I'm told, I don't think I have ever tried this myself). plugin functions that let me query the local database and then return the uuid and origin_name would work in this model. +1 on being able to mark the 'change origin' in a SET command when the replication process is pushing data into the replica. >> Exactly how we do this filtering is an open question, I think the output >> plugin will at a minimum need to know: >> >> a) What the slony node id is of the node it is running on. This is easy to >> figure out if the output plugin is able/allowed to query its database. Will >> this be possible? I would expect to be able to query the database as it >> exists now(at plugin invocation time) not as it existed in the past when the >> WAL was generated. In addition to the node ID I can see us wanting to be >> able to query other slony tables (sl_table,sl_set etc...) > Hm. There is no fundamental reason not to allow normal database access > to the current database but it won't be all that cheap, so doing it > frequently is not a good idea. > The reason its not cheap is that you basically need to teardown the > postgres internal caches if you switch the timestream in which you are > working. > > Would go something like: > > TransactionContext = AllocSetCreate(...); > RevertFromDecodingSnapshot(); > InvalidateSystemCaches(); > StartTransactionCommand(); > /* do database work */ > CommitTransactionCommand(); > /* cleanup memory*/ > SetupDecodingSnapshot(snapshot, data); > InvalidateSystemCaches(); > > Why do you need to be able to query the present? I thought it might be > neccesary to allow additional tables be accessed in a timetraveling > manner, but not this way round. > I guess an initial round of querying during plugin initialization won't > be good enough? For example my output plugin would want the list of replicated tables (or the list of tables replicated to a particular replica). This list can change over time. As administrators issue commands to add or remove tables to replication or otherwise reshape the cluster the output plugin will need to know about this. I MIGHT be able to get away with having slon disconnect and reconnect on reconfiguration events so only the init() call would need this data, but I am not sure. One of the ways slony allows you to shoot your foot off is by changing certain configuration things (like dropping a table from a set) while a subscription is in progress. Being able to timetravel the slony configuration tables might make this type of foot-gun a lot harder to encounter but that might be asking for too much. >> b) What the slony node id is of the node we are streaming too. It would be >> nice if we could pass extra, arbitrary data/parameters to the output plugins >> that could include that, or other things. At the moment the >> start_logical_replication rule in repl_gram.y doesn't allow for that but I >> don't see why we couldn't make it do so. > Yes, I think we want something like that. I even asked input on that > recently ;): > http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de > > Input welcome! How flexible will the datatypes for the arguments be? If I wanted to pass in a list of tables (ie an array?) could I? Above I talked about having the init() or change() methods query the local database. Another option might be to make the slon build up this data (by querying the database over a normal psql connection) and just passing the data in. However that might mean passing in a list of a few thousand table names, which doesn't sound like a good idea. > >> Even though, from a data-correctness point of view, slony could commit the >> transaction on the replica after it sees the t1 commit, we won't want it to >> do commits other than on a SYNC boundary. This means that the replicas will >> continue to move between consistent SYNC snapshots and that we can still >> track the state/progress of replication by knowing what events (SYNC or >> otherwise) have been confirmed. > I don't know enough about slony internals, but: why? This will prohibit > you from ever doing (per-transaction) synchronous replication... A lot of this has to do with the stuff I discuss in the section below on cluster reshaping that you didn't understand. Slony depends on knowing what data has , or hasn't been sent to a replica at a particular event id. If 'some' transactions in between two SYNC events have committed but not others then slony has no idea what data it needs to get elsewhere on a FAILOVER type event. There might be a way to make this work otherwise but I'm not sure what that is and how long it will take to debug out the issues. > Cool! Don't hesitate to mention anything that you think would make you > life easier, chances are that youre not the only one who could benefit > from it... Thanks, Andres
On 2012-11-20 09:30:40 +0900, Michael Paquier wrote: > On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > On 2012-11-19 16:28:55 +0900, Michael Paquier wrote: > > > I am just looking at this patch and will provide some comments. > > > By the way, you forgot the installation part of pg_receivellog, please see > > > patch attached. > > > > That actually was somewhat intended, I thought people wouldn't like the > > name and I didn't want a binary that's going to be replaced anyway lying > > around ;) > > > OK no problem. For sure this is going to happen, I was wondering myself if > it could be possible to merge pg_receivexlog and pg_receivellog into a > single utility with multiple modes :) Don't really see that, the differences already are significant and imo are bound to get bigger. Shouldn't live in pg_basebackup/ either.. > Btw, here are some extra comments based on my progress, hope it will be > useful for other people playing around with your patches. > 1) Necessary to install the contrib module test_decoding on server side or > the test case will not work. > 2) Obtention of the following logs on server: > LOG: forced to assume catalog changes for xid 1370 because it was running > to early > WARNING: ABORT 1370 > Actually I saw that there are many warnings like this. Those aren't unexpected. Perhaps I should not make it a warning then... A short explanation: We can only decode tuples we see in the WAL when we already have a timetravel catalog snapshot before that transaction started. To build such a snapshot we need to collect information about committed which changed the catalog. Unfortunately we can't diagnose whether a txn changed the catalog without a snapshot so we just assume all committed ones do - it just costs a bit of memory. Thats the background of the "forced to assume catalog changes for ..." message. The reason for the ABORTs is related but different. We start out in the "SNAPBUILD_START" state when we try to build a snapshot. When we find initial information about running transactions (i.e. xl_running_xacts) we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can decode all changes in transactions that start *after* the current lsn. Earlier transactions might have tuples on a catalog state we can't query. Only when all transactions we observed as running before the FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT. As we want a consistent/reproducible set of transactions to produce output via the logstream we only pass transactions to the output plugin if they commit *after* CONSISTENT (they can start earlier though!). This allows us to produce a pg_dump compatible snapshot in the moment we get consistent that contains exactly the changes we won't stream out. Makes sense? > 3) Assertion failure while running pgbench, I was just curious to see how > it reacted when logical replication was put under a little bit of load. > TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) && > ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c", > Line: 877) > => pgbench -i postgres; pgbench -T 500 -c 8 postgres Can you reproduce this one? I would be interested in log output. Because I did run pgbench for quite some time and I haven't seen that one after fixing some issues last week. It implies that snapstate->nrrunning has lost touch with reality... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2012-11-19 19:50:32 -0500, Steve Singer wrote: > On 12-11-18 11:07 AM, Andres Freund wrote: > >I think we should provide some glue code to do this, otherwise people > >will start replicating all the bugs I hacked into this... More > >seriously: I think we should have support code here, no user will want > >to learn the intracacies of feedback messages and such. Where that would > >live? No idea. > libpglogicalrep.so ? Yea. We don't really have the infrastructure for that yet though... Robert and me were just talking about that recently... > >I wholeheartedly aggree. It should also be cleaned up a fair bit before > >others copy it should we not go for having some client side library. > > > >Imo the library could very roughly be something like: > > > >state = SetupStreamingLLog(replication-slot, ...); > >while((message = StreamingLLogNextMessage(state)) > >{ > > write(outfd, message->data, message->length); > > if (received_100_messages) > > { > > fsync(outfd); > > StreamingLLogConfirm(message); > > } > >} > > > >Although I guess thats not good enough because StreamingLLogNextMessage > >would be blocking, but that shouldn't be too hard to work around. > > > > How about we pass a timeout value to StreamingLLogNextMessage (..) where it > returns if no data is available after the timeout to give the caller a > chance to do something else. Doesn't really integrate into the sort of loop thats often built around poll(2), select(2) and similar. It probably should return NULL if there's nothing there yet and we should have a StreamingLLogWaitForMessage() or such. > >>This is basically the Slony 2.2 sl_log format minus a few columns we no > >>longer need (txid, actionseq). > >>command_args is a postgresql text array of column=value pairs. Ie [ > >>{id=1},{name='steve'},{project='slony'}] > >It seems to me that that makes escaping unneccesarily complicated, but > >given you already have all the code... ;) > > When I look at the actual code/representation we picked it is closer to > {column1,value1,column2,value2...} Still means you need to escape and later pasrse columnN, valueN values. I would have expected something like (length:data, length:data)+ > >>I don't t think our output plugin will be much more complicated than the > >>test_decoding plugin. > >Good. Thats the idea ;). Are you ok with the interface as it is now or > >would you like to change something? > > I'm going to think about this some more and maybe try to write an example > plugin before I can say anything with confidence. That would be very good. > >Yes. We will also need something like that. If you remember the first > >prototype we sent to the list, it included the concept of an > >'origin_node' in wal record. I think you actually reviewed that one ;) > > > >That was exactly aimed at something like this... > > > >Since then my thoughts about how the origin_id looks like have changed a > >bit: > >- origin id is internally still represented as an uint32/Oid > > - never visible outside of wal/system catalogs > >- externally visible it gets > > - assigned an uuid > > - optionally assigned a user defined name > >- user settable (permissions?) origin when executing sql: > > - SET change_origin_uuid = 'uuid'; > > - SET change_origin_name = 'user-settable-name'; > > - defaults to the local node > >- decoding callbacks get passed the origin of a change > > - txn->{origin_uuid, origin_name, origin_internal?} > >- the init decoding callback can setup an array of interesting origins, > > so the others don't even get the ReorderBuffer treatment > > > >I have to thank the discussion on -hackers and a march through prague > >with Marko here... > So would the uuid and optional name assignment be done in the output plugin > or some else? That would be postgres infrastructure. The output plugin would get passed at least uuid and name and potentially the internal name as well (might be useful to build some internal caching of information). > When/how does the uuid get generated and where do we store it so the same > uuid gets returned when postgres restarts. Slony today stores all this type > of stuff in user-level tables and user-level functions (because it has no > other choice). Would need to be its own system catalog. > What is the connection between these values and the > 'slot-id' in your proposal for the init arguments? Does the slot-id need to > be the external uuid of the other end or is there no direct connection? None really. The "slot-id" really is only an identifier for a replication connection (which should live longer than a single postmaster run) which contains information about the point up to which you replicated. We need to manage some local resources based on that. > Today slony allows us to replicate between two databases in the same > postgresql cluster (I use this for testing all the time) > Slony also allows for two different 'slony clusters' to be setup in the same > database (or so I'm told, I don't think I have ever tried this myself). Yuck. I haven't thought about this very much. I honestly don't see support for the first case right now. The second shouldn't be too hard, we already have the database oid available everywhere we need it. > plugin functions that let me query the local database and then return the > uuid and origin_name would work in this model. Should be possible. > +1 on being able to mark the 'change origin' in a SET command when the > replication process is pushing data into the replica. Good. > > >>Exactly how we do this filtering is an open question, I think the output > >>plugin will at a minimum need to know: > >> > >>a) What the slony node id is of the node it is running on. This is easy to > >>figure out if the output plugin is able/allowed to query its database. Will > >>this be possible? I would expect to be able to query the database as it > >>exists now(at plugin invocation time) not as it existed in the past when the > >>WAL was generated. In addition to the node ID I can see us wanting to be > >>able to query other slony tables (sl_table,sl_set etc...) > >Hm. There is no fundamental reason not to allow normal database access > >to the current database but it won't be all that cheap, so doing it > >frequently is not a good idea. > >The reason its not cheap is that you basically need to teardown the > >postgres internal caches if you switch the timestream in which you are > >working. > > > >Would go something like: > > > >TransactionContext = AllocSetCreate(...); > >RevertFromDecodingSnapshot(); > >InvalidateSystemCaches(); > >StartTransactionCommand(); > >/* do database work */ > >CommitTransactionCommand(); > >/* cleanup memory*/ > >SetupDecodingSnapshot(snapshot, data); > >InvalidateSystemCaches(); > > > >Why do you need to be able to query the present? I thought it might be > >neccesary to allow additional tables be accessed in a timetraveling > >manner, but not this way round. > >I guess an initial round of querying during plugin initialization won't > >be good enough? > > For example my output plugin would want the list of replicated tables (or > the list of tables replicated to a particular replica). This list can change > over time. As administrators issue commands to add or remove tables to > replication or otherwise reshape the cluster the output plugin will need to > know about this. I MIGHT be able to get away with having slon disconnect > and reconnect on reconfiguration events so only the init() call would need > this data, but I am not sure. > > One of the ways slony allows you to shoot your foot off is by changing > certain configuration things (like dropping a table from a set) while a > subscription is in progress. Being able to timetravel the slony > configuration tables might make this type of foot-gun a lot harder to > encounter but that might be asking for too much. Actually timetravel access to those tables is considerably easier/faster. I wanted to provide such tables anyway (because you need them to safely write your own pg_enum alike types). It means that you log slightly (32 + sizeof(XLogRecord) afair) more per modified row. > >>b) What the slony node id is of the node we are streaming too. It would be > >>nice if we could pass extra, arbitrary data/parameters to the output plugins > >>that could include that, or other things. At the moment the > >>start_logical_replication rule in repl_gram.y doesn't allow for that but I > >>don't see why we couldn't make it do so. > >Yes, I think we want something like that. I even asked input on that > >recently ;): > >http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de > > > >Input welcome! > > How flexible will the datatypes for the arguments be? If I wanted to pass in > a list of tables (ie an array?) could I? I was thinking of just a textual (key = value, ...) style list, similar to options to EXPLAIN, COPY et al. > Above I talked about having the init() or change() methods query the local > database. Another option might be to make the slon build up this data (by > querying the database over a normal psql connection) and just passing the > data in. However that might mean passing in a list of a few thousand table > names, which doesn't sound like a good idea. No, it certainly doesn't. > > > >>Even though, from a data-correctness point of view, slony could commit the > >>transaction on the replica after it sees the t1 commit, we won't want it to > >>do commits other than on a SYNC boundary. This means that the replicas will > >>continue to move between consistent SYNC snapshots and that we can still > >>track the state/progress of replication by knowing what events (SYNC or > >>otherwise) have been confirmed. > >I don't know enough about slony internals, but: why? This will prohibit > >you from ever doing (per-transaction) synchronous replication... > > A lot of this has to do with the stuff I discuss in the section below on > cluster reshaping that you didn't understand. Slony depends on knowing what > data has , or hasn't been sent to a replica at a particular event id. If > 'some' transactions in between two SYNC events have committed but not others > then slony has no idea what data it needs to get elsewhere on a FAILOVER > type event. There might be a way to make this work otherwise but I'm not > sure what that is and how long it will take to debug out the issues. Ah, it starts to make sense. The way I solved that issue in the prototype from arround pgcon was that I included the LSN from the original commit record in the remote transaction into the commit record of the local transaction (with the origin_id set to the remote side). That allowed to trivially restore the exact state of replication after a crash even with synchronous_commit=off as during replay you could simply ensure the replication-surely-received-lsn of every remote side was up to date. Then you can simply do a START_LOGICAL_REPLICATION 'slot' just-recovered/lsn; and restart applying (*not* INIT_LOGICAL_REPLICATION). Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- Those aren't unexpected. Perhaps I should not make it a warning then...
A short explanation:
We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.
The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.
Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.
Makes sense?Can you reproduce this one? I would be interested in log output. Because
> 3) Assertion failure while running pgbench, I was just curious to see how
> it reacted when logical replication was put under a little bit of load.
> TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c",
> Line: 877)
> => pgbench -i postgres; pgbench -T 500 -c 8 postgres
I did run pgbench for quite some time and I haven't seen that one after
fixing some issues last week.
It implies that snapstate->nrrunning has lost touch with reality...
Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't outputted anything in the logs, but here is the backtrace of the core file produced.
#2 0x0000000000865145 in ExceptionalCondition (conditionName=0xa15100 "!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >= ((TransactionId) 3)))", errorType=0xa14f3b "FailedAssertion",
fileName=0xa14ed0 "snapbuild.c", lineNumber=877) at assert.c:54
#3 0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10, xid=0) at snapbuild.c:877
#4 0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80, snapstate=0x19e4f10, xid=0, buf=0x1a0a368, relfilenode=0x1a0a450) at snapbuild.c:388
#5 0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80, snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732
#6 0x00000000007080b9 in DecodeRecordIntoReorderBuffer (reader=0x1a08300, state=0x19e4e20, buf=0x1a0a368) at decode.c:84
#7 0x0000000000708cad in replay_finished_record (state=0x1a08300, buf=0x1a0a368) at logicalfuncs.c:54
#8 0x00000000004d8033 in XLogReaderRead (state=0x1a08300) at xlogreader.c:965
#9 0x000000000070f7c3 in XLogSendLogical (caughtup=0x7fffb22c35fb "") at walsender.c:1721
#10 0x000000000070ea05 in WalSndLoop (send_data=0x70f6e2 <XLogSendLogical>) at walsender.c:1184
#11 0x000000000070e0eb in StartLogicalReplication (cmd=0x190eaa8) at walsender.c:726
#12 0x000000000070e3ac in exec_replication_command (cmd_string=0x19a65c8 "START_LOGICAL_REPLICATION 'id-0' 0/7E1855C") at walsender.c:853
#13 0x0000000000753ee0 in PostgresMain (argc=2, argv=0x18f63d8, username=0x18f62a8 "michael") at postgres.c:3974
#14 0x00000000006f13ea in BackendRun (port=0x1912600) at postmaster.c:3668
#15 0x00000000006f0b76 in BackendStartup (port=0x1912600) at postmaster.c:3352
#16 0x00000000006ed900 in ServerLoop () at postmaster.c:1431
#17 0x00000000006ed208 in PostmasterMain (argc=13, argv=0x18f40a0) at postmaster.c:1180
#18 0x0000000000657517 in main (argc=13, argv=0x18f40a0) at main.c:197
I'm keeping this core and the binary btw.
#2 0x0000000000865145 in ExceptionalCondition (conditionName=0xa15100 "!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >= ((TransactionId) 3)))", errorType=0xa14f3b "FailedAssertion",
fileName=0xa14ed0 "snapbuild.c", lineNumber=877) at assert.c:54
#3 0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10, xid=0) at snapbuild.c:877
#4 0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80, snapstate=0x19e4f10, xid=0, buf=0x1a0a368, relfilenode=0x1a0a450) at snapbuild.c:388
#5 0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80, snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732
#6 0x00000000007080b9 in DecodeRecordIntoReorderBuffer (reader=0x1a08300, state=0x19e4e20, buf=0x1a0a368) at decode.c:84
#7 0x0000000000708cad in replay_finished_record (state=0x1a08300, buf=0x1a0a368) at logicalfuncs.c:54
#8 0x00000000004d8033 in XLogReaderRead (state=0x1a08300) at xlogreader.c:965
#9 0x000000000070f7c3 in XLogSendLogical (caughtup=0x7fffb22c35fb "") at walsender.c:1721
#10 0x000000000070ea05 in WalSndLoop (send_data=0x70f6e2 <XLogSendLogical>) at walsender.c:1184
#11 0x000000000070e0eb in StartLogicalReplication (cmd=0x190eaa8) at walsender.c:726
#12 0x000000000070e3ac in exec_replication_command (cmd_string=0x19a65c8 "START_LOGICAL_REPLICATION 'id-0' 0/7E1855C") at walsender.c:853
#13 0x0000000000753ee0 in PostgresMain (argc=2, argv=0x18f63d8, username=0x18f62a8 "michael") at postgres.c:3974
#14 0x00000000006f13ea in BackendRun (port=0x1912600) at postmaster.c:3668
#15 0x00000000006f0b76 in BackendStartup (port=0x1912600) at postmaster.c:3352
#16 0x00000000006ed900 in ServerLoop () at postmaster.c:1431
#17 0x00000000006ed208 in PostmasterMain (argc=13, argv=0x18f40a0) at postmaster.c:1180
#18 0x0000000000657517 in main (argc=13, argv=0x18f40a0) at main.c:197
I'm keeping this core and the binary btw.
Michael Paquier
http://michael.otacoo.com
On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:
OK got it thanks for your explanation.On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund <andres@2ndquadrant.com>wrote:> > On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:> > > I am just looking at this patch and will provide some comments.Don't really see that, the differences already are significant and imo
> > > By the way, you forgot the installation part of pg_receivellog, please see
> > > patch attached.
> >
> > That actually was somewhat intended, I thought people wouldn't like the
> > name and I didn't want a binary that's going to be replaced anyway lying
> > around ;)
> >
> OK no problem. For sure this is going to happen, I was wondering myself if
> it could be possible to merge pg_receivexlog and pg_receivellog into a
> single utility with multiple modes :)
are bound to get bigger. Shouldn't live in pg_basebackup/ either...
I am sure that this will be the object of many future discussions.
> Btw, here are some extra comments based on my progress, hope it will beThose aren't unexpected. Perhaps I should not make it a warning then...
> useful for other people playing around with your patches.
> 1) Necessary to install the contrib module test_decoding on server side or
> the test case will not work.
> 2) Obtention of the following logs on server:
> LOG: forced to assume catalog changes for xid 1370 because it was running
> to early
> WARNING: ABORT 1370
> Actually I saw that there are many warnings like this.
A NOTICE would be more adapted, a WARNING means that something that may endanger the system has happened, but as far as I understand from your explanation this is not the case.
A short explanation:
We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.
OK, so this snapshot only needs to include the XIDs of transactions that have modified the catalogs. Do I get it right? This way you are able to fetch the correct relation definition for replication decoding.
Just thinking but... It looks to be a waste to store the transactions XIDs of all the committed transactions, but on the other hand there is no way to track the XIDs of transactions that modified a catalog in current core code. So yes this approach is better as refining the transaction XID tracking for snapshot reconstruction is something that could be improved later. Those are only thoughts though...
Just thinking but... It looks to be a waste to store the transactions XIDs of all the committed transactions, but on the other hand there is no way to track the XIDs of transactions that modified a catalog in current core code. So yes this approach is better as refining the transaction XID tracking for snapshot reconstruction is something that could be improved later. Those are only thoughts though...
The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.
Just to be clear, lsn means the log-sequence number associated to each xlog record?
Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.
Makes sense?
So, once again coming to it, we need in the snapshot built only the XIDs of transactions that modified the catalogs to get a consistent view of relation info for decoding.
Really, I think that refining the XID tracking to minimize the size of the snapshot built for decoding would be really a key for performance improvement especially for OLTP-type applications (lots of transactions involved, few of them involving catalogs).
--
Michael Paquier
http://michael.otacoo.com
On 2012-11-21 15:28:30 +0900, Michael Paquier wrote: > On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > On 2012-11-20 09:30:40 +0900, Michael Paquier wrote: > > > Btw, here are some extra comments based on my progress, hope it will be > > > useful for other people playing around with your patches. > > > 1) Necessary to install the contrib module test_decoding on server side > > or > > > the test case will not work. > > > 2) Obtention of the following logs on server: > > > LOG: forced to assume catalog changes for xid 1370 because it was > > running > > > to early > > > WARNING: ABORT 1370 > > > Actually I saw that there are many warnings like this. > > > > Those aren't unexpected. Perhaps I should not make it a warning then... > > > A NOTICE would be more adapted, a WARNING means that something that may > endanger the system has happened, but as far as I understand from your > explanation this is not the case. I think it should go DEBUG2 or so once were a bit more confident about the code. > > A short explanation: > > > > We can only decode tuples we see in the WAL when we already have a > > timetravel catalog snapshot before that transaction started. To build > > such a snapshot we need to collect information about committed which > > changed the catalog. Unfortunately we can't diagnose whether a txn > > changed the catalog without a snapshot so we just assume all committed > > ones do - it just costs a bit of memory. Thats the background of the > > "forced to assume catalog changes for ..." message. > > > OK, so this snapshot only needs to include the XIDs of transactions that > have modified the catalogs. Do I get it right? This way you are able to > fetch the correct relation definition for replication decoding. Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn), so its not all of them. Normal snapshots carry all in-progress transactionids instead of the committed ones, but that would have been far more in our case (only a minority of txn's touch the catalog) and it has problems with subtransaction tracking. > Just thinking but... It looks to be a waste to store the transactions XIDs > of all the committed transactions, but on the other hand there is no way to > track the XIDs of transactions that modified a catalog in current core > code. So yes this approach is better as refining the transaction XID > tracking for snapshot reconstruction is something that could be improved > later. Those are only thoughts though... We actually only track xids of catalog modifying transactions once we hit the CONSISTENT state. Before the initial snapshot we can't detect that. > The reason for the ABORTs is related but different. We start out in the > > "SNAPBUILD_START" state when we try to build a snapshot. When we find > > initial information about running transactions (i.e. xl_running_xacts) > > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can > > decode all changes in transactions that start *after* the current > > lsn. Earlier transactions might have tuples on a catalog state we can't > > query. > > > Just to be clear, lsn means the log-sequence number associated to each xlog > record? Yes. And that number is just the position in the stream. Greetings, Andres --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-11-21 14:57:08 +0900, Michael Paquier wrote: > On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > Those aren't unexpected. Perhaps I should not make it a warning then... > > > > A short explanation: > > > > We can only decode tuples we see in the WAL when we already have a > > timetravel catalog snapshot before that transaction started. To build > > such a snapshot we need to collect information about committed which > > changed the catalog. Unfortunately we can't diagnose whether a txn > > changed the catalog without a snapshot so we just assume all committed > > ones do - it just costs a bit of memory. Thats the background of the > > "forced to assume catalog changes for ..." message. > > > > The reason for the ABORTs is related but different. We start out in the > > "SNAPBUILD_START" state when we try to build a snapshot. When we find > > initial information about running transactions (i.e. xl_running_xacts) > > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can > > decode all changes in transactions that start *after* the current > > lsn. Earlier transactions might have tuples on a catalog state we can't > > query. > > Only when all transactions we observed as running before the > > FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT. > > As we want a consistent/reproducible set of transactions to produce > > output via the logstream we only pass transactions to the output plugin > > if they commit *after* CONSISTENT (they can start earlier though!). This > > allows us to produce a pg_dump compatible snapshot in the moment we get > > consistent that contains exactly the changes we won't stream out. > > > > Makes sense? > > > > > 3) Assertion failure while running pgbench, I was just curious to see > > how > > > it reacted when logical replication was put under a little bit of load. > > > TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) && > > > ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: > > "snapbuild.c", > > > Line: 877) > > > => pgbench -i postgres; pgbench -T 500 -c 8 postgres > > > > Can you reproduce this one? I would be interested in log output. Because > > I did run pgbench for quite some time and I haven't seen that one after > > fixing some issues last week. > > > > > It implies that snapstate->nrrunning has lost touch with reality... > > > Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't > outputted anything in the logs, but here is the backtrace of the core file > produced. Could you run it with log_level=DEBUG2? Do you run pgbench after youve reached a consistent state (by issuing a manual checkpoint)? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 21, 2012 at 4:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- Could you run it with log_level=DEBUG2?On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> > It implies that snapstate->nrrunning has lost touch with reality...
> >
> Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
> outputted anything in the logs, but here is the backtrace of the core file
> produced.
Let me try.
Do you run pgbench after youve reached a consistent state (by issuing a
manual checkpoint)?
Yes. I issue a manual checkpoint to initialize the replication.
Michael Paquier
http://michael.otacoo.com
On 2012-11-21 14:57:08 +0900, Michael Paquier wrote: > On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > Those aren't unexpected. Perhaps I should not make it a warning then... > > > > A short explanation: > > > > We can only decode tuples we see in the WAL when we already have a > > timetravel catalog snapshot before that transaction started. To build > > such a snapshot we need to collect information about committed which > > changed the catalog. Unfortunately we can't diagnose whether a txn > > changed the catalog without a snapshot so we just assume all committed > > ones do - it just costs a bit of memory. Thats the background of the > > "forced to assume catalog changes for ..." message. > > > > The reason for the ABORTs is related but different. We start out in the > > "SNAPBUILD_START" state when we try to build a snapshot. When we find > > initial information about running transactions (i.e. xl_running_xacts) > > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can > > decode all changes in transactions that start *after* the current > > lsn. Earlier transactions might have tuples on a catalog state we can't > > query. > > Only when all transactions we observed as running before the > > FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT. > > As we want a consistent/reproducible set of transactions to produce > > output via the logstream we only pass transactions to the output plugin > > if they commit *after* CONSISTENT (they can start earlier though!). This > > allows us to produce a pg_dump compatible snapshot in the moment we get > > consistent that contains exactly the changes we won't stream out. > > > > Makes sense? > > > > > 3) Assertion failure while running pgbench, I was just curious to see > > how > > > it reacted when logical replication was put under a little bit of load. > > > TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) && > > > ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: > > "snapbuild.c", > > > Line: 877) > > > => pgbench -i postgres; pgbench -T 500 -c 8 postgres > > > > Can you reproduce this one? I would be interested in log output. Because > > I did run pgbench for quite some time and I haven't seen that one after > > fixing some issues last week. > > > > > It implies that snapstate->nrrunning has lost touch with reality... > > > Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't > outputted anything in the logs, but here is the backtrace of the core file > produced. Ah, I see. Could you try the following diff? diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c index df24b33..797a126 100644 --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer *reorder, Snapstate *snapstate, */ snapstate->transactions_after = buf->origptr; + snapstate->nrrunning = running->xcnt; snapstate->xmin_running = InvalidTransactionId; snapstate->xmax_running = InvalidTransactionId; Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 21, 2012 at 4:30 PM, Andres Freund <andres@2ndquadrant.com> wrote:
How to you track them? I think I need to go deeper in the code before asking more...On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:> > > Btw, here are some extra comments based on my progress, hope it will beI think it should go DEBUG2 or so once were a bit more confident about
> > > useful for other people playing around with your patches.
> > > 1) Necessary to install the contrib module test_decoding on server side
> > or
> > > the test case will not work.
> > > 2) Obtention of the following logs on server:
> > > LOG: forced to assume catalog changes for xid 1370 because it was
> > running
> > > to early
> > > WARNING: ABORT 1370
> > > Actually I saw that there are many warnings like this.
> >
> > Those aren't unexpected. Perhaps I should not make it a warning then...
> >
> A NOTICE would be more adapted, a WARNING means that something that may
> endanger the system has happened, but as far as I understand from your
> explanation this is not the case.
the code.Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
> > A short explanation:
> >
> > We can only decode tuples we see in the WAL when we already have a
> > timetravel catalog snapshot before that transaction started. To build
> > such a snapshot we need to collect information about committed which
> > changed the catalog. Unfortunately we can't diagnose whether a txn
> > changed the catalog without a snapshot so we just assume all committed
> > ones do - it just costs a bit of memory. Thats the background of the
> > "forced to assume catalog changes for ..." message.
> >
> OK, so this snapshot only needs to include the XIDs of transactions that
> have modified the catalogs. Do I get it right? This way you are able to
> fetch the correct relation definition for replication decoding.
so its not all of them. Normal snapshots carry all in-progress
transactionids instead of the committed ones, but that would have been
far more in our case (only a minority of txn's touch the catalog) and it
has problems with subtransaction tracking.
Hum. I might have missed something but what is the variable tracking the newest XID that modified catalogs.
I can see of course recentXmin in snapmgr.c but nothing related to what you describe.
I can see of course recentXmin in snapmgr.c but nothing related to what you describe.
We actually only track xids of catalog modifying transactions once we
> Just thinking but... It looks to be a waste to store the transactions XIDs
> of all the committed transactions, but on the other hand there is no way to
> track the XIDs of transactions that modified a catalog in current core
> code. So yes this approach is better as refining the transaction XID
> tracking for snapshot reconstruction is something that could be improved
> later. Those are only thoughts though...
hit the CONSISTENT state. Before the initial snapshot we can't detect
that.
--
Michael Paquier
http://michael.otacoo.com
On 2012-11-21 16:47:11 +0900, Michael Paquier wrote: > On Wed, Nov 21, 2012 at 4:30 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > On 2012-11-21 15:28:30 +0900, Michael Paquier wrote: > > > On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com > > >wrote: > > > > > > > On 2012-11-20 09:30:40 +0900, Michael Paquier wrote: > > > > > Btw, here are some extra comments based on my progress, hope it will > > be > > > > > useful for other people playing around with your patches. > > > > > 1) Necessary to install the contrib module test_decoding on server > > side > > > > or > > > > > the test case will not work. > > > > > 2) Obtention of the following logs on server: > > > > > LOG: forced to assume catalog changes for xid 1370 because it was > > > > running > > > > > to early > > > > > WARNING: ABORT 1370 > > > > > Actually I saw that there are many warnings like this. > > > > > > > > Those aren't unexpected. Perhaps I should not make it a warning then... > > > > > > > A NOTICE would be more adapted, a WARNING means that something that may > > > endanger the system has happened, but as far as I understand from your > > > explanation this is not the case. > > > > I think it should go DEBUG2 or so once were a bit more confident about > > the code. > > > > > > A short explanation: > > > > > > > > We can only decode tuples we see in the WAL when we already have a > > > > timetravel catalog snapshot before that transaction started. To build > > > > such a snapshot we need to collect information about committed which > > > > changed the catalog. Unfortunately we can't diagnose whether a txn > > > > changed the catalog without a snapshot so we just assume all committed > > > > ones do - it just costs a bit of memory. Thats the background of the > > > > "forced to assume catalog changes for ..." message. > > > > > > > OK, so this snapshot only needs to include the XIDs of transactions that > > > have modified the catalogs. Do I get it right? This way you are able to > > > fetch the correct relation definition for replication decoding. > > > > Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn), > > so its not all of them. Normal snapshots carry all in-progress > > transactionids instead of the committed ones, but that would have been > > far more in our case (only a minority of txn's touch the catalog) and it > > has problems with subtransaction tracking. > > > Hum. I might have missed something but what is the variable tracking the > newest XID that modified catalogs. > I can see of course recentXmin in snapmgr.c but nothing related to what you > describe. We determine that ourselves. SnapBuildCommitTxn(Snapstate *snapstate, ReorderBuffer *reorder, XLogRecPtr lsn, TransactionId xid, int nsubxacts, TransactionId *subxacts) { ...if (forced_timetravel || top_does_timetravel || sub_does_timetravel){ if (!TransactionIdIsValid(snapstate->xmax) || NormalTransactionIdFollows(xid, snapstate->xmax)) { snapstate->xmax = xid; TransactionIdAdvance(snapstate->xmax); } > > > Just thinking but... It looks to be a waste to store the transactions > > XIDs > > > of all the committed transactions, but on the other hand there is no way > > to > > > track the XIDs of transactions that modified a catalog in current core > > > code. So yes this approach is better as refining the transaction XID > > > tracking for snapshot reconstruction is something that could be improved > > > later. Those are only thoughts though... > > > > We actually only track xids of catalog modifying transactions once we > > hit the CONSISTENT state. Before the initial snapshot we can't detect > > that. > > > How to you track them? I think I need to go deeper in the code before > asking more... You mean, how do I detect they are catalog modifying? By asking the reorderbuffer (ReorderBufferXidDoesTimetravel(...)). That one knows because we told him so (ReorderBufferXidSetTimetravel()) and we do that by looking at the type of xid records we've seen incoming (HEAP_INPLACE, HEAP2_NEW_CID tell us its doing timetravel). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
<br /><br /><div class="gmail_quote">On Wed, Nov 21, 2012 at 4:34 PM, Andres Freund <span dir="ltr"><<a href="mailto:andres@2ndquadrant.com"target="_blank">andres@2ndquadrant.com</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="h5">On 2012-11-21 14:57:08+0900, Michael Paquier wrote:<br /><br /></div>Ah, I see. Could you try the following diff?<br /><br /> diff --gita/src/backend/replication/logical/snapbuild.c<br /> b/src/backend/replication/logical/snapbuild.c<br /> index df24b33..797a126100644<br /> --- a/src/backend/replication/logical/snapbuild.c<br /> +++ b/src/backend/replication/logical/snapbuild.c<br/> @@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer *reorder,<br/> Snapstate *snapstate,<br /> */<br /> snapstate->transactions_after = buf->origptr;<br/><br /> + snapstate->nrrunning = running->xcnt;<br /> snapstate->xmin_running= InvalidTransactionId;<br /> snapstate->xmax_running = InvalidTransactionId;<br/></blockquote></div>I am still getting the same assertion failure even with this diff included.<br/>-- <br />Michael Paquier<br /><a href="http://michael.otacoo.com" target="_blank">http://michael.otacoo.com</a><br/>
On Thu, Nov 15, 2012 at 9:13 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2012-11-15 09:06:23 -0800, Jeff Janes wrote: >> On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > --- >> > src/bin/Makefile | 2 +- >> > src/bin/xlogdump/Makefile | 25 +++ >> > src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++ >> > 3 files changed, 494 insertions(+), 1 deletion(-) >> > create mode 100644 src/bin/xlogdump/Makefile >> > create mode 100644 src/bin/xlogdump/xlogdump.c >> >> Is this intended to be the successor of >> https://github.com/snaga/xlogdump which will then be deprecated? > > As-is this is just a development tool which was sorely needed for the > development of this patchset. But yes I think that once ready > (xlogreader infrastructure, *_desc routines splitted) it should > definitely be able to do most of what the above xlogdump can do and it > should live in bin/. I think mostly some filtering is missing. > > That doesn't really "deprecate" the above though. > > Does that answer your question? Yes, I think so. Thanks. (I've just recently gotten the original xlogdump to work for me in 9.2, and I had been wonder if back-porting yours to 9.2 would have been an easier way to go.) Cheers, Jeff
On 2012-11-21 14:57:14 -0800, Jeff Janes wrote: > On Thu, Nov 15, 2012 at 9:13 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2012-11-15 09:06:23 -0800, Jeff Janes wrote: > >> On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> > --- > >> > src/bin/Makefile | 2 +- > >> > src/bin/xlogdump/Makefile | 25 +++ > >> > src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++ > >> > 3 files changed, 494 insertions(+), 1 deletion(-) > >> > create mode 100644 src/bin/xlogdump/Makefile > >> > create mode 100644 src/bin/xlogdump/xlogdump.c > >> > >> Is this intended to be the successor of > >> https://github.com/snaga/xlogdump which will then be deprecated? > > > > As-is this is just a development tool which was sorely needed for the > > development of this patchset. But yes I think that once ready > > (xlogreader infrastructure, *_desc routines splitted) it should > > definitely be able to do most of what the above xlogdump can do and it > > should live in bin/. I think mostly some filtering is missing. > > > > That doesn't really "deprecate" the above though. > > > > Does that answer your question? > > Yes, I think so. Thanks. > > (I've just recently gotten the original xlogdump to work for me in > 9.2, and I had been wonder if back-porting yours to 9.2 would have > been an easier way to go.) I don't think you would have much fun doing so - the WAL format changes between 9.2 and 9.3 make this larger than one might think. I had a version that worked with the previous format but there have been some interface changes since then... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-11-21 18:35:34 +0900, Michael Paquier wrote: > On Wed, Nov 21, 2012 at 4:34 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > On 2012-11-21 14:57:08 +0900, Michael Paquier wrote: > > > > Ah, I see. Could you try the following diff? > > > > diff --git a/src/backend/replication/logical/snapbuild.c > > b/src/backend/replication/logical/snapbuild.c > > index df24b33..797a126 100644 > > --- a/src/backend/replication/logical/snapbuild.c > > +++ b/src/backend/replication/logical/snapbuild.c > > @@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer *reorder, > > Snapstate *snapstate, > > */ > > snapstate->transactions_after = buf->origptr; > > > > + snapstate->nrrunning = running->xcnt; > > snapstate->xmin_running = InvalidTransactionId; > > snapstate->xmax_running = InvalidTransactionId; > > > I am still getting the same assertion failure even with this diff included. I really don't understand whats going on here then. Youve said you made sure that there is a catalog snapshot. Which means you would need something like: WARNING: connecting to postgres WARNING: Initiating logical rep LOG: computed new xmin: 16566894 LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000 LOG: found initial snapshot (via running xacts). Done: 1 WARNING: reached consistent point, stopping! WARNING: Starting logical replication LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000 LOG: found initial snapshot (via running xacts). Done: 1 in the log *and* it means that snapbuild->state has to be CONSISTENT. But the backtrace youve posted: #3 0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10,xid=0) at snapbuild.c:877 #4 0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80,snapstate=0x19e4f10, xid=0, buf=0x1a0a368, relfilenode=0x1a0a450)at snapbuild.c:388 #5 0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80,snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732 shows pretty clearly that snapstate *can't* be consistent because line 387ff is: else if (snapstate->state < SNAPBUILD_CONSISTENT&& SnapBuildTxnIsRunning(snapstate, xid)) ; so #3 #4 can't happen at those line numbers with state == CONSISTENT. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 22, 2012 at 8:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I still have the core file and its binary at hand if you want, so can send them at will.I really don't understand whats going on here then. Youve said you made
sure that there is a catalog snapshot. Which means you would need
something like:
WARNING: connecting to postgres
WARNING: Initiating logical rep
LOG: computed new xmin: 16566894
LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG: found initial snapshot (via running xacts). Done: 1
WARNING: reached consistent point, stopping!
WARNING: Starting logical replication
LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG: found initial snapshot (via running xacts). Done: 1
in the log *and* it means that snapbuild->state has to be
CONSISTENT. But the backtrace youve posted:
#3 0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10,xid=0) at snapbuild.c:877
#4 0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80,snapstate=0x19e4f10, xid=0, buf=0x1a0a368, relfilenode=0x1a0a450) at snapbuild.c:388#5 0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80,snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732shows pretty clearly that snapstate *can't* be consistent because line 387ff is:
else if (snapstate->state < SNAPBUILD_CONSISTENT &&
SnapBuildTxnIsRunning(snapstate, xid))
;
so #3 #4 can't happen at those line numbers with state == CONSISTENT.
Still this *impossible* thing happens.
Here are some more information on the logs I get on server side:
Yes I have the logical replication correctly initialized:
[629 0] LOG: database system was shut down at 2012-11-22 09:02:42 JST
[628 0] LOG: database system is ready to accept connections
[633 0] LOG: autovacuum launcher started
[648 0] WARNING: connecting to postgres
[648 0] WARNING: Initiating logical rep
[648 0] LOG: computed new xmin: 684
[648 0] LOG: start reading from 0/178C1B8, scrolled back to 0/178C000
And I am also getting logs of this type with pg_receivellog:
BEGIN 698
table "pgbench_accounts": UPDATE: aid[int4]:759559 bid[int4]:8 abalance[int4]:-3641 filler[bpchar]:
table "pgbench_tellers": UPDATE: tid[int4]:93 bid[int4]:10 tbalance[int4]:-3641 filler[bpchar]:(null)
table "pgbench_branches": UPDATE: bid[int4]:10 bbalance[int4]:-3641 filler[bpchar]:(null)
table "pgbench_history": INSERT: tid[int4]:93 bid[int4]:10 aid[int4]:759559 delta[int4]:-3641 mtime[timestamp]:2012-11-22 09:05:34.535651 filler[bpchar]:(null)
COMMIT 698
Until the assertion failure:
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c", Line: 878)
Here are some more information on the logs I get on server side:
Yes I have the logical replication correctly initialized:
[629 0] LOG: database system was shut down at 2012-11-22 09:02:42 JST
[628 0] LOG: database system is ready to accept connections
[633 0] LOG: autovacuum launcher started
[648 0] WARNING: connecting to postgres
[648 0] WARNING: Initiating logical rep
[648 0] LOG: computed new xmin: 684
[648 0] LOG: start reading from 0/178C1B8, scrolled back to 0/178C000
And I am also getting logs of this type with pg_receivellog:
BEGIN 698
table "pgbench_accounts": UPDATE: aid[int4]:759559 bid[int4]:8 abalance[int4]:-3641 filler[bpchar]:
table "pgbench_tellers": UPDATE: tid[int4]:93 bid[int4]:10 tbalance[int4]:-3641 filler[bpchar]:(null)
table "pgbench_branches": UPDATE: bid[int4]:10 bbalance[int4]:-3641 filler[bpchar]:(null)
table "pgbench_history": INSERT: tid[int4]:93 bid[int4]:10 aid[int4]:759559 delta[int4]:-3641 mtime[timestamp]:2012-11-22 09:05:34.535651 filler[bpchar]:(null)
COMMIT 698
Until the assertion failure:
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c", Line: 878)
I have not been able to read your code yet, but there should be something you are missing.
Thanks,
--
Michael Paquier
http://michael.otacoo.com
On 2012-11-22 09:13:30 +0900, Michael Paquier wrote: > On Thu, Nov 22, 2012 at 8:25 AM, Andres Freund <andres@2ndquadrant.com>wrote: > > > I really don't understand whats going on here then. Youve said you made > > sure that there is a catalog snapshot. Which means you would need > > something like: > > WARNING: connecting to postgres > > WARNING: Initiating logical rep > > LOG: computed new xmin: 16566894 > > LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000 > > LOG: found initial snapshot (via running xacts). Done: 1 > > WARNING: reached consistent point, stopping! > > WARNING: Starting logical replication > > LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000 > > LOG: found initial snapshot (via running xacts). Done: 1 > > > > in the log *and* it means that snapbuild->state has to be > > CONSISTENT. But the backtrace youve posted: > > > > #3 0x000000000070c409 in SnapBuildTxnIsRunning > > (snapstate=0x19e4f10,xid=0) at snapbuild.c:877 > > #4 0x000000000070b8e4 in SnapBuildProcessChange > > (reorder=0x19e4e80,snapstate=0x19e4f10, xid=0, buf=0x1a0a368, > > relfilenode=0x1a0a450) at snapbuild.c:388 > > #5 0x000000000070c088 in SnapBuildDecodeCallback > > (reorder=0x19e4e80,snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732 > > > > shows pretty clearly that snapstate *can't* be consistent because line > > 387ff is: > > else if (snapstate->state < SNAPBUILD_CONSISTENT && > > SnapBuildTxnIsRunning(snapstate, xid)) > > ; > > so #3 #4 can't happen at those line numbers with state == CONSISTENT. > > > Still this *impossible* thing happens. > Here are some more information on the logs I get on server side: > > Yes I have the logical replication correctly initialized: > [629 0] LOG: database system was shut down at 2012-11-22 09:02:42 JST > [628 0] LOG: database system is ready to accept connections > [633 0] LOG: autovacuum launcher started > [648 0] WARNING: connecting to postgres > [648 0] WARNING: Initiating logical rep > [648 0] LOG: computed new xmin: 684 > [648 0] LOG: start reading from 0/178C1B8, scrolled back to 0/178C000 Ok, so youve not yet reached a consistent point. Which means this shouldn't yet be written out: > And I am also getting logs of this type with pg_receivellog: > BEGIN 698 > table "pgbench_accounts": UPDATE: aid[int4]:759559 bid[int4]:8 > abalance[int4]:-3641 filler[bpchar]: > table "pgbench_tellers": UPDATE: tid[int4]:93 bid[int4]:10 > tbalance[int4]:-3641 filler[bpchar]:(null) > table "pgbench_branches": UPDATE: bid[int4]:10 bbalance[int4]:-3641 > filler[bpchar]:(null) > table "pgbench_history": INSERT: tid[int4]:93 bid[int4]:10 aid[int4]:759559 > delta[int4]:-3641 mtime[timestamp]:2012-11-22 09:05:34.535651 > filler[bpchar]:(null) > COMMIT 698 that could already be good enough of a hint, let me check tomorrow. > I still have the core file and its binary at hand if you want, so can send > them at will. If those aren't too big, its worth a try... Greetings, Andres --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Alvaro Herrera
Date:
This is very much the same as the previous patch, except it has been rebased to the latest master. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: [PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids
From
Simon Riggs
Date:
On 15 November 2012 12:07, Simon Riggs <simon@2ndquadrant.com> wrote: > On 14 November 2012 22:17, Andres Freund <andres@2ndquadrant.com> wrote: > >> To avoid complicating logic we store both, the toplevel and the subxids, in >> ->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids. > > That looks good, not much change. Will apply in next few days. Please > add me as committer and mark ready. I tried improving this, but couldn't. So I've committed it as is. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
<div class="moz-cite-prefix">On 12-11-14 08:17 PM, Andres Freund wrote:<br /></div><br /> I am getting errors like the followingwhen I try to use either your test_decoding plugin or my own (which does even less than yours)<br /><br /><br />LOG: database system is ready to accept connections<br /> LOG: autovacuum launcher started<br /> WARNING: connectingto <br /> WARNING: Initiating logical rep<br /> LOG: computed new xmin: 773<br /> LOG: start reading from 0/17F5D58,scrolled back to 0/17F4000<br /> LOG: got new xmin 773 at 25124280<br /> LOG: found initial snapshot (via runningxacts). Done: 1<br /> WARNING: reached consistent point, stopping!<br /> WARNING: Starting logical replication<br/> LOG: start reading from 0/17F5D58, scrolled back to 0/17F4000<br /> LOG: found initial snapshot (via runningxacts). Done: 1<br /> FATAL: cannot read pg_class without having selected a database<br /> TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))",File: "proc.c", Line: 759)<br /><br /> This seems tobe happening under the calls at<br /> reorderbuffer.c:832 if (!SnapBuildHasCatalogChanges(NULL, xid, &change->relnode))<br/><br /> The sequence of events I do is:<br /> 1. start pg_receivellog<br /> 2. run a checkpoint<br/> 3. Attach to the walsender process with gdb<br /> 4. Start a new client connection with psql and do 'INSERTINTO a values (1)' twice.<br /><br /> (skipping step 3 doesn't make a difference)<br /><br /><br /><br /><br /> I<br/><br /><br /><blockquote cite="mid:1352942234-3953-11-git-send-email-andres@2ndquadrant.com" type="cite"><pre wrap=""> This introduces several things: * 'reorderbuffer' module which reassembles transactions from a stream of interspersed changes * 'snapbuilder' which builds catalog snapshots so that tuples from wal can be understood * logging more data into wal to facilitate logical decoding * wal decoding into an reorderbuffer * shared library output plugins with 5 callbacks* init* begin* change* commit * walsender infrastructur to stream out changes and to keep the global xmin low enough* INIT_LOGICAL_REPLICATION $plugin;waits till a consistent snapshot is built and returns * initial LSN * replication slot identifier * id of a pg_export()style snapshot* START_LOGICAL_REPLICATION $id $lsn; streams out changes* uses named output plugins for outputspecification Todo: * testing infrastructure (isolationtester) * persistence/spilling to disk of built snapshots, longrunning transactions * user docs * more frequent lowering of xmins * more docs about the internals * support for user declared catalog tables * actual exporting of initial pg_export snapshots after INIT_LOGICAL_REPLICATION * own shared memory segment instead of piggybacking on walsender's * nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the outside. * more frequent xl_running_xid's so xmin can be upped more frequently * add STOP_LOGICAL_REPLICATION $id ---src/backend/access/heap/heapam.c | 280 +++++-src/backend/access/transam/xlog.c | 1 +src/backend/catalog/index.c | 74 ++src/backend/replication/Makefile | 2 +src/backend/replication/logical/Makefile | 19 +src/backend/replication/logical/decode.c | 496 ++++++++++src/backend/replication/logical/logicalfuncs.c | 247 +++++src/backend/replication/logical/reorderbuffer.c | 1156+++++++++++++++++++++++src/backend/replication/logical/snapbuild.c | 1144 ++++++++++++++++++++++src/backend/replication/repl_gram.y | 32 +-src/backend/replication/repl_scanner.l | 2 +src/backend/replication/walsender.c | 566 ++++++++++-src/backend/storage/ipc/procarray.c | 23 +src/backend/storage/ipc/standby.c | 8 +-src/backend/utils/cache/inval.c | 2 +-src/backend/utils/cache/relcache.c | 3 +-src/backend/utils/misc/guc.c | 11 +src/backend/utils/time/tqual.c | 249 +++++src/bin/pg_controldata/pg_controldata.c | 2 +src/include/access/heapam_xlog.h | 23 +src/include/access/transam.h | 5 +src/include/access/xlog.h | 3 +-src/include/catalog/index.h | 4 +src/include/nodes/nodes.h | 2 +src/include/nodes/replnodes.h | 22 +src/include/replication/decode.h | 21 +src/include/replication/logicalfuncs.h | 44 +src/include/replication/output_plugin.h | 76 ++src/include/replication/reorderbuffer.h | 284 ++++++src/include/replication/snapbuild.h | 128 +++src/include/replication/walsender.h | 1 +src/include/replication/walsender_private.h | 34 +-src/include/storage/itemptr.h | 3 +src/include/storage/sinval.h | 2 +src/include/utils/tqual.h | 31 +-35 fileschanged, 4966 insertions(+), 34 deletions(-)create mode 100644 src/backend/replication/logical/Makefilecreate mode 100644src/backend/replication/logical/decode.ccreate mode 100644 src/backend/replication/logical/logicalfuncs.ccreate mode100644 src/backend/replication/logical/reorderbuffer.ccreate mode 100644 src/backend/replication/logical/snapbuild.ccreatemode 100644 src/include/replication/decode.hcreate mode 100644 src/include/replication/logicalfuncs.hcreatemode 100644 src/include/replication/output_plugin.hcreate mode 100644 src/include/replication/reorderbuffer.hcreatemode 100644 src/include/replication/snapbuild.h </pre><br /><fieldset class="mimeAttachmentHeader"></fieldset><br /><pre wrap=""> </pre></blockquote><br />
Hi Steve, On 2012-12-02 21:52:08 -0500, Steve Singer wrote: > On 12-11-14 08:17 PM, Andres Freund wrote: > > I am getting errors like the following when I try to use either your > test_decoding plugin or my own (which does even less than yours) > > > LOG: database system is ready to accept connections > LOG: autovacuum launcher started > WARNING: connecting to > WARNING: Initiating logical rep > LOG: computed new xmin: 773 > LOG: start reading from 0/17F5D58, scrolled back to 0/17F4000 > LOG: got new xmin 773 at 25124280 > LOG: found initial snapshot (via running xacts). Done: 1 > WARNING: reached consistent point, stopping! > WARNING: Starting logical replication > LOG: start reading from 0/17F5D58, scrolled back to 0/17F4000 > LOG: found initial snapshot (via running xacts). Done: 1 > FATAL: cannot read pg_class without having selected a database > TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File: > "proc.c", Line: 759) > > This seems to be happening under the calls at > reorderbuffer.c:832 if (!SnapBuildHasCatalogChanges(NULL, xid, > &change->relnode)) Two things: 1) Which exact options are you using for pg_receivellog? Not "-d replication" by any chance? 2) Could you check that you really have a fully clean build? That has hit me in the past as well. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-03 13:22:12 +0100, Andres Freund wrote: > Hi Steve, > > On 2012-12-02 21:52:08 -0500, Steve Singer wrote: > > On 12-11-14 08:17 PM, Andres Freund wrote: > > > > I am getting errors like the following when I try to use either your > > test_decoding plugin or my own (which does even less than yours) > > > > > > LOG: database system is ready to accept connections > > LOG: autovacuum launcher started > > WARNING: connecting to > > WARNING: Initiating logical rep > > LOG: computed new xmin: 773 > > LOG: start reading from 0/17F5D58, scrolled back to 0/17F4000 > > LOG: got new xmin 773 at 25124280 > > LOG: found initial snapshot (via running xacts). Done: 1 > > WARNING: reached consistent point, stopping! > > WARNING: Starting logical replication > > LOG: start reading from 0/17F5D58, scrolled back to 0/17F4000 > > LOG: found initial snapshot (via running xacts). Done: 1 > > FATAL: cannot read pg_class without having selected a database > > TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File: > > "proc.c", Line: 759) > > > > This seems to be happening under the calls at > > reorderbuffer.c:832 if (!SnapBuildHasCatalogChanges(NULL, xid, > > &change->relnode)) > > Two things: > 1) Which exact options are you using for pg_receivellog? Not "-d > replication" by any chance? This seems to produce exactly that kind off error messages. I added some error checking around that. Pushed. Thanks! Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12-12-03 07:42 AM, Andres Freund wrote: > Two things: > 1) Which exact options are you using for pg_receivellog? Not "-d > replication" by any chance? Yes that is exactly what I'md doing. Using a real database name instead makes this go away. Thanks > This seems to produce exactly that kind off error messages. I added some > error checking around that. Pushed. > > Thanks! > > Andres Freund > > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > >
On 2012-12-03 09:35:55 -0500, Steve Singer wrote: > On 12-12-03 07:42 AM, Andres Freund wrote: > >Two things: > >1) Which exact options are you using for pg_receivellog? Not "-d > >replication" by any chance? > > Yes that is exactly what I'md doing. Using a real database name instead > makes this go away. Was using "replication" an accident or do you think it makes sense in some way? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12-12-03 09:48 AM, Andres Freund wrote: > On 2012-12-03 09:35:55 -0500, Steve Singer wrote: >> On 12-12-03 07:42 AM, Andres Freund wrote: >> Yes that is exactly what I'md doing. Using a real database name >> instead makes this go away. > Was using "replication" an accident or do you think it makes sense in > some way? The 'replication' line in pg_hba.conf made me think that the database name had to be replication for walsender connections. > Greetings, > > Andres Freund > > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > >
Hi, I tried to address most (all?) your comments in the version from http://archives.postgresql.org/message-id/20121204175212.GB12055%40awork2.anarazel.de . On 2012-11-15 11:31:55 -0500, Peter Eisentraut wrote: > > +xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v /main.o|sed's/^/..\/..\/..\//') > > + $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) > > This looks pretty evil, and there is no documentation about what it is > supposed to do. > > Windows build support needs some thought. Ok, since Alvaro made it possible the build now only has rules like: xlogreader.c: % : $(top_srcdir)/src/backend/access/transam/% rm -f $@ && $(LN_S) $< . clogdesc.c: % : $(top_srcdir)/src/backend/access/rmgrdesc/% rm -f $@ && $(LN_S) $< . and OBJS = \ clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \ mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.osmgrdesc.o spgdesc.o \ standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o assert.o \ $(WIN32RES) \ pg_xlogdump.opqexpbuf_strinfo.o compat.o tables.o xlogreader.o \ pg_xlogdump: $(OBJS) | submake-libpq submake-libpgport $(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS)$(libpq_pgport) -o $@$(X) Thats easier easier to integrate into the windows build? > > +static void > > +usage(void) > > +{ > > + printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"), > > + progname); > > + printf(_("Usage:\n")); > > + printf(_(" %s [OPTION]...\n"), progname); > > + printf(_("\nOptions:\n")); > > + printf(_(" -v, --version output version information, then exit\n")); > > + printf(_(" -h, --help show this help, then exit\n")); > > + printf(_(" -s, --start from where recptr onwards to read\n")); > > + printf(_(" -e, --end up to which recptr to read\n")); > > + printf(_(" -t, --timeline which timeline do we want to read\n")); > > + printf(_(" -i, --inpath from where do we want to read? cwd/pg_xlog is the default\n")); > > + printf(_(" -o, --output where to write [start, end]\n")); > > + printf(_(" -f, --file wal file to parse\n")); > > +} > > Options list should be in alphabetic order (or some other less random > order). Most of these descriptions are not very intelligible (at least > without additional documentation). I tried to improve the help, its now: pg_xlogdump: reads/writes postgres transaction logs for debugging. Usage: pg_xlogdump [OPTION]... Options: -b, --bkp-details output detailed information about backup blocks -e, --end RECPTR read wal up to RECPTR-f, --file FILE wal file to parse, cannot be specified together with -p -h, --help show this help,then exit -p, --path PATH from where do we want to read? cwd/pg_xlog is the default -s, --start RECPTR readwal in directory indicated by -p starting at RECPTR -t, --timeline TLI which timeline do we want to read, defaultsto 1 -v, --version output version information, then exit I wonder whether it would make sense to split the help into different sections? It seems likely we will gain some more options... > no nls.mk Do I need to do anything for that besides: # src/bin/pg_xlogdump/nls.mk CATALOG_NAME = pg_xlogdump AVAIL_LANGUAGES = GETTEXT_FILES = pg_xlogdump.c ? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-11-15 02:26:53 +0100, Andres Freund wrote: > On 2012-11-15 01:27:46 +0100, Andres Freund wrote: > > In response to this you will soon find the 14 patches that currently > > implement $subject. > > As its not very wieldly to send around that many/big patches all the > time, until the next "major" version I will just update the git tree at: > > Web: > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf3 > > Git: > git clone git://git.postgresql.org/git/users/andresfreund/postgres.git xlog-decoding-rebasing-cf3 I pushed a new version which - is rebased ontop of master - is based ontop of the new xlogreader (biggest part) - is base ontop of the new binaryheap.h - some fixes - some more comments Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On 2012-11-15 16:22:56 +0200, Heikki Linnakangas wrote: > On 15.11.2012 03:17, Andres Freund wrote: > > > >Features: > >- streaming reading/writing > >- filtering > >- reassembly of records > > > >Reusing the ReadRecord infrastructure in situations where the code that wants > >to do so is not tightly integrated into xlog.c is rather hard and would require > >changes to rather integral parts of the recovery code which doesn't seem to be > >a good idea. > > > >Missing: > >- "compressing" the stream when removing uninteresting records > >- writing out correct CRCs > >- separating reader/writer > > I'm disappointed to see that there has been no progress on this patch since > last commitfest. I thought we agreed on the approach I championed for here: > http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. There > wasn't much work left to finish that, I believe. > > Are you going to continue working on this? Patch (git repo) is now based ontop of my vesion of your xlogreader version... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2012-11-19 09:50:30 +0100, Andres Freund wrote: > On 2012-11-19 16:28:55 +0900, Michael Paquier wrote: > > After launching some SQLs, the logical receiver is stuck just after sending > > INIT_LOGICAL_REPLICATION, please see bt of process waiting: > > Its waiting till it sees initial an initial xl_running_xacts record. The > easiest way to do that is manually issue a checkpoint. Sorry, should > have included that in the description. > Otherwise you can wait till the next routine checkpoint comes arround... > > I plan to cause more xl_running_xacts record to be logged in the > future. I think the timing of those currently is non-optimal, you have > the same problem as here in normal streaming replication as well :( This is "fixed" now with the changes I pushed a second ago. Unless some longrunning transactions are arround we now immediately jump to a ready state. This is achieved by 1. regularly logging xl_running_xacts (in background writer) 2. logging xl_runnign_xacts at the beginning of INIT_LOGICAL_REPLICATION This also has the advantage that the xmin horizon can be increased much more frequently. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
(Offlist) Just a quick note that I'm working on this patch now. I pushed some trivial fixes to my git repository at git://git.postgresql.org/git/users/heikki/postgres.git, xlogreader_v3 branch. - Heikki
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
On 10.12.2012 22:22, Heikki Linnakangas wrote: > (Offlist) > > Just a quick note that I'm working on this patch now. I pushed some > trivial fixes to my git repository at > git://git.postgresql.org/git/users/heikki/postgres.git, xlogreader_v3 > branch. Oops, wasn't offlist :-). Well, if anyone wants to take a look, feel free. - Heikki
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
I've been molding this patch for a while now, here's what I have this far (also available in my git repository). The biggest change is in the error reporting. A stand-alone program that wants to use xlogreader.c no longer has to provide a full-blown replacement for ereport(). The only thing that xlogreader.c used ereport() was when it encounters an invalid record. And even there we had the emode_for_corrupt_record hack. I think it's a much better API that XLogReadRecord just returns NULL on an invalid record, and an error string, and the caller can do what it wants with that. In xlog.c, we'll pass the error string to ereport(), with the right emode as determined by emode_for_corrupt_record. xlog.c is no longer concerned with emode_for_corrupt_record, or error levels in general. We talked about this earlier, and Tom Lane argued that "it's basically insane to imagine that you can carve out a non-trivial piece of the backend that doesn't contain any elog calls." (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but having done just that, it doesn't seem insane to me. xlogreader.c really is a pretty well contained piece of code. All the complicated stuff that contains elog calls and pallocs and more is in the callback, which can freely use all the normal backend infrastructure. Now, here's some stuff that still need to be done: * A stand-alone program using xlogreader.c has to provide an implementation of tliInHistory(). Need to find a better way to do that. Perhaps "#ifndef FRONTEND" the tliInHistory checks in xlogreader. * In xlog.c, some of the variables that used to be statics like readFile, readOff etc. are now in the XLogPageReadPrivate struct. But there's still plenty of statics left in there - it would certainly not work correctly if xlog.c tried to open two xlog files at the same time. I think it's just confusing to have some stuff in the XLogPageReadPrivate struct, and others as static, so I think we should get rid of XLogPageReadPrivate struct altogether and put back the static variables. At least it would make the diff smaller, which might help with reviewing. xlog.c probably doesn't need to provide a "private" struct to xlogreader.c at all, which is okay. * It's pretty ugly that to use the rm_desc functions, you have to provide dummy implementations of a bunch of backend functions, including pfree() and timestamptz_to_str(). Should find a better way to do that. * It's not clear to me how we'd handle translating the strings in xlogreader.c, when xlogreader.c is used in a stand-alone program like pg_xlogdump. Maybe we can just punt on that... * How about we move pg_xlogdump to contrib? It doesn't feel like the kind of essential tool that deserves to be in src/bin. - Heikki
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
Forgot attachment.. On 11.12.2012 15:55, Heikki Linnakangas wrote: > I've been molding this patch for a while now, here's what I have this > far (also available in my git repository). > > The biggest change is in the error reporting. A stand-alone program that > wants to use xlogreader.c no longer has to provide a full-blown > replacement for ereport(). The only thing that xlogreader.c used > ereport() was when it encounters an invalid record. And even there we > had the emode_for_corrupt_record hack. I think it's a much better API > that XLogReadRecord just returns NULL on an invalid record, and an error > string, and the caller can do what it wants with that. In xlog.c, we'll > pass the error string to ereport(), with the right emode as determined > by emode_for_corrupt_record. xlog.c is no longer concerned with > emode_for_corrupt_record, or error levels in general. > > We talked about this earlier, and Tom Lane argued that "it's basically > insane to imagine that you can carve out a non-trivial piece of the > backend that doesn't contain any elog calls." > (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but > having done just that, it doesn't seem insane to me. xlogreader.c really > is a pretty well contained piece of code. All the complicated stuff that > contains elog calls and pallocs and more is in the callback, which can > freely use all the normal backend infrastructure. > > Now, here's some stuff that still need to be done: > > * A stand-alone program using xlogreader.c has to provide an > implementation of tliInHistory(). Need to find a better way to do that. > Perhaps "#ifndef FRONTEND" the tliInHistory checks in xlogreader. > > * In xlog.c, some of the variables that used to be statics like > readFile, readOff etc. are now in the XLogPageReadPrivate struct. But > there's still plenty of statics left in there - it would certainly not > work correctly if xlog.c tried to open two xlog files at the same time. > I think it's just confusing to have some stuff in the > XLogPageReadPrivate struct, and others as static, so I think we should > get rid of XLogPageReadPrivate struct altogether and put back the static > variables. At least it would make the diff smaller, which might help > with reviewing. xlog.c probably doesn't need to provide a "private" > struct to xlogreader.c at all, which is okay. > > * It's pretty ugly that to use the rm_desc functions, you have to > provide dummy implementations of a bunch of backend functions, including > pfree() and timestamptz_to_str(). Should find a better way to do that. > > * It's not clear to me how we'd handle translating the strings in > xlogreader.c, when xlogreader.c is used in a stand-alone program like > pg_xlogdump. Maybe we can just punt on that... > > * How about we move pg_xlogdump to contrib? It doesn't feel like the > kind of essential tool that deserves to be in src/bin. > > - Heikki -- - Heikki
Attachment
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote: > I've been molding this patch for a while now, here's what I have this far > (also available in my git repository). On a very quick this looks good. I will try to rebase the decoding stuff and read a bit around in the course of that... > The biggest change is in the error reporting. A stand-alone program that > wants to use xlogreader.c no longer has to provide a full-blown replacement > for ereport(). The only thing that xlogreader.c used ereport() was when it > encounters an invalid record. And even there we had the > emode_for_corrupt_record hack. I think it's a much better API that > XLogReadRecord just returns NULL on an invalid record, and an error string, > and the caller can do what it wants with that. In xlog.c, we'll pass the > error string to ereport(), with the right emode as determined by > emode_for_corrupt_record. xlog.c is no longer concerned with > emode_for_corrupt_record, or error levels in general. > We talked about this earlier, and Tom Lane argued that "it's basically > insane to imagine that you can carve out a non-trivial piece of the backend > that doesn't contain any elog calls." > (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but > having done just that, it doesn't seem insane to me. xlogreader.c really is > a pretty well contained piece of code. All the complicated stuff that > contains elog calls and pallocs and more is in the callback, which can > freely use all the normal backend infrastructure. This is pretty good. I was a bit afraid of making this change - thus my really ugly emode callback hack - but this is way better. > Now, here's some stuff that still need to be done: > > * A stand-alone program using xlogreader.c has to provide an implementation > of tliInHistory(). Need to find a better way to do that. Perhaps "#ifndef > FRONTEND" the tliInHistory checks in xlogreader. We could just leave it in xlogreader in the first place. Having an #ifdef'ed out version in there seems to be schizophrenic to me, all the maintenance overhead, none of the fun... > * In xlog.c, some of the variables that used to be statics like > readFile, readOff etc. are now in the XLogPageReadPrivate struct. But > there's still plenty of statics left in there - it would certainly not > work correctly if xlog.c tried to open two xlog files at the same > time. I think it's just confusing to have some stuff in the > XLogPageReadPrivate struct, and others as static, so I think we should > get rid of XLogPageReadPrivate struct altogether and put back the > static variables. At least it would make the diff smaller, which might > help with reviewing. xlog.c probably doesn't need to provide a > "private" struct to xlogreader.c at all, which is okay. Fine with me. I find the grouping that the struct provides somewhat helpful when reading the code, but its more than offset by duplicating some of the variables. The reasons to have it are fewer compared when you'd introduced the struct - xlogreader does more now, so less needs to be handled outside. > * It's pretty ugly that to use the rm_desc functions, you have to provide > dummy implementations of a bunch of backend functions, including pfree() and > timestamptz_to_str(). Should find a better way to do that. I think most of the cases requiring those ugly hacks can be fixed to just use a caller-provided buffer, there's not that much left. timestamptz_to_str() is probably the most complex case. I just noticed there's already a second implementation in ecpg/pgtypeslib/dt_common.c. Yuck. It seems to already have diverged in a number of cases :( > * It's not clear to me how we'd handle translating the strings in > xlogreader.c, when xlogreader.c is used in a stand-alone program like > pg_xlogdump. Maybe we can just punt on that... I personally would have no problem with that. It's probably either going to be used during in-depth-debugging or when developing pg. In both cases english seems to be fine. But I seldomly use translated programs so maybe I am not the right person to ask. > * How about we move pg_xlogdump to contrib? It doesn't feel like the kind of > essential tool that deserves to be in src/bin. contrib would be fine, but I think src/bin is better. There have been quite some bugs by now where it would have been useful to have a reliable xlogdump in core so its really installed. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On 2012-12-11 15:44:39 +0100, Andres Freund wrote: > On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote: > > * It's pretty ugly that to use the rm_desc functions, you have to provide > > dummy implementations of a bunch of backend functions, including pfree() and > > timestamptz_to_str(). Should find a better way to do that. > > I think most of the cases requiring those ugly hacks can be fixed to > just use a caller-provided buffer, there's not that much left. > > timestamptz_to_str() is probably the most complex case. I just noticed > there's already a second implementation in > ecpg/pgtypeslib/dt_common.c. Yuck. It seems to already have diverged in > a number of cases :( The attached (and pushed) patches change relpathbackend to use a static buffer instead. That gets rid of the pfree() requirement and looks ok otherwise as well. Unfortunately that still leaves us with the need to re-implement relpathbackend() in xlogdump, but that seems somwhat ok to me. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On 2012-12-11 16:43:12 +0100, Andres Freund wrote: > On 2012-12-11 15:44:39 +0100, Andres Freund wrote: > > On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote: > > > * It's pretty ugly that to use the rm_desc functions, you have to provide > > > dummy implementations of a bunch of backend functions, including pfree() and > > > timestamptz_to_str(). Should find a better way to do that. > > > > I think most of the cases requiring those ugly hacks can be fixed to > > just use a caller-provided buffer, there's not that much left. > > > > timestamptz_to_str() is probably the most complex case. I just noticed > > there's already a second implementation in > > ecpg/pgtypeslib/dt_common.c. Yuck. It seems to already have diverged in > > a number of cases :( > > The attached (and pushed) patches change relpathbackend to use a static buffer > instead. That gets rid of the pfree() requirement and looks ok otherwise as > well. ... really attached. -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote: > I've been molding this patch for a while now, here's what I have this far > (also available in my git repository). > > The biggest change is in the error reporting. A stand-alone program that > wants to use xlogreader.c no longer has to provide a full-blown replacement > for ereport(). The only thing that xlogreader.c used ereport() was when it > encounters an invalid record. And even there we had the > emode_for_corrupt_record hack. I think it's a much better API that > XLogReadRecord just returns NULL on an invalid record, and an error string, > and the caller can do what it wants with that. In xlog.c, we'll pass the > error string to ereport(), with the right emode as determined by > emode_for_corrupt_record. xlog.c is no longer concerned with > emode_for_corrupt_record, or error levels in general. > > We talked about this earlier, and Tom Lane argued that "it's basically > insane to imagine that you can carve out a non-trivial piece of the backend > that doesn't contain any elog calls." > (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but > having done just that, it doesn't seem insane to me. xlogreader.c really is > a pretty well contained piece of code. All the complicated stuff that > contains elog calls and pallocs and more is in the callback, which can > freely use all the normal backend infrastructure. Now that I have read some of that code, I am currently unsure how the current implementation of this can cooperate with translation, even when used from the backend? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Robert Haas
Date:
On Tue, Dec 11, 2012 at 9:44 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> * How about we move pg_xlogdump to contrib? It doesn't feel like the kind of >> essential tool that deserves to be in src/bin. > > contrib would be fine, but I think src/bin is better. There have been > quite some bugs by now where it would have been useful to have a > reliable xlogdump in core so its really installed. I think I'm with Heikki on this one. Dumping xlog data is useful, but it's really for developers and troubleshooters, not something we expect people to do on a regular basis, so contrib seems appropriate. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Dimitri Fontaine
Date:
Robert Haas <robertmhaas@gmail.com> writes: > I think I'm with Heikki on this one. Dumping xlog data is useful, but > it's really for developers and troubleshooters, not something we > expect people to do on a regular basis, so contrib seems appropriate. There are two downsides for contrib rather than src/bin. First, maintainance and user trust are easier done and achieved in src/bin. Second, a lot of users won't install contribs in their production server and will then miss the tool when they need it most. In some places getting new software installed on a certified production setup is not easy. I would agree to get that piece in contrib if we were to work again on separating contribs into "production monitoring and diagnosis", "production ready extra" (hstore) and the rest, basically. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Tom Lane
Date:
Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes: > Robert Haas <robertmhaas@gmail.com> writes: >> I think I'm with Heikki on this one. Dumping xlog data is useful, but >> it's really for developers and troubleshooters, not something we >> expect people to do on a regular basis, so contrib seems appropriate. > There are two downsides for contrib rather than src/bin. First, > maintainance and user trust are easier done and achieved in src/bin. User trust, maybe, but the "maintenance" argument seems bogus. We ship contrib on the same release schedule as core. > Second, a lot of users won't install contribs in their production server > and will then miss the tool when they need it most. TBH, I don't believe that ordinary users will need this tool at all, ever, and thus I don't want it in src/bin/. From a packaging standpoint it will be a lot easier if it's in contrib ... otherwise I'll probably have to invent some new sub-RPM along the lines of postgresql-extras so as to avoid bloating the core server package. regards, tom lane
Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Dimitri Fontaine
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes: > User trust, maybe, but the "maintenance" argument seems bogus. > We ship contrib on the same release schedule as core. I meant maintenance as in updating the code when it needs to be, I'm not sure contrib systematically receives the same careness as core. I have no data to back my feeling, though. > TBH, I don't believe that ordinary users will need this tool at all, > ever, and thus I don't want it in src/bin/. From a packaging standpoint > it will be a lot easier if it's in contrib ... otherwise I'll probably > have to invent some new sub-RPM along the lines of postgresql-extras > so as to avoid bloating the core server package. Oh. I didn't know that the server package would be considered bloated by anyone and that would impact the way to ship our binaries. What about splitting contrib *officially* then, not just in your RH packages, and have postgresql-server-extra-diagnosis, -extra-data-types and -contrib with the things you tipically don't want in production? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Peter Geoghegan
Date:
On 11 December 2012 22:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > TBH, I don't believe that ordinary users will need this tool at all, > ever, and thus I don't want it in src/bin/. From a packaging standpoint > it will be a lot easier if it's in contrib ... otherwise I'll probably > have to invent some new sub-RPM along the lines of postgresql-extras > so as to avoid bloating the core server package. I happen to agree that pg_xlogdump belongs in contrib, but I think that the importance of avoiding "bloat" has been overstated. Maybe it's slightly useful to make sure that Postgres can get on the Fedora CD, but that aside, is including pg_xlogdump here, for example, really likely to make any appreciable difference package-wise? pg_xlogdump is 141K on my system. I'd hate to see us embrace the exact opposite tendency, towards including everything but the kitchen sink, but at the same time that seems like a very insignificant size. Perhaps people who live in countries with less bandwidth care about these things more. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Dimitri Fontaine
Date:
Peter Geoghegan <peter@2ndquadrant.com> writes: > Perhaps people who live in countries with less bandwidth care about > these things more. The day they will need it is not the day the bandwidth will magically increase, is all I'm saying. Better have that around just in case you get WAL corruption because of a crappy RAID controler or whatnot. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On 2012-12-11 22:52:09 +0000, Peter Geoghegan wrote: > On 11 December 2012 22:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > TBH, I don't believe that ordinary users will need this tool at all, > > ever, and thus I don't want it in src/bin/. From a packaging standpoint > > it will be a lot easier if it's in contrib ... otherwise I'll probably > > have to invent some new sub-RPM along the lines of postgresql-extras > > so as to avoid bloating the core server package. > > I happen to agree that pg_xlogdump belongs in contrib Ok, I think there has been clear support for putting it into contrib, I can comfortably live with that even though I would prefer otherwise. So lets concentrate on other things ;) > pg_xlogdump is 141K on my system. I'd hate to see us embrace the exact > opposite tendency, towards including everything but the kitchen sink, > but at the same time that seems like a very insignificant size. > Perhaps people who live in countries with less bandwidth care about > these things more. Optimized and stripped - which is what most distros do - it's 40k here. Gzipped - as in packages - its only 20k on its own. So its even smaller ;) Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 9 December 2012 19:14, Andres Freund <andres@2ndquadrant.com> wrote: > I pushed a new version which > > - is rebased ontop of master > - is based ontop of the new xlogreader (biggest part) > - is base ontop of the new binaryheap.h > - some fixes > - some more comments I decided to take another look at this, following my earlier reviews of a substantial subset of earlier versions of this patch, including my earlier reviews of WAL decoding [1] and focused review of snapshot building [2] (itself a subset of WAL decoding). I think it was the right time to consolidate your multiple earlier patches, because some of the earlier BDR patches were committed (including "Rearrange storage of data in xl_running_xacts" [3], "Basic binary heap implementation" [4], "Embedded list interface" [5], and, though it isn't touched on here and is technically entirely distinct functionality, "Background worker processes" [6]). Furthermore, now that we've gotten past some early rounds of reviewing, it makes sense to build a *perfectly* (rather than just approximately) atomic unit, as we work towards something that is actually committable. So what's the footprint of this big, newly rebased feature branch? Well, though some of these changes are uncommitted stuff from Heikki (i.e. XLogReader, which you've modified), and some of this is README documentation, the footprint is very large. I merged master with your dev branch (last commit of yours, 743f3af081209f784a30270bdf49301e9e242b78, made on Mon 10th Dec 15:35), and the stats are: 91 files changed, 9736 insertions(+), 956 deletions(-) Note that there is a relatively large number of files affected in part because the tqual interface was bashed around a bit for the benefit of logical decoding - a lot of the changes to each of those 91 files are completely trivial. I'm very glad that you followed my earlier recommendation of splitting your demo logical changeset consumer into a contrib module, in the spirit of contrib/spi, etc. This module, "test_decoding", represents a logical entry point, if you will, for the entire patch. As unwieldy as it may appear to be, the patch is (or at least *should be*) ultimately reducible to some infrastructural changes to core to facilitate this example logical change-set consumer. test_decoding contrib module ============================ contrib/Makefile | 1 +contrib/test_decoding/Makefile | 16 +contrib/test_decoding/test_decoding.c | 192 ++++ Once again, because test_decoding is a kind of "entry point", it gives me a nice point to continually refer back to when talking about this patch. (Incidentally, maybe test_decoding should be called pg_decoding?). The regression tests pass, though this isn't all that surprising, since frankly the test coverage of this patch appears to be quite low. I know that you're working with Abhijit on improvements to the isolation tester to verify the correctness of the patch as it relates to supporting actual, practical logical replication systems. I would very much welcome any such test coverage (even if it wasn't in a committable state), since in effect you're asking me to take a leap of faith in respect of how well this infrastructure will support such systems – previously, I obliged you and didn't focus on concurrency and serializability concerns (it was sufficient to print out values/do some decoding in a toy function), but it's time to take a closer look at those now, I feel. test_decoding is a client of the logical change-set producing infrastructure, and there appears to be broad agreement that that infrastructure needs to treat such consumers in a way that is maximally abstract. My question is, just how abstract does this interface have to be, really? How well are you going to support the use-case of a real logical replication system? Now, maybe it's just that I haven't being paying attention (in particular, to the discussion surrounding [3] – though that commit doesn't appear to have been justified in terms of commit ordering in BDR at all), but I would like you to be more demonstrative of certain things, like: 1. Just what does a logical change-set consumer look like? What things are always true of one, and never true of one? 2. Please describe in as much detail as possible the concurrency issues with respect to logical replication systems. Please make verifiable, testable claims as to how well these issues are considered here, perhaps with reference to the previous remarks of subject-matter experts like Chris Browne [7], Steve Singer [8] and Kevin Grittner [9] following my earlier review. I'm not all that impressed with where test_decoding is at right now. There is still essentially no documentation. I think it's notable that you don't really touch the ReorderBufferTXN passed by the core system in the test_decoding plugin. test_decoding and pg_receivellog ======================== I surmised that the way that the test_decoding module is intended to be used is as a client of receivellog.c (*not* receivelog.c – that naming is *confusing*, perhaps call it receivelogiclog.c or something. Better still, make receivexlog handle the logical case rather than inventing a new tool). The reason for receivellog.c existing, as you yourself put it, is: + /* + * We have to use postgres.h not postgres_fe.h here, because there's so much + * backend-only stuff in the XLOG include files we need. But we need a + * frontend-ish environment otherwise. Hence this ugly hack. + */ So receivellog.c is part of a new utility called pg_receivellog, in much the same way as receivexlog.c is part of the existing pg_receivexlog utility (see commit b840640000934fca1575d29f94daad4ad85ba000 in Andres' tree). We're talking about these changes: src/backend/utils/misc/guc.c | 11 +src/bin/pg_basebackup/Makefile | 7 +-src/bin/pg_basebackup/pg_basebackup.c | 4 +-src/bin/pg_basebackup/pg_receivellog.c | 717 ++++++++++++src/bin/pg_basebackup/pg_receivexlog.c | 4 +-src/bin/pg_basebackup/receivelog.c | 4 +-src/bin/pg_basebackup/streamutil.c | 3 +-src/bin/pg_basebackup/streamutil.h | 1 + So far, so good. Incidentally, you forgot to do this: install: all installdirs $(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)' $(INSTALL_PROGRAM)pg_receivexlog$(X) '$(DESTDIR)$(bindir)/pg_receivexlog$(X)' + $(INSTALL_PROGRAM) pg_receivellog$(X) '$(DESTDIR)$(bindir)/pg_receivellog$(X)' So this creates a new binary executable, pg_receivellog, which is described as “the pg_receivexlog equivalent for logical changes”. Much like pg_receivexlog, pg_receivellog issues special new replication protocol commands for logical replication, which account for your changes to the replication protocol grammar and lexer (i.e. walsender): src/backend/replication/repl_gram.y | 32 +-src/backend/replication/repl_scanner.l | 2 + You say: + /* This is is just for demonstration, don't ever use this code for anything real! */ uh, why not? What is the purpose of a contrib module, if not to serve as a minimal example? So, I went to play with pg_receivellog, and I got lots of output like this: [peter@peterlaptop decode]$ pg_receivellog -f test.log -d postgres WARNING: Initiating logical rep pg_receivellog: could not init logical rep: got 0 rows and 0 fields, expected 1 rows and 4 fields pg_receivellog: disconnected. Waiting 5 seconds to try again. Evidently you expected me to see this message: + if (!walsnd) + { + elog(ERROR, "couldn't find free logical slot. free one or increase max_logical_slots"); + } If I did, that might have been okay. I didn't though, presumably because the “walsnd” variable was wild/uninitialised. So, I went and set max_logical_slots to something higher than 0, and restarted. pg_receivellog behaved itself this time. In one terminal: [peter@peterlaptop decode]$ tty /dev/pts/0 [peter@peterlaptop decode]$ pg_receivellog -f test.log -d postgres WARNING: Initiating logical rep WARNING: reached consistent point, stopping! WARNING: Starting logical replication In another: [peter@peterlaptop decode]$ tty /dev/pts/1 [peter@peterlaptop decode]$ psql Expanded display is used automatically. psql (9.3devel) Type "help" for help. postgres=# insert into b values(66,64); INSERT 0 1 postgres=# \q [peter@peterlaptop decode]$ cat test.log BEGIN 1910 table "b": INSERT: i[int4]:66 j[int4]:64 COMMIT 1910 We're subscribed to logical changes, and everything looks about right. We have a toy demo of a logical change-set subscriber. I wondered how this had actually worked. Since test_decoding had done nothing more than expose some functions, without registering any callback in the conventional way (hooks, etc), how could it have worked? That brings me to the interface used by plugins like this test_decoding. Plugin interface =========== So test_decoding uses various type of caches and catalogs. I'm mostly worried about the core BDR interface that it uses, more-so than this other stuff. I'm talking about: src/include/replication/output_plugin.h | 76 ++ One minor gripe is that output_plugin.h isn't going to pass muster with cpluspluscheck (private is a C++ keyword). There are more serious problems, though. In particular, I'm quite perplexed at some of the code that “installs” the test_decoding plugin. The test_decoding module is hard-coded within pg_receivellog thusly (the SCONST token here could name an arbitrary module): + res = PQexec(conn, "INIT_LOGICAL_REPLICATION 'test_decoding'"); Furthermore, the names of particular test_decoding routines are hard coded into core, using libdl/PG_MODULE_MAGIC introspection: + XLogReaderState * + normal_snapshot_reader(XLogRecPtr startpoint, TransactionId xmin, + char *plugin, XLogRecPtr valid_after) + { + /* to simplify things we reuse initial_snapshot_reader */ + XLogReaderState *xlogreader = initial_snapshot_reader(startpoint, xmin); *** SNIP *** + + /* lookup symbols in the shared libarary */ + + /* optional */ + apply_state->init_cb = (LogicalDecodeInitCB) + load_external_function(plugin, "pg_decode_init", false, NULL); + + /* required */ + apply_state->begin_cb = (LogicalDecodeBeginCB) + load_external_function(plugin, "pg_decode_begin_txn", true, NULL); *** SNIP *** This seems fairly wrong-headed. Comments above this function say: + /* + * Build a snapshot reader with callbacks found in the shared library "plugin" + * under the symbol names found in output_plugin.h. + * It wraps those callbacks so they send out their changes via an logical + * walsender. + */ So the idea is that the names of all functions with public linkage within test_decoding (their symbols) have magical significance, and that the core system resolve those magic symbols dynamically. I'm not aware of this pattern appearing anywhere else within Postgres. Furthermore, it seems kind of short sighted. Have we not painted ourselves into a corner with regard to using multiple plugins at once? This doesn't seem terribly unreasonable, if for example we wanted to use test_decoding in production to debug a problem, while running a proper logical replication system and some other logical change-set consumer in tandem. Idiomatic use of “hooks” allows multiple plugins to be called for the same call of the authoritative hook by the core system, as for example when using auto_explain and pg_stat_statements at the same time. Why not just use hooks? It isn't obvious that you shouldn't be able to do this. The signature of the function pg_decode_change (imposed by the function pointer typedef LogicalDecodeChangeCB) assumes that everything should go through a passed StringInfo, but I have a hard time believing that that's a good idea. It's like your plugin functions as a way of filtering reorder buffers. It's not as if the core system just passes logical change-sets off, as one might expect. It is actually the case that clients have to connect to the server in replication mode, and get their change-sets (as filtered by their plugin) streamed by a walsender over the wire protocol directly. What of making changeset subscribers generic abstractions? Again, maybe you don't have to do anything with the StringInfo, but that is far from clear from the extant code and documentation. Snapshot builder ================ We've seen [1] that the snapshot builder is concerned with building snapshots for the purposes of timetravel. This is needed to see the contents of the catalog at a point in time when decoding (see design documents for more). src/backend/access/transam/xact.c | 4 +-src/backend/replication/logical/DESIGN.txt | 603++++++++++src/backend/replication/logical/Makefile | 25 +src/backend/replication/logical/README.SNAPBUILD.txt| 298 +++++src/backend/replication/logical/snapbuild.c | 1181+++++++++++++++++++src/backend/utils/time/tqual.c | 299 ++++-src/include/access/heapam_xlog.h | 23 +src/include/c.h | 1 +src/include/replication/snapbuild.h | 129 +++src/include/utils/snapshot.h | 4 +-src/include/utils/tqual.h | 51 +- I've lumped the tqual/snapshot visibility changes under “Snapshot builder” too, and anything mostly to do with ComboCids. The README.SNAPBUILD document (and the code described by it) was previously the focus of an entire review of its own [2]. I still don't see why you're allocating snapshots within DecodeRecordIntoReorderBuffer(). As I've said, I think that snapshots would be better allocated alongside the ReadApplyState that is directly concerned with snapshots, to better encapsulate the snapshot stuff. Now, you do at least acknowledge this problem this time around: + /* + * FIXME: The existance of the snapshot builder is pretty obvious to the + * outside right now, that doesn't seem to be very good... + */ However, the fact is that this function: + Snapstate * + AllocateSnapshotBuilder(ReorderBuffer *reorder) + { doesn't actually do anything with the ReorderBuffer pointer that it is passed. So I don't see why you've put off doing this, as if it's something that would require a non-trivial effort. One of my major concerns during that review was the need for this “peg xmin horizon” hack - you presented an example that required the use of a prepared transaction to artificially peg the global xmin horizon, and I wasn't happy about that. We were worried about catalog tables getting vacuumed in a way that prevented us from correctly interpreting data about types in the face of transactions that mix DML and DDL. If the catalog tables were vacuumed, we'd be out of luck - we needed to do something somewhat analogous to hot_standby_feedback. At the same time, we need to manage the risk of bloat on the primary due to non-availability of a standby in some speculative replication system using this infrastructure. One proposal floated around was to have a special notion of xmin horizon - a more granular xmin horizon applicable to only the necessary catalog tables. You didn't pursue that idea yet, preferring to solve the simpler case. You say of xmin horizon handling: + == xmin Horizon Handling == + + Reusing MVCC for timetravel access has one obvious major problem: + VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also + obviously, we want autovacuum/manual vacuum to work as before. + + The idea here is to reuse the infrastrcuture built for hot_standby_feedback + which allows us to keep the xmin horizon of a walsender backend artificially + low. We keep it low enough so we can restart decoding from the last location + the client has confirmed to be safely received. The means that we keep it low + enough to contain the last checkpoints oldestXid value. + + That also means we need to make that value persist across restarts/crashes in a + very similar manner to twophase.c's. That infrastructure actually also useful + to make hot_standby_feedback work properly across primary restarts. So we jury rig the actual xmin horizon by doing this: + /* + * inrease shared memory state, so vacuum can work + * on tuples we prevent from being purged. + */ + IncreaseLogicalXminForSlot(buf->origptr, + running->oldestRunningXid); We switch the WAL Sender proc's xmin while the walsender replies to a message, while preserving the “real” xmin horizon. Presumably this is crash safe, since we do this as part of XLOG_RUNNING_XACTS replay (iff we're doing “logical recovery”; that is, decoding is being performed as we reach SNAPBUILD_CONSISTENT): recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS, rdata); I continue to be quite concerned about the failure modes here. I do not accept that this is no worse than using hot_standby_feedback. hot_standby_feedback can see a standby bloat up the master because it has a long-running transaction - it's a process that the standby must actively engage in. However, what you have here will bloat up the master passively; standbys have to actively work to *prevent* that from happening. That's a *fundamental* distinction. Maybe it's actually reasonable to do that, at least for now, but I think that you should at least acknowledge the distinction as an important one. We also use this new tqual.c infrastructure to time-travel during decoding, with the snapshot built for us by snapshot builder: + /* + * See the comments for HeapTupleSatisfiesMVCC for the semantics this function + * obeys. + * + * Only usable on tuples from catalog tables! + * + * We don't need to support HEAP_MOVED_(IN|OFF) for now because we only support + * reading catalog pages which couldn't have been created in an older version. + * + * We don't set any hint bits in here as it seems unlikely to be beneficial as + * those should already be set by normal access and it seems to be too + * dangerous to do so as the semantics of doing so during timetravel are more + * complicated than when dealing "only" with the present. + */ + bool + HeapTupleSatisfiesMVCCDuringDecoding(HeapTuple htup, Snapshot snapshot, + Buffer buffer) Are you sure that ReorderBuffer.private_data should be a void*? Maybe we'd be better off if it was a minimal “abstract base class” pointer, that contained a MemoryContext? This whole area could use a lot more scrutiny. That's all I have for now, though. I'm happy to note that the overhead of computing the pegged Recent(Global)Xmin is one TransactionIdIsValid, one TransactionIdPrecedes and, potentially, one assignment. I am also pleased to see that you're invalidating system caches in a more granular fashion (for transactions that contain both DDL and DML, where we cannot rely on the usual Hot Standby where sinval messages are applied for commit records). That is a subject worthy of another e-mail, though. Decoding (“glue code”) ====================== We've seen [1] that decoding is concerned with decoding WAL records from an xlogreader.h callback into an reorderbuffer. Decoding means breaking up individual XLogRecord structs, reading them through an XlogReaderState, and storing them in an Re-Order buffer (reorderbuffer.c does this, and stores them as ReorderBufferChange records), while building a snapshot (which is needed in advance of adding tuples from records). It can be thought of as the small piece of glue between reorderbuffer and snapbuild that is called by XLogReader (DecodeRecordIntoReorderBuffer() is the only public function, which will be called by the WAL sender – previously, this was called by plugins directly). An example of what belongs in decode.c is the way it ignores physical XLogRecords, because they are not of interest. src/backend/replication/logical/decode.c | 494 ++++++++src/backend/replication/logical/logicalfuncs.c | 224++++src/backend/utils/adt/dbsize.c | 79 ++src/include/catalog/indexing.h | 2 +src/include/catalog/pg_proc.h | 2 +src/include/replication/decode.h | 21+src/include/replication/logicalfuncs.h | 45 +src/include/storage/itemptr.h | 3 +src/include/utils/builtins.h | 1 + The pg_proc accessible utility function pg_relation_by_filenode() - which you've documented - doesn't appear to be used at present (it's just a way of exposing the core infrastructure, as described under “Miscellaneous thoughts”) . A new index is created on pg_class (reltablespace oid_ops, relfilenode oid_ops). We've seen that we need a whole new infrastructure for resolving relfilenodes to relation OIDs, because only relfilenodes are available from the WAL stream, and in general the mapping isn't stable, as for example when we need to do a table rewrite. We have a new syscache for this. We WAL-log the new XLOG_HEAP2_NEW_CID record to store both table relfilenode and combocids. I'm still not clear on how you're managing corner case with relfilenode/table oid mapping that Robert spoke of previously [17]. Could you talk about that? Reorder buffer ============== Last time around [1], this was known as ApplyCache. It's still concerned with the management of logical replay cache - it reassembles transactions from a stream of interspersed changes. This is what a design doc previously talked about under “4.5 - TX reassembly” [14]. src/backend/replication/logical/reorderbuffer.c | 1185 ++++++++++++++++++++src/include/replication/reorderbuffer.h | 284 +++++ Last time around, I described spooling to disk, like a tuplestore, as a probable prerequisite to commit - I raise that now because I thought that this was the place where you'd most likely want to do that. Concerns about the crash-safety of buffered change-sets were raised too. You say this in a README: + * crash safety, restartability & spilling to disk + * consistency with the commit status of transactions + * only a minimal amount of synchronous work should be done inside individual + transactions + + In our opinion those problems are restricting progress/wider distribution of + these class of solutions. It is our aim though that existing solutions in this + space - most prominently slony and londiste - can benefit from the work we are + doing & planning to do by incorporating at least parts of the changeset + generation infrastructure. So, have I understood correctly - are you proposing that we simply outsource this to something else? I'm not sure how I feel about that, but I'd like clarity on this matter. reorderbuffer.h should have way, way more comments for each of the structs. I want to see detailed comments, like those you see for the structs in parsenodes.h - you shouldn't have to jump to some design document to see how each struct fits within the overall design of reorder buffering. XLog stuff (in particular, the new XLogReader) ================================= Andres rebased on top of Heikki's XLogReader patch for the purposes of BDR, and privately identified this whole area to me as a particular concern for this review. The version that I'm reviewing here is not the version that Andres described last week, v3.0 [10], but a slight revision thereof, v3.1 [11]. See the commit message in Andres' feature branch for full details [12]. doc/src/sgml/ref/pg_xlogdump.sgml | 76 ++src/backend/access/transam/Makefile | 2 +-src/backend/access/transam/xlog.c | 1084 ++++--------------src/backend/access/transam/xlogfuncs.c | 1 +src/backend/access/transam/xlogreader.c | 962 ++++++++++++++++src/bin/Makefile | 2 +-src/bin/pg_xlogdump/Makefile | 87 ++src/bin/pg_xlogdump/compat.c | 173 +++src/bin/pg_xlogdump/nls.mk | 4 +src/bin/pg_xlogdump/pg_xlogdump.c | 462 ++++++++src/bin/pg_xlogdump/pqexpbuf_strinfo.c | 76 ++src/bin/pg_xlogdump/tables.c | 78 ++src/include/access/heapam_xlog.h | 23 +src/include/access/transam.h | 5 +src/include/access/xlog.h | 3 +-src/include/access/xlog_fn.h | 35 +src/include/access/xlog_internal.h | 23 -src/include/access/xlogdefs.h | 1 +src/include/access/xlogreader.h | 159 +++ There was some controversy over the approach to implementing a “generic xlog reader”[13]. This revision of Andres' work presumably resolves that controversy, since it heavily incorporates Heikki's own work. Heikki has described the design of his original XLogReader patch [18]. pg_xlogdump is a hacker-orientated utility that has been around in various forms for quite some time (i.e. at least since the 8.3 days), concerned with reading and writing Postgres transaction logs for debugging purposes. It has long been obvious that it would be useful to maintain along with Postgres (there has been a tendency for xlogdump to fall behind, and only stable releases are supported), but the XLogReader-related refactoring makes adding an official xlogdump tool quite compelling (we're talking about 462 lines of wrapper code for pg_xlogdump.c, against several thousands of lines of code for the version in common use [15] that has hard-coded per-version knowledge of catalog oids and things like that). I think that some of the refactoring that Simon did to xlog.c last year [16] makes things easier here, and kind of anticipates this. Again, with pg_xlogdump you forgot to do this: pg_xlogdump: $(OBJS) | submake-libpq submake-libpgport $(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) $(libpq_pgport) -o $@$(X) + install: all installdirs + $(INSTALL_PROGRAM) pg_xlogdump$(X) '$(DESTDIR)$(bindir)'/pg_xlogdump$(X) + + installdirs: + $(MKDIR_P) '$(DESTDIR)$(bindir)' + + uninstall: + rm -f $(addprefix '$(DESTDIR)$(bindir)'/, pg_xlogdump$(X)) pg_xlogdump could be considered a useful way of testing the XLogReader and decoding functionality, independent of the test_decoding plugin. It is something that I'll probably use to debug this patch over the next few weeks. Example usage: [peter@peterlaptop pg_xlog]$ pg_xlogdump -f 000000010000000000000002 | head -n 3 xlog record: rmgr: Heap2 , record_len: 34, tot_len: 66, tx: 1902, lsn: 0/020011C8, prev 0/01FFFC48, bkp: 0000, desc: new_cid: rel 1663/12933/12671; tid 7/44; cmin: 0, cmax: 4294967295, combo: 4294967295 xlog record: rmgr: Heap , record_len: 175, tot_len: 207, tx: 1902, lsn: 0/02001210, prev 0/020011C8, bkp: 0000, desc: insert: rel 1663/12933/12671; tid 7/44 xlog record: rmgr: Btree , record_len: 34, tot_len: 66, tx: 1902, lsn: 0/020012E0, prev 0/02001210, bkp: 0000, desc: insert: rel 1663/12933/12673; tid 1/355 In another thread, Robert and Heikki remarked that pg_xlogdump ought to be in contrib, and not in src/bin. As you know, I am inclined to agree. [peter@peterlaptop pg_xlog]$ pg_xlogdump -f 1234567 fatal_error: requested WAL segment 012345670000000000000009 has already been removed This error message seems a bit presumptuous to me; as it happens there never was such a WAL segment. Saying that there was introduces the possibility of operator error. This appears to be superfluous: *** a/src/backend/access/transam/xlogfuncs.c --- b/src/backend/access/transam/xlogfuncs.c *************** *** 18,23 **** --- 18,24 ---- #include "access/htup_details.h" #include "access/xlog.h" + #include "access/xlog_fn.h" #include "access/xlog_internal.h" The real heavyweight here is xlogreader.c, at 962 lines. The module refactors xlog.c, moving ReadRecord and some supporting functions to xlogreader.c. Those supporting functions now operate on *generic* XLogReaderState rather than various global variables. The idea here is that the client of the API calls ReadRecord repeatedly to get each record. There is a callback of type XLogPageReadCB, which is used by the client to obtain a given page in the WAL stream. The XLogReader facility is responsible for decoding the WAL into records, but the client is responsible for supplying the physical bytes via the callback within XLogReader state. There is an error-handling callback too, added by Andres. Andres added a new function, XLogFindNextRecord(), which is used for checking wether RecPtr is a valid XLog address for reading and to find the first valid address after some address when dumping records, for debugging purposes. Why did you move the page validation handling into XLogReader? Support was added for reading pages which are only partially valid, which seems reasonable. The callback that acts as a replacement for emode_for_corrupt_record might be a bit questionable. I'd like to have more to say on this. I'll leave that for another day. I note that there are many mallocs in this module (see note below under “Miscellaneous thoughts”). heapam and other executor stuff =============================== One aspect of this patch that I feel certainly warrants another of these subsections is the changes to heapam.c and related executor changes. These are essentially changes to functions called by nodeModifyTable.c frequently, including functions like heap_hot_search_buffer, heap_insert, heap_multi_insert and heap_delete. We now have to do extra logical logging, and we need primary key values to be looked up. Files changed include: src/backend/access/heap/heapam.c | 284 ++++-src/backend/access/heap/pruneheap.c | 16 +-src/backend/catalog/index.c | 76 +-src/backend/access/rmgrdesc/heapdesc.c | 9 +src/include/access/heapam_xlog.h | 23 +src/include/catalog/index.h | 4 + What of this? (I'm using the dellstore sample database, as always): postgres=# \d+ orders Table "public.orders" *** SNIP *** Indexes: "orders_pkey" PRIMARY KEY, btree (orderid) "ix_order_custid" btree (customerid) ***SNIP *** postgres=# delete from orders where orderid = 77; WARNING: Could not find primary key for table with oid 16406 CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE $1 OPERATOR(pg_catalog.=) "orderid"" WARNING: Could not find primary key for table with oid 16406 CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE $1 OPERATOR(pg_catalog.=) "orderid"" WARNING: Could not find primary key for table with oid 16406 CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE $1 OPERATOR(pg_catalog.=) "orderid"" DELETE 1 I don't have time to figure out what this issue is right now. Hot Standby, Replication and libpq stuff ======================================== Not forgetting existing replication infrastructure and libpq stuff affected by this patch. Files under this category that have been modified are: src/backend/access/rmgrdesc/xlogdesc.c | 1 +src/backend/postmaster/bgwriter.c | 35 +src/backend/postmaster/postmaster.c | 7 +-src/backend/replication/libpqwalreceiver/libpqwalreceiver.c| 4 +-src/backend/replication/Makefile | 2 +src/backend/replication/walsender.c | 732 +++++++++++-src/backend/storage/ipc/procarray.c | 23 +src/backend/storage/ipc/standby.c | 8 +src/backend/utils/init/postinit.c | 5 +src/bin/pg_controldata/pg_controldata.c | 2 +src/include/nodes/nodes.h | 2 +src/include/nodes/replnodes.h | 22 +src/include/replication/walsender.h | 1 +src/include/replication/walsender_private.h | 43 +-src/interfaces/libpq/exports.txt | 1 +src/interfaces/libpq/pqexpbuffer.c | 40 +src/interfaces/libpq/pqexpbuffer.h | 5 + I take particular interest in bgwriter.c here. You're doing this: + * Log a new xl_running_xacts every now and then so replication can get + * into a consistent state faster and clean up resources more + * frequently. The costs of this are relatively low, so doing it 4 + * times a minute seems fine. What about the power consumption of the bgwriter? I think that the way try to interact with the existing loop logic is ill-considered. Just why is the bgwriter the compelling auxiliary process in which to do this extra work? Quite a lot of code has been added to walsender. This is mostly down to some new functions, responsible for initialising logical replication: ! typedef void (*WalSndSendData)(bool *); ! static void WalSndLoop(WalSndSendData send_data) __attribute__((noreturn)); static void InitWalSenderSlot(void); staticvoid WalSndKill(int code, Datum arg); ! static void XLogSendPhysical(bool *caughtup); ! static void XLogSendLogical(bool *caughtup); static void IdentifySystem(void); static void StartReplication(StartReplicationCmd*cmd); + static void CheckLogicalReplicationRequirements(void); + static void InitLogicalReplication(InitLogicalReplicationCmd *cmd); + static void StartLogicalReplication(StartLogicalReplicationCmd *cmd); + static void ComputeLogicalXmin(void); This is mostly infrastructure for initialising and starting logical replication. Initialisation means finding a free “logical slot” from shared memory, then looping until the new magic xmin horizon for logical walsenders (stored in their “slot”) is that of the weakest link (think local global xmin). + * FIXME: think about solving the race conditions in a nicer way. + */ + recompute_xmin: + walsnd->xmin = GetOldestXmin(true, true); + ComputeLogicalXmin(); + if (walsnd->xmin != GetOldestXmin(true, true)) + goto recompute_xmin; Apart from the race conditions that I'm not confident are addressed here, I think that the above could easily get stuck indefinitely in the event of contention. Initialisation occurs due to a “INIT_LOGICAL_REPLICATION” replication command. Initialisation also means that decoding state is allocated (a snapshot reader is initialised), and we report back success or failure to the client that's using the streaming replication protocol (i.e. in our toy example, pg_receivellog). Starting logical replication means we load the previously initialised slot, and find a snapshot reader plugin (using the “magic symbols” pattern described above, under “Plugin interface”). Why do we have to “find” a logical slot twice (both during initialisation and starting)? Since I've already described the “peg xmin horizon” stuff under “Snapshot builder”, I won't belabour the point. I think that I have more to say about this, but not today. Minor point: This is a terrible name for the variable in question: + LogicalWalSnd *walsnd; Miscellaneous thoughts ====================== You're still using C stdlib functions like malloc, free, calloc quite a bit. My concern is that this points to a lack of thought about the memory management strategy; why are you still not using memory contexts in some places? If it's so difficult to anticipate what clients of, say, XLogReaderAllocate() want for the lifetime of their memory, then likely as not those clients should be doing their own memory allocation, and passing the allocated buffer directly. If it is obvious that the memory ought to persist indefinitely (and I think it's your contention that it is in the case of XLogReaderAllocate()), I'd just allocate it in the top memory context. Now, I am aware that there are a trivial number of backend mallocs that you can point to as precedent here, but I'm still not satisfied with your explanation for using malloc(). At the very least, you ought to be handling the case where malloc returns NULL, and you're not doing so consistently. Memory contexts are very handy for debugging. As you know, I wrote a little Python script with GDB bindings, that walks the tree of memory contexts and prints out statistics about them using the aset.c/AllocSetStats() infrastructure. It isn't difficult to imagine that something like that could be quite useful with this patch - I'd like to be able to easily determine how many snapshot builders have been allocated from within a given backend, for example (though I see you refcount that anyway for reasons that are not immediately apparent - just debugging?). Minor gripes: * There is no need to use a *.txt extension for README files; we don't currently use those anywhere else. * If you only credit the PGDG and not the Berkeley guys (as you should, for the most part), there is no need to phrase the notice “Portions Copyright...”. You should just say “Copyright...”. * You're still calling function pointer typedefs things like LogicalDecodeInitCB. As I've already pointed out, you should prefer the existing conventions (call it something like LogicalDecodeInit_hook_type). Under this section are all modifications to files that are not separately described under some dedicated section header. I'll quickly pass remark on them. System caches were knocked around a bit: // LocalExecuteInvalidationMessage now exposed:src/backend/utils/cache/inval.c | 2 +-// relcache.chas stray whitespace:src/backend/utils/cache/relcache.c | 1 -// New RelationMapFilenodeToOid()function:src/backend/utils/cache/relmapper.c | 53 +// New RELFILENODE syscacheadded:src/backend/utils/cache/syscache.c | 11 +// Headers:src/include/storage/sinval.h | 2 +src/include/utils/relmapper.h | 2 +src/include/utils/syscache.h | 1 + These are only of tangential interest to snapshot building, and so are not described separately. Essentially, just “add new syscache” boilerplate. There's also a little documentation, covering only the pg_relation_by_filenode() utility function (this exposes RelationMapFilenodeToOid()/RELFILENODE syscache): doc/src/sgml/func.sgml | 23 +-doc/src/sgml/ref/allfiles.sgml | 1 +doc/src/sgml/reference.sgml | 1 + The following files were only changed due to the change in the tqual.c interfaces of HeapTupleSatisfies*(). contrib/pgrowlocks/pgrowlocks.c | 2 +-src/backend/commands/analyze.c | 3 +-src/backend/commands/cluster.c | 2 +-src/backend/commands/vacuumlazy.c | 3 +-src/backend/storage/lmgr/predicate.c | 2 +- That's all the feedback that I have for now. I'd have liked to have gone into more detail in many cases, but I cannot only do so much. I always like to start off rounds of review with “this is the current state of play as I see it” type e-mails. There will be more to follow, now that I have that out of the way. References ========== [1] Earlier WAL decoding review: http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com [2] Earlier snapshot building doc review: http://archives.postgresql.org/message-id/CAEYLb_Xj=t-4CW6gLV5jUvdPZSsYwSTbZtUethsW2oMpd58jzA@mail.gmail.com [3] "Rearrange storage of data in xl_running_xacts" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5c11725867ac3cb06db065f7940143114280649c [4] "Basic binary heap implementation" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7a2fe9bd0371b819aacc97a007ec1d955237d207 [5] "Embedded list interface" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a66ee69add6e129c7674a59f8c3ba010ed4c9386 [6] "Background worker processes" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=da07a1e856511dca59cbb1357616e26baa64428e [7] Chris Browne on Slony and ordering conflicts: http://archives.postgresql.org/message-id/CAFNqd5VY9aKZtPSEyzOTMsGAhfFHKaGNCgY0D0wZvqjC0Dtt1g@mail.gmail.com [8] Steve Singer on Slony and transaction isolation level: http://archives.postgresql.org/message-id/BLU0-SMTP6402AA6F3A1F850EDFA1B2DC8D0@phx.gbl [9] Kevin Grittner on commit ordering: http://archives.postgresql.org/message-id/20121022141701.224550@gmx.com [10] v3.0 of the XLogReader (Andres' revision): http://archives.postgresql.org/message-id/20121204175212.GB12055@awork2.anarazel.de [11] v3.1 of the XLogReader(Andres' slight tweak of [10]): http://archives.postgresql.org/message-id/20121209190532.GD4694@awork2.anarazel.de [12] Andres' XLogReader commit: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commit;h=3ea7ec5eea2cf890c14075b559e77a25a4130efc [13] Heikki objects to XLogReader approach, proposes alternative: http://archives.postgresql.org/message-id/5056D3E1.3060108@vmware.com [14] “WAL decoding, attempt #2” design documents: http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com [15] xlogdump satellite project: https://github.com/snaga/xlogdump [16] Numerous refactoring commits. Main split was commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9aceb6ab3c202a5bf00d5f00436bb6ad285fc0bf [17] Robert of relfilenodes: http://archives.postgresql.org/message-id/CA+TgmoZXkCo5FAbU=3JHuXXF0Op2SLhGJcVuFM3tkmcBnmhBMQ@mail.gmail.com [18] Heikki on his XLogReader: http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php
Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
On 11.12.2012 21:11, Andres Freund wrote: > Now that I have read some of that code, I am currently unsure how the > current implementation of this can cooperate with translation, even when > used from the backend? Hmm, there was a gettext() call missing from report_invalid_record. That's where the translation needs to happen. Fixed now. - Heikki
Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader
From
Alvaro Herrera
Date:
Heikki Linnakangas wrote: > On 11.12.2012 21:11, Andres Freund wrote: > >Now that I have read some of that code, I am currently unsure how the > >current implementation of this can cooperate with translation, even when > >used from the backend? > > Hmm, there was a gettext() call missing from report_invalid_record. > That's where the translation needs to happen. Fixed now. You need to call gettext_noop() in the string literals as well, unless you've added the function and argument number to the gettext trigger list in nls.mk. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi Peter! Thanks for the review, you raise many noteworthy points. This is going to be a long mail... On 2012-12-13 00:05:41 +0000, Peter Geoghegan wrote: > I'm very glad that you followed my earlier recommendation of splitting > your demo logical changeset consumer into a contrib module, in the > spirit of contrib/spi, etc. This module, "test_decoding", represents a > logical entry point, if you will, for the entire patch. As unwieldy as > it may appear to be, the patch is (or at least *should be*) ultimately > reducible to some infrastructural changes to core to facilitate this > example logical change-set consumer. To be fair that point has been brought up first by Robert and Kevin. But yes, its now included. Which is totally sensible. > Once again, because test_decoding is a kind of "entry point", it gives > me a nice point to continually refer back to when talking about this > patch. (Incidentally, maybe test_decoding should be called > pg_decoding?). I am not particularly happy with the current name, I just named it akin to test_parser/. I don't really like pg_decoding tho, ISTM the pg_ prefix doesn't serve a point there, since its not a binary or such which will lie around in some general namespace. Other suggestions? > The regression tests pass, though this isn't all that surprising, > since frankly the test coverage of this patch appears to be quite low. Yes, that certainly needs to be adressed. > I obliged you and didn't focus on concurrency > and serializability concerns (it was sufficient to print out values/do > some decoding in a toy function), but it's time to take a closer look > at those now, I feel. Agreed. > test_decoding is a client of the logical > change-set producing infrastructure, and there appears to be broad > agreement that that infrastructure needs to treat such consumers in a > way that is maximally abstract. My question is, just how abstract does > this interface have to be, really? How well are you going to support > the use-case of a real logical replication system? > Now, maybe it's just that I haven't being paying attention (in > particular, to the discussion surrounding [3] – though that commit > doesn't appear to have been justified in terms of commit ordering in > BDR at all), but I would like you to be more demonstrative of certain > things, like: That commit was basically just about being able to discern which xids are toplevel and which are subtransaction xids. snapbuild.c only needs to wait for toplevel xids now and doesn't care about subtransaction xids which made the code significantly simpler. > 1. Just what does a logical change-set consumer look like? What things > are always true of one, and never true of one? > 2. Please describe in as much detail as possible the concurrency > issues with respect to logical replication systems. Please make > verifiable, testable claims as to how well these issues are considered > here, perhaps with reference to the previous remarks of subject-matter > experts like Chris Browne [7], Steve Singer [8] and Kevin Grittner [9] > following my earlier review. Not sure what you want to hear here to be honest. Let me try anyway: Transactions (and the contained changes) are guaranteed to be replayed in commit-order where the order is defined by the LSN/position in the xlog stream of the commit record[1]. Thats the same ordering that Hot Standby uses. The code achieves that order by reading the xlog records sequentially in-order and replaying the begin/changes/commmit "events" everytime it reads a commit record and never at a different time [1]. Several people in the thread you referenced seemed to agree that commit-ordering is a sensible choice. [1]: Note that there are potential visibility differences between the order in which transactions are marked as visible in WAL, in the clog and in memory (procarray) since thats not done while holding a lock over the whole period. Thats an existing property with HS. > I'm not all that impressed with where test_decoding is at right now. > There is still essentially no documentation. I will add comments. > I think it's notable that you don't really touch the ReorderBufferTXN > passed by the core system in the test_decoding plugin. Don't think thats saying very much except that 1) we don't pass on enough information about transactions yet (e.g. commit timestamp). 2) the output plugin is simple. > > test_decoding and pg_receivellog > ======================== > > I surmised that the way that the test_decoding module is intended to > be used is as a client of receivellog.c (*not* receivelog.c – that > naming is *confusing*, perhaps call it receivelogiclog.c or something. I am happy to name it any way people want. Once decided I think it should move out of bin/pg_basebackup and the code (which is mostly copied from streamutil.c and pg_receivexlog) should be cleaned up considerably. > Better still, make receivexlog handle the logical case rather than > inventing a new tool). The reason for receivellog.c existing, as you > yourself put it, is: I don't think they really can be merged, the differences are notable already and are going to get bigger. > + /* > + * We have to use postgres.h not postgres_fe.h here, because there's so much > + * backend-only stuff in the XLOG include files we need. But we need a > + * frontend-ish environment otherwise. Hence this ugly hack. > + */ > > So receivellog.c is part of a new utility called pg_receivellog, in > much the same way as receivexlog.c is part of the existing > pg_receivexlog utility (see commit > b840640000934fca1575d29f94daad4ad85ba000 in Andres' tree). We're > talking about these changes: receivelog.c is old code, I didn't change anything nontrivial there. The one change is gone now since Heikki committed the xlog_internal./xlog_fn.h change. > src/backend/utils/misc/guc.c | 11 + > src/bin/pg_basebackup/Makefile | 7 +- > src/bin/pg_basebackup/pg_basebackup.c | 4 +- > src/bin/pg_basebackup/pg_receivellog.c | 717 ++++++++++++ > src/bin/pg_basebackup/pg_receivexlog.c | 4 +- > src/bin/pg_basebackup/receivelog.c | 4 +- > src/bin/pg_basebackup/streamutil.c | 3 +- > src/bin/pg_basebackup/streamutil.h | 1 + > > So far, so good. Incidentally, you forgot to do this: > > install: all installdirs > $(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)' > $(INSTALL_PROGRAM) pg_receivexlog$(X) > '$(DESTDIR)$(bindir)/pg_receivexlog$(X)' > + $(INSTALL_PROGRAM) pg_receivellog$(X) > '$(DESTDIR)$(bindir)/pg_receivellog$(X)' I actually didn't forget to do this, but I didn't want to install binaries that probably won't survive under the current name. That seems to have been a bad idea since Michael and you noticed it as missing ;) > So this creates a new binary executable, pg_receivellog, which is > described as “the pg_receivexlog equivalent for logical changes”. Much > like pg_receivexlog, pg_receivellog issues special new replication > protocol commands for logical replication, which account for your > changes to the replication protocol grammar and lexer (i.e. > walsender): > > src/backend/replication/repl_gram.y | 32 +- > src/backend/replication/repl_scanner.l | 2 + > > You say: > > + /* This is is just for demonstration, don't ever use this code for > anything real! */ > > uh, why not? What is the purpose of a contrib module, if not to serve > as a minimal example? Stupid copy & paste error from the old example code. The code should probably grow a call to some escape functionality and more coments to serve as a good example but otherwise its ok. > Evidently you expected me to see this message: > > + if (!walsnd) > + { > + elog(ERROR, "couldn't find free logical slot. free one or increase > max_logical_slots"); > + } > > If I did, that might have been okay. I didn't though, presumably > because the “walsnd” variable was wild/uninitialised. The problem was earlier, CheckLogicalReplicationRequirements() should have checked for a reasonable max_logical_slots value but only checked for wal_level. Fix pushed. > So, I went and set max_logical_slots to something higher than 0, and > restarted. pg_receivellog behaved itself this time. > > In one terminal: > > [peter@peterlaptop decode]$ tty > /dev/pts/0 > [peter@peterlaptop decode]$ pg_receivellog -f test.log -d postgres > WARNING: Initiating logical rep > WARNING: reached consistent point, stopping! > WARNING: Starting logical replication Those currently are WARNINGs to make them easier to see, they obviously need to be demoted at some point. > One minor gripe is that output_plugin.h isn't going to pass muster > with cpluspluscheck (private is a C++ keyword). Fix pushed. > Plugin interface > =========== > So test_decoding uses various type of caches and catalogs. I'm mostly > worried about the core BDR interface that it uses, more-so than this > other stuff. I'm talking about: I have asked for input on the interface in a short email http://archives.postgresql.org/message-id/20121115014250.GA5844%40awork2.anarazel.de but didn't get responses so far. I am happy to change the interface, its just did the first thing that made sense to me. Steve Singer - who I believe played a bit with writing his own output plugin - seemed to be ok with it. > The test_decoding module is hard-coded within pg_receivellog thusly > (the SCONST token here could name an arbitrary module): > > + res = PQexec(conn, "INIT_LOGICAL_REPLICATION 'test_decoding'"); pg_receivellog will/should grow a --output-plugin parameter at some point. > + /* optional */ > + apply_state->init_cb = (LogicalDecodeInitCB) > + load_external_function(plugin, "pg_decode_init", false, NULL); > So the idea is that the names of all functions with public linkage > within test_decoding (their symbols) have magical significance, and > that the core system resolve those magic symbols dynamically. > I'm not aware of this pattern appearing anywhere else within Postgres. There's _PG_init/fini... > Furthermore, it seems kind of short sighted. Have we not painted > ourselves into a corner with regard to using multiple plugins at once? > This doesn't seem terribly unreasonable, if for example we wanted to > use test_decoding in production to debug a problem, while running a > proper logical replication system and some other logical change-set > consumer in tandem. How does the scheme prevent you from doing that? Simply open up another replication connection and specify a different output plugin there? Not sure how two output plugins in one process would make sense? > Idiomatic use of “hooks” allows multiple plugins > to be called for the same call of the authoritative hook by the core > system, as for example when using auto_explain and pg_stat_statements > at the same time. Why not just use hooks? It isn't obvious that you > shouldn't be able to do this. I considered using hooks but it seemed not to be a good fit. Let me describe my thought process: 1) we want different output formats to be available in the same server & database 2) the wished-for plugin should be specified via the replication connection 3) thus shared_preload_libraries and such aren't really helpful 4) we need to load the plugin ourselves 5) We could simply load it and let the object's _PG_init() call something like OutputPluginInitialize(begin_callback, change_callback,commit_callback), but then we would need to handle the case where that wasn't called and such 6) Going the OutputPluginInitialize route didn't seem to offer any benefits, thus the hardcoded symbol names > The signature of the function > pg_decode_change (imposed by the function pointer typedef > LogicalDecodeChangeCB) assumes that everything should go through a > passed StringInfo, but I have a hard time believing that that's a good > idea. I don't particularly like passing a StringInfo either, but what would you rather pass? Note that StringInfo's are what's currently used in normal fe/be communication. Doing the sending out directly in the output plugin seems to be a bad idea because: 1) we need to handle receiving replies from the receiving side, like keepalives and such, also we need to terminate theconnection if no reply has come inside wal_sender_timeout. 2) the output plugins imo shouldn't know they are sending out to a walsender, we might want to allow sending from insidea function, to disk or anything at some point. Does the reasoning make sense to you? > It's like your plugin functions as a way of filtering reorder buffers. > It's not as if the core system just passes logical change-sets off, as > one might expect. It is actually the case that clients have to connect > to the server in replication mode, and get their change-sets (as > filtered by their plugin) streamed by a walsender over the wire > protocol directly. What of making changeset subscribers generic > abstractions? Sorry, I cannot follow you here. What kind of architecture are you envisioning here? > Snapshot builder > ================ > > I still don't see why you're allocating snapshots within > DecodeRecordIntoReorderBuffer(). As I've said, I think that snapshots > would be better allocated alongside the ReadApplyState that is > directly concerned with snapshots, to better encapsulate the snapshot > stuff. Now, you do at least acknowledge this problem this time around: > > + /* > + * FIXME: The existance of the snapshot builder is pretty obvious to the > + * outside right now, that doesn't seem to be very good... > + */ I think that comment was there in the last round as well ;) > However, the fact is that this function: > > + Snapstate * > + AllocateSnapshotBuilder(ReorderBuffer *reorder) > + { > > doesn't actually do anything with the ReorderBuffer pointer that it is > passed. So I don't see why you've put off doing this, as if it's > something that would require a non-trivial effort. Well, there simply are a lot of things that need a littlebit of effort. In total thats still a nontrivial amount. And I wasn't sure how much of allt hat needed to change due to changes in the actual snapshot building and the xlogreader swap. Turned out not too many... [ The xmin handling deserves its own mail, I'll respond to that separately] > I am also pleased to see that you're invalidating system caches in a > more granular fashion (for transactions that contain both DDL and DML, > where we cannot rely on the usual Hot Standby where sinval messages > are applied for commit records). That is a subject worthy of another > e-mail, though. There still are two issues worth improving with this though: 1) clear the whole cache when entering/leaving timetravel 2) Don't replay normal "present day" inval messages while in timetravel * That may actually be able to cause errors whentrying to reload the relcache... 1) seems pretty uncontroversion to me since it should happen really infrequently and it seems to be semantically correct. I have to think some more about 2), there are some interesting things with relmap updates due to CLUSTER et al. on nailed tables... > Decoding (“glue code”) > ====================== > > We've seen [1] that decoding is concerned with decoding WAL records > from an xlogreader.h callback into an reorderbuffer. > > Decoding means breaking up individual XLogRecord structs, reading them > through an XlogReaderState, and storing them in an Re-Order buffer > (reorderbuffer.c does this, and stores them as ReorderBufferChange > records), while building a snapshot (which is needed in advance of > adding tuples from records). It can be thought of as the small piece > of glue between reorderbuffer and snapbuild that is called by > XLogReader (DecodeRecordIntoReorderBuffer() is the only public > function, which will be called by the WAL sender – previously, this > was called by plugins directly). > > An example of what belongs in decode.c is the way it ignores physical > XLogRecords, because they are not of interest. > > src/backend/replication/logical/decode.c | 494 ++++++++ > src/backend/replication/logical/logicalfuncs.c | 224 ++++ > src/backend/utils/adt/dbsize.c | 79 ++ > src/include/catalog/indexing.h | 2 + > src/include/catalog/pg_proc.h | 2 + > src/include/replication/decode.h | 21 + > src/include/replication/logicalfuncs.h | 45 + > src/include/storage/itemptr.h | 3 + > src/include/utils/builtins.h | 1 + > > The pg_proc accessible utility function pg_relation_by_filenode() - > which you've documented - doesn't appear to be used at present (it's > just a way of exposing the core infrastructure, as described under > “Miscellaneous thoughts”) It's not required for anything (and I don't think it ever will be). Its was handy during development of this though, and I could have needed earlier during DBAish work. Hm. We could use it to add a regression test for the new syscache though... > . A new index is created on pg_class > (reltablespace oid_ops, relfilenode oid_ops). > > We've seen that we need a whole new infrastructure for resolving > relfilenodes to relation OIDs, because only relfilenodes are available > from the WAL stream, and in general the mapping isn't stable, as for > example when we need to do a table rewrite. We have a new syscache for > this. > We WAL-log the new XLOG_HEAP2_NEW_CID record to store both table > relfilenode and combocids. I'm still not clear on how you're managing > corner case with relfilenode/table oid mapping that Robert spoke of > previously [17]. Could you talk about that? Sure. (Found another potential bug due to this already). Robert talks about two dangers: 1) relfilenode => oid is not unique 2) the relfilenode => oid mapping changes over time 1) is solved by only looking up relfilenodes by (reltablespace, oid) (which is why the syscache is over those two thanks to Roberts observations). We can recognize shared relations via spcNode == GLOBALTABLESPACE_OID and we can recognize nailed tables by the fact that they cannot be looked up in pg_class (there's an InvalidOid stored in the pg_class for them). Shared and nailed tables are then looked up via the new RelationMapFilenodeToOid function. As the decoding is now per-database (we don't have the other catalogs) we skip processing tuples when dbNode != MyDatabaseId. So I think 1) is handled by that? 2) Is solved by the fact that the syscache now works properly time-relativized as well. That is, if you look up the (reltablespace, relfilenode) => oid mapping in the syscache you get the correct result for the current moment in time (whats the correct term for current when its only current from the POV of timetravelling?). Due to the proper cache invalidation handling old mappings are purged correctly as well. > Reorder buffer > ============== > > Last time around [1], this was known as ApplyCache. It's still > concerned with the management of logical replay cache - it reassembles > transactions from a stream of interspersed changes. This is what a > design doc previously talked about under “4.5 - TX reassembly” [14]. Happier with the new name? > src/backend/replication/logical/reorderbuffer.c | 1185 ++++++++++++++++++++ > src/include/replication/reorderbuffer.h | 284 +++++ > > Last time around, I described spooling to disk, like a tuplestore, as > a probable prerequisite to commit - I raise that now because I thought > that this was the place where you'd most likely want to do that. > Concerns about the crash-safety of buffered change-sets were raised > too. Yes, this certainly is a prerequisite to commit. > You say this in a README: > > + * crash safety, restartability & spilling to disk > + * consistency with the commit status of transactions > + * only a minimal amount of synchronous work should be done inside individual > + transactions > + > + In our opinion those problems are restricting progress/wider distribution of > + these class of solutions. It is our aim though that existing solutions in this > + space - most prominently slony and londiste - can benefit from the work we are > + doing & planning to do by incorporating at least parts of the changeset > + generation infrastructure. > > So, have I understood correctly - are you proposing that we simply > outsource this to something else? I'm not sure how I feel about that, > but I'd like clarity on this matter. No, this needs to be implemented in the reorderbuffer. Thats the next task I will work on after committing the actual snapshot export. > reorderbuffer.h should have way, way more comments for each of the > structs. I want to see detailed comments, like those you see for the > structs in parsenodes.h - you shouldn't have to jump to some design > document to see how each struct fits within the overall design of > reorder buffering. Will go over it. I am wondering whether it makes sense to split most of the ones in the header in private/public part... > XLog stuff (in particular, the new XLogReader) > ================================= > > There was some controversy over the approach to implementing a > “generic xlog reader”[13]. This revision of Andres' work presumably > resolves that controversy, since it heavily incorporates Heikki's own > work. Heikki has described the design of his original XLogReader patch > [18]. I hope its resolved. I won't believe it until any version is committed ;) > pg_xlogdump could be considered a useful way of testing the XLogReader > and decoding functionality, independent of the test_decoding plugin. > It is something that I'll probably use to debug this patch over the > next few weeks. Example usage: > > [peter@peterlaptop pg_xlog]$ pg_xlogdump -f 000000010000000000000002 | head -n 3 > xlog record: rmgr: Heap2 , record_len: 34, tot_len: 66, > tx: 1902, lsn: 0/020011C8, prev 0/01FFFC48, bkp: 0000, desc: > new_cid: rel 1663/12933/12671; tid 7/44; cmin: 0, cmax: 4294967295, > combo: 4294967295 > xlog record: rmgr: Heap , record_len: 175, tot_len: 207, > tx: 1902, lsn: 0/02001210, prev 0/020011C8, bkp: 0000, desc: > insert: rel 1663/12933/12671; tid 7/44 > xlog record: rmgr: Btree , record_len: 34, tot_len: 66, > tx: 1902, lsn: 0/020012E0, prev 0/02001210, bkp: 0000, desc: > insert: rel 1663/12933/12673; tid 1/355 > In another thread, Robert and Heikki remarked that pg_xlogdump ought > to be in contrib, and not in src/bin. As you know, I am inclined to > agree. Moved in Heikki's worktree by now. > [peter@peterlaptop pg_xlog]$ pg_xlogdump -f 1234567 > fatal_error: requested WAL segment 012345670000000000000009 has > already been removed > This error message seems a bit presumptuous to me; as it happens there > never was such a WAL segment. Saying that there was introduces the > possibility of operator error. FWIW its the "historical" error message for that ;). I am happy to change it to something else. > > The real heavyweight here is xlogreader.c, at 962 lines. The module > refactors xlog.c, moving ReadRecord and some supporting functions to > xlogreader.c. Those supporting functions now operate on *generic* > XLogReaderState rather than various global variables. The idea here is > that the client of the API calls ReadRecord repeatedly to get each > record. > There is a callback of type XLogPageReadCB, which is used by the > client to obtain a given page in the WAL stream. The XLogReader > facility is responsible for decoding the WAL into records, but the > client is responsible for supplying the physical bytes via the > callback within XLogReader state. There is an error-handling callback > too, added by Andres. Gone again, solved way much better by Heikki. > Andres added a new function, > XLogFindNextRecord(), which is used for checking wether RecPtr is a > valid XLog address for reading and to find the first valid address > after some address when dumping records, for debugging purposes. And thats a very much needed safety feature for logical decoding. We need to make sure LSNs specified by the user don't point into the middle of a record. That would make some ugly things possible. > Why did you move the page validation handling into XLogReader? Because its needed from xlogdump and wal decoding as well. Reimplementing it there doesn't seem to be a good idea. Skimping on checks seems neither. Any arguments against? > heapam and other executor stuff > =============================== > > One aspect of this patch that I feel certainly warrants another of > these subsections is the changes to heapam.c and related executor > changes. These are essentially changes to functions called by > nodeModifyTable.c frequently, including functions like > heap_hot_search_buffer, heap_insert, heap_multi_insert and > heap_delete. We now have to do extra logical logging, and we need > primary key values to be looked up. The amount of extra logging should be relatively small though - some preliminary tests seem to confirm that for me. But it certainly needs some more validation. I think it would be sensible to add the primary/candidate key as a relcache/RelationData attribute. Do others agree? heap_hot_search_buffer was changed in the course of the *Satisfies changes, thats not related to this part. > Files changed include: > > src/backend/access/heap/heapam.c | 284 ++++- > src/backend/access/heap/pruneheap.c | 16 +- > src/backend/catalog/index.c | 76 +- > src/backend/access/rmgrdesc/heapdesc.c | 9 + > src/include/access/heapam_xlog.h | 23 + > src/include/catalog/index.h | 4 + > What of this? (I'm using the dellstore sample database, as always): > > WARNING: Could not find primary key for table with oid 16406 > CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE > $1 OPERATOR(pg_catalog.=) "orderid"" > DELETE 1 > > I don't have time to figure out what this issue is right now. Its just a development debugging message that should go in the near future. There's no primary key on orderlines so we currently cannot safely replicate DELETEs. Its recognizable from the record that thats the case, so we should be able to handle that "safely" during decoding, that is we can print a warning there. > Hot Standby, Replication and libpq stuff > ======================================== > > I take particular interest in bgwriter.c here. You're doing this: > > + * Log a new xl_running_xacts every now and then so replication can get > + * into a consistent state faster and clean up resources more > + * frequently. The costs of this are relatively low, so doing it 4 > + * times a minute seems fine. > > What about the power consumption of the bgwriter? I think we are not doing any additional wakeups due to this, the complete sleeping logic is unaffected. The maximum sleep duration currently is "BgWriterDelay * HIBERNATE_FACTOR" which is lower than the interval in which we log new snapshots. So I don't think this should make a measurable difference? > I think that the way try to interact with the existing loop logic is > ill-considered. Just why is the bgwriter the compelling auxiliary > process in which to do this extra work? Which process would be a good idea otherwise? Bgwriter seemed best suited to me, but I am certainly open to reconsideration. It really was a process of elimination, and I don't really see a downside. Moving that code somewhere else should be no problem, so I am open to suggestions? > Quite a lot of code has been added to walsender. This is mostly down > to some new functions, responsible for initialising logical > replication: > > ! typedef void (*WalSndSendData)(bool *); > ! static void WalSndLoop(WalSndSendData send_data) __attribute__((noreturn)); > static void InitWalSenderSlot(void); > static void WalSndKill(int code, Datum arg); > ! static void XLogSendPhysical(bool *caughtup); > ! static void XLogSendLogical(bool *caughtup); > static void IdentifySystem(void); > static void StartReplication(StartReplicationCmd *cmd); > + static void CheckLogicalReplicationRequirements(void); > + static void InitLogicalReplication(InitLogicalReplicationCmd *cmd); > + static void StartLogicalReplication(StartLogicalReplicationCmd *cmd); > + static void ComputeLogicalXmin(void); > > This is mostly infrastructure for initialising and starting logical replication. > > Initialisation means finding a free “logical slot” from shared memory, > then looping until the new magic xmin horizon for logical walsenders > (stored in their “slot”) is that of the weakest link (think local > global xmin). > > + * FIXME: think about solving the race conditions in a nicer way. > + */ > + recompute_xmin: > + walsnd->xmin = GetOldestXmin(true, true); > + ComputeLogicalXmin(); > + if (walsnd->xmin != GetOldestXmin(true, true)) > + goto recompute_xmin; > > Apart from the race conditions that I'm not confident are addressed > here, I think that the above could easily get stuck indefinitely in > the event of contention. I don't like that part the slightest bit but I don't think its actually in danger of looping forever. In fact I think its so broken it won't ever loop ;). (ComputeLogicalXmin() will set the current global minimum which will then be returned by GetOldestXmin()). I would like to solve this properly without copying GetOldestXmin once more (so we can compute and set the logical xmin while holding ProcArrayLock), but I am not yet sure how a nice way to do that would look like. I guess GetOldestXminNoLock? That gets called while we already hold the procarray lock? Yuck. If we have it we should also use it for hot standby feedback. > Initialisation occurs due to a “INIT_LOGICAL_REPLICATION” replication > command. Initialisation also means that decoding state is allocated (a > snapshot reader is initialised), and we report back success or failure > to the client that's using the streaming replication protocol (i.e. in > our toy example, pg_receivellog). > > Starting logical replication means we load the previously initialised > slot, and find a snapshot reader plugin (using the “magic symbols” > pattern described above, under “Plugin interface”). > > Why do we have to “find” a logical slot twice (both during > initialisation and starting)? Because they can happen in totally different walsenders, even after a restart. Finding a consistent point to start decoding from can take some time (basically you need to wait for any old snapshots to finish), you don't want to do that every time you disconnect as you would loose updates inbetween. So what you do is to do INIT_LOGICAL_REPLICATION *once* when you setup a new replica. And then you only do START_LOGICAL_REPLICATION 'slot-id' 'position'; afterwards. Obviously that needs some work since we're not yet persisting enough between restarts... As I said above, thats what I am working on next. > Minor point: This is a terrible name for the variable in question: > > + LogicalWalSnd *walsnd; Why? As long as the struct is called "LogicalWalSnd" it seems to accurate enough. I think LocalWalSnds should emancipate themselves and be separate from walsender, when that has happened its obviously a bad name. > Miscellaneous thoughts > ====================== > > You're still using C stdlib functions like malloc, free, calloc quite > a bit. My concern is that this points to a lack of thought about the > memory management strategy; why are you still not using memory > contexts in some places? If it's so difficult to anticipate what > clients of, say, XLogReaderAllocate() want for the lifetime of their > memory, then likely as not those clients should be doing their own > memory allocation, and passing the allocated buffer directly. If it is > obvious that the memory ought to persist indefinitely (and I think > it's your contention that it is in the case of XLogReaderAllocate()), > I'd just allocate it in the top memory context. Now, I am aware that > there are a trivial number of backend mallocs that you can point to as > precedent here, but I'm still not satisfied with your explanation for > using malloc(). At the very least, you ought to be handling the case > where malloc returns NULL, and you're not doing so consistently. There are different categories here. XLogReader *has* to use malloc instead of the pg infrastructure since it needs to be usable by xlogdump which doesn't have the pg memory infrastructure. I would like reorderbuffer.c to stay usable outside the backend as well (primarily for a printing tool of the spooled changes). In contrast the use-case for snapbuild.c outside the backend is pretty slim, so it probably grow its own memory context and use that. > Minor gripes: > > * There is no need to use a *.txt extension for README files; we don't > currently use those anywhere else. It makes it easier for me to have a generic rule to transcode them into a different format, thats why they have it... > * If you only credit the PGDG and not the Berkeley guys (as you > should, for the most part), there is no need to phrase the notice > “Portions Copyright...”. You should just say “Copyright...”. ok. > * You're still calling function pointer typedefs things like > LogicalDecodeInitCB. As I've already pointed out, you should prefer > the existing conventions (call it something like > LogicalDecodeInit_hook_type). I still think CB is way much better than _hook_type, because its not a hook in this, its a callback. A hook intercepts normal operation, thats not what happens here. Note that we already use *CallBack in various places. If you prefer the longer form, ok, I can do that. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-13 18:29:00 +0100, Andres Freund wrote: > On 2012-12-13 00:05:41 +0000, Peter Geoghegan wrote: > > Initialisation means finding a free “logical slot” from shared memory, > > then looping until the new magic xmin horizon for logical walsenders > > (stored in their “slot”) is that of the weakest link (think local > > global xmin). > > > > + * FIXME: think about solving the race conditions in a nicer way. > > + */ > > + recompute_xmin: > > + walsnd->xmin = GetOldestXmin(true, true); > > + ComputeLogicalXmin(); > > + if (walsnd->xmin != GetOldestXmin(true, true)) > > + goto recompute_xmin; > > > > Apart from the race conditions that I'm not confident are addressed > > here, I think that the above could easily get stuck indefinitely in > > the event of contention. > > I don't like that part the slightest bit but I don't think its actually > in danger of looping forever. In fact I think its so broken it won't > ever loop ;). (ComputeLogicalXmin() will set the current global minimum > which will then be returned by GetOldestXmin()). > > I would like to solve this properly without copying GetOldestXmin once > more (so we can compute and set the logical xmin while holding > ProcArrayLock), but I am not yet sure how a nice way to do that would > look like. > > I guess GetOldestXminNoLock? That gets called while we already hold the > procarray lock? Yuck. Does anybody have an opinion on the attached patches? Especially 0001, which contains the procarray changes? It moves a computation of the sort of: result -= vacuum_defer_cleanup_age; if (!TransactionIdIsNormal(result)) result = FirstNormalTransactionId; inside ProcArrayLock. But I can't really imagine that to be relevant... Another alternative to this would be to get a snapshot with GetSnapshotData(), copy the xmin to the logical slot, then call ProcArrayEndTransaction(). But that doesn't really seem to be nicer to me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote: > It moves a computation of the sort of: > > result -= vacuum_defer_cleanup_age; > if (!TransactionIdIsNormal(result)) > result = FirstNormalTransactionId; > > inside ProcArrayLock. But I can't really imagine that to be relevant... I can. Go look at some of the 9.2 optimizations around GetSnapshotData(). Those made a BIG difference under heavy concurrency and they were definitely micro-optimization. For example, the introduction of NormalTransactionIdPrecedes() was shockingly effective. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2012-12-13 17:29:06 -0500, Robert Haas wrote: > On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > It moves a computation of the sort of: > > > > result -= vacuum_defer_cleanup_age; > > if (!TransactionIdIsNormal(result)) > > result = FirstNormalTransactionId; > > > > inside ProcArrayLock. But I can't really imagine that to be relevant... > > I can. Go look at some of the 9.2 optimizations around > GetSnapshotData(). Those made a BIG difference under heavy > concurrency and they were definitely micro-optimization. For example, > the introduction of NormalTransactionIdPrecedes() was shockingly > effective. But GetOldestXmin() should be called less frequently than GetSnapshotData() by several orders of magnitudes. I don't really see it being used in any really hot code paths? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 13 December 2012 22:37, Andres Freund <andres@2ndquadrant.com> wrote: > On 2012-12-13 17:29:06 -0500, Robert Haas wrote: >> On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > It moves a computation of the sort of: >> > >> > result -= vacuum_defer_cleanup_age; >> > if (!TransactionIdIsNormal(result)) >> > result = FirstNormalTransactionId; >> > >> > inside ProcArrayLock. But I can't really imagine that to be relevant... >> >> I can. Go look at some of the 9.2 optimizations around >> GetSnapshotData(). Those made a BIG difference under heavy >> concurrency and they were definitely micro-optimization. For example, >> the introduction of NormalTransactionIdPrecedes() was shockingly >> effective. > > But GetOldestXmin() should be called less frequently than > GetSnapshotData() by several orders of magnitudes. I don't really see > it being used in any really hot code paths? Maybe, but that calculation doesn't *need* to be inside the lock, that is just a consequence of the current coding. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Dec 14, 2012 at 2:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
The two commits coming to my mind are:
- ed0b409 (Separate PGPROC into PGPROC and PGXACT)
- 0d76b60 (introduction of NormalTransactionIdPrecedes)
Those ones really improved concurrency performance.
-- On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:I can. Go look at some of the 9.2 optimizations around
> It moves a computation of the sort of:
>
> result -= vacuum_defer_cleanup_age;
> if (!TransactionIdIsNormal(result))
> result = FirstNormalTransactionId;
>
> inside ProcArrayLock. But I can't really imagine that to be relevant...
GetSnapshotData(). Those made a BIG difference under heavy
concurrency and they were definitely micro-optimization. For example,
the introduction of NormalTransactionIdPrecedes() was shockingly
effective.
The two commits coming to my mind are:
- ed0b409 (Separate PGPROC into PGPROC and PGXACT)
- 0d76b60 (introduction of NormalTransactionIdPrecedes)
Those ones really improved concurrency performance.
Michael Paquier
http://michael.otacoo.com
On 2012-12-13 23:35:00 +0000, Simon Riggs wrote: > On 13 December 2012 22:37, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2012-12-13 17:29:06 -0500, Robert Haas wrote: > >> On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> > It moves a computation of the sort of: > >> > > >> > result -= vacuum_defer_cleanup_age; > >> > if (!TransactionIdIsNormal(result)) > >> > result = FirstNormalTransactionId; > >> > > >> > inside ProcArrayLock. But I can't really imagine that to be relevant... > >> > >> I can. Go look at some of the 9.2 optimizations around > >> GetSnapshotData(). Those made a BIG difference under heavy > >> concurrency and they were definitely micro-optimization. For example, > >> the introduction of NormalTransactionIdPrecedes() was shockingly > >> effective. > > > > But GetOldestXmin() should be called less frequently than > > GetSnapshotData() by several orders of magnitudes. I don't really see > > it being used in any really hot code paths? > > Maybe, but that calculation doesn't *need* to be inside the lock, that > is just a consequence of the current coding. I am open to suggestion how to do that in a way we a) can hold the lock already (to safely nail the global xmin to the current value) b) without duplicating all the code. Just moving that tidbit inside the lock seems to be the pragmatic choice. GetOldestXmin is called * once per checkpoint * one per index build * once in analyze * twice per vacuum * once for HS feedback messages Nothing of that occurs frequently enough that 5 instructions will make a difference. I would be happy to go an alternative path, but right now I don't see any nice one. A "already_locked" parameter to GetOldestXmin seems to be a cure worse than the disease. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Just moving that tidbit inside the lock seems to be the pragmatic > choice. GetOldestXmin is called > > * once per checkpoint > * one per index build > * once in analyze > * twice per vacuum > * once for HS feedback messages > > Nothing of that occurs frequently enough that 5 instructions will make a > difference. I would be happy to go an alternative path, but right now I > don't see any nice one. A "already_locked" parameter to GetOldestXmin > seems to be a cure worse than the disease. I'm not sure that would be so bad, but I guess I question the need to do it this way at all. Most of the time, if you need to advertise your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and I guess I'm not seeing why that wouldn't also work here. Am I dumb? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2012-12-14 14:01:30 -0500, Robert Haas wrote: > On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > Just moving that tidbit inside the lock seems to be the pragmatic > > choice. GetOldestXmin is called > > > > * once per checkpoint > > * one per index build > > * once in analyze > > * twice per vacuum > > * once for HS feedback messages > > > > Nothing of that occurs frequently enough that 5 instructions will make a > > difference. I would be happy to go an alternative path, but right now I > > don't see any nice one. A "already_locked" parameter to GetOldestXmin > > seems to be a cure worse than the disease. > > I'm not sure that would be so bad, but I guess I question the need to > do it this way at all. Most of the time, if you need to advertise > your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and > I guess I'm not seeing why that wouldn't also work here. Am I dumb? I wondered upthread whether that would be better: On 2012-12-13 21:03:44 +0100, Andres Freund wrote: > Another alternative to this would be to get a snapshot with > GetSnapshotData(), copy the xmin to the logical slot, then call > ProcArrayEndTransaction(). But that doesn't really seem to be nicer to > me. Not sure why I considered it ugly anymore, but it actually has a noticeable disadvantage. GetOldestXmin is nicer is than GetSnapshotData as the latter set a fairly new xid as xmin whereas GetOldestXmin returns the actual current xmin horizon. Thats preferrable because it allows us to start up more quickly. snapbuild.c can only start building a snapshot once it has seen a xl_running_xact with oldestRunningXid >= own_xmin. Otherwise we cannot be sure that no relevant catalog tuples have been removed. This also made me notice that my changes to GetSnapshotData were quite pessimal... I do set the xmin of the new snapshot to the "logical xmin" instead of doing it only to globalxmin/RecentGlobalXmin. Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-15 01:19:26 +0100, Andres Freund wrote: > On 2012-12-14 14:01:30 -0500, Robert Haas wrote: > > On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > > Just moving that tidbit inside the lock seems to be the pragmatic > > > choice. GetOldestXmin is called > > > > > > * once per checkpoint > > > * one per index build > > > * once in analyze > > > * twice per vacuum > > > * once for HS feedback messages > > > > > > Nothing of that occurs frequently enough that 5 instructions will make a > > > difference. I would be happy to go an alternative path, but right now I > > > don't see any nice one. A "already_locked" parameter to GetOldestXmin > > > seems to be a cure worse than the disease. > > > > I'm not sure that would be so bad, but I guess I question the need to > > do it this way at all. Most of the time, if you need to advertise > > your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and > > I guess I'm not seeing why that wouldn't also work here. Am I dumb? > > I wondered upthread whether that would be better: > > On 2012-12-13 21:03:44 +0100, Andres Freund wrote: > > Another alternative to this would be to get a snapshot with > > GetSnapshotData(), copy the xmin to the logical slot, then call > > ProcArrayEndTransaction(). But that doesn't really seem to be nicer to > > me. > > Not sure why I considered it ugly anymore, but it actually has a > noticeable disadvantage. GetOldestXmin is nicer is than GetSnapshotData > as the latter set a fairly new xid as xmin whereas GetOldestXmin returns > the actual current xmin horizon. Thats preferrable because it allows us > to start up more quickly. snapbuild.c can only start building a snapshot > once it has seen a xl_running_xact with oldestRunningXid >= > own_xmin. Otherwise we cannot be sure that no relevant catalog tuples > have been removed. Hm. One way that could work with fewer changes is to exploit the fact that a) it seems to be possible to acquire a shared lwlock twice in the same backend and b) both GetOldestXmin & GetSnapshotData acquire only a shared lwlock. Are we willing to guarantee that recursive acquiration of shared lwlocks continues to work? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 13 December 2012 20:03, Andres Freund <andres@2ndquadrant.com> wrote: > Does anybody have an opinion on the attached patches? Especially 0001, > which contains the procarray changes? > > It moves a computation of the sort of: > > result -= vacuum_defer_cleanup_age; > if (!TransactionIdIsNormal(result)) > result = FirstNormalTransactionId; > > inside ProcArrayLock. But I can't really imagine that to be relevant... I don't see why this is hard. Just make the lock acquisition/release conditional on another parameter. That way the only thing you'll be moving inside the lock is an if test on a constant boolean. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-16 16:44:04 +0000, Simon Riggs wrote: > On 13 December 2012 20:03, Andres Freund <andres@2ndquadrant.com> wrote: > > > Does anybody have an opinion on the attached patches? Especially 0001, > > which contains the procarray changes? > > > > It moves a computation of the sort of: > > > > result -= vacuum_defer_cleanup_age; > > if (!TransactionIdIsNormal(result)) > > result = FirstNormalTransactionId; > > > > inside ProcArrayLock. But I can't really imagine that to be relevant... > > I don't see why this is hard. > > Just make the lock acquisition/release conditional on another parameter. > > That way the only thing you'll be moving inside the lock is an if test > on a constant boolean. Thats not really cheaper. Two branches + additional parameter passed/pushed vs one branch, one subtransaction, two assignments is a close call. As I don't think either really matters in the GetOldestXmin case, I would be happy with that as well. If people prefer an additional parameter + adjusting the few callsite vs. a separate function I will go that way. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Dec 14, 2012 at 7:19 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2012-12-14 14:01:30 -0500, Robert Haas wrote: >> On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > Just moving that tidbit inside the lock seems to be the pragmatic >> > choice. GetOldestXmin is called >> > >> > * once per checkpoint >> > * one per index build >> > * once in analyze >> > * twice per vacuum >> > * once for HS feedback messages >> > >> > Nothing of that occurs frequently enough that 5 instructions will make a >> > difference. I would be happy to go an alternative path, but right now I >> > don't see any nice one. A "already_locked" parameter to GetOldestXmin >> > seems to be a cure worse than the disease. >> >> I'm not sure that would be so bad, but I guess I question the need to >> do it this way at all. Most of the time, if you need to advertise >> your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and >> I guess I'm not seeing why that wouldn't also work here. Am I dumb? > > I wondered upthread whether that would be better: > > On 2012-12-13 21:03:44 +0100, Andres Freund wrote: >> Another alternative to this would be to get a snapshot with >> GetSnapshotData(), copy the xmin to the logical slot, then call >> ProcArrayEndTransaction(). But that doesn't really seem to be nicer to >> me. > > Not sure why I considered it ugly anymore, but it actually has a > noticeable disadvantage. GetOldestXmin is nicer is than GetSnapshotData > as the latter set a fairly new xid as xmin whereas GetOldestXmin returns > the actual current xmin horizon. Thats preferrable because it allows us > to start up more quickly. snapbuild.c can only start building a snapshot > once it has seen a xl_running_xact with oldestRunningXid >= > own_xmin. Otherwise we cannot be sure that no relevant catalog tuples > have been removed. I'm a bit confused. Are you talking about the difference between RecentGlobalXmin and RecentXmin? I think GetSnapshotData() updates both. Anyway, if there's no nicer way, I think it's probably OK to add a parameter to GetOldestXmin(). It seems like kind of a hack, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Robert Haas <robertmhaas@gmail.com> schrieb: >On Fri, Dec 14, 2012 at 7:19 PM, Andres Freund <andres@2ndquadrant.com> >wrote: >> On 2012-12-14 14:01:30 -0500, Robert Haas wrote: >>> On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund ><andres@2ndquadrant.com> wrote: >>> > Just moving that tidbit inside the lock seems to be the pragmatic >>> > choice. GetOldestXmin is called >>> > >>> > * once per checkpoint >>> > * one per index build >>> > * once in analyze >>> > * twice per vacuum >>> > * once for HS feedback messages >>> > >>> > Nothing of that occurs frequently enough that 5 instructions will >make a >>> > difference. I would be happy to go an alternative path, but right >now I >>> > don't see any nice one. A "already_locked" parameter to >GetOldestXmin >>> > seems to be a cure worse than the disease. >>> >>> I'm not sure that would be so bad, but I guess I question the need >to >>> do it this way at all. Most of the time, if you need to advertise >>> your global xmin, you use GetSnapshotData(), not GetOldestXmin(), >and >>> I guess I'm not seeing why that wouldn't also work here. Am I dumb? >> >> I wondered upthread whether that would be better: >> >> On 2012-12-13 21:03:44 +0100, Andres Freund wrote: >>> Another alternative to this would be to get a snapshot with >>> GetSnapshotData(), copy the xmin to the logical slot, then call >>> ProcArrayEndTransaction(). But that doesn't really seem to be nicer >to >>> me. >> >> Not sure why I considered it ugly anymore, but it actually has a >> noticeable disadvantage. GetOldestXmin is nicer is than >GetSnapshotData >> as the latter set a fairly new xid as xmin whereas GetOldestXmin >returns >> the actual current xmin horizon. Thats preferrable because it allows >us >> to start up more quickly. snapbuild.c can only start building a >snapshot >> once it has seen a xl_running_xact with oldestRunningXid >= >> own_xmin. Otherwise we cannot be sure that no relevant catalog tuples >> have been removed. > >I'm a bit confused. Are you talking about the difference between >RecentGlobalXmin and RecentXmin? I think GetSnapshotData() updates >both. The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples required fordecoding might get removed by other backends. That needs to be prevented while holding the procarray lock exclusively. Does it make more sense now? Andres --- Please excuse the brevity and formatting - I am writing this on my mobile phone.
On Tue, Dec 18, 2012 at 5:25 PM, anarazel@anarazel.de <andres@anarazel.de> wrote: > The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples required fordecoding might get removed by other backends. That needs to be prevented while holding the procarray lock exclusively. Well, for the ordinary use of GetSnapshotData(), that doesn't matter, because GetSnapshotData() also updates proc->xmin. If you're trying to store a different value in that field then of course it matters. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2012-12-18 19:56:18 -0500, Robert Haas wrote: > On Tue, Dec 18, 2012 at 5:25 PM, anarazel@anarazel.de > <andres@anarazel.de> wrote: > > The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples requiredfor decoding might get removed by other backends. That needs to be prevented while holding the procarray lock exclusively. > > Well, for the ordinary use of GetSnapshotData(), that doesn't matter, > because GetSnapshotData() also updates proc->xmin. If you're trying > to store a different value in that field then of course it matters. Absolutely right. I don't want to say there's anything wrong with it right now. The "problem" for me is that it sets proc->xmin to the newest value it can while I want/need the oldest valid value... I will go with adding a already_locked parameter to GetOldestXmin then. Thanks for the input, Andres --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 18, 2012 at 7:59 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2012-12-18 19:56:18 -0500, Robert Haas wrote: >> On Tue, Dec 18, 2012 at 5:25 PM, anarazel@anarazel.de >> <andres@anarazel.de> wrote: >> > The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples requiredfor decoding might get removed by other backends. That needs to be prevented while holding the procarray lock exclusively. >> >> Well, for the ordinary use of GetSnapshotData(), that doesn't matter, >> because GetSnapshotData() also updates proc->xmin. If you're trying >> to store a different value in that field then of course it matters. > > Absolutely right. I don't want to say there's anything wrong with it > right now. The "problem" for me is that it sets proc->xmin to the newest > value it can while I want/need the oldest valid value... > > I will go with adding a already_locked parameter to GetOldestXmin then. Or instead of bool already_locked, maybe bool advertise_xmin? Seems like that might be more friendly to the abstraction boundaries. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Noah Misch
Date:
[Catching up on old threads.] On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote: > On 11/17/2012 03:00 PM, Markus Wanner wrote: >> On 11/17/2012 02:30 PM, Hannu Krosing wrote: >>> Is it possible to replicate UPDATEs and DELETEs without a primary key in >>> PostgreSQL-R >> No. There must be some way to logically identify the tuple. > It can be done as selecting on _all_ attributes and updating/deleting > just the first matching row > > create cursor ... > select from t ... where t.* = (....) > fetch one ... > delete where current of ... > > This is on distant (round 3 or 4) roadmap for this work, just was > interested > if you had found any better way of doing this :) That only works if every attribute's type has a notion of equality ("xml" does not). The equality operator may have a name other than "=", and an operator named "=" may exist with semantics other than equality ("box" is affected). Code attempting this replication strategy should select an equality operator the way typcache.c does so.
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Hannu Krosing
Date:
On 01/13/2013 12:28 AM, Noah Misch wrote: > [Catching up on old threads.] > > On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote: >> On 11/17/2012 03:00 PM, Markus Wanner wrote: >>> On 11/17/2012 02:30 PM, Hannu Krosing wrote: >>>> Is it possible to replicate UPDATEs and DELETEs without a primary key in >>>> PostgreSQL-R >>> No. There must be some way to logically identify the tuple. >> It can be done as selecting on _all_ attributes and updating/deleting >> just the first matching row >> >> create cursor ... >> select from t ... where t.* = (....) >> fetch one ... >> delete where current of ... >> >> This is on distant (round 3 or 4) roadmap for this work, just was >> interested >> if you had found any better way of doing this :) > That only works if every attribute's type has a notion of equality ("xml" does > not). The equality operator may have a name other than "=", and an operator > named "=" may exist with semantics other than equality ("box" is affected). > Code attempting this replication strategy should select an equality operator > the way typcache.c does so. Does this hint that postgreSQL also needs an sameness operator ( "is" or "===" in same languages). Or does "IS NOT DISTINCT FROM" already work even for types without comparison operator ? -------------- Hannu
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Hannu Krosing
Date:
On 01/13/2013 10:49 AM, Hannu Krosing wrote: > On 01/13/2013 12:28 AM, Noah Misch wrote: >> [Catching up on old threads.] >> >> On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote: >>> On 11/17/2012 03:00 PM, Markus Wanner wrote: >>>> On 11/17/2012 02:30 PM, Hannu Krosing wrote: >>>>> Is it possible to replicate UPDATEs and DELETEs without a primary >>>>> key in >>>>> PostgreSQL-R >>>> No. There must be some way to logically identify the tuple. >>> It can be done as selecting on _all_ attributes and updating/deleting >>> just the first matching row >>> >>> create cursor ... >>> select from t ... where t.* = (....) >>> fetch one ... >>> delete where current of ... >>> >>> This is on distant (round 3 or 4) roadmap for this work, just was >>> interested >>> if you had found any better way of doing this :) >> That only works if every attribute's type has a notion of equality >> ("xml" does >> not). The equality operator may have a name other than "=", and an >> operator >> named "=" may exist with semantics other than equality ("box" is >> affected). >> Code attempting this replication strategy should select an equality >> operator >> the way typcache.c does so. > Does this hint that postgreSQL also needs an sameness operator > ( "is" or "===" in same languages). > > Or does "IS NOT DISTINCT FROM" already work even for types without > comparison operator ? Just checked - it does not, it still looks for "=" operator so it is just equality-with-nulls How do people feel about adding a real sameness operator ? Hannu > > -------------- > Hannu > >
Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Tom Lane
Date:
Hannu Krosing <hannu@2ndQuadrant.com> writes: > How do people feel about adding a real sameness operator ? Just begs the question of "what's sameness?" In many places we consider a datatype's default btree equality operator to define sameness, but not all types provide a btree opclass (in particular, anything that hasn't got a sensible one-dimensional sort order will not). And some do but it doesn't represent anything that anyone would want to consider "sameness" --- IIRC, some of the geometric types provide btree opclasses that sort by area. Even for apparently simple types like float8 there are interesting questions like whether minus zero is the same as plus zero. The messiness here is not just due to lack of a notation. regards, tom lane
Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Markus Wanner
Date:
On 01/13/2013 12:30 PM, Hannu Krosing wrote: > On 01/13/2013 10:49 AM, Hannu Krosing wrote: >> Does this hint that postgreSQL also needs an sameness operator >> ( "is" or "===" in same languages). > > How do people feel about adding a real sameness operator ? We'd need to define what "sameness" means. If this goes toward "exact match in binary representation", this gets a thumbs-up from me. As a first step in that direction, I'd see adjusting send() and recv() functions to use a portable binary format. A "sameness" operator could then be implemented by simply comparing two value's send() outputs. Regards Markus Wanner
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Andres Freund
Date:
On 2013-01-13 12:44:44 -0500, Tom Lane wrote: > Hannu Krosing <hannu@2ndQuadrant.com> writes: > > How do people feel about adding a real sameness operator ? > > Just begs the question of "what's sameness?" > > In many places we consider a datatype's default btree equality operator > to define sameness, but not all types provide a btree opclass (in > particular, anything that hasn't got a sensible one-dimensional sort > order will not). And some do but it doesn't represent anything that > anyone would want to consider "sameness" --- IIRC, some of the geometric > types provide btree opclasses that sort by area. Even for apparently > simple types like float8 there are interesting questions like whether > minus zero is the same as plus zero. > > The messiness here is not just due to lack of a notation. FWIW *I* (but others might) don't plan to support that case for now, it just seems to be too messy for far too little benefit. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Hannu Krosing
Date:
On 01/13/2013 06:02 PM, Markus Wanner wrote: > On 01/13/2013 12:30 PM, Hannu Krosing wrote: >> On 01/13/2013 10:49 AM, Hannu Krosing wrote: >>> Does this hint that postgreSQL also needs an sameness operator >>> ( "is" or "===" in same languages). >> How do people feel about adding a real sameness operator ? > We'd need to define what "sameness" means. If this goes toward "exact > match in binary representation", this gets a thumbs-up from me. > > As a first step in that direction, I'd see adjusting send() and recv() > functions to use a portable binary format. A "sameness" operator could > then be implemented by simply comparing two value's send() outputs. This seems like a good definition of "sameness" to me - if binary images are bitwise same, then the values are the same. And if both are fields of the same type and NULLs then these are also "same" And defining a cross-platform binary format also a good direction of movement in implementing this. I'd just start with what send() and recv() on each type produces now using GCC on 64bit Intel and move towards adjusting others to match. For a period anything else would still be allowed, but be "non-standard" I have no strong opinion on typed NULLs, though I'd like them to also be "the same" for a sake of simplicity. As this would be non-standard anyway, I'd make a row of all nulls NOT "be the same" as NULL This would be much easier to explain than losing the "IS NULL"-ness at nesting level 3 ;) Hannu > > Regards > > Markus Wanner > >
Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Dimitri Fontaine
Date:
Hannu Krosing <hannu@2ndQuadrant.com> writes: >> Does this hint that postgreSQL also needs an sameness operator >> ( "is" or "===" in same languages). > > How do people feel about adding a real sameness operator ? Well. I would prefer it if we can bypass the need for it. Then Do we need the full range of eq, eql, equal and equalp predicates, and would all of them allow overriding or just some? http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node74.html Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Hannu Krosing
Date:
On 01/13/2013 12:28 AM, Noah Misch wrote: > [Catching up on old threads.] > > On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote: >> On 11/17/2012 03:00 PM, Markus Wanner wrote: >>> On 11/17/2012 02:30 PM, Hannu Krosing wrote: >>>> Is it possible to replicate UPDATEs and DELETEs without a primary key in >>>> PostgreSQL-R >>> No. There must be some way to logically identify the tuple. >> It can be done as selecting on _all_ attributes and updating/deleting >> just the first matching row >> >> create cursor ... >> select from t ... where t.* = (....) >> fetch one ... >> delete where current of ... >> >> This is on distant (round 3 or 4) roadmap for this work, just was >> interested >> if you had found any better way of doing this :) > That only works if every attribute's type has a notion of equality ("xml" does > not). The equality operator may have a name other than "=", and an operator > named "=" may exist with semantics other than equality ("box" is affected). > Code attempting this replication strategy should select an equality operator > the way typcache.c does so. A method for making this work as PostgreSQL works now would be to compare "textual representations" of tuples create cursor ... select from t ... where t::text = '(<image of original row>)' fetch one ... delete where current of ... But of course having an operator for "sameness" without needing to convert to text would be better ---------------- Hannu
Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Hannu Krosing
Date:
On 01/13/2013 08:06 PM, Dimitri Fontaine wrote: > Hannu Krosing <hannu@2ndQuadrant.com> writes: >>> Does this hint that postgreSQL also needs an sameness operator >>> ( "is" or "===" in same languages). >> How do people feel about adding a real sameness operator ? > Well. I would prefer it if we can bypass the need for it. What is actually sufficient for current problem is sameness which compares outputs of type output functions and also considers NULLs to be the same. The reason for not providing equality for xml was not that two xml files which compare equal as text could be considered unequal in any sense but that there are some other textual representations of the same xml which could also be considered to be equal, like different whitespace between tag and attribute > Then Do we need the full range of eq, eql, equal and equalp predicates, > and would all of them allow overriding or just some? I consider sameness as basic thing as IS NULL, so the sameness should not be overridable. Extending IS NOT DISTINCT FROM to do this comparison instead of current '=' seems reasonable. That is SELECT '<tag/>'::xml IS DISTINCT FROM '<tag />'::xml should return TRUE as long as the internal representation of the two differ and even after you add equality operator to xml which compares some canonic form of xml and thus would make SELECT '<tag/>'::xml = '<tag />'::xml ; be TRUE. Regards, Hannu > > http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node74.html > > Regards,
Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format
From
Markus Wanner
Date:
On 01/13/2013 09:04 PM, Hannu Krosing wrote: > I'd just start with what send() and recv() on each type produces > now using GCC on 64bit Intel and move towards adjusting others > to match. For a period anything else would still be allowed, but > be "non-standard" Intel being little endian seems like a bad choice to me, given that send/recv kind of implies network byte ordering. I'd rather not tie this to any particular processor architecture at all (at least not solely on the ground that it's the most common one at the time). I have no strong opinion on "sameness" of NULLs and could also imagine that to throw some kind of invalid operation error. Based on the ground that neither is a value and it's unclear what send() method to use at all. FWIW, trying to determine the length of a sent NULL gives an interesting result that I don't currently understand. > psql (9.2.2) > Type "help" for help. > > postgres=# SELECT length(int4send(NULL)); > length > -------- > > (1 row) > > postgres=# SELECT length(float4send(NULL)); > length > -------- > > (1 row) > > postgres=# SELECT length(textsend(NULL)); > length > -------- > > (1 row) > > postgres=# SELECT length(textsend(NULL) || '\000'::bytea); > length > -------- > > (1 row) Regards Markus Wanner