Thread: logical changeset generation v3

logical changeset generation v3

From
andres@anarazel.de (Andres Freund)
Date:
Hi,

In response to this you will soon find the 14 patches that currently
implement $subject. I'll go over each one after showing off for a bit:

Start postgres:

Start postgres instance (with pg_hba.conf allowing replication cons):

$ postgres -D ~/tmp/pgdev-lcr \         -c wal_level=logical \         -c max_wal_senders=10 \         -c
max_logical_slots=10\         -c wal_keep_segments=100 \         -c log_line_prefix="[%p %x] "
 

Start the changelog receiver:
$ pg_receivellog -h /tmp -f /dev/stderr -d postgres -v

Generate changes:
$ psql -h /tmp postgres <<EOF

DROP TABLE IF EXISTS replication_example;

CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

-- plain insert
INSERT INTO replication_example(somedata, text) VALUES (1, 1);

-- plain update
UPDATE replication_example SET somedata = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));

-- plain delete
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));

-- wrapped in a transaction
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));
COMMIT;

-- dont write out aborted data
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (2, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));
ROLLBACK;

-- add a column
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
ALTER TABLE replication_example ADD COLUMN bar int;
INSERT INTO replication_example(somedata, text, bar) VALUES (3, 1, 1);
COMMIT;

-- once more outside
INSERT INTO replication_example(somedata, text, bar) VALUES (4, 1, 1);


-- DDL with table rewrite
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (5, 1);
ALTER TABLE replication_example RENAME COLUMN text TO somenum;
INSERT INTO replication_example(somedata, somenum) VALUES (5, 2);
ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING (somenum::int4);
INSERT INTO replication_example(somedata, somenum) VALUES (5, 3);
COMMIT;

EOF

And the results printed by llog:

BEGIN 16556826
COMMIT 16556826
BEGIN 16556827
table "replication_example_id_seq": INSERT: sequence_name[name]:replication_example_id_seq last_value[int8]:1
start_value[int8]:1increment_by[int8]:1 max_value[int8]:9223372036854775807 min_value[int8]:1 cache_value[int8]:1
log_cnt[int8]:0is_cycled[bool]:f is_called[bool]:f
 
COMMIT 16556827
BEGIN 16556828
table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1
COMMIT 16556828
BEGIN 16556829
table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1
COMMIT 16556829
BEGIN 16556830
table "replication_example": DELETE (pkey): id[int4]:1
COMMIT 16556830
BEGIN 16556833
table "replication_example": INSERT: id[int4]:4 somedata[int4]:3 text[varchar]:1
table "replication_example": INSERT: id[int4]:5 somedata[int4]:3 text[varchar]:1 bar[int4]:1
COMMIT 16556833
BEGIN 16556834
table "replication_example": INSERT: id[int4]:6 somedata[int4]:4 text[varchar]:1 bar[int4]:1
COMMIT 16556834
BEGIN 16556835
table "replication_example": INSERT: id[int4]:7 somedata[int4]:5 text[varchar]:1 bar[int4]:(null)
table "replication_example": INSERT: id[int4]:8 somedata[int4]:5 somenum[varchar]:2 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:4 somedata[int4]:3 somenum[int4]:1 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:5 somedata[int4]:3 somenum[int4]:1 bar[int4]:1
table "pg_temp_74943": INSERT: id[int4]:6 somedata[int4]:4 somenum[int4]:1 bar[int4]:1
table "pg_temp_74943": INSERT: id[int4]:7 somedata[int4]:5 somenum[int4]:1 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:8 somedata[int4]:5 somenum[int4]:2 bar[int4]:(null)
table "replication_example": INSERT: id[int4]:9 somedata[int4]:5 somenum[int4]:3 bar[int4]:(null)
COMMIT 16556835

As you can see above we can decode WAL in the presence of nearly all
forms of DDL. The plugin that outputted these changes is supposed to be
added to contrib and is fairly small and uncomplicated.

An interesting piece of information might be that in the very
preliminary benchmarking I have done on this even the textual decoding
could keep up with a full tilt pgbench -c16 -j16 -M prepared on my
(somewhat larger) workstation. The wal space overhead was less than 1%
between two freshly initdb'ed clusters, comparing
wal_level=hot_standby with =logical.
With a custom pgbench script I can saturate the decoding to the effect
that it lags a second or so, but once I write out the data in a binary
format it can keep up again.
The biggest overhead is currently the more slowly increasing
Global/RecentXmin, but that can be greatly improved by logging
xl_running_xact's more than just every checkpoint.


A short overview over the patches in this series:

* Add minimal binary heap implementation
Abhijit submitted a nicer version of this, the plan is to rebase ontop
of that once people are happy with the interface.
(unchanged)

* Add support for a generic wal reading facility dubbed XLogReader
There's some discussion about whats the best way to implement this in a
separate CF topic.
(unchanged)

* Add simple xlogdump tool
Very nice for debugging, couldn't have developed this without. Obviously
not a prerequisite for comitting this feature but still pretty worthy.
(quite a bit updated, still bad build infrastructure)

* Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode)
Relatively simple, somewhat contentious due to some uniqueness
issues. Would very much welcome input from somebody with syscache
experience on this. It was previously suggested to write something like
attoptcache.c for this, but to me that seems to be code-duplication. We
can go that route though.
(unchanged)

* Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode
Simple. I don't even think its contentious... Just wasn't needed before.
(unchanged)

* Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs
Just a nice to have thing for debugging, not a prerequisite for the
feature.
(unchanged)

* Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement
Uncomplicated and I hope uncontentious.
(new)

*Store the number of subtransactions in xl_running_xacts separately fromtoplevel xids
Increases the size of xl_running_xacts by 4bytes in the worst case,
decreases it in some others. Improves the efficiency of some HS
operations.
Should be ok?
(new)

* Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader
Not sure if people will complain about this? Its rather simple due to
the fact that the HeapTupleSatisfiesVisibility wrapper already took a
HeapTupleHeader as parameter.
(new)

* Allow walsender's to connect to a specific database
This has been requested by others. I think we need to work on the
external interface a bit, should be ok otherwise.
(new)

* Introduce wal decoding via catalog timetravel
This is the meat of the feature. I think this is going in a good
direction, still needs some work, but architectural review can really
start now. (more later)
(heavily changed)

* Add a simple decoding module in contrib named 'test_decoding'
The much requested example contrib module.
(new)

* Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes
Debugging tool to receive changes and write them to a file. Needs some
more options and probably shouldn't live inside pg_basebackup's
directory.
(new)

* design document v2.3 and snapshot building design doc v0.2
(unchanged)


There remains quite a bit to be done but I think the state of the patch
has improved quite a bit. The biggest thing now is to get input about
the user facing parts so we can get some aggreement there.

Todo:
* testing infrastructure (isolationtester)
* persistence/spilling to disk of built snapshots, longrunning transactions
* user docs
* more frequent lowering of xmins
* more docs about the internals
* support for user declared catalog tables
* actual exporting of initial pg_export snapshots after INIT_LOGICAL_REPLICATION
* own shared memory segment instead of piggybacking on walsender's
* nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the outside.
* more frequent xl_running_xid's so xmin can be upped more frequently

Please comment!

Happy and tired,

Andres

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



[PATCH 01/14] Add minimal binary heap implementation

From
Andres Freund
Date:
Will be replaces by the "binaryheap.[ch]" from Abhijit once its been reviewed.
---
 src/backend/lib/Makefile     |   3 +-
 src/backend/lib/simpleheap.c | 255 +++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/simpleheap.h |  91 +++++++++++++++
 3 files changed, 348 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/simpleheap.c
 create mode 100644 src/include/lib/simpleheap.h


Attachment
This cache is somewhat problematic because formally indexes used by syscaches
needs to be unique, this one is not. This is "just" because of 0/InvalidOids
stored in pg_class.relfilenode for nailed/shared catalog relations.  The
syscache will never be queried for InvalidOid relfilenodes however so it seems
to be safe even if it violates the rules somewhat.

It might be nicer to add infrastructure to do this properly, like using a
partial index, its not clear what the best way to do this is though.

Needs a CATVERSION bump.
---
 src/backend/utils/cache/syscache.c | 11 +++++++++++
 src/include/catalog/indexing.h     |  2 ++
 src/include/catalog/pg_proc.h      |  1 +
 src/include/utils/syscache.h       |  1 +
 4 files changed, 15 insertions(+)


Attachment
---
 src/backend/utils/cache/relmapper.c | 53 +++++++++++++++++++++++++++++++++++++
 src/include/catalog/indexing.h      |  4 +--
 src/include/utils/relmapper.h       |  2 ++
 3 files changed, 57 insertions(+), 2 deletions(-)


Attachment

[PATCH 03/14] Add simple xlogdump tool

From
Andres Freund
Date:
---
 src/bin/Makefile            |   2 +-
 src/bin/xlogdump/Makefile   |  25 +++
 src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 494 insertions(+), 1 deletion(-)
 create mode 100644 src/bin/xlogdump/Makefile
 create mode 100644 src/bin/xlogdump/xlogdump.c


Attachment
This requires the previously added RELFILENODE syscache and the added
RelationMapFilenodeToOid function added in previous commits.
---
 doc/src/sgml/func.sgml         | 23 +++++++++++-
 src/backend/utils/adt/dbsize.c | 79 ++++++++++++++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.h  |  2 ++
 src/include/utils/builtins.h   |  1 +
 4 files changed, 104 insertions(+), 1 deletion(-)


Attachment
To avoid complicating logic we store both, the toplevel and the subxids, in
->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids.
Also skip logging any subxids if the snapshot is suboverflowed, they aren't
useful in that case anyway.

This allows to make some operations cheaper and it allows faster startup for
the future logical decoding feature because that doesn't care about
subtransactions/suboverflow'edness.
---
 src/backend/access/transam/xlog.c   |  2 ++
 src/backend/storage/ipc/procarray.c | 65 ++++++++++++++++++++++++-------------
 src/backend/storage/ipc/standby.c   |  8 +++--
 src/include/storage/standby.h       |  2 ++
 4 files changed, 52 insertions(+), 25 deletions(-)


Attachment

[PATCH 11/14] Introduce wal decoding via catalog timetravel

From
Andres Freund
Date:
This introduces several things:
* 'reorderbuffer' module which reassembles transactions from a stream of interspersed changes
* 'snapbuilder' which builds catalog snapshots so that tuples from wal can be understood
* logging more data into wal to facilitate logical decoding
* wal decoding into an reorderbuffer
* shared library output plugins with 5 callbacks
 * init
 * begin
 * change
 * commit
* walsender infrastructur to stream out changes and to keep the global xmin low enough
 * INIT_LOGICAL_REPLICATION $plugin; waits till a consistent snapshot is built and returns
   * initial LSN
   * replication slot identifier
   * id of a pg_export() style snapshot
 * START_LOGICAL_REPLICATION $id $lsn; streams out changes
 * uses named output plugins for output specification

Todo:
* testing infrastructure (isolationtester)
* persistence/spilling to disk of built snapshots, longrunning
  transactions
* user docs
* more frequent lowering of xmins
* more docs about the internals
* support for user declared catalog tables
* actual exporting of initial pg_export snapshots after
  INIT_LOGICAL_REPLICATION
* own shared memory segment instead of piggybacking on walsender's
* nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the
  outside.
* more frequent xl_running_xid's so xmin can be upped more frequently
* add STOP_LOGICAL_REPLICATION $id
---
 src/backend/access/heap/heapam.c                |  280 +++++-
 src/backend/access/transam/xlog.c               |    1 +
 src/backend/catalog/index.c                     |   74 ++
 src/backend/replication/Makefile                |    2 +
 src/backend/replication/logical/Makefile        |   19 +
 src/backend/replication/logical/decode.c        |  496 ++++++++++
 src/backend/replication/logical/logicalfuncs.c  |  247 +++++
 src/backend/replication/logical/reorderbuffer.c | 1156 +++++++++++++++++++++++
 src/backend/replication/logical/snapbuild.c     | 1144 ++++++++++++++++++++++
 src/backend/replication/repl_gram.y             |   32 +-
 src/backend/replication/repl_scanner.l          |    2 +
 src/backend/replication/walsender.c             |  566 ++++++++++-
 src/backend/storage/ipc/procarray.c             |   23 +
 src/backend/storage/ipc/standby.c               |    8 +-
 src/backend/utils/cache/inval.c                 |    2 +-
 src/backend/utils/cache/relcache.c              |    3 +-
 src/backend/utils/misc/guc.c                    |   11 +
 src/backend/utils/time/tqual.c                  |  249 +++++
 src/bin/pg_controldata/pg_controldata.c         |    2 +
 src/include/access/heapam_xlog.h                |   23 +
 src/include/access/transam.h                    |    5 +
 src/include/access/xlog.h                       |    3 +-
 src/include/catalog/index.h                     |    4 +
 src/include/nodes/nodes.h                       |    2 +
 src/include/nodes/replnodes.h                   |   22 +
 src/include/replication/decode.h                |   21 +
 src/include/replication/logicalfuncs.h          |   44 +
 src/include/replication/output_plugin.h         |   76 ++
 src/include/replication/reorderbuffer.h         |  284 ++++++
 src/include/replication/snapbuild.h             |  128 +++
 src/include/replication/walsender.h             |    1 +
 src/include/replication/walsender_private.h     |   34 +-
 src/include/storage/itemptr.h                   |    3 +
 src/include/storage/sinval.h                    |    2 +
 src/include/utils/tqual.h                       |   31 +-
 35 files changed, 4966 insertions(+), 34 deletions(-)
 create mode 100644 src/backend/replication/logical/Makefile
 create mode 100644 src/backend/replication/logical/decode.c
 create mode 100644 src/backend/replication/logical/logicalfuncs.c
 create mode 100644 src/backend/replication/logical/reorderbuffer.c
 create mode 100644 src/backend/replication/logical/snapbuild.c
 create mode 100644 src/include/replication/decode.h
 create mode 100644 src/include/replication/logicalfuncs.h
 create mode 100644 src/include/replication/output_plugin.h
 create mode 100644 src/include/replication/reorderbuffer.h
 create mode 100644 src/include/replication/snapbuild.h


Attachment

[PATCH 14/14] design document v2.3 and snapshot building design doc v0.2

From
Andres Freund
Date:
---
 src/backend/replication/logical/DESIGN.txt         | 603 +++++++++++++++++++++
 src/backend/replication/logical/Makefile           |   6 +
 .../replication/logical/README.SNAPBUILD.txt       | 298 ++++++++++
 3 files changed, 907 insertions(+)
 create mode 100644 src/backend/replication/logical/DESIGN.txt
 create mode 100644 src/backend/replication/logical/README.SNAPBUILD.txt


Attachment
This is useful to be able to represent a CommandId thats invalid. There was no
such value before.

This decreases the possible number of subtransactions by one which seems
unproblematic. Its also not a problem for pg_upgrade because cmin/cmax are
never looked at outside the context of their own transaction (spare timetravel
access, but thats new anyway).
---
 src/backend/access/transam/xact.c | 4 ++--
 src/include/c.h                   | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)


Attachment

[PATCH 10/14] Allow walsender's to connect to a specific database

From
Andres Freund
Date:
Currently the decision whether to connect to a database or not is made by
checking whether the passed "dbname" parameter is "replication". Unfortunately
this makes it impossible to connect a to a database named replication...

This is useful for future walsender commands which need database interaction.
---
 src/backend/postmaster/postmaster.c                |  7 ++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++--
 src/backend/replication/walsender.c                | 27 ++++++++++++++++++----
 src/backend/utils/init/postinit.c                  |  5 ++++
 src/bin/pg_basebackup/pg_basebackup.c              |  4 ++--
 src/bin/pg_basebackup/pg_receivexlog.c             |  4 ++--
 src/bin/pg_basebackup/receivelog.c                 |  4 ++--
 7 files changed, 41 insertions(+), 14 deletions(-)


Attachment
For the regular satisfies routines this is needed in prepareation of logical
decoding. I changed the non-regular ones for consistency as well.

The naming between htup, tuple and similar is rather confused, I could not find
any consistent naming anywhere.

This is preparatory work for the logical decoding feature which needs to be
able to get to a valid relfilenode from when checking the visibility of a
tuple.
---
 contrib/pgrowlocks/pgrowlocks.c      |  2 +-
 src/backend/access/heap/heapam.c     | 13 ++++++----
 src/backend/access/heap/pruneheap.c  | 16 ++++++++++--
 src/backend/catalog/index.c          |  2 +-
 src/backend/commands/analyze.c       |  3 ++-
 src/backend/commands/cluster.c       |  2 +-
 src/backend/commands/vacuumlazy.c    |  3 ++-
 src/backend/storage/lmgr/predicate.c |  2 +-
 src/backend/utils/time/tqual.c       | 50 +++++++++++++++++++++++++++++-------
 src/include/utils/snapshot.h         |  4 +--
 src/include/utils/tqual.h            | 20 +++++++--------
 11 files changed, 83 insertions(+), 34 deletions(-)


Attachment
---
 src/bin/pg_basebackup/Makefile         |   7 +-
 src/bin/pg_basebackup/pg_receivellog.c | 717 +++++++++++++++++++++++++++++++++
 src/bin/pg_basebackup/streamutil.c     |   3 +-
 src/bin/pg_basebackup/streamutil.h     |   1 +
 4 files changed, 725 insertions(+), 3 deletions(-)
 create mode 100644 src/bin/pg_basebackup/pg_receivellog.c


Attachment

[PATCH 12/14] Add a simple decoding module in contrib named 'test_decoding'

From
Andres Freund
Date:
---
 contrib/Makefile                      |   1 +
 contrib/test_decoding/Makefile        |  16 +++
 contrib/test_decoding/test_decoding.c | 192 ++++++++++++++++++++++++++++++++++
 3 files changed, 209 insertions(+)
 create mode 100644 contrib/test_decoding/Makefile
 create mode 100644 contrib/test_decoding/test_decoding.c


Attachment

[PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Andres Freund
Date:
Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer
---
 src/backend/access/transam/Makefile     |    2 +-
 src/backend/access/transam/xlogreader.c | 1032 +++++++++++++++++++++++++++++++
 src/include/access/xlogreader.h         |  264 ++++++++
 3 files changed, 1297 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogreader.c
 create mode 100644 src/include/access/xlogreader.h


Attachment

Re: logical changeset generation v3 - git repository

From
Andres Freund
Date:
On 2012-11-15 01:27:46 +0100, Andres Freund wrote:
> In response to this you will soon find the 14 patches that currently
> implement $subject.

As its not very wieldly to send around that many/big patches all the
time, until the next "major" version I will just update the git tree at:

Web:
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf3

Git:
git clone git://git.postgresql.org/git/users/andresfreund/postgres.git xlog-decoding-rebasing-cf3

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: lcr - walsender integration

From
Andres Freund
Date:
Hi,

The current logical walsender integration looks like the following:

=# INIT_LOGICAL_REPLICATION 'text';
WARNING:  Initiating logical rep
WARNING:  reached consistent point, stopping!replication_id | consistent_point | snapshot_name | plugin
----------------+------------------+---------------+--------id-2           | 3/CACBDF98       | 0xDEADBEEF    | text
(1 row)

=# START_LOGICAL_REPLICATION 'id-2' 3/CACBDF98;
...


So the current protocol is:
INIT_LOGICAL_REPLICATION '$plugin';
returns
* slot
* first consistent point
* snapshot id

START_LOGICAL_REPLICATION '$slot' $last_received_lsn;

streams changes, each wrapped in a 'w' message with (start, end) set to
the same value. The content of the data is completely free-format and
only depends on the output plugin.

Feedback is provided from the client via the normal 'r' messages.

I think thats not a bad start, but we probably can improve it a bit:

INIT_LOGICAL_REPLICATION '$slot' '$plugin' ($value = $key, ...);
START_LOGICAL_REPLICATION '$slot' $last_received_lsn;
STOP_LOGICAL_REPLICATION '$slot';

The option to INIT_LOGICAL_REPLICATION would then get passed to the
'pg_decode_init' output plugin function (i.e. a function of that name
would get dlsym()'ed using the pg infrastructure for that).

Does that look good to you? Any suggestions?

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3

From
Josh Berkus
Date:
On 11/14/12 4:27 PM, Andres Freund wrote:
> Hi,
> 
> In response to this you will soon find the 14 patches that currently
> implement $subject. I'll go over each one after showing off for a bit:

Lemme be the first to say, "wow".  Impressive work.

Now the debugging starts ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: logical changeset generation v3

From
Michael Paquier
Date:
Looks like cool stuff @-@<br />I might be interested in looking at that a bit as I think I will hopefully be hopefully
beable to grab some time in the next couple of weeks.<br />Are some of those patches already submitted to a CF?<br />
--<br />Michael Paquier<br /><a href="http://michael.otacoo.com" target="_blank">http://michael.otacoo.com</a><br /> 

Re: logical changeset generation v3

From
Andres Freund
Date:
Hi,

On Thursday, November 15, 2012 05:08:26 AM Michael Paquier wrote:
> Looks like cool stuff @-@
> I might be interested in looking at that a bit as I think I will hopefully
> be hopefully be able to grab some time in the next couple of weeks.
> Are some of those patches already submitted to a CF?

I added the patchset as one entry to the CF this time, it seems to me they are 
too hard to judge individually to make them really separately reviewable.

I can split it off there, but really all the complicated stuff is in one patch 
anyway...

Greetings,

Andres



On 14 November 2012 22:17, Andres Freund <andres@2ndquadrant.com> wrote:

> To avoid complicating logic we store both, the toplevel and the subxids, in
> ->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids.

That looks good, not much change. Will apply in next few days. Please
add me as committer and mark ready.

> Also skip logging any subxids if the snapshot is suboverflowed, they aren't
> useful in that case anyway.

> This allows to make some operations cheaper and it allows faster startup for
> the future logical decoding feature because that doesn't care about
> subtransactions/suboverflow'edness.

...but please don't add extra touches of Andres magic along the way.
Doing that will just slow down patch acceptance and its not important.
I suggest to keep note of things like that and come back to them
later.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



On 2012-11-15 09:07:23 -0300, Simon Riggs wrote:
> On 14 November 2012 22:17, Andres Freund <andres@2ndquadrant.com> wrote:
>
> > To avoid complicating logic we store both, the toplevel and the subxids, in
> > ->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids.
>
> That looks good, not much change. Will apply in next few days. Please
> add me as committer and mark ready.

Cool. Will do.

> > Also skip logging any subxids if the snapshot is suboverflowed, they aren't
> > useful in that case anyway.
>
> > This allows to make some operations cheaper and it allows faster startup for
> > the future logical decoding feature because that doesn't care about
> > subtransactions/suboverflow'edness.
>
> ...but please don't add extra touches of Andres magic along the way.
> Doing that will just slow down patch acceptance and its not important.
> I suggest to keep note of things like that and come back to them
> later.

Which magic are you talking about?

Only two parts changed in comparison to the previous situation. One is
that the following in ProcArrayApplyRecoveryInfo only applies to
toplevel transactions by virtue of ->xcnt now only containing the
toplevel transaction count:
> +       /*
> +        * Remove stale locks, if any.
> +        *
> +        * Locks are always assigned to the toplevel xid so we don't
> need to care
> +        * about subxcnt/subxids (and by extension not about
> ->suboverflowed).
> +        */
>         StandbyReleaseOldLocks(running->xcnt, running->xids);

Note that there was no code change, just a change in meaning.

The other part is:
> +       /*
> +        * Spin over procArray collecting all subxids, but only if there hasn't
> +        * been a suboverflow.
> +        */
> +       if (!suboverflowed)

Well, thats something that basically had to be decided either way when
writing the patch...

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Heikki Linnakangas
Date:
On 15.11.2012 03:17, Andres Freund wrote:
>
> Features:
> - streaming reading/writing
> - filtering
> - reassembly of records
>
> Reusing the ReadRecord infrastructure in situations where the code that wants
> to do so is not tightly integrated into xlog.c is rather hard and would require
> changes to rather integral parts of the recovery code which doesn't seem to be
> a good idea.
>
> Missing:
> - "compressing" the stream when removing uninteresting records
> - writing out correct CRCs
> - separating reader/writer

I'm disappointed to see that there has been no progress on this patch 
since last commitfest. I thought we agreed on the approach I championed 
for here: 
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. There 
wasn't much work left to finish that, I believe.

Are you going to continue working on this?

- Heikki



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Andres Freund
Date:
On 2012-11-15 16:22:56 +0200, Heikki Linnakangas wrote:
> On 15.11.2012 03:17, Andres Freund wrote:
> >
> >Features:
> >- streaming reading/writing
> >- filtering
> >- reassembly of records
> >
> >Reusing the ReadRecord infrastructure in situations where the code that wants
> >to do so is not tightly integrated into xlog.c is rather hard and would require
> >changes to rather integral parts of the recovery code which doesn't seem to be
> >a good idea.
> >
> >Missing:
> >- "compressing" the stream when removing uninteresting records
> >- writing out correct CRCs
> >- separating reader/writer
>
> I'm disappointed to see that there has been no progress on this patch since
> last commitfest. I thought we agreed on the approach I championed for here:
> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. There
> wasn't much work left to finish that, I believe.

While I still think my approach is superior at this point I have
accepted that I haven't convinced anybody of that. I plan to port over
what I have submitted to Alvaro's version of your patch.

I have actually started that but I simply couldn't finish it in
time. The approach for porting I took didn't work all that well and I
plan to restart doing that after doing some review work.

> Are you going to continue working on this?

"this" being my version of XlogReader? No. The patch above is unchanged
except some very minor rebasing to recent wal changes by Tom. The reason
its included in the series is simply that I haven't gotten rid of it yet
and the subsequent patches needed it. I do plan to continue working on a
rebased xlogdump version if nobody beats me to it (please do beat me!).

Ok?

The cover letter said:

> * Add support for a generic wal reading facility dubbed XLogReader
> There's some discussion about whats the best way to implement this in a
> separate CF topic.
> (unchanged)

I should have folded that in into the patch description, sorry.

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> On 15.11.2012 03:17, Andres Freund wrote:
> >
> >Features:
> >- streaming reading/writing
> >- filtering
> >- reassembly of records
> >
> >Reusing the ReadRecord infrastructure in situations where the code that wants
> >to do so is not tightly integrated into xlog.c is rather hard and would require
> >changes to rather integral parts of the recovery code which doesn't seem to be
> >a good idea.
> >
> >Missing:
> >- "compressing" the stream when removing uninteresting records
> >- writing out correct CRCs
> >- separating reader/writer
>
> I'm disappointed to see that there has been no progress on this
> patch since last commitfest. I thought we agreed on the approach I
> championed for here:
> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php.
> There wasn't much work left to finish that, I believe.
>
> Are you going to continue working on this?

I worked a bit more on that patch of yours, but I neglected to submit
it.  Did you have something in particular that you wanted changed in it?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Andres Freund
Date:
On 2012-11-15 11:50:37 -0300, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
> > On 15.11.2012 03:17, Andres Freund wrote:
> > >
> > >Features:
> > >- streaming reading/writing
> > >- filtering
> > >- reassembly of records
> > >
> > >Reusing the ReadRecord infrastructure in situations where the code that wants
> > >to do so is not tightly integrated into xlog.c is rather hard and would require
> > >changes to rather integral parts of the recovery code which doesn't seem to be
> > >a good idea.
> > >
> > >Missing:
> > >- "compressing" the stream when removing uninteresting records
> > >- writing out correct CRCs
> > >- separating reader/writer
> >
> > I'm disappointed to see that there has been no progress on this
> > patch since last commitfest. I thought we agreed on the approach I
> > championed for here:
> > http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php.
> > There wasn't much work left to finish that, I believe.
> >
> > Are you going to continue working on this?
>
> I worked a bit more on that patch of yours, but I neglected to submit
> it.  Did you have something in particular that you wanted changed in it?

Could you push your newest version to your git repository or similar?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Alvaro Herrera
Date:
Andres Freund wrote:
> On 2012-11-15 11:50:37 -0300, Alvaro Herrera wrote:

> > I worked a bit more on that patch of yours, but I neglected to submit
> > it.  Did you have something in particular that you wanted changed in it?
>
> Could you push your newest version to your git repository or similar?

Sadly, I cannot, because I had it on my laptop only and its screen died
this morning (well, actually it doesn't boot at all, so I can't use the
external screen either).  I'm trying to get it fixed as soon as possible
but obviously I have no idea when I will be able to get it back.  Most
likely I will have to go out and buy a 2.5" drive enclosure to get the
valuable stuff out of it.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Heikki Linnakangas
Date:
On 15.11.2012 16:50, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>> On 15.11.2012 03:17, Andres Freund wrote:
>>>
>>> Features:
>>> - streaming reading/writing
>>> - filtering
>>> - reassembly of records
>>>
>>> Reusing the ReadRecord infrastructure in situations where the code that wants
>>> to do so is not tightly integrated into xlog.c is rather hard and would require
>>> changes to rather integral parts of the recovery code which doesn't seem to be
>>> a good idea.
>>>
>>> Missing:
>>> - "compressing" the stream when removing uninteresting records
>>> - writing out correct CRCs
>>> - separating reader/writer
>>
>> I'm disappointed to see that there has been no progress on this
>> patch since last commitfest. I thought we agreed on the approach I
>> championed for here:
>> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php.
>> There wasn't much work left to finish that, I believe.
>>
>> Are you going to continue working on this?
>
> I worked a bit more on that patch of yours, but I neglected to submit
> it.  Did you have something in particular that you wanted changed in it?

Off the top of my head, there were a two open items with the patch as I 
submitted it:

1. Need to make sure it's easy to compile outside backend code. So that 
it's suitable for using in an xlogdump contrib module, for example.

2. do something about error reporting. In particular, xlogreader.c 
should not call emode_for_corrupt_record(), but we need to provide for 
that functionlity somehow. I think I'd prefer xlogreader.c to not 
ereport() on a corrupt record. Instead, it would return an error string 
to the caller, which could then decide what to do with it. Translating 
the messages needs some thought, though.

Other than those, and cleanup of any obsoleted comments etc. and adding 
docs, I think it was good to go.

- Heikki



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> On 15.11.2012 16:50, Alvaro Herrera wrote:

> >I worked a bit more on that patch of yours, but I neglected to submit
> >it.  Did you have something in particular that you wanted changed in it?
>
> Off the top of my head, there were a two open items with the patch
> as I submitted it:
>
> 1. Need to make sure it's easy to compile outside backend code. So
> that it's suitable for using in an xlogdump contrib module, for
> example.
>
> 2. do something about error reporting. In particular, xlogreader.c
> should not call emode_for_corrupt_record(), but we need to provide
> for that functionlity somehow. I think I'd prefer xlogreader.c to
> not ereport() on a corrupt record. Instead, it would return an error
> string to the caller, which could then decide what to do with it.
> Translating the messages needs some thought, though.
>
> Other than those, and cleanup of any obsoleted comments etc. and
> adding docs, I think it was good to go.

Thanks.  I was toying with the idea that xlogreader.c should return a
status code to the caller, and additionally an error string; not all
error cases are equal.

Most of what I did (other than general cleanup) was moving some xlog.c
global vars into a private_data struct for xlogreader.c to pass around;
one problem I had was deciding what to do with curFileTLI and
LastFileTLI (IIRC), because they are used outside of the reader module
(they were examined after recovery finished).

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: [PATCH 03/14] Add simple xlogdump tool

From
Peter Eisentraut
Date:
On 11/14/12 8:17 PM, Andres Freund wrote:
> diff --git a/src/bin/Makefile b/src/bin/Makefile
> index b4dfdba..9992f7a 100644
> --- a/src/bin/Makefile
> +++ b/src/bin/Makefile
> @@ -14,7 +14,7 @@ top_builddir = ../..
>  include $(top_builddir)/src/Makefile.global
>  
>  SUBDIRS = initdb pg_ctl pg_dump \
> -    psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup
> +    psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump

should be pg_xlogdump

>  
>  ifeq ($(PORTNAME), win32)
>  SUBDIRS += pgevent
> diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile
> new file mode 100644
> index 0000000..d54640a
> --- /dev/null
> +++ b/src/bin/xlogdump/Makefile
> @@ -0,0 +1,25 @@
> +#-------------------------------------------------------------------------
> +#
> +# Makefile for src/bin/xlogdump
> +#
> +# Copyright (c) 1998-2012, PostgreSQL Global Development Group
> +#
> +# src/bin/pg_resetxlog/Makefile

fix that

> +#
> +#-------------------------------------------------------------------------
> +
> +PGFILEDESC = "xlogdump"
> +PGAPPICON=win32
> +
> +subdir = src/bin/xlogdump
> +top_builddir = ../../..
> +include $(top_builddir)/src/Makefile.global
> +
> +OBJS= xlogdump.o \
> +     $(WIN32RES)
> +
> +all: xlogdump
> +
> +
> +xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v
/main.o|sed's/^/..\/..\/..\//')
 
> +    $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

This looks pretty evil, and there is no documentation about what it is
supposed to do.

Windows build support needs some thought.


> diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c
> new file mode 100644
> index 0000000..0f984e4
> --- /dev/null
> +++ b/src/bin/xlogdump/xlogdump.c
> @@ -0,0 +1,468 @@
> +#include "postgres.h"
> +
> +#include <unistd.h>
> +
> +#include "access/xlogreader.h"
> +#include "access/rmgr.h"
> +#include "miscadmin.h"
> +#include "storage/ipc.h"
> +#include "utils/memutils.h"
> +#include "utils/guc.h"
> +
> +#include "getopt_long.h"
> +
> +/*
> + * needs to be declared because otherwise its defined in main.c which we cannot
> + * link from here.
> + */
> +const char *progname = "xlogdump";

Which may be a reason not to link with main.o.  We generally don't want
to hardcode the program name inside the program.

> +static void
> +usage(void)
> +{
> +    printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"),
> +           progname);
> +    printf(_("Usage:\n"));
> +    printf(_("  %s [OPTION]...\n"), progname);
> +    printf(_("\nOptions:\n"));
> +    printf(_("  -v, --version          output version information, then exit\n"));
> +    printf(_("  -h, --help             show this help, then exit\n"));
> +    printf(_("  -s, --start            from where recptr onwards to read\n"));
> +    printf(_("  -e, --end              up to which recptr to read\n"));
> +    printf(_("  -t, --timeline         which timeline do we want to read\n"));
> +    printf(_("  -i, --inpath           from where do we want to read? cwd/pg_xlog is the default\n"));
> +    printf(_("  -o, --output           where to write [start, end]\n"));
> +    printf(_("  -f, --file             wal file to parse\n"));
> +}

Options list should be in alphabetic order (or some other less random
order).  Most of these descriptions are not very intelligible (at least
without additional documentation).

> +
> +int main(int argc, char **argv)
> +{
> +    uint32 xlogid;
> +    uint32 xrecoff;
> +    XLogReaderState *xlogreader_state;
> +    XLogDumpPrivateData private;
> +    XLogRecPtr from = InvalidXLogRecPtr;
> +    XLogRecPtr to = InvalidXLogRecPtr;
> +    bool bad_argument = false;
> +
> +    static struct option long_options[] = {
> +        {"help", no_argument, NULL, 'h'},
> +        {"version", no_argument, NULL, 'v'},

Standard letters for help and version are ? and V.

> +        {"start", required_argument, NULL, 's'},
> +        {"end", required_argument, NULL, 'e'},
> +        {"timeline", required_argument, NULL, 't'},
> +        {"inpath", required_argument, NULL, 'i'},
> +        {"outpath", required_argument, NULL, 'o'},
> +        {"file", required_argument, NULL, 'f'},
> +        {NULL, 0, NULL, 0}
> +    };
> +    int            c;
> +    int            option_index;
> +
> +    memset(&private, 0, sizeof(XLogDumpPrivateData));
> +
> +    while ((c = getopt_long(argc, argv, "hvs:e:t:i:o:f:",

This could also be in a less random order.

> +                            long_options, &option_index)) != -1)
> +    {
> +        switch (c)
> +        {
> +            case 'h':
> +                usage();
> +                exit(0);
> +                break;
> +            case 'v':
> +                printf("Version: 0.1\n");
> +                exit(0);
> +                break;

This should be the PostgreSQL version.


also:

no man page

no nls.mk




Re: [PATCH 03/14] Add simple xlogdump tool

From
Andres Freund
Date:
On 2012-11-15 11:31:55 -0500, Peter Eisentraut wrote:
> On 11/14/12 8:17 PM, Andres Freund wrote:
> > diff --git a/src/bin/Makefile b/src/bin/Makefile
> > index b4dfdba..9992f7a 100644
> > --- a/src/bin/Makefile
> > +++ b/src/bin/Makefile
> > @@ -14,7 +14,7 @@ top_builddir = ../..
> >  include $(top_builddir)/src/Makefile.global
> >
> >  SUBDIRS = initdb pg_ctl pg_dump \
> > -    psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup
> > +    psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump
>
> should be pg_xlogdump

Good Point.

> >
> >  ifeq ($(PORTNAME), win32)
> >  SUBDIRS += pgevent
> > diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile
> > new file mode 100644
> > index 0000000..d54640a
> > --- /dev/null
> > +++ b/src/bin/xlogdump/Makefile
> > @@ -0,0 +1,25 @@
> > +#-------------------------------------------------------------------------
> > +#
> > +# Makefile for src/bin/xlogdump
> > +#
> > +# Copyright (c) 1998-2012, PostgreSQL Global Development Group
> > +#
> > +# src/bin/pg_resetxlog/Makefile
>
> fix that

Dito.

> > +#
> > +#-------------------------------------------------------------------------
> > +
> > +PGFILEDESC = "xlogdump"
> > +PGAPPICON=win32
> > +
> > +subdir = src/bin/xlogdump
> > +top_builddir = ../../..
> > +include $(top_builddir)/src/Makefile.global
> > +
> > +OBJS= xlogdump.o \
> > +     $(WIN32RES)
> > +
> > +all: xlogdump
> > +
> > +
> > +xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v
/main.o|sed's/^/..\/..\/..\//')
 
> > +    $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
>
> This looks pretty evil, and there is no documentation about what it is
> supposed to do.

There has been some talk about this before and this clearly isn't an
acceptable solution. The previously stated idea was to split of the
_desc routines so we don't need to link with the whole backend.

Alvaro stared to work on that a bit:
http://archives.postgresql.org/message-id/1346268803-sup-9854%40alvh.no-ip.org

(What the above does is simply collect all backend object files, remove
main.o from that list an dlist them as dependencies.)

> Windows build support needs some thought.

I don't have the slightest clue how the windows build environment works,
is there still a problem if we only link to a very selected list of
backend object files? Or do we need to link them to some external
location?

> > diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c
> > new file mode 100644
> > index 0000000..0f984e4
> > --- /dev/null
> > +++ b/src/bin/xlogdump/xlogdump.c
> > @@ -0,0 +1,468 @@
> > +#include "postgres.h"
> > +
> > +#include <unistd.h>
> > +
> > +#include "access/xlogreader.h"
> > +#include "access/rmgr.h"
> > +#include "miscadmin.h"
> > +#include "storage/ipc.h"
> > +#include "utils/memutils.h"
> > +#include "utils/guc.h"
> > +
> > +#include "getopt_long.h"
> > +
> > +/*
> > + * needs to be declared because otherwise its defined in main.c which we cannot
> > + * link from here.
> > + */
> > +const char *progname = "xlogdump";
>
> Which may be a reason not to link with main.o.

Well, we're not linking to main.o which causes the problem, but yes,
really fixing this is definitely the goal, but not really possible yet.

> > +static void
> > +usage(void)
> > +{
> > +    printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"),
> > +           progname);
> > +    printf(_("Usage:\n"));
> > +    printf(_("  %s [OPTION]...\n"), progname);
> > +    printf(_("\nOptions:\n"));
> > +    printf(_("  -v, --version          output version information, then exit\n"));
> > +    printf(_("  -h, --help             show this help, then exit\n"));
> > +    printf(_("  -s, --start            from where recptr onwards to read\n"));
> > +    printf(_("  -e, --end              up to which recptr to read\n"));
> > +    printf(_("  -t, --timeline         which timeline do we want to read\n"));
> > +    printf(_("  -i, --inpath           from where do we want to read? cwd/pg_xlog is the default\n"));
> > +    printf(_("  -o, --output           where to write [start, end]\n"));
> > +    printf(_("  -f, --file             wal file to parse\n"));
> > +}
>
> Options list should be in alphabetic order (or some other less random
> order).  Most of these descriptions are not very intelligible (at
> least without additional documentation).

True, its noticeable that this mostly was a development tool. But it
shouldn't stay that way. There have been several bugreports of late
where a bin/pg_xlogdump would have been very helpful...

> This should be the PostgreSQL version.
>
>
> also:
>
> no man page
>
> no nls.mk

Will try to provide some actually submittable version once the
xlogreader situation is finalized and the _desc routines are splitted...

Thanks!

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PATCH 03/14] Add simple xlogdump tool

From
Jeff Janes
Date:
On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> ---
>  src/bin/Makefile            |   2 +-
>  src/bin/xlogdump/Makefile   |  25 +++
>  src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 494 insertions(+), 1 deletion(-)
>  create mode 100644 src/bin/xlogdump/Makefile
>  create mode 100644 src/bin/xlogdump/xlogdump.c

Is this intended to be the successor of
https://github.com/snaga/xlogdump which will then be deprecated?

Thanks,

Jeff



Re: [PATCH 03/14] Add simple xlogdump tool

From
Andres Freund
Date:
On 2012-11-15 09:06:23 -0800, Jeff Janes wrote:
> On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > ---
> >  src/bin/Makefile            |   2 +-
> >  src/bin/xlogdump/Makefile   |  25 +++
> >  src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 494 insertions(+), 1 deletion(-)
> >  create mode 100644 src/bin/xlogdump/Makefile
> >  create mode 100644 src/bin/xlogdump/xlogdump.c
>
> Is this intended to be the successor of
> https://github.com/snaga/xlogdump which will then be deprecated?

As-is this is just a development tool which was sorely needed for the
development of this patchset. But yes I think that once ready
(xlogreader infrastructure, *_desc routines splitted) it should
definitely be able to do most of what the above xlogdump can do and it
should live in bin/. I think mostly some filtering is missing.

That doesn't really "deprecate" the above though.

Does that answer your question?

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:

> I'm disappointed to see that there has been no progress on this
> patch since last commitfest. I thought we agreed on the approach I
> championed for here:
> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php.
> There wasn't much work left to finish that, I believe.
>
> Are you going to continue working on this?

Here's what I have right now.  It's your patch, plus some tweaks such as
changing the timing for allocating readRecordBuf; I also added a struct
to contain XLogReadPage's private data, instead of using global
variables.  (The main conclusion I get from this, is that it's
relatively easy to split out reading of XLog out of xlog.c; there are
some global variables still remaining, but AFAICS that should be
relatively simple to fix).

There is no consensus on the way to handle error reporting.  Tom
suggests having the hypothetical client-side code redefine ereport()
somehow; as far as I can see that means we would have to reimplement
errstart, errfinish, etc.  That doesn't sound all that nice to me.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical changeset generation v3

From
Michael Paquier
Date:
Do you have a git repository or something where all the 14 patches are applied? I would like to test the feature
globally.<br/>Sorry I recall that you put a link somewhere but I cannot remember its email...<br /><br /><div
class="gmail_quote">On Thu, Nov 15, 2012 at 6:34 PM, Andres Freund <span dir="ltr"><<a
href="mailto:andres@anarazel.de"target="_blank">andres@anarazel.de</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hi,<br /><div class="im"><br
/>On Thursday, November 15, 2012 05:08:26 AM Michael Paquier wrote:<br /> > Looks like cool stuff @-@<br /> > I
mightbe interested in looking at that a bit as I think I will hopefully<br /> > be hopefully be able to grab some
timein the next couple of weeks.<br /> > Are some of those patches already submitted to a CF?<br /><br /></div>I
addedthe patchset as one entry to the CF this time, it seems to me they are<br /> too hard to judge individually to
makethem really separately reviewable.<br /><br /> I can split it off there, but really all the complicated stuff is in
onepatch<br /> anyway...<br /><br /> Greetings,<br /><br /> Andres<br /></blockquote></div><br /><br clear="all" /><br
/>--<br />Michael Paquier<br /><a href="http://michael.otacoo.com" target="_blank">http://michael.otacoo.com</a><br /> 
Hi,<br /><br />This patch looks OK.<br /><br />I got 3 comments:<br />1) Why changing the OID of
pg_class_tblspc_relfilenode_indexfrom 3171 to 3455? It does not look necessary.<br />2) You should perhaps change the
headerof RelationMapFilenodeToOid so as not mentionning it as the opposite operation of RelationMapOidToFilenode but as
anoperation that looks for the OID of a relation based on its relfilenode. Both functions are opposite but
independent.<br/> 3) Both functions are doing similar operations. Could it be possible to wrap them in the same central
function?<br/><br /><div class="gmail_quote">On Thu, Nov 15, 2012 at 10:17 AM, Andres Freund <span dir="ltr"><<a
href="mailto:andres@2ndquadrant.com"target="_blank">andres@2ndquadrant.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">---<br />
 src/backend/utils/cache/relmapper.c| 53 +++++++++++++++++++++++++++++++++++++<br />  src/include/catalog/indexing.h  
  |  4 +--<br />  src/include/utils/relmapper.h       |  2 ++<br />  3 files changed, 57 insertions(+), 2
deletions(-)<br/><br /><br /><br /> --<br /> Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your
subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br /></blockquote></div><br /><br clear="all"
/><br/>-- <br />Michael Paquier<br /><a href="http://michael.otacoo.com"
target="_blank">http://michael.otacoo.com</a><br/> 

Re: logical changeset generation v3

From
Andrea Suisani
Date:
Il 16/11/2012 05:34, Michael Paquier ha scritto:
> Do you have a git repository or something where all the 14 patches are applied? I would like to test the feature
globally.
> Sorry I recall that you put a link somewhere but I cannot remember its email...

http://archives.postgresql.org/pgsql-hackers/2012-11/msg00686.php

Andrea




Hi,

On 2012-11-16 13:44:45 +0900, Michael Paquier wrote:
> This patch looks OK.
>
> I got 3 comments:
> 1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to
> 3455? It does not look necessary.

Its a mismerge and should have happened in "Add a new RELFILENODE
syscache to fetch a pg_class entry via (reltablespace, relfilenode)" but
it seems I squashed the wrong two commits.
I had originally used 3171 but that since got used up for lo_tell64...

> 2) You should perhaps change the header of RelationMapFilenodeToOid so as
> not mentionning it as the opposite operation of RelationMapOidToFilenode
> but as an operation that looks for the OID of a relation based on its
> relfilenode. Both functions are opposite but independent.

I described it as the opposite because RelationMapOidToFilenode is the
relmappers stated goal and RelationMapFilenodeToOid is just some
side-business.

> 3) Both functions are doing similar operations. Could it be possible
> to wrap them in the same central function?

I don't really see how without making both quite a bit more
complicated. The amount of if's needed seems to be too large to me.

Thanks,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Markus Wanner
Date:
Andres,

On 11/15/2012 01:27 AM, Andres Freund wrote:
> In response to this you will soon find the 14 patches that currently
> implement $subject.

Congratulations on that piece of work.


I'd like to provide a comparison of the proposed change set format to
the one used in Postgres-R. I hope for this comparison to shed some
light on the similarities and differences of the two projects. As the
author of Postgres-R, I'm obviously biased, but I try to be as neutral
as I can.


Let's start with the representation: I so far considered the Postgres-R
change set format to be an implementation detail and I don't intend it
to be readable by humans or third party tools. It's thus binary only and
doesn't offer a textual representation. The approach presented here
seems to target different formats for different audiences, including
binary representations. More general, less specific.


Next, contents: this proposal is more verbose. In the textual
representation shown, it provides (usually redundant) information about
attribute names and types. Postgres-R doesn't ever transmit attribute
name or type information for INSERT, UPDATE or DELETE operations.
Instead, it relies on attribute numbers and pg_attributes being at some
known consistent state.

Let's compare by example:

> table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1
> table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1
> table "replication_example": DELETE (pkey): id[int4]:1

In Postgres-R, the change sets for these same operations would carry the
following information, in a binary representation:

> table "replication_example": INSERT: VALUES (1, 1, '1')
> table "replication_example": UPDATE: PKEY(1) COID(77) MODS('nyn') VALUES(-1)
> table "replication_example": DELETE: PKEY(1) COID(78)

You may have noticed that there's an additional COID field. This is an
identifier for the transaction that last changed this tuple. Together
with the primary key, it effectively identifies the exact version of a
tuple (during its lifetime, for example before vs after an UPDATE). This
in turn is used by Postgres-R to detect conflicts.

It may be possible to add that to the proposed format as well, for it to
be able to implement a Postgres-R-like algorithm.


To finish off this comparison, let's take a look at how and where the
change sets are generated: in Postgres-R the change set stream is
constructed directly from the heap modification routines, i.e. in
heapam.c's heap_{insert,update,delete}() methods. Where as the patches
proposed here parse the WAL to reconstruct the modifications and add the
required meta information.

To me, going via the WAL first sounded like a step that unnecessarily
complicates matters. I recently talked to Andres and brought that up.
Here's my current view of things:

The Postgres-R approach is independent of WAL and its format, where as
the approach proposed here clearly is not. Either way, there is a
certain overhead - however minimal it is - which the former adds to the
transaction processing itself, while the later postpones it to a
separate XLogReader process. If there's any noticeable difference, it
might reduce latency in case of asynchronous replication, but can only
increase latency in the synchronous case. As far as I understood Andres,
it was easier to collect the additional meta data from within the
separate process.


In summary, I'd say that Postgres-R is an approach specifically
targeting and optimized for multi-master replication between Postgres
nodes, where as the proposed patches are kept more general.

I hope you found this to be an insightful and fair comparison.

Regards

Markus Wanner



Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Andres Freund
Date:
Hi Markus,

On 2012-11-16 14:46:39 +0100, Markus Wanner wrote:
> On 11/15/2012 01:27 AM, Andres Freund wrote:
> > In response to this you will soon find the 14 patches that currently
> > implement $subject.
>
> Congratulations on that piece of work.

Thanks.

> I'd like to provide a comparison of the proposed change set format to
> the one used in Postgres-R.

Uh, sorry to interrupt you right here, but thats not the "proposed
format" ;) Thats just an example output plugin that people wished
for. For the use-case were after we (as in 2ndq) also want to use binary
data.  Its also rather useful for debugging and such.

I generally aggree that the presented format is too verbose for actual
replication, but it seems fine enough for showing off ;)

If you look at Patch 12/14 "Add a simple decoding module in contrib
named 'test_decoding'" you can see that adding a different output format
should be pretty straight forward.

Which output plugin is used is determined by the initial
INIT_LOGICAL_REPLICATION '$plugin'; command in a replication connection.

> To finish off this comparison, let's take a look at how and where the
> change sets are generated: in Postgres-R the change set stream is
> constructed directly from the heap modification routines, i.e. in
> heapam.c's heap_{insert,update,delete}() methods. Where as the patches
> proposed here parse the WAL to reconstruct the modifications and add the
> required meta information.
>
> To me, going via the WAL first sounded like a step that unnecessarily
> complicates matters. I recently talked to Andres and brought that up.
> Here's my current view of things:
>
> The Postgres-R approach is independent of WAL and its format, where as
> the approach proposed here clearly is not. Either way, there is a
> certain overhead - however minimal it is - which the former adds to the
> transaction processing itself, while the later postpones it to a
> separate XLogReader process. If there's any noticeable difference, it
> might reduce latency in case of asynchronous replication, but can only
> increase latency in the synchronous case. As far as I understood Andres,
> it was easier to collect the additional meta data from within the
> separate process.

There also is the point that if you do the processing inside heap_* you
need to make sure the replication targeted data is safely received &
fsynced away, in "our" case thats not necessary as WAL already provides
crash safety, so should the replication connection break you can simply
start from the location last confirmed as being safely sent.

As we want to provide asynchronous replication thats a rather major
point.

> In summary, I'd say that Postgres-R is an approach specifically
> targeting and optimized for multi-master replication between Postgres
> nodes, where as the proposed patches are kept more general.

One major aim definitely was optionally be able to replicate into just
about any target system, so yes, I certainly agree.

> I hope you found this to be an insightful and fair comparison.

Yes, input in general and especially from other replication providers is
certainly interesting and important!

Thanks,

Andres

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Andres Freund
Date:
Hi,

On 2012-11-16 14:46:39 +0100, Markus Wanner wrote:
> You may have noticed that there's an additional COID field. This is an
> identifier for the transaction that last changed this tuple. Together
> with the primary key, it effectively identifies the exact version of a
> tuple (during its lifetime, for example before vs after an UPDATE). This
> in turn is used by Postgres-R to detect conflicts.

Whats the data type of the "COID" in -R?

In the patchset the output plugin has enough data to get the old xid and
the new xid in the case of updates (not in the case of deletes, but
thats a small bug and should be fixable with a single line of code), and
it has enough information to extract the primary key without problems.

I wonder whether we also should track the xid epoch...

> It may be possible to add that to the proposed format as well, for it to
> be able to implement a Postgres-R-like algorithm.

I don't know the exact Postgres-R algorithm (but I queued reading some
papers you referred to when we talked), but I guess what we have in mind
is roughly similar - its just not even remotely part of this patchset ;)
Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



<div class="moz-cite-prefix">On 12-11-14 08:17 PM, Andres Freund wrote:<br /></div><blockquote
cite="mid:1352942234-3953-9-git-send-email-andres@2ndquadrant.com"type="cite"><pre wrap=""> 
For the regular satisfies routines this is needed in prepareation of logical
decoding. I changed the non-regular ones for consistency as well.

The naming between htup, tuple and similar is rather confused, I could not find
any consistent naming anywhere.

This is preparatory work for the logical decoding feature which needs to be
able to get to a valid relfilenode from when checking the visibility of a
tuple.

</pre></blockquote><br /> I have taken a look at this patch.   The patch does what it says,  it changes a bunch of <br
/>HeapTupleSatisfiesXXXX routines to take a HeapTuple structure instead of a HeapTupleHeader.<br /> It also sets the
HeapTuple.t_tableOidvalue before calling these routines.<br /><br /> The patch does not modify these routines to
actuallydo anything useful with the additional data fields though it does add some assertions to make sure t_tableOid
isset.<br /><br /> The patch compiles cleanly and the unit tests pass.<br /><br /> This patch  does not seem to depend
onany of the other patches in this set and applies cleanly against master.  The patch doesn't actually add any
functionality,unless someone sees a reason for complaining about this that I don't see, then I think it can be
committed.<br/><br /> Steve<br /><br /><blockquote cite="mid:1352942234-3953-9-git-send-email-andres@2ndquadrant.com"
type="cite"><prewrap=""> 
---contrib/pgrowlocks/pgrowlocks.c      |  2 +-src/backend/access/heap/heapam.c     | 13
++++++----src/backend/access/heap/pruneheap.c | 16 ++++++++++--src/backend/catalog/index.c          |  2
+-src/backend/commands/analyze.c      |  3 ++-src/backend/commands/cluster.c       |  2
+-src/backend/commands/vacuumlazy.c   |  3 ++-src/backend/storage/lmgr/predicate.c |  2
+-src/backend/utils/time/tqual.c      | 50 +++++++++++++++++++++++++++++-------src/include/utils/snapshot.h         |
4+--src/include/utils/tqual.h            | 20 +++++++--------11 files changed, 83 insertions(+), 34 deletions(-) 

</pre><br /><fieldset class="mimeAttachmentHeader"></fieldset><br /><pre wrap="">
</pre></blockquote><br />

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Markus Wanner
Date:
On 11/16/2012 03:05 PM, Andres Freund wrote:
>> I'd like to provide a comparison of the proposed change set format to
>> the one used in Postgres-R.
> 
> Uh, sorry to interrupt you right here, but thats not the "proposed
> format" ;)

Understood. Sorry, I didn't mean to imply that. It's pretty obvious to
me that this is more of a human readable format and that others,
including binary formats, can be implemented. I apologize for the bad
wording of a "proposed format", which doesn't make that clear.

>> The Postgres-R approach is independent of WAL and its format, where as
>> the approach proposed here clearly is not. Either way, there is a
>> certain overhead - however minimal it is - which the former adds to the
>> transaction processing itself, while the later postpones it to a
>> separate XLogReader process. If there's any noticeable difference, it
>> might reduce latency in case of asynchronous replication, but can only
>> increase latency in the synchronous case. As far as I understood Andres,
>> it was easier to collect the additional meta data from within the
>> separate process.
> 
> There also is the point that if you do the processing inside heap_* you
> need to make sure the replication targeted data is safely received &
> fsynced away, in "our" case thats not necessary as WAL already provides
> crash safety, so should the replication connection break you can simply
> start from the location last confirmed as being safely sent.

In the case of Postgres-R, the "safely received" part isn't really
handled at the change set level at all. And regarding the fsync
guarantee: you can well use the WAL to provide that, without basing
change set generation on in. In that regard, Postgres-R is probably the
more general approach: you can run Postgres-R with WAL turned off
entirely - which may well make sense if you take into account the vast
amount of cloud resources available, which don't have a BBWC. Instead of
WAL, you can add more nodes at more different locations. And no, you
don't want your database to ever go down, anyway  :-)

>> In summary, I'd say that Postgres-R is an approach specifically
>> targeting and optimized for multi-master replication between Postgres
>> nodes, where as the proposed patches are kept more general.
> 
> One major aim definitely was optionally be able to replicate into just
> about any target system, so yes, I certainly agree.

I'm glad I got that correct ;-)

Regards

Markus Wanner



Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Markus Wanner
Date:
On 11/16/2012 03:14 PM, Andres Freund wrote:
> Whats the data type of the "COID" in -R?

It's short for CommitOrderId, a 32bit global transaction identifier,
being wrapped-around, very much like TransactionIds are. (In that sense,
it's global, but unique only for a certain amount of time).

> In the patchset the output plugin has enough data to get the old xid and
> the new xid in the case of updates (not in the case of deletes, but
> thats a small bug and should be fixable with a single line of code), and
> it has enough information to extract the primary key without problems.

It's the xmin of the old tuple that Postgres-R needs to get the COID.

Regards

Markus Wanner





On Fri, Nov 16, 2012 at 7:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,

On 2012-11-16 13:44:45 +0900, Michael Paquier wrote:
> This patch looks OK.
>
> I got 3 comments:
> 1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to
> 3455? It does not look necessary.

Its a mismerge and should have happened in "Add a new RELFILENODE
syscache to fetch a pg_class entry via (reltablespace, relfilenode)" but
it seems I squashed the wrong two commits.
I had originally used 3171 but that since got used up for lo_tell64...

> 2) You should perhaps change the header of RelationMapFilenodeToOid so as
> not mentionning it as the opposite operation of RelationMapOidToFilenode
> but as an operation that looks for the OID of a relation based on its
> relfilenode. Both functions are opposite but independent.

I described it as the opposite because RelationMapOidToFilenode is the
relmappers stated goal and RelationMapFilenodeToOid is just some
side-business.

> 3) Both functions are doing similar operations. Could it be possible
> to wrap them in the same central function?

I don't really see how without making both quite a bit more
complicated. The amount of if's needed seems to be too large to me.
OK thanks for your answers.
As this patch only adds a new function and is not that much complicated, I think there is no problem in committing it. The only thing to remove is the diff in indexing.h. Could someone take care of that?
If other people have additional comments on the ability to perform a relfileoid->reloid operation for cached maps, of course go ahead.
--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Hannu Krosing
Date:
On 11/16/2012 02:46 PM, Markus Wanner wrote:
> Andres,
>
> On 11/15/2012 01:27 AM, Andres Freund wrote:
>> In response to this you will soon find the 14 patches that currently
>> implement $subject.
> Congratulations on that piece of work.
>
>
> I'd like to provide a comparison of the proposed change set format to
> the one used in Postgres-R. I hope for this comparison to shed some
> light on the similarities and differences of the two projects. As the
> author of Postgres-R, I'm obviously biased, but I try to be as neutral
> as I can.
...
> Let's compare by example:
>
>> table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1
>> table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1
>> table "replication_example": DELETE (pkey): id[int4]:1
> In Postgres-R, the change sets for these same operations would carry the
> following information, in a binary representation:
>
>> table "replication_example": INSERT: VALUES (1, 1, '1')
>> table "replication_example": UPDATE: PKEY(1) COID(77) MODS('nyn') VALUES(-1)
>> table "replication_example": DELETE: PKEY(1) COID(78)
Is it possible to replicate UPDATEs and DELETEs without a primary key in 
PostgreSQL-R


Hannu



Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Markus Wanner
Date:
On 11/17/2012 02:30 PM, Hannu Krosing wrote:
> Is it possible to replicate UPDATEs and DELETEs without a primary key in
> PostgreSQL-R

No. There must be some way to logically identify the tuple. Note,
though, that theoretically any (unconditional) unique key would suffice.
In practice, that usually doesn't matter, as you rarely have one or more
unique keys without a primary.

Also note that the underlying index is useful for remote application of
change sets (except perhaps for very small tables).

In some cases, for example for n:m linking tables, you need to add a
uniqueness key that spans all columns (as opposed to a simple index on
one of the columns that's usually required, anyway). I hope for
index-only scans eventually mitigating this issue.

Alternatively, I've been thinking about the ability to add a hidden
column, which can then be used as a PRIMARY KEY without breaking legacy
applications that rely on SELECT * not returning that primary key.

Are there other reasons to want tables without primary keys that I'm
missing?

Regards

Markus Wanner



Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Hannu Krosing
Date:
On 11/17/2012 03:00 PM, Markus Wanner wrote:
> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>> PostgreSQL-R
> No. There must be some way to logically identify the tuple.
It can be done as selecting on _all_ attributes and updating/deleting 
just the first matching row

create cursor ...
select from t ... where t.* = (....)
fetch one ...
delete where current of ...

This is on distant (round 3 or 4) roadmap for this work, just was 
interested
if you had found any better way of doing this :)

Hannu

> Note,
> though, that theoretically any (unconditional) unique key would suffice.
> In practice, that usually doesn't matter, as you rarely have one or more
> unique keys without a primary.
>
> Also note that the underlying index is useful for remote application of
> change sets (except perhaps for very small tables).
>
> In some cases, for example for n:m linking tables, you need to add a
> uniqueness key that spans all columns (as opposed to a simple index on
> one of the columns that's usually required, anyway). I hope for
> index-only scans eventually mitigating this issue.
>
> Alternatively, I've been thinking about the ability to add a hidden
> column, which can then be used as a PRIMARY KEY without breaking legacy
> applications that rely on SELECT * not returning that primary key.
>
> Are there other reasons to want tables without primary keys that I'm
> missing?
>
> Regards
>
> Markus Wanner




Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Hannu Krosing
Date:
On 11/17/2012 03:00 PM, Markus Wanner wrote:
> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>> PostgreSQL-R
> No. There must be some way to logically identify the tuple. Note,
> though, that theoretically any (unconditional) unique key would suffice.
> In practice, that usually doesn't matter, as you rarely have one or more
> unique keys without a primary.
...
> Are there other reasons to want tables without primary keys that I'm
> missing?
>
Nope. The only place a table without a primary key would be needed is a 
log table, but as these are (supposed to be) INSERT-only this is not a 
problem for them.

Hannu



Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Markus Wanner
Date:
Hannu,

On 11/17/2012 03:40 PM, Hannu Krosing wrote:
> On 11/17/2012 03:00 PM, Markus Wanner wrote:
>> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>>> PostgreSQL-R
>> No. There must be some way to logically identify the tuple.
> It can be done as selecting on _all_ attributes and updating/deleting
> just the first matching row
> 
> create cursor ...
> select from t ... where t.* = (....)
> fetch one ...
> delete where current of ...

That doesn't sound like it could possibly work for Postgres-R. At least
not when there can be multiple rows with all the same attributes, i.e.
without a unique key constraint over all columns.

Otherwise, some nodes could detect two concurrent UPDATES as a conflict,
while other nodes select different rows and don't handle it as a conflict.

Regards

Markus Wanner



On 2012-11-17 19:14:06 +0900, Michael Paquier wrote:
> On Fri, Nov 16, 2012 at 7:58 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > Hi,
> >
> > On 2012-11-16 13:44:45 +0900, Michael Paquier wrote:
> > > This patch looks OK.
> > >
> > > I got 3 comments:
> > > 1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to
> > > 3455? It does not look necessary.
> >
> > Its a mismerge and should have happened in "Add a new RELFILENODE
> > syscache to fetch a pg_class entry via (reltablespace, relfilenode)" but
> > it seems I squashed the wrong two commits.
> > I had originally used 3171 but that since got used up for lo_tell64...
> >
> > > 2) You should perhaps change the header of RelationMapFilenodeToOid so as
> > > not mentionning it as the opposite operation of RelationMapOidToFilenode
> > > but as an operation that looks for the OID of a relation based on its
> > > relfilenode. Both functions are opposite but independent.
> >
> > I described it as the opposite because RelationMapOidToFilenode is the
> > relmappers stated goal and RelationMapFilenodeToOid is just some
> > side-business.
> >
> > > 3) Both functions are doing similar operations. Could it be possible
> > > to wrap them in the same central function?
> >
> > I don't really see how without making both quite a bit more
> > complicated. The amount of if's needed seems to be too large to me.
> >
> OK thanks for your answers.
> As this patch only adds a new function and is not that much complicated, I
> think there is no problem in committing it. The only thing to remove is the
> diff in indexing.h. Could someone take care of that?

I pushed a rebase to the git repository that fixed it...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3

From
Michael Paquier
Date:


On Fri, Nov 16, 2012 at 5:16 PM, Andrea Suisani <sickpig@opinioni.net> wrote:
Il 16/11/2012 05:34, Michael Paquier ha scritto:

Do you have a git repository or something where all the 14 patches are applied? I would like to test the feature globally.
Sorry I recall that you put a link somewhere but I cannot remember its email...

http://archives.postgresql.org/pgsql-hackers/2012-11/msg00686.php
Thanks Andrea.
I am pretty sure I will be able to provide some feedback by Friday.
--
Michael Paquier
http://michael.otacoo.com
Hi,

I am just looking at this patch and will provide some comments.
By the way, you forgot the installation part of pg_receivellog, please see patch attached.
Thanks,

On Thu, Nov 15, 2012 at 10:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:
---
 src/bin/pg_basebackup/Makefile         |   7 +-
 src/bin/pg_basebackup/pg_receivellog.c | 717 +++++++++++++++++++++++++++++++++
 src/bin/pg_basebackup/streamutil.c     |   3 +-
 src/bin/pg_basebackup/streamutil.h     |   1 +
 4 files changed, 725 insertions(+), 3 deletions(-)
 create mode 100644 src/bin/pg_basebackup/pg_receivellog.c



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Michael Paquier
http://michael.otacoo.com
Attachment

Re: logical changeset generation v3

From
Michael Paquier
Date:
Hi Andres,

I have been able to fetch your code (thanks Andrea!) and some it. For the time being I am spending some time reading the code and understanding the whole set of features you are trying to implement inside core, even if I got some background from what you presented at PGCon and from the hackers ML.
Btw, as a first approach, I tried to run the logical log receiver plugged on a postgres server, and I am not able to make it work.

Well, I am using settings similar to yours.
# Run master
rm -r ~/bin/pgsql/master/
initdb -D ~/bin/pgsql/master/
echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf
postgres -D ~/bin/pgsql/master \
          -c wal_level=logical \
          -c max_wal_senders=10 \
          -c max_logical_slots=10 \
          -c wal_keep_segments=100 \
          -c log_line_prefix="[%p %x] "
# Logical log receiver
pg_receivellog -f $HOME/output.txt -d postgres -v

After launching some SQLs, the logical receiver is stuck just after sending INIT_LOGICAL_REPLICATION, please see bt of process waiting:
(gdb) bt
#0  0x00007f1bbc13b170 in __poll_nocancel () from /usr/lib/libc.so.6
#1  0x00007f1bbc43072d in pqSocketPoll (sock=3, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1089
#2  0x00007f1bbc43060d in pqSocketCheck (conn=0x1dd0290, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1031
#3  0x00007f1bbc4304dd in pqWaitTimed (forRead=1, forWrite=0, conn=0x1dd0290, finish_time=-1) at fe-misc.c:963
#4  0x00007f1bbc4304af in pqWait (forRead=1, forWrite=0, conn=0x1dd0290) at fe-misc.c:946
#5  0x00007f1bbc42c64c in PQgetResult (conn=0x1dd0290) at fe-exec.c:1709
#6  0x00007f1bbc42cd62 in PQexecFinish (conn=0x1dd0290) at fe-exec.c:1974
#7  0x00007f1bbc42c9c8 in PQexec (conn=0x1dd0290, query=0x406c60 "INIT_LOGICAL_REPLICATION 'test_decoding'") at fe-exec.c:1808
#8  0x0000000000402370 in StreamLog () at pg_receivellog.c:263
#9  0x00000000004030c9 in main (argc=6, argv=0x7fff44edb288) at pg_receivellog.c:694
So I am not able to output any results using pg_receivellog.

Also, I noticed 2 errors in your set of tests.

On Thu, Nov 15, 2012 at 9:27 AM, Andres Freund <andres@anarazel.de> wrote:
-- wrapped in a transaction
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
In SET clause, the column name is somedata and not somedate
 
-- dont write out aborted data
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (2, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
Same error here, somedata and not somedate. Not a big deal, it made the transactions failing though.
--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3

From
Andres Freund
Date:
Hi Michael,


On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> I have been able to fetch your code (thanks Andrea!) and some it. For the
> time being I am spending some time reading the code and understanding the
> whole set of features you are trying to implement inside core, even if I
> got some background from what you presented at PGCon and from the hackers
> ML.

Cool.

> Btw, as a first approach, I tried to run the logical log receiver plugged
> on a postgres server, and I am not able to make it work.

> Well, I am using settings similar to yours.
> # Run master
> rm -r ~/bin/pgsql/master/
> initdb -D ~/bin/pgsql/master/
> echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf
> postgres -D ~/bin/pgsql/master \
>           -c wal_level=logical \
>           -c max_wal_senders=10 \
>           -c max_logical_slots=10 \
>           -c wal_keep_segments=100 \
>           -c log_line_prefix="[%p %x] "
> # Logical log receiver
> pg_receivellog -f $HOME/output.txt -d postgres -v
>
> After launching some SQLs, the logical receiver is stuck just after sending
> INIT_LOGICAL_REPLICATION, please see bt of process waiting:

Its waiting till it sees initial an initial xl_running_xacts record. The
easiest way to do that is manually issue a checkpoint. Sorry, should
have included that in the description.
Otherwise you can wait till the next routine checkpoint comes arround...

I plan to cause more xl_running_xacts record to be logged in the
future. I think the timing of those currently is non-optimal, you have
the same problem as here in normal streaming replication as well :(

> > -- wrapped in a transaction
> > BEGIN;
> > INSERT INTO replication_example(somedata, text) VALUES (1, 1);
> > UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT
> > currval('replication_example_id_seq'));
> >
> In SET clause, the column name is *somedata* and not *somedate*

Crap. Sorry for that. I wrote the example in the mailclient and then
executed it and I seem to have forgot to put back some of the fixes...


> I am just looking at this patch and will provide some comments.
> By the way, you forgot the installation part of pg_receivellog, please see
> patch attached.

That actually was somewhat intended, I thought people wouldn't like the
name and I didn't want a binary thats going to be replaced anyway lying
arround ;)

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3 - Source for Slony

From
Steve Singer
Date:
First, you can add me to the list of people saying 'wow', I'm impressed.

The approach I am taking to reviewing this to try and answer the 
following question

1) How might a future version of slony be able to use logical 
replication as described by your patch and design documents
and what would that look like.

2) What functionality is missing from the patch set that would stop me 
from implementing or prototyping the above.



Connecting slon to the remote postgresql
========================

Today the slony remote listener thread queries a bunch of events from 
sl_event for a batch of SYNC events. Then the remote helper thread 
queries data from sl_log_1 and sl_log_2.    I see this changing, instead 
the slony remote listener thread would connect to the remote system and 
get a logical replication stream.
  1) Would slony connect as a normal client connection and call 
something like 'select pg_slony_process_xlog(...)' to get bunch of 
logical replication      change records to process.  OR  2) Would slony connect as a replication connection similar to
howthe 
 
pg_receivelog program does today and then process the logical changeset      outputs.  Instead of writing it to a file
(aspg_receivelog does)
 

It seems that the second approach is what is encouraged.  I think we 
would put a lot of the pg_receivelog functionality into slon and it 
would issue a command like 'INIT_LOGICAL_REPLICATION 'slony') to use the 
slony logical replication plugin.  Slon would also have to provide 
feedback to the walsender about what it has processed so the origin 
database knows what catalog snapshots can be expired.  Based on 
eyeballing pg_receivelog.c it seems that about half the code in the 700 
file is related to command line arguments etc, and the other half is 
related to looping over the copy out stream, sending feedback and other 
things that we would need to duplicate in slon.

pg_receivelog.c has  comment:

/* * We have to use postgres.h not postgres_fe.h here, because there's so 
much * backend-only stuff in the XLOG include files we need.  But we need a * frontend-ish environment otherwise.
Hencethis ugly hack. */
 

This looks like more of a carryover from pg_receivexlog.c.  From what I 
can tell we can eliminate the postgres.h include if we also eliminate 
the utils/datetime.h and utils/timestamp.h and instead add in:

#include "postgres_fe.h"
#define POSTGRES_EPOCH_JDATE 2451545
#define UNIX_EPOCH_JDATE 2440588
#define SECS_PER_DAY 86400
#define USECS_PER_SEC INT64CONST(1000000)
typedef int64 XLogRecPtr;
#define InvalidXLogRecPtr 0

If there is a better way of getting these defines someone should speak 
up.   I recall that in the past slon actually did include postgres.h and 
it caused some issues (I think with MSVC win32 builds).  Since 
pg_receivelog.c will be used as a starting point/sample for third 
parties to write client programs it would be better if it didn't 
encourage client programs to include postgres.h


The Slony Output Plugin
=====================

Once we've modified slon to connect as a logical replication client we 
will need to write a slony plugin.

As I understand the plugin API:
* A walsender is processing through WAL records, each time it sees a 
COMMIT WAL record it will call my plugins
.begin
.change (for each change in the transaction)
.commit

* The plugin for a particular stream/replication client will see one 
transaction at a time passed to it in commit order.  It won't see 
.change(t1) followed by .change (t2), followed by a second .change(t1).  
The reorder buffer code hides me from all that complexity (yah)
From a slony point of view I think the output of the plugin will be 
rows, suitable to be passed to COPY IN of the form:

origin_id, table_namespace,table_name,command_type, 
cmd_updatencols,command_args

This is basically the Slony 2.2 sl_log format minus a few columns we no 
longer need (txid, actionseq).
command_args is a postgresql text array of column=value pairs.  Ie [ 
{id=1},{name='steve'},{project='slony'}]

I don't t think our output plugin will be much more complicated than the 
test_decoding plugin.  I suspect we will want to give it the ability to 
filter out non-replicated tables.   We will also have to filter out 
change records that didn't originate on the local-node that aren't part 
of a cascaded subscription.  Remember that in a two node cluster  slony 
will have connections from A-->B  and from B--->A even if user tables 
only flow one way. Data that is replicated from A into B will show up in 
the WAL stream for B.

Exactly how we do this filtering is an open question,  I think the 
output plugin will at a minimum need to know:

a) What the slony node id is of the node it is running on.  This is easy 
to figure out if the output plugin is able/allowed to query its 
database.  Will this be possible? I would expect to be able to query the 
database as it exists now(at plugin invocation time) not as it existed 
in the past when the WAL was generated.   In addition to the node ID I 
can see us wanting to be able to query other slony tables 
(sl_table,sl_set etc...)

b) What the slony node id is of the node we are streaming too.   It 
would be nice if we could pass extra, arbitrary data/parameters to the 
output plugins that could include that, or other things.  At the moment 
the start_logical_replication rule in repl_gram.y doesn't allow for that 
but I don't see why we couldn't make it do so.

I still see some open questions about exactly how we would filter out 
data in this stage.


<editorial> Everything above deals with the postgresql side of things, 
ie the patch in question or the plugin API we would  have to work with.
Much of what is below deals with slony side change and might of limited 
interest to some on pgsql-hackers
</editorial>

Slon Applying Changes
================

The next task we will have is to make slon and the replica instance be 
able to apply these changes.  In slony 2.2 we do a COPY from sl_log and 
apply that stream to a table on the replica with COPY. We then have 
triggers on the replica that decode the command_args and apply the 
changes as
INSERT/UPDATE/DELETE statements on the user tables.   I see this 
continuing to work in this fashion, but there are a few special cases:

1) Changes made to sl_event on the origin will result in records in the 
logical replication stream that change sl_event.  In many cases we won't
just be inserting records into sl_event but we will need to instead do 
the logic in remote_worker.c for processing the different types of
events.  Worst case we could parse the change records we receive from 
our version pg_receivellog and split the sl_event records out into a 
sl_event stream and a sl_log stream.  Another approach might be to have 
the slony apply trigger build up a list of events that the slon 
remote_worker code can than process through.

2) Slony is normally bi-directional even if user data only replicates 
one way.  Confirm (sl_confirm) entries go from a replica back to an 
origin.  In a two node origin->replica scenario for data, the way I see 
this working is that the slon for the origin would connect to the 
replica (like it does today).
It would receive the logical replication records, but since it isn't 
subscribed to any tables it won't receive/process the WAL for 
user-tables but it will still receive/process sl_confirm rows.   It will 
then insert the rows in sl_confirm that it 'replicated' from the remote 
node.


With what I have described so far, Slony would then be receiving a 
stream of events that look like

t1-insert into foo , [id=1, name='steve']
t1-insert into bar  [id=1, something='somethingelse']
t1-commit
t2- insert into foo [....]
t2-commit
t3- insert into sl_event [ev_type=SYNC, ev_origin=1,ev_seqno=12345]
t3-commit

Even though, from a data-correctness point of view, slony could commit 
the transaction on the replica after it sees the t1 commit, we won't 
want it to do commits other than on a SYNC boundary.  This means that
the replicas will continue to move between consistent SYNC snapshots and 
that we can still track the state/progress of replication by knowing 
what events (SYNC or otherwise) have been confirmed.

This also means that slony should only provide  feedback to the 
walsender on SYNC boundaries after the transaction has committed on the 
receiver. I don't see this as being an issue.

Setting up Subscriptions
===================
At first we have a slon cluster with just 1 node, life is good. When a 
second node is created and a path(or pair of paths) are defined between 
the nodes I think they will each:
1. Connect to the remote node with a normal libpq connection.    a. Get the current xlog recptr,    b. Query any
non-syncevents of interest from sl_event.
 
2. Connect to the remote node with a logical replication connection and 
start streaming logical replication changes start at the recptr we retrieved    above.

Slon will then receive any future events from the remote sl_event as 
part of the logical replication stream.  It won't receive any user 
tables because it isn't yet subscribed to any.

When a subscription is started,  the SUBSCRIBE_SET and 
ENABLE_SUBSCRIPTION events will go through sl_event and the INSERT INTO 
sl_event will be part of a change record in the replication stream and 
be picked up by the subscribers slon remote_worker.

The remote_worker:copy_set will then need to get a consistent COPY of 
the tables in the replication set such that any changes made to the 
tables after the copy is started get included in the replication 
stream.  The approach proposed in the DESIGN.TXT file with exporting a 
snapshot sounds okay for this.    I *think* slony could get by with 
something less fancy as well but it would be ugly.

1. Make sure that the origin starts including change records for the 
tables in the set
2. have the slon(copy_set) wait until any transactions on the origin, 
that started prior to the ENABLE_SUBSCRIPTION, are committed.     Slony does this today as part of the copy_set logic.
3.  Get/remember the snapshot visibility information for the COPY's 
transaction
4.  When we start to process change records we need to filter out 
records for transactions that were already visible by the copy.

Steps 1-3 are similar to how slony works today, but step 4 will be a bit 
awkward/ugly.  This isn't an issue today because we are already using 
the transaction visibility information for selecting from sl_log so it 
works, but above I had proposed stripping the xid from the logical 
change records.



Cascading Replication
=================
A-->B--->C

The slon for B will insert records from A into B's tables.  This insert 
will generate WAL records on B.  The slon for C should be able to pull 
the data it needs  (both sl_event entries with ev_origin=A, and user 
table data originating on A) from B's logical replication stream.  I 
don't see any issues here nor do I see a need to 'cache' the data in an 
sl_log type of table on B.


Reshaping Replication
=================

In Slony replication is reshaped by two types events, a MOVE SET and a 
FAILOVER.

Move Set:
A replication set might be subscribed in a cascaded fashion like
A--->B--->C

When a MOVE SET is issued node A will stop accepting new write 
transactions for tables in the set.  A MOVE_SET(1,A,B) event is then put 
into sl_event on node A. Node A will then stop accepting new 
transactions on the tables in set 1.
Node B receives the MOVE_SET command in the proper order, after it has 
processed the last SYNC generated on A when A was still accepting write 
transactions to those tables.  When Node B processes the MOVE_SET event 
then node B starts accepting write transactions on the tables.  Node B 
will also generates an ACCEPT_SET event. Node C will then receive the 
MOVE SET (ev_origin=A) and the ACCEPT_SET(ev_origin=B) command (after 
all SYNC events from A with data changes to the set) and then knows that 
it should start data on those tables from B.

I don't see any of this changing with logical replication acting as the 
data source.

FAILOVER:
---------------
A---->B
|    .
v  .
C

Today with slony, if B is a valid failover target then it is a 
forwarding node of the set.  This means that B keeps a record in sl_log 
of any changes originating on A until B knows that node C has received 
those changes.  In the event of a failover, if node C is far behind, it 
can just get the missing data from sl_log on node B (the failover 
target/new origin).

I see a problem with what I have discussed above, B won't explicitly 
store the data from A in sl_log, a cascaded node would depend on B's WAL 
stream.
The problem is that at FAILOVER time,  B might have processed some 
changes from A. Node  C might also be processing Node B's WAL stream for 
events (or data from another set).  Node C will discard/not receive the 
data for A's tables since it isn't subscribed to those tables from B.  
What happens then if at some later point B and C receive the FAILOVER event.
What does node C do? It can't get the missing data from node A because 
node A has failed, and it can't get it from node B because node C has 
already processed the WAL changes from node B that included the data but 
it ignored/discarded it.  Maybe node C could reprocess older WAL from 
node B? Maybe this forces us to keep an sl_log type structure around?

Is it complete enough to build a prototype?
==========================
I think so, the incomplete areas I see are the ones that mentioned in 
the patch submission including:
* Snapshot exporting for the initial COPY
* Spilling the reorder buffer to disk

I think it would be possible to build a prototype without those even 
though we'd need them before I could build a production system.

Conclusions
=============
I like this design much better than the original design from the spring 
that would have required keeping a catalog proxy on the decoding 
machine.  Based on what I've seen it should be possible to make slony 
use logical replication as a source for events instead of triggers 
populating sl_log.
My thinking is that we want a way for logreceiver programs to pass 
arguments/parameters to the output plugins. Beyond that this looks like 
something slony can use.





Re: logical changeset generation v3 - Source for Slony

From
Andres Freund
Date:
Hi Steve!

On 2012-11-17 22:50:35 -0500, Steve Singer wrote:
> First, you can add me to the list of people saying 'wow', I'm impressed.

Thanks!

> The approach I am taking to reviewing this to try and answer the following
> question
>
> 1) How might a future version of slony be able to use logical replication as
> described by your patch and design documents
> and what would that look like.
>
> 2) What functionality is missing from the patch set that would stop me from
> implementing or prototyping the above.

Sounds like a good plan to me.

>
> Connecting slon to the remote postgresql
> ========================
>
> Today the slony remote listener thread queries a bunch of events from
> sl_event for a batch of SYNC events. Then the remote helper thread queries
> data from sl_log_1 and sl_log_2.    I see this changing, instead the slony
> remote listener thread would connect to the remote system and get a logical
> replication stream.
>
>   1) Would slony connect as a normal client connection and call something
> like 'select pg_slony_process_xlog(...)' to get bunch of logical replication
>       change records to process.
>   OR
>   2) Would slony connect as a replication connection similar to how the
> pg_receivelog program does today and then process the logical changeset
>       outputs.  Instead of writing it to a file (as pg_receivelog does)

It would need to be the latter. We need the feedback messages it sends
for several purposes:
- increasing the lowered xmin
- implementing optionally synchronous replication at some point
- using 1) would mean having transactions open...

> It seems that the second approach is what is encouraged.  I think we would
> put a lot of the pg_receivelog functionality into slon and it would issue a
> command like 'INIT_LOGICAL_REPLICATION 'slony') to use the slony logical
> replication plugin.  Slon would also have to provide feedback to the
> walsender about what it has processed so the origin database knows what
> catalog snapshots can be expired.  Based on eyeballing pg_receivelog.c it
> seems that about half the code in the 700 file is related to command line
> arguments etc, and the other half is related to looping over the copy out
> stream, sending feedback and other things that we would need to duplicate in
> slon.

I think we should provide some glue code to do this, otherwise people
will start replicating all the bugs I hacked into this... More
seriously: I think we should have support code here, no user will want
to learn the intracacies of feedback messages and such. Where that would
live? No idea.

> pg_receivelog.c has  comment:

(its pg_receivellog btw. ;))

>
> /*
>  * We have to use postgres.h not postgres_fe.h here, because there's so much
>  * backend-only stuff in the XLOG include files we need.  But we need a
>  * frontend-ish environment otherwise.    Hence this ugly hack.
>  */
>
> This looks like more of a carryover from pg_receivexlog.c.  From what I can
> tell we can eliminate the postgres.h include if we also eliminate the
> utils/datetime.h and utils/timestamp.h and instead add in:
>
> #include "postgres_fe.h"
> #define POSTGRES_EPOCH_JDATE 2451545
> #define UNIX_EPOCH_JDATE 2440588
> #define SECS_PER_DAY 86400
> #define USECS_PER_SEC INT64CONST(1000000)
> typedef int64 XLogRecPtr;
> #define InvalidXLogRecPtr 0
>
> If there is a better way of getting these defines someone should speak up.
> I recall that in the past slon actually did include postgres.h and it caused
> some issues (I think with MSVC win32 builds).  Since pg_receivelog.c will be
> used as a starting point/sample for third parties to write client programs
> it would be better if it didn't encourage client programs to include
> postgres.h

I wholeheartedly aggree. It should also be cleaned up a fair bit before
others copy it should we not go for having some client side library.

Imo the library could very roughly be something like:

state = SetupStreamingLLog(replication-slot, ...);
while((message = StreamingLLogNextMessage(state))
{    write(outfd, message->data, message->length);    if (received_100_messages)    {         fsync(outfd);
StreamingLLogConfirm(message);   }
 
}

Although I guess thats not good enough because StreamingLLogNextMessage
would be blocking, but that shouldn't be too hard to work around.


> The Slony Output Plugin
> =====================
>
> Once we've modified slon to connect as a logical replication client we will
> need to write a slony plugin.
>
> As I understand the plugin API:
> * A walsender is processing through WAL records, each time it sees a COMMIT
> WAL record it will call my plugins
> .begin
> .change (for each change in the transaction)
> .commit
>
> * The plugin for a particular stream/replication client will see one
> transaction at a time passed to it in commit order.  It won't see
> .change(t1) followed by .change (t2), followed by a second .change(t1).  The
> reorder buffer code hides me from all that complexity (yah)

Correct.

> From a slony point of view I think the output of the plugin will be rows,
> suitable to be passed to COPY IN of the form:
>
> origin_id, table_namespace,table_name,command_type,
> cmd_updatencols,command_args
>
> This is basically the Slony 2.2 sl_log format minus a few columns we no
> longer need (txid, actionseq).
> command_args is a postgresql text array of column=value pairs.  Ie [
> {id=1},{name='steve'},{project='slony'}]

It seems to me that that makes escaping unneccesarily complicated, but
given you already have all the code... ;)

> I don't t think our output plugin will be much more complicated than the
> test_decoding plugin.

Good. Thats the idea ;). Are you ok with the interface as it is now or
would you like to change something?

> I suspect we will want to give it the ability to
> filter out non-replicated tables.   We will also have to filter out change
> records that didn't originate on the local-node that aren't part of a
> cascaded subscription.  Remember that in a two node cluster  slony will have
> connections from A-->B  and from B--->A even if user tables only flow one
> way. Data that is replicated from A into B will show up in the WAL stream
> for B.

Yes. We will also need something like that. If you remember the first
prototype we sent to the list, it included the concept of an
'origin_node' in wal record. I think you actually reviewed that one ;)

That was exactly aimed at something like this...

Since then my thoughts about how the origin_id looks like have changed a
bit:
- origin id is internally still represented as an uint32/Oid - never visible outside of wal/system catalogs
- externally visible it gets - assigned an uuid - optionally assigned a user defined name
- user settable (permissions?) origin when executing sql: - SET change_origin_uuid = 'uuid'; - SET change_origin_name =
'user-settable-name';- defaults to the local node
 
- decoding callbacks get passed the origin of a change - txn->{origin_uuid, origin_name, origin_internal?}
- the init decoding callback can setup an array of interesting origins, so the others don't even get the ReorderBuffer
treatment

I have to thank the discussion on -hackers and a march through prague
with Marko here...

> Exactly how we do this filtering is an open question,  I think the output
> plugin will at a minimum need to know:
>
> a) What the slony node id is of the node it is running on.  This is easy to
> figure out if the output plugin is able/allowed to query its database.  Will
> this be possible? I would expect to be able to query the database as it
> exists now(at plugin invocation time) not as it existed in the past when the
> WAL was generated.   In addition to the node ID I can see us wanting to be
> able to query other slony tables (sl_table,sl_set etc...)

Hm. There is no fundamental reason not to allow normal database access
to the current database but it won't be all that cheap, so doing it
frequently is not a good idea.
The reason its not cheap is that you basically need to teardown the
postgres internal caches if you switch the timestream in which you are
working.

Would go something like:

TransactionContext = AllocSetCreate(...);
RevertFromDecodingSnapshot();
InvalidateSystemCaches();
StartTransactionCommand();
/* do database work */
CommitTransactionCommand();
/* cleanup memory*/
SetupDecodingSnapshot(snapshot, data);
InvalidateSystemCaches();

Why do you need to be able to query the present? I thought it might be
neccesary to allow additional tables be accessed in a timetraveling
manner, but not this way round.
I guess an initial round of querying during plugin initialization won't
be good enough?

> b) What the slony node id is of the node we are streaming too.   It would be
> nice if we could pass extra, arbitrary data/parameters to the output plugins
> that could include that, or other things.  At the moment the
> start_logical_replication rule in repl_gram.y doesn't allow for that but I
> don't see why we couldn't make it do so.

Yes, I think we want something like that. I even asked input on that
recently ;):
http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de

Input welcome!


> Even though, from a data-correctness point of view, slony could commit the
> transaction on the replica after it sees the t1 commit, we won't want it to
> do commits other than on a SYNC boundary.  This means that the replicas will
> continue to move between consistent SYNC snapshots and that we can still
> track the state/progress of replication by knowing what events (SYNC or
> otherwise) have been confirmed.

I don't know enough about slony internals, but: why? This will prohibit
you from ever doing (per-transaction) synchronous replication...

> This also means that slony should only provide  feedback to the walsender on
> SYNC boundaries after the transaction has committed on the receiver. I don't
> see this as being an issue.

Yes, thats no problem. You need to give feedback more frequently
(otherwise walsender kicks you off), but you don't have to increase the
confirmed flush location.

> Setting up Subscriptions
> ===================
> At first we have a slon cluster with just 1 node, life is good. When a
> second node is created and a path(or pair of paths) are defined between the
> nodes I think they will each:
> 1. Connect to the remote node with a normal libpq connection.
>     a. Get the current xlog recptr,
>     b. Query any non-sync events of interest from sl_event.
> 2. Connect to the remote node with a logical replication connection and
> start streaming logical replication changes start at the recptr we retrieved
>     above.

Note that INIT_LOGICAL_REPLICATION can take some time to get to the
initial consistent state (especially if there are longrunning
transactions). So you should do the init in 1), query all the events in
the snapshot that returns and then go over to 2).

> The remote_worker:copy_set will then need to get a consistent COPY of the
> tables in the replication set such that any changes made to the tables after
> the copy is started get included in the replication stream.  The approach
> proposed in the DESIGN.TXT file with exporting a snapshot sounds okay for
> this.    I *think* slony could get by with something less fancy as well but
> it would be ugly.

The snapshot exporting isn't really that much additional work as we
already need to support most of it for keeping state across restarts.

> FAILOVER:
> ---------------
> A---->B
> |    .
> v  .
> C
>
> Today with slony, if B is a valid failover target then it is a forwarding
> node of the set.  This means that B keeps a record in sl_log of any changes
> originating on A until B knows that node C has received those changes.  In
> the event of a failover, if node C is far behind, it can just get the
> missing data from sl_log on node B (the failover target/new origin).
>
> I see a problem with what I have discussed above, B won't explicitly store
> the data from A in sl_log, a cascaded node would depend on B's WAL stream.
> The problem is that at FAILOVER time,  B might have processed some changes
> from A. Node  C might also be processing Node B's WAL stream for events (or
> data from another set).  Node C will discard/not receive the data for A's
> tables since it isn't subscribed to those tables from B.  What happens then
> if at some later point B and C receive the FAILOVER event.
> What does node C do? It can't get the missing data from node A because node
> A has failed, and it can't get it from node B because node C has already
> processed the WAL changes from node B that included the data but it
> ignored/discarded it.  Maybe node C could reprocess older WAL from node B?
> Maybe this forces us to keep an sl_log type structure around?

I fear youve left me behind here, sorry, can't give you any input.

> Is it complete enough to build a prototype?
> ==========================
> I think so, the incomplete areas I see are the ones that mentioned in the
> patch submission including:
> * Snapshot exporting for the initial COPY
> * Spilling the reorder buffer to disk
>
> I think it would be possible to build a prototype without those even though
> we'd need them before I could build a production system.

> Conclusions
> =============
> I like this design much better than the original design from the spring that
> would have required keeping a catalog proxy on the decoding machine.  Based
> on what I've seen it should be possible to make slony use logical
> replication as a source for events instead of triggers populating sl_log.
> My thinking is that we want a way for logreceiver programs to pass
> arguments/parameters to the output plugins. Beyond that this looks like
> something slony can use.
>
>

Cool!

Don't hesitate to mention anything that you think would make you life
easier, chances are that youre not the only one who could benefit from
it...


Thanks,

Andres

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3

From
Michael Paquier
Date:


On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi Michael,


On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> I have been able to fetch your code (thanks Andrea!) and some it. For the
> time being I am spending some time reading the code and understanding the
> whole set of features you are trying to implement inside core, even if I
> got some background from what you presented at PGCon and from the hackers
> ML.

Cool.

> Btw, as a first approach, I tried to run the logical log receiver plugged
> on a postgres server, and I am not able to make it work.

> Well, I am using settings similar to yours.
> # Run master
> rm -r ~/bin/pgsql/master/
> initdb -D ~/bin/pgsql/master/
> echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf
> postgres -D ~/bin/pgsql/master \
>           -c wal_level=logical \
>           -c max_wal_senders=10 \
>           -c max_logical_slots=10 \
>           -c wal_keep_segments=100 \
>           -c log_line_prefix="[%p %x] "
> # Logical log receiver
> pg_receivellog -f $HOME/output.txt -d postgres -v
>
> After launching some SQLs, the logical receiver is stuck just after sending
> INIT_LOGICAL_REPLICATION, please see bt of process waiting:

Its waiting till it sees initial an initial xl_running_xacts record. The
easiest way to do that is manually issue a checkpoint. Sorry, should
have included that in the description.
Otherwise you can wait till the next routine checkpoint comes arround...

I plan to cause more xl_running_xacts record to be logged in the
future. I think the timing of those currently is non-optimal, you have
the same problem as here in normal streaming replication as well :(

> I am just looking at this patch and will provide some comments.
> By the way, you forgot the installation part of pg_receivellog, please see
> patch attached.

That actually was somewhat intended, I thought people wouldn't like the
name and I didn't want a binary that's going to be replaced anyway lying
around ;)
OK no problem. For sure this is going to happen, I was wondering myself if it could be possible to merge pg_receivexlog and pg_receivellog into a single utility with multiple modes :)

Btw, here are some extra comments based on my progress, hope it will be useful for other people playing around with your patches.
1) Necessary to install the contrib module test_decoding on server side or the test case will not work.
2) Obtention of the following logs on server:
LOG:  forced to assume catalog changes for xid 1370 because it was running to early
WARNING:  ABORT 1370
Actually I saw that there are many warnings like this.
3) Assertion failure while running pgbench, I was  just curious to see how it reacted when logical replication was put under a little bit of load.
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c", Line: 877)
=> pgbench -i postgres; pgbench -T 500 -c 8 postgres
4) Mentionned by Andres above, but logical replication begins only there is a xl_running_xacts record. I just enforced a checkpoint manually.

With all those things done, I have been able to set up the system, for example those queries:
postgres=# create table ac (a int);
CREATE TABLE
postgres=# insert into ac values (1);
INSERT 0 1
created the expected output:
BEGIN 32135
COMMIT 32135
BEGIN 32136
table "ac": INSERT: a[int4]:1
COMMIT 32136

Now it is time to dig into the code...
--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3 - Source for Slony

From
Steve Singer
Date:
On 12-11-18 11:07 AM, Andres Freund wrote:
> Hi Steve!
>
>
> I think we should provide some glue code to do this, otherwise people
> will start replicating all the bugs I hacked into this... More
> seriously: I think we should have support code here, no user will want
> to learn the intracacies of feedback messages and such. Where that would
> live? No idea.

libpglogicalrep.so ?

> I wholeheartedly aggree. It should also be cleaned up a fair bit before
> others copy it should we not go for having some client side library.
>
> Imo the library could very roughly be something like:
>
> state = SetupStreamingLLog(replication-slot, ...);
> while((message = StreamingLLogNextMessage(state))
> {
>       write(outfd, message->data, message->length);
>       if (received_100_messages)
>       {
>            fsync(outfd);
>            StreamingLLogConfirm(message);
>       }
> }
>
> Although I guess thats not good enough because StreamingLLogNextMessage
> would be blocking, but that shouldn't be too hard to work around.
>

How about we pass a timeout value to StreamingLLogNextMessage (..) where 
it returns if no data is available after the timeout to give the caller 
a chance to do something else.

>> This is basically the Slony 2.2 sl_log format minus a few columns we no
>> longer need (txid, actionseq).
>> command_args is a postgresql text array of column=value pairs.  Ie [
>> {id=1},{name='steve'},{project='slony'}]
> It seems to me that that makes escaping unneccesarily complicated, but
> given you already have all the code... ;)

When I look at the actual code/representation we picked it is closer to 
{column1,value1,column2,value2...}



>> I don't t think our output plugin will be much more complicated than the
>> test_decoding plugin.
> Good. Thats the idea ;). Are you ok with the interface as it is now or
> would you like to change something?

I'm going to think about this some more and maybe try to write an 
example plugin before I can say anything with confidence.

>
> Yes. We will also need something like that. If you remember the first
> prototype we sent to the list, it included the concept of an
> 'origin_node' in wal record. I think you actually reviewed that one ;)
>
> That was exactly aimed at something like this...
>
> Since then my thoughts about how the origin_id looks like have changed a
> bit:
> - origin id is internally still represented as an uint32/Oid
>    - never visible outside of wal/system catalogs
> - externally visible it gets
>    - assigned an uuid
>    - optionally assigned a user defined name
> - user settable (permissions?) origin when executing sql:
>    - SET change_origin_uuid = 'uuid';
>    - SET change_origin_name = 'user-settable-name';
>    - defaults to the local node
> - decoding callbacks get passed the origin of a change
>    - txn->{origin_uuid, origin_name, origin_internal?}
> - the init decoding callback can setup an array of interesting origins,
>    so the others don't even get the ReorderBuffer treatment
>
> I have to thank the discussion on -hackers and a march through prague
> with Marko here...
So would the uuid and optional name assignment be done in the output 
plugin or some else?
When/how does the uuid get generated and where do we store it so the 
same uuid gets returned when postgres restarts.  Slony today stores all 
this type of stuff in user-level tables and user-level functions 
(because it has no other choice).    What is the connection between 
these values and the 'slot-id' in your proposal for the init arguments? 
Does the slot-id need to be the external uuid of the other end or is 
there no direct connection?

Today slony allows us to replicate between two databases in the same 
postgresql cluster (I use this for testing all the time)
Slony also allows for two different 'slony clusters' to be setup in the 
same database (or so I'm told, I don't think I have ever tried this myself).

plugin functions that let me query the local database and then return 
the  uuid and origin_name would work in this model.

+1 on being able to mark the 'change origin' in a SET command when the 
replication process is pushing data into the replica.

>> Exactly how we do this filtering is an open question,  I think the output
>> plugin will at a minimum need to know:
>>
>> a) What the slony node id is of the node it is running on.  This is easy to
>> figure out if the output plugin is able/allowed to query its database.  Will
>> this be possible? I would expect to be able to query the database as it
>> exists now(at plugin invocation time) not as it existed in the past when the
>> WAL was generated.   In addition to the node ID I can see us wanting to be
>> able to query other slony tables (sl_table,sl_set etc...)
> Hm. There is no fundamental reason not to allow normal database access
> to the current database but it won't be all that cheap, so doing it
> frequently is not a good idea.
> The reason its not cheap is that you basically need to teardown the
> postgres internal caches if you switch the timestream in which you are
> working.
>
> Would go something like:
>
> TransactionContext = AllocSetCreate(...);
> RevertFromDecodingSnapshot();
> InvalidateSystemCaches();
> StartTransactionCommand();
> /* do database work */
> CommitTransactionCommand();
> /* cleanup memory*/
> SetupDecodingSnapshot(snapshot, data);
> InvalidateSystemCaches();
>
> Why do you need to be able to query the present? I thought it might be
> neccesary to allow additional tables be accessed in a timetraveling
> manner, but not this way round.
> I guess an initial round of querying during plugin initialization won't
> be good enough?

For example my output plugin would want the list of replicated tables 
(or the list of tables replicated to a particular replica). This list 
can change over time.  As administrators issue commands to add or remove 
tables to replication or otherwise reshape the cluster the output plugin 
will need to know about this.  I MIGHT be able to get away with having 
slon disconnect and reconnect on reconfiguration events so only the 
init() call would need this data, but I am not sure.

One of the ways slony allows you to shoot your foot off is by changing 
certain configuration things (like dropping a table from a set) while a 
subscription is in progress.   Being able to timetravel the slony 
configuration tables might make this type of foot-gun a lot harder to 
encounter but that might be asking for too much.




>> b) What the slony node id is of the node we are streaming too.   It would be
>> nice if we could pass extra, arbitrary data/parameters to the output plugins
>> that could include that, or other things.  At the moment the
>> start_logical_replication rule in repl_gram.y doesn't allow for that but I
>> don't see why we couldn't make it do so.
> Yes, I think we want something like that. I even asked input on that
> recently ;):
> http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de
>
> Input welcome!

How flexible will the datatypes for the arguments be? If I wanted to 
pass in a list of tables (ie an array?) could I?
Above I talked about having the init() or change() methods query the 
local database.  Another option might be to make the slon build up this 
data (by querying the database over a normal psql connection) and just 
passing the data in.   However that might mean passing in a list of a 
few thousand table names, which doesn't sound like a good idea.

>
>> Even though, from a data-correctness point of view, slony could commit the
>> transaction on the replica after it sees the t1 commit, we won't want it to
>> do commits other than on a SYNC boundary.  This means that the replicas will
>> continue to move between consistent SYNC snapshots and that we can still
>> track the state/progress of replication by knowing what events (SYNC or
>> otherwise) have been confirmed.
> I don't know enough about slony internals, but: why? This will prohibit
> you from ever doing (per-transaction) synchronous replication...

A lot of this has to do with the stuff I discuss in the section below on 
cluster reshaping that you didn't understand.  Slony depends on knowing 
what data has , or hasn't been sent to a replica at a particular event 
id.  If 'some' transactions in between two SYNC events have committed 
but not others then slony has no idea what data it needs to get 
elsewhere on a FAILOVER type event.  There might be a way to make this 
work otherwise but I'm not sure what that is and how long it will take 
to debug out the issues.

> Cool! Don't hesitate to mention anything that you think would make you 
> life easier, chances are that youre not the only one who could benefit 
> from it... Thanks, Andres






Re: logical changeset generation v3

From
Andres Freund
Date:
On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> > On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> > > I am just looking at this patch and will provide some comments.
> > > By the way, you forgot the installation part of pg_receivellog, please see
> > > patch attached.
> >
> > That actually was somewhat intended, I thought people wouldn't like the
> > name and I didn't want a binary that's going to be replaced anyway lying
> > around ;)
> >
> OK no problem. For sure this is going to happen, I was wondering myself if
> it could be possible to merge pg_receivexlog and pg_receivellog into a
> single utility with multiple modes :)

Don't really see that, the differences already are significant and imo
are bound to get bigger. Shouldn't live in pg_basebackup/ either..

> Btw, here are some extra comments based on my progress, hope it will be
> useful for other people playing around with your patches.
> 1) Necessary to install the contrib module test_decoding on server side or
> the test case will not work.
> 2) Obtention of the following logs on server:
> LOG:  forced to assume catalog changes for xid 1370 because it was running
> to early
> WARNING:  ABORT 1370
> Actually I saw that there are many warnings like this.

Those aren't unexpected. Perhaps I should not make it a warning then...

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.
Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.

Makes sense?

> 3) Assertion failure while running pgbench, I was  just curious to see how
> it reacted when logical replication was put under a little bit of load.
> TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c",
> Line: 877)
> => pgbench -i postgres; pgbench -T 500 -c 8 postgres

Can you reproduce this one? I would be interested in log output. Because
I did run pgbench for quite some time and I haven't seen that one after
fixing some issues last week.

It implies that snapstate->nrrunning has lost touch with reality...

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3 - Source for Slony

From
Andres Freund
Date:
Hi,

On 2012-11-19 19:50:32 -0500, Steve Singer wrote:
> On 12-11-18 11:07 AM, Andres Freund wrote:
> >I think we should provide some glue code to do this, otherwise people
> >will start replicating all the bugs I hacked into this... More
> >seriously: I think we should have support code here, no user will want
> >to learn the intracacies of feedback messages and such. Where that would
> >live? No idea.

> libpglogicalrep.so ?

Yea. We don't really have the infrastructure for that yet
though... Robert and me were just talking about that recently...


> >I wholeheartedly aggree. It should also be cleaned up a fair bit before
> >others copy it should we not go for having some client side library.
> >
> >Imo the library could very roughly be something like:
> >
> >state = SetupStreamingLLog(replication-slot, ...);
> >while((message = StreamingLLogNextMessage(state))
> >{
> >      write(outfd, message->data, message->length);
> >      if (received_100_messages)
> >      {
> >           fsync(outfd);
> >           StreamingLLogConfirm(message);
> >      }
> >}
> >
> >Although I guess thats not good enough because StreamingLLogNextMessage
> >would be blocking, but that shouldn't be too hard to work around.
> >
>
> How about we pass a timeout value to StreamingLLogNextMessage (..) where it
> returns if no data is available after the timeout to give the caller a
> chance to do something else.

Doesn't really integrate into the sort of loop thats often built around
poll(2), select(2) and similar. It probably should return NULL if
there's nothing there yet and we should have a
StreamingLLogWaitForMessage() or such.

> >>This is basically the Slony 2.2 sl_log format minus a few columns we no
> >>longer need (txid, actionseq).
> >>command_args is a postgresql text array of column=value pairs.  Ie [
> >>{id=1},{name='steve'},{project='slony'}]
> >It seems to me that that makes escaping unneccesarily complicated, but
> >given you already have all the code... ;)
>
> When I look at the actual code/representation we picked it is closer to
> {column1,value1,column2,value2...}

Still means you need to escape and later pasrse columnN, valueN
values. I would have expected something like (length:data, length:data)+

> >>I don't t think our output plugin will be much more complicated than the
> >>test_decoding plugin.
> >Good. Thats the idea ;). Are you ok with the interface as it is now or
> >would you like to change something?
>
> I'm going to think about this some more and maybe try to write an example
> plugin before I can say anything with confidence.

That would be very good.

> >Yes. We will also need something like that. If you remember the first
> >prototype we sent to the list, it included the concept of an
> >'origin_node' in wal record. I think you actually reviewed that one ;)
> >
> >That was exactly aimed at something like this...
> >
> >Since then my thoughts about how the origin_id looks like have changed a
> >bit:
> >- origin id is internally still represented as an uint32/Oid
> >   - never visible outside of wal/system catalogs
> >- externally visible it gets
> >   - assigned an uuid
> >   - optionally assigned a user defined name
> >- user settable (permissions?) origin when executing sql:
> >   - SET change_origin_uuid = 'uuid';
> >   - SET change_origin_name = 'user-settable-name';
> >   - defaults to the local node
> >- decoding callbacks get passed the origin of a change
> >   - txn->{origin_uuid, origin_name, origin_internal?}
> >- the init decoding callback can setup an array of interesting origins,
> >   so the others don't even get the ReorderBuffer treatment
> >
> >I have to thank the discussion on -hackers and a march through prague
> >with Marko here...

> So would the uuid and optional name assignment be done in the output plugin
> or some else?

That would be postgres infrastructure. The output plugin would get
passed at least uuid and name and potentially the internal name as well
(might be useful to build some internal caching of information).

> When/how does the uuid get generated and where do we store it so the same
> uuid gets returned when postgres restarts.  Slony today stores all this type
> of stuff in user-level tables and user-level functions (because it has no
> other choice).

Would need to be its own system catalog.

> What is the connection between these values and the
> 'slot-id' in your proposal for the init arguments? Does the slot-id need to
> be the external uuid of the other end or is there no direct connection?

None really. The "slot-id" really is only an identifier for a
replication connection (which should live longer than a single
postmaster run) which contains information about the point up to which
you replicated. We need to manage some local resources based on that.

> Today slony allows us to replicate between two databases in the same
> postgresql cluster (I use this for testing all the time)
> Slony also allows for two different 'slony clusters' to be setup in the same
> database (or so I'm told, I don't think I have ever tried this myself).

Yuck. I haven't thought about this very much. I honestly don't see
support for the first case right now. The second shouldn't be too hard,
we already have the database oid available everywhere we need it.

> plugin functions that let me query the local database and then return the
> uuid and origin_name would work in this model.

Should be possible.

> +1 on being able to mark the 'change origin' in a SET command when the
> replication process is pushing data into the replica.

Good.

>
> >>Exactly how we do this filtering is an open question,  I think the output
> >>plugin will at a minimum need to know:
> >>
> >>a) What the slony node id is of the node it is running on.  This is easy to
> >>figure out if the output plugin is able/allowed to query its database.  Will
> >>this be possible? I would expect to be able to query the database as it
> >>exists now(at plugin invocation time) not as it existed in the past when the
> >>WAL was generated.   In addition to the node ID I can see us wanting to be
> >>able to query other slony tables (sl_table,sl_set etc...)
> >Hm. There is no fundamental reason not to allow normal database access
> >to the current database but it won't be all that cheap, so doing it
> >frequently is not a good idea.
> >The reason its not cheap is that you basically need to teardown the
> >postgres internal caches if you switch the timestream in which you are
> >working.
> >
> >Would go something like:
> >
> >TransactionContext = AllocSetCreate(...);
> >RevertFromDecodingSnapshot();
> >InvalidateSystemCaches();
> >StartTransactionCommand();
> >/* do database work */
> >CommitTransactionCommand();
> >/* cleanup memory*/
> >SetupDecodingSnapshot(snapshot, data);
> >InvalidateSystemCaches();
> >
> >Why do you need to be able to query the present? I thought it might be
> >neccesary to allow additional tables be accessed in a timetraveling
> >manner, but not this way round.
> >I guess an initial round of querying during plugin initialization won't
> >be good enough?
>
> For example my output plugin would want the list of replicated tables (or
> the list of tables replicated to a particular replica). This list can change
> over time.  As administrators issue commands to add or remove tables to
> replication or otherwise reshape the cluster the output plugin will need to
> know about this.  I MIGHT be able to get away with having slon disconnect
> and reconnect on reconfiguration events so only the init() call would need
> this data, but I am not sure.
>
> One of the ways slony allows you to shoot your foot off is by changing
> certain configuration things (like dropping a table from a set) while a
> subscription is in progress.   Being able to timetravel the slony
> configuration tables might make this type of foot-gun a lot harder to
> encounter but that might be asking for too much.

Actually timetravel access to those tables is considerably
easier/faster. I wanted to provide such tables anyway (because you need
them to safely write your own pg_enum alike types). It means that you
log slightly (32 + sizeof(XLogRecord) afair) more per modified row.

> >>b) What the slony node id is of the node we are streaming too.   It would be
> >>nice if we could pass extra, arbitrary data/parameters to the output plugins
> >>that could include that, or other things.  At the moment the
> >>start_logical_replication rule in repl_gram.y doesn't allow for that but I
> >>don't see why we couldn't make it do so.
> >Yes, I think we want something like that. I even asked input on that
> >recently ;):
> >http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de
> >
> >Input welcome!
>
> How flexible will the datatypes for the arguments be? If I wanted to pass in
> a list of tables (ie an array?) could I?

I was thinking of just a textual (key = value, ...) style list, similar
to options to EXPLAIN, COPY et al.

> Above I talked about having the init() or change() methods query the local
> database.  Another option might be to make the slon build up this data (by
> querying the database over a normal psql connection) and just passing the
> data in.   However that might mean passing in a list of a few thousand table
> names, which doesn't sound like a good idea.

No, it certainly doesn't.

> >
> >>Even though, from a data-correctness point of view, slony could commit the
> >>transaction on the replica after it sees the t1 commit, we won't want it to
> >>do commits other than on a SYNC boundary.  This means that the replicas will
> >>continue to move between consistent SYNC snapshots and that we can still
> >>track the state/progress of replication by knowing what events (SYNC or
> >>otherwise) have been confirmed.
> >I don't know enough about slony internals, but: why? This will prohibit
> >you from ever doing (per-transaction) synchronous replication...
>
> A lot of this has to do with the stuff I discuss in the section below on
> cluster reshaping that you didn't understand.  Slony depends on knowing what
> data has , or hasn't been sent to a replica at a particular event id.  If
> 'some' transactions in between two SYNC events have committed but not others
> then slony has no idea what data it needs to get elsewhere on a FAILOVER
> type event.  There might be a way to make this work otherwise but I'm not
> sure what that is and how long it will take to debug out the issues.

Ah, it starts to make sense.

The way I solved that issue in the prototype from arround pgcon was that
I included the LSN from the original commit record in the remote
transaction into the commit record of the local transaction (with the
origin_id set to the remote side). That allowed to trivially restore the
exact state of replication after a crash even with
synchronous_commit=off as during replay you could simply ensure the
replication-surely-received-lsn of every remote side was up to date.
Then you can simply do a START_LOGICAL_REPLICATION 'slot'
just-recovered/lsn; and restart applying (*not*
INIT_LOGICAL_REPLICATION).


Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3

From
Michael Paquier
Date:


On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Those aren't unexpected. Perhaps I should not make it a warning then...

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.
Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.

Makes sense?

> 3) Assertion failure while running pgbench, I was  just curious to see how
> it reacted when logical replication was put under a little bit of load.
> TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c",
> Line: 877)
> => pgbench -i postgres; pgbench -T 500 -c 8 postgres

Can you reproduce this one? I would be interested in log output. Because
I did run pgbench for quite some time and I haven't seen that one after
fixing some issues last week.

It implies that snapstate->nrrunning has lost touch with reality...
Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't outputted anything in the logs, but here is the backtrace of the core file produced.
 #2  0x0000000000865145 in ExceptionalCondition (conditionName=0xa15100 "!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >= ((TransactionId) 3)))", errorType=0xa14f3b "FailedAssertion",
    fileName=0xa14ed0 "snapbuild.c", lineNumber=877) at assert.c:54
#3  0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10, xid=0) at snapbuild.c:877
#4  0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80, snapstate=0x19e4f10, xid=0, buf=0x1a0a368, relfilenode=0x1a0a450) at snapbuild.c:388
#5  0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80, snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732
#6  0x00000000007080b9 in DecodeRecordIntoReorderBuffer (reader=0x1a08300, state=0x19e4e20, buf=0x1a0a368) at decode.c:84
#7  0x0000000000708cad in replay_finished_record (state=0x1a08300, buf=0x1a0a368) at logicalfuncs.c:54
#8  0x00000000004d8033 in XLogReaderRead (state=0x1a08300) at xlogreader.c:965
#9  0x000000000070f7c3 in XLogSendLogical (caughtup=0x7fffb22c35fb "") at walsender.c:1721
#10 0x000000000070ea05 in WalSndLoop (send_data=0x70f6e2 <XLogSendLogical>) at walsender.c:1184
#11 0x000000000070e0eb in StartLogicalReplication (cmd=0x190eaa8) at walsender.c:726
#12 0x000000000070e3ac in exec_replication_command (cmd_string=0x19a65c8 "START_LOGICAL_REPLICATION 'id-0' 0/7E1855C") at walsender.c:853
#13 0x0000000000753ee0 in PostgresMain (argc=2, argv=0x18f63d8, username=0x18f62a8 "michael") at postgres.c:3974
#14 0x00000000006f13ea in BackendRun (port=0x1912600) at postmaster.c:3668
#15 0x00000000006f0b76 in BackendStartup (port=0x1912600) at postmaster.c:3352
#16 0x00000000006ed900 in ServerLoop () at postmaster.c:1431
#17 0x00000000006ed208 in PostmasterMain (argc=13, argv=0x18f40a0) at postmaster.c:1180
#18 0x0000000000657517 in main (argc=13, argv=0x18f40a0) at main.c:197
I'm keeping this core and the binary btw.
--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3

From
Michael Paquier
Date:


On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> > On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> > > I am just looking at this patch and will provide some comments.
> > > By the way, you forgot the installation part of pg_receivellog, please see
> > > patch attached.
> >
> > That actually was somewhat intended, I thought people wouldn't like the
> > name and I didn't want a binary that's going to be replaced anyway lying
> > around ;)
> >
> OK no problem. For sure this is going to happen, I was wondering myself if
> it could be possible to merge pg_receivexlog and pg_receivellog into a
> single utility with multiple modes :)

Don't really see that, the differences already are significant and imo
are bound to get bigger. Shouldn't live in pg_basebackup/ either...
I am sure that this will be the object of many future discussions.
 
> Btw, here are some extra comments based on my progress, hope it will be
> useful for other people playing around with your patches.
> 1) Necessary to install the contrib module test_decoding on server side or
> the test case will not work.
> 2) Obtention of the following logs on server:
> LOG:  forced to assume catalog changes for xid 1370 because it was running
> to early
> WARNING:  ABORT 1370
> Actually I saw that there are many warnings like this.

Those aren't unexpected. Perhaps I should not make it a warning then...
A NOTICE would be more adapted, a WARNING means that something that may endanger the system has happened, but as far as I understand from your explanation this is not the case.
 
A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.
OK, so this snapshot only needs to include the XIDs of transactions that have modified the catalogs. Do I get it right? This way you are able to fetch the correct relation definition for replication decoding.

Just thinking but... It looks to be a waste to store the transactions XIDs of all the committed transactions, but on the other hand there is no way to track the XIDs of transactions that modified a catalog in current core code. So yes this approach is better as refining the transaction XID tracking for snapshot reconstruction is something that could be improved later. Those are only thoughts though...

The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.
Just to be clear, lsn means the log-sequence number associated to each xlog record?
 
Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.

Makes sense?
OK got it thanks for your explanation.

So, once again coming to it, we need in the snapshot built only the XIDs of transactions that modified the catalogs to get a consistent view of relation info for decoding.
Really, I think that refining the XID tracking to minimize the size of the snapshot built for decoding would be really a key for performance improvement especially for OLTP-type applications (lots of transactions involved, few of them involving catalogs).
--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3

From
Andres Freund
Date:
On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> > > Btw, here are some extra comments based on my progress, hope it will be
> > > useful for other people playing around with your patches.
> > > 1) Necessary to install the contrib module test_decoding on server side
> > or
> > > the test case will not work.
> > > 2) Obtention of the following logs on server:
> > > LOG:  forced to assume catalog changes for xid 1370 because it was
> > running
> > > to early
> > > WARNING:  ABORT 1370
> > > Actually I saw that there are many warnings like this.
> >
> > Those aren't unexpected. Perhaps I should not make it a warning then...
> >
> A NOTICE would be more adapted, a WARNING means that something that may
> endanger the system has happened, but as far as I understand from your
> explanation this is not the case.

I think it should go DEBUG2 or so once were a bit more confident about
the code.

> > A short explanation:
> >
> > We can only decode tuples we see in the WAL when we already have a
> > timetravel catalog snapshot before that transaction started. To build
> > such a snapshot we need to collect information about committed which
> > changed the catalog. Unfortunately we can't diagnose whether a txn
> > changed the catalog without a snapshot so we just assume all committed
> > ones do - it just costs a bit of memory. Thats the background of the
> > "forced to assume catalog changes for ..." message.
> >
> OK, so this snapshot only needs to include the XIDs of transactions that
> have modified the catalogs. Do I get it right? This way you are able to
> fetch the correct relation definition for replication decoding.

Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
so its not all of them. Normal snapshots carry all in-progress
transactionids instead of the committed ones, but that would have been
far more in our case (only a minority of txn's touch the catalog) and it
has problems with subtransaction tracking.

> Just thinking but... It looks to be a waste to store the transactions XIDs
> of all the committed transactions, but on the other hand there is no way to
> track the XIDs of transactions that modified a catalog in current core
> code. So yes this approach is better as refining the transaction XID
> tracking for snapshot reconstruction is something that could be improved
> later. Those are only thoughts though...

We actually only track xids of catalog modifying transactions once we
hit the CONSISTENT state. Before the initial snapshot we can't detect
that.

> The reason for the ABORTs is related but different. We start out in the
> > "SNAPBUILD_START" state when we try to build a snapshot. When we find
> > initial information about running transactions (i.e. xl_running_xacts)
> > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
> > decode all changes in transactions that start *after* the current
> > lsn. Earlier transactions might have tuples on a catalog state we can't
> > query.
> >
> Just to be clear, lsn means the log-sequence number associated to each xlog
> record?

Yes. And that number is just the position in the stream.

Greetings,

Andres

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3

From
Andres Freund
Date:
On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > Those aren't unexpected. Perhaps I should not make it a warning then...
> >
> > A short explanation:
> >
> > We can only decode tuples we see in the WAL when we already have a
> > timetravel catalog snapshot before that transaction started. To build
> > such a snapshot we need to collect information about committed which
> > changed the catalog. Unfortunately we can't diagnose whether a txn
> > changed the catalog without a snapshot so we just assume all committed
> > ones do - it just costs a bit of memory. Thats the background of the
> > "forced to assume catalog changes for ..." message.
> >
> > The reason for the ABORTs is related but different. We start out in the
> > "SNAPBUILD_START" state when we try to build a snapshot. When we find
> > initial information about running transactions (i.e. xl_running_xacts)
> > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
> > decode all changes in transactions that start *after* the current
> > lsn. Earlier transactions might have tuples on a catalog state we can't
> > query.
> > Only when all transactions we observed as running before the
> > FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
> > As we want a consistent/reproducible set of transactions to produce
> > output via the logstream we only pass transactions to the output plugin
> > if they commit *after* CONSISTENT (they can start earlier though!). This
> > allows us to produce a pg_dump compatible snapshot in the moment we get
> > consistent that contains exactly the changes we won't stream out.
> >
> > Makes sense?
> >
> > > 3) Assertion failure while running pgbench, I was  just curious to see
> > how
> > > it reacted when logical replication was put under a little bit of load.
> > > TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> > > ((snapstate->xmin_running) >= ((TransactionId) 3)))", File:
> > "snapbuild.c",
> > > Line: 877)
> > > => pgbench -i postgres; pgbench -T 500 -c 8 postgres
> >
> > Can you reproduce this one? I would be interested in log output. Because
> > I did run pgbench for quite some time and I haven't seen that one after
> > fixing some issues last week.
> >
> 
> > It implies that snapstate->nrrunning has lost touch with reality...
> >
> Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
> outputted anything in the logs, but here is the backtrace of the core file
> produced.

Could you run it with log_level=DEBUG2?

Do you run pgbench after youve reached a consistent state (by issuing a
manual checkpoint)?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3

From
Michael Paquier
Date:


On Wed, Nov 21, 2012 at 4:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> > It implies that snapstate->nrrunning has lost touch with reality...
> >
> Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
> outputted anything in the logs, but here is the backtrace of the core file
> produced.

Could you run it with log_level=DEBUG2?
Let me try.
Do you run pgbench after youve reached a consistent state (by issuing a
manual checkpoint)?
Yes. I issue a manual checkpoint to initialize the replication.
--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3

From
Andres Freund
Date:
On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > Those aren't unexpected. Perhaps I should not make it a warning then...
> >
> > A short explanation:
> >
> > We can only decode tuples we see in the WAL when we already have a
> > timetravel catalog snapshot before that transaction started. To build
> > such a snapshot we need to collect information about committed which
> > changed the catalog. Unfortunately we can't diagnose whether a txn
> > changed the catalog without a snapshot so we just assume all committed
> > ones do - it just costs a bit of memory. Thats the background of the
> > "forced to assume catalog changes for ..." message.
> >
> > The reason for the ABORTs is related but different. We start out in the
> > "SNAPBUILD_START" state when we try to build a snapshot. When we find
> > initial information about running transactions (i.e. xl_running_xacts)
> > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
> > decode all changes in transactions that start *after* the current
> > lsn. Earlier transactions might have tuples on a catalog state we can't
> > query.
> > Only when all transactions we observed as running before the
> > FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
> > As we want a consistent/reproducible set of transactions to produce
> > output via the logstream we only pass transactions to the output plugin
> > if they commit *after* CONSISTENT (they can start earlier though!). This
> > allows us to produce a pg_dump compatible snapshot in the moment we get
> > consistent that contains exactly the changes we won't stream out.
> >
> > Makes sense?
> >
> > > 3) Assertion failure while running pgbench, I was  just curious to see
> > how
> > > it reacted when logical replication was put under a little bit of load.
> > > TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> > > ((snapstate->xmin_running) >= ((TransactionId) 3)))", File:
> > "snapbuild.c",
> > > Line: 877)
> > > => pgbench -i postgres; pgbench -T 500 -c 8 postgres
> >
> > Can you reproduce this one? I would be interested in log output. Because
> > I did run pgbench for quite some time and I haven't seen that one after
> > fixing some issues last week.
> >
> 
> > It implies that snapstate->nrrunning has lost touch with reality...
> >
> Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
> outputted anything in the logs, but here is the backtrace of the core file
> produced.

Ah, I see. Could you try the following diff?

diff --git a/src/backend/replication/logical/snapbuild.c
b/src/backend/replication/logical/snapbuild.c
index df24b33..797a126 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer *reorder,
Snapstate *snapstate,                */               snapstate->transactions_after = buf->origptr;
+               snapstate->nrrunning = running->xcnt;               snapstate->xmin_running = InvalidTransactionId;
         snapstate->xmax_running = InvalidTransactionId;
 


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3

From
Michael Paquier
Date:


On Wed, Nov 21, 2012 at 4:30 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> > > Btw, here are some extra comments based on my progress, hope it will be
> > > useful for other people playing around with your patches.
> > > 1) Necessary to install the contrib module test_decoding on server side
> > or
> > > the test case will not work.
> > > 2) Obtention of the following logs on server:
> > > LOG:  forced to assume catalog changes for xid 1370 because it was
> > running
> > > to early
> > > WARNING:  ABORT 1370
> > > Actually I saw that there are many warnings like this.
> >
> > Those aren't unexpected. Perhaps I should not make it a warning then...
> >
> A NOTICE would be more adapted, a WARNING means that something that may
> endanger the system has happened, but as far as I understand from your
> explanation this is not the case.

I think it should go DEBUG2 or so once were a bit more confident about
the code.

> > A short explanation:
> >
> > We can only decode tuples we see in the WAL when we already have a
> > timetravel catalog snapshot before that transaction started. To build
> > such a snapshot we need to collect information about committed which
> > changed the catalog. Unfortunately we can't diagnose whether a txn
> > changed the catalog without a snapshot so we just assume all committed
> > ones do - it just costs a bit of memory. Thats the background of the
> > "forced to assume catalog changes for ..." message.
> >
> OK, so this snapshot only needs to include the XIDs of transactions that
> have modified the catalogs. Do I get it right? This way you are able to
> fetch the correct relation definition for replication decoding.

Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
so its not all of them. Normal snapshots carry all in-progress
transactionids instead of the committed ones, but that would have been
far more in our case (only a minority of txn's touch the catalog) and it
has problems with subtransaction tracking.
Hum. I might have missed something but what is the variable tracking the newest XID that modified catalogs.
I can see of course recentXmin in snapmgr.c but nothing related to what you describe.
 

> Just thinking but... It looks to be a waste to store the transactions XIDs
> of all the committed transactions, but on the other hand there is no way to
> track the XIDs of transactions that modified a catalog in current core
> code. So yes this approach is better as refining the transaction XID
> tracking for snapshot reconstruction is something that could be improved
> later. Those are only thoughts though...

We actually only track xids of catalog modifying transactions once we
hit the CONSISTENT state. Before the initial snapshot we can't detect
that.
How to you track them? I think I need to go deeper in the code before asking more...
--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3

From
Andres Freund
Date:
On 2012-11-21 16:47:11 +0900, Michael Paquier wrote:
> On Wed, Nov 21, 2012 at 4:30 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:
> > > On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com
> > >wrote:
> > >
> > > > On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> > > > > Btw, here are some extra comments based on my progress, hope it will
> > be
> > > > > useful for other people playing around with your patches.
> > > > > 1) Necessary to install the contrib module test_decoding on server
> > side
> > > > or
> > > > > the test case will not work.
> > > > > 2) Obtention of the following logs on server:
> > > > > LOG:  forced to assume catalog changes for xid 1370 because it was
> > > > running
> > > > > to early
> > > > > WARNING:  ABORT 1370
> > > > > Actually I saw that there are many warnings like this.
> > > >
> > > > Those aren't unexpected. Perhaps I should not make it a warning then...
> > > >
> > > A NOTICE would be more adapted, a WARNING means that something that may
> > > endanger the system has happened, but as far as I understand from your
> > > explanation this is not the case.
> >
> > I think it should go DEBUG2 or so once were a bit more confident about
> > the code.
> >
> > > > A short explanation:
> > > >
> > > > We can only decode tuples we see in the WAL when we already have a
> > > > timetravel catalog snapshot before that transaction started. To build
> > > > such a snapshot we need to collect information about committed which
> > > > changed the catalog. Unfortunately we can't diagnose whether a txn
> > > > changed the catalog without a snapshot so we just assume all committed
> > > > ones do - it just costs a bit of memory. Thats the background of the
> > > > "forced to assume catalog changes for ..." message.
> > > >
> > > OK, so this snapshot only needs to include the XIDs of transactions that
> > > have modified the catalogs. Do I get it right? This way you are able to
> > > fetch the correct relation definition for replication decoding.
> >
> > Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
> > so its not all of them. Normal snapshots carry all in-progress
> > transactionids instead of the committed ones, but that would have been
> > far more in our case (only a minority of txn's touch the catalog) and it
> > has problems with subtransaction tracking.
> >
> Hum. I might have missed something but what is the variable tracking the
> newest XID that modified catalogs.
> I can see of course recentXmin in snapmgr.c but nothing related to what you
> describe.

We determine that ourselves.

SnapBuildCommitTxn(Snapstate *snapstate, ReorderBuffer *reorder,               XLogRecPtr lsn, TransactionId xid,
       int nsubxacts, TransactionId *subxacts)
 
{
...if (forced_timetravel || top_does_timetravel || sub_does_timetravel){    if (!TransactionIdIsValid(snapstate->xmax)
||       NormalTransactionIdFollows(xid, snapstate->xmax))    {        snapstate->xmax = xid;
TransactionIdAdvance(snapstate->xmax);   }
 

> > > Just thinking but... It looks to be a waste to store the transactions
> > XIDs
> > > of all the committed transactions, but on the other hand there is no way
> > to
> > > track the XIDs of transactions that modified a catalog in current core
> > > code. So yes this approach is better as refining the transaction XID
> > > tracking for snapshot reconstruction is something that could be improved
> > > later. Those are only thoughts though...
> >
> > We actually only track xids of catalog modifying transactions once we
> > hit the CONSISTENT state. Before the initial snapshot we can't detect
> > that.
> >
> How to you track them? I think I need to go deeper in the code before
> asking more...

You mean, how do I detect they are catalog modifying? By asking the
reorderbuffer (ReorderBufferXidDoesTimetravel(...)). That one knows
because we told him so (ReorderBufferXidSetTimetravel()) and we do that
by looking at the type of xid records we've seen incoming (HEAP_INPLACE,
HEAP2_NEW_CID tell us its doing timetravel).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3

From
Michael Paquier
Date:
<br /><br /><div class="gmail_quote">On Wed, Nov 21, 2012 at 4:34 PM, Andres Freund <span dir="ltr"><<a
href="mailto:andres@2ndquadrant.com"target="_blank">andres@2ndquadrant.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="h5">On 2012-11-21
14:57:08+0900, Michael Paquier wrote:<br /><br /></div>Ah, I see. Could you try the following diff?<br /><br /> diff
--gita/src/backend/replication/logical/snapbuild.c<br /> b/src/backend/replication/logical/snapbuild.c<br /> index
df24b33..797a126100644<br /> --- a/src/backend/replication/logical/snapbuild.c<br /> +++
b/src/backend/replication/logical/snapbuild.c<br/> @@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer
*reorder,<br/> Snapstate *snapstate,<br />                  */<br />                 snapstate->transactions_after =
buf->origptr;<br/><br /> +               snapstate->nrrunning = running->xcnt;<br />                
snapstate->xmin_running= InvalidTransactionId;<br />                 snapstate->xmax_running =
InvalidTransactionId;<br/></blockquote></div>I am still getting the same assertion failure even with this diff
included.<br/>-- <br />Michael Paquier<br /><a href="http://michael.otacoo.com"
target="_blank">http://michael.otacoo.com</a><br/> 

Re: [PATCH 03/14] Add simple xlogdump tool

From
Jeff Janes
Date:
On Thu, Nov 15, 2012 at 9:13 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2012-11-15 09:06:23 -0800, Jeff Janes wrote:
>> On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > ---
>> >  src/bin/Makefile            |   2 +-
>> >  src/bin/xlogdump/Makefile   |  25 +++
>> >  src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
>> >  3 files changed, 494 insertions(+), 1 deletion(-)
>> >  create mode 100644 src/bin/xlogdump/Makefile
>> >  create mode 100644 src/bin/xlogdump/xlogdump.c
>>
>> Is this intended to be the successor of
>> https://github.com/snaga/xlogdump which will then be deprecated?
>
> As-is this is just a development tool which was sorely needed for the
> development of this patchset. But yes I think that once ready
> (xlogreader infrastructure, *_desc routines splitted) it should
> definitely be able to do most of what the above xlogdump can do and it
> should live in bin/. I think mostly some filtering is missing.
>
> That doesn't really "deprecate" the above though.
>
> Does that answer your question?

Yes, I think so.  Thanks.

(I've just recently gotten the original xlogdump to work for me in
9.2, and I had been wonder if back-porting yours to 9.2 would have
been an easier way to go.)

Cheers,

Jeff



Re: [PATCH 03/14] Add simple xlogdump tool

From
Andres Freund
Date:
On 2012-11-21 14:57:14 -0800, Jeff Janes wrote:
> On Thu, Nov 15, 2012 at 9:13 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2012-11-15 09:06:23 -0800, Jeff Janes wrote:
> >> On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > ---
> >> >  src/bin/Makefile            |   2 +-
> >> >  src/bin/xlogdump/Makefile   |  25 +++
> >> >  src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
> >> >  3 files changed, 494 insertions(+), 1 deletion(-)
> >> >  create mode 100644 src/bin/xlogdump/Makefile
> >> >  create mode 100644 src/bin/xlogdump/xlogdump.c
> >>
> >> Is this intended to be the successor of
> >> https://github.com/snaga/xlogdump which will then be deprecated?
> >
> > As-is this is just a development tool which was sorely needed for the
> > development of this patchset. But yes I think that once ready
> > (xlogreader infrastructure, *_desc routines splitted) it should
> > definitely be able to do most of what the above xlogdump can do and it
> > should live in bin/. I think mostly some filtering is missing.
> >
> > That doesn't really "deprecate" the above though.
> >
> > Does that answer your question?
> 
> Yes, I think so.  Thanks.
> 
> (I've just recently gotten the original xlogdump to work for me in
> 9.2, and I had been wonder if back-porting yours to 9.2 would have
> been an easier way to go.)

I don't think you would have much fun doing so - the WAL format changes
between 9.2 and 9.3 make this larger than one might think. I had a
version that worked with the previous format but there have been some
interface changes since then...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3

From
Andres Freund
Date:
On 2012-11-21 18:35:34 +0900, Michael Paquier wrote:
> On Wed, Nov 21, 2012 at 4:34 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:
> >
> > Ah, I see. Could you try the following diff?
> >
> > diff --git a/src/backend/replication/logical/snapbuild.c
> > b/src/backend/replication/logical/snapbuild.c
> > index df24b33..797a126 100644
> > --- a/src/backend/replication/logical/snapbuild.c
> > +++ b/src/backend/replication/logical/snapbuild.c
> > @@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer *reorder,
> > Snapstate *snapstate,
> >                  */
> >                 snapstate->transactions_after = buf->origptr;
> >
> > +               snapstate->nrrunning = running->xcnt;
> >                 snapstate->xmin_running = InvalidTransactionId;
> >                 snapstate->xmax_running = InvalidTransactionId;
> >
> I am still getting the same assertion failure even with this diff included.

I really don't understand whats going on here then. Youve said you made
sure that there is a catalog snapshot. Which means you would need
something like:
WARNING:  connecting to postgres
WARNING:  Initiating logical rep
LOG:  computed new xmin: 16566894
LOG:  start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG:  found initial snapshot (via running xacts). Done: 1
WARNING:  reached consistent point, stopping!
WARNING:  Starting logical replication
LOG:  start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG:  found initial snapshot (via running xacts). Done: 1

in the log *and* it means that snapbuild->state has to be
CONSISTENT. But the backtrace youve posted:

#3  0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10,xid=0) at snapbuild.c:877
#4  0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80,snapstate=0x19e4f10, xid=0, buf=0x1a0a368,
relfilenode=0x1a0a450)at snapbuild.c:388
 
#5  0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80,snapstate=0x19e4f10, buf=0x1a0a368) at
snapbuild.c:732

shows pretty clearly that snapstate *can't* be consistent because line 387ff is:     else if (snapstate->state <
SNAPBUILD_CONSISTENT&&              SnapBuildTxnIsRunning(snapstate, xid))    ;
 
so #3 #4 can't happen at those line numbers with state == CONSISTENT.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3

From
Michael Paquier
Date:


On Thu, Nov 22, 2012 at 8:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I really don't understand whats going on here then. Youve said you made
sure that there is a catalog snapshot. Which means you would need
something like:
WARNING:  connecting to postgres
WARNING:  Initiating logical rep
LOG:  computed new xmin: 16566894
LOG:  start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG:  found initial snapshot (via running xacts). Done: 1
WARNING:  reached consistent point, stopping!
WARNING:  Starting logical replication
LOG:  start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG:  found initial snapshot (via running xacts). Done: 1

in the log *and* it means that snapbuild->state has to be
CONSISTENT. But the backtrace youve posted:

#3  0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10,xid=0) at snapbuild.c:877
#4  0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80,snapstate=0x19e4f10, xid=0, buf=0x1a0a368, relfilenode=0x1a0a450) at snapbuild.c:388
#5  0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80,snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732

shows pretty clearly that snapstate *can't* be consistent because line 387ff is:
      else if (snapstate->state < SNAPBUILD_CONSISTENT &&
               SnapBuildTxnIsRunning(snapstate, xid))
                ;
so #3 #4 can't happen at those line numbers with state == CONSISTENT.
Still this *impossible* thing happens.
Here are some more information on the logs I get on server side:

Yes I have the logical replication correctly initialized:
[629 0] LOG:  database system was shut down at 2012-11-22 09:02:42 JST
[628 0] LOG:  database system is ready to accept connections
[633 0] LOG:  autovacuum launcher started
[648 0] WARNING:  connecting to postgres
[648 0] WARNING:  Initiating logical rep
[648 0] LOG:  computed new xmin: 684
[648 0] LOG:  start reading from 0/178C1B8, scrolled back to 0/178C000

And I am also getting logs of this type with pg_receivellog:
BEGIN 698
table "pgbench_accounts": UPDATE: aid[int4]:759559 bid[int4]:8 abalance[int4]:-3641 filler[bpchar]:
table "pgbench_tellers": UPDATE: tid[int4]:93 bid[int4]:10 tbalance[int4]:-3641 filler[bpchar]:(null)
table "pgbench_branches": UPDATE: bid[int4]:10 bbalance[int4]:-3641 filler[bpchar]:(null)
table "pgbench_history": INSERT: tid[int4]:93 bid[int4]:10 aid[int4]:759559 delta[int4]:-3641 mtime[timestamp]:2012-11-22 09:05:34.535651 filler[bpchar]:(null)
COMMIT 698

Until the assertion failure:
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c", Line: 878)
I still have the core file and its binary at hand if you want, so can send them at will.
I have not been able to read your code yet, but there should be something you are missing.

Thanks,
--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3

From
Andres Freund
Date:
On 2012-11-22 09:13:30 +0900, Michael Paquier wrote:
> On Thu, Nov 22, 2012 at 8:25 AM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > I really don't understand whats going on here then. Youve said you made
> > sure that there is a catalog snapshot. Which means you would need
> > something like:
> > WARNING:  connecting to postgres
> > WARNING:  Initiating logical rep
> > LOG:  computed new xmin: 16566894
> > LOG:  start reading from 3/E62457C0, scrolled back to 3/E6244000
> > LOG:  found initial snapshot (via running xacts). Done: 1
> > WARNING:  reached consistent point, stopping!
> > WARNING:  Starting logical replication
> > LOG:  start reading from 3/E62457C0, scrolled back to 3/E6244000
> > LOG:  found initial snapshot (via running xacts). Done: 1
> >
> > in the log *and* it means that snapbuild->state has to be
> > CONSISTENT. But the backtrace youve posted:
> >
> > #3  0x000000000070c409 in SnapBuildTxnIsRunning
> > (snapstate=0x19e4f10,xid=0) at snapbuild.c:877
> > #4  0x000000000070b8e4 in SnapBuildProcessChange
> > (reorder=0x19e4e80,snapstate=0x19e4f10, xid=0, buf=0x1a0a368,
> > relfilenode=0x1a0a450) at snapbuild.c:388
> > #5  0x000000000070c088 in SnapBuildDecodeCallback
> > (reorder=0x19e4e80,snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732
> >
> > shows pretty clearly that snapstate *can't* be consistent because line
> > 387ff is:
> >       else if (snapstate->state < SNAPBUILD_CONSISTENT &&
> >                SnapBuildTxnIsRunning(snapstate, xid))
> >                 ;
> > so #3 #4 can't happen at those line numbers with state == CONSISTENT.
> >
> Still this *impossible* thing happens.
> Here are some more information on the logs I get on server side:
>
> Yes I have the logical replication correctly initialized:
> [629 0] LOG:  database system was shut down at 2012-11-22 09:02:42 JST
> [628 0] LOG:  database system is ready to accept connections
> [633 0] LOG:  autovacuum launcher started
> [648 0] WARNING:  connecting to postgres
> [648 0] WARNING:  Initiating logical rep
> [648 0] LOG:  computed new xmin: 684
> [648 0] LOG:  start reading from 0/178C1B8, scrolled back to 0/178C000

Ok, so youve not yet reached a consistent point.

Which means this shouldn't yet be written out:

> And I am also getting logs of this type with pg_receivellog:
> BEGIN 698
> table "pgbench_accounts": UPDATE: aid[int4]:759559 bid[int4]:8
> abalance[int4]:-3641 filler[bpchar]:
> table "pgbench_tellers": UPDATE: tid[int4]:93 bid[int4]:10
> tbalance[int4]:-3641 filler[bpchar]:(null)
> table "pgbench_branches": UPDATE: bid[int4]:10 bbalance[int4]:-3641
> filler[bpchar]:(null)
> table "pgbench_history": INSERT: tid[int4]:93 bid[int4]:10 aid[int4]:759559
> delta[int4]:-3641 mtime[timestamp]:2012-11-22 09:05:34.535651
> filler[bpchar]:(null)
> COMMIT 698

that could already be good enough of a hint, let me check tomorrow.

> I still have the core file and its binary at hand if you want, so can send
> them at will.

If those aren't too big, its worth a try...

Greetings,

Andres

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Alvaro Herrera
Date:
This is very much the same as the previous patch, except it has been
rebased to the latest master.


--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment
On 15 November 2012 12:07, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 14 November 2012 22:17, Andres Freund <andres@2ndquadrant.com> wrote:
>
>> To avoid complicating logic we store both, the toplevel and the subxids, in
>> ->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids.
>
> That looks good, not much change. Will apply in next few days. Please
> add me as committer and mark ready.

I tried improving this, but couldn't. So I've committed it as is.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

From
Steve Singer
Date:
<div class="moz-cite-prefix">On 12-11-14 08:17 PM, Andres Freund wrote:<br /></div><br /> I am getting errors like the
followingwhen I try to use either your test_decoding plugin or my own (which does even less than yours)<br /><br /><br
/>LOG:  database system is ready to accept connections<br /> LOG:  autovacuum launcher started<br /> WARNING: 
connectingto <br /> WARNING:  Initiating logical rep<br /> LOG:  computed new xmin: 773<br /> LOG:  start reading from
0/17F5D58,scrolled back to 0/17F4000<br /> LOG:  got new xmin 773 at 25124280<br /> LOG:  found initial snapshot (via
runningxacts). Done: 1<br /> WARNING:  reached consistent point, stopping!<br /> WARNING:  Starting logical
replication<br/> LOG:  start reading from 0/17F5D58, scrolled back to 0/17F4000<br /> LOG:  found initial snapshot (via
runningxacts). Done: 1<br /> FATAL:  cannot read pg_class without having selected a database<br /> TRAP:
FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))",File: "proc.c", Line: 759)<br /><br /> This seems
tobe happening under the calls at<br /> reorderbuffer.c:832    if (!SnapBuildHasCatalogChanges(NULL, xid,
&change->relnode))<br/><br /> The sequence of events I do is:<br /> 1. start pg_receivellog<br /> 2. run a
checkpoint<br/> 3. Attach to the walsender process with gdb<br /> 4. Start a new client connection with psql and do
'INSERTINTO a values (1)' twice.<br /><br /> (skipping step 3 doesn't make a difference)<br /><br /><br /><br /><br />
I<br/><br /><br /><blockquote cite="mid:1352942234-3953-11-git-send-email-andres@2ndquadrant.com" type="cite"><pre
wrap="">
This introduces several things:
* 'reorderbuffer' module which reassembles transactions from a stream of interspersed changes
* 'snapbuilder' which builds catalog snapshots so that tuples from wal can be understood
* logging more data into wal to facilitate logical decoding
* wal decoding into an reorderbuffer
* shared library output plugins with 5 callbacks* init* begin* change* commit
* walsender infrastructur to stream out changes and to keep the global xmin low enough* INIT_LOGICAL_REPLICATION
$plugin;waits till a consistent snapshot is built and returns  * initial LSN  * replication slot identifier  * id of a
pg_export()style snapshot* START_LOGICAL_REPLICATION $id $lsn; streams out changes* uses named output plugins for
outputspecification 

Todo:
* testing infrastructure (isolationtester)
* persistence/spilling to disk of built snapshots, longrunning transactions
* user docs
* more frequent lowering of xmins
* more docs about the internals
* support for user declared catalog tables
* actual exporting of initial pg_export snapshots after INIT_LOGICAL_REPLICATION
* own shared memory segment instead of piggybacking on walsender's
* nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the outside.
* more frequent xl_running_xid's so xmin can be upped more frequently
* add STOP_LOGICAL_REPLICATION $id
---src/backend/access/heap/heapam.c                |  280 +++++-src/backend/access/transam/xlog.c               |    1
+src/backend/catalog/index.c                    |   74 ++src/backend/replication/Makefile                |    2
+src/backend/replication/logical/Makefile       |   19 +src/backend/replication/logical/decode.c        |  496
++++++++++src/backend/replication/logical/logicalfuncs.c |  247 +++++src/backend/replication/logical/reorderbuffer.c |
1156+++++++++++++++++++++++src/backend/replication/logical/snapbuild.c     | 1144
++++++++++++++++++++++src/backend/replication/repl_gram.y            |   32 +-src/backend/replication/repl_scanner.l
     |    2 +src/backend/replication/walsender.c             |  566 ++++++++++-src/backend/storage/ipc/procarray.c
      |   23 +src/backend/storage/ipc/standby.c               |    8 +-src/backend/utils/cache/inval.c
|   2 +-src/backend/utils/cache/relcache.c              |    3 +-src/backend/utils/misc/guc.c                    |   11
+src/backend/utils/time/tqual.c                 |  249 +++++src/bin/pg_controldata/pg_controldata.c         |    2
+src/include/access/heapam_xlog.h               |   23 +src/include/access/transam.h                    |    5
+src/include/access/xlog.h                      |    3 +-src/include/catalog/index.h                     |    4
+src/include/nodes/nodes.h                      |    2 +src/include/nodes/replnodes.h                   |   22
+src/include/replication/decode.h               |   21 +src/include/replication/logicalfuncs.h          |   44
+src/include/replication/output_plugin.h        |   76 ++src/include/replication/reorderbuffer.h         |  284
++++++src/include/replication/snapbuild.h            |  128 +++src/include/replication/walsender.h             |    1
+src/include/replication/walsender_private.h    |   34 +-src/include/storage/itemptr.h                   |    3
+src/include/storage/sinval.h                   |    2 +src/include/utils/tqual.h                       |   31 +-35
fileschanged, 4966 insertions(+), 34 deletions(-)create mode 100644 src/backend/replication/logical/Makefilecreate mode
100644src/backend/replication/logical/decode.ccreate mode 100644 src/backend/replication/logical/logicalfuncs.ccreate
mode100644 src/backend/replication/logical/reorderbuffer.ccreate mode 100644
src/backend/replication/logical/snapbuild.ccreatemode 100644 src/include/replication/decode.hcreate mode 100644
src/include/replication/logicalfuncs.hcreatemode 100644 src/include/replication/output_plugin.hcreate mode 100644
src/include/replication/reorderbuffer.hcreatemode 100644 src/include/replication/snapbuild.h 

</pre><br /><fieldset class="mimeAttachmentHeader"></fieldset><br /><pre wrap="">
</pre></blockquote><br />

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

From
Andres Freund
Date:
Hi Steve,

On 2012-12-02 21:52:08 -0500, Steve Singer wrote:
> On 12-11-14 08:17 PM, Andres Freund wrote:
>
> I am getting errors like the following when I try to use either your
> test_decoding plugin or my own (which does even less than yours)
>
>
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> WARNING:  connecting to
> WARNING:  Initiating logical rep
> LOG:  computed new xmin: 773
> LOG:  start reading from 0/17F5D58, scrolled back to 0/17F4000
> LOG:  got new xmin 773 at 25124280
> LOG:  found initial snapshot (via running xacts). Done: 1
> WARNING:  reached consistent point, stopping!
> WARNING:  Starting logical replication
> LOG:  start reading from 0/17F5D58, scrolled back to 0/17F4000
> LOG:  found initial snapshot (via running xacts). Done: 1
> FATAL:  cannot read pg_class without having selected a database
> TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File:
> "proc.c", Line: 759)
>
> This seems to be happening under the calls at
> reorderbuffer.c:832    if (!SnapBuildHasCatalogChanges(NULL, xid,
> &change->relnode))

Two things:
1) Which exact options are you using for pg_receivellog? Not "-d
replication" by any chance?
2) Could you check that you really have a fully clean build? That has
hit me in the past as well.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

From
Andres Freund
Date:
On 2012-12-03 13:22:12 +0100, Andres Freund wrote:
> Hi Steve,
>
> On 2012-12-02 21:52:08 -0500, Steve Singer wrote:
> > On 12-11-14 08:17 PM, Andres Freund wrote:
> >
> > I am getting errors like the following when I try to use either your
> > test_decoding plugin or my own (which does even less than yours)
> >
> >
> > LOG:  database system is ready to accept connections
> > LOG:  autovacuum launcher started
> > WARNING:  connecting to
> > WARNING:  Initiating logical rep
> > LOG:  computed new xmin: 773
> > LOG:  start reading from 0/17F5D58, scrolled back to 0/17F4000
> > LOG:  got new xmin 773 at 25124280
> > LOG:  found initial snapshot (via running xacts). Done: 1
> > WARNING:  reached consistent point, stopping!
> > WARNING:  Starting logical replication
> > LOG:  start reading from 0/17F5D58, scrolled back to 0/17F4000
> > LOG:  found initial snapshot (via running xacts). Done: 1
> > FATAL:  cannot read pg_class without having selected a database
> > TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File:
> > "proc.c", Line: 759)
> >
> > This seems to be happening under the calls at
> > reorderbuffer.c:832    if (!SnapBuildHasCatalogChanges(NULL, xid,
> > &change->relnode))
>
> Two things:
> 1) Which exact options are you using for pg_receivellog? Not "-d
> replication" by any chance?

This seems to produce exactly that kind off error messages. I added some
error checking around that. Pushed.

Thanks!

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

From
Steve Singer
Date:
On 12-12-03 07:42 AM, Andres Freund wrote:
> Two things:
> 1) Which exact options are you using for pg_receivellog? Not "-d
> replication" by any chance?

Yes that is exactly what I'md doing.  Using a real database name instead 
makes this go away.

Thanks

> This seems to produce exactly that kind off error messages. I added some
> error checking around that. Pushed.
>
> Thanks!
>
> Andres Freund
>
> --
>   Andres Freund                       http://www.2ndQuadrant.com/
>   PostgreSQL Development, 24x7 Support, Training & Services
>
>




Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

From
Andres Freund
Date:
On 2012-12-03 09:35:55 -0500, Steve Singer wrote:
> On 12-12-03 07:42 AM, Andres Freund wrote:
> >Two things:
> >1) Which exact options are you using for pg_receivellog? Not "-d
> >replication" by any chance?
>
> Yes that is exactly what I'md doing.  Using a real database name instead
> makes this go away.

Was using "replication" an accident or do you think it makes sense in
some way?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

From
Steve Singer
Date:
On 12-12-03 09:48 AM, Andres Freund wrote:
> On 2012-12-03 09:35:55 -0500, Steve Singer wrote:
>> On 12-12-03 07:42 AM, Andres Freund wrote:
>> Yes that is exactly what I'md doing. Using a real database name 
>> instead makes this go away. 
> Was using "replication" an accident or do you think it makes sense in
> some way?

The 'replication' line in pg_hba.conf made me think that the database 
name had to be replication for walsender connections.

> Greetings,
>
> Andres Freund
>
> --
>   Andres Freund                       http://www.2ndQuadrant.com/
>   PostgreSQL Development, 24x7 Support, Training & Services
>
>




Re: [PATCH 03/14] Add simple xlogdump tool

From
Andres Freund
Date:
Hi,

I tried to address most (all?) your comments in the version from
http://archives.postgresql.org/message-id/20121204175212.GB12055%40awork2.anarazel.de
.

On 2012-11-15 11:31:55 -0500, Peter Eisentraut wrote:
> > +xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v
/main.o|sed's/^/..\/..\/..\//')
 
> > +    $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
>
> This looks pretty evil, and there is no documentation about what it is
> supposed to do.
>
> Windows build support needs some thought.

Ok, since Alvaro made it possible the build now only has rules like:

xlogreader.c: % : $(top_srcdir)/src/backend/access/transam/%             rm -f $@ && $(LN_S) $< .

clogdesc.c: % : $(top_srcdir)/src/backend/access/rmgrdesc/%           rm -f $@ && $(LN_S) $< .
and
OBJS = \    clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \    mxactdesc.o nbtdesc.o relmapdesc.o
seqdesc.osmgrdesc.o spgdesc.o \    standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o assert.o \    $(WIN32RES) \
pg_xlogdump.opqexpbuf_strinfo.o compat.o tables.o xlogreader.o \
 

pg_xlogdump: $(OBJS) | submake-libpq submake-libpgport            $(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX)
$(LIBS)$(libpq_pgport) -o $@$(X)
 

Thats easier easier to integrate into the windows build?

> > +static void
> > +usage(void)
> > +{
> > +    printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"),
> > +           progname);
> > +    printf(_("Usage:\n"));
> > +    printf(_("  %s [OPTION]...\n"), progname);
> > +    printf(_("\nOptions:\n"));
> > +    printf(_("  -v, --version          output version information, then exit\n"));
> > +    printf(_("  -h, --help             show this help, then exit\n"));
> > +    printf(_("  -s, --start            from where recptr onwards to read\n"));
> > +    printf(_("  -e, --end              up to which recptr to read\n"));
> > +    printf(_("  -t, --timeline         which timeline do we want to read\n"));
> > +    printf(_("  -i, --inpath           from where do we want to read? cwd/pg_xlog is the default\n"));
> > +    printf(_("  -o, --output           where to write [start, end]\n"));
> > +    printf(_("  -f, --file             wal file to parse\n"));
> > +}
>
> Options list should be in alphabetic order (or some other less random
> order).  Most of these descriptions are not very intelligible (at least
> without additional documentation).

I tried to improve the help, its now:

pg_xlogdump: reads/writes postgres transaction logs for debugging.

Usage: pg_xlogdump [OPTION]...

Options: -b, --bkp-details      output detailed information about backup blocks -e, --end RECPTR       read wal up to
RECPTR-f, --file FILE        wal file to parse, cannot be specified together with -p -h, --help             show this
help,then exit -p, --path PATH        from where do we want to read? cwd/pg_xlog is the default -s, --start RECPTR
readwal in directory indicated by -p starting at RECPTR -t, --timeline TLI     which timeline do we want to read,
defaultsto 1 -v, --version          output version information, then exit
 

I wonder whether it would make sense to split the help into different
sections? It seems likely we will gain some more options...

> no nls.mk

Do I need to do anything for that besides:

# src/bin/pg_xlogdump/nls.mk
CATALOG_NAME     = pg_xlogdump
AVAIL_LANGUAGES  =
GETTEXT_FILES    = pg_xlogdump.c

?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3 - git repository

From
Andres Freund
Date:
On 2012-11-15 02:26:53 +0100, Andres Freund wrote:
> On 2012-11-15 01:27:46 +0100, Andres Freund wrote:
> > In response to this you will soon find the 14 patches that currently
> > implement $subject.
>
> As its not very wieldly to send around that many/big patches all the
> time, until the next "major" version I will just update the git tree at:
>
> Web:
>
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf3
>
> Git:
> git clone git://git.postgresql.org/git/users/andresfreund/postgres.git xlog-decoding-rebasing-cf3

I pushed a new version which

- is rebased ontop of master
- is based ontop of the new xlogreader (biggest part)
- is base ontop of the new binaryheap.h
- some fixes
- some more comments

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Andres Freund
Date:
On 2012-11-15 16:22:56 +0200, Heikki Linnakangas wrote:
> On 15.11.2012 03:17, Andres Freund wrote:
> >
> >Features:
> >- streaming reading/writing
> >- filtering
> >- reassembly of records
> >
> >Reusing the ReadRecord infrastructure in situations where the code that wants
> >to do so is not tightly integrated into xlog.c is rather hard and would require
> >changes to rather integral parts of the recovery code which doesn't seem to be
> >a good idea.
> >
> >Missing:
> >- "compressing" the stream when removing uninteresting records
> >- writing out correct CRCs
> >- separating reader/writer
>
> I'm disappointed to see that there has been no progress on this patch since
> last commitfest. I thought we agreed on the approach I championed for here:
> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. There
> wasn't much work left to finish that, I believe.
>
> Are you going to continue working on this?

Patch (git repo) is now based ontop of my vesion of your xlogreader
version...

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3

From
Andres Freund
Date:
Hi,

On 2012-11-19 09:50:30 +0100, Andres Freund wrote:
> On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> > After launching some SQLs, the logical receiver is stuck just after sending
> > INIT_LOGICAL_REPLICATION, please see bt of process waiting:
>
> Its waiting till it sees initial an initial xl_running_xacts record. The
> easiest way to do that is manually issue a checkpoint. Sorry, should
> have included that in the description.
> Otherwise you can wait till the next routine checkpoint comes arround...
>
> I plan to cause more xl_running_xacts record to be logged in the
> future. I think the timing of those currently is non-optimal, you have
> the same problem as here in normal streaming replication as well :(

This is "fixed" now with the changes I pushed a second ago. Unless some
longrunning transactions are arround we now immediately jump to a ready
state.
This is achieved by
1. regularly logging xl_running_xacts (in background writer)
2. logging xl_runnign_xacts at the beginning of INIT_LOGICAL_REPLICATION

This also has the advantage that the xmin horizon can be increased much
more frequently.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Heikki Linnakangas
Date:
(Offlist)

Just a quick note that I'm working on this patch now. I pushed some 
trivial fixes to my git repository at 
git://git.postgresql.org/git/users/heikki/postgres.git, xlogreader_v3 
branch.

- Heikki



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Heikki Linnakangas
Date:
On 10.12.2012 22:22, Heikki Linnakangas wrote:
> (Offlist)
>
> Just a quick note that I'm working on this patch now. I pushed some
> trivial fixes to my git repository at
> git://git.postgresql.org/git/users/heikki/postgres.git, xlogreader_v3
> branch.

Oops, wasn't offlist :-). Well, if anyone wants to take a look, feel free.

- Heikki



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Heikki Linnakangas
Date:
I've been molding this patch for a while now, here's what I have this 
far (also available in my git repository).

The biggest change is in the error reporting. A stand-alone program that 
wants to use xlogreader.c no longer has to provide a full-blown 
replacement for ereport(). The only thing that xlogreader.c used 
ereport() was when it encounters an invalid record. And even there we 
had the emode_for_corrupt_record hack. I think it's a much better API 
that XLogReadRecord just returns NULL on an invalid record, and an error 
string, and the caller can do what it wants with that. In xlog.c, we'll 
pass the error string to ereport(), with the right emode as determined 
by emode_for_corrupt_record. xlog.c is no longer concerned with 
emode_for_corrupt_record, or error levels in general.

We talked about this earlier, and Tom Lane argued that "it's basically 
insane to imagine that you can carve out a non-trivial piece of the 
backend that doesn't contain any elog calls." 
(http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but 
having done just that, it doesn't seem insane to me. xlogreader.c really 
is a pretty well contained piece of code. All the complicated stuff that 
contains elog calls and pallocs and more is in the callback, which can 
freely use all the normal backend infrastructure.

Now, here's some stuff that still need to be done:

* A stand-alone program using xlogreader.c has to provide an 
implementation of tliInHistory(). Need to find a better way to do that. 
Perhaps "#ifndef FRONTEND" the tliInHistory checks in xlogreader.

* In xlog.c, some of the variables that used to be statics like 
readFile, readOff etc. are now in the XLogPageReadPrivate struct. But 
there's still plenty of statics left in there - it would certainly not 
work correctly if xlog.c tried to open two xlog files at the same time. 
I think it's just confusing to have some stuff in the 
XLogPageReadPrivate struct, and others as static, so I think we should 
get rid of XLogPageReadPrivate struct altogether and put back the static 
variables. At least it would make the diff smaller, which might help 
with reviewing. xlog.c probably doesn't need to provide a "private" 
struct to xlogreader.c at all, which is okay.

* It's pretty ugly that to use the rm_desc functions, you have to 
provide dummy implementations of a bunch of backend functions, including 
pfree() and timestamptz_to_str(). Should find a better way to do that.

* It's not clear to me how we'd handle translating the strings in 
xlogreader.c, when xlogreader.c is used in a stand-alone program like 
pg_xlogdump. Maybe we can just punt on that...

* How about we move pg_xlogdump to contrib? It doesn't feel like the 
kind of essential tool that deserves to be in src/bin.

- Heikki



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Heikki Linnakangas
Date:
Forgot attachment..

On 11.12.2012 15:55, Heikki Linnakangas wrote:
> I've been molding this patch for a while now, here's what I have this
> far (also available in my git repository).
>
> The biggest change is in the error reporting. A stand-alone program that
> wants to use xlogreader.c no longer has to provide a full-blown
> replacement for ereport(). The only thing that xlogreader.c used
> ereport() was when it encounters an invalid record. And even there we
> had the emode_for_corrupt_record hack. I think it's a much better API
> that XLogReadRecord just returns NULL on an invalid record, and an error
> string, and the caller can do what it wants with that. In xlog.c, we'll
> pass the error string to ereport(), with the right emode as determined
> by emode_for_corrupt_record. xlog.c is no longer concerned with
> emode_for_corrupt_record, or error levels in general.
>
> We talked about this earlier, and Tom Lane argued that "it's basically
> insane to imagine that you can carve out a non-trivial piece of the
> backend that doesn't contain any elog calls."
> (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but
> having done just that, it doesn't seem insane to me. xlogreader.c really
> is a pretty well contained piece of code. All the complicated stuff that
> contains elog calls and pallocs and more is in the callback, which can
> freely use all the normal backend infrastructure.
>
> Now, here's some stuff that still need to be done:
>
> * A stand-alone program using xlogreader.c has to provide an
> implementation of tliInHistory(). Need to find a better way to do that.
> Perhaps "#ifndef FRONTEND" the tliInHistory checks in xlogreader.
>
> * In xlog.c, some of the variables that used to be statics like
> readFile, readOff etc. are now in the XLogPageReadPrivate struct. But
> there's still plenty of statics left in there - it would certainly not
> work correctly if xlog.c tried to open two xlog files at the same time.
> I think it's just confusing to have some stuff in the
> XLogPageReadPrivate struct, and others as static, so I think we should
> get rid of XLogPageReadPrivate struct altogether and put back the static
> variables. At least it would make the diff smaller, which might help
> with reviewing. xlog.c probably doesn't need to provide a "private"
> struct to xlogreader.c at all, which is okay.
>
> * It's pretty ugly that to use the rm_desc functions, you have to
> provide dummy implementations of a bunch of backend functions, including
> pfree() and timestamptz_to_str(). Should find a better way to do that.
>
> * It's not clear to me how we'd handle translating the strings in
> xlogreader.c, when xlogreader.c is used in a stand-alone program like
> pg_xlogdump. Maybe we can just punt on that...
>
> * How about we move pg_xlogdump to contrib? It doesn't feel like the
> kind of essential tool that deserves to be in src/bin.
>
> - Heikki


--
- Heikki

Attachment

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Andres Freund
Date:
On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote:
> I've been molding this patch for a while now, here's what I have this far
> (also available in my git repository).

On a very quick this looks good. I will try to rebase the decoding stuff
and read a bit around in the course of that...

> The biggest change is in the error reporting. A stand-alone program that
> wants to use xlogreader.c no longer has to provide a full-blown replacement
> for ereport(). The only thing that xlogreader.c used ereport() was when it
> encounters an invalid record. And even there we had the
> emode_for_corrupt_record hack. I think it's a much better API that
> XLogReadRecord just returns NULL on an invalid record, and an error string,
> and the caller can do what it wants with that. In xlog.c, we'll pass the
> error string to ereport(), with the right emode as determined by
> emode_for_corrupt_record. xlog.c is no longer concerned with
> emode_for_corrupt_record, or error levels in general.

> We talked about this earlier, and Tom Lane argued that "it's basically
> insane to imagine that you can carve out a non-trivial piece of the backend
> that doesn't contain any elog calls."
> (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but
> having done just that, it doesn't seem insane to me. xlogreader.c really is
> a pretty well contained piece of code. All the complicated stuff that
> contains elog calls and pallocs and more is in the callback, which can
> freely use all the normal backend infrastructure.

This is pretty good. I was a bit afraid of making this change - thus my
really ugly emode callback hack - but this is way better.

> Now, here's some stuff that still need to be done:
>
> * A stand-alone program using xlogreader.c has to provide an implementation
> of tliInHistory(). Need to find a better way to do that. Perhaps "#ifndef
> FRONTEND" the tliInHistory checks in xlogreader.

We could just leave it in xlogreader in the first place. Having an
#ifdef'ed out version in there seems to be schizophrenic to me, all the
maintenance overhead, none of the fun...

> * In xlog.c, some of the variables that used to be statics like
> readFile, readOff etc. are now in the XLogPageReadPrivate struct. But
> there's still plenty of statics left in there - it would certainly not
> work correctly if xlog.c tried to open two xlog files at the same
> time. I think it's just confusing to have some stuff in the
> XLogPageReadPrivate struct, and others as static, so I think we should
> get rid of XLogPageReadPrivate struct altogether and put back the
> static variables. At least it would make the diff smaller, which might
> help with reviewing. xlog.c probably doesn't need to provide a
> "private" struct to xlogreader.c at all, which is okay.

Fine with me. I find the grouping that the struct provides somewhat
helpful when reading the code, but its more than offset by duplicating
some of the variables.
The reasons to have it are fewer compared when you'd introduced the
struct - xlogreader does more now, so less needs to be handled outside.

> * It's pretty ugly that to use the rm_desc functions, you have to provide
> dummy implementations of a bunch of backend functions, including pfree() and
> timestamptz_to_str(). Should find a better way to do that.

I think most of the cases requiring those ugly hacks can be fixed to
just use a caller-provided buffer, there's not that much left.

timestamptz_to_str() is probably the most complex case. I just noticed
there's already a second implementation in
ecpg/pgtypeslib/dt_common.c. Yuck. It seems to already have diverged in
a number of cases :(

> * It's not clear to me how we'd handle translating the strings in
> xlogreader.c, when xlogreader.c is used in a stand-alone program like
> pg_xlogdump. Maybe we can just punt on that...

I personally would have no problem with that. It's probably either going
to be used during in-depth-debugging or when developing pg. In both
cases english seems to be fine. But I seldomly use translated programs
so maybe I am not the right person to ask.

> * How about we move pg_xlogdump to contrib? It doesn't feel like the kind of
> essential tool that deserves to be in src/bin.

contrib would be fine, but I think src/bin is better. There have been
quite some bugs by now where it would have been useful to have a
reliable xlogdump in core so its really installed.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On 2012-12-11 15:44:39 +0100, Andres Freund wrote:
> On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote:
> > * It's pretty ugly that to use the rm_desc functions, you have to provide
> > dummy implementations of a bunch of backend functions, including pfree() and
> > timestamptz_to_str(). Should find a better way to do that.
>
> I think most of the cases requiring those ugly hacks can be fixed to
> just use a caller-provided buffer, there's not that much left.
>
> timestamptz_to_str() is probably the most complex case. I just noticed
> there's already a second implementation in
> ecpg/pgtypeslib/dt_common.c. Yuck. It seems to already have diverged in
> a number of cases :(

The attached (and pushed) patches change relpathbackend to use a static buffer
instead. That gets rid of the pfree() requirement and looks ok otherwise as
well.

Unfortunately that still leaves us with the need to re-implement
relpathbackend() in xlogdump, but that seems somwhat ok to me.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On 2012-12-11 16:43:12 +0100, Andres Freund wrote:
> On 2012-12-11 15:44:39 +0100, Andres Freund wrote:
> > On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote:
> > > * It's pretty ugly that to use the rm_desc functions, you have to provide
> > > dummy implementations of a bunch of backend functions, including pfree() and
> > > timestamptz_to_str(). Should find a better way to do that.
> >
> > I think most of the cases requiring those ugly hacks can be fixed to
> > just use a caller-provided buffer, there's not that much left.
> >
> > timestamptz_to_str() is probably the most complex case. I just noticed
> > there's already a second implementation in
> > ecpg/pgtypeslib/dt_common.c. Yuck. It seems to already have diverged in
> > a number of cases :(
>
> The attached (and pushed) patches change relpathbackend to use a static buffer
> instead. That gets rid of the pfree() requirement and looks ok otherwise as
> well.

... really attached.

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Andres Freund
Date:
On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote:
> I've been molding this patch for a while now, here's what I have this far
> (also available in my git repository).
>
> The biggest change is in the error reporting. A stand-alone program that
> wants to use xlogreader.c no longer has to provide a full-blown replacement
> for ereport(). The only thing that xlogreader.c used ereport() was when it
> encounters an invalid record. And even there we had the
> emode_for_corrupt_record hack. I think it's a much better API that
> XLogReadRecord just returns NULL on an invalid record, and an error string,
> and the caller can do what it wants with that. In xlog.c, we'll pass the
> error string to ereport(), with the right emode as determined by
> emode_for_corrupt_record. xlog.c is no longer concerned with
> emode_for_corrupt_record, or error levels in general.
>
> We talked about this earlier, and Tom Lane argued that "it's basically
> insane to imagine that you can carve out a non-trivial piece of the backend
> that doesn't contain any elog calls."
> (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but
> having done just that, it doesn't seem insane to me. xlogreader.c really is
> a pretty well contained piece of code. All the complicated stuff that
> contains elog calls and pallocs and more is in the callback, which can
> freely use all the normal backend infrastructure.

Now that I have read some of that code, I am currently unsure how the
current implementation of this can cooperate with translation, even when
used from the backend?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On Tue, Dec 11, 2012 at 9:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> * How about we move pg_xlogdump to contrib? It doesn't feel like the kind of
>> essential tool that deserves to be in src/bin.
>
> contrib would be fine, but I think src/bin is better. There have been
> quite some bugs by now where it would have been useful to have a
> reliable xlogdump in core so its really installed.

I think I'm with Heikki on this one.  Dumping xlog data is useful, but
it's really for developers and troubleshooters, not something we
expect people to do on a regular basis, so contrib seems appropriate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Dimitri Fontaine
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I think I'm with Heikki on this one.  Dumping xlog data is useful, but
> it's really for developers and troubleshooters, not something we
> expect people to do on a regular basis, so contrib seems appropriate.

There are two downsides for contrib rather than src/bin. First,
maintainance and user trust are easier done and achieved in src/bin.
Second, a lot of users won't install contribs in their production server
and will then miss the tool when they need it most. In some places
getting new software installed on a certified production setup is not
easy.

I would agree to get that piece in contrib if we were to work again on
separating contribs into "production monitoring and diagnosis",
"production ready extra" (hstore) and the rest, basically.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support



Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think I'm with Heikki on this one.  Dumping xlog data is useful, but
>> it's really for developers and troubleshooters, not something we
>> expect people to do on a regular basis, so contrib seems appropriate.

> There are two downsides for contrib rather than src/bin. First,
> maintainance and user trust are easier done and achieved in src/bin.

User trust, maybe, but the "maintenance" argument seems bogus.
We ship contrib on the same release schedule as core.

> Second, a lot of users won't install contribs in their production server
> and will then miss the tool when they need it most.

TBH, I don't believe that ordinary users will need this tool at all,
ever, and thus I don't want it in src/bin/.  From a packaging standpoint
it will be a lot easier if it's in contrib ... otherwise I'll probably
have to invent some new sub-RPM along the lines of postgresql-extras
so as to avoid bloating the core server package.
        regards, tom lane



Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Dimitri Fontaine
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> User trust, maybe, but the "maintenance" argument seems bogus.
> We ship contrib on the same release schedule as core.

I meant maintenance as in updating the code when it needs to be, I'm not
sure contrib systematically receives the same careness as core. I have
no data to back my feeling, though.

> TBH, I don't believe that ordinary users will need this tool at all,
> ever, and thus I don't want it in src/bin/.  From a packaging standpoint
> it will be a lot easier if it's in contrib ... otherwise I'll probably
> have to invent some new sub-RPM along the lines of postgresql-extras
> so as to avoid bloating the core server package.

Oh. I didn't know that the server package would be considered bloated by
anyone and that would impact the way to ship our binaries.

What about splitting contrib *officially* then, not just in your RH
packages, and have postgresql-server-extra-diagnosis, -extra-data-types
and -contrib with the things you tipically don't want in production?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support



On 11 December 2012 22:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> TBH, I don't believe that ordinary users will need this tool at all,
> ever, and thus I don't want it in src/bin/.  From a packaging standpoint
> it will be a lot easier if it's in contrib ... otherwise I'll probably
> have to invent some new sub-RPM along the lines of postgresql-extras
> so as to avoid bloating the core server package.

I happen to agree that pg_xlogdump belongs in contrib, but I think
that the importance of avoiding "bloat" has been overstated. Maybe
it's slightly useful to make sure that Postgres can get on the Fedora
CD, but that aside, is including pg_xlogdump here, for example, really
likely to make any appreciable difference package-wise?

pg_xlogdump is 141K on my system. I'd hate to see us embrace the exact
opposite tendency, towards including everything but the kitchen sink,
but at the same time that seems like a very insignificant size.
Perhaps people who live in countries with less bandwidth care about
these things more.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Dimitri Fontaine
Date:
Peter Geoghegan <peter@2ndquadrant.com> writes:
> Perhaps people who live in countries with less bandwidth care about
> these things more.

The day they will need it is not the day the bandwidth will magically
increase, is all I'm saying. Better have that around just in case you
get WAL corruption because of a crappy RAID controler or whatnot.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support



On 2012-12-11 22:52:09 +0000, Peter Geoghegan wrote:
> On 11 December 2012 22:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > TBH, I don't believe that ordinary users will need this tool at all,
> > ever, and thus I don't want it in src/bin/.  From a packaging standpoint
> > it will be a lot easier if it's in contrib ... otherwise I'll probably
> > have to invent some new sub-RPM along the lines of postgresql-extras
> > so as to avoid bloating the core server package.
>
> I happen to agree that pg_xlogdump belongs in contrib

Ok, I think there has been clear support for putting it into contrib, I
can comfortably live with that even though I would prefer otherwise.  So
lets concentrate on other things ;)

> pg_xlogdump is 141K on my system. I'd hate to see us embrace the exact
> opposite tendency, towards including everything but the kitchen sink,
> but at the same time that seems like a very insignificant size.
> Perhaps people who live in countries with less bandwidth care about
> these things more.

Optimized and stripped - which is what most distros do - it's 40k
here. Gzipped - as in packages - its only 20k on its own. So its even
smaller ;)

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical changeset generation v3 - git repository

From
Peter Geoghegan
Date:
On 9 December 2012 19:14, Andres Freund <andres@2ndquadrant.com> wrote:
> I pushed a new version which
>
> - is rebased ontop of master
> - is based ontop of the new xlogreader (biggest part)
> - is base ontop of the new binaryheap.h
> - some fixes
> - some more comments

I decided to take another look at this, following my earlier reviews
of a substantial subset of earlier versions of this patch, including
my earlier reviews of WAL decoding [1] and focused review of snapshot
building [2] (itself a subset of WAL decoding). I think it was the
right time to consolidate your multiple earlier patches, because some
of the earlier BDR patches were committed (including "Rearrange
storage of data in xl_running_xacts" [3], "Basic binary heap
implementation" [4], "Embedded list interface" [5], and, though it
isn't touched on here and is technically entirely distinct
functionality, "Background worker processes" [6]). Furthermore, now
that we've gotten past some early rounds of reviewing, it makes sense
to build a *perfectly* (rather than just approximately) atomic unit,
as we work towards something that is actually committable.

So what's the footprint of this big, newly rebased feature branch?
Well, though some of these changes are uncommitted stuff from Heikki
(i.e. XLogReader, which you've modified), and some of this is README
documentation, the footprint is very large. I merged master with your
dev branch (last commit of yours,
743f3af081209f784a30270bdf49301e9e242b78, made on Mon 10th Dec 15:35),
and the stats are:
91 files changed, 9736 insertions(+), 956 deletions(-)

Note that there is a relatively large number of files affected in part
because the tqual interface was bashed around a bit for the benefit of
logical decoding - a lot of the changes to each of those 91 files are
completely trivial.

I'm very glad that you followed my earlier recommendation of splitting
your demo logical changeset consumer into a contrib module, in the
spirit of contrib/spi, etc. This module, "test_decoding", represents a
logical entry point, if you will, for the entire patch. As unwieldy as
it may appear to be, the patch is (or at least *should be*) ultimately
reducible to some infrastructural changes to core to facilitate this
example logical change-set consumer.

test_decoding contrib module
============================
contrib/Makefile                       |    1 +contrib/test_decoding/Makefile         |   16
+contrib/test_decoding/test_decoding.c |  192 ++++ 

Once again, because test_decoding is a kind of "entry point", it gives
me a nice point to continually refer back to when talking about this
patch. (Incidentally, maybe test_decoding should be called
pg_decoding?).

The regression tests pass, though this isn't all that surprising,
since frankly the test coverage of this patch appears to be quite low.
I know that you're working with Abhijit on improvements to the
isolation tester to verify the correctness of the patch as it relates
to supporting actual, practical logical replication systems. I would
very much welcome any such test coverage (even if it wasn't in a
committable state), since in effect you're asking me to take a leap of
faith in respect of how well this infrastructure will support such
systems – previously, I obliged you and didn't focus on concurrency
and serializability concerns (it was sufficient to print out values/do
some decoding in a toy function), but it's time to take a closer look
at those now, I feel. test_decoding is a client of the logical
change-set producing infrastructure, and there appears to be broad
agreement that that infrastructure needs to treat such consumers in a
way that is maximally abstract. My question is, just how abstract does
this interface have to be, really? How well are you going to support
the use-case of a real logical replication system?

Now, maybe it's just that I haven't being paying attention (in
particular, to the discussion surrounding [3] – though that commit
doesn't appear to have been justified in terms of commit ordering in
BDR at all), but I would like you to be more demonstrative of certain
things, like:

1. Just what does a logical change-set consumer look like? What things
are always true of one, and never true of one?
2. Please describe in as much detail as possible the concurrency
issues with respect to logical replication systems. Please make
verifiable, testable claims as to how well these issues are considered
here, perhaps with reference to the previous remarks of subject-matter
experts like Chris Browne [7], Steve Singer [8] and Kevin Grittner [9]
following my earlier review.

I'm not all that impressed with where test_decoding is at right now.
There is still essentially no documentation. I think it's notable that
you don't really touch the ReorderBufferTXN passed by the core system
in the test_decoding plugin.

test_decoding and pg_receivellog
========================

I surmised that the way that the test_decoding module is intended to
be used is as a client of receivellog.c (*not* receivelog.c – that
naming is *confusing*, perhaps call it receivelogiclog.c or something.
Better still, make receivexlog handle the logical case rather than
inventing a new tool). The reason for receivellog.c existing, as you
yourself put it, is:

+ /*
+  * We have to use postgres.h not postgres_fe.h here, because there's so much
+  * backend-only stuff in the XLOG include files we need.  But we need a
+  * frontend-ish environment otherwise.    Hence this ugly hack.
+  */

So receivellog.c is part of a new utility called pg_receivellog, in
much the same way as receivexlog.c is part of the existing
pg_receivexlog utility (see commit
b840640000934fca1575d29f94daad4ad85ba000 in Andres' tree). We're
talking about these changes:
src/backend/utils/misc/guc.c            |   11 +src/bin/pg_basebackup/Makefile          |    7
+-src/bin/pg_basebackup/pg_basebackup.c  |    4 +-src/bin/pg_basebackup/pg_receivellog.c  |  717
++++++++++++src/bin/pg_basebackup/pg_receivexlog.c |    4 +-src/bin/pg_basebackup/receivelog.c      |    4
+-src/bin/pg_basebackup/streamutil.c     |    3 +-src/bin/pg_basebackup/streamutil.h      |    1 + 

So far, so good. Incidentally, you forgot to do this:
 install: all installdirs     $(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
$(INSTALL_PROGRAM)pg_receivexlog$(X) 
'$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
+     $(INSTALL_PROGRAM) pg_receivellog$(X)
'$(DESTDIR)$(bindir)/pg_receivellog$(X)'

So this creates a new binary executable, pg_receivellog, which is
described as “the pg_receivexlog equivalent for logical changes”. Much
like pg_receivexlog, pg_receivellog issues special new replication
protocol commands for logical replication, which account for your
changes to the replication protocol grammar and lexer (i.e.
walsender):
src/backend/replication/repl_gram.y                |   32 +-src/backend/replication/repl_scanner.l             |    2 +

You say:

+ /* This is is just for demonstration, don't ever use this code for
anything real! */

uh, why not? What is the purpose of a contrib module, if not to serve
as a minimal example?

So, I went to play with pg_receivellog, and I got lots of output like this:

[peter@peterlaptop decode]$ pg_receivellog -f test.log -d postgres
WARNING:  Initiating logical rep
pg_receivellog: could not init logical rep: got 0 rows and 0 fields,
expected 1 rows and 4 fields
pg_receivellog: disconnected. Waiting 5 seconds to try again.

Evidently you expected me to see this message:

+     if (!walsnd)
+     {
+         elog(ERROR, "couldn't find free logical slot. free one or increase
max_logical_slots");
+     }

If I did, that might have been okay. I didn't though, presumably
because the “walsnd” variable was wild/uninitialised.

So, I went and set max_logical_slots to something higher than 0, and
restarted. pg_receivellog behaved itself this time.

In one terminal:

[peter@peterlaptop decode]$ tty
/dev/pts/0
[peter@peterlaptop decode]$ pg_receivellog -f test.log -d postgres
WARNING:  Initiating logical rep
WARNING:  reached consistent point, stopping!
WARNING:  Starting logical replication

In another:

[peter@peterlaptop decode]$ tty
/dev/pts/1
[peter@peterlaptop decode]$ psql
Expanded display is used automatically.
psql (9.3devel)
Type "help" for help.

postgres=# insert into b values(66,64);
INSERT 0 1
postgres=# \q
[peter@peterlaptop decode]$ cat test.log
BEGIN 1910
table "b": INSERT: i[int4]:66 j[int4]:64
COMMIT 1910

We're subscribed to logical changes, and everything looks about right.
We have a toy demo of a logical change-set subscriber.

I wondered how this had actually worked. Since test_decoding had done
nothing more than expose some functions, without registering any
callback in the conventional way (hooks, etc), how could it have
worked? That brings me to the interface used by plugins like this
test_decoding.

Plugin interface
===========
So test_decoding uses various type of caches and catalogs. I'm mostly
worried about the core BDR interface that it uses, more-so than this
other stuff. I'm talking about:
src/include/replication/output_plugin.h |   76 ++

One minor gripe is that output_plugin.h isn't going to pass muster
with cpluspluscheck (private is a C++ keyword). There are more serious
problems, though. In particular, I'm quite perplexed at some of the
code that “installs” the test_decoding plugin.

The test_decoding module is hard-coded within pg_receivellog thusly
(the SCONST token here could name an arbitrary module):

+     res = PQexec(conn, "INIT_LOGICAL_REPLICATION 'test_decoding'");

Furthermore, the names of particular test_decoding routines are hard
coded into core, using libdl/PG_MODULE_MAGIC introspection:

+ XLogReaderState *
+ normal_snapshot_reader(XLogRecPtr startpoint, TransactionId xmin,
+                        char *plugin, XLogRecPtr valid_after)
+ {
+     /* to simplify things we reuse initial_snapshot_reader */
+     XLogReaderState *xlogreader = initial_snapshot_reader(startpoint, xmin);

*** SNIP ***

+
+     /* lookup symbols in the shared libarary */
+
+     /* optional */
+     apply_state->init_cb = (LogicalDecodeInitCB)
+         load_external_function(plugin, "pg_decode_init", false, NULL);
+
+     /* required */
+     apply_state->begin_cb = (LogicalDecodeBeginCB)
+         load_external_function(plugin, "pg_decode_begin_txn", true, NULL);

*** SNIP ***

This seems fairly wrong-headed. Comments above this function say:

+ /*
+  * Build a snapshot reader with callbacks found in the shared library "plugin"
+  * under the symbol names found in output_plugin.h.
+  * It wraps those callbacks so they send out their changes via an logical
+  * walsender.
+  */

So the idea is that the names of all functions with public linkage
within test_decoding (their symbols) have magical significance, and
that the core system resolve those magic symbols dynamically. I'm not
aware of this pattern appearing anywhere else within Postgres.
Furthermore, it seems kind of short sighted. Have we not painted
ourselves into a corner with regard to using multiple plugins at once?
This doesn't seem terribly unreasonable, if for example we wanted to
use test_decoding in production to debug a problem, while running a
proper logical replication system and some other logical change-set
consumer in tandem. Idiomatic use of “hooks” allows multiple plugins
to be called for the same call of the authoritative hook by the core
system, as for example when using auto_explain and pg_stat_statements
at the same time. Why not just use hooks? It isn't obvious that you
shouldn't be able to do this. The signature of the function
pg_decode_change (imposed by the function pointer typedef
LogicalDecodeChangeCB) assumes that everything should go through a
passed StringInfo, but I have a hard time believing that that's a good
idea.

It's like your plugin functions as a way of filtering reorder buffers.
It's not as if the core system just passes logical change-sets off, as
one might expect. It is actually the case that clients have to connect
to the server in replication mode, and get their change-sets (as
filtered by their plugin) streamed by a walsender over the wire
protocol directly. What of making changeset subscribers generic
abstractions? Again, maybe you don't have to do anything with the
StringInfo, but that is far from clear from the extant code and
documentation.

Snapshot builder
================

We've seen [1] that the snapshot builder is concerned with building
snapshots for the purposes of timetravel. This is needed to see the
contents of the catalog at a point in time when decoding (see design
documents for more).
src/backend/access/transam/xact.c                    |    4 +-src/backend/replication/logical/DESIGN.txt           |
603++++++++++src/backend/replication/logical/Makefile             |   25
+src/backend/replication/logical/README.SNAPBUILD.txt|  298 +++++src/backend/replication/logical/snapbuild.c          |
1181+++++++++++++++++++src/backend/utils/time/tqual.c                       |  299
++++-src/include/access/heapam_xlog.h                    |   23 +src/include/c.h                                      |
  1 +src/include/replication/snapbuild.h                  |  129 +++src/include/utils/snapshot.h
|    4 +-src/include/utils/tqual.h                            |   51 +- 

I've lumped the tqual/snapshot visibility changes under “Snapshot
builder” too, and anything mostly to do with ComboCids. The
README.SNAPBUILD document (and the code described by it) was
previously the focus of an entire review of its own [2].

I still don't see why you're allocating snapshots within
DecodeRecordIntoReorderBuffer(). As I've said, I think that snapshots
would be better allocated alongside the ReadApplyState that is
directly concerned with snapshots, to better encapsulate the snapshot
stuff. Now, you do at least acknowledge this problem this time around:

+     /*
+      * FIXME: The existance of the snapshot builder is pretty obvious to the
+      * outside right now, that doesn't seem to be very good...
+      */

However, the fact is that this function:

+ Snapstate *
+ AllocateSnapshotBuilder(ReorderBuffer *reorder)
+ {

doesn't actually do anything with the ReorderBuffer pointer that it is
passed. So I don't see why you've put off doing this, as if it's
something that would require a non-trivial effort.

One of my major concerns during that review was the need for this “peg
xmin horizon” hack - you presented an example that required the use of
a prepared transaction to artificially peg the global xmin horizon,
and I wasn't happy about that. We were worried about catalog tables
getting vacuumed in a way that prevented us from correctly
interpreting data about types in the face of transactions that mix DML
and DDL.

If the catalog tables were vacuumed, we'd be out of luck - we needed
to do something somewhat analogous to hot_standby_feedback. At the
same time, we need to manage the risk of bloat on the primary due to
non-availability of a standby in some speculative replication system
using this infrastructure. One proposal floated around was to have a
special notion of xmin horizon - a more granular xmin horizon
applicable to only the necessary catalog tables. You didn't pursue
that idea yet, preferring to solve the simpler case. You say of xmin
horizon handling:

+ == xmin Horizon Handling ==
+
+ Reusing MVCC for timetravel access has one obvious major problem:
+ VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also
+ obviously, we want autovacuum/manual vacuum to work as before.
+
+ The idea here is to reuse the infrastrcuture built for hot_standby_feedback
+ which allows us to keep the xmin horizon of a walsender backend artificially
+ low. We keep it low enough so we can restart decoding from the last location
+ the client has confirmed to be safely received. The means that we keep it low
+ enough to contain the last checkpoints oldestXid value.
+
+ That also means we need to make that value persist across
restarts/crashes in a
+ very similar manner to twophase.c's. That infrastructure actually also useful
+ to make hot_standby_feedback work properly across primary restarts.

So we jury rig the actual xmin horizon by doing this:

+                              /*
+                               * inrease shared memory state, so vacuum can work
+                               * on tuples we prevent from being purged.
+                               */
+                              IncreaseLogicalXminForSlot(buf->origptr,
+                                                         running->oldestRunningXid);

We switch the WAL Sender proc's xmin while the walsender replies to a
message, while preserving the “real” xmin horizon. Presumably this is
crash safe, since we do this as part of XLOG_RUNNING_XACTS replay (iff
we're doing “logical recovery”; that is, decoding is being performed
as we reach SNAPBUILD_CONSISTENT):
recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS, rdata);

I continue to be quite concerned about the failure modes here. I do
not accept that this is no worse than using hot_standby_feedback.
hot_standby_feedback can see a standby bloat up the master because it
has a long-running transaction - it's a process that the standby must
actively engage in. However, what you have here will bloat up the
master passively; standbys have to actively work to *prevent* that
from happening. That's a *fundamental* distinction. Maybe it's
actually reasonable to do that, at least for now, but I think that you
should at least acknowledge the distinction as an important one.

We also use this new tqual.c infrastructure to time-travel during
decoding, with the snapshot built for us by snapshot builder:

+ /*
+  * See the comments for HeapTupleSatisfiesMVCC for the semantics this function
+  * obeys.
+  *
+  * Only usable on tuples from catalog tables!
+  *
+  * We don't need to support HEAP_MOVED_(IN|OFF) for now because we
only support
+  * reading catalog pages which couldn't have been created in an older version.
+  *
+  * We don't set any hint bits in here as it seems unlikely to be beneficial as
+  * those should already be set by normal access and it seems to be too
+  * dangerous to do so as the semantics of doing so during timetravel are more
+  * complicated than when dealing "only" with the present.
+  */
+ bool
+ HeapTupleSatisfiesMVCCDuringDecoding(HeapTuple htup, Snapshot snapshot,
+                                      Buffer buffer)

Are you sure that ReorderBuffer.private_data should be a void*? Maybe
we'd be better off if it was a minimal “abstract base class” pointer,
that contained a MemoryContext?

This whole area could use a lot more scrutiny. That's all I have for
now, though.

I'm happy to note that the overhead of computing the pegged
Recent(Global)Xmin is one TransactionIdIsValid, one
TransactionIdPrecedes and, potentially, one assignment.

I am also pleased to see that you're invalidating system caches in a
more granular fashion (for transactions that contain both DDL and DML,
where we cannot rely on the usual Hot Standby where sinval messages
are applied for commit records). That is a subject worthy of another
e-mail, though.

Decoding (“glue code”)
======================

We've seen [1] that decoding is concerned with decoding WAL records
from an xlogreader.h callback into an reorderbuffer.

Decoding means breaking up individual XLogRecord structs, reading them
through an XlogReaderState, and storing them in an Re-Order buffer
(reorderbuffer.c does this, and stores them as ReorderBufferChange
records), while building a snapshot (which is needed in advance of
adding tuples from records). It can be thought of as the small piece
of glue between reorderbuffer and snapbuild that is called by
XLogReader (DecodeRecordIntoReorderBuffer() is the only public
function, which will be called by the WAL sender – previously, this
was called by plugins directly).

An example of what belongs in decode.c is the way it ignores physical
XLogRecords, because they are not of interest.
src/backend/replication/logical/decode.c           |  494 ++++++++src/backend/replication/logical/logicalfuncs.c     |
224++++src/backend/utils/adt/dbsize.c                     |   79 ++src/include/catalog/indexing.h                     |
  2 +src/include/catalog/pg_proc.h                      |    2 +src/include/replication/decode.h                   |
21+src/include/replication/logicalfuncs.h             |   45 +src/include/storage/itemptr.h                      |    3
+src/include/utils/builtins.h                      |    1 + 

The pg_proc accessible utility function pg_relation_by_filenode() -
which you've documented - doesn't appear to be used at present (it's
just a way of exposing the core infrastructure, as described under
“Miscellaneous thoughts”)  . A new index is created on pg_class
(reltablespace oid_ops, relfilenode oid_ops).

We've seen that we need a whole new infrastructure for resolving
relfilenodes to relation OIDs, because only relfilenodes are available
from the WAL stream, and in general the mapping isn't stable, as for
example when we need to do a table rewrite. We have a new syscache for
this.

We WAL-log the new XLOG_HEAP2_NEW_CID record to store both table
relfilenode and combocids. I'm still not clear on how you're managing
corner case with relfilenode/table oid mapping that Robert spoke of
previously [17]. Could you talk about that?

Reorder buffer
==============

Last time around [1], this was known as ApplyCache. It's still
concerned with the management of logical replay cache - it reassembles
transactions from a stream of interspersed changes. This is what a
design doc previously talked about under “4.5 - TX reassembly” [14].
src/backend/replication/logical/reorderbuffer.c    | 1185 ++++++++++++++++++++src/include/replication/reorderbuffer.h
        |  284 +++++ 

Last time around, I described spooling to disk, like a tuplestore, as
a probable prerequisite to commit - I raise that now because I thought
that this was the place where you'd most likely want to do that.
Concerns about the crash-safety of buffered change-sets were raised
too.

You say this in a README:

+ * crash safety, restartability & spilling to disk
+ * consistency with the commit status of transactions
+ * only a minimal amount of synchronous work should be done inside individual
+ transactions
+
+ In our opinion those problems are restricting progress/wider distribution of
+ these class of solutions. It is our aim though that existing solutions in this
+ space - most prominently slony and londiste - can benefit from the work we are
+ doing & planning to do by incorporating at least parts of the changeset
+ generation infrastructure.

So, have I understood correctly - are you proposing that we simply
outsource this to something else? I'm not sure how I feel about that,
but I'd like clarity on this matter.

reorderbuffer.h should have way, way more comments for each of the
structs. I want to see detailed comments, like those you see for the
structs in parsenodes.h - you shouldn't have to jump to some design
document to see how each struct fits within the overall design of
reorder buffering.

XLog stuff (in particular, the new XLogReader)
=================================

Andres rebased on top of Heikki's XLogReader patch for the purposes of
BDR, and privately identified this whole area to me as a particular
concern for this review. The version that I'm reviewing here is not
the version that Andres described last week, v3.0 [10], but a slight
revision thereof, v3.1 [11]. See the commit message in Andres' feature
branch for full details [12].
doc/src/sgml/ref/pg_xlogdump.sgml        |   76 ++src/backend/access/transam/Makefile      |    2
+-src/backend/access/transam/xlog.c       | 1084 ++++--------------src/backend/access/transam/xlogfuncs.c   |    1
+src/backend/access/transam/xlogreader.c |  962 ++++++++++++++++src/bin/Makefile                         |    2
+-src/bin/pg_xlogdump/Makefile            |   87 ++src/bin/pg_xlogdump/compat.c             |  173
+++src/bin/pg_xlogdump/nls.mk              |    4 +src/bin/pg_xlogdump/pg_xlogdump.c        |  462
++++++++src/bin/pg_xlogdump/pqexpbuf_strinfo.c  |   76 ++src/bin/pg_xlogdump/tables.c             |   78
++src/include/access/heapam_xlog.h        |   23 +src/include/access/transam.h             |    5
+src/include/access/xlog.h               |    3 +-src/include/access/xlog_fn.h             |   35
+src/include/access/xlog_internal.h      |   23 -src/include/access/xlogdefs.h            |    1
+src/include/access/xlogreader.h         |  159 +++ 

There was some controversy over the approach to implementing a
“generic xlog reader”[13]. This revision of Andres' work presumably
resolves that controversy, since it heavily incorporates Heikki's own
work. Heikki has described the design of his original XLogReader patch
[18].

pg_xlogdump is a hacker-orientated utility that has been around in
various forms for quite some time (i.e. at least since the 8.3 days),
concerned with reading and writing Postgres transaction logs for
debugging purposes. It has long been obvious that it would be useful
to maintain along with Postgres (there has been a tendency for
xlogdump to fall behind, and only stable releases are supported), but
the XLogReader-related refactoring makes adding an official xlogdump
tool quite compelling (we're talking about 462 lines of wrapper code
for pg_xlogdump.c, against several thousands of lines of code for the
version in common use [15] that has hard-coded per-version knowledge
of catalog oids and things like that). I think that some of the
refactoring that Simon did to xlog.c last year [16] makes things
easier here, and kind of anticipates this.

Again, with pg_xlogdump you forgot to do this:
 pg_xlogdump: $(OBJS) | submake-libpq submake-libpgport     $(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX)  $(LIBS)
$(libpq_pgport) -o $@$(X)

+ install: all installdirs
+     $(INSTALL_PROGRAM) pg_xlogdump$(X) '$(DESTDIR)$(bindir)'/pg_xlogdump$(X)
+
+ installdirs:
+     $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+ uninstall:
+     rm -f $(addprefix '$(DESTDIR)$(bindir)'/, pg_xlogdump$(X))

pg_xlogdump could be considered a useful way of testing the XLogReader
and decoding functionality, independent of the test_decoding plugin.
It is something that I'll probably use to debug this patch over the
next few weeks. Example usage:

[peter@peterlaptop pg_xlog]$ pg_xlogdump -f 000000010000000000000002 | head -n 3
xlog record: rmgr: Heap2      , record_len:     34, tot_len:     66,
tx:       1902, lsn: 0/020011C8, prev 0/01FFFC48, bkp: 0000, desc:
new_cid: rel 1663/12933/12671; tid 7/44; cmin: 0, cmax: 4294967295,
combo: 4294967295
xlog record: rmgr: Heap       , record_len:    175, tot_len:    207,
tx:       1902, lsn: 0/02001210, prev 0/020011C8, bkp: 0000, desc:
insert: rel 1663/12933/12671; tid 7/44
xlog record: rmgr: Btree      , record_len:     34, tot_len:     66,
tx:       1902, lsn: 0/020012E0, prev 0/02001210, bkp: 0000, desc:
insert: rel 1663/12933/12673; tid 1/355

In another thread, Robert and Heikki remarked that pg_xlogdump ought
to be in contrib, and not in src/bin. As you know, I am inclined to
agree.

[peter@peterlaptop pg_xlog]$ pg_xlogdump -f 1234567
fatal_error: requested WAL segment 012345670000000000000009 has
already been removed

This error message seems a bit presumptuous to me; as it happens there
never was such a WAL segment. Saying that there was introduces the
possibility of operator error.

This appears to be superfluous:

*** a/src/backend/access/transam/xlogfuncs.c
--- b/src/backend/access/transam/xlogfuncs.c
***************
*** 18,23 ****
--- 18,24 ----
 #include "access/htup_details.h" #include "access/xlog.h"
+ #include "access/xlog_fn.h" #include "access/xlog_internal.h"

The real heavyweight here is xlogreader.c, at 962 lines. The module
refactors xlog.c, moving ReadRecord and some supporting functions to
xlogreader.c. Those supporting functions now operate on *generic*
XLogReaderState rather than various global variables. The idea here is
that the client of the API calls ReadRecord repeatedly to get each
record.

There is a callback of type XLogPageReadCB, which is used by the
client to obtain a given page in the WAL stream. The XLogReader
facility is responsible for decoding the WAL into records, but the
client is responsible for supplying the physical bytes via the
callback within XLogReader state. There is an error-handling callback
too, added by Andres. Andres added a new function,
XLogFindNextRecord(), which is used for checking wether RecPtr is a
valid XLog address for reading and to find the first valid address
after some address when dumping records, for debugging purposes.

Why did you move the page validation handling into XLogReader?

Support was added for reading pages which are only partially valid,
which seems reasonable. The callback that acts as a replacement for
emode_for_corrupt_record might be a bit questionable.

I'd like to have more to say on this. I'll leave that for another day.

I note that there are many mallocs in this module (see note below
under “Miscellaneous thoughts”).

heapam and other executor stuff
===============================

One aspect of this patch that I feel certainly warrants another of
these subsections is the changes to heapam.c and related executor
changes. These are essentially changes to functions called by
nodeModifyTable.c frequently, including functions like
heap_hot_search_buffer, heap_insert, heap_multi_insert and
heap_delete. We now have to do extra logical logging, and we need
primary key values to be looked up.

Files changed include:
src/backend/access/heap/heapam.c         |  284 ++++-src/backend/access/heap/pruneheap.c      |   16
+-src/backend/catalog/index.c             |   76 +-src/backend/access/rmgrdesc/heapdesc.c   |    9
+src/include/access/heapam_xlog.h        |   23 +src/include/catalog/index.h              |    4 + 

What of this? (I'm using the dellstore sample database, as always):

postgres=# \d+ orders                                                    Table "public.orders"
*** SNIP ***
Indexes:   "orders_pkey" PRIMARY KEY, btree (orderid)   "ix_order_custid" btree (customerid)
***SNIP ***

postgres=# delete from orders where orderid = 77;
WARNING:  Could not find primary key for table with oid 16406
CONTEXT:  SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE
$1 OPERATOR(pg_catalog.=) "orderid""
WARNING:  Could not find primary key for table with oid 16406
CONTEXT:  SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE
$1 OPERATOR(pg_catalog.=) "orderid""
WARNING:  Could not find primary key for table with oid 16406
CONTEXT:  SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE
$1 OPERATOR(pg_catalog.=) "orderid""
DELETE 1

I don't have time to figure out what this issue is right now.

Hot Standby, Replication and libpq stuff
========================================

Not forgetting existing replication infrastructure and libpq stuff
affected by this patch. Files under this category that have been
modified are:
src/backend/access/rmgrdesc/xlogdesc.c                      |    1 +src/backend/postmaster/bgwriter.c
       |   35 +src/backend/postmaster/postmaster.c                         |    7
+-src/backend/replication/libpqwalreceiver/libpqwalreceiver.c|    4 +-src/backend/replication/Makefile
         |    2 +src/backend/replication/walsender.c                         |  732
+++++++++++-src/backend/storage/ipc/procarray.c                        |   23 +src/backend/storage/ipc/standby.c
                  |    8 +src/backend/utils/init/postinit.c                           |    5
+src/bin/pg_controldata/pg_controldata.c                    |    2 +src/include/nodes/nodes.h
       |    2 +src/include/nodes/replnodes.h                               |   22 +src/include/replication/walsender.h
                      |    1 +src/include/replication/walsender_private.h                 |   43
+-src/interfaces/libpq/exports.txt                           |    1 +src/interfaces/libpq/pqexpbuffer.c
        |   40 +src/interfaces/libpq/pqexpbuffer.h                          |    5 + 

I take particular interest in bgwriter.c here. You're doing this:

+          * Log a new xl_running_xacts every now and then so replication can get
+          * into a consistent state faster and clean up resources more
+          * frequently. The costs of this are relatively low, so doing it 4
+          * times a minute seems fine.

What about the power consumption of the bgwriter? I think that the way
try to interact with the existing loop logic is ill-considered. Just
why is the bgwriter the compelling auxiliary process in which to do
this extra work?

Quite a lot of code has been added to walsender. This is mostly down
to some new functions, responsible for initialising logical
replication:

! typedef void (*WalSndSendData)(bool *);
! static void WalSndLoop(WalSndSendData send_data) __attribute__((noreturn)); static void InitWalSenderSlot(void);
staticvoid WalSndKill(int code, Datum arg); 
! static void XLogSendPhysical(bool *caughtup);
! static void XLogSendLogical(bool *caughtup); static void IdentifySystem(void); static void
StartReplication(StartReplicationCmd*cmd); 
+ static void CheckLogicalReplicationRequirements(void);
+ static void InitLogicalReplication(InitLogicalReplicationCmd *cmd);
+ static void StartLogicalReplication(StartLogicalReplicationCmd *cmd);
+ static void ComputeLogicalXmin(void);

This is mostly infrastructure for initialising and starting logical replication.

Initialisation means finding a free “logical slot” from shared memory,
then looping until the new magic xmin horizon for logical walsenders
(stored in their “slot”) is that of the weakest link (think local
global xmin).

+      * FIXME: think about solving the race conditions in a nicer way.
+      */
+ recompute_xmin:
+     walsnd->xmin = GetOldestXmin(true, true);
+     ComputeLogicalXmin();
+     if (walsnd->xmin != GetOldestXmin(true, true))
+         goto recompute_xmin;

Apart from the race conditions that I'm not confident are addressed
here, I think that the above could easily get stuck indefinitely in
the event of contention.

Initialisation occurs due to a “INIT_LOGICAL_REPLICATION” replication
command. Initialisation also means that decoding state is allocated (a
snapshot reader is initialised), and we report back success or failure
to the client that's using the streaming replication protocol (i.e. in
our toy example, pg_receivellog).

Starting logical replication means we load the previously initialised
slot, and find a snapshot reader plugin (using the “magic symbols”
pattern described above, under “Plugin interface”).

Why do we have to “find” a logical slot twice (both during
initialisation and starting)?

Since I've already described the “peg xmin horizon” stuff under
“Snapshot builder”, I won't belabour the point. I think that I have
more to say about this, but not today.

Minor point: This is a terrible name for the variable in question:

+     LogicalWalSnd *walsnd;

Miscellaneous thoughts
======================

You're still using C stdlib functions like malloc, free, calloc quite
a bit. My concern is that this points to a lack of thought about the
memory management strategy; why are you still not using memory
contexts in some places? If it's so difficult to anticipate what
clients of, say, XLogReaderAllocate() want for the lifetime of their
memory, then likely as not those clients should be doing their own
memory allocation, and passing the allocated buffer directly. If it is
obvious that the memory ought to persist indefinitely (and I think
it's your contention that it is in the case of XLogReaderAllocate()),
I'd just allocate it in the top memory context. Now, I am aware that
there are a trivial number of backend mallocs that you can point to as
precedent here, but I'm still not satisfied with your explanation for
using malloc(). At the very least, you ought to be handling the case
where malloc returns NULL, and you're not doing so consistently.

Memory contexts are very handy for debugging. As you know, I wrote a
little Python script with GDB bindings, that walks the tree of memory
contexts and prints out statistics about them using the
aset.c/AllocSetStats() infrastructure. It isn't difficult to imagine
that something like that could be quite useful with this patch - I'd
like to be able to easily determine how many snapshot builders have
been allocated from within a given backend, for example (though I see
you refcount that anyway for reasons that are not immediately apparent
- just debugging?).

Minor gripes:

* There is no need to use a *.txt extension for README files; we don't
currently use those anywhere else.

* If you only credit the PGDG and not the Berkeley guys (as you
should, for the most part), there is no need to phrase the notice
“Portions Copyright...”. You should just say “Copyright...”.

* You're still calling function pointer typedefs things like
LogicalDecodeInitCB. As I've already pointed out, you should prefer
the existing conventions (call it something like
LogicalDecodeInit_hook_type).

Under this section are all modifications to files that are not
separately described under some dedicated section header. I'll quickly
pass remark on them.

System caches were knocked around a bit:
// LocalExecuteInvalidationMessage now exposed:src/backend/utils/cache/inval.c                    |    2 +-//
relcache.chas stray whitespace:src/backend/utils/cache/relcache.c                 |    1 -// New
RelationMapFilenodeToOid()function:src/backend/utils/cache/relmapper.c                |   53 +// New RELFILENODE
syscacheadded:src/backend/utils/cache/syscache.c                 |   11 +// Headers:src/include/storage/sinval.h
              |    2 +src/include/utils/relmapper.h                      |    2 +src/include/utils/syscache.h
           |    1 + 

These are only of tangential interest to snapshot building, and so are
not described separately. Essentially, just “add new syscache”
boilerplate. There's also a little documentation, covering only the
pg_relation_by_filenode() utility function (this exposes
RelationMapFilenodeToOid()/RELFILENODE syscache):
doc/src/sgml/func.sgml                             |   23 +-doc/src/sgml/ref/allfiles.sgml                     |    1
+doc/src/sgml/reference.sgml                       |    1 + 

The following files were only changed due to the change in the tqual.c
interfaces of HeapTupleSatisfies*().
contrib/pgrowlocks/pgrowlocks.c                    |    2 +-src/backend/commands/analyze.c                     |    3
+-src/backend/commands/cluster.c                    |    2 +-src/backend/commands/vacuumlazy.c                  |    3
+-src/backend/storage/lmgr/predicate.c              |    2 +- 

That's all the feedback that I have for now. I'd have liked to have
gone into more detail in many cases, but I cannot only do so much. I
always like to start off rounds of review with “this is the current
state of play as I see it” type e-mails. There will be more to follow,
now that I have that out of the way.

References
==========

[1] Earlier WAL decoding review:
http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com

[2] Earlier snapshot building doc review:
http://archives.postgresql.org/message-id/CAEYLb_Xj=t-4CW6gLV5jUvdPZSsYwSTbZtUethsW2oMpd58jzA@mail.gmail.com

[3] "Rearrange storage of data in xl_running_xacts" commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5c11725867ac3cb06db065f7940143114280649c

[4] "Basic binary heap implementation" commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7a2fe9bd0371b819aacc97a007ec1d955237d207

[5] "Embedded list interface" commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a66ee69add6e129c7674a59f8c3ba010ed4c9386

[6] "Background worker processes" commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=da07a1e856511dca59cbb1357616e26baa64428e

[7] Chris Browne on Slony and ordering conflicts:
http://archives.postgresql.org/message-id/CAFNqd5VY9aKZtPSEyzOTMsGAhfFHKaGNCgY0D0wZvqjC0Dtt1g@mail.gmail.com

[8] Steve Singer on Slony and transaction isolation level:
http://archives.postgresql.org/message-id/BLU0-SMTP6402AA6F3A1F850EDFA1B2DC8D0@phx.gbl

[9] Kevin Grittner on commit ordering:
http://archives.postgresql.org/message-id/20121022141701.224550@gmx.com

[10] v3.0 of the XLogReader (Andres' revision):
http://archives.postgresql.org/message-id/20121204175212.GB12055@awork2.anarazel.de

[11] v3.1 of the XLogReader(Andres' slight tweak of [10]):
http://archives.postgresql.org/message-id/20121209190532.GD4694@awork2.anarazel.de

[12] Andres' XLogReader commit:
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commit;h=3ea7ec5eea2cf890c14075b559e77a25a4130efc

[13] Heikki objects to XLogReader approach, proposes alternative:
http://archives.postgresql.org/message-id/5056D3E1.3060108@vmware.com

[14] “WAL decoding, attempt #2” design documents:
http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com

[15] xlogdump satellite project: https://github.com/snaga/xlogdump

[16] Numerous refactoring commits. Main split was commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9aceb6ab3c202a5bf00d5f00436bb6ad285fc0bf

[17] Robert of relfilenodes:
http://archives.postgresql.org/message-id/CA+TgmoZXkCo5FAbU=3JHuXXF0Op2SLhGJcVuFM3tkmcBnmhBMQ@mail.gmail.com

[18] Heikki on his XLogReader:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php



Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From
Heikki Linnakangas
Date:
On 11.12.2012 21:11, Andres Freund wrote:
> Now that I have read some of that code, I am currently unsure how the
> current implementation of this can cooperate with translation, even when
> used from the backend?

Hmm, there was a gettext() call missing from report_invalid_record. 
That's where the translation needs to happen. Fixed now.

- Heikki



Heikki Linnakangas wrote:
> On 11.12.2012 21:11, Andres Freund wrote:
> >Now that I have read some of that code, I am currently unsure how the
> >current implementation of this can cooperate with translation, even when
> >used from the backend?
>
> Hmm, there was a gettext() call missing from report_invalid_record.
> That's where the translation needs to happen. Fixed now.

You need to call gettext_noop() in the string literals as well, unless
you've added the function and argument number to the gettext trigger
list in nls.mk.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: logical changeset generation v3 - git repository

From
Andres Freund
Date:
Hi Peter!

Thanks for the review, you raise many noteworthy points. This is going
to be a long mail...

On 2012-12-13 00:05:41 +0000, Peter Geoghegan wrote:
> I'm very glad that you followed my earlier recommendation of splitting
> your demo logical changeset consumer into a contrib module, in the
> spirit of contrib/spi, etc. This module, "test_decoding", represents a
> logical entry point, if you will, for the entire patch. As unwieldy as
> it may appear to be, the patch is (or at least *should be*) ultimately
> reducible to some infrastructural changes to core to facilitate this
> example logical change-set consumer.

To be fair that point has been brought up first by Robert and Kevin. But
yes, its now included. Which is totally sensible.

> Once again, because test_decoding is a kind of "entry point", it gives
> me a nice point to continually refer back to when talking about this
> patch. (Incidentally, maybe test_decoding should be called
> pg_decoding?).

I am not particularly happy with the current name, I just named it akin
to test_parser/. I don't really like pg_decoding tho, ISTM the pg_
prefix doesn't serve a point there, since its not a binary or such which
will lie around in some general namespace.

Other suggestions?

> The regression tests pass, though this isn't all that surprising,
> since frankly the test coverage of this patch appears to be quite low.

Yes, that certainly needs to be adressed.

> I obliged you and didn't focus on concurrency
> and serializability concerns (it was sufficient to print out values/do
> some decoding in a toy function), but it's time to take a closer look
> at those now, I feel.

Agreed.

> test_decoding is a client of the logical
> change-set producing infrastructure, and there appears to be broad
> agreement that that infrastructure needs to treat such consumers in a
> way that is maximally abstract. My question is, just how abstract does
> this interface have to be, really? How well are you going to support
> the use-case of a real logical replication system?

> Now, maybe it's just that I haven't being paying attention (in
> particular, to the discussion surrounding [3] – though that commit
> doesn't appear to have been justified in terms of commit ordering in
> BDR at all), but I would like you to be more demonstrative of certain
> things, like:

That commit was basically just about being able to discern which xids
are toplevel and which are subtransaction xids. snapbuild.c only needs
to wait for toplevel xids now and doesn't care about subtransaction xids
which made the code significantly simpler.

> 1. Just what does a logical change-set consumer look like? What things
> are always true of one, and never true of one?

> 2. Please describe in as much detail as possible the concurrency
> issues with respect to logical replication systems. Please make
> verifiable, testable claims as to how well these issues are considered
> here, perhaps with reference to the previous remarks of subject-matter
> experts like Chris Browne [7], Steve Singer [8] and Kevin Grittner [9]
> following my earlier review.

Not sure what you want to hear here to be honest.

Let me try anyway:

Transactions (and the contained changes) are guaranteed to be replayed
in commit-order where the order is defined by the LSN/position in the
xlog stream of the commit record[1]. Thats the same ordering that Hot
Standby uses.
The code achieves that order by reading the xlog records sequentially
in-order and replaying the begin/changes/commmit "events" everytime it
reads a commit record and never at a different time [1].

Several people in the thread you referenced seemed to agree that
commit-ordering is a sensible choice.

[1]: Note that there are potential visibility differences between the
order in which transactions are marked as visible in WAL, in the clog
and in memory (procarray) since thats not done while holding a lock over
the whole period. Thats an existing property with HS.

> I'm not all that impressed with where test_decoding is at right now.
> There is still essentially no documentation.

I will add comments.

> I think it's notable that you don't really touch the ReorderBufferTXN
> passed by the core system in the test_decoding plugin.

Don't think thats saying very much except that 1) we don't pass on
enough information about transactions yet (e.g. commit timestamp). 2)
the output plugin is simple.

>
> test_decoding and pg_receivellog
> ========================
>
> I surmised that the way that the test_decoding module is intended to
> be used is as a client of receivellog.c (*not* receivelog.c – that
> naming is *confusing*, perhaps call it receivelogiclog.c or something.

I am happy to name it any way people want. Once decided I think it
should move out of bin/pg_basebackup and the code (which is mostly
copied from streamutil.c and pg_receivexlog) should be cleaned up
considerably.

> Better still, make receivexlog handle the logical case rather than
> inventing a new tool). The reason for receivellog.c existing, as you
> yourself put it, is:

I don't think they really can be merged, the differences are notable
already and are going to get bigger.

> + /*
> +  * We have to use postgres.h not postgres_fe.h here, because there's so much
> +  * backend-only stuff in the XLOG include files we need.  But we need a
> +  * frontend-ish environment otherwise.    Hence this ugly hack.
> +  */
>
> So receivellog.c is part of a new utility called pg_receivellog, in
> much the same way as receivexlog.c is part of the existing
> pg_receivexlog utility (see commit
> b840640000934fca1575d29f94daad4ad85ba000 in Andres' tree). We're
> talking about these changes:

receivelog.c is old code, I didn't change anything nontrivial there. The
one change is gone now since Heikki committed the
xlog_internal./xlog_fn.h change.

>  src/backend/utils/misc/guc.c            |   11 +
>  src/bin/pg_basebackup/Makefile          |    7 +-
>  src/bin/pg_basebackup/pg_basebackup.c   |    4 +-
>  src/bin/pg_basebackup/pg_receivellog.c  |  717 ++++++++++++
>  src/bin/pg_basebackup/pg_receivexlog.c  |    4 +-
>  src/bin/pg_basebackup/receivelog.c      |    4 +-
>  src/bin/pg_basebackup/streamutil.c      |    3 +-
>  src/bin/pg_basebackup/streamutil.h      |    1 +
>
> So far, so good. Incidentally, you forgot to do this:
>
>   install: all installdirs
>       $(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
>       $(INSTALL_PROGRAM) pg_receivexlog$(X)
> '$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
> +     $(INSTALL_PROGRAM) pg_receivellog$(X)
> '$(DESTDIR)$(bindir)/pg_receivellog$(X)'

I actually didn't forget to do this, but I didn't want to install
binaries that probably won't survive under the current name. That seems
to have been a bad idea since Michael and you noticed it as missing ;)

> So this creates a new binary executable, pg_receivellog, which is
> described as “the pg_receivexlog equivalent for logical changes”. Much
> like pg_receivexlog, pg_receivellog issues special new replication
> protocol commands for logical replication, which account for your
> changes to the replication protocol grammar and lexer (i.e.
> walsender):
>
>  src/backend/replication/repl_gram.y                |   32 +-
>  src/backend/replication/repl_scanner.l             |    2 +
>
> You say:
>
> + /* This is is just for demonstration, don't ever use this code for
> anything real! */
>
> uh, why not? What is the purpose of a contrib module, if not to serve
> as a minimal example?

Stupid copy & paste error from the old example code. The code should
probably grow a call to some escape functionality and more coments to
serve as a good example but otherwise its ok.

> Evidently you expected me to see this message:
>
> +     if (!walsnd)
> +     {
> +         elog(ERROR, "couldn't find free logical slot. free one or increase
> max_logical_slots");
> +     }
>
> If I did, that might have been okay. I didn't though, presumably
> because the “walsnd” variable was wild/uninitialised.

The problem was earlier, CheckLogicalReplicationRequirements() should
have checked for a reasonable max_logical_slots value but only checked
for wal_level. Fix pushed.

> So, I went and set max_logical_slots to something higher than 0, and
> restarted. pg_receivellog behaved itself this time.
>
> In one terminal:
>
> [peter@peterlaptop decode]$ tty
> /dev/pts/0
> [peter@peterlaptop decode]$ pg_receivellog -f test.log -d postgres
> WARNING:  Initiating logical rep
> WARNING:  reached consistent point, stopping!
> WARNING:  Starting logical replication

Those currently are WARNINGs to make them easier to see, they obviously
need to be demoted at some point.

> One minor gripe is that output_plugin.h isn't going to pass muster
> with cpluspluscheck (private is a C++ keyword).

Fix pushed.

> Plugin interface
> ===========
> So test_decoding uses various type of caches and catalogs. I'm mostly
> worried about the core BDR interface that it uses, more-so than this
> other stuff. I'm talking about:

I have asked for input on the interface in a short email
http://archives.postgresql.org/message-id/20121115014250.GA5844%40awork2.anarazel.de
but didn't get responses so far.

I am happy to change the interface, its just did the first thing that
made sense to me.

Steve Singer - who I believe played a bit with writing his own output
plugin - seemed to be ok with it.

> The test_decoding module is hard-coded within pg_receivellog thusly
> (the SCONST token here could name an arbitrary module):
>
> +     res = PQexec(conn, "INIT_LOGICAL_REPLICATION 'test_decoding'");

pg_receivellog will/should grow a --output-plugin parameter at some
point.

> +     /* optional */
> +     apply_state->init_cb = (LogicalDecodeInitCB)
> +         load_external_function(plugin, "pg_decode_init", false, NULL);

> So the idea is that the names of all functions with public linkage
> within test_decoding (their symbols) have magical significance, and
> that the core system resolve those magic symbols dynamically.

> I'm not aware of this pattern appearing anywhere else within Postgres.

There's _PG_init/fini...

> Furthermore, it seems kind of short sighted. Have we not painted
> ourselves into a corner with regard to using multiple plugins at once?
> This doesn't seem terribly unreasonable, if for example we wanted to
> use test_decoding in production to debug a problem, while running a
> proper logical replication system and some other logical change-set
> consumer in tandem.

How does the scheme prevent you from doing that? Simply open up another
replication connection and specify a different output plugin there?
Not sure how two output plugins in one process would make sense?

> Idiomatic use of “hooks” allows multiple plugins
> to be called for the same call of the authoritative hook by the core
> system, as for example when using auto_explain and pg_stat_statements
> at the same time. Why not just use hooks? It isn't obvious that you
> shouldn't be able to do this.

I considered using hooks but it seemed not to be a good fit. Let me
describe my thought process:

1) we want different output formats to be available in the same server &  database
2) the wished-for plugin should be specified via the replication  connection
3) thus shared_preload_libraries and such aren't really helpful
4) we need to load the plugin ourselves
5) We could simply load it and let the object's _PG_init() call  something like OutputPluginInitialize(begin_callback,
change_callback,commit_callback), but then we would need to handle  the case where that wasn't called and such 
6) Going the OutputPluginInitialize route didn't seem to offer any  benefits, thus the hardcoded symbol names


> The signature of the function
> pg_decode_change (imposed by the function pointer typedef
> LogicalDecodeChangeCB) assumes that everything should go through a
> passed StringInfo, but I have a hard time believing that that's a good
> idea.

I don't particularly like passing a StringInfo either, but what would
you rather pass? Note that StringInfo's are what's currently used in
normal fe/be communication.

Doing the sending out directly in the output plugin seems to be a bad
idea because:
1) we need to handle receiving replies from the receiving side, like  keepalives and such, also we need to terminate
theconnection if no  reply has come inside wal_sender_timeout. 
2) the output plugins imo shouldn't know they are sending out to a  walsender, we might want to allow sending from
insidea function, to  disk or anything at some point. 

Does the reasoning make sense to you?

> It's like your plugin functions as a way of filtering reorder buffers.
> It's not as if the core system just passes logical change-sets off, as
> one might expect. It is actually the case that clients have to connect
> to the server in replication mode, and get their change-sets (as
> filtered by their plugin) streamed by a walsender over the wire
> protocol directly. What of making changeset subscribers generic
> abstractions?

Sorry, I cannot follow you here. What kind of architecture are you
envisioning here?

> Snapshot builder
> ================
>
> I still don't see why you're allocating snapshots within
> DecodeRecordIntoReorderBuffer(). As I've said, I think that snapshots
> would be better allocated alongside the ReadApplyState that is
> directly concerned with snapshots, to better encapsulate the snapshot
> stuff. Now, you do at least acknowledge this problem this time around:
>
> +     /*
> +      * FIXME: The existance of the snapshot builder is pretty obvious to the
> +      * outside right now, that doesn't seem to be very good...
> +      */

I think that comment was there in the last round as well ;)

> However, the fact is that this function:
>
> + Snapstate *
> + AllocateSnapshotBuilder(ReorderBuffer *reorder)
> + {
>
> doesn't actually do anything with the ReorderBuffer pointer that it is
> passed. So I don't see why you've put off doing this, as if it's
> something that would require a non-trivial effort.

Well, there simply are a lot of things that need a littlebit of effort.
In total thats still a nontrivial amount.

And I wasn't sure how much of allt hat needed to change due to changes
in the actual snapshot building and the xlogreader swap. Turned out not
too many...


[ The xmin handling deserves its own mail, I'll respond to that
separately]


> I am also pleased to see that you're invalidating system caches in a
> more granular fashion (for transactions that contain both DDL and DML,
> where we cannot rely on the usual Hot Standby where sinval messages
> are applied for commit records). That is a subject worthy of another
> e-mail, though.

There still are two issues worth improving with this though:
1) clear the whole cache when entering/leaving timetravel
2) Don't replay normal "present day" inval messages while in timetravel * That may actually be able to cause errors
whentrying to reload the   relcache... 

1) seems pretty uncontroversion to me since it should happen really
infrequently and it seems to be semantically correct. I have to think
some more about 2), there are some interesting things with relmap
updates due to CLUSTER et al. on nailed tables...

> Decoding (“glue code”)
> ======================
>
> We've seen [1] that decoding is concerned with decoding WAL records
> from an xlogreader.h callback into an reorderbuffer.
>
> Decoding means breaking up individual XLogRecord structs, reading them
> through an XlogReaderState, and storing them in an Re-Order buffer
> (reorderbuffer.c does this, and stores them as ReorderBufferChange
> records), while building a snapshot (which is needed in advance of
> adding tuples from records). It can be thought of as the small piece
> of glue between reorderbuffer and snapbuild that is called by
> XLogReader (DecodeRecordIntoReorderBuffer() is the only public
> function, which will be called by the WAL sender – previously, this
> was called by plugins directly).
>
> An example of what belongs in decode.c is the way it ignores physical
> XLogRecords, because they are not of interest.
>
>  src/backend/replication/logical/decode.c           |  494 ++++++++
>  src/backend/replication/logical/logicalfuncs.c     |  224 ++++
>  src/backend/utils/adt/dbsize.c                     |   79 ++
>  src/include/catalog/indexing.h                     |    2 +
>  src/include/catalog/pg_proc.h                      |    2 +
>  src/include/replication/decode.h                   |   21 +
>  src/include/replication/logicalfuncs.h             |   45 +
>  src/include/storage/itemptr.h                      |    3 +
>  src/include/utils/builtins.h                       |    1 +
>
> The pg_proc accessible utility function pg_relation_by_filenode() -
> which you've documented - doesn't appear to be used at present (it's
> just a way of exposing the core infrastructure, as described under
> “Miscellaneous thoughts”)

It's not required for anything (and I don't think it ever will be). Its
was handy during development of this though, and I could have needed
earlier during DBAish work.
Hm. We could use it to add a regression test for the new syscache
though...

> . A new index is created on pg_class
> (reltablespace oid_ops, relfilenode oid_ops).
>
> We've seen that we need a whole new infrastructure for resolving
> relfilenodes to relation OIDs, because only relfilenodes are available
> from the WAL stream, and in general the mapping isn't stable, as for
> example when we need to do a table rewrite. We have a new syscache for
> this.

> We WAL-log the new XLOG_HEAP2_NEW_CID record to store both table
> relfilenode and combocids. I'm still not clear on how you're managing
> corner case with relfilenode/table oid mapping that Robert spoke of
> previously [17]. Could you talk about that?

Sure. (Found another potential bug due to this already). Robert talks
about two dangers:

1) relfilenode => oid is not unique
2) the relfilenode => oid mapping changes over time

1) is solved by only looking up relfilenodes by (reltablespace, oid)
(which is why the syscache is over those two thanks to Roberts
observations). We can recognize shared relations via spcNode ==
GLOBALTABLESPACE_OID and we can recognize nailed tables by the fact that
they cannot be looked up in pg_class (there's an InvalidOid stored in
the pg_class for them).
Shared and nailed tables are then looked up via the new
RelationMapFilenodeToOid function.
As the decoding is now per-database (we don't have the other catalogs)
we skip processing tuples when dbNode != MyDatabaseId.

So I think 1) is handled by that?

2) Is solved by the fact that the syscache now works properly
time-relativized as well. That is, if you look up the (reltablespace,
relfilenode) => oid mapping in the syscache you get the correct result
for the current moment in time (whats the correct term for current when
its only current from the POV of timetravelling?). Due to the proper
cache invalidation handling old mappings are purged correctly as well.

> Reorder buffer
> ==============
>
> Last time around [1], this was known as ApplyCache. It's still
> concerned with the management of logical replay cache - it reassembles
> transactions from a stream of interspersed changes. This is what a
> design doc previously talked about under “4.5 - TX reassembly” [14].

Happier with the new name?

>  src/backend/replication/logical/reorderbuffer.c    | 1185 ++++++++++++++++++++
>  src/include/replication/reorderbuffer.h            |  284 +++++
>
> Last time around, I described spooling to disk, like a tuplestore, as
> a probable prerequisite to commit - I raise that now because I thought
> that this was the place where you'd most likely want to do that.
> Concerns about the crash-safety of buffered change-sets were raised
> too.

Yes, this certainly is a prerequisite to commit.

> You say this in a README:
>
> + * crash safety, restartability & spilling to disk
> + * consistency with the commit status of transactions
> + * only a minimal amount of synchronous work should be done inside individual
> + transactions
> +
> + In our opinion those problems are restricting progress/wider distribution of
> + these class of solutions. It is our aim though that existing solutions in this
> + space - most prominently slony and londiste - can benefit from the work we are
> + doing & planning to do by incorporating at least parts of the changeset
> + generation infrastructure.
>
> So, have I understood correctly - are you proposing that we simply
> outsource this to something else? I'm not sure how I feel about that,
> but I'd like clarity on this matter.

No, this needs to be implemented in the reorderbuffer. Thats the next
task I will work on after committing the actual snapshot export.

> reorderbuffer.h should have way, way more comments for each of the
> structs. I want to see detailed comments, like those you see for the
> structs in parsenodes.h - you shouldn't have to jump to some design
> document to see how each struct fits within the overall design of
> reorder buffering.

Will go over it. I am wondering whether it makes sense to split most of
the ones in the header in private/public part...

> XLog stuff (in particular, the new XLogReader)
> =================================
>
> There was some controversy over the approach to implementing a
> “generic xlog reader”[13]. This revision of Andres' work presumably
> resolves that controversy, since it heavily incorporates Heikki's own
> work. Heikki has described the design of his original XLogReader patch
> [18].

I hope its resolved. I won't believe it until any version is committed
;)

> pg_xlogdump could be considered a useful way of testing the XLogReader
> and decoding functionality, independent of the test_decoding plugin.
> It is something that I'll probably use to debug this patch over the
> next few weeks. Example usage:
>
> [peter@peterlaptop pg_xlog]$ pg_xlogdump -f 000000010000000000000002 | head -n 3
> xlog record: rmgr: Heap2      , record_len:     34, tot_len:     66,
> tx:       1902, lsn: 0/020011C8, prev 0/01FFFC48, bkp: 0000, desc:
> new_cid: rel 1663/12933/12671; tid 7/44; cmin: 0, cmax: 4294967295,
> combo: 4294967295
> xlog record: rmgr: Heap       , record_len:    175, tot_len:    207,
> tx:       1902, lsn: 0/02001210, prev 0/020011C8, bkp: 0000, desc:
> insert: rel 1663/12933/12671; tid 7/44
> xlog record: rmgr: Btree      , record_len:     34, tot_len:     66,
> tx:       1902, lsn: 0/020012E0, prev 0/02001210, bkp: 0000, desc:
> insert: rel 1663/12933/12673; tid 1/355

> In another thread, Robert and Heikki remarked that pg_xlogdump ought
> to be in contrib, and not in src/bin. As you know, I am inclined to
> agree.

Moved in Heikki's worktree by now.

> [peter@peterlaptop pg_xlog]$ pg_xlogdump -f 1234567
> fatal_error: requested WAL segment 012345670000000000000009 has
> already been removed

> This error message seems a bit presumptuous to me; as it happens there
> never was such a WAL segment. Saying that there was introduces the
> possibility of operator error.

FWIW its the "historical" error message for that ;). I am happy to
change it to something else.
>

> The real heavyweight here is xlogreader.c, at 962 lines. The module
> refactors xlog.c, moving ReadRecord and some supporting functions to
> xlogreader.c. Those supporting functions now operate on *generic*
> XLogReaderState rather than various global variables. The idea here is
> that the client of the API calls ReadRecord repeatedly to get each
> record.

> There is a callback of type XLogPageReadCB, which is used by the
> client to obtain a given page in the WAL stream. The XLogReader
> facility is responsible for decoding the WAL into records, but the
> client is responsible for supplying the physical bytes via the
> callback within XLogReader state. There is an error-handling callback
> too, added by Andres.

Gone again, solved way much better by Heikki.

> Andres added a new function,
> XLogFindNextRecord(), which is used for checking wether RecPtr is a
> valid XLog address for reading and to find the first valid address
> after some address when dumping records, for debugging purposes.

And thats a very much needed safety feature for logical decoding. We
need to make sure LSNs specified by the user don't point into the middle
of a record. That would make some ugly things possible.

> Why did you move the page validation handling into XLogReader?

Because its needed from xlogdump and wal decoding as
well. Reimplementing it there doesn't seem to be a good idea. Skimping
on checks seems neither.

Any arguments against?


> heapam and other executor stuff
> ===============================
>
> One aspect of this patch that I feel certainly warrants another of
> these subsections is the changes to heapam.c and related executor
> changes. These are essentially changes to functions called by
> nodeModifyTable.c frequently, including functions like
> heap_hot_search_buffer, heap_insert, heap_multi_insert and
> heap_delete. We now have to do extra logical logging, and we need
> primary key values to be looked up.

The amount of extra logging should be relatively small though - some
preliminary tests seem to confirm that for me. But it certainly needs
some more validation.

I think it would be sensible to add the primary/candidate key as a
relcache/RelationData attribute. Do others agree?

heap_hot_search_buffer was changed in the course of the *Satisfies
changes, thats not related to this part.

> Files changed include:
>
>  src/backend/access/heap/heapam.c         |  284 ++++-
>  src/backend/access/heap/pruneheap.c      |   16 +-
>  src/backend/catalog/index.c              |   76 +-
>  src/backend/access/rmgrdesc/heapdesc.c   |    9 +
>  src/include/access/heapam_xlog.h         |   23 +
>  src/include/catalog/index.h              |    4 +


> What of this? (I'm using the dellstore sample database, as always):
>
> WARNING:  Could not find primary key for table with oid 16406
> CONTEXT:  SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE
> $1 OPERATOR(pg_catalog.=) "orderid""
> DELETE 1
>
> I don't have time to figure out what this issue is right now.

Its just a development debugging message that should go in the near
future. There's no primary key on orderlines so we currently cannot
safely replicate DELETEs. Its recognizable from the record that thats
the case, so we should be able to handle that "safely" during decoding,
that is we can print a warning there.

> Hot Standby, Replication and libpq stuff
> ========================================
>
> I take particular interest in bgwriter.c here. You're doing this:
>
> +          * Log a new xl_running_xacts every now and then so replication can get
> +          * into a consistent state faster and clean up resources more
> +          * frequently. The costs of this are relatively low, so doing it 4
> +          * times a minute seems fine.
>
> What about the power consumption of the bgwriter?

I think we are not doing any additional wakeups due to this, the
complete sleeping logic is unaffected. The maximum sleep duration
currently is "BgWriterDelay * HIBERNATE_FACTOR" which is lower than the
interval in which we log new snapshots. So I don't think this should
make a measurable difference?

> I think that the way try to interact with the existing loop logic is
> ill-considered. Just why is the bgwriter the compelling auxiliary
> process in which to do this extra work?

Which process would be a good idea otherwise? Bgwriter seemed best
suited to me, but I am certainly open to reconsideration. It really was
a process of elimination, and I don't really see a downside.

Moving that code somewhere else should be no problem, so I am open to
suggestions?

> Quite a lot of code has been added to walsender. This is mostly down
> to some new functions, responsible for initialising logical
> replication:
>
> ! typedef void (*WalSndSendData)(bool *);
> ! static void WalSndLoop(WalSndSendData send_data) __attribute__((noreturn));
>   static void InitWalSenderSlot(void);
>   static void WalSndKill(int code, Datum arg);
> ! static void XLogSendPhysical(bool *caughtup);
> ! static void XLogSendLogical(bool *caughtup);
>   static void IdentifySystem(void);
>   static void StartReplication(StartReplicationCmd *cmd);
> + static void CheckLogicalReplicationRequirements(void);
> + static void InitLogicalReplication(InitLogicalReplicationCmd *cmd);
> + static void StartLogicalReplication(StartLogicalReplicationCmd *cmd);
> + static void ComputeLogicalXmin(void);
>
> This is mostly infrastructure for initialising and starting logical replication.
>
> Initialisation means finding a free “logical slot” from shared memory,
> then looping until the new magic xmin horizon for logical walsenders
> (stored in their “slot”) is that of the weakest link (think local
> global xmin).
>
> +      * FIXME: think about solving the race conditions in a nicer way.
> +      */
> + recompute_xmin:
> +     walsnd->xmin = GetOldestXmin(true, true);
> +     ComputeLogicalXmin();
> +     if (walsnd->xmin != GetOldestXmin(true, true))
> +         goto recompute_xmin;
>
> Apart from the race conditions that I'm not confident are addressed
> here, I think that the above could easily get stuck indefinitely in
> the event of contention.

I don't like that part the slightest bit but I don't think its actually
in danger of looping forever. In fact I think its so broken it won't
ever loop ;). (ComputeLogicalXmin() will set the current global minimum
which will then be returned by GetOldestXmin()).

I would like to solve this properly without copying GetOldestXmin once
more (so we can compute and set the logical xmin while holding
ProcArrayLock), but I am not yet sure how a nice way to do that would
look like.

I guess GetOldestXminNoLock? That gets called while we already hold the
procarray lock? Yuck.

If we have it we should also use it for hot standby feedback.

> Initialisation occurs due to a “INIT_LOGICAL_REPLICATION” replication
> command. Initialisation also means that decoding state is allocated (a
> snapshot reader is initialised), and we report back success or failure
> to the client that's using the streaming replication protocol (i.e. in
> our toy example, pg_receivellog).
>
> Starting logical replication means we load the previously initialised
> slot, and find a snapshot reader plugin (using the “magic symbols”
> pattern described above, under “Plugin interface”).
>
> Why do we have to “find” a logical slot twice (both during
> initialisation and starting)?

Because they can happen in totally different walsenders, even after a
restart. Finding a consistent point to start decoding from can take some
time (basically you need to wait for any old snapshots to finish), you
don't want to do that every time you disconnect as you would loose
updates inbetween.

So what you do is to do INIT_LOGICAL_REPLICATION *once* when you setup a
new replica. And then you only do START_LOGICAL_REPLICATION 'slot-id'
'position'; afterwards.

Obviously that needs some work since we're not yet persisting enough
between restarts... As I said above, thats what I am working on next.


> Minor point: This is a terrible name for the variable in question:
>
> +     LogicalWalSnd *walsnd;

Why? As long as the struct is called "LogicalWalSnd" it seems to
accurate enough.

I think LocalWalSnds should emancipate themselves and be separate from
walsender, when that has happened its obviously a bad name.

> Miscellaneous thoughts
> ======================
>
> You're still using C stdlib functions like malloc, free, calloc quite
> a bit. My concern is that this points to a lack of thought about the
> memory management strategy; why are you still not using memory
> contexts in some places? If it's so difficult to anticipate what
> clients of, say, XLogReaderAllocate() want for the lifetime of their
> memory, then likely as not those clients should be doing their own
> memory allocation, and passing the allocated buffer directly. If it is
> obvious that the memory ought to persist indefinitely (and I think
> it's your contention that it is in the case of XLogReaderAllocate()),
> I'd just allocate it in the top memory context. Now, I am aware that
> there are a trivial number of backend mallocs that you can point to as
> precedent here, but I'm still not satisfied with your explanation for
> using malloc(). At the very least, you ought to be handling the case
> where malloc returns NULL, and you're not doing so consistently.

There are different categories here. XLogReader *has* to use malloc
instead of the pg infrastructure since it needs to be usable by xlogdump
which doesn't have the pg memory infrastructure.
I would like reorderbuffer.c to stay usable outside the backend as well
(primarily for a printing tool of the spooled changes).

In contrast the use-case for snapbuild.c outside the backend is pretty
slim, so it probably grow its own memory context and use that.

> Minor gripes:
>
> * There is no need to use a *.txt extension for README files; we don't
> currently use those anywhere else.

It makes it easier for me to have a generic rule to transcode them into
a different format, thats why they have it...

> * If you only credit the PGDG and not the Berkeley guys (as you
> should, for the most part), there is no need to phrase the notice
> “Portions Copyright...”. You should just say “Copyright...”.

ok.

> * You're still calling function pointer typedefs things like
> LogicalDecodeInitCB. As I've already pointed out, you should prefer
> the existing conventions (call it something like
> LogicalDecodeInit_hook_type).

I still think CB is way much better than _hook_type, because its not a
hook in this, its a callback. A hook intercepts normal operation, thats
not what happens here. Note that we already use *CallBack in various
places. If you prefer the longer form, ok, I can do that.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical decoding - GetOldestXmin

From
Andres Freund
Date:
On 2012-12-13 18:29:00 +0100, Andres Freund wrote:
> On 2012-12-13 00:05:41 +0000, Peter Geoghegan wrote:
> > Initialisation means finding a free “logical slot” from shared memory,
> > then looping until the new magic xmin horizon for logical walsenders
> > (stored in their “slot”) is that of the weakest link (think local
> > global xmin).
> >
> > +      * FIXME: think about solving the race conditions in a nicer way.
> > +      */
> > + recompute_xmin:
> > +     walsnd->xmin = GetOldestXmin(true, true);
> > +     ComputeLogicalXmin();
> > +     if (walsnd->xmin != GetOldestXmin(true, true))
> > +         goto recompute_xmin;
> >
> > Apart from the race conditions that I'm not confident are addressed
> > here, I think that the above could easily get stuck indefinitely in
> > the event of contention.
>
> I don't like that part the slightest bit but I don't think its actually
> in danger of looping forever. In fact I think its so broken it won't
> ever loop ;). (ComputeLogicalXmin() will set the current global minimum
> which will then be returned by GetOldestXmin()).
>
> I would like to solve this properly without copying GetOldestXmin once
> more (so we can compute and set the logical xmin while holding
> ProcArrayLock), but I am not yet sure how a nice way to do that would
> look like.
>
> I guess GetOldestXminNoLock? That gets called while we already hold the
> procarray lock? Yuck.

Does anybody have an opinion on the attached patches? Especially 0001,
which contains the procarray changes?

It moves a computation of the sort of:

result -= vacuum_defer_cleanup_age;
if (!TransactionIdIsNormal(result))
   result = FirstNormalTransactionId;

inside ProcArrayLock. But I can't really imagine that to be relevant...


Another alternative to this would be to get a snapshot with
GetSnapshotData(), copy the xmin to the logical slot, then call
ProcArrayEndTransaction(). But that doesn't really seem to be nicer to
me.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: logical decoding - GetOldestXmin

From
Robert Haas
Date:
On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> It moves a computation of the sort of:
>
> result -= vacuum_defer_cleanup_age;
> if (!TransactionIdIsNormal(result))
>    result = FirstNormalTransactionId;
>
> inside ProcArrayLock. But I can't really imagine that to be relevant...

I can.  Go look at some of the 9.2 optimizations around
GetSnapshotData().  Those made a BIG difference under heavy
concurrency and they were definitely micro-optimization.  For example,
the introduction of NormalTransactionIdPrecedes() was shockingly
effective.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding - GetOldestXmin

From
Andres Freund
Date:
On 2012-12-13 17:29:06 -0500, Robert Haas wrote:
> On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > It moves a computation of the sort of:
> >
> > result -= vacuum_defer_cleanup_age;
> > if (!TransactionIdIsNormal(result))
> >    result = FirstNormalTransactionId;
> >
> > inside ProcArrayLock. But I can't really imagine that to be relevant...
>
> I can.  Go look at some of the 9.2 optimizations around
> GetSnapshotData().  Those made a BIG difference under heavy
> concurrency and they were definitely micro-optimization.  For example,
> the introduction of NormalTransactionIdPrecedes() was shockingly
> effective.

But GetOldestXmin() should be called less frequently than
GetSnapshotData() by several orders of magnitudes. I don't really see
it being used in any really hot code paths?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical decoding - GetOldestXmin

From
Simon Riggs
Date:
On 13 December 2012 22:37, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2012-12-13 17:29:06 -0500, Robert Haas wrote:
>> On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > It moves a computation of the sort of:
>> >
>> > result -= vacuum_defer_cleanup_age;
>> > if (!TransactionIdIsNormal(result))
>> >    result = FirstNormalTransactionId;
>> >
>> > inside ProcArrayLock. But I can't really imagine that to be relevant...
>>
>> I can.  Go look at some of the 9.2 optimizations around
>> GetSnapshotData().  Those made a BIG difference under heavy
>> concurrency and they were definitely micro-optimization.  For example,
>> the introduction of NormalTransactionIdPrecedes() was shockingly
>> effective.
>
> But GetOldestXmin() should be called less frequently than
> GetSnapshotData() by several orders of magnitudes. I don't really see
> it being used in any really hot code paths?

Maybe, but that calculation doesn't *need* to be inside the lock, that
is just a consequence of the current coding.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: logical decoding - GetOldestXmin

From
Michael Paquier
Date:


On Fri, Dec 14, 2012 at 2:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> It moves a computation of the sort of:
>
> result -= vacuum_defer_cleanup_age;
> if (!TransactionIdIsNormal(result))
>    result = FirstNormalTransactionId;
>
> inside ProcArrayLock. But I can't really imagine that to be relevant...

I can.  Go look at some of the 9.2 optimizations around
GetSnapshotData().  Those made a BIG difference under heavy
concurrency and they were definitely micro-optimization.  For example,
the introduction of NormalTransactionIdPrecedes() was shockingly
effective.

The two commits coming to my mind are:
- ed0b409 (Separate PGPROC into PGPROC and PGXACT)
- 0d76b60 (introduction of NormalTransactionIdPrecedes)
Those ones really improved concurrency performance.
--
Michael Paquier
http://michael.otacoo.com

Re: logical decoding - GetOldestXmin

From
Andres Freund
Date:
On 2012-12-13 23:35:00 +0000, Simon Riggs wrote:
> On 13 December 2012 22:37, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2012-12-13 17:29:06 -0500, Robert Haas wrote:
> >> On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > It moves a computation of the sort of:
> >> >
> >> > result -= vacuum_defer_cleanup_age;
> >> > if (!TransactionIdIsNormal(result))
> >> >    result = FirstNormalTransactionId;
> >> >
> >> > inside ProcArrayLock. But I can't really imagine that to be relevant...
> >>
> >> I can.  Go look at some of the 9.2 optimizations around
> >> GetSnapshotData().  Those made a BIG difference under heavy
> >> concurrency and they were definitely micro-optimization.  For example,
> >> the introduction of NormalTransactionIdPrecedes() was shockingly
> >> effective.
> >
> > But GetOldestXmin() should be called less frequently than
> > GetSnapshotData() by several orders of magnitudes. I don't really see
> > it being used in any really hot code paths?
>
> Maybe, but that calculation doesn't *need* to be inside the lock, that
> is just a consequence of the current coding.

I am open to suggestion how to do that in a way we a) can hold the lock
already (to safely nail the global xmin to the current value) b) without
duplicating all the code.

Just moving that tidbit inside the lock seems to be the pragmatic
choice. GetOldestXmin is called

* once per checkpoint
* one per index build
* once in analyze
* twice per vacuum
* once for HS feedback messages

Nothing of that occurs frequently enough that 5 instructions will make a
difference. I would be happy to go an alternative path, but right now I
don't see any nice one. A "already_locked" parameter to GetOldestXmin
seems to be a cure worse than the disease.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical decoding - GetOldestXmin

From
Robert Haas
Date:
On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Just moving that tidbit inside the lock seems to be the pragmatic
> choice. GetOldestXmin is called
>
> * once per checkpoint
> * one per index build
> * once in analyze
> * twice per vacuum
> * once for HS feedback messages
>
> Nothing of that occurs frequently enough that 5 instructions will make a
> difference. I would be happy to go an alternative path, but right now I
> don't see any nice one. A "already_locked" parameter to GetOldestXmin
> seems to be a cure worse than the disease.

I'm not sure that would be so bad, but I guess I question the need to
do it this way at all.  Most of the time, if you need to advertise
your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and
I guess I'm not seeing why that wouldn't also work here.  Am I dumb?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding - GetOldestXmin

From
Andres Freund
Date:
On 2012-12-14 14:01:30 -0500, Robert Haas wrote:
> On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Just moving that tidbit inside the lock seems to be the pragmatic
> > choice. GetOldestXmin is called
> >
> > * once per checkpoint
> > * one per index build
> > * once in analyze
> > * twice per vacuum
> > * once for HS feedback messages
> >
> > Nothing of that occurs frequently enough that 5 instructions will make a
> > difference. I would be happy to go an alternative path, but right now I
> > don't see any nice one. A "already_locked" parameter to GetOldestXmin
> > seems to be a cure worse than the disease.
>
> I'm not sure that would be so bad, but I guess I question the need to
> do it this way at all.  Most of the time, if you need to advertise
> your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and
> I guess I'm not seeing why that wouldn't also work here.  Am I dumb?

I wondered upthread whether that would be better:

On 2012-12-13 21:03:44 +0100, Andres Freund wrote:
> Another alternative to this would be to get a snapshot with
> GetSnapshotData(), copy the xmin to the logical slot, then call
> ProcArrayEndTransaction(). But that doesn't really seem to be nicer to
> me.

Not sure why I considered it ugly anymore, but it actually has a
noticeable disadvantage. GetOldestXmin is nicer is than GetSnapshotData
as the latter set a fairly new xid as xmin whereas GetOldestXmin returns
the actual current xmin horizon. Thats preferrable because it allows us
to start up more quickly. snapbuild.c can only start building a snapshot
once it has seen a xl_running_xact with oldestRunningXid >=
own_xmin. Otherwise we cannot be sure that no relevant catalog tuples
have been removed.

This also made me notice that my changes to GetSnapshotData were quite
pessimal... I do set the xmin of the new snapshot to the "logical xmin"
instead of doing it only to globalxmin/RecentGlobalXmin.

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical decoding - GetOldestXmin

From
Andres Freund
Date:
On 2012-12-15 01:19:26 +0100, Andres Freund wrote:
> On 2012-12-14 14:01:30 -0500, Robert Haas wrote:
> > On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > Just moving that tidbit inside the lock seems to be the pragmatic
> > > choice. GetOldestXmin is called
> > >
> > > * once per checkpoint
> > > * one per index build
> > > * once in analyze
> > > * twice per vacuum
> > > * once for HS feedback messages
> > >
> > > Nothing of that occurs frequently enough that 5 instructions will make a
> > > difference. I would be happy to go an alternative path, but right now I
> > > don't see any nice one. A "already_locked" parameter to GetOldestXmin
> > > seems to be a cure worse than the disease.
> >
> > I'm not sure that would be so bad, but I guess I question the need to
> > do it this way at all.  Most of the time, if you need to advertise
> > your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and
> > I guess I'm not seeing why that wouldn't also work here.  Am I dumb?
>
> I wondered upthread whether that would be better:
>
> On 2012-12-13 21:03:44 +0100, Andres Freund wrote:
> > Another alternative to this would be to get a snapshot with
> > GetSnapshotData(), copy the xmin to the logical slot, then call
> > ProcArrayEndTransaction(). But that doesn't really seem to be nicer to
> > me.
>
> Not sure why I considered it ugly anymore, but it actually has a
> noticeable disadvantage. GetOldestXmin is nicer is than GetSnapshotData
> as the latter set a fairly new xid as xmin whereas GetOldestXmin returns
> the actual current xmin horizon. Thats preferrable because it allows us
> to start up more quickly. snapbuild.c can only start building a snapshot
> once it has seen a xl_running_xact with oldestRunningXid >=
> own_xmin. Otherwise we cannot be sure that no relevant catalog tuples
> have been removed.

Hm. One way that could work with fewer changes is to exploit the fact
that a) it seems to be possible to acquire a shared lwlock twice in the
same backend and b) both GetOldestXmin & GetSnapshotData acquire only a
shared lwlock.

Are we willing to guarantee that recursive acquiration of shared lwlocks
continues to work?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical decoding - GetOldestXmin

From
Simon Riggs
Date:
On 13 December 2012 20:03, Andres Freund <andres@2ndquadrant.com> wrote:

> Does anybody have an opinion on the attached patches? Especially 0001,
> which contains the procarray changes?
>
> It moves a computation of the sort of:
>
> result -= vacuum_defer_cleanup_age;
> if (!TransactionIdIsNormal(result))
>    result = FirstNormalTransactionId;
>
> inside ProcArrayLock. But I can't really imagine that to be relevant...

I don't see why this is hard.

Just make the lock acquisition/release conditional on another parameter.

That way the only thing you'll be moving inside the lock is an if test
on a constant boolean.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: logical decoding - GetOldestXmin

From
Andres Freund
Date:
On 2012-12-16 16:44:04 +0000, Simon Riggs wrote:
> On 13 December 2012 20:03, Andres Freund <andres@2ndquadrant.com> wrote:
>
> > Does anybody have an opinion on the attached patches? Especially 0001,
> > which contains the procarray changes?
> >
> > It moves a computation of the sort of:
> >
> > result -= vacuum_defer_cleanup_age;
> > if (!TransactionIdIsNormal(result))
> >    result = FirstNormalTransactionId;
> >
> > inside ProcArrayLock. But I can't really imagine that to be relevant...
>
> I don't see why this is hard.
>
> Just make the lock acquisition/release conditional on another parameter.
>
> That way the only thing you'll be moving inside the lock is an if test
> on a constant boolean.

Thats not really cheaper. Two branches + additional parameter
passed/pushed vs one branch, one subtransaction, two assignments is a
close call.
As I don't think either really matters in the GetOldestXmin case, I
would be happy with that as well. If people prefer an additional
parameter + adjusting the few callsite vs. a separate function I will go
that way.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical decoding - GetOldestXmin

From
Robert Haas
Date:
On Fri, Dec 14, 2012 at 7:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2012-12-14 14:01:30 -0500, Robert Haas wrote:
>> On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > Just moving that tidbit inside the lock seems to be the pragmatic
>> > choice. GetOldestXmin is called
>> >
>> > * once per checkpoint
>> > * one per index build
>> > * once in analyze
>> > * twice per vacuum
>> > * once for HS feedback messages
>> >
>> > Nothing of that occurs frequently enough that 5 instructions will make a
>> > difference. I would be happy to go an alternative path, but right now I
>> > don't see any nice one. A "already_locked" parameter to GetOldestXmin
>> > seems to be a cure worse than the disease.
>>
>> I'm not sure that would be so bad, but I guess I question the need to
>> do it this way at all.  Most of the time, if you need to advertise
>> your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and
>> I guess I'm not seeing why that wouldn't also work here.  Am I dumb?
>
> I wondered upthread whether that would be better:
>
> On 2012-12-13 21:03:44 +0100, Andres Freund wrote:
>> Another alternative to this would be to get a snapshot with
>> GetSnapshotData(), copy the xmin to the logical slot, then call
>> ProcArrayEndTransaction(). But that doesn't really seem to be nicer to
>> me.
>
> Not sure why I considered it ugly anymore, but it actually has a
> noticeable disadvantage. GetOldestXmin is nicer is than GetSnapshotData
> as the latter set a fairly new xid as xmin whereas GetOldestXmin returns
> the actual current xmin horizon. Thats preferrable because it allows us
> to start up more quickly. snapbuild.c can only start building a snapshot
> once it has seen a xl_running_xact with oldestRunningXid >=
> own_xmin. Otherwise we cannot be sure that no relevant catalog tuples
> have been removed.

I'm a bit confused.  Are you talking about the difference between
RecentGlobalXmin and RecentXmin?  I think GetSnapshotData() updates
both.

Anyway, if there's no nicer way, I think it's probably OK to add a
parameter to GetOldestXmin().  It seems like kind of a hack, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding - GetOldestXmin

From
"anarazel@anarazel.de"
Date:
Hi,

Robert Haas <robertmhaas@gmail.com> schrieb:

>On Fri, Dec 14, 2012 at 7:19 PM, Andres Freund <andres@2ndquadrant.com>
>wrote:
>> On 2012-12-14 14:01:30 -0500, Robert Haas wrote:
>>> On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund
><andres@2ndquadrant.com> wrote:
>>> > Just moving that tidbit inside the lock seems to be the pragmatic
>>> > choice. GetOldestXmin is called
>>> >
>>> > * once per checkpoint
>>> > * one per index build
>>> > * once in analyze
>>> > * twice per vacuum
>>> > * once for HS feedback messages
>>> >
>>> > Nothing of that occurs frequently enough that 5 instructions will
>make a
>>> > difference. I would be happy to go an alternative path, but right
>now I
>>> > don't see any nice one. A "already_locked" parameter to
>GetOldestXmin
>>> > seems to be a cure worse than the disease.
>>>
>>> I'm not sure that would be so bad, but I guess I question the need
>to
>>> do it this way at all.  Most of the time, if you need to advertise
>>> your global xmin, you use GetSnapshotData(), not GetOldestXmin(),
>and
>>> I guess I'm not seeing why that wouldn't also work here.  Am I dumb?
>>
>> I wondered upthread whether that would be better:
>>
>> On 2012-12-13 21:03:44 +0100, Andres Freund wrote:
>>> Another alternative to this would be to get a snapshot with
>>> GetSnapshotData(), copy the xmin to the logical slot, then call
>>> ProcArrayEndTransaction(). But that doesn't really seem to be nicer
>to
>>> me.
>>
>> Not sure why I considered it ugly anymore, but it actually has a
>> noticeable disadvantage. GetOldestXmin is nicer is than
>GetSnapshotData
>> as the latter set a fairly new xid as xmin whereas GetOldestXmin
>returns
>> the actual current xmin horizon. Thats preferrable because it allows
>us
>> to start up more quickly. snapbuild.c can only start building a
>snapshot
>> once it has seen a xl_running_xact with oldestRunningXid >=
>> own_xmin. Otherwise we cannot be sure that no relevant catalog tuples
>> have been removed.
>
>I'm a bit confused.  Are you talking about the difference between
>RecentGlobalXmin and RecentXmin?  I think GetSnapshotData() updates
>both.

The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples required
fordecoding might get removed by other backends. That needs to be prevented while holding the  procarray lock
exclusively.

Does it make more sense now?

Andres

---
Please excuse the brevity and formatting - I am writing this on my mobile phone.



Re: logical decoding - GetOldestXmin

From
Robert Haas
Date:
On Tue, Dec 18, 2012 at 5:25 PM, anarazel@anarazel.de
<andres@anarazel.de> wrote:
> The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples required
fordecoding might get removed by other backends. That needs to be prevented while holding the  procarray lock
exclusively.

Well, for the ordinary use of GetSnapshotData(), that doesn't matter,
because GetSnapshotData() also updates proc->xmin.  If you're trying
to store a different value in that field then of course it matters.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding - GetOldestXmin

From
Andres Freund
Date:
On 2012-12-18 19:56:18 -0500, Robert Haas wrote:
> On Tue, Dec 18, 2012 at 5:25 PM, anarazel@anarazel.de
> <andres@anarazel.de> wrote:
> > The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples
requiredfor decoding might get removed by other backends. That needs to be prevented while holding the  procarray lock
exclusively.
>
> Well, for the ordinary use of GetSnapshotData(), that doesn't matter,
> because GetSnapshotData() also updates proc->xmin.  If you're trying
> to store a different value in that field then of course it matters.

Absolutely right. I don't want to say there's anything wrong with it
right now. The "problem" for me is that it sets proc->xmin to the newest
value it can while I want/need the oldest valid value...

I will go with adding a already_locked parameter to GetOldestXmin then.

Thanks for the input,

Andres
--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical decoding - GetOldestXmin

From
Robert Haas
Date:
On Tue, Dec 18, 2012 at 7:59 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2012-12-18 19:56:18 -0500, Robert Haas wrote:
>> On Tue, Dec 18, 2012 at 5:25 PM, anarazel@anarazel.de
>> <andres@anarazel.de> wrote:
>> > The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples
requiredfor decoding might get removed by other backends. That needs to be prevented while holding the  procarray lock
exclusively.
>>
>> Well, for the ordinary use of GetSnapshotData(), that doesn't matter,
>> because GetSnapshotData() also updates proc->xmin.  If you're trying
>> to store a different value in that field then of course it matters.
>
> Absolutely right. I don't want to say there's anything wrong with it
> right now. The "problem" for me is that it sets proc->xmin to the newest
> value it can while I want/need the oldest valid value...
>
> I will go with adding a already_locked parameter to GetOldestXmin then.

Or instead of bool already_locked, maybe bool advertise_xmin?   Seems
like that might be more friendly to the abstraction boundaries.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



[Catching up on old threads.]

On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote:
> On 11/17/2012 03:00 PM, Markus Wanner wrote:
>> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>>> PostgreSQL-R
>> No. There must be some way to logically identify the tuple.
> It can be done as selecting on _all_ attributes and updating/deleting  
> just the first matching row
>
> create cursor ...
> select from t ... where t.* = (....)
> fetch one ...
> delete where current of ...
>
> This is on distant (round 3 or 4) roadmap for this work, just was  
> interested
> if you had found any better way of doing this :)

That only works if every attribute's type has a notion of equality ("xml" does
not).  The equality operator may have a name other than "=", and an operator
named "=" may exist with semantics other than equality ("box" is affected).
Code attempting this replication strategy should select an equality operator
the way typcache.c does so.



Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Hannu Krosing
Date:
On 01/13/2013 12:28 AM, Noah Misch wrote:
> [Catching up on old threads.]
>
> On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote:
>> On 11/17/2012 03:00 PM, Markus Wanner wrote:
>>> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>>>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>>>> PostgreSQL-R
>>> No. There must be some way to logically identify the tuple.
>> It can be done as selecting on _all_ attributes and updating/deleting
>> just the first matching row
>>
>> create cursor ...
>> select from t ... where t.* = (....)
>> fetch one ...
>> delete where current of ...
>>
>> This is on distant (round 3 or 4) roadmap for this work, just was
>> interested
>> if you had found any better way of doing this :)
> That only works if every attribute's type has a notion of equality ("xml" does
> not).  The equality operator may have a name other than "=", and an operator
> named "=" may exist with semantics other than equality ("box" is affected).
> Code attempting this replication strategy should select an equality operator
> the way typcache.c does so.
Does this hint that postgreSQL also needs an sameness operator
( "is" or "===" in same languages).

Or does "IS NOT DISTINCT FROM" already work even for types without
comparison operator ?

--------------
Hannu





Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Hannu Krosing
Date:
On 01/13/2013 10:49 AM, Hannu Krosing wrote:
> On 01/13/2013 12:28 AM, Noah Misch wrote:
>> [Catching up on old threads.]
>>
>> On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote:
>>> On 11/17/2012 03:00 PM, Markus Wanner wrote:
>>>> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>>>>> Is it possible to replicate UPDATEs and DELETEs without a primary 
>>>>> key in
>>>>> PostgreSQL-R
>>>> No. There must be some way to logically identify the tuple.
>>> It can be done as selecting on _all_ attributes and updating/deleting
>>> just the first matching row
>>>
>>> create cursor ...
>>> select from t ... where t.* = (....)
>>> fetch one ...
>>> delete where current of ...
>>>
>>> This is on distant (round 3 or 4) roadmap for this work, just was
>>> interested
>>> if you had found any better way of doing this :)
>> That only works if every attribute's type has a notion of equality 
>> ("xml" does
>> not).  The equality operator may have a name other than "=", and an 
>> operator
>> named "=" may exist with semantics other than equality ("box" is 
>> affected).
>> Code attempting this replication strategy should select an equality 
>> operator
>> the way typcache.c does so.
> Does this hint that postgreSQL also needs an sameness operator
> ( "is" or "===" in same languages).
>
> Or does "IS NOT DISTINCT FROM" already work even for types without
> comparison operator ?
Just checked - it does not, it still looks for "=" operator so it is 
just equality-with-nulls

How do people feel about adding a real sameness operator ?

Hannu


>
> --------------
> Hannu
>
>




Hannu Krosing <hannu@2ndQuadrant.com> writes:
> How do people feel about adding a real sameness operator ?

Just begs the question of "what's sameness?"

In many places we consider a datatype's default btree equality operator
to define sameness, but not all types provide a btree opclass (in
particular, anything that hasn't got a sensible one-dimensional sort
order will not).  And some do but it doesn't represent anything that
anyone would want to consider "sameness" --- IIRC, some of the geometric
types provide btree opclasses that sort by area.  Even for apparently
simple types like float8 there are interesting questions like whether
minus zero is the same as plus zero.

The messiness here is not just due to lack of a notation.
        regards, tom lane



On 01/13/2013 12:30 PM, Hannu Krosing wrote:
> On 01/13/2013 10:49 AM, Hannu Krosing wrote:
>> Does this hint that postgreSQL also needs an sameness operator
>> ( "is" or "===" in same languages).
> 
> How do people feel about adding a real sameness operator ?

We'd need to define what "sameness" means. If this goes toward "exact
match in binary representation", this gets a thumbs-up from me.

As a first step in that direction, I'd see adjusting send() and recv()
functions to use a portable binary format. A "sameness" operator could
then be implemented by simply comparing two value's send() outputs.

Regards

Markus Wanner



Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Andres Freund
Date:
On 2013-01-13 12:44:44 -0500, Tom Lane wrote:
> Hannu Krosing <hannu@2ndQuadrant.com> writes:
> > How do people feel about adding a real sameness operator ?
> 
> Just begs the question of "what's sameness?"
> 
> In many places we consider a datatype's default btree equality operator
> to define sameness, but not all types provide a btree opclass (in
> particular, anything that hasn't got a sensible one-dimensional sort
> order will not).  And some do but it doesn't represent anything that
> anyone would want to consider "sameness" --- IIRC, some of the geometric
> types provide btree opclasses that sort by area.  Even for apparently
> simple types like float8 there are interesting questions like whether
> minus zero is the same as plus zero.
> 
> The messiness here is not just due to lack of a notation.

FWIW *I* (but others might) don't plan to support that case for now, it
just seems to be too messy for far too little benefit.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On 01/13/2013 06:02 PM, Markus Wanner wrote:
> On 01/13/2013 12:30 PM, Hannu Krosing wrote:
>> On 01/13/2013 10:49 AM, Hannu Krosing wrote:
>>> Does this hint that postgreSQL also needs an sameness operator
>>> ( "is" or "===" in same languages).
>> How do people feel about adding a real sameness operator ?
> We'd need to define what "sameness" means. If this goes toward "exact
> match in binary representation", this gets a thumbs-up from me.
>
> As a first step in that direction, I'd see adjusting send() and recv()
> functions to use a portable binary format. A "sameness" operator could
> then be implemented by simply comparing two value's send() outputs.
This seems like a good definition of "sameness" to me - if binary
images are bitwise same, then the values are the same. And if
both are fields of the same type and NULLs then these are also "same"

And defining a cross-platform binary format also a good direction
of movement in implementing this.

I'd just start with what send() and recv() on each type produces
now using GCC on 64bit Intel and move towards adjusting others
to match. For a period anything else would still be allowed, but
be "non-standard"

I have no strong opinion on typed NULLs, though I'd like them
to also be "the same" for a sake of simplicity.

As this would be non-standard anyway, I'd make a row of all nulls NOT 
"be the same" as NULL

This would be much easier to explain than losing the "IS NULL"-ness at 
nesting level 3 ;)

Hannu

>
> Regards
>
> Markus Wanner
>
>




Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Dimitri Fontaine
Date:
Hannu Krosing <hannu@2ndQuadrant.com> writes:
>> Does this hint that postgreSQL also needs an sameness operator
>> ( "is" or "===" in same languages).
>
> How do people feel about adding a real sameness operator ?

Well. I would prefer it if we can bypass the need for it.

Then Do we need the full range of eq, eql, equal and equalp predicates,
and would all of them allow overriding or just some?
 http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node74.html

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support



Re: logical changeset generation v3 - comparison to Postgres-R change set format

From
Hannu Krosing
Date:
On 01/13/2013 12:28 AM, Noah Misch wrote:
> [Catching up on old threads.]
>
> On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote:
>> On 11/17/2012 03:00 PM, Markus Wanner wrote:
>>> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>>>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>>>> PostgreSQL-R
>>> No. There must be some way to logically identify the tuple.
>> It can be done as selecting on _all_ attributes and updating/deleting
>> just the first matching row
>>
>> create cursor ...
>> select from t ... where t.* = (....)
>> fetch one ...
>> delete where current of ...
>>
>> This is on distant (round 3 or 4) roadmap for this work, just was
>> interested
>> if you had found any better way of doing this :)
> That only works if every attribute's type has a notion of equality ("xml" does
> not).  The equality operator may have a name other than "=", and an operator
> named "=" may exist with semantics other than equality ("box" is affected).
> Code attempting this replication strategy should select an equality operator
> the way typcache.c does so.
A method for making this work as PostgreSQL works now would be to
compare "textual representations" of tuples

create cursor ...
select from t ... where t::text = '(<image of original row>)'
fetch one ...
delete where current of ...

But of course having an operator for "sameness" without needing to convert to text
would be better

----------------
Hannu





On 01/13/2013 08:06 PM, Dimitri Fontaine wrote:
> Hannu Krosing <hannu@2ndQuadrant.com> writes:
>>> Does this hint that postgreSQL also needs an sameness operator
>>> ( "is" or "===" in same languages).
>> How do people feel about adding a real sameness operator ?
> Well. I would prefer it if we can bypass the need for it.
What is actually sufficient for current problem is sameness
which compares outputs of type output functions and also
considers NULLs to be the same.

The reason for not providing equality for xml was not that two xml
files which compare equal as text could be considered unequal in
any sense but that there are some other textual representations
of the same xml which could also be considered to be equal, like different whitespace between tag and attribute
> Then Do we need the full range of eq, eql, equal and equalp predicates,
> and would all of them allow overriding or just some?
I consider sameness as basic thing as IS NULL, so the sameness
should not be overridable. Extending IS NOT DISTINCT FROM to
do this comparison instead of current '=' seems reasonable.

That is

SELECT '<tag/>'::xml IS DISTINCT FROM '<tag />'::xml

should return TRUE as long as the internal representation of the
two differ and even after you add equality operator to xml
which compares some canonic form of xml and thus would make

SELECT '<tag/>'::xml = '<tag />'::xml ;

be TRUE.


Regards,
Hannu

>
>    http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node74.html
>
> Regards,




On 01/13/2013 09:04 PM, Hannu Krosing wrote:
> I'd just start with what send() and recv() on each type produces
> now using GCC on 64bit Intel and move towards adjusting others
> to match. For a period anything else would still be allowed, but
> be "non-standard"

Intel being little endian seems like a bad choice to me, given that
send/recv kind of implies network byte ordering. I'd rather not tie this
to any particular processor architecture at all (at least not solely on
the ground that it's the most common one at the time).

I have no strong opinion on "sameness" of NULLs and could also imagine
that to throw some kind of invalid operation error. Based on the ground
that neither is a value and it's unclear what send() method to use at all.

FWIW, trying to determine the length of a sent NULL gives an interesting
result that I don't currently understand.

> psql (9.2.2)
> Type "help" for help.
> 
> postgres=# SELECT length(int4send(NULL));
>  length 
> --------
>        
> (1 row)
> 
> postgres=# SELECT length(float4send(NULL));
>  length 
> --------
>        
> (1 row)
> 
> postgres=# SELECT length(textsend(NULL));
>  length 
> --------
>        
> (1 row)
>
> postgres=# SELECT length(textsend(NULL) || '\000'::bytea);
>  length 
> --------
>        
> (1 row)


Regards

Markus Wanner