Thread: logical changeset generation v3

logical changeset generation v3

From

andres@anarazel.de (Andres Freund)

Date:

15 November 2012, 00:27:57

Hi,

In response to this you will soon find the 14 patches that currently
implement $subject. I'll go over each one after showing off for a bit:

Start postgres:

Start postgres instance (with pg_hba.conf allowing replication cons):

$ postgres -D ~/tmp/pgdev-lcr \         -c wal_level=logical \         -c max_wal_senders=10 \         -c
max_logical_slots=10\         -c wal_keep_segments=100 \         -c log_line_prefix="[%p %x] "
 

Start the changelog receiver:
$ pg_receivellog -h /tmp -f /dev/stderr -d postgres -v

Generate changes:
$ psql -h /tmp postgres <<EOF

DROP TABLE IF EXISTS replication_example;

CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

-- plain insert
INSERT INTO replication_example(somedata, text) VALUES (1, 1);

-- plain update
UPDATE replication_example SET somedata = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));

-- plain delete
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));

-- wrapped in a transaction
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));
COMMIT;

-- dont write out aborted data
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (2, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));
ROLLBACK;

-- add a column
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
ALTER TABLE replication_example ADD COLUMN bar int;
INSERT INTO replication_example(somedata, text, bar) VALUES (3, 1, 1);
COMMIT;

-- once more outside
INSERT INTO replication_example(somedata, text, bar) VALUES (4, 1, 1);


-- DDL with table rewrite
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (5, 1);
ALTER TABLE replication_example RENAME COLUMN text TO somenum;
INSERT INTO replication_example(somedata, somenum) VALUES (5, 2);
ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING (somenum::int4);
INSERT INTO replication_example(somedata, somenum) VALUES (5, 3);
COMMIT;

EOF

And the results printed by llog:

BEGIN 16556826
COMMIT 16556826
BEGIN 16556827
table "replication_example_id_seq": INSERT: sequence_name[name]:replication_example_id_seq last_value[int8]:1
start_value[int8]:1increment_by[int8]:1 max_value[int8]:9223372036854775807 min_value[int8]:1 cache_value[int8]:1
log_cnt[int8]:0is_cycled[bool]:f is_called[bool]:f
 
COMMIT 16556827
BEGIN 16556828
table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1
COMMIT 16556828
BEGIN 16556829
table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1
COMMIT 16556829
BEGIN 16556830
table "replication_example": DELETE (pkey): id[int4]:1
COMMIT 16556830
BEGIN 16556833
table "replication_example": INSERT: id[int4]:4 somedata[int4]:3 text[varchar]:1
table "replication_example": INSERT: id[int4]:5 somedata[int4]:3 text[varchar]:1 bar[int4]:1
COMMIT 16556833
BEGIN 16556834
table "replication_example": INSERT: id[int4]:6 somedata[int4]:4 text[varchar]:1 bar[int4]:1
COMMIT 16556834
BEGIN 16556835
table "replication_example": INSERT: id[int4]:7 somedata[int4]:5 text[varchar]:1 bar[int4]:(null)
table "replication_example": INSERT: id[int4]:8 somedata[int4]:5 somenum[varchar]:2 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:4 somedata[int4]:3 somenum[int4]:1 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:5 somedata[int4]:3 somenum[int4]:1 bar[int4]:1
table "pg_temp_74943": INSERT: id[int4]:6 somedata[int4]:4 somenum[int4]:1 bar[int4]:1
table "pg_temp_74943": INSERT: id[int4]:7 somedata[int4]:5 somenum[int4]:1 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:8 somedata[int4]:5 somenum[int4]:2 bar[int4]:(null)
table "replication_example": INSERT: id[int4]:9 somedata[int4]:5 somenum[int4]:3 bar[int4]:(null)
COMMIT 16556835

As you can see above we can decode WAL in the presence of nearly all
forms of DDL. The plugin that outputted these changes is supposed to be
added to contrib and is fairly small and uncomplicated.

An interesting piece of information might be that in the very
preliminary benchmarking I have done on this even the textual decoding
could keep up with a full tilt pgbench -c16 -j16 -M prepared on my
(somewhat larger) workstation. The wal space overhead was less than 1%
between two freshly initdb'ed clusters, comparing
wal_level=hot_standby with =logical.
With a custom pgbench script I can saturate the decoding to the effect
that it lags a second or so, but once I write out the data in a binary
format it can keep up again.
The biggest overhead is currently the more slowly increasing
Global/RecentXmin, but that can be greatly improved by logging
xl_running_xact's more than just every checkpoint.


A short overview over the patches in this series:

* Add minimal binary heap implementation
Abhijit submitted a nicer version of this, the plan is to rebase ontop
of that once people are happy with the interface.
(unchanged)

* Add support for a generic wal reading facility dubbed XLogReader
There's some discussion about whats the best way to implement this in a
separate CF topic.
(unchanged)

* Add simple xlogdump tool
Very nice for debugging, couldn't have developed this without. Obviously
not a prerequisite for comitting this feature but still pretty worthy.
(quite a bit updated, still bad build infrastructure)

* Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode)
Relatively simple, somewhat contentious due to some uniqueness
issues. Would very much welcome input from somebody with syscache
experience on this. It was previously suggested to write something like
attoptcache.c for this, but to me that seems to be code-duplication. We
can go that route though.
(unchanged)

* Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode
Simple. I don't even think its contentious... Just wasn't needed before.
(unchanged)

* Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs
Just a nice to have thing for debugging, not a prerequisite for the
feature.
(unchanged)

* Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement
Uncomplicated and I hope uncontentious.
(new)

*Store the number of subtransactions in xl_running_xacts separately fromtoplevel xids
Increases the size of xl_running_xacts by 4bytes in the worst case,
decreases it in some others. Improves the efficiency of some HS
operations.
Should be ok?
(new)

* Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader
Not sure if people will complain about this? Its rather simple due to
the fact that the HeapTupleSatisfiesVisibility wrapper already took a
HeapTupleHeader as parameter.
(new)

* Allow walsender's to connect to a specific database
This has been requested by others. I think we need to work on the
external interface a bit, should be ok otherwise.
(new)

* Introduce wal decoding via catalog timetravel
This is the meat of the feature. I think this is going in a good
direction, still needs some work, but architectural review can really
start now. (more later)
(heavily changed)

* Add a simple decoding module in contrib named 'test_decoding'
The much requested example contrib module.
(new)

* Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes
Debugging tool to receive changes and write them to a file. Needs some
more options and probably shouldn't live inside pg_basebackup's
directory.
(new)

* design document v2.3 and snapshot building design doc v0.2
(unchanged)


There remains quite a bit to be done but I think the state of the patch
has improved quite a bit. The biggest thing now is to get input about
the user facing parts so we can get some aggreement there.

Todo:
* testing infrastructure (isolationtester)
* persistence/spilling to disk of built snapshots, longrunning transactions
* user docs
* more frequent lowering of xmins
* more docs about the internals
* support for user declared catalog tables
* actual exporting of initial pg_export snapshots after INIT_LOGICAL_REPLICATION
* own shared memory segment instead of piggybacking on walsender's
* nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the outside.
* more frequent xl_running_xid's so xmin can be upped more frequently

Please comment!

Happy and tired,

Andres

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

[PATCH 01/14] Add minimal binary heap implementation

From

Andres Freund

Date:

15 November 2012, 01:17:28

Will be replaces by the "binaryheap.[ch]" from Abhijit once its been reviewed.
---
 src/backend/lib/Makefile     |   3 +-
 src/backend/lib/simpleheap.c | 255 +++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/simpleheap.h |  91 +++++++++++++++
 3 files changed, 348 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/simpleheap.c
 create mode 100644 src/include/lib/simpleheap.h

Attachment

0001-Add-minimal-binary-heap-implementation.patch

[PATCH 04/14] Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode)

From

Andres Freund

Date:

15 November 2012, 01:17:28

This cache is somewhat problematic because formally indexes used by syscaches
needs to be unique, this one is not. This is "just" because of 0/InvalidOids
stored in pg_class.relfilenode for nailed/shared catalog relations.  The
syscache will never be queried for InvalidOid relfilenodes however so it seems
to be safe even if it violates the rules somewhat.

It might be nicer to add infrastructure to do this properly, like using a
partial index, its not clear what the best way to do this is though.

Needs a CATVERSION bump.
---
 src/backend/utils/cache/syscache.c | 11 +++++++++++
 src/include/catalog/indexing.h     |  2 ++
 src/include/catalog/pg_proc.h      |  1 +
 src/include/utils/syscache.h       |  1 +
 4 files changed, 15 insertions(+)

Attachment

0004-Add-a-new-RELFILENODE-syscache-to-fetch-a-pg_class-e.patch

[PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

From

Andres Freund

Date:

15 November 2012, 01:17:34

---
 src/backend/utils/cache/relmapper.c | 53 +++++++++++++++++++++++++++++++++++++
 src/include/catalog/indexing.h      |  4 +--
 src/include/utils/relmapper.h       |  2 ++
 3 files changed, 57 insertions(+), 2 deletions(-)

Attachment

0005-Add-a-new-relmapper.c-function-RelationMapFilenodeTo.patch

[PATCH 03/14] Add simple xlogdump tool

From

Andres Freund

Date:

15 November 2012, 01:17:39

---
 src/bin/Makefile            |   2 +-
 src/bin/xlogdump/Makefile   |  25 +++
 src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 494 insertions(+), 1 deletion(-)
 create mode 100644 src/bin/xlogdump/Makefile
 create mode 100644 src/bin/xlogdump/xlogdump.c

Attachment

0003-Add-simple-xlogdump-tool.patch

[PATCH 06/14] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs

From

Andres Freund

Date:

15 November 2012, 01:17:48

This requires the previously added RELFILENODE syscache and the added
RelationMapFilenodeToOid function added in previous commits.
---
 doc/src/sgml/func.sgml         | 23 +++++++++++-
 src/backend/utils/adt/dbsize.c | 79 ++++++++++++++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.h  |  2 ++
 src/include/utils/builtins.h   |  1 +
 4 files changed, 104 insertions(+), 1 deletion(-)

Attachment

0006-Add-a-new-function-pg_relation_by_filenode-to-lookup.patch

[PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids

From

Andres Freund

Date:

15 November 2012, 01:20:40

To avoid complicating logic we store both, the toplevel and the subxids, in
->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids.
Also skip logging any subxids if the snapshot is suboverflowed, they aren't
useful in that case anyway.

This allows to make some operations cheaper and it allows faster startup for
the future logical decoding feature because that doesn't care about
subtransactions/suboverflow'edness.
---
 src/backend/access/transam/xlog.c   |  2 ++
 src/backend/storage/ipc/procarray.c | 65 ++++++++++++++++++++++++-------------
 src/backend/storage/ipc/standby.c   |  8 +++--
 src/include/storage/standby.h       |  2 ++
 4 files changed, 52 insertions(+), 25 deletions(-)

Attachment

0008-Store-the-number-of-subtransactions-in-xl_running_xa.patch

[PATCH 11/14] Introduce wal decoding via catalog timetravel

From

Andres Freund

Date:

15 November 2012, 01:20:52

This introduces several things:
* 'reorderbuffer' module which reassembles transactions from a stream of interspersed changes
* 'snapbuilder' which builds catalog snapshots so that tuples from wal can be understood
* logging more data into wal to facilitate logical decoding
* wal decoding into an reorderbuffer
* shared library output plugins with 5 callbacks
 * init
 * begin
 * change
 * commit
* walsender infrastructur to stream out changes and to keep the global xmin low enough
 * INIT_LOGICAL_REPLICATION $plugin; waits till a consistent snapshot is built and returns
   * initial LSN
   * replication slot identifier
   * id of a pg_export() style snapshot
 * START_LOGICAL_REPLICATION $id $lsn; streams out changes
 * uses named output plugins for output specification

Todo:
* testing infrastructure (isolationtester)
* persistence/spilling to disk of built snapshots, longrunning
  transactions
* user docs
* more frequent lowering of xmins
* more docs about the internals
* support for user declared catalog tables
* actual exporting of initial pg_export snapshots after
  INIT_LOGICAL_REPLICATION
* own shared memory segment instead of piggybacking on walsender's
* nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the
  outside.
* more frequent xl_running_xid's so xmin can be upped more frequently
* add STOP_LOGICAL_REPLICATION $id
---
 src/backend/access/heap/heapam.c                |  280 +++++-
 src/backend/access/transam/xlog.c               |    1 +
 src/backend/catalog/index.c                     |   74 ++
 src/backend/replication/Makefile                |    2 +
 src/backend/replication/logical/Makefile        |   19 +
 src/backend/replication/logical/decode.c        |  496 ++++++++++
 src/backend/replication/logical/logicalfuncs.c  |  247 +++++
 src/backend/replication/logical/reorderbuffer.c | 1156 +++++++++++++++++++++++
 src/backend/replication/logical/snapbuild.c     | 1144 ++++++++++++++++++++++
 src/backend/replication/repl_gram.y             |   32 +-
 src/backend/replication/repl_scanner.l          |    2 +
 src/backend/replication/walsender.c             |  566 ++++++++++-
 src/backend/storage/ipc/procarray.c             |   23 +
 src/backend/storage/ipc/standby.c               |    8 +-
 src/backend/utils/cache/inval.c                 |    2 +-
 src/backend/utils/cache/relcache.c              |    3 +-
 src/backend/utils/misc/guc.c                    |   11 +
 src/backend/utils/time/tqual.c                  |  249 +++++
 src/bin/pg_controldata/pg_controldata.c         |    2 +
 src/include/access/heapam_xlog.h                |   23 +
 src/include/access/transam.h                    |    5 +
 src/include/access/xlog.h                       |    3 +-
 src/include/catalog/index.h                     |    4 +
 src/include/nodes/nodes.h                       |    2 +
 src/include/nodes/replnodes.h                   |   22 +
 src/include/replication/decode.h                |   21 +
 src/include/replication/logicalfuncs.h          |   44 +
 src/include/replication/output_plugin.h         |   76 ++
 src/include/replication/reorderbuffer.h         |  284 ++++++
 src/include/replication/snapbuild.h             |  128 +++
 src/include/replication/walsender.h             |    1 +
 src/include/replication/walsender_private.h     |   34 +-
 src/include/storage/itemptr.h                   |    3 +
 src/include/storage/sinval.h                    |    2 +
 src/include/utils/tqual.h                       |   31 +-
 35 files changed, 4966 insertions(+), 34 deletions(-)
 create mode 100644 src/backend/replication/logical/Makefile
 create mode 100644 src/backend/replication/logical/decode.c
 create mode 100644 src/backend/replication/logical/logicalfuncs.c
 create mode 100644 src/backend/replication/logical/reorderbuffer.c
 create mode 100644 src/backend/replication/logical/snapbuild.c
 create mode 100644 src/include/replication/decode.h
 create mode 100644 src/include/replication/logicalfuncs.h
 create mode 100644 src/include/replication/output_plugin.h
 create mode 100644 src/include/replication/reorderbuffer.h
 create mode 100644 src/include/replication/snapbuild.h

Attachment

0011-Introduce-wal-decoding-via-catalog-timetravel.patch

[PATCH 14/14] design document v2.3 and snapshot building design doc v0.2

From

Andres Freund

Date:

15 November 2012, 01:21:11

---
 src/backend/replication/logical/DESIGN.txt         | 603 +++++++++++++++++++++
 src/backend/replication/logical/Makefile           |   6 +
 .../replication/logical/README.SNAPBUILD.txt       | 298 ++++++++++
 3 files changed, 907 insertions(+)
 create mode 100644 src/backend/replication/logical/DESIGN.txt
 create mode 100644 src/backend/replication/logical/README.SNAPBUILD.txt

Attachment

0014-design-document-v2.3-and-snapshot-building-design-do.patch

[PATCH 07/14] Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement

From

Andres Freund

Date:

15 November 2012, 01:23:53

This is useful to be able to represent a CommandId thats invalid. There was no
such value before.

This decreases the possible number of subtransactions by one which seems
unproblematic. Its also not a problem for pg_upgrade because cmin/cmax are
never looked at outside the context of their own transaction (spare timetravel
access, but thats new anyway).
---
 src/backend/access/transam/xact.c | 4 ++--
 src/include/c.h                   | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

Attachment

0007-Introduce-InvalidCommandId-and-declare-that-to-be-th.patch

[PATCH 10/14] Allow walsender's to connect to a specific database

From

Andres Freund

Date:

15 November 2012, 01:25:01

Currently the decision whether to connect to a database or not is made by
checking whether the passed "dbname" parameter is "replication". Unfortunately
this makes it impossible to connect a to a database named replication...

This is useful for future walsender commands which need database interaction.
---
 src/backend/postmaster/postmaster.c                |  7 ++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++--
 src/backend/replication/walsender.c                | 27 ++++++++++++++++++----
 src/backend/utils/init/postinit.c                  |  5 ++++
 src/bin/pg_basebackup/pg_basebackup.c              |  4 ++--
 src/bin/pg_basebackup/pg_receivexlog.c             |  4 ++--
 src/bin/pg_basebackup/receivelog.c                 |  4 ++--
 7 files changed, 41 insertions(+), 14 deletions(-)

Attachment

0010-Allow-walsender-s-to-connect-to-a-specific-database.patch

[PATCH 09/14] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader

From

Andres Freund

Date:

15 November 2012, 01:25:58

For the regular satisfies routines this is needed in prepareation of logical
decoding. I changed the non-regular ones for consistency as well.

The naming between htup, tuple and similar is rather confused, I could not find
any consistent naming anywhere.

This is preparatory work for the logical decoding feature which needs to be
able to get to a valid relfilenode from when checking the visibility of a
tuple.
---
 contrib/pgrowlocks/pgrowlocks.c      |  2 +-
 src/backend/access/heap/heapam.c     | 13 ++++++----
 src/backend/access/heap/pruneheap.c  | 16 ++++++++++--
 src/backend/catalog/index.c          |  2 +-
 src/backend/commands/analyze.c       |  3 ++-
 src/backend/commands/cluster.c       |  2 +-
 src/backend/commands/vacuumlazy.c    |  3 ++-
 src/backend/storage/lmgr/predicate.c |  2 +-
 src/backend/utils/time/tqual.c       | 50 +++++++++++++++++++++++++++++-------
 src/include/utils/snapshot.h         |  4 +--
 src/include/utils/tqual.h            | 20 +++++++--------
 11 files changed, 83 insertions(+), 34 deletions(-)

Attachment

0009-Adjust-all-Satisfies-routines-to-take-a-HeapTuple-in.patch

[PATCH 13/14] Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes

From

Andres Freund

Date:

15 November 2012, 01:26:18

---
 src/bin/pg_basebackup/Makefile         |   7 +-
 src/bin/pg_basebackup/pg_receivellog.c | 717 +++++++++++++++++++++++++++++++++
 src/bin/pg_basebackup/streamutil.c     |   3 +-
 src/bin/pg_basebackup/streamutil.h     |   1 +
 4 files changed, 725 insertions(+), 3 deletions(-)
 create mode 100644 src/bin/pg_basebackup/pg_receivellog.c

Attachment

0013-Introduce-pg_receivellog-the-pg_receivexlog-equivale.patch

[PATCH 12/14] Add a simple decoding module in contrib named 'test_decoding'

From

Andres Freund

Date:

15 November 2012, 01:27:08

---
 contrib/Makefile                      |   1 +
 contrib/test_decoding/Makefile        |  16 +++
 contrib/test_decoding/test_decoding.c | 192 ++++++++++++++++++++++++++++++++++
 3 files changed, 209 insertions(+)
 create mode 100644 contrib/test_decoding/Makefile
 create mode 100644 contrib/test_decoding/test_decoding.c

Attachment

0012-Add-a-simple-decoding-module-in-contrib-named-test_d.patch

[PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

15 November 2012, 01:29:58

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer
---
 src/backend/access/transam/Makefile     |    2 +-
 src/backend/access/transam/xlogreader.c | 1032 +++++++++++++++++++++++++++++++
 src/include/access/xlogreader.h         |  264 ++++++++
 3 files changed, 1297 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogreader.c
 create mode 100644 src/include/access/xlogreader.h

Attachment

0002-Add-support-for-a-generic-wal-reading-facility-dubbe.patch

Re: logical changeset generation v3 - git repository

From

Andres Freund

Date:

15 November 2012, 01:31:47

On 2012-11-15 01:27:46 +0100, Andres Freund wrote:
> In response to this you will soon find the 14 patches that currently
> implement $subject.

As its not very wieldly to send around that many/big patches all the
time, until the next "major" version I will just update the git tree at:

Web:
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf3

Git:
git clone git://git.postgresql.org/git/users/andresfreund/postgres.git xlog-decoding-rebasing-cf3

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: lcr - walsender integration

From

Andres Freund

Date:

15 November 2012, 01:42:54

Hi,

The current logical walsender integration looks like the following:

=# INIT_LOGICAL_REPLICATION 'text';
WARNING:  Initiating logical rep
WARNING:  reached consistent point, stopping!replication_id | consistent_point | snapshot_name | plugin
----------------+------------------+---------------+--------id-2           | 3/CACBDF98       | 0xDEADBEEF    | text
(1 row)

=# START_LOGICAL_REPLICATION 'id-2' 3/CACBDF98;
...


So the current protocol is:
INIT_LOGICAL_REPLICATION '$plugin';
returns
* slot
* first consistent point
* snapshot id

START_LOGICAL_REPLICATION '$slot' $last_received_lsn;

streams changes, each wrapped in a 'w' message with (start, end) set to
the same value. The content of the data is completely free-format and
only depends on the output plugin.

Feedback is provided from the client via the normal 'r' messages.

I think thats not a bad start, but we probably can improve it a bit:

INIT_LOGICAL_REPLICATION '$slot' '$plugin' ($value = $key, ...);
START_LOGICAL_REPLICATION '$slot' $last_received_lsn;
STOP_LOGICAL_REPLICATION '$slot';

The option to INIT_LOGICAL_REPLICATION would then get passed to the
'pg_decode_init' output plugin function (i.e. a function of that name
would get dlsym()'ed using the pg infrastructure for that).

Does that look good to you? Any suggestions?

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3

From

Josh Berkus

Date:

15 November 2012, 02:08:57

On 11/14/12 4:27 PM, Andres Freund wrote:
> Hi,
> 
> In response to this you will soon find the 14 patches that currently
> implement $subject. I'll go over each one after showing off for a bit:

Lemme be the first to say, "wow".  Impressive work.

Now the debugging starts ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: logical changeset generation v3

From

Michael Paquier

Date:

15 November 2012, 04:08:30

Looks like cool stuff @-@<br />I might be interested in looking at that a bit as I think I will hopefully be hopefully
beable to grab some time in the next couple of weeks.<br />Are some of those patches already submitted to a CF?<br />
--<br />Michael Paquier<br /><a href="http://michael.otacoo.com" target="_blank">http://michael.otacoo.com</a><br />

Re: logical changeset generation v3

From

Andres Freund

Date:

15 November 2012, 09:34:52

Hi,

On Thursday, November 15, 2012 05:08:26 AM Michael Paquier wrote:
> Looks like cool stuff @-@
> I might be interested in looking at that a bit as I think I will hopefully
> be hopefully be able to grab some time in the next couple of weeks.
> Are some of those patches already submitted to a CF?

I added the patchset as one entry to the CF this time, it seems to me they are 
too hard to judge individually to make them really separately reviewable.

I can split it off there, but really all the complicated stuff is in one patch 
anyway...

Greetings,

Andres

Re: [PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids

From

Simon Riggs

Date:

15 November 2012, 14:06:08

On 14 November 2012 22:17, Andres Freund <andres@2ndquadrant.com> wrote:

> To avoid complicating logic we store both, the toplevel and the subxids, in
> ->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids.

That looks good, not much change. Will apply in next few days. Please
add me as committer and mark ready.

> Also skip logging any subxids if the snapshot is suboverflowed, they aren't
> useful in that case anyway.

> This allows to make some operations cheaper and it allows faster startup for
> the future logical decoding feature because that doesn't care about
> subtransactions/suboverflow'edness.

...but please don't add extra touches of Andres magic along the way.
Doing that will just slow down patch acceptance and its not important.
I suggest to keep note of things like that and come back to them
later.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids

From

Andres Freund

Date:

15 November 2012, 14:06:21

On 2012-11-15 09:07:23 -0300, Simon Riggs wrote:
> On 14 November 2012 22:17, Andres Freund <andres@2ndquadrant.com> wrote:
>
> > To avoid complicating logic we store both, the toplevel and the subxids, in
> > ->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids.
>
> That looks good, not much change. Will apply in next few days. Please
> add me as committer and mark ready.

Cool. Will do.

> > Also skip logging any subxids if the snapshot is suboverflowed, they aren't
> > useful in that case anyway.
>
> > This allows to make some operations cheaper and it allows faster startup for
> > the future logical decoding feature because that doesn't care about
> > subtransactions/suboverflow'edness.
>
> ...but please don't add extra touches of Andres magic along the way.
> Doing that will just slow down patch acceptance and its not important.
> I suggest to keep note of things like that and come back to them
> later.

Which magic are you talking about?

Only two parts changed in comparison to the previous situation. One is
that the following in ProcArrayApplyRecoveryInfo only applies to
toplevel transactions by virtue of ->xcnt now only containing the
toplevel transaction count:
> +       /*
> +        * Remove stale locks, if any.
> +        *
> +        * Locks are always assigned to the toplevel xid so we don't
> need to care
> +        * about subxcnt/subxids (and by extension not about
> ->suboverflowed).
> +        */
>         StandbyReleaseOldLocks(running->xcnt, running->xids);

Note that there was no code change, just a change in meaning.

The other part is:
> +       /*
> +        * Spin over procArray collecting all subxids, but only if there hasn't
> +        * been a suboverflow.
> +        */
> +       if (!suboverflowed)

Well, thats something that basically had to be decided either way when
writing the patch...

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Heikki Linnakangas

Date:

15 November 2012, 14:23:04

On 15.11.2012 03:17, Andres Freund wrote:
>
> Features:
> - streaming reading/writing
> - filtering
> - reassembly of records
>
> Reusing the ReadRecord infrastructure in situations where the code that wants
> to do so is not tightly integrated into xlog.c is rather hard and would require
> changes to rather integral parts of the recovery code which doesn't seem to be
> a good idea.
>
> Missing:
> - "compressing" the stream when removing uninteresting records
> - writing out correct CRCs
> - separating reader/writer

I'm disappointed to see that there has been no progress on this patch 
since last commitfest. I thought we agreed on the approach I championed 
for here: 
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. There 
wasn't much work left to finish that, I believe.

Are you going to continue working on this?

- Heikki

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

15 November 2012, 14:48:40

On 2012-11-15 16:22:56 +0200, Heikki Linnakangas wrote:
> On 15.11.2012 03:17, Andres Freund wrote:
> >
> >Features:
> >- streaming reading/writing
> >- filtering
> >- reassembly of records
> >
> >Reusing the ReadRecord infrastructure in situations where the code that wants
> >to do so is not tightly integrated into xlog.c is rather hard and would require
> >changes to rather integral parts of the recovery code which doesn't seem to be
> >a good idea.
> >
> >Missing:
> >- "compressing" the stream when removing uninteresting records
> >- writing out correct CRCs
> >- separating reader/writer
>
> I'm disappointed to see that there has been no progress on this patch since
> last commitfest. I thought we agreed on the approach I championed for here:
> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. There
> wasn't much work left to finish that, I believe.

While I still think my approach is superior at this point I have
accepted that I haven't convinced anybody of that. I plan to port over
what I have submitted to Alvaro's version of your patch.

I have actually started that but I simply couldn't finish it in
time. The approach for porting I took didn't work all that well and I
plan to restart doing that after doing some review work.

> Are you going to continue working on this?

"this" being my version of XlogReader? No. The patch above is unchanged
except some very minor rebasing to recent wal changes by Tom. The reason
its included in the series is simply that I haven't gotten rid of it yet
and the subsequent patches needed it. I do plan to continue working on a
rebased xlogdump version if nobody beats me to it (please do beat me!).

Ok?

The cover letter said:

> * Add support for a generic wal reading facility dubbed XLogReader
> There's some discussion about whats the best way to implement this in a
> separate CF topic.
> (unchanged)

I should have folded that in into the patch description, sorry.

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Alvaro Herrera

Date:

15 November 2012, 14:51:04

Heikki Linnakangas wrote:
> On 15.11.2012 03:17, Andres Freund wrote:
> >
> >Features:
> >- streaming reading/writing
> >- filtering
> >- reassembly of records
> >
> >Reusing the ReadRecord infrastructure in situations where the code that wants
> >to do so is not tightly integrated into xlog.c is rather hard and would require
> >changes to rather integral parts of the recovery code which doesn't seem to be
> >a good idea.
> >
> >Missing:
> >- "compressing" the stream when removing uninteresting records
> >- writing out correct CRCs
> >- separating reader/writer
>
> I'm disappointed to see that there has been no progress on this
> patch since last commitfest. I thought we agreed on the approach I
> championed for here:
> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php.
> There wasn't much work left to finish that, I believe.
>
> Are you going to continue working on this?

I worked a bit more on that patch of yours, but I neglected to submit
it.  Did you have something in particular that you wanted changed in it?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

15 November 2012, 15:12:27

On 2012-11-15 11:50:37 -0300, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
> > On 15.11.2012 03:17, Andres Freund wrote:
> > >
> > >Features:
> > >- streaming reading/writing
> > >- filtering
> > >- reassembly of records
> > >
> > >Reusing the ReadRecord infrastructure in situations where the code that wants
> > >to do so is not tightly integrated into xlog.c is rather hard and would require
> > >changes to rather integral parts of the recovery code which doesn't seem to be
> > >a good idea.
> > >
> > >Missing:
> > >- "compressing" the stream when removing uninteresting records
> > >- writing out correct CRCs
> > >- separating reader/writer
> >
> > I'm disappointed to see that there has been no progress on this
> > patch since last commitfest. I thought we agreed on the approach I
> > championed for here:
> > http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php.
> > There wasn't much work left to finish that, I believe.
> >
> > Are you going to continue working on this?
>
> I worked a bit more on that patch of yours, but I neglected to submit
> it.  Did you have something in particular that you wanted changed in it?

Could you push your newest version to your git repository or similar?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Alvaro Herrera

Date:

15 November 2012, 15:19:19

Andres Freund wrote:
> On 2012-11-15 11:50:37 -0300, Alvaro Herrera wrote:

> > I worked a bit more on that patch of yours, but I neglected to submit
> > it.  Did you have something in particular that you wanted changed in it?
>
> Could you push your newest version to your git repository or similar?

Sadly, I cannot, because I had it on my laptop only and its screen died
this morning (well, actually it doesn't boot at all, so I can't use the
external screen either).  I'm trying to get it fixed as soon as possible
but obviously I have no idea when I will be able to get it back.  Most
likely I will have to go out and buy a 2.5" drive enclosure to get the
valuable stuff out of it.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Heikki Linnakangas

Date:

15 November 2012, 15:55:45

On 15.11.2012 16:50, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>> On 15.11.2012 03:17, Andres Freund wrote:
>>>
>>> Features:
>>> - streaming reading/writing
>>> - filtering
>>> - reassembly of records
>>>
>>> Reusing the ReadRecord infrastructure in situations where the code that wants
>>> to do so is not tightly integrated into xlog.c is rather hard and would require
>>> changes to rather integral parts of the recovery code which doesn't seem to be
>>> a good idea.
>>>
>>> Missing:
>>> - "compressing" the stream when removing uninteresting records
>>> - writing out correct CRCs
>>> - separating reader/writer
>>
>> I'm disappointed to see that there has been no progress on this
>> patch since last commitfest. I thought we agreed on the approach I
>> championed for here:
>> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php.
>> There wasn't much work left to finish that, I believe.
>>
>> Are you going to continue working on this?
>
> I worked a bit more on that patch of yours, but I neglected to submit
> it.  Did you have something in particular that you wanted changed in it?

Off the top of my head, there were a two open items with the patch as I 
submitted it:

1. Need to make sure it's easy to compile outside backend code. So that 
it's suitable for using in an xlogdump contrib module, for example.

2. do something about error reporting. In particular, xlogreader.c 
should not call emode_for_corrupt_record(), but we need to provide for 
that functionlity somehow. I think I'd prefer xlogreader.c to not 
ereport() on a corrupt record. Instead, it would return an error string 
to the caller, which could then decide what to do with it. Translating 
the messages needs some thought, though.

Other than those, and cleanup of any obsoleted comments etc. and adding 
docs, I think it was good to go.

- Heikki

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Alvaro Herrera

Date:

15 November 2012, 16:23:41

Heikki Linnakangas wrote:
> On 15.11.2012 16:50, Alvaro Herrera wrote:

> >I worked a bit more on that patch of yours, but I neglected to submit
> >it.  Did you have something in particular that you wanted changed in it?
>
> Off the top of my head, there were a two open items with the patch
> as I submitted it:
>
> 1. Need to make sure it's easy to compile outside backend code. So
> that it's suitable for using in an xlogdump contrib module, for
> example.
>
> 2. do something about error reporting. In particular, xlogreader.c
> should not call emode_for_corrupt_record(), but we need to provide
> for that functionlity somehow. I think I'd prefer xlogreader.c to
> not ereport() on a corrupt record. Instead, it would return an error
> string to the caller, which could then decide what to do with it.
> Translating the messages needs some thought, though.
>
> Other than those, and cleanup of any obsoleted comments etc. and
> adding docs, I think it was good to go.

Thanks.  I was toying with the idea that xlogreader.c should return a
status code to the caller, and additionally an error string; not all
error cases are equal.

Most of what I did (other than general cleanup) was moving some xlog.c
global vars into a private_data struct for xlogreader.c to pass around;
one problem I had was deciding what to do with curFileTLI and
LastFileTLI (IIRC), because they are used outside of the reader module
(they were examined after recovery finished).

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 03/14] Add simple xlogdump tool

From

Peter Eisentraut

Date:

15 November 2012, 16:31:54

On 11/14/12 8:17 PM, Andres Freund wrote:
> diff --git a/src/bin/Makefile b/src/bin/Makefile
> index b4dfdba..9992f7a 100644
> --- a/src/bin/Makefile
> +++ b/src/bin/Makefile
> @@ -14,7 +14,7 @@ top_builddir = ../..
>  include $(top_builddir)/src/Makefile.global
>  
>  SUBDIRS = initdb pg_ctl pg_dump \
> -    psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup
> +    psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump

should be pg_xlogdump

>  
>  ifeq ($(PORTNAME), win32)
>  SUBDIRS += pgevent
> diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile
> new file mode 100644
> index 0000000..d54640a
> --- /dev/null
> +++ b/src/bin/xlogdump/Makefile
> @@ -0,0 +1,25 @@
> +#-------------------------------------------------------------------------
> +#
> +# Makefile for src/bin/xlogdump
> +#
> +# Copyright (c) 1998-2012, PostgreSQL Global Development Group
> +#
> +# src/bin/pg_resetxlog/Makefile

fix that

> +#
> +#-------------------------------------------------------------------------
> +
> +PGFILEDESC = "xlogdump"
> +PGAPPICON=win32
> +
> +subdir = src/bin/xlogdump
> +top_builddir = ../../..
> +include $(top_builddir)/src/Makefile.global
> +
> +OBJS= xlogdump.o \
> +     $(WIN32RES)
> +
> +all: xlogdump
> +
> +
> +xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v
/main.o|sed's/^/..\/..\/..\//')
 
> +    $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

This looks pretty evil, and there is no documentation about what it is
supposed to do.

Windows build support needs some thought.


> diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c
> new file mode 100644
> index 0000000..0f984e4
> --- /dev/null
> +++ b/src/bin/xlogdump/xlogdump.c
> @@ -0,0 +1,468 @@
> +#include "postgres.h"
> +
> +#include <unistd.h>
> +
> +#include "access/xlogreader.h"
> +#include "access/rmgr.h"
> +#include "miscadmin.h"
> +#include "storage/ipc.h"
> +#include "utils/memutils.h"
> +#include "utils/guc.h"
> +
> +#include "getopt_long.h"
> +
> +/*
> + * needs to be declared because otherwise its defined in main.c which we cannot
> + * link from here.
> + */
> +const char *progname = "xlogdump";

Which may be a reason not to link with main.o.  We generally don't want
to hardcode the program name inside the program.

> +static void
> +usage(void)
> +{
> +    printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"),
> +           progname);
> +    printf(_("Usage:\n"));
> +    printf(_("  %s [OPTION]...\n"), progname);
> +    printf(_("\nOptions:\n"));
> +    printf(_("  -v, --version          output version information, then exit\n"));
> +    printf(_("  -h, --help             show this help, then exit\n"));
> +    printf(_("  -s, --start            from where recptr onwards to read\n"));
> +    printf(_("  -e, --end              up to which recptr to read\n"));
> +    printf(_("  -t, --timeline         which timeline do we want to read\n"));
> +    printf(_("  -i, --inpath           from where do we want to read? cwd/pg_xlog is the default\n"));
> +    printf(_("  -o, --output           where to write [start, end]\n"));
> +    printf(_("  -f, --file             wal file to parse\n"));
> +}

Options list should be in alphabetic order (or some other less random
order).  Most of these descriptions are not very intelligible (at least
without additional documentation).

> +
> +int main(int argc, char **argv)
> +{
> +    uint32 xlogid;
> +    uint32 xrecoff;
> +    XLogReaderState *xlogreader_state;
> +    XLogDumpPrivateData private;
> +    XLogRecPtr from = InvalidXLogRecPtr;
> +    XLogRecPtr to = InvalidXLogRecPtr;
> +    bool bad_argument = false;
> +
> +    static struct option long_options[] = {
> +        {"help", no_argument, NULL, 'h'},
> +        {"version", no_argument, NULL, 'v'},

Standard letters for help and version are ? and V.

> +        {"start", required_argument, NULL, 's'},
> +        {"end", required_argument, NULL, 'e'},
> +        {"timeline", required_argument, NULL, 't'},
> +        {"inpath", required_argument, NULL, 'i'},
> +        {"outpath", required_argument, NULL, 'o'},
> +        {"file", required_argument, NULL, 'f'},
> +        {NULL, 0, NULL, 0}
> +    };
> +    int            c;
> +    int            option_index;
> +
> +    memset(&private, 0, sizeof(XLogDumpPrivateData));
> +
> +    while ((c = getopt_long(argc, argv, "hvs:e:t:i:o:f:",

This could also be in a less random order.

> +                            long_options, &option_index)) != -1)
> +    {
> +        switch (c)
> +        {
> +            case 'h':
> +                usage();
> +                exit(0);
> +                break;
> +            case 'v':
> +                printf("Version: 0.1\n");
> +                exit(0);
> +                break;

This should be the PostgreSQL version.


also:

no man page

no nls.mk

Re: [PATCH 03/14] Add simple xlogdump tool

From

Andres Freund

Date:

15 November 2012, 16:45:21

On 2012-11-15 11:31:55 -0500, Peter Eisentraut wrote:
> On 11/14/12 8:17 PM, Andres Freund wrote:
> > diff --git a/src/bin/Makefile b/src/bin/Makefile
> > index b4dfdba..9992f7a 100644
> > --- a/src/bin/Makefile
> > +++ b/src/bin/Makefile
> > @@ -14,7 +14,7 @@ top_builddir = ../..
> >  include $(top_builddir)/src/Makefile.global
> >
> >  SUBDIRS = initdb pg_ctl pg_dump \
> > -    psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup
> > +    psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump
>
> should be pg_xlogdump

Good Point.

> >
> >  ifeq ($(PORTNAME), win32)
> >  SUBDIRS += pgevent
> > diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile
> > new file mode 100644
> > index 0000000..d54640a
> > --- /dev/null
> > +++ b/src/bin/xlogdump/Makefile
> > @@ -0,0 +1,25 @@
> > +#-------------------------------------------------------------------------
> > +#
> > +# Makefile for src/bin/xlogdump
> > +#
> > +# Copyright (c) 1998-2012, PostgreSQL Global Development Group
> > +#
> > +# src/bin/pg_resetxlog/Makefile
>
> fix that

Dito.

> > +#
> > +#-------------------------------------------------------------------------
> > +
> > +PGFILEDESC = "xlogdump"
> > +PGAPPICON=win32
> > +
> > +subdir = src/bin/xlogdump
> > +top_builddir = ../../..
> > +include $(top_builddir)/src/Makefile.global
> > +
> > +OBJS= xlogdump.o \
> > +     $(WIN32RES)
> > +
> > +all: xlogdump
> > +
> > +
> > +xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v
/main.o|sed's/^/..\/..\/..\//')
 
> > +    $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
>
> This looks pretty evil, and there is no documentation about what it is
> supposed to do.

There has been some talk about this before and this clearly isn't an
acceptable solution. The previously stated idea was to split of the
_desc routines so we don't need to link with the whole backend.

Alvaro stared to work on that a bit:
http://archives.postgresql.org/message-id/1346268803-sup-9854%40alvh.no-ip.org

(What the above does is simply collect all backend object files, remove
main.o from that list an dlist them as dependencies.)

> Windows build support needs some thought.

I don't have the slightest clue how the windows build environment works,
is there still a problem if we only link to a very selected list of
backend object files? Or do we need to link them to some external
location?

> > diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c
> > new file mode 100644
> > index 0000000..0f984e4
> > --- /dev/null
> > +++ b/src/bin/xlogdump/xlogdump.c
> > @@ -0,0 +1,468 @@
> > +#include "postgres.h"
> > +
> > +#include <unistd.h>
> > +
> > +#include "access/xlogreader.h"
> > +#include "access/rmgr.h"
> > +#include "miscadmin.h"
> > +#include "storage/ipc.h"
> > +#include "utils/memutils.h"
> > +#include "utils/guc.h"
> > +
> > +#include "getopt_long.h"
> > +
> > +/*
> > + * needs to be declared because otherwise its defined in main.c which we cannot
> > + * link from here.
> > + */
> > +const char *progname = "xlogdump";
>
> Which may be a reason not to link with main.o.

Well, we're not linking to main.o which causes the problem, but yes,
really fixing this is definitely the goal, but not really possible yet.

> > +static void
> > +usage(void)
> > +{
> > +    printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"),
> > +           progname);
> > +    printf(_("Usage:\n"));
> > +    printf(_("  %s [OPTION]...\n"), progname);
> > +    printf(_("\nOptions:\n"));
> > +    printf(_("  -v, --version          output version information, then exit\n"));
> > +    printf(_("  -h, --help             show this help, then exit\n"));
> > +    printf(_("  -s, --start            from where recptr onwards to read\n"));
> > +    printf(_("  -e, --end              up to which recptr to read\n"));
> > +    printf(_("  -t, --timeline         which timeline do we want to read\n"));
> > +    printf(_("  -i, --inpath           from where do we want to read? cwd/pg_xlog is the default\n"));
> > +    printf(_("  -o, --output           where to write [start, end]\n"));
> > +    printf(_("  -f, --file             wal file to parse\n"));
> > +}
>
> Options list should be in alphabetic order (or some other less random
> order).  Most of these descriptions are not very intelligible (at
> least without additional documentation).

True, its noticeable that this mostly was a development tool. But it
shouldn't stay that way. There have been several bugreports of late
where a bin/pg_xlogdump would have been very helpful...

> This should be the PostgreSQL version.
>
>
> also:
>
> no man page
>
> no nls.mk

Will try to provide some actually submittable version once the
xlogreader situation is finalized and the _desc routines are splitted...

Thanks!

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 03/14] Add simple xlogdump tool

From

Jeff Janes

Date:

15 November 2012, 17:06:27

On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> ---
>  src/bin/Makefile            |   2 +-
>  src/bin/xlogdump/Makefile   |  25 +++
>  src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 494 insertions(+), 1 deletion(-)
>  create mode 100644 src/bin/xlogdump/Makefile
>  create mode 100644 src/bin/xlogdump/xlogdump.c

Is this intended to be the successor of
https://github.com/snaga/xlogdump which will then be deprecated?

Thanks,

Jeff

Re: [PATCH 03/14] Add simple xlogdump tool

From

Andres Freund

Date:

15 November 2012, 17:13:14

On 2012-11-15 09:06:23 -0800, Jeff Janes wrote:
> On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > ---
> >  src/bin/Makefile            |   2 +-
> >  src/bin/xlogdump/Makefile   |  25 +++
> >  src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 494 insertions(+), 1 deletion(-)
> >  create mode 100644 src/bin/xlogdump/Makefile
> >  create mode 100644 src/bin/xlogdump/xlogdump.c
>
> Is this intended to be the successor of
> https://github.com/snaga/xlogdump which will then be deprecated?

As-is this is just a development tool which was sorely needed for the
development of this patchset. But yes I think that once ready
(xlogreader infrastructure, *_desc routines splitted) it should
definitely be able to do most of what the above xlogdump can do and it
should live in bin/. I think mostly some filtering is missing.

That doesn't really "deprecate" the above though.

Does that answer your question?

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Alvaro Herrera

Date:

15 November 2012, 21:12:48

Heikki Linnakangas wrote:

> I'm disappointed to see that there has been no progress on this
> patch since last commitfest. I thought we agreed on the approach I
> championed for here:
> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php.
> There wasn't much work left to finish that, I believe.
>
> Are you going to continue working on this?

Here's what I have right now.  It's your patch, plus some tweaks such as
changing the timing for allocating readRecordBuf; I also added a struct
to contain XLogReadPage's private data, instead of using global
variables.  (The main conclusion I get from this, is that it's
relatively easy to split out reading of XLog out of xlog.c; there are
some global variables still remaining, but AFAICS that should be
relatively simple to fix).

There is no consensus on the way to handle error reporting.  Tom
suggests having the hypothetical client-side code redefine ereport()
somehow; as far as I can see that means we would have to reimplement
errstart, errfinish, etc.  That doesn't sound all that nice to me.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

xlogreader-heikki-2.patch

Re: logical changeset generation v3

From

Michael Paquier

Date:

16 November 2012, 04:34:53

Do you have a git repository or something where all the 14 patches are applied? I would like to test the feature
globally.<br/>Sorry I recall that you put a link somewhere but I cannot remember its email...<br /><br /><div
class="gmail_quote">On Thu, Nov 15, 2012 at 6:34 PM, Andres Freund <span dir="ltr"><<a
href="mailto:andres@anarazel.de"target="_blank">andres@anarazel.de</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hi,<br /><div class="im"><br
/>On Thursday, November 15, 2012 05:08:26 AM Michael Paquier wrote:<br /> > Looks like cool stuff @-@<br /> > I
mightbe interested in looking at that a bit as I think I will hopefully<br /> > be hopefully be able to grab some
timein the next couple of weeks.<br /> > Are some of those patches already submitted to a CF?<br /><br /></div>I
addedthe patchset as one entry to the CF this time, it seems to me they are<br /> too hard to judge individually to
makethem really separately reviewable.<br /><br /> I can split it off there, but really all the complicated stuff is in
onepatch<br /> anyway...<br /><br /> Greetings,<br /><br /> Andres<br /></blockquote></div><br /><br clear="all" /><br
/>--<br />Michael Paquier<br /><a href="http://michael.otacoo.com" target="_blank">http://michael.otacoo.com</a><br />

Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

From

Michael Paquier

Date:

16 November 2012, 04:44:48

Hi,<br /><br />This patch looks OK.<br /><br />I got 3 comments:<br />1) Why changing the OID of
pg_class_tblspc_relfilenode_indexfrom 3171 to 3455? It does not look necessary.<br />2) You should perhaps change the
headerof RelationMapFilenodeToOid so as not mentionning it as the opposite operation of RelationMapOidToFilenode but as
anoperation that looks for the OID of a relation based on its relfilenode. Both functions are opposite but
independent.<br/> 3) Both functions are doing similar operations. Could it be possible to wrap them in the same central
function?<br/><br /><div class="gmail_quote">On Thu, Nov 15, 2012 at 10:17 AM, Andres Freund <span dir="ltr"><<a
href="mailto:andres@2ndquadrant.com"target="_blank">andres@2ndquadrant.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">---<br />
 src/backend/utils/cache/relmapper.c| 53 +++++++++++++++++++++++++++++++++++++<br />  src/include/catalog/indexing.h  
  |  4 +--<br />  src/include/utils/relmapper.h       |  2 ++<br />  3 files changed, 57 insertions(+), 2
deletions(-)<br/><br /><br /><br /> --<br /> Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your
subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br /></blockquote></div><br /><br clear="all"
/><br/>-- <br />Michael Paquier<br /><a href="http://michael.otacoo.com"
target="_blank">http://michael.otacoo.com</a><br/>

Re: logical changeset generation v3

From

Andrea Suisani

Date:

16 November 2012, 08:16:40

Il 16/11/2012 05:34, Michael Paquier ha scritto:
> Do you have a git repository or something where all the 14 patches are applied? I would like to test the feature
globally.
> Sorry I recall that you put a link somewhere but I cannot remember its email...

http://archives.postgresql.org/pgsql-hackers/2012-11/msg00686.php

Andrea

Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

From

Andres Freund

Date:

16 November 2012, 10:58:10

Hi,

On 2012-11-16 13:44:45 +0900, Michael Paquier wrote:
> This patch looks OK.
>
> I got 3 comments:
> 1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to
> 3455? It does not look necessary.

Its a mismerge and should have happened in "Add a new RELFILENODE
syscache to fetch a pg_class entry via (reltablespace, relfilenode)" but
it seems I squashed the wrong two commits.
I had originally used 3171 but that since got used up for lo_tell64...

> 2) You should perhaps change the header of RelationMapFilenodeToOid so as
> not mentionning it as the opposite operation of RelationMapOidToFilenode
> but as an operation that looks for the OID of a relation based on its
> relfilenode. Both functions are opposite but independent.

I described it as the opposite because RelationMapOidToFilenode is the
relmappers stated goal and RelationMapFilenodeToOid is just some
side-business.

> 3) Both functions are doing similar operations. Could it be possible
> to wrap them in the same central function?

I don't really see how without making both quite a bit more
complicated. The amount of if's needed seems to be too large to me.

Thanks,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Markus Wanner

Date:

16 November 2012, 13:46:47

Andres,

On 11/15/2012 01:27 AM, Andres Freund wrote:
> In response to this you will soon find the 14 patches that currently
> implement $subject.

Congratulations on that piece of work.

I'd like to provide a comparison of the proposed change set format to
the one used in Postgres-R. I hope for this comparison to shed some
light on the similarities and differences of the two projects. As the
author of Postgres-R, I'm obviously biased, but I try to be as neutral
as I can.

Let's start with the representation: I so far considered the Postgres-R
change set format to be an implementation detail and I don't intend it
to be readable by humans or third party tools. It's thus binary only and
doesn't offer a textual representation. The approach presented here
seems to target different formats for different audiences, including
binary representations. More general, less specific.

Next, contents: this proposal is more verbose. In the textual
representation shown, it provides (usually redundant) information about
attribute names and types. Postgres-R doesn't ever transmit attribute
name or type information for INSERT, UPDATE or DELETE operations.
Instead, it relies on attribute numbers and pg_attributes being at some
known consistent state.

Let's compare by example:

> table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1
> table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1
> table "replication_example": DELETE (pkey): id[int4]:1

In Postgres-R, the change sets for these same operations would carry the
following information, in a binary representation:

> table "replication_example": INSERT: VALUES (1, 1, '1')
> table "replication_example": UPDATE: PKEY(1) COID(77) MODS('nyn') VALUES(-1)
> table "replication_example": DELETE: PKEY(1) COID(78)

You may have noticed that there's an additional COID field. This is an
identifier for the transaction that last changed this tuple. Together
with the primary key, it effectively identifies the exact version of a
tuple (during its lifetime, for example before vs after an UPDATE). This
in turn is used by Postgres-R to detect conflicts.

It may be possible to add that to the proposed format as well, for it to
be able to implement a Postgres-R-like algorithm.

To finish off this comparison, let's take a look at how and where the
change sets are generated: in Postgres-R the change set stream is
constructed directly from the heap modification routines, i.e. in
heapam.c's heap_{insert,update,delete}() methods. Where as the patches
proposed here parse the WAL to reconstruct the modifications and add the
required meta information.

To me, going via the WAL first sounded like a step that unnecessarily
complicates matters. I recently talked to Andres and brought that up.
Here's my current view of things:

The Postgres-R approach is independent of WAL and its format, where as
the approach proposed here clearly is not. Either way, there is a
certain overhead - however minimal it is - which the former adds to the
transaction processing itself, while the later postpones it to a
separate XLogReader process. If there's any noticeable difference, it
might reduce latency in case of asynchronous replication, but can only
increase latency in the synchronous case. As far as I understood Andres,
it was easier to collect the additional meta data from within the
separate process.

In summary, I'd say that Postgres-R is an approach specifically
targeting and optimized for multi-master replication between Postgres
nodes, where as the proposed patches are kept more general.

I hope you found this to be an insightful and fair comparison.

Regards

Markus Wanner

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Andres Freund

Date:

16 November 2012, 14:05:36

Hi Markus,

On 2012-11-16 14:46:39 +0100, Markus Wanner wrote:
> On 11/15/2012 01:27 AM, Andres Freund wrote:
> > In response to this you will soon find the 14 patches that currently
> > implement $subject.
>
> Congratulations on that piece of work.

Thanks.

> I'd like to provide a comparison of the proposed change set format to
> the one used in Postgres-R.

Uh, sorry to interrupt you right here, but thats not the "proposed
format" ;) Thats just an example output plugin that people wished
for. For the use-case were after we (as in 2ndq) also want to use binary
data.  Its also rather useful for debugging and such.

I generally aggree that the presented format is too verbose for actual
replication, but it seems fine enough for showing off ;)

If you look at Patch 12/14 "Add a simple decoding module in contrib
named 'test_decoding'" you can see that adding a different output format
should be pretty straight forward.

Which output plugin is used is determined by the initial
INIT_LOGICAL_REPLICATION '$plugin'; command in a replication connection.

> To finish off this comparison, let's take a look at how and where the
> change sets are generated: in Postgres-R the change set stream is
> constructed directly from the heap modification routines, i.e. in
> heapam.c's heap_{insert,update,delete}() methods. Where as the patches
> proposed here parse the WAL to reconstruct the modifications and add the
> required meta information.
>
> To me, going via the WAL first sounded like a step that unnecessarily
> complicates matters. I recently talked to Andres and brought that up.
> Here's my current view of things:
>
> The Postgres-R approach is independent of WAL and its format, where as
> the approach proposed here clearly is not. Either way, there is a
> certain overhead - however minimal it is - which the former adds to the
> transaction processing itself, while the later postpones it to a
> separate XLogReader process. If there's any noticeable difference, it
> might reduce latency in case of asynchronous replication, but can only
> increase latency in the synchronous case. As far as I understood Andres,
> it was easier to collect the additional meta data from within the
> separate process.

There also is the point that if you do the processing inside heap_* you
need to make sure the replication targeted data is safely received &
fsynced away, in "our" case thats not necessary as WAL already provides
crash safety, so should the replication connection break you can simply
start from the location last confirmed as being safely sent.

As we want to provide asynchronous replication thats a rather major
point.

> In summary, I'd say that Postgres-R is an approach specifically
> targeting and optimized for multi-master replication between Postgres
> nodes, where as the proposed patches are kept more general.

One major aim definitely was optionally be able to replicate into just
about any target system, so yes, I certainly agree.

> I hope you found this to be an insightful and fair comparison.

Yes, input in general and especially from other replication providers is
certainly interesting and important!

Thanks,

Andres

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Andres Freund

Date:

16 November 2012, 14:14:36

Hi,

On 2012-11-16 14:46:39 +0100, Markus Wanner wrote:
> You may have noticed that there's an additional COID field. This is an
> identifier for the transaction that last changed this tuple. Together
> with the primary key, it effectively identifies the exact version of a
> tuple (during its lifetime, for example before vs after an UPDATE). This
> in turn is used by Postgres-R to detect conflicts.

Whats the data type of the "COID" in -R?

In the patchset the output plugin has enough data to get the old xid and
the new xid in the case of updates (not in the case of deletes, but
thats a small bug and should be fixable with a single line of code), and
it has enough information to extract the primary key without problems.

I wonder whether we also should track the xid epoch...

> It may be possible to add that to the proposed format as well, for it to
> be able to implement a Postgres-R-like algorithm.

I don't know the exact Postgres-R algorithm (but I queued reading some
papers you referred to when we talked), but I guess what we have in mind
is roughly similar - its just not even remotely part of this patchset ;)
Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 09/14] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader

From

Steve Singer

Date:

17 November 2012, 04:27:51

<div class="moz-cite-prefix">On 12-11-14 08:17 PM, Andres Freund wrote:<br /></div><blockquote
cite="mid:1352942234-3953-9-git-send-email-andres@2ndquadrant.com"type="cite"><pre wrap=""> 
For the regular satisfies routines this is needed in prepareation of logical
decoding. I changed the non-regular ones for consistency as well.

The naming between htup, tuple and similar is rather confused, I could not find
any consistent naming anywhere.

This is preparatory work for the logical decoding feature which needs to be
able to get to a valid relfilenode from when checking the visibility of a
tuple.

</pre></blockquote><br /> I have taken a look at this patch.   The patch does what it says,  it changes a bunch of <br
/>HeapTupleSatisfiesXXXX routines to take a HeapTuple structure instead of a HeapTupleHeader.<br /> It also sets the
HeapTuple.t_tableOidvalue before calling these routines.<br /><br /> The patch does not modify these routines to
actuallydo anything useful with the additional data fields though it does add some assertions to make sure t_tableOid
isset.<br /><br /> The patch compiles cleanly and the unit tests pass.<br /><br /> This patch  does not seem to depend
onany of the other patches in this set and applies cleanly against master.  The patch doesn't actually add any
functionality,unless someone sees a reason for complaining about this that I don't see, then I think it can be
committed.<br/><br /> Steve<br /><br /><blockquote cite="mid:1352942234-3953-9-git-send-email-andres@2ndquadrant.com"
type="cite"><prewrap=""> 
---contrib/pgrowlocks/pgrowlocks.c      |  2 +-src/backend/access/heap/heapam.c     | 13
++++++----src/backend/access/heap/pruneheap.c | 16 ++++++++++--src/backend/catalog/index.c          |  2
+-src/backend/commands/analyze.c      |  3 ++-src/backend/commands/cluster.c       |  2
+-src/backend/commands/vacuumlazy.c   |  3 ++-src/backend/storage/lmgr/predicate.c |  2
+-src/backend/utils/time/tqual.c      | 50 +++++++++++++++++++++++++++++-------src/include/utils/snapshot.h         |
4+--src/include/utils/tqual.h            | 20 +++++++--------11 files changed, 83 insertions(+), 34 deletions(-) 

</pre><br /><fieldset class="mimeAttachmentHeader"></fieldset><br /><pre wrap="">
</pre></blockquote><br />

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Markus Wanner

Date:

17 November 2012, 09:05:59

On 11/16/2012 03:05 PM, Andres Freund wrote:
>> I'd like to provide a comparison of the proposed change set format to
>> the one used in Postgres-R.
> 
> Uh, sorry to interrupt you right here, but thats not the "proposed
> format" ;)

Understood. Sorry, I didn't mean to imply that. It's pretty obvious to
me that this is more of a human readable format and that others,
including binary formats, can be implemented. I apologize for the bad
wording of a "proposed format", which doesn't make that clear.

>> The Postgres-R approach is independent of WAL and its format, where as
>> the approach proposed here clearly is not. Either way, there is a
>> certain overhead - however minimal it is - which the former adds to the
>> transaction processing itself, while the later postpones it to a
>> separate XLogReader process. If there's any noticeable difference, it
>> might reduce latency in case of asynchronous replication, but can only
>> increase latency in the synchronous case. As far as I understood Andres,
>> it was easier to collect the additional meta data from within the
>> separate process.
> 
> There also is the point that if you do the processing inside heap_* you
> need to make sure the replication targeted data is safely received &
> fsynced away, in "our" case thats not necessary as WAL already provides
> crash safety, so should the replication connection break you can simply
> start from the location last confirmed as being safely sent.

In the case of Postgres-R, the "safely received" part isn't really
handled at the change set level at all. And regarding the fsync
guarantee: you can well use the WAL to provide that, without basing
change set generation on in. In that regard, Postgres-R is probably the
more general approach: you can run Postgres-R with WAL turned off
entirely - which may well make sense if you take into account the vast
amount of cloud resources available, which don't have a BBWC. Instead of
WAL, you can add more nodes at more different locations. And no, you
don't want your database to ever go down, anyway  :-)

>> In summary, I'd say that Postgres-R is an approach specifically
>> targeting and optimized for multi-master replication between Postgres
>> nodes, where as the proposed patches are kept more general.
> 
> One major aim definitely was optionally be able to replicate into just
> about any target system, so yes, I certainly agree.

I'm glad I got that correct ;-)

Regards

Markus Wanner

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Markus Wanner

Date:

17 November 2012, 09:22:45

On 11/16/2012 03:14 PM, Andres Freund wrote:
> Whats the data type of the "COID" in -R?

It's short for CommitOrderId, a 32bit global transaction identifier,
being wrapped-around, very much like TransactionIds are. (In that sense,
it's global, but unique only for a certain amount of time).

> In the patchset the output plugin has enough data to get the old xid and
> the new xid in the case of updates (not in the case of deletes, but
> thats a small bug and should be fixable with a single line of code), and
> it has enough information to extract the primary key without problems.

It's the xmin of the old tuple that Postgres-R needs to get the COID.

Regards

Markus Wanner

Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

From

Michael Paquier

Date:

17 November 2012, 10:14:11

On Fri, Nov 16, 2012 at 7:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Hi,

On 2012-11-16 13:44:45 +0900, Michael Paquier wrote:
> This patch looks OK.
>
> I got 3 comments:
> 1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to
> 3455? It does not look necessary.

Its a mismerge and should have happened in "Add a new RELFILENODE
syscache to fetch a pg_class entry via (reltablespace, relfilenode)" but
it seems I squashed the wrong two commits.
I had originally used 3171 but that since got used up for lo_tell64...

> 2) You should perhaps change the header of RelationMapFilenodeToOid so as
> not mentionning it as the opposite operation of RelationMapOidToFilenode
> but as an operation that looks for the OID of a relation based on its
> relfilenode. Both functions are opposite but independent.

I described it as the opposite because RelationMapOidToFilenode is the
relmappers stated goal and RelationMapFilenodeToOid is just some
side-business.

> 3) Both functions are doing similar operations. Could it be possible
> to wrap them in the same central function?

I don't really see how without making both quite a bit more
complicated. The amount of if's needed seems to be too large to me.

OK thanks for your answers.
As this patch only adds a new function and is not that much complicated, I think there is no problem in committing it. The only thing to remove is the diff in indexing.h. Could someone take care of that?
If other people have additional comments on the ability to perform a relfileoid->reloid operation for cached maps, of course go ahead.

--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Hannu Krosing

Date:

17 November 2012, 13:30:36

On 11/16/2012 02:46 PM, Markus Wanner wrote:
> Andres,
>
> On 11/15/2012 01:27 AM, Andres Freund wrote:
>> In response to this you will soon find the 14 patches that currently
>> implement $subject.
> Congratulations on that piece of work.
>
>
> I'd like to provide a comparison of the proposed change set format to
> the one used in Postgres-R. I hope for this comparison to shed some
> light on the similarities and differences of the two projects. As the
> author of Postgres-R, I'm obviously biased, but I try to be as neutral
> as I can.
...
> Let's compare by example:
>
>> table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1
>> table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1
>> table "replication_example": DELETE (pkey): id[int4]:1
> In Postgres-R, the change sets for these same operations would carry the
> following information, in a binary representation:
>
>> table "replication_example": INSERT: VALUES (1, 1, '1')
>> table "replication_example": UPDATE: PKEY(1) COID(77) MODS('nyn') VALUES(-1)
>> table "replication_example": DELETE: PKEY(1) COID(78)
Is it possible to replicate UPDATEs and DELETEs without a primary key in 
PostgreSQL-R


Hannu

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Markus Wanner

Date:

17 November 2012, 14:00:15

On 11/17/2012 02:30 PM, Hannu Krosing wrote:
> Is it possible to replicate UPDATEs and DELETEs without a primary key in
> PostgreSQL-R

No. There must be some way to logically identify the tuple. Note,
though, that theoretically any (unconditional) unique key would suffice.
In practice, that usually doesn't matter, as you rarely have one or more
unique keys without a primary.

Also note that the underlying index is useful for remote application of
change sets (except perhaps for very small tables).

In some cases, for example for n:m linking tables, you need to add a
uniqueness key that spans all columns (as opposed to a simple index on
one of the columns that's usually required, anyway). I hope for
index-only scans eventually mitigating this issue.

Alternatively, I've been thinking about the ability to add a hidden
column, which can then be used as a PRIMARY KEY without breaking legacy
applications that rely on SELECT * not returning that primary key.

Are there other reasons to want tables without primary keys that I'm
missing?

Regards

Markus Wanner

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Hannu Krosing

Date:

17 November 2012, 14:40:54

On 11/17/2012 03:00 PM, Markus Wanner wrote:
> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>> PostgreSQL-R
> No. There must be some way to logically identify the tuple.
It can be done as selecting on _all_ attributes and updating/deleting 
just the first matching row

create cursor ...
select from t ... where t.* = (....)
fetch one ...
delete where current of ...

This is on distant (round 3 or 4) roadmap for this work, just was 
interested
if you had found any better way of doing this :)

Hannu

> Note,
> though, that theoretically any (unconditional) unique key would suffice.
> In practice, that usually doesn't matter, as you rarely have one or more
> unique keys without a primary.
>
> Also note that the underlying index is useful for remote application of
> change sets (except perhaps for very small tables).
>
> In some cases, for example for n:m linking tables, you need to add a
> uniqueness key that spans all columns (as opposed to a simple index on
> one of the columns that's usually required, anyway). I hope for
> index-only scans eventually mitigating this issue.
>
> Alternatively, I've been thinking about the ability to add a hidden
> column, which can then be used as a PRIMARY KEY without breaking legacy
> applications that rely on SELECT * not returning that primary key.
>
> Are there other reasons to want tables without primary keys that I'm
> missing?
>
> Regards
>
> Markus Wanner

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Hannu Krosing

Date:

17 November 2012, 14:57:51

On 11/17/2012 03:00 PM, Markus Wanner wrote:
> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>> PostgreSQL-R
> No. There must be some way to logically identify the tuple. Note,
> though, that theoretically any (unconditional) unique key would suffice.
> In practice, that usually doesn't matter, as you rarely have one or more
> unique keys without a primary.
...
> Are there other reasons to want tables without primary keys that I'm
> missing?
>
Nope. The only place a table without a primary key would be needed is a 
log table, but as these are (supposed to be) INSERT-only this is not a 
problem for them.

Hannu

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Markus Wanner

Date:

17 November 2012, 17:41:45

Hannu,

On 11/17/2012 03:40 PM, Hannu Krosing wrote:
> On 11/17/2012 03:00 PM, Markus Wanner wrote:
>> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>>> PostgreSQL-R
>> No. There must be some way to logically identify the tuple.
> It can be done as selecting on _all_ attributes and updating/deleting
> just the first matching row
> 
> create cursor ...
> select from t ... where t.* = (....)
> fetch one ...
> delete where current of ...

That doesn't sound like it could possibly work for Postgres-R. At least
not when there can be multiple rows with all the same attributes, i.e.
without a unique key constraint over all columns.

Otherwise, some nodes could detect two concurrent UPDATES as a conflict,
while other nodes select different rows and don't handle it as a conflict.

Regards

Markus Wanner

Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

From

Andres Freund

Date:

18 November 2012, 16:18:39

On 2012-11-17 19:14:06 +0900, Michael Paquier wrote:
> On Fri, Nov 16, 2012 at 7:58 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > Hi,
> >
> > On 2012-11-16 13:44:45 +0900, Michael Paquier wrote:
> > > This patch looks OK.
> > >
> > > I got 3 comments:
> > > 1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to
> > > 3455? It does not look necessary.
> >
> > Its a mismerge and should have happened in "Add a new RELFILENODE
> > syscache to fetch a pg_class entry via (reltablespace, relfilenode)" but
> > it seems I squashed the wrong two commits.
> > I had originally used 3171 but that since got used up for lo_tell64...
> >
> > > 2) You should perhaps change the header of RelationMapFilenodeToOid so as
> > > not mentionning it as the opposite operation of RelationMapOidToFilenode
> > > but as an operation that looks for the OID of a relation based on its
> > > relfilenode. Both functions are opposite but independent.
> >
> > I described it as the opposite because RelationMapOidToFilenode is the
> > relmappers stated goal and RelationMapFilenodeToOid is just some
> > side-business.
> >
> > > 3) Both functions are doing similar operations. Could it be possible
> > > to wrap them in the same central function?
> >
> > I don't really see how without making both quite a bit more
> > complicated. The amount of if's needed seems to be too large to me.
> >
> OK thanks for your answers.
> As this patch only adds a new function and is not that much complicated, I
> think there is no problem in committing it. The only thing to remove is the
> diff in indexing.h. Could someone take care of that?

I pushed a rebase to the git repository that fixed it...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3

From

Michael Paquier

Date:

19 November 2012, 00:31:17

On Fri, Nov 16, 2012 at 5:16 PM, Andrea Suisani <sickpig@opinioni.net> wrote:

Il 16/11/2012 05:34, Michael Paquier ha scritto:

Do you have a git repository or something where all the 14 patches are applied? I would like to test the feature globally.
Sorry I recall that you put a link somewhere but I cannot remember its email...

http://archives.postgresql.org/pgsql-hackers/2012-11/msg00686.php

Thanks Andrea.
I am pretty sure I will be able to provide some feedback by Friday.

--
Michael Paquier
http://michael.otacoo.com

Re: [PATCH 13/14] Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes

From

Michael Paquier

Date:

19 November 2012, 06:46:36

Hi,

I am just looking at this patch and will provide some comments.
By the way, you forgot the installation part of pg_receivellog, please see patch attached.
Thanks,

On Thu, Nov 15, 2012 at 10:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:

---
src/bin/pg_basebackup/Makefile | 7 +-
src/bin/pg_basebackup/pg_receivellog.c | 717 +++++++++++++++++++++++++++++++++
src/bin/pg_basebackup/streamutil.c | 3 +-
src/bin/pg_basebackup/streamutil.h | 1 +
4 files changed, 725 insertions(+), 3 deletions(-)
create mode 100644 src/bin/pg_basebackup/pg_receivellog.c

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Michael Paquier
http://michael.otacoo.com

Attachment

20121119_pg_receivellog_install.patch

Re: logical changeset generation v3

From

Michael Paquier

Date:

19 November 2012, 07:29:01

Hi Andres,

I have been able to fetch your code (thanks Andrea!) and some it. For the time being I am spending some time reading the code and understanding the whole set of features you are trying to implement inside core, even if I got some background from what you presented at PGCon and from the hackers ML.
Btw, as a first approach, I tried to run the logical log receiver plugged on a postgres server, and I am not able to make it work.

Well, I am using settings similar to yours.
# Run master
rm -r ~/bin/pgsql/master/
initdb -D ~/bin/pgsql/master/
echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf
postgres -D ~/bin/pgsql/master \
          -c wal_level=logical \
          -c max_wal_senders=10 \
          -c max_logical_slots=10 \
          -c wal_keep_segments=100 \
          -c log_line_prefix="[%p %x] "
# Logical log receiver
pg_receivellog -f $HOME/output.txt -d postgres -v

After launching some SQLs, the logical receiver is stuck just after sending INIT_LOGICAL_REPLICATION, please see bt of process waiting:
(gdb) bt
#0 0x00007f1bbc13b170 in __poll_nocancel () from /usr/lib/libc.so.6
#1 0x00007f1bbc43072d in pqSocketPoll (sock=3, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1089
#2 0x00007f1bbc43060d in pqSocketCheck (conn=0x1dd0290, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1031
#3 0x00007f1bbc4304dd in pqWaitTimed (forRead=1, forWrite=0, conn=0x1dd0290, finish_time=-1) at fe-misc.c:963
#4 0x00007f1bbc4304af in pqWait (forRead=1, forWrite=0, conn=0x1dd0290) at fe-misc.c:946
#5 0x00007f1bbc42c64c in PQgetResult (conn=0x1dd0290) at fe-exec.c:1709
#6 0x00007f1bbc42cd62 in PQexecFinish (conn=0x1dd0290) at fe-exec.c:1974
#7 0x00007f1bbc42c9c8 in PQexec (conn=0x1dd0290, query=0x406c60 "INIT_LOGICAL_REPLICATION 'test_decoding'") at fe-exec.c:1808
#8 0x0000000000402370 in StreamLog () at pg_receivellog.c:263
#9 0x00000000004030c9 in main (argc=6, argv=0x7fff44edb288) at pg_receivellog.c:694
So I am not able to output any results using pg_receivellog.

Also, I noticed 2 errors in your set of tests.

On Thu, Nov 15, 2012 at 9:27 AM, Andres Freund <andres@anarazel.de> wrote:

-- wrapped in a transaction
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));

In SET clause, the column name is somedata and not somedate

-- dont write out aborted data
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (2, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));

Same error here, somedata and not somedate. Not a big deal, it made the transactions failing though.

--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3

From

Andres Freund

Date:

19 November 2012, 08:50:36

Hi Michael,


On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> I have been able to fetch your code (thanks Andrea!) and some it. For the
> time being I am spending some time reading the code and understanding the
> whole set of features you are trying to implement inside core, even if I
> got some background from what you presented at PGCon and from the hackers
> ML.

Cool.

> Btw, as a first approach, I tried to run the logical log receiver plugged
> on a postgres server, and I am not able to make it work.

> Well, I am using settings similar to yours.
> # Run master
> rm -r ~/bin/pgsql/master/
> initdb -D ~/bin/pgsql/master/
> echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf
> postgres -D ~/bin/pgsql/master \
>           -c wal_level=logical \
>           -c max_wal_senders=10 \
>           -c max_logical_slots=10 \
>           -c wal_keep_segments=100 \
>           -c log_line_prefix="[%p %x] "
> # Logical log receiver
> pg_receivellog -f $HOME/output.txt -d postgres -v
>
> After launching some SQLs, the logical receiver is stuck just after sending
> INIT_LOGICAL_REPLICATION, please see bt of process waiting:

Its waiting till it sees initial an initial xl_running_xacts record. The
easiest way to do that is manually issue a checkpoint. Sorry, should
have included that in the description.
Otherwise you can wait till the next routine checkpoint comes arround...

I plan to cause more xl_running_xacts record to be logged in the
future. I think the timing of those currently is non-optimal, you have
the same problem as here in normal streaming replication as well :(

> > -- wrapped in a transaction
> > BEGIN;
> > INSERT INTO replication_example(somedata, text) VALUES (1, 1);
> > UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT
> > currval('replication_example_id_seq'));
> >
> In SET clause, the column name is *somedata* and not *somedate*

Crap. Sorry for that. I wrote the example in the mailclient and then
executed it and I seem to have forgot to put back some of the fixes...


> I am just looking at this patch and will provide some comments.
> By the way, you forgot the installation part of pg_receivellog, please see
> patch attached.

That actually was somewhat intended, I thought people wouldn't like the
name and I didn't want a binary thats going to be replaced anyway lying
arround ;)

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3 - Source for Slony

From

Steve Singer

Date:

19 November 2012, 12:05:28

First, you can add me to the list of people saying 'wow', I'm impressed.

The approach I am taking to reviewing this to try and answer the 
following question

1) How might a future version of slony be able to use logical 
replication as described by your patch and design documents
and what would that look like.

2) What functionality is missing from the patch set that would stop me 
from implementing or prototyping the above.



Connecting slon to the remote postgresql
========================

Today the slony remote listener thread queries a bunch of events from 
sl_event for a batch of SYNC events. Then the remote helper thread 
queries data from sl_log_1 and sl_log_2.    I see this changing, instead 
the slony remote listener thread would connect to the remote system and 
get a logical replication stream.
  1) Would slony connect as a normal client connection and call 
something like 'select pg_slony_process_xlog(...)' to get bunch of 
logical replication      change records to process.  OR  2) Would slony connect as a replication connection similar to
howthe 
 
pg_receivelog program does today and then process the logical changeset      outputs.  Instead of writing it to a file
(aspg_receivelog does)
 

It seems that the second approach is what is encouraged.  I think we 
would put a lot of the pg_receivelog functionality into slon and it 
would issue a command like 'INIT_LOGICAL_REPLICATION 'slony') to use the 
slony logical replication plugin.  Slon would also have to provide 
feedback to the walsender about what it has processed so the origin 
database knows what catalog snapshots can be expired.  Based on 
eyeballing pg_receivelog.c it seems that about half the code in the 700 
file is related to command line arguments etc, and the other half is 
related to looping over the copy out stream, sending feedback and other 
things that we would need to duplicate in slon.

pg_receivelog.c has  comment:

/* * We have to use postgres.h not postgres_fe.h here, because there's so 
much * backend-only stuff in the XLOG include files we need.  But we need a * frontend-ish environment otherwise.
Hencethis ugly hack. */
 

This looks like more of a carryover from pg_receivexlog.c.  From what I 
can tell we can eliminate the postgres.h include if we also eliminate 
the utils/datetime.h and utils/timestamp.h and instead add in:

#include "postgres_fe.h"
#define POSTGRES_EPOCH_JDATE 2451545
#define UNIX_EPOCH_JDATE 2440588
#define SECS_PER_DAY 86400
#define USECS_PER_SEC INT64CONST(1000000)
typedef int64 XLogRecPtr;
#define InvalidXLogRecPtr 0

If there is a better way of getting these defines someone should speak 
up.   I recall that in the past slon actually did include postgres.h and 
it caused some issues (I think with MSVC win32 builds).  Since 
pg_receivelog.c will be used as a starting point/sample for third 
parties to write client programs it would be better if it didn't 
encourage client programs to include postgres.h


The Slony Output Plugin
=====================

Once we've modified slon to connect as a logical replication client we 
will need to write a slony plugin.

As I understand the plugin API:
* A walsender is processing through WAL records, each time it sees a 
COMMIT WAL record it will call my plugins
.begin
.change (for each change in the transaction)
.commit

* The plugin for a particular stream/replication client will see one 
transaction at a time passed to it in commit order.  It won't see 
.change(t1) followed by .change (t2), followed by a second .change(t1).  
The reorder buffer code hides me from all that complexity (yah)
From a slony point of view I think the output of the plugin will be 
rows, suitable to be passed to COPY IN of the form:

origin_id, table_namespace,table_name,command_type, 
cmd_updatencols,command_args

This is basically the Slony 2.2 sl_log format minus a few columns we no 
longer need (txid, actionseq).
command_args is a postgresql text array of column=value pairs.  Ie [ 
{id=1},{name='steve'},{project='slony'}]

I don't t think our output plugin will be much more complicated than the 
test_decoding plugin.  I suspect we will want to give it the ability to 
filter out non-replicated tables.   We will also have to filter out 
change records that didn't originate on the local-node that aren't part 
of a cascaded subscription.  Remember that in a two node cluster  slony 
will have connections from A-->B  and from B--->A even if user tables 
only flow one way. Data that is replicated from A into B will show up in 
the WAL stream for B.

Exactly how we do this filtering is an open question,  I think the 
output plugin will at a minimum need to know:

a) What the slony node id is of the node it is running on.  This is easy 
to figure out if the output plugin is able/allowed to query its 
database.  Will this be possible? I would expect to be able to query the 
database as it exists now(at plugin invocation time) not as it existed 
in the past when the WAL was generated.   In addition to the node ID I 
can see us wanting to be able to query other slony tables 
(sl_table,sl_set etc...)

b) What the slony node id is of the node we are streaming too.   It 
would be nice if we could pass extra, arbitrary data/parameters to the 
output plugins that could include that, or other things.  At the moment 
the start_logical_replication rule in repl_gram.y doesn't allow for that 
but I don't see why we couldn't make it do so.

I still see some open questions about exactly how we would filter out 
data in this stage.


<editorial> Everything above deals with the postgresql side of things, 
ie the patch in question or the plugin API we would  have to work with.
Much of what is below deals with slony side change and might of limited 
interest to some on pgsql-hackers
</editorial>

Slon Applying Changes
================

The next task we will have is to make slon and the replica instance be 
able to apply these changes.  In slony 2.2 we do a COPY from sl_log and 
apply that stream to a table on the replica with COPY. We then have 
triggers on the replica that decode the command_args and apply the 
changes as
INSERT/UPDATE/DELETE statements on the user tables.   I see this 
continuing to work in this fashion, but there are a few special cases:

1) Changes made to sl_event on the origin will result in records in the 
logical replication stream that change sl_event.  In many cases we won't
just be inserting records into sl_event but we will need to instead do 
the logic in remote_worker.c for processing the different types of
events.  Worst case we could parse the change records we receive from 
our version pg_receivellog and split the sl_event records out into a 
sl_event stream and a sl_log stream.  Another approach might be to have 
the slony apply trigger build up a list of events that the slon 
remote_worker code can than process through.

2) Slony is normally bi-directional even if user data only replicates 
one way.  Confirm (sl_confirm) entries go from a replica back to an 
origin.  In a two node origin->replica scenario for data, the way I see 
this working is that the slon for the origin would connect to the 
replica (like it does today).
It would receive the logical replication records, but since it isn't 
subscribed to any tables it won't receive/process the WAL for 
user-tables but it will still receive/process sl_confirm rows.   It will 
then insert the rows in sl_confirm that it 'replicated' from the remote 
node.


With what I have described so far, Slony would then be receiving a 
stream of events that look like

t1-insert into foo , [id=1, name='steve']
t1-insert into bar  [id=1, something='somethingelse']
t1-commit
t2- insert into foo [....]
t2-commit
t3- insert into sl_event [ev_type=SYNC, ev_origin=1,ev_seqno=12345]
t3-commit

Even though, from a data-correctness point of view, slony could commit 
the transaction on the replica after it sees the t1 commit, we won't 
want it to do commits other than on a SYNC boundary.  This means that
the replicas will continue to move between consistent SYNC snapshots and 
that we can still track the state/progress of replication by knowing 
what events (SYNC or otherwise) have been confirmed.

This also means that slony should only provide  feedback to the 
walsender on SYNC boundaries after the transaction has committed on the 
receiver. I don't see this as being an issue.

Setting up Subscriptions
===================
At first we have a slon cluster with just 1 node, life is good. When a 
second node is created and a path(or pair of paths) are defined between 
the nodes I think they will each:
1. Connect to the remote node with a normal libpq connection.    a. Get the current xlog recptr,    b. Query any
non-syncevents of interest from sl_event.
 
2. Connect to the remote node with a logical replication connection and 
start streaming logical replication changes start at the recptr we retrieved    above.

Slon will then receive any future events from the remote sl_event as 
part of the logical replication stream.  It won't receive any user 
tables because it isn't yet subscribed to any.

When a subscription is started,  the SUBSCRIBE_SET and 
ENABLE_SUBSCRIPTION events will go through sl_event and the INSERT INTO 
sl_event will be part of a change record in the replication stream and 
be picked up by the subscribers slon remote_worker.

The remote_worker:copy_set will then need to get a consistent COPY of 
the tables in the replication set such that any changes made to the 
tables after the copy is started get included in the replication 
stream.  The approach proposed in the DESIGN.TXT file with exporting a 
snapshot sounds okay for this.    I *think* slony could get by with 
something less fancy as well but it would be ugly.

1. Make sure that the origin starts including change records for the 
tables in the set
2. have the slon(copy_set) wait until any transactions on the origin, 
that started prior to the ENABLE_SUBSCRIPTION, are committed.     Slony does this today as part of the copy_set logic.
3.  Get/remember the snapshot visibility information for the COPY's 
transaction
4.  When we start to process change records we need to filter out 
records for transactions that were already visible by the copy.

Steps 1-3 are similar to how slony works today, but step 4 will be a bit 
awkward/ugly.  This isn't an issue today because we are already using 
the transaction visibility information for selecting from sl_log so it 
works, but above I had proposed stripping the xid from the logical 
change records.



Cascading Replication
=================
A-->B--->C

The slon for B will insert records from A into B's tables.  This insert 
will generate WAL records on B.  The slon for C should be able to pull 
the data it needs  (both sl_event entries with ev_origin=A, and user 
table data originating on A) from B's logical replication stream.  I 
don't see any issues here nor do I see a need to 'cache' the data in an 
sl_log type of table on B.


Reshaping Replication
=================

In Slony replication is reshaped by two types events, a MOVE SET and a 
FAILOVER.

Move Set:
A replication set might be subscribed in a cascaded fashion like
A--->B--->C

When a MOVE SET is issued node A will stop accepting new write 
transactions for tables in the set.  A MOVE_SET(1,A,B) event is then put 
into sl_event on node A. Node A will then stop accepting new 
transactions on the tables in set 1.
Node B receives the MOVE_SET command in the proper order, after it has 
processed the last SYNC generated on A when A was still accepting write 
transactions to those tables.  When Node B processes the MOVE_SET event 
then node B starts accepting write transactions on the tables.  Node B 
will also generates an ACCEPT_SET event. Node C will then receive the 
MOVE SET (ev_origin=A) and the ACCEPT_SET(ev_origin=B) command (after 
all SYNC events from A with data changes to the set) and then knows that 
it should start data on those tables from B.

I don't see any of this changing with logical replication acting as the 
data source.

FAILOVER:
---------------
A---->B
|    .
v  .
C

Today with slony, if B is a valid failover target then it is a 
forwarding node of the set.  This means that B keeps a record in sl_log 
of any changes originating on A until B knows that node C has received 
those changes.  In the event of a failover, if node C is far behind, it 
can just get the missing data from sl_log on node B (the failover 
target/new origin).

I see a problem with what I have discussed above, B won't explicitly 
store the data from A in sl_log, a cascaded node would depend on B's WAL 
stream.
The problem is that at FAILOVER time,  B might have processed some 
changes from A. Node  C might also be processing Node B's WAL stream for 
events (or data from another set).  Node C will discard/not receive the 
data for A's tables since it isn't subscribed to those tables from B.  
What happens then if at some later point B and C receive the FAILOVER event.
What does node C do? It can't get the missing data from node A because 
node A has failed, and it can't get it from node B because node C has 
already processed the WAL changes from node B that included the data but 
it ignored/discarded it.  Maybe node C could reprocess older WAL from 
node B? Maybe this forces us to keep an sl_log type structure around?

Is it complete enough to build a prototype?
==========================
I think so, the incomplete areas I see are the ones that mentioned in 
the patch submission including:
* Snapshot exporting for the initial COPY
* Spilling the reorder buffer to disk

I think it would be possible to build a prototype without those even 
though we'd need them before I could build a production system.

Conclusions
=============
I like this design much better than the original design from the spring 
that would have required keeping a catalog proxy on the decoding 
machine.  Based on what I've seen it should be possible to make slony 
use logical replication as a source for events instead of triggers 
populating sl_log.
My thinking is that we want a way for logreceiver programs to pass 
arguments/parameters to the output plugins. Beyond that this looks like 
something slony can use.

Re: logical changeset generation v3 - Source for Slony

From

Andres Freund

Date:

19 November 2012, 12:05:42

Hi Steve!

On 2012-11-17 22:50:35 -0500, Steve Singer wrote:
> First, you can add me to the list of people saying 'wow', I'm impressed.

Thanks!

> The approach I am taking to reviewing this to try and answer the following
> question
>
> 1) How might a future version of slony be able to use logical replication as
> described by your patch and design documents
> and what would that look like.
>
> 2) What functionality is missing from the patch set that would stop me from
> implementing or prototyping the above.

Sounds like a good plan to me.

>
> Connecting slon to the remote postgresql
> ========================
>
> Today the slony remote listener thread queries a bunch of events from
> sl_event for a batch of SYNC events. Then the remote helper thread queries
> data from sl_log_1 and sl_log_2.    I see this changing, instead the slony
> remote listener thread would connect to the remote system and get a logical
> replication stream.
>
>   1) Would slony connect as a normal client connection and call something
> like 'select pg_slony_process_xlog(...)' to get bunch of logical replication
>       change records to process.
>   OR
>   2) Would slony connect as a replication connection similar to how the
> pg_receivelog program does today and then process the logical changeset
>       outputs.  Instead of writing it to a file (as pg_receivelog does)

It would need to be the latter. We need the feedback messages it sends
for several purposes:
- increasing the lowered xmin
- implementing optionally synchronous replication at some point
- using 1) would mean having transactions open...

> It seems that the second approach is what is encouraged.  I think we would
> put a lot of the pg_receivelog functionality into slon and it would issue a
> command like 'INIT_LOGICAL_REPLICATION 'slony') to use the slony logical
> replication plugin.  Slon would also have to provide feedback to the
> walsender about what it has processed so the origin database knows what
> catalog snapshots can be expired.  Based on eyeballing pg_receivelog.c it
> seems that about half the code in the 700 file is related to command line
> arguments etc, and the other half is related to looping over the copy out
> stream, sending feedback and other things that we would need to duplicate in
> slon.

I think we should provide some glue code to do this, otherwise people
will start replicating all the bugs I hacked into this... More
seriously: I think we should have support code here, no user will want
to learn the intracacies of feedback messages and such. Where that would
live? No idea.

> pg_receivelog.c has  comment:

(its pg_receivellog btw. ;))

>
> /*
>  * We have to use postgres.h not postgres_fe.h here, because there's so much
>  * backend-only stuff in the XLOG include files we need.  But we need a
>  * frontend-ish environment otherwise.    Hence this ugly hack.
>  */
>
> This looks like more of a carryover from pg_receivexlog.c.  From what I can
> tell we can eliminate the postgres.h include if we also eliminate the
> utils/datetime.h and utils/timestamp.h and instead add in:
>
> #include "postgres_fe.h"
> #define POSTGRES_EPOCH_JDATE 2451545
> #define UNIX_EPOCH_JDATE 2440588
> #define SECS_PER_DAY 86400
> #define USECS_PER_SEC INT64CONST(1000000)
> typedef int64 XLogRecPtr;
> #define InvalidXLogRecPtr 0
>
> If there is a better way of getting these defines someone should speak up.
> I recall that in the past slon actually did include postgres.h and it caused
> some issues (I think with MSVC win32 builds).  Since pg_receivelog.c will be
> used as a starting point/sample for third parties to write client programs
> it would be better if it didn't encourage client programs to include
> postgres.h

I wholeheartedly aggree. It should also be cleaned up a fair bit before
others copy it should we not go for having some client side library.

Imo the library could very roughly be something like:

state = SetupStreamingLLog(replication-slot, ...);
while((message = StreamingLLogNextMessage(state))
{    write(outfd, message->data, message->length);    if (received_100_messages)    {         fsync(outfd);
StreamingLLogConfirm(message);   }
 
}

Although I guess thats not good enough because StreamingLLogNextMessage
would be blocking, but that shouldn't be too hard to work around.


> The Slony Output Plugin
> =====================
>
> Once we've modified slon to connect as a logical replication client we will
> need to write a slony plugin.
>
> As I understand the plugin API:
> * A walsender is processing through WAL records, each time it sees a COMMIT
> WAL record it will call my plugins
> .begin
> .change (for each change in the transaction)
> .commit
>
> * The plugin for a particular stream/replication client will see one
> transaction at a time passed to it in commit order.  It won't see
> .change(t1) followed by .change (t2), followed by a second .change(t1).  The
> reorder buffer code hides me from all that complexity (yah)

Correct.

> From a slony point of view I think the output of the plugin will be rows,
> suitable to be passed to COPY IN of the form:
>
> origin_id, table_namespace,table_name,command_type,
> cmd_updatencols,command_args
>
> This is basically the Slony 2.2 sl_log format minus a few columns we no
> longer need (txid, actionseq).
> command_args is a postgresql text array of column=value pairs.  Ie [
> {id=1},{name='steve'},{project='slony'}]

It seems to me that that makes escaping unneccesarily complicated, but
given you already have all the code... ;)

> I don't t think our output plugin will be much more complicated than the
> test_decoding plugin.

Good. Thats the idea ;). Are you ok with the interface as it is now or
would you like to change something?

> I suspect we will want to give it the ability to
> filter out non-replicated tables.   We will also have to filter out change
> records that didn't originate on the local-node that aren't part of a
> cascaded subscription.  Remember that in a two node cluster  slony will have
> connections from A-->B  and from B--->A even if user tables only flow one
> way. Data that is replicated from A into B will show up in the WAL stream
> for B.

Yes. We will also need something like that. If you remember the first
prototype we sent to the list, it included the concept of an
'origin_node' in wal record. I think you actually reviewed that one ;)

That was exactly aimed at something like this...

Since then my thoughts about how the origin_id looks like have changed a
bit:
- origin id is internally still represented as an uint32/Oid - never visible outside of wal/system catalogs
- externally visible it gets - assigned an uuid - optionally assigned a user defined name
- user settable (permissions?) origin when executing sql: - SET change_origin_uuid = 'uuid'; - SET change_origin_name =
'user-settable-name';- defaults to the local node
 
- decoding callbacks get passed the origin of a change - txn->{origin_uuid, origin_name, origin_internal?}
- the init decoding callback can setup an array of interesting origins, so the others don't even get the ReorderBuffer
treatment

I have to thank the discussion on -hackers and a march through prague
with Marko here...

> Exactly how we do this filtering is an open question,  I think the output
> plugin will at a minimum need to know:
>
> a) What the slony node id is of the node it is running on.  This is easy to
> figure out if the output plugin is able/allowed to query its database.  Will
> this be possible? I would expect to be able to query the database as it
> exists now(at plugin invocation time) not as it existed in the past when the
> WAL was generated.   In addition to the node ID I can see us wanting to be
> able to query other slony tables (sl_table,sl_set etc...)

Hm. There is no fundamental reason not to allow normal database access
to the current database but it won't be all that cheap, so doing it
frequently is not a good idea.
The reason its not cheap is that you basically need to teardown the
postgres internal caches if you switch the timestream in which you are
working.

Would go something like:

TransactionContext = AllocSetCreate(...);
RevertFromDecodingSnapshot();
InvalidateSystemCaches();
StartTransactionCommand();
/* do database work */
CommitTransactionCommand();
/* cleanup memory*/
SetupDecodingSnapshot(snapshot, data);
InvalidateSystemCaches();

Why do you need to be able to query the present? I thought it might be
neccesary to allow additional tables be accessed in a timetraveling
manner, but not this way round.
I guess an initial round of querying during plugin initialization won't
be good enough?

> b) What the slony node id is of the node we are streaming too.   It would be
> nice if we could pass extra, arbitrary data/parameters to the output plugins
> that could include that, or other things.  At the moment the
> start_logical_replication rule in repl_gram.y doesn't allow for that but I
> don't see why we couldn't make it do so.

Yes, I think we want something like that. I even asked input on that
recently ;):
http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de

Input welcome!


> Even though, from a data-correctness point of view, slony could commit the
> transaction on the replica after it sees the t1 commit, we won't want it to
> do commits other than on a SYNC boundary.  This means that the replicas will
> continue to move between consistent SYNC snapshots and that we can still
> track the state/progress of replication by knowing what events (SYNC or
> otherwise) have been confirmed.

I don't know enough about slony internals, but: why? This will prohibit
you from ever doing (per-transaction) synchronous replication...

> This also means that slony should only provide  feedback to the walsender on
> SYNC boundaries after the transaction has committed on the receiver. I don't
> see this as being an issue.

Yes, thats no problem. You need to give feedback more frequently
(otherwise walsender kicks you off), but you don't have to increase the
confirmed flush location.

> Setting up Subscriptions
> ===================
> At first we have a slon cluster with just 1 node, life is good. When a
> second node is created and a path(or pair of paths) are defined between the
> nodes I think they will each:
> 1. Connect to the remote node with a normal libpq connection.
>     a. Get the current xlog recptr,
>     b. Query any non-sync events of interest from sl_event.
> 2. Connect to the remote node with a logical replication connection and
> start streaming logical replication changes start at the recptr we retrieved
>     above.

Note that INIT_LOGICAL_REPLICATION can take some time to get to the
initial consistent state (especially if there are longrunning
transactions). So you should do the init in 1), query all the events in
the snapshot that returns and then go over to 2).

> The remote_worker:copy_set will then need to get a consistent COPY of the
> tables in the replication set such that any changes made to the tables after
> the copy is started get included in the replication stream.  The approach
> proposed in the DESIGN.TXT file with exporting a snapshot sounds okay for
> this.    I *think* slony could get by with something less fancy as well but
> it would be ugly.

The snapshot exporting isn't really that much additional work as we
already need to support most of it for keeping state across restarts.

> FAILOVER:
> ---------------
> A---->B
> |    .
> v  .
> C
>
> Today with slony, if B is a valid failover target then it is a forwarding
> node of the set.  This means that B keeps a record in sl_log of any changes
> originating on A until B knows that node C has received those changes.  In
> the event of a failover, if node C is far behind, it can just get the
> missing data from sl_log on node B (the failover target/new origin).
>
> I see a problem with what I have discussed above, B won't explicitly store
> the data from A in sl_log, a cascaded node would depend on B's WAL stream.
> The problem is that at FAILOVER time,  B might have processed some changes
> from A. Node  C might also be processing Node B's WAL stream for events (or
> data from another set).  Node C will discard/not receive the data for A's
> tables since it isn't subscribed to those tables from B.  What happens then
> if at some later point B and C receive the FAILOVER event.
> What does node C do? It can't get the missing data from node A because node
> A has failed, and it can't get it from node B because node C has already
> processed the WAL changes from node B that included the data but it
> ignored/discarded it.  Maybe node C could reprocess older WAL from node B?
> Maybe this forces us to keep an sl_log type structure around?

I fear youve left me behind here, sorry, can't give you any input.

> Is it complete enough to build a prototype?
> ==========================
> I think so, the incomplete areas I see are the ones that mentioned in the
> patch submission including:
> * Snapshot exporting for the initial COPY
> * Spilling the reorder buffer to disk
>
> I think it would be possible to build a prototype without those even though
> we'd need them before I could build a production system.

> Conclusions
> =============
> I like this design much better than the original design from the spring that
> would have required keeping a catalog proxy on the decoding machine.  Based
> on what I've seen it should be possible to make slony use logical
> replication as a source for events instead of triggers populating sl_log.
> My thinking is that we want a way for logreceiver programs to pass
> arguments/parameters to the output plugins. Beyond that this looks like
> something slony can use.
>
>

Cool!

Don't hesitate to mention anything that you think would make you life
easier, chances are that youre not the only one who could benefit from
it...


Thanks,

Andres

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3

From

Michael Paquier

Date:

20 November 2012, 00:30:44

On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Hi Michael,

On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> I have been able to fetch your code (thanks Andrea!) and some it. For the
> time being I am spending some time reading the code and understanding the
> whole set of features you are trying to implement inside core, even if I
> got some background from what you presented at PGCon and from the hackers
> ML.

Cool.

> Btw, as a first approach, I tried to run the logical log receiver plugged
> on a postgres server, and I am not able to make it work.

> Well, I am using settings similar to yours.
> # Run master
> rm -r ~/bin/pgsql/master/
> initdb -D ~/bin/pgsql/master/
> echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf
> postgres -D ~/bin/pgsql/master \
> -c wal_level=logical \
> -c max_wal_senders=10 \
> -c max_logical_slots=10 \
> -c wal_keep_segments=100 \
> -c log_line_prefix="[%p %x] "
> # Logical log receiver
> pg_receivellog -f $HOME/output.txt -d postgres -v
>
> After launching some SQLs, the logical receiver is stuck just after sending
> INIT_LOGICAL_REPLICATION, please see bt of process waiting:

Its waiting till it sees initial an initial xl_running_xacts record. The
easiest way to do that is manually issue a checkpoint. Sorry, should
have included that in the description.
Otherwise you can wait till the next routine checkpoint comes arround...

I plan to cause more xl_running_xacts record to be logged in the
future. I think the timing of those currently is non-optimal, you have
the same problem as here in normal streaming replication as well :(

> I am just looking at this patch and will provide some comments.
> By the way, you forgot the installation part of pg_receivellog, please see
> patch attached.

That actually was somewhat intended, I thought people wouldn't like the
name and I didn't want a binary that's going to be replaced anyway lying
around ;)

OK no problem. For sure this is going to happen, I was wondering myself if it could be possible to merge pg_receivexlog and pg_receivellog into a single utility with multiple modes :)

Btw, here are some extra comments based on my progress, hope it will be useful for other people playing around with your patches.
1) Necessary to install the contrib module test_decoding on server side or the test case will not work.
2) Obtention of the following logs on server:
LOG: forced to assume catalog changes for xid 1370 because it was running to early
WARNING: ABORT 1370
Actually I saw that there are many warnings like this.
3) Assertion failure while running pgbench, I was just curious to see how it reacted when logical replication was put under a little bit of load.
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c", Line: 877)
=> pgbench -i postgres; pgbench -T 500 -c 8 postgres
4) Mentionned by Andres above, but logical replication begins only there is a xl_running_xacts record. I just enforced a checkpoint manually.

With all those things done, I have been able to set up the system, for example those queries:
postgres=# create table ac (a int);
CREATE TABLE
postgres=# insert into ac values (1);
INSERT 0 1
created the expected output:
BEGIN 32135
COMMIT 32135
BEGIN 32136
table "ac": INSERT: a[int4]:1
COMMIT 32136

Now it is time to dig into the code...
--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3 - Source for Slony

From

Steve Singer

Date:

20 November 2012, 00:50:44

On 12-11-18 11:07 AM, Andres Freund wrote:
> Hi Steve!
>
>
> I think we should provide some glue code to do this, otherwise people
> will start replicating all the bugs I hacked into this... More
> seriously: I think we should have support code here, no user will want
> to learn the intracacies of feedback messages and such. Where that would
> live? No idea.

libpglogicalrep.so ?

> I wholeheartedly aggree. It should also be cleaned up a fair bit before
> others copy it should we not go for having some client side library.
>
> Imo the library could very roughly be something like:
>
> state = SetupStreamingLLog(replication-slot, ...);
> while((message = StreamingLLogNextMessage(state))
> {
>       write(outfd, message->data, message->length);
>       if (received_100_messages)
>       {
>            fsync(outfd);
>            StreamingLLogConfirm(message);
>       }
> }
>
> Although I guess thats not good enough because StreamingLLogNextMessage
> would be blocking, but that shouldn't be too hard to work around.
>

How about we pass a timeout value to StreamingLLogNextMessage (..) where 
it returns if no data is available after the timeout to give the caller 
a chance to do something else.

>> This is basically the Slony 2.2 sl_log format minus a few columns we no
>> longer need (txid, actionseq).
>> command_args is a postgresql text array of column=value pairs.  Ie [
>> {id=1},{name='steve'},{project='slony'}]
> It seems to me that that makes escaping unneccesarily complicated, but
> given you already have all the code... ;)

When I look at the actual code/representation we picked it is closer to 
{column1,value1,column2,value2...}



>> I don't t think our output plugin will be much more complicated than the
>> test_decoding plugin.
> Good. Thats the idea ;). Are you ok with the interface as it is now or
> would you like to change something?

I'm going to think about this some more and maybe try to write an 
example plugin before I can say anything with confidence.

>
> Yes. We will also need something like that. If you remember the first
> prototype we sent to the list, it included the concept of an
> 'origin_node' in wal record. I think you actually reviewed that one ;)
>
> That was exactly aimed at something like this...
>
> Since then my thoughts about how the origin_id looks like have changed a
> bit:
> - origin id is internally still represented as an uint32/Oid
>    - never visible outside of wal/system catalogs
> - externally visible it gets
>    - assigned an uuid
>    - optionally assigned a user defined name
> - user settable (permissions?) origin when executing sql:
>    - SET change_origin_uuid = 'uuid';
>    - SET change_origin_name = 'user-settable-name';
>    - defaults to the local node
> - decoding callbacks get passed the origin of a change
>    - txn->{origin_uuid, origin_name, origin_internal?}
> - the init decoding callback can setup an array of interesting origins,
>    so the others don't even get the ReorderBuffer treatment
>
> I have to thank the discussion on -hackers and a march through prague
> with Marko here...
So would the uuid and optional name assignment be done in the output 
plugin or some else?
When/how does the uuid get generated and where do we store it so the 
same uuid gets returned when postgres restarts.  Slony today stores all 
this type of stuff in user-level tables and user-level functions 
(because it has no other choice).    What is the connection between 
these values and the 'slot-id' in your proposal for the init arguments? 
Does the slot-id need to be the external uuid of the other end or is 
there no direct connection?

Today slony allows us to replicate between two databases in the same 
postgresql cluster (I use this for testing all the time)
Slony also allows for two different 'slony clusters' to be setup in the 
same database (or so I'm told, I don't think I have ever tried this myself).

plugin functions that let me query the local database and then return 
the  uuid and origin_name would work in this model.

+1 on being able to mark the 'change origin' in a SET command when the 
replication process is pushing data into the replica.

>> Exactly how we do this filtering is an open question,  I think the output
>> plugin will at a minimum need to know:
>>
>> a) What the slony node id is of the node it is running on.  This is easy to
>> figure out if the output plugin is able/allowed to query its database.  Will
>> this be possible? I would expect to be able to query the database as it
>> exists now(at plugin invocation time) not as it existed in the past when the
>> WAL was generated.   In addition to the node ID I can see us wanting to be
>> able to query other slony tables (sl_table,sl_set etc...)
> Hm. There is no fundamental reason not to allow normal database access
> to the current database but it won't be all that cheap, so doing it
> frequently is not a good idea.
> The reason its not cheap is that you basically need to teardown the
> postgres internal caches if you switch the timestream in which you are
> working.
>
> Would go something like:
>
> TransactionContext = AllocSetCreate(...);
> RevertFromDecodingSnapshot();
> InvalidateSystemCaches();
> StartTransactionCommand();
> /* do database work */
> CommitTransactionCommand();
> /* cleanup memory*/
> SetupDecodingSnapshot(snapshot, data);
> InvalidateSystemCaches();
>
> Why do you need to be able to query the present? I thought it might be
> neccesary to allow additional tables be accessed in a timetraveling
> manner, but not this way round.
> I guess an initial round of querying during plugin initialization won't
> be good enough?

For example my output plugin would want the list of replicated tables 
(or the list of tables replicated to a particular replica). This list 
can change over time.  As administrators issue commands to add or remove 
tables to replication or otherwise reshape the cluster the output plugin 
will need to know about this.  I MIGHT be able to get away with having 
slon disconnect and reconnect on reconfiguration events so only the 
init() call would need this data, but I am not sure.

One of the ways slony allows you to shoot your foot off is by changing 
certain configuration things (like dropping a table from a set) while a 
subscription is in progress.   Being able to timetravel the slony 
configuration tables might make this type of foot-gun a lot harder to 
encounter but that might be asking for too much.




>> b) What the slony node id is of the node we are streaming too.   It would be
>> nice if we could pass extra, arbitrary data/parameters to the output plugins
>> that could include that, or other things.  At the moment the
>> start_logical_replication rule in repl_gram.y doesn't allow for that but I
>> don't see why we couldn't make it do so.
> Yes, I think we want something like that. I even asked input on that
> recently ;):
> http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de
>
> Input welcome!

How flexible will the datatypes for the arguments be? If I wanted to 
pass in a list of tables (ie an array?) could I?
Above I talked about having the init() or change() methods query the 
local database.  Another option might be to make the slon build up this 
data (by querying the database over a normal psql connection) and just 
passing the data in.   However that might mean passing in a list of a 
few thousand table names, which doesn't sound like a good idea.

>
>> Even though, from a data-correctness point of view, slony could commit the
>> transaction on the replica after it sees the t1 commit, we won't want it to
>> do commits other than on a SYNC boundary.  This means that the replicas will
>> continue to move between consistent SYNC snapshots and that we can still
>> track the state/progress of replication by knowing what events (SYNC or
>> otherwise) have been confirmed.
> I don't know enough about slony internals, but: why? This will prohibit
> you from ever doing (per-transaction) synchronous replication...

A lot of this has to do with the stuff I discuss in the section below on 
cluster reshaping that you didn't understand.  Slony depends on knowing 
what data has , or hasn't been sent to a replica at a particular event 
id.  If 'some' transactions in between two SYNC events have committed 
but not others then slony has no idea what data it needs to get 
elsewhere on a FAILOVER type event.  There might be a way to make this 
work otherwise but I'm not sure what that is and how long it will take 
to debug out the issues.

> Cool! Don't hesitate to mention anything that you think would make you 
> life easier, chances are that youre not the only one who could benefit 
> from it... Thanks, Andres

Re: logical changeset generation v3

From

Andres Freund

Date:

20 November 2012, 11:22:15

On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> > On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> > > I am just looking at this patch and will provide some comments.
> > > By the way, you forgot the installation part of pg_receivellog, please see
> > > patch attached.
> >
> > That actually was somewhat intended, I thought people wouldn't like the
> > name and I didn't want a binary that's going to be replaced anyway lying
> > around ;)
> >
> OK no problem. For sure this is going to happen, I was wondering myself if
> it could be possible to merge pg_receivexlog and pg_receivellog into a
> single utility with multiple modes :)

Don't really see that, the differences already are significant and imo
are bound to get bigger. Shouldn't live in pg_basebackup/ either..

> Btw, here are some extra comments based on my progress, hope it will be
> useful for other people playing around with your patches.
> 1) Necessary to install the contrib module test_decoding on server side or
> the test case will not work.
> 2) Obtention of the following logs on server:
> LOG:  forced to assume catalog changes for xid 1370 because it was running
> to early
> WARNING:  ABORT 1370
> Actually I saw that there are many warnings like this.

Those aren't unexpected. Perhaps I should not make it a warning then...

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.
Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.

Makes sense?

> 3) Assertion failure while running pgbench, I was  just curious to see how
> it reacted when logical replication was put under a little bit of load.
> TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c",
> Line: 877)
> => pgbench -i postgres; pgbench -T 500 -c 8 postgres

Can you reproduce this one? I would be interested in log output. Because
I did run pgbench for quite some time and I haven't seen that one after
fixing some issues last week.

It implies that snapstate->nrrunning has lost touch with reality...

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3 - Source for Slony

From

Andres Freund

Date:

20 November 2012, 11:44:37

Hi,

On 2012-11-19 19:50:32 -0500, Steve Singer wrote:
> On 12-11-18 11:07 AM, Andres Freund wrote:
> >I think we should provide some glue code to do this, otherwise people
> >will start replicating all the bugs I hacked into this... More
> >seriously: I think we should have support code here, no user will want
> >to learn the intracacies of feedback messages and such. Where that would
> >live? No idea.

> libpglogicalrep.so ?

Yea. We don't really have the infrastructure for that yet
though... Robert and me were just talking about that recently...


> >I wholeheartedly aggree. It should also be cleaned up a fair bit before
> >others copy it should we not go for having some client side library.
> >
> >Imo the library could very roughly be something like:
> >
> >state = SetupStreamingLLog(replication-slot, ...);
> >while((message = StreamingLLogNextMessage(state))
> >{
> >      write(outfd, message->data, message->length);
> >      if (received_100_messages)
> >      {
> >           fsync(outfd);
> >           StreamingLLogConfirm(message);
> >      }
> >}
> >
> >Although I guess thats not good enough because StreamingLLogNextMessage
> >would be blocking, but that shouldn't be too hard to work around.
> >
>
> How about we pass a timeout value to StreamingLLogNextMessage (..) where it
> returns if no data is available after the timeout to give the caller a
> chance to do something else.

Doesn't really integrate into the sort of loop thats often built around
poll(2), select(2) and similar. It probably should return NULL if
there's nothing there yet and we should have a
StreamingLLogWaitForMessage() or such.

> >>This is basically the Slony 2.2 sl_log format minus a few columns we no
> >>longer need (txid, actionseq).
> >>command_args is a postgresql text array of column=value pairs.  Ie [
> >>{id=1},{name='steve'},{project='slony'}]
> >It seems to me that that makes escaping unneccesarily complicated, but
> >given you already have all the code... ;)
>
> When I look at the actual code/representation we picked it is closer to
> {column1,value1,column2,value2...}

Still means you need to escape and later pasrse columnN, valueN
values. I would have expected something like (length:data, length:data)+

> >>I don't t think our output plugin will be much more complicated than the
> >>test_decoding plugin.
> >Good. Thats the idea ;). Are you ok with the interface as it is now or
> >would you like to change something?
>
> I'm going to think about this some more and maybe try to write an example
> plugin before I can say anything with confidence.

That would be very good.

> >Yes. We will also need something like that. If you remember the first
> >prototype we sent to the list, it included the concept of an
> >'origin_node' in wal record. I think you actually reviewed that one ;)
> >
> >That was exactly aimed at something like this...
> >
> >Since then my thoughts about how the origin_id looks like have changed a
> >bit:
> >- origin id is internally still represented as an uint32/Oid
> >   - never visible outside of wal/system catalogs
> >- externally visible it gets
> >   - assigned an uuid
> >   - optionally assigned a user defined name
> >- user settable (permissions?) origin when executing sql:
> >   - SET change_origin_uuid = 'uuid';
> >   - SET change_origin_name = 'user-settable-name';
> >   - defaults to the local node
> >- decoding callbacks get passed the origin of a change
> >   - txn->{origin_uuid, origin_name, origin_internal?}
> >- the init decoding callback can setup an array of interesting origins,
> >   so the others don't even get the ReorderBuffer treatment
> >
> >I have to thank the discussion on -hackers and a march through prague
> >with Marko here...

> So would the uuid and optional name assignment be done in the output plugin
> or some else?

That would be postgres infrastructure. The output plugin would get
passed at least uuid and name and potentially the internal name as well
(might be useful to build some internal caching of information).

> When/how does the uuid get generated and where do we store it so the same
> uuid gets returned when postgres restarts.  Slony today stores all this type
> of stuff in user-level tables and user-level functions (because it has no
> other choice).

Would need to be its own system catalog.

> What is the connection between these values and the
> 'slot-id' in your proposal for the init arguments? Does the slot-id need to
> be the external uuid of the other end or is there no direct connection?

None really. The "slot-id" really is only an identifier for a
replication connection (which should live longer than a single
postmaster run) which contains information about the point up to which
you replicated. We need to manage some local resources based on that.

> Today slony allows us to replicate between two databases in the same
> postgresql cluster (I use this for testing all the time)
> Slony also allows for two different 'slony clusters' to be setup in the same
> database (or so I'm told, I don't think I have ever tried this myself).

Yuck. I haven't thought about this very much. I honestly don't see
support for the first case right now. The second shouldn't be too hard,
we already have the database oid available everywhere we need it.

> plugin functions that let me query the local database and then return the
> uuid and origin_name would work in this model.

Should be possible.

> +1 on being able to mark the 'change origin' in a SET command when the
> replication process is pushing data into the replica.

Good.

>
> >>Exactly how we do this filtering is an open question,  I think the output
> >>plugin will at a minimum need to know:
> >>
> >>a) What the slony node id is of the node it is running on.  This is easy to
> >>figure out if the output plugin is able/allowed to query its database.  Will
> >>this be possible? I would expect to be able to query the database as it
> >>exists now(at plugin invocation time) not as it existed in the past when the
> >>WAL was generated.   In addition to the node ID I can see us wanting to be
> >>able to query other slony tables (sl_table,sl_set etc...)
> >Hm. There is no fundamental reason not to allow normal database access
> >to the current database but it won't be all that cheap, so doing it
> >frequently is not a good idea.
> >The reason its not cheap is that you basically need to teardown the
> >postgres internal caches if you switch the timestream in which you are
> >working.
> >
> >Would go something like:
> >
> >TransactionContext = AllocSetCreate(...);
> >RevertFromDecodingSnapshot();
> >InvalidateSystemCaches();
> >StartTransactionCommand();
> >/* do database work */
> >CommitTransactionCommand();
> >/* cleanup memory*/
> >SetupDecodingSnapshot(snapshot, data);
> >InvalidateSystemCaches();
> >
> >Why do you need to be able to query the present? I thought it might be
> >neccesary to allow additional tables be accessed in a timetraveling
> >manner, but not this way round.
> >I guess an initial round of querying during plugin initialization won't
> >be good enough?
>
> For example my output plugin would want the list of replicated tables (or
> the list of tables replicated to a particular replica). This list can change
> over time.  As administrators issue commands to add or remove tables to
> replication or otherwise reshape the cluster the output plugin will need to
> know about this.  I MIGHT be able to get away with having slon disconnect
> and reconnect on reconfiguration events so only the init() call would need
> this data, but I am not sure.
>
> One of the ways slony allows you to shoot your foot off is by changing
> certain configuration things (like dropping a table from a set) while a
> subscription is in progress.   Being able to timetravel the slony
> configuration tables might make this type of foot-gun a lot harder to
> encounter but that might be asking for too much.

Actually timetravel access to those tables is considerably
easier/faster. I wanted to provide such tables anyway (because you need
them to safely write your own pg_enum alike types). It means that you
log slightly (32 + sizeof(XLogRecord) afair) more per modified row.

> >>b) What the slony node id is of the node we are streaming too.   It would be
> >>nice if we could pass extra, arbitrary data/parameters to the output plugins
> >>that could include that, or other things.  At the moment the
> >>start_logical_replication rule in repl_gram.y doesn't allow for that but I
> >>don't see why we couldn't make it do so.
> >Yes, I think we want something like that. I even asked input on that
> >recently ;):
> >http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de
> >
> >Input welcome!
>
> How flexible will the datatypes for the arguments be? If I wanted to pass in
> a list of tables (ie an array?) could I?

I was thinking of just a textual (key = value, ...) style list, similar
to options to EXPLAIN, COPY et al.

> Above I talked about having the init() or change() methods query the local
> database.  Another option might be to make the slon build up this data (by
> querying the database over a normal psql connection) and just passing the
> data in.   However that might mean passing in a list of a few thousand table
> names, which doesn't sound like a good idea.

No, it certainly doesn't.

> >
> >>Even though, from a data-correctness point of view, slony could commit the
> >>transaction on the replica after it sees the t1 commit, we won't want it to
> >>do commits other than on a SYNC boundary.  This means that the replicas will
> >>continue to move between consistent SYNC snapshots and that we can still
> >>track the state/progress of replication by knowing what events (SYNC or
> >>otherwise) have been confirmed.
> >I don't know enough about slony internals, but: why? This will prohibit
> >you from ever doing (per-transaction) synchronous replication...
>
> A lot of this has to do with the stuff I discuss in the section below on
> cluster reshaping that you didn't understand.  Slony depends on knowing what
> data has , or hasn't been sent to a replica at a particular event id.  If
> 'some' transactions in between two SYNC events have committed but not others
> then slony has no idea what data it needs to get elsewhere on a FAILOVER
> type event.  There might be a way to make this work otherwise but I'm not
> sure what that is and how long it will take to debug out the issues.

Ah, it starts to make sense.

The way I solved that issue in the prototype from arround pgcon was that
I included the LSN from the original commit record in the remote
transaction into the commit record of the local transaction (with the
origin_id set to the remote side). That allowed to trivially restore the
exact state of replication after a crash even with
synchronous_commit=off as during replay you could simply ensure the
replication-surely-received-lsn of every remote side was up to date.
Then you can simply do a START_LOGICAL_REPLICATION 'slot'
just-recovered/lsn; and restart applying (*not*
INIT_LOGICAL_REPLICATION).


Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3

From

Michael Paquier

Date:

21 November 2012, 05:57:13

On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Those aren't unexpected. Perhaps I should not make it a warning then...

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.
Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.

Makes sense?

> 3) Assertion failure while running pgbench, I was just curious to see how
> it reacted when logical replication was put under a little bit of load.
> TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c",
> Line: 877)
> => pgbench -i postgres; pgbench -T 500 -c 8 postgres

Can you reproduce this one? I would be interested in log output. Because
I did run pgbench for quite some time and I haven't seen that one after
fixing some issues last week.

It implies that snapstate->nrrunning has lost touch with reality...

Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't outputted anything in the logs, but here is the backtrace of the core file produced.
#2 0x0000000000865145 in ExceptionalCondition (conditionName=0xa15100 "!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >= ((TransactionId) 3)))", errorType=0xa14f3b "FailedAssertion",
fileName=0xa14ed0 "snapbuild.c", lineNumber=877) at assert.c:54
#3 0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10, xid=0) at snapbuild.c:877
#4 0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80, snapstate=0x19e4f10, xid=0, buf=0x1a0a368, relfilenode=0x1a0a450) at snapbuild.c:388
#5 0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80, snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732
#6 0x00000000007080b9 in DecodeRecordIntoReorderBuffer (reader=0x1a08300, state=0x19e4e20, buf=0x1a0a368) at decode.c:84
#7 0x0000000000708cad in replay_finished_record (state=0x1a08300, buf=0x1a0a368) at logicalfuncs.c:54
#8 0x00000000004d8033 in XLogReaderRead (state=0x1a08300) at xlogreader.c:965
#9 0x000000000070f7c3 in XLogSendLogical (caughtup=0x7fffb22c35fb "") at walsender.c:1721
#10 0x000000000070ea05 in WalSndLoop (send_data=0x70f6e2 <XLogSendLogical>) at walsender.c:1184
#11 0x000000000070e0eb in StartLogicalReplication (cmd=0x190eaa8) at walsender.c:726
#12 0x000000000070e3ac in exec_replication_command (cmd_string=0x19a65c8 "START_LOGICAL_REPLICATION 'id-0' 0/7E1855C") at walsender.c:853
#13 0x0000000000753ee0 in PostgresMain (argc=2, argv=0x18f63d8, username=0x18f62a8 "michael") at postgres.c:3974
#14 0x00000000006f13ea in BackendRun (port=0x1912600) at postmaster.c:3668
#15 0x00000000006f0b76 in BackendStartup (port=0x1912600) at postmaster.c:3352
#16 0x00000000006ed900 in ServerLoop () at postmaster.c:1431
#17 0x00000000006ed208 in PostmasterMain (argc=13, argv=0x18f40a0) at postmaster.c:1180
#18 0x0000000000657517 in main (argc=13, argv=0x18f40a0) at main.c:197
I'm keeping this core and the binary btw.

--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3

From

Michael Paquier

Date:

21 November 2012, 06:28:36

On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> > On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> > > I am just looking at this patch and will provide some comments.
> > > By the way, you forgot the installation part of pg_receivellog, please see
> > > patch attached.
> >
> > That actually was somewhat intended, I thought people wouldn't like the
> > name and I didn't want a binary that's going to be replaced anyway lying
> > around ;)
> >
> OK no problem. For sure this is going to happen, I was wondering myself if
> it could be possible to merge pg_receivexlog and pg_receivellog into a
> single utility with multiple modes :)

Don't really see that, the differences already are significant and imo
are bound to get bigger. Shouldn't live in pg_basebackup/ either...

I am sure that this will be the object of many future discussions.

> Btw, here are some extra comments based on my progress, hope it will be
> useful for other people playing around with your patches.
> 1) Necessary to install the contrib module test_decoding on server side or
> the test case will not work.
> 2) Obtention of the following logs on server:
> LOG: forced to assume catalog changes for xid 1370 because it was running
> to early
> WARNING: ABORT 1370
> Actually I saw that there are many warnings like this.

Those aren't unexpected. Perhaps I should not make it a warning then...

A NOTICE would be more adapted, a WARNING means that something that may endanger the system has happened, but as far as I understand from your explanation this is not the case.

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

OK, so this snapshot only needs to include the XIDs of transactions that have modified the catalogs. Do I get it right? This way you are able to fetch the correct relation definition for replication decoding.

Just thinking but... It looks to be a waste to store the transactions XIDs of all the committed transactions, but on the other hand there is no way to track the XIDs of transactions that modified a catalog in current core code. So yes this approach is better as refining the transaction XID tracking for snapshot reconstruction is something that could be improved later. Those are only thoughts though...

The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.

Just to be clear, lsn means the log-sequence number associated to each xlog record?

Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.

Makes sense?

OK got it thanks for your explanation.

So, once again coming to it, we need in the snapshot built only the XIDs of transactions that modified the catalogs to get a consistent view of relation info for decoding.
Really, I think that refining the XID tracking to minimize the size of the snapshot built for decoding would be really a key for performance improvement especially for OLTP-type applications (lots of transactions involved, few of them involving catalogs).
--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3

From

Andres Freund

Date:

21 November 2012, 07:30:07

On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> > > Btw, here are some extra comments based on my progress, hope it will be
> > > useful for other people playing around with your patches.
> > > 1) Necessary to install the contrib module test_decoding on server side
> > or
> > > the test case will not work.
> > > 2) Obtention of the following logs on server:
> > > LOG:  forced to assume catalog changes for xid 1370 because it was
> > running
> > > to early
> > > WARNING:  ABORT 1370
> > > Actually I saw that there are many warnings like this.
> >
> > Those aren't unexpected. Perhaps I should not make it a warning then...
> >
> A NOTICE would be more adapted, a WARNING means that something that may
> endanger the system has happened, but as far as I understand from your
> explanation this is not the case.

I think it should go DEBUG2 or so once were a bit more confident about
the code.

> > A short explanation:
> >
> > We can only decode tuples we see in the WAL when we already have a
> > timetravel catalog snapshot before that transaction started. To build
> > such a snapshot we need to collect information about committed which
> > changed the catalog. Unfortunately we can't diagnose whether a txn
> > changed the catalog without a snapshot so we just assume all committed
> > ones do - it just costs a bit of memory. Thats the background of the
> > "forced to assume catalog changes for ..." message.
> >
> OK, so this snapshot only needs to include the XIDs of transactions that
> have modified the catalogs. Do I get it right? This way you are able to
> fetch the correct relation definition for replication decoding.

Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
so its not all of them. Normal snapshots carry all in-progress
transactionids instead of the committed ones, but that would have been
far more in our case (only a minority of txn's touch the catalog) and it
has problems with subtransaction tracking.

> Just thinking but... It looks to be a waste to store the transactions XIDs
> of all the committed transactions, but on the other hand there is no way to
> track the XIDs of transactions that modified a catalog in current core
> code. So yes this approach is better as refining the transaction XID
> tracking for snapshot reconstruction is something that could be improved
> later. Those are only thoughts though...

We actually only track xids of catalog modifying transactions once we
hit the CONSISTENT state. Before the initial snapshot we can't detect
that.

> The reason for the ABORTs is related but different. We start out in the
> > "SNAPBUILD_START" state when we try to build a snapshot. When we find
> > initial information about running transactions (i.e. xl_running_xacts)
> > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
> > decode all changes in transactions that start *after* the current
> > lsn. Earlier transactions might have tuples on a catalog state we can't
> > query.
> >
> Just to be clear, lsn means the log-sequence number associated to each xlog
> record?

Yes. And that number is just the position in the stream.

Greetings,

Andres

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3

From

Andres Freund

Date:

21 November 2012, 07:31:19

On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > Those aren't unexpected. Perhaps I should not make it a warning then...
> >
> > A short explanation:
> >
> > We can only decode tuples we see in the WAL when we already have a
> > timetravel catalog snapshot before that transaction started. To build
> > such a snapshot we need to collect information about committed which
> > changed the catalog. Unfortunately we can't diagnose whether a txn
> > changed the catalog without a snapshot so we just assume all committed
> > ones do - it just costs a bit of memory. Thats the background of the
> > "forced to assume catalog changes for ..." message.
> >
> > The reason for the ABORTs is related but different. We start out in the
> > "SNAPBUILD_START" state when we try to build a snapshot. When we find
> > initial information about running transactions (i.e. xl_running_xacts)
> > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
> > decode all changes in transactions that start *after* the current
> > lsn. Earlier transactions might have tuples on a catalog state we can't
> > query.
> > Only when all transactions we observed as running before the
> > FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
> > As we want a consistent/reproducible set of transactions to produce
> > output via the logstream we only pass transactions to the output plugin
> > if they commit *after* CONSISTENT (they can start earlier though!). This
> > allows us to produce a pg_dump compatible snapshot in the moment we get
> > consistent that contains exactly the changes we won't stream out.
> >
> > Makes sense?
> >
> > > 3) Assertion failure while running pgbench, I was  just curious to see
> > how
> > > it reacted when logical replication was put under a little bit of load.
> > > TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> > > ((snapstate->xmin_running) >= ((TransactionId) 3)))", File:
> > "snapbuild.c",
> > > Line: 877)
> > > => pgbench -i postgres; pgbench -T 500 -c 8 postgres
> >
> > Can you reproduce this one? I would be interested in log output. Because
> > I did run pgbench for quite some time and I haven't seen that one after
> > fixing some issues last week.
> >
> 
> > It implies that snapstate->nrrunning has lost touch with reality...
> >
> Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
> outputted anything in the logs, but here is the backtrace of the core file
> produced.

Could you run it with log_level=DEBUG2?

Do you run pgbench after youve reached a consistent state (by issuing a
manual checkpoint)?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3

From

Michael Paquier

Date:

21 November 2012, 07:34:46

On Wed, Nov 21, 2012 at 4:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> > It implies that snapstate->nrrunning has lost touch with reality...
> >
> Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
> outputted anything in the logs, but here is the backtrace of the core file
> produced.

Could you run it with log_level=DEBUG2?

Let me try.

Do you run pgbench after youve reached a consistent state (by issuing a
manual checkpoint)?

Yes. I issue a manual checkpoint to initialize the replication.

--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3

From

Andres Freund

Date:

21 November 2012, 07:34:48

On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > Those aren't unexpected. Perhaps I should not make it a warning then...
> >
> > A short explanation:
> >
> > We can only decode tuples we see in the WAL when we already have a
> > timetravel catalog snapshot before that transaction started. To build
> > such a snapshot we need to collect information about committed which
> > changed the catalog. Unfortunately we can't diagnose whether a txn
> > changed the catalog without a snapshot so we just assume all committed
> > ones do - it just costs a bit of memory. Thats the background of the
> > "forced to assume catalog changes for ..." message.
> >
> > The reason for the ABORTs is related but different. We start out in the
> > "SNAPBUILD_START" state when we try to build a snapshot. When we find
> > initial information about running transactions (i.e. xl_running_xacts)
> > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
> > decode all changes in transactions that start *after* the current
> > lsn. Earlier transactions might have tuples on a catalog state we can't
> > query.
> > Only when all transactions we observed as running before the
> > FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
> > As we want a consistent/reproducible set of transactions to produce
> > output via the logstream we only pass transactions to the output plugin
> > if they commit *after* CONSISTENT (they can start earlier though!). This
> > allows us to produce a pg_dump compatible snapshot in the moment we get
> > consistent that contains exactly the changes we won't stream out.
> >
> > Makes sense?
> >
> > > 3) Assertion failure while running pgbench, I was  just curious to see
> > how
> > > it reacted when logical replication was put under a little bit of load.
> > > TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> > > ((snapstate->xmin_running) >= ((TransactionId) 3)))", File:
> > "snapbuild.c",
> > > Line: 877)
> > > => pgbench -i postgres; pgbench -T 500 -c 8 postgres
> >
> > Can you reproduce this one? I would be interested in log output. Because
> > I did run pgbench for quite some time and I haven't seen that one after
> > fixing some issues last week.
> >
> 
> > It implies that snapstate->nrrunning has lost touch with reality...
> >
> Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
> outputted anything in the logs, but here is the backtrace of the core file
> produced.

Ah, I see. Could you try the following diff?

diff --git a/src/backend/replication/logical/snapbuild.c
b/src/backend/replication/logical/snapbuild.c
index df24b33..797a126 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer *reorder,
Snapstate *snapstate,                */               snapstate->transactions_after = buf->origptr;
+               snapstate->nrrunning = running->xcnt;               snapstate->xmin_running = InvalidTransactionId;
         snapstate->xmax_running = InvalidTransactionId;
 


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3

From

Michael Paquier

Date:

21 November 2012, 07:47:18

On Wed, Nov 21, 2012 at 4:30 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> > > Btw, here are some extra comments based on my progress, hope it will be
> > > useful for other people playing around with your patches.
> > > 1) Necessary to install the contrib module test_decoding on server side
> > or
> > > the test case will not work.
> > > 2) Obtention of the following logs on server:
> > > LOG: forced to assume catalog changes for xid 1370 because it was
> > running
> > > to early
> > > WARNING: ABORT 1370
> > > Actually I saw that there are many warnings like this.
> >
> > Those aren't unexpected. Perhaps I should not make it a warning then...
> >
> A NOTICE would be more adapted, a WARNING means that something that may
> endanger the system has happened, but as far as I understand from your
> explanation this is not the case.

I think it should go DEBUG2 or so once were a bit more confident about
the code.

> > A short explanation:
> >
> > We can only decode tuples we see in the WAL when we already have a
> > timetravel catalog snapshot before that transaction started. To build
> > such a snapshot we need to collect information about committed which
> > changed the catalog. Unfortunately we can't diagnose whether a txn
> > changed the catalog without a snapshot so we just assume all committed
> > ones do - it just costs a bit of memory. Thats the background of the
> > "forced to assume catalog changes for ..." message.
> >
> OK, so this snapshot only needs to include the XIDs of transactions that
> have modified the catalogs. Do I get it right? This way you are able to
> fetch the correct relation definition for replication decoding.

Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
so its not all of them. Normal snapshots carry all in-progress
transactionids instead of the committed ones, but that would have been
far more in our case (only a minority of txn's touch the catalog) and it
has problems with subtransaction tracking.

Hum. I might have missed something but what is the variable tracking the newest XID that modified catalogs.
I can see of course recentXmin in snapmgr.c but nothing related to what you describe.

> Just thinking but... It looks to be a waste to store the transactions XIDs
> of all the committed transactions, but on the other hand there is no way to
> track the XIDs of transactions that modified a catalog in current core
> code. So yes this approach is better as refining the transaction XID
> tracking for snapshot reconstruction is something that could be improved
> later. Those are only thoughts though...

We actually only track xids of catalog modifying transactions once we
hit the CONSISTENT state. Before the initial snapshot we can't detect
that.

How to you track them? I think I need to go deeper in the code before asking more...
--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3

From

Andres Freund

Date:

21 November 2012, 07:58:59

On 2012-11-21 16:47:11 +0900, Michael Paquier wrote:
> On Wed, Nov 21, 2012 at 4:30 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:
> > > On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com
> > >wrote:
> > >
> > > > On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> > > > > Btw, here are some extra comments based on my progress, hope it will
> > be
> > > > > useful for other people playing around with your patches.
> > > > > 1) Necessary to install the contrib module test_decoding on server
> > side
> > > > or
> > > > > the test case will not work.
> > > > > 2) Obtention of the following logs on server:
> > > > > LOG:  forced to assume catalog changes for xid 1370 because it was
> > > > running
> > > > > to early
> > > > > WARNING:  ABORT 1370
> > > > > Actually I saw that there are many warnings like this.
> > > >
> > > > Those aren't unexpected. Perhaps I should not make it a warning then...
> > > >
> > > A NOTICE would be more adapted, a WARNING means that something that may
> > > endanger the system has happened, but as far as I understand from your
> > > explanation this is not the case.
> >
> > I think it should go DEBUG2 or so once were a bit more confident about
> > the code.
> >
> > > > A short explanation:
> > > >
> > > > We can only decode tuples we see in the WAL when we already have a
> > > > timetravel catalog snapshot before that transaction started. To build
> > > > such a snapshot we need to collect information about committed which
> > > > changed the catalog. Unfortunately we can't diagnose whether a txn
> > > > changed the catalog without a snapshot so we just assume all committed
> > > > ones do - it just costs a bit of memory. Thats the background of the
> > > > "forced to assume catalog changes for ..." message.
> > > >
> > > OK, so this snapshot only needs to include the XIDs of transactions that
> > > have modified the catalogs. Do I get it right? This way you are able to
> > > fetch the correct relation definition for replication decoding.
> >
> > Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
> > so its not all of them. Normal snapshots carry all in-progress
> > transactionids instead of the committed ones, but that would have been
> > far more in our case (only a minority of txn's touch the catalog) and it
> > has problems with subtransaction tracking.
> >
> Hum. I might have missed something but what is the variable tracking the
> newest XID that modified catalogs.
> I can see of course recentXmin in snapmgr.c but nothing related to what you
> describe.

We determine that ourselves.

SnapBuildCommitTxn(Snapstate *snapstate, ReorderBuffer *reorder,               XLogRecPtr lsn, TransactionId xid,
       int nsubxacts, TransactionId *subxacts)
 
{
...if (forced_timetravel || top_does_timetravel || sub_does_timetravel){    if (!TransactionIdIsValid(snapstate->xmax)
||       NormalTransactionIdFollows(xid, snapstate->xmax))    {        snapstate->xmax = xid;
TransactionIdAdvance(snapstate->xmax);   }
 

> > > Just thinking but... It looks to be a waste to store the transactions
> > XIDs
> > > of all the committed transactions, but on the other hand there is no way
> > to
> > > track the XIDs of transactions that modified a catalog in current core
> > > code. So yes this approach is better as refining the transaction XID
> > > tracking for snapshot reconstruction is something that could be improved
> > > later. Those are only thoughts though...
> >
> > We actually only track xids of catalog modifying transactions once we
> > hit the CONSISTENT state. Before the initial snapshot we can't detect
> > that.
> >
> How to you track them? I think I need to go deeper in the code before
> asking more...

You mean, how do I detect they are catalog modifying? By asking the
reorderbuffer (ReorderBufferXidDoesTimetravel(...)). That one knows
because we told him so (ReorderBufferXidSetTimetravel()) and we do that
by looking at the type of xid records we've seen incoming (HEAP_INPLACE,
HEAP2_NEW_CID tell us its doing timetravel).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3

From

Michael Paquier

Date:

21 November 2012, 09:35:36

<br /><br /><div class="gmail_quote">On Wed, Nov 21, 2012 at 4:34 PM, Andres Freund <span dir="ltr"><<a
href="mailto:andres@2ndquadrant.com"target="_blank">andres@2ndquadrant.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="h5">On 2012-11-21
14:57:08+0900, Michael Paquier wrote:<br /><br /></div>Ah, I see. Could you try the following diff?<br /><br /> diff
--gita/src/backend/replication/logical/snapbuild.c<br /> b/src/backend/replication/logical/snapbuild.c<br /> index
df24b33..797a126100644<br /> --- a/src/backend/replication/logical/snapbuild.c<br /> +++
b/src/backend/replication/logical/snapbuild.c<br/> @@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer
*reorder,<br/> Snapstate *snapstate,<br />                  */<br />                 snapstate->transactions_after =
buf->origptr;<br/><br /> +               snapstate->nrrunning = running->xcnt;<br />                
snapstate->xmin_running= InvalidTransactionId;<br />                 snapstate->xmax_running =
InvalidTransactionId;<br/></blockquote></div>I am still getting the same assertion failure even with this diff
included.<br/>-- <br />Michael Paquier<br /><a href="http://michael.otacoo.com"
target="_blank">http://michael.otacoo.com</a><br/>

Re: [PATCH 03/14] Add simple xlogdump tool

From

Jeff Janes

Date:

21 November 2012, 22:57:18

On Thu, Nov 15, 2012 at 9:13 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2012-11-15 09:06:23 -0800, Jeff Janes wrote:
>> On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > ---
>> >  src/bin/Makefile            |   2 +-
>> >  src/bin/xlogdump/Makefile   |  25 +++
>> >  src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
>> >  3 files changed, 494 insertions(+), 1 deletion(-)
>> >  create mode 100644 src/bin/xlogdump/Makefile
>> >  create mode 100644 src/bin/xlogdump/xlogdump.c
>>
>> Is this intended to be the successor of
>> https://github.com/snaga/xlogdump which will then be deprecated?
>
> As-is this is just a development tool which was sorely needed for the
> development of this patchset. But yes I think that once ready
> (xlogreader infrastructure, *_desc routines splitted) it should
> definitely be able to do most of what the above xlogdump can do and it
> should live in bin/. I think mostly some filtering is missing.
>
> That doesn't really "deprecate" the above though.
>
> Does that answer your question?

Yes, I think so.  Thanks.

(I've just recently gotten the original xlogdump to work for me in
9.2, and I had been wonder if back-porting yours to 9.2 would have
been an easier way to go.)

Cheers,

Jeff

Re: [PATCH 03/14] Add simple xlogdump tool

From

Andres Freund

Date:

21 November 2012, 23:10:35

On 2012-11-21 14:57:14 -0800, Jeff Janes wrote:
> On Thu, Nov 15, 2012 at 9:13 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2012-11-15 09:06:23 -0800, Jeff Janes wrote:
> >> On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > ---
> >> >  src/bin/Makefile            |   2 +-
> >> >  src/bin/xlogdump/Makefile   |  25 +++
> >> >  src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
> >> >  3 files changed, 494 insertions(+), 1 deletion(-)
> >> >  create mode 100644 src/bin/xlogdump/Makefile
> >> >  create mode 100644 src/bin/xlogdump/xlogdump.c
> >>
> >> Is this intended to be the successor of
> >> https://github.com/snaga/xlogdump which will then be deprecated?
> >
> > As-is this is just a development tool which was sorely needed for the
> > development of this patchset. But yes I think that once ready
> > (xlogreader infrastructure, *_desc routines splitted) it should
> > definitely be able to do most of what the above xlogdump can do and it
> > should live in bin/. I think mostly some filtering is missing.
> >
> > That doesn't really "deprecate" the above though.
> >
> > Does that answer your question?
> 
> Yes, I think so.  Thanks.
> 
> (I've just recently gotten the original xlogdump to work for me in
> 9.2, and I had been wonder if back-porting yours to 9.2 would have
> been an easier way to go.)

I don't think you would have much fun doing so - the WAL format changes
between 9.2 and 9.3 make this larger than one might think. I had a
version that worked with the previous format but there have been some
interface changes since then...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3

From

Andres Freund

Date:

21 November 2012, 23:25:59

On 2012-11-21 18:35:34 +0900, Michael Paquier wrote:
> On Wed, Nov 21, 2012 at 4:34 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:
> >
> > Ah, I see. Could you try the following diff?
> >
> > diff --git a/src/backend/replication/logical/snapbuild.c
> > b/src/backend/replication/logical/snapbuild.c
> > index df24b33..797a126 100644
> > --- a/src/backend/replication/logical/snapbuild.c
> > +++ b/src/backend/replication/logical/snapbuild.c
> > @@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer *reorder,
> > Snapstate *snapstate,
> >                  */
> >                 snapstate->transactions_after = buf->origptr;
> >
> > +               snapstate->nrrunning = running->xcnt;
> >                 snapstate->xmin_running = InvalidTransactionId;
> >                 snapstate->xmax_running = InvalidTransactionId;
> >
> I am still getting the same assertion failure even with this diff included.

I really don't understand whats going on here then. Youve said you made
sure that there is a catalog snapshot. Which means you would need
something like:
WARNING:  connecting to postgres
WARNING:  Initiating logical rep
LOG:  computed new xmin: 16566894
LOG:  start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG:  found initial snapshot (via running xacts). Done: 1
WARNING:  reached consistent point, stopping!
WARNING:  Starting logical replication
LOG:  start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG:  found initial snapshot (via running xacts). Done: 1

in the log *and* it means that snapbuild->state has to be
CONSISTENT. But the backtrace youve posted:

#3  0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10,xid=0) at snapbuild.c:877
#4  0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80,snapstate=0x19e4f10, xid=0, buf=0x1a0a368,
relfilenode=0x1a0a450)at snapbuild.c:388

#5  0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80,snapstate=0x19e4f10, buf=0x1a0a368) at
snapbuild.c:732

shows pretty clearly that snapstate *can't* be consistent because line 387ff is:     else if (snapstate->state <
SNAPBUILD_CONSISTENT&&              SnapBuildTxnIsRunning(snapstate, xid))    ;

so #3 #4 can't happen at those line numbers with state == CONSISTENT.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3

From

Michael Paquier

Date:

22 November 2012, 00:13:34

On Thu, Nov 22, 2012 at 8:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I really don't understand whats going on here then. Youve said you made
sure that there is a catalog snapshot. Which means you would need
something like:
WARNING: connecting to postgres
WARNING: Initiating logical rep
LOG: computed new xmin: 16566894
LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG: found initial snapshot (via running xacts). Done: 1
WARNING: reached consistent point, stopping!
WARNING: Starting logical replication
LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG: found initial snapshot (via running xacts). Done: 1

in the log *and* it means that snapbuild->state has to be
CONSISTENT. But the backtrace youve posted:

#3 0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10,xid=0) at snapbuild.c:877
#4 0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80,snapstate=0x19e4f10, xid=0, buf=0x1a0a368, relfilenode=0x1a0a450) at snapbuild.c:388
#5 0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80,snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732

shows pretty clearly that snapstate *can't* be consistent because line 387ff is:
else if (snapstate->state < SNAPBUILD_CONSISTENT &&
SnapBuildTxnIsRunning(snapstate, xid))
;
so #3 #4 can't happen at those line numbers with state == CONSISTENT.

Still this *impossible* thing happens.
Here are some more information on the logs I get on server side:

Yes I have the logical replication correctly initialized:
[629 0] LOG: database system was shut down at 2012-11-22 09:02:42 JST
[628 0] LOG: database system is ready to accept connections
[633 0] LOG: autovacuum launcher started
[648 0] WARNING: connecting to postgres
[648 0] WARNING: Initiating logical rep
[648 0] LOG: computed new xmin: 684
[648 0] LOG: start reading from 0/178C1B8, scrolled back to 0/178C000

And I am also getting logs of this type with pg_receivellog:
BEGIN 698
table "pgbench_accounts": UPDATE: aid[int4]:759559 bid[int4]:8 abalance[int4]:-3641 filler[bpchar]:
table "pgbench_tellers": UPDATE: tid[int4]:93 bid[int4]:10 tbalance[int4]:-3641 filler[bpchar]:(null)
table "pgbench_branches": UPDATE: bid[int4]:10 bbalance[int4]:-3641 filler[bpchar]:(null)
table "pgbench_history": INSERT: tid[int4]:93 bid[int4]:10 aid[int4]:759559 delta[int4]:-3641 mtime[timestamp]:2012-11-22 09:05:34.535651 filler[bpchar]:(null)
COMMIT 698

Until the assertion failure:
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c", Line: 878)

I still have the core file and its binary at hand if you want, so can send them at will.
I have not been able to read your code yet, but there should be something you are missing.

Thanks,
--
Michael Paquier
http://michael.otacoo.com

Re: logical changeset generation v3

From

Andres Freund

Date:

22 November 2012, 00:48:55

On 2012-11-22 09:13:30 +0900, Michael Paquier wrote:
> On Thu, Nov 22, 2012 at 8:25 AM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > I really don't understand whats going on here then. Youve said you made
> > sure that there is a catalog snapshot. Which means you would need
> > something like:
> > WARNING:  connecting to postgres
> > WARNING:  Initiating logical rep
> > LOG:  computed new xmin: 16566894
> > LOG:  start reading from 3/E62457C0, scrolled back to 3/E6244000
> > LOG:  found initial snapshot (via running xacts). Done: 1
> > WARNING:  reached consistent point, stopping!
> > WARNING:  Starting logical replication
> > LOG:  start reading from 3/E62457C0, scrolled back to 3/E6244000
> > LOG:  found initial snapshot (via running xacts). Done: 1
> >
> > in the log *and* it means that snapbuild->state has to be
> > CONSISTENT. But the backtrace youve posted:
> >
> > #3  0x000000000070c409 in SnapBuildTxnIsRunning
> > (snapstate=0x19e4f10,xid=0) at snapbuild.c:877
> > #4  0x000000000070b8e4 in SnapBuildProcessChange
> > (reorder=0x19e4e80,snapstate=0x19e4f10, xid=0, buf=0x1a0a368,
> > relfilenode=0x1a0a450) at snapbuild.c:388
> > #5  0x000000000070c088 in SnapBuildDecodeCallback
> > (reorder=0x19e4e80,snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732
> >
> > shows pretty clearly that snapstate *can't* be consistent because line
> > 387ff is:
> >       else if (snapstate->state < SNAPBUILD_CONSISTENT &&
> >                SnapBuildTxnIsRunning(snapstate, xid))
> >                 ;
> > so #3 #4 can't happen at those line numbers with state == CONSISTENT.
> >
> Still this *impossible* thing happens.
> Here are some more information on the logs I get on server side:
>
> Yes I have the logical replication correctly initialized:
> [629 0] LOG:  database system was shut down at 2012-11-22 09:02:42 JST
> [628 0] LOG:  database system is ready to accept connections
> [633 0] LOG:  autovacuum launcher started
> [648 0] WARNING:  connecting to postgres
> [648 0] WARNING:  Initiating logical rep
> [648 0] LOG:  computed new xmin: 684
> [648 0] LOG:  start reading from 0/178C1B8, scrolled back to 0/178C000

Ok, so youve not yet reached a consistent point.

Which means this shouldn't yet be written out:

> And I am also getting logs of this type with pg_receivellog:
> BEGIN 698
> table "pgbench_accounts": UPDATE: aid[int4]:759559 bid[int4]:8
> abalance[int4]:-3641 filler[bpchar]:
> table "pgbench_tellers": UPDATE: tid[int4]:93 bid[int4]:10
> tbalance[int4]:-3641 filler[bpchar]:(null)
> table "pgbench_branches": UPDATE: bid[int4]:10 bbalance[int4]:-3641
> filler[bpchar]:(null)
> table "pgbench_history": INSERT: tid[int4]:93 bid[int4]:10 aid[int4]:759559
> delta[int4]:-3641 mtime[timestamp]:2012-11-22 09:05:34.535651
> filler[bpchar]:(null)
> COMMIT 698

that could already be good enough of a hint, let me check tomorrow.

> I still have the core file and its binary at hand if you want, so can send
> them at will.

If those aren't too big, its worth a try...

Greetings,

Andres

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Alvaro Herrera

Date:

26 November 2012, 19:13:53

This is very much the same as the previous patch, except it has been
rebased to the latest master.


--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

xlogreader-heikki-3.patch

Re: [PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids

From

Simon Riggs

Date:

02 December 2012, 19:55:16

On 15 November 2012 12:07, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 14 November 2012 22:17, Andres Freund <andres@2ndquadrant.com> wrote:
>
>> To avoid complicating logic we store both, the toplevel and the subxids, in
>> ->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids.
>
> That looks good, not much change. Will apply in next few days. Please
> add me as committer and mark ready.

I tried improving this, but couldn't. So I've committed it as is.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

From

Steve Singer

Date:

03 December 2012, 02:52:20

<div class="moz-cite-prefix">On 12-11-14 08:17 PM, Andres Freund wrote:<br /></div><br /> I am getting errors like the
followingwhen I try to use either your test_decoding plugin or my own (which does even less than yours)<br /><br /><br
/>LOG:  database system is ready to accept connections<br /> LOG:  autovacuum launcher started<br /> WARNING: 
connectingto <br /> WARNING:  Initiating logical rep<br /> LOG:  computed new xmin: 773<br /> LOG:  start reading from
0/17F5D58,scrolled back to 0/17F4000<br /> LOG:  got new xmin 773 at 25124280<br /> LOG:  found initial snapshot (via
runningxacts). Done: 1<br /> WARNING:  reached consistent point, stopping!<br /> WARNING:  Starting logical
replication<br/> LOG:  start reading from 0/17F5D58, scrolled back to 0/17F4000<br /> LOG:  found initial snapshot (via
runningxacts). Done: 1<br /> FATAL:  cannot read pg_class without having selected a database<br /> TRAP:
FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))",File: "proc.c", Line: 759)<br /><br /> This seems
tobe happening under the calls at<br /> reorderbuffer.c:832    if (!SnapBuildHasCatalogChanges(NULL, xid,
&change->relnode))<br/><br /> The sequence of events I do is:<br /> 1. start pg_receivellog<br /> 2. run a
checkpoint<br/> 3. Attach to the walsender process with gdb<br /> 4. Start a new client connection with psql and do
'INSERTINTO a values (1)' twice.<br /><br /> (skipping step 3 doesn't make a difference)<br /><br /><br /><br /><br />
I<br/><br /><br /><blockquote cite="mid:1352942234-3953-11-git-send-email-andres@2ndquadrant.com" type="cite"><pre
wrap="">
This introduces several things:
* 'reorderbuffer' module which reassembles transactions from a stream of interspersed changes
* 'snapbuilder' which builds catalog snapshots so that tuples from wal can be understood
* logging more data into wal to facilitate logical decoding
* wal decoding into an reorderbuffer
* shared library output plugins with 5 callbacks* init* begin* change* commit
* walsender infrastructur to stream out changes and to keep the global xmin low enough* INIT_LOGICAL_REPLICATION
$plugin;waits till a consistent snapshot is built and returns  * initial LSN  * replication slot identifier  * id of a
pg_export()style snapshot* START_LOGICAL_REPLICATION $id $lsn; streams out changes* uses named output plugins for
outputspecification 

Todo:
* testing infrastructure (isolationtester)
* persistence/spilling to disk of built snapshots, longrunning transactions
* user docs
* more frequent lowering of xmins
* more docs about the internals
* support for user declared catalog tables
* actual exporting of initial pg_export snapshots after INIT_LOGICAL_REPLICATION
* own shared memory segment instead of piggybacking on walsender's
* nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the outside.
* more frequent xl_running_xid's so xmin can be upped more frequently
* add STOP_LOGICAL_REPLICATION $id
---src/backend/access/heap/heapam.c                |  280 +++++-src/backend/access/transam/xlog.c               |    1
+src/backend/catalog/index.c                    |   74 ++src/backend/replication/Makefile                |    2
+src/backend/replication/logical/Makefile       |   19 +src/backend/replication/logical/decode.c        |  496
++++++++++src/backend/replication/logical/logicalfuncs.c |  247 +++++src/backend/replication/logical/reorderbuffer.c |
1156+++++++++++++++++++++++src/backend/replication/logical/snapbuild.c     | 1144
++++++++++++++++++++++src/backend/replication/repl_gram.y            |   32 +-src/backend/replication/repl_scanner.l
     |    2 +src/backend/replication/walsender.c             |  566 ++++++++++-src/backend/storage/ipc/procarray.c
      |   23 +src/backend/storage/ipc/standby.c               |    8 +-src/backend/utils/cache/inval.c
|   2 +-src/backend/utils/cache/relcache.c              |    3 +-src/backend/utils/misc/guc.c                    |   11
+src/backend/utils/time/tqual.c                 |  249 +++++src/bin/pg_controldata/pg_controldata.c         |    2
+src/include/access/heapam_xlog.h               |   23 +src/include/access/transam.h                    |    5
+src/include/access/xlog.h                      |    3 +-src/include/catalog/index.h                     |    4
+src/include/nodes/nodes.h                      |    2 +src/include/nodes/replnodes.h                   |   22
+src/include/replication/decode.h               |   21 +src/include/replication/logicalfuncs.h          |   44
+src/include/replication/output_plugin.h        |   76 ++src/include/replication/reorderbuffer.h         |  284
++++++src/include/replication/snapbuild.h            |  128 +++src/include/replication/walsender.h             |    1
+src/include/replication/walsender_private.h    |   34 +-src/include/storage/itemptr.h                   |    3
+src/include/storage/sinval.h                   |    2 +src/include/utils/tqual.h                       |   31 +-35
fileschanged, 4966 insertions(+), 34 deletions(-)create mode 100644 src/backend/replication/logical/Makefilecreate mode
100644src/backend/replication/logical/decode.ccreate mode 100644 src/backend/replication/logical/logicalfuncs.ccreate
mode100644 src/backend/replication/logical/reorderbuffer.ccreate mode 100644
src/backend/replication/logical/snapbuild.ccreatemode 100644 src/include/replication/decode.hcreate mode 100644
src/include/replication/logicalfuncs.hcreatemode 100644 src/include/replication/output_plugin.hcreate mode 100644
src/include/replication/reorderbuffer.hcreatemode 100644 src/include/replication/snapbuild.h 

</pre><br /><fieldset class="mimeAttachmentHeader"></fieldset><br /><pre wrap="">
</pre></blockquote><br />

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

From

Andres Freund

Date:

03 December 2012, 12:22:16

Hi Steve,

On 2012-12-02 21:52:08 -0500, Steve Singer wrote:
> On 12-11-14 08:17 PM, Andres Freund wrote:
>
> I am getting errors like the following when I try to use either your
> test_decoding plugin or my own (which does even less than yours)
>
>
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> WARNING:  connecting to
> WARNING:  Initiating logical rep
> LOG:  computed new xmin: 773
> LOG:  start reading from 0/17F5D58, scrolled back to 0/17F4000
> LOG:  got new xmin 773 at 25124280
> LOG:  found initial snapshot (via running xacts). Done: 1
> WARNING:  reached consistent point, stopping!
> WARNING:  Starting logical replication
> LOG:  start reading from 0/17F5D58, scrolled back to 0/17F4000
> LOG:  found initial snapshot (via running xacts). Done: 1
> FATAL:  cannot read pg_class without having selected a database
> TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File:
> "proc.c", Line: 759)
>
> This seems to be happening under the calls at
> reorderbuffer.c:832    if (!SnapBuildHasCatalogChanges(NULL, xid,
> &change->relnode))

Two things:
1) Which exact options are you using for pg_receivellog? Not "-d
replication" by any chance?
2) Could you check that you really have a fully clean build? That has
hit me in the past as well.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

From

Andres Freund

Date:

03 December 2012, 12:42:08

On 2012-12-03 13:22:12 +0100, Andres Freund wrote:
> Hi Steve,
>
> On 2012-12-02 21:52:08 -0500, Steve Singer wrote:
> > On 12-11-14 08:17 PM, Andres Freund wrote:
> >
> > I am getting errors like the following when I try to use either your
> > test_decoding plugin or my own (which does even less than yours)
> >
> >
> > LOG:  database system is ready to accept connections
> > LOG:  autovacuum launcher started
> > WARNING:  connecting to
> > WARNING:  Initiating logical rep
> > LOG:  computed new xmin: 773
> > LOG:  start reading from 0/17F5D58, scrolled back to 0/17F4000
> > LOG:  got new xmin 773 at 25124280
> > LOG:  found initial snapshot (via running xacts). Done: 1
> > WARNING:  reached consistent point, stopping!
> > WARNING:  Starting logical replication
> > LOG:  start reading from 0/17F5D58, scrolled back to 0/17F4000
> > LOG:  found initial snapshot (via running xacts). Done: 1
> > FATAL:  cannot read pg_class without having selected a database
> > TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File:
> > "proc.c", Line: 759)
> >
> > This seems to be happening under the calls at
> > reorderbuffer.c:832    if (!SnapBuildHasCatalogChanges(NULL, xid,
> > &change->relnode))
>
> Two things:
> 1) Which exact options are you using for pg_receivellog? Not "-d
> replication" by any chance?

This seems to produce exactly that kind off error messages. I added some
error checking around that. Pushed.

Thanks!

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

From

Steve Singer

Date:

03 December 2012, 14:36:06

On 12-12-03 07:42 AM, Andres Freund wrote:
> Two things:
> 1) Which exact options are you using for pg_receivellog? Not "-d
> replication" by any chance?

Yes that is exactly what I'md doing.  Using a real database name instead 
makes this go away.

Thanks

> This seems to produce exactly that kind off error messages. I added some
> error checking around that. Pushed.
>
> Thanks!
>
> Andres Freund
>
> --
>   Andres Freund                       http://www.2ndQuadrant.com/
>   PostgreSQL Development, 24x7 Support, Training & Services
>
>

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

From

Andres Freund

Date:

03 December 2012, 14:48:16

On 2012-12-03 09:35:55 -0500, Steve Singer wrote:
> On 12-12-03 07:42 AM, Andres Freund wrote:
> >Two things:
> >1) Which exact options are you using for pg_receivellog? Not "-d
> >replication" by any chance?
>
> Yes that is exactly what I'md doing.  Using a real database name instead
> makes this go away.

Was using "replication" an accident or do you think it makes sense in
some way?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

From

Steve Singer

Date:

03 December 2012, 14:56:26

On 12-12-03 09:48 AM, Andres Freund wrote:
> On 2012-12-03 09:35:55 -0500, Steve Singer wrote:
>> On 12-12-03 07:42 AM, Andres Freund wrote:
>> Yes that is exactly what I'md doing. Using a real database name 
>> instead makes this go away. 
> Was using "replication" an accident or do you think it makes sense in
> some way?

The 'replication' line in pg_hba.conf made me think that the database 
name had to be replication for walsender connections.

> Greetings,
>
> Andres Freund
>
> --
>   Andres Freund                       http://www.2ndQuadrant.com/
>   PostgreSQL Development, 24x7 Support, Training & Services
>
>

Re: [PATCH 03/14] Add simple xlogdump tool

From

Andres Freund

Date:

06 December 2012, 16:13:58

Hi,

I tried to address most (all?) your comments in the version from
http://archives.postgresql.org/message-id/20121204175212.GB12055%40awork2.anarazel.de
.

On 2012-11-15 11:31:55 -0500, Peter Eisentraut wrote:
> > +xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v
/main.o|sed's/^/..\/..\/..\//')

> > +    $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
>
> This looks pretty evil, and there is no documentation about what it is
> supposed to do.
>
> Windows build support needs some thought.

Ok, since Alvaro made it possible the build now only has rules like:

xlogreader.c: % : $(top_srcdir)/src/backend/access/transam/%             rm -f $@ && $(LN_S) $< .

clogdesc.c: % : $(top_srcdir)/src/backend/access/rmgrdesc/%           rm -f $@ && $(LN_S) $< .
and
OBJS = \    clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \    mxactdesc.o nbtdesc.o relmapdesc.o
seqdesc.osmgrdesc.o spgdesc.o \    standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o assert.o \    $(WIN32RES) \
pg_xlogdump.opqexpbuf_strinfo.o compat.o tables.o xlogreader.o \

pg_xlogdump: $(OBJS) | submake-libpq submake-libpgport            $(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX)
$(LIBS)$(libpq_pgport) -o $@$(X)

Thats easier easier to integrate into the windows build?

> > +static void
> > +usage(void)
> > +{
> > +    printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"),
> > +           progname);
> > +    printf(_("Usage:\n"));
> > +    printf(_("  %s [OPTION]...\n"), progname);
> > +    printf(_("\nOptions:\n"));
> > +    printf(_("  -v, --version          output version information, then exit\n"));
> > +    printf(_("  -h, --help             show this help, then exit\n"));
> > +    printf(_("  -s, --start            from where recptr onwards to read\n"));
> > +    printf(_("  -e, --end              up to which recptr to read\n"));
> > +    printf(_("  -t, --timeline         which timeline do we want to read\n"));
> > +    printf(_("  -i, --inpath           from where do we want to read? cwd/pg_xlog is the default\n"));
> > +    printf(_("  -o, --output           where to write [start, end]\n"));
> > +    printf(_("  -f, --file             wal file to parse\n"));
> > +}
>
> Options list should be in alphabetic order (or some other less random
> order).  Most of these descriptions are not very intelligible (at least
> without additional documentation).

I tried to improve the help, its now:

pg_xlogdump: reads/writes postgres transaction logs for debugging.

Usage: pg_xlogdump [OPTION]...

Options: -b, --bkp-details      output detailed information about backup blocks -e, --end RECPTR       read wal up to
RECPTR-f, --file FILE        wal file to parse, cannot be specified together with -p -h, --help             show this
help,then exit -p, --path PATH        from where do we want to read? cwd/pg_xlog is the default -s, --start RECPTR
readwal in directory indicated by -p starting at RECPTR -t, --timeline TLI     which timeline do we want to read,
defaultsto 1 -v, --version          output version information, then exit

I wonder whether it would make sense to split the help into different
sections? It seems likely we will gain some more options...

> no nls.mk

Do I need to do anything for that besides:

# src/bin/pg_xlogdump/nls.mk
CATALOG_NAME     = pg_xlogdump
AVAIL_LANGUAGES  =
GETTEXT_FILES    = pg_xlogdump.c

?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3 - git repository

From

Andres Freund

Date:

09 December 2012, 19:14:56

On 2012-11-15 02:26:53 +0100, Andres Freund wrote:
> On 2012-11-15 01:27:46 +0100, Andres Freund wrote:
> > In response to this you will soon find the 14 patches that currently
> > implement $subject.
>
> As its not very wieldly to send around that many/big patches all the
> time, until the next "major" version I will just update the git tree at:
>
> Web:
>
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf3
>
> Git:
> git clone git://git.postgresql.org/git/users/andresfreund/postgres.git xlog-decoding-rebasing-cf3

I pushed a new version which

- is rebased ontop of master
- is based ontop of the new xlogreader (biggest part)
- is base ontop of the new binaryheap.h
- some fixes
- some more comments

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

09 December 2012, 19:15:42

On 2012-11-15 16:22:56 +0200, Heikki Linnakangas wrote:
> On 15.11.2012 03:17, Andres Freund wrote:
> >
> >Features:
> >- streaming reading/writing
> >- filtering
> >- reassembly of records
> >
> >Reusing the ReadRecord infrastructure in situations where the code that wants
> >to do so is not tightly integrated into xlog.c is rather hard and would require
> >changes to rather integral parts of the recovery code which doesn't seem to be
> >a good idea.
> >
> >Missing:
> >- "compressing" the stream when removing uninteresting records
> >- writing out correct CRCs
> >- separating reader/writer
>
> I'm disappointed to see that there has been no progress on this patch since
> last commitfest. I thought we agreed on the approach I championed for here:
> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. There
> wasn't much work left to finish that, I believe.
>
> Are you going to continue working on this?

Patch (git repo) is now based ontop of my vesion of your xlogreader
version...

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical changeset generation v3

From

Andres Freund

Date:

10 December 2012, 14:43:35

Hi,

On 2012-11-19 09:50:30 +0100, Andres Freund wrote:
> On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> > After launching some SQLs, the logical receiver is stuck just after sending
> > INIT_LOGICAL_REPLICATION, please see bt of process waiting:
>
> Its waiting till it sees initial an initial xl_running_xacts record. The
> easiest way to do that is manually issue a checkpoint. Sorry, should
> have included that in the description.
> Otherwise you can wait till the next routine checkpoint comes arround...
>
> I plan to cause more xl_running_xacts record to be logged in the
> future. I think the timing of those currently is non-optimal, you have
> the same problem as here in normal streaming replication as well :(

This is "fixed" now with the changes I pushed a second ago. Unless some
longrunning transactions are arround we now immediately jump to a ready
state.
This is achieved by
1. regularly logging xl_running_xacts (in background writer)
2. logging xl_runnign_xacts at the beginning of INIT_LOGICAL_REPLICATION

This also has the advantage that the xmin horizon can be increased much
more frequently.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Heikki Linnakangas

Date:

10 December 2012, 20:22:47

(Offlist)

Just a quick note that I'm working on this patch now. I pushed some 
trivial fixes to my git repository at 
git://git.postgresql.org/git/users/heikki/postgres.git, xlogreader_v3 
branch.

- Heikki

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Heikki Linnakangas

Date:

10 December 2012, 20:23:07

On 10.12.2012 22:22, Heikki Linnakangas wrote:
> (Offlist)
>
> Just a quick note that I'm working on this patch now. I pushed some
> trivial fixes to my git repository at
> git://git.postgresql.org/git/users/heikki/postgres.git, xlogreader_v3
> branch.

Oops, wasn't offlist :-). Well, if anyone wants to take a look, feel free.

- Heikki

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Heikki Linnakangas

Date:

11 December 2012, 13:55:36

I've been molding this patch for a while now, here's what I have this 
far (also available in my git repository).

The biggest change is in the error reporting. A stand-alone program that 
wants to use xlogreader.c no longer has to provide a full-blown 
replacement for ereport(). The only thing that xlogreader.c used 
ereport() was when it encounters an invalid record. And even there we 
had the emode_for_corrupt_record hack. I think it's a much better API 
that XLogReadRecord just returns NULL on an invalid record, and an error 
string, and the caller can do what it wants with that. In xlog.c, we'll 
pass the error string to ereport(), with the right emode as determined 
by emode_for_corrupt_record. xlog.c is no longer concerned with 
emode_for_corrupt_record, or error levels in general.

We talked about this earlier, and Tom Lane argued that "it's basically 
insane to imagine that you can carve out a non-trivial piece of the 
backend that doesn't contain any elog calls." 
(http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but 
having done just that, it doesn't seem insane to me. xlogreader.c really 
is a pretty well contained piece of code. All the complicated stuff that 
contains elog calls and pallocs and more is in the callback, which can 
freely use all the normal backend infrastructure.

Now, here's some stuff that still need to be done:

* A stand-alone program using xlogreader.c has to provide an 
implementation of tliInHistory(). Need to find a better way to do that. 
Perhaps "#ifndef FRONTEND" the tliInHistory checks in xlogreader.

* In xlog.c, some of the variables that used to be statics like 
readFile, readOff etc. are now in the XLogPageReadPrivate struct. But 
there's still plenty of statics left in there - it would certainly not 
work correctly if xlog.c tried to open two xlog files at the same time. 
I think it's just confusing to have some stuff in the 
XLogPageReadPrivate struct, and others as static, so I think we should 
get rid of XLogPageReadPrivate struct altogether and put back the static 
variables. At least it would make the diff smaller, which might help 
with reviewing. xlog.c probably doesn't need to provide a "private" 
struct to xlogreader.c at all, which is okay.

* It's pretty ugly that to use the rm_desc functions, you have to 
provide dummy implementations of a bunch of backend functions, including 
pfree() and timestamptz_to_str(). Should find a better way to do that.

* It's not clear to me how we'd handle translating the strings in 
xlogreader.c, when xlogreader.c is used in a stand-alone program like 
pg_xlogdump. Maybe we can just punt on that...

* How about we move pg_xlogdump to contrib? It doesn't feel like the 
kind of essential tool that deserves to be in src/bin.

- Heikki

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Heikki Linnakangas

Date:

11 December 2012, 13:56:49

Forgot attachment..

On 11.12.2012 15:55, Heikki Linnakangas wrote:
> I've been molding this patch for a while now, here's what I have this
> far (also available in my git repository).
>
> The biggest change is in the error reporting. A stand-alone program that
> wants to use xlogreader.c no longer has to provide a full-blown
> replacement for ereport(). The only thing that xlogreader.c used
> ereport() was when it encounters an invalid record. And even there we
> had the emode_for_corrupt_record hack. I think it's a much better API
> that XLogReadRecord just returns NULL on an invalid record, and an error
> string, and the caller can do what it wants with that. In xlog.c, we'll
> pass the error string to ereport(), with the right emode as determined
> by emode_for_corrupt_record. xlog.c is no longer concerned with
> emode_for_corrupt_record, or error levels in general.
>
> We talked about this earlier, and Tom Lane argued that "it's basically
> insane to imagine that you can carve out a non-trivial piece of the
> backend that doesn't contain any elog calls."
> (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but
> having done just that, it doesn't seem insane to me. xlogreader.c really
> is a pretty well contained piece of code. All the complicated stuff that
> contains elog calls and pallocs and more is in the callback, which can
> freely use all the normal backend infrastructure.
>
> Now, here's some stuff that still need to be done:
>
> * A stand-alone program using xlogreader.c has to provide an
> implementation of tliInHistory(). Need to find a better way to do that.
> Perhaps "#ifndef FRONTEND" the tliInHistory checks in xlogreader.
>
> * In xlog.c, some of the variables that used to be statics like
> readFile, readOff etc. are now in the XLogPageReadPrivate struct. But
> there's still plenty of statics left in there - it would certainly not
> work correctly if xlog.c tried to open two xlog files at the same time.
> I think it's just confusing to have some stuff in the
> XLogPageReadPrivate struct, and others as static, so I think we should
> get rid of XLogPageReadPrivate struct altogether and put back the static
> variables. At least it would make the diff smaller, which might help
> with reviewing. xlog.c probably doesn't need to provide a "private"
> struct to xlogreader.c at all, which is okay.
>
> * It's pretty ugly that to use the rm_desc functions, you have to
> provide dummy implementations of a bunch of backend functions, including
> pfree() and timestamptz_to_str(). Should find a better way to do that.
>
> * It's not clear to me how we'd handle translating the strings in
> xlogreader.c, when xlogreader.c is used in a stand-alone program like
> pg_xlogdump. Maybe we can just punt on that...
>
> * How about we move pg_xlogdump to contrib? It doesn't feel like the
> kind of essential tool that deserves to be in src/bin.
>
> - Heikki


--
- Heikki

Attachment

xlogreader-heikki-20121211.patch

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

11 December 2012, 14:44:48

On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote:
> I've been molding this patch for a while now, here's what I have this far
> (also available in my git repository).

On a very quick this looks good. I will try to rebase the decoding stuff
and read a bit around in the course of that...

> The biggest change is in the error reporting. A stand-alone program that
> wants to use xlogreader.c no longer has to provide a full-blown replacement
> for ereport(). The only thing that xlogreader.c used ereport() was when it
> encounters an invalid record. And even there we had the
> emode_for_corrupt_record hack. I think it's a much better API that
> XLogReadRecord just returns NULL on an invalid record, and an error string,
> and the caller can do what it wants with that. In xlog.c, we'll pass the
> error string to ereport(), with the right emode as determined by
> emode_for_corrupt_record. xlog.c is no longer concerned with
> emode_for_corrupt_record, or error levels in general.

> We talked about this earlier, and Tom Lane argued that "it's basically
> insane to imagine that you can carve out a non-trivial piece of the backend
> that doesn't contain any elog calls."
> (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but
> having done just that, it doesn't seem insane to me. xlogreader.c really is
> a pretty well contained piece of code. All the complicated stuff that
> contains elog calls and pallocs and more is in the callback, which can
> freely use all the normal backend infrastructure.

This is pretty good. I was a bit afraid of making this change - thus my
really ugly emode callback hack - but this is way better.

> Now, here's some stuff that still need to be done:
>
> * A stand-alone program using xlogreader.c has to provide an implementation
> of tliInHistory(). Need to find a better way to do that. Perhaps "#ifndef
> FRONTEND" the tliInHistory checks in xlogreader.

We could just leave it in xlogreader in the first place. Having an
#ifdef'ed out version in there seems to be schizophrenic to me, all the
maintenance overhead, none of the fun...

> * In xlog.c, some of the variables that used to be statics like
> readFile, readOff etc. are now in the XLogPageReadPrivate struct. But
> there's still plenty of statics left in there - it would certainly not
> work correctly if xlog.c tried to open two xlog files at the same
> time. I think it's just confusing to have some stuff in the
> XLogPageReadPrivate struct, and others as static, so I think we should
> get rid of XLogPageReadPrivate struct altogether and put back the
> static variables. At least it would make the diff smaller, which might
> help with reviewing. xlog.c probably doesn't need to provide a
> "private" struct to xlogreader.c at all, which is okay.

Fine with me. I find the grouping that the struct provides somewhat
helpful when reading the code, but its more than offset by duplicating
some of the variables.
The reasons to have it are fewer compared when you'd introduced the
struct - xlogreader does more now, so less needs to be handled outside.

> * It's pretty ugly that to use the rm_desc functions, you have to provide
> dummy implementations of a bunch of backend functions, including pfree() and
> timestamptz_to_str(). Should find a better way to do that.

I think most of the cases requiring those ugly hacks can be fixed to
just use a caller-provided buffer, there's not that much left.

timestamptz_to_str() is probably the most complex case. I just noticed
there's already a second implementation in
ecpg/pgtypeslib/dt_common.c. Yuck. It seems to already have diverged in
a number of cases :(

> * It's not clear to me how we'd handle translating the strings in
> xlogreader.c, when xlogreader.c is used in a stand-alone program like
> pg_xlogdump. Maybe we can just punt on that...

I personally would have no problem with that. It's probably either going
to be used during in-depth-debugging or when developing pg. In both
cases english seems to be fine. But I seldomly use translated programs
so maybe I am not the right person to ask.

> * How about we move pg_xlogdump to contrib? It doesn't feel like the kind of
> essential tool that deserves to be in src/bin.

contrib would be fine, but I think src/bin is better. There have been
quite some bugs by now where it would have been useful to have a
reliable xlogdump in core so its really installed.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

11 December 2012, 15:43:16

On 2012-12-11 15:44:39 +0100, Andres Freund wrote:
> On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote:
> > * It's pretty ugly that to use the rm_desc functions, you have to provide
> > dummy implementations of a bunch of backend functions, including pfree() and
> > timestamptz_to_str(). Should find a better way to do that.
>
> I think most of the cases requiring those ugly hacks can be fixed to
> just use a caller-provided buffer, there's not that much left.
>
> timestamptz_to_str() is probably the most complex case. I just noticed
> there's already a second implementation in
> ecpg/pgtypeslib/dt_common.c. Yuck. It seems to already have diverged in
> a number of cases :(

The attached (and pushed) patches change relpathbackend to use a static buffer
instead. That gets rid of the pfree() requirement and looks ok otherwise as
well.

Unfortunately that still leaves us with the need to re-implement
relpathbackend() in xlogdump, but that seems somwhat ok to me.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

11 December 2012, 15:44:46

On 2012-12-11 16:43:12 +0100, Andres Freund wrote:
> On 2012-12-11 15:44:39 +0100, Andres Freund wrote:
> > On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote:
> > > * It's pretty ugly that to use the rm_desc functions, you have to provide
> > > dummy implementations of a bunch of backend functions, including pfree() and
> > > timestamptz_to_str(). Should find a better way to do that.
> >
> > I think most of the cases requiring those ugly hacks can be fixed to
> > just use a caller-provided buffer, there's not that much left.
> >
> > timestamptz_to_str() is probably the most complex case. I just noticed
> > there's already a second implementation in
> > ecpg/pgtypeslib/dt_common.c. Yuck. It seems to already have diverged in
> > a number of cases :(
>
> The attached (and pushed) patches change relpathbackend to use a static buffer
> instead. That gets rid of the pfree() requirement and looks ok otherwise as
> well.

... really attached.

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

On 2012-12-13 18:29:00 +0100, Andres Freund wrote:
> On 2012-12-13 00:05:41 +0000, Peter Geoghegan wrote:
> > Initialisation means finding a free “logical slot” from shared memory,
> > then looping until the new magic xmin horizon for logical walsenders
> > (stored in their “slot”) is that of the weakest link (think local
> > global xmin).
> >
> > +      * FIXME: think about solving the race conditions in a nicer way.
> > +      */
> > + recompute_xmin:
> > +     walsnd->xmin = GetOldestXmin(true, true);
> > +     ComputeLogicalXmin();
> > +     if (walsnd->xmin != GetOldestXmin(true, true))
> > +         goto recompute_xmin;
> >
> > Apart from the race conditions that I'm not confident are addressed
> > here, I think that the above could easily get stuck indefinitely in
> > the event of contention.
>
> I don't like that part the slightest bit but I don't think its actually
> in danger of looping forever. In fact I think its so broken it won't
> ever loop ;). (ComputeLogicalXmin() will set the current global minimum
> which will then be returned by GetOldestXmin()).
>
> I would like to solve this properly without copying GetOldestXmin once
> more (so we can compute and set the logical xmin while holding
> ProcArrayLock), but I am not yet sure how a nice way to do that would
> look like.
>
> I guess GetOldestXminNoLock? That gets called while we already hold the
> procarray lock? Yuck.

Does anybody have an opinion on the attached patches? Especially 0001,
which contains the procarray changes?

It moves a computation of the sort of:

result -= vacuum_defer_cleanup_age;
if (!TransactionIdIsNormal(result))
   result = FirstNormalTransactionId;

inside ProcArrayLock. But I can't really imagine that to be relevant...


Another alternative to this would be to get a snapshot with
GetSnapshotData(), copy the xmin to the logical slot, then call
ProcArrayEndTransaction(). But that doesn't really seem to be nicer to
me.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

On Fri, Dec 14, 2012 at 2:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> It moves a computation of the sort of:
>
> result -= vacuum_defer_cleanup_age;
> if (!TransactionIdIsNormal(result))
> result = FirstNormalTransactionId;
>
> inside ProcArrayLock. But I can't really imagine that to be relevant...

I can. Go look at some of the 9.2 optimizations around
GetSnapshotData(). Those made a BIG difference under heavy
concurrency and they were definitely micro-optimization. For example,
the introduction of NormalTransactionIdPrecedes() was shockingly
effective.

The two commits coming to my mind are:
- ed0b409 (Separate PGPROC into PGPROC and PGXACT)
- 0d76b60 (introduction of NormalTransactionIdPrecedes)
Those ones really improved concurrency performance.

--
Michael Paquier
http://michael.otacoo.com

Re: logical decoding - GetOldestXmin

From

Andres Freund

Date:

14 December 2012, 11:46:27

On 2012-12-13 23:35:00 +0000, Simon Riggs wrote:
> On 13 December 2012 22:37, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2012-12-13 17:29:06 -0500, Robert Haas wrote:
> >> On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > It moves a computation of the sort of:
> >> >
> >> > result -= vacuum_defer_cleanup_age;
> >> > if (!TransactionIdIsNormal(result))
> >> >    result = FirstNormalTransactionId;
> >> >
> >> > inside ProcArrayLock. But I can't really imagine that to be relevant...
> >>
> >> I can.  Go look at some of the 9.2 optimizations around
> >> GetSnapshotData().  Those made a BIG difference under heavy
> >> concurrency and they were definitely micro-optimization.  For example,
> >> the introduction of NormalTransactionIdPrecedes() was shockingly
> >> effective.
> >
> > But GetOldestXmin() should be called less frequently than
> > GetSnapshotData() by several orders of magnitudes. I don't really see
> > it being used in any really hot code paths?
>
> Maybe, but that calculation doesn't *need* to be inside the lock, that
> is just a consequence of the current coding.

I am open to suggestion how to do that in a way we a) can hold the lock
already (to safely nail the global xmin to the current value) b) without
duplicating all the code.

Just moving that tidbit inside the lock seems to be the pragmatic
choice. GetOldestXmin is called

* once per checkpoint
* one per index build
* once in analyze
* twice per vacuum
* once for HS feedback messages

Nothing of that occurs frequently enough that 5 instructions will make a
difference. I would be happy to go an alternative path, but right now I
don't see any nice one. A "already_locked" parameter to GetOldestXmin
seems to be a cure worse than the disease.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical decoding - GetOldestXmin

From

Robert Haas

Date:

14 December 2012, 19:01:40

On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Just moving that tidbit inside the lock seems to be the pragmatic
> choice. GetOldestXmin is called
>
> * once per checkpoint
> * one per index build
> * once in analyze
> * twice per vacuum
> * once for HS feedback messages
>
> Nothing of that occurs frequently enough that 5 instructions will make a
> difference. I would be happy to go an alternative path, but right now I
> don't see any nice one. A "already_locked" parameter to GetOldestXmin
> seems to be a cure worse than the disease.

I'm not sure that would be so bad, but I guess I question the need to
do it this way at all.  Most of the time, if you need to advertise
your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and
I guess I'm not seeing why that wouldn't also work here.  Am I dumb?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding - GetOldestXmin

From

Andres Freund

Date:

15 December 2012, 00:19:31

On 2012-12-14 14:01:30 -0500, Robert Haas wrote:
> On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Just moving that tidbit inside the lock seems to be the pragmatic
> > choice. GetOldestXmin is called
> >
> > * once per checkpoint
> > * one per index build
> > * once in analyze
> > * twice per vacuum
> > * once for HS feedback messages
> >
> > Nothing of that occurs frequently enough that 5 instructions will make a
> > difference. I would be happy to go an alternative path, but right now I
> > don't see any nice one. A "already_locked" parameter to GetOldestXmin
> > seems to be a cure worse than the disease.
>
> I'm not sure that would be so bad, but I guess I question the need to
> do it this way at all.  Most of the time, if you need to advertise
> your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and
> I guess I'm not seeing why that wouldn't also work here.  Am I dumb?

I wondered upthread whether that would be better:

On 2012-12-13 21:03:44 +0100, Andres Freund wrote:
> Another alternative to this would be to get a snapshot with
> GetSnapshotData(), copy the xmin to the logical slot, then call
> ProcArrayEndTransaction(). But that doesn't really seem to be nicer to
> me.

Not sure why I considered it ugly anymore, but it actually has a
noticeable disadvantage. GetOldestXmin is nicer is than GetSnapshotData
as the latter set a fairly new xid as xmin whereas GetOldestXmin returns
the actual current xmin horizon. Thats preferrable because it allows us
to start up more quickly. snapbuild.c can only start building a snapshot
once it has seen a xl_running_xact with oldestRunningXid >=
own_xmin. Otherwise we cannot be sure that no relevant catalog tuples
have been removed.

This also made me notice that my changes to GetSnapshotData were quite
pessimal... I do set the xmin of the new snapshot to the "logical xmin"
instead of doing it only to globalxmin/RecentGlobalXmin.

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical decoding - GetOldestXmin

From

Andres Freund

Date:

16 December 2012, 14:55:48

On 2012-12-15 01:19:26 +0100, Andres Freund wrote:
> On 2012-12-14 14:01:30 -0500, Robert Haas wrote:
> > On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > Just moving that tidbit inside the lock seems to be the pragmatic
> > > choice. GetOldestXmin is called
> > >
> > > * once per checkpoint
> > > * one per index build
> > > * once in analyze
> > > * twice per vacuum
> > > * once for HS feedback messages
> > >
> > > Nothing of that occurs frequently enough that 5 instructions will make a
> > > difference. I would be happy to go an alternative path, but right now I
> > > don't see any nice one. A "already_locked" parameter to GetOldestXmin
> > > seems to be a cure worse than the disease.
> >
> > I'm not sure that would be so bad, but I guess I question the need to
> > do it this way at all.  Most of the time, if you need to advertise
> > your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and
> > I guess I'm not seeing why that wouldn't also work here.  Am I dumb?
>
> I wondered upthread whether that would be better:
>
> On 2012-12-13 21:03:44 +0100, Andres Freund wrote:
> > Another alternative to this would be to get a snapshot with
> > GetSnapshotData(), copy the xmin to the logical slot, then call
> > ProcArrayEndTransaction(). But that doesn't really seem to be nicer to
> > me.
>
> Not sure why I considered it ugly anymore, but it actually has a
> noticeable disadvantage. GetOldestXmin is nicer is than GetSnapshotData
> as the latter set a fairly new xid as xmin whereas GetOldestXmin returns
> the actual current xmin horizon. Thats preferrable because it allows us
> to start up more quickly. snapbuild.c can only start building a snapshot
> once it has seen a xl_running_xact with oldestRunningXid >=
> own_xmin. Otherwise we cannot be sure that no relevant catalog tuples
> have been removed.

Hm. One way that could work with fewer changes is to exploit the fact
that a) it seems to be possible to acquire a shared lwlock twice in the
same backend and b) both GetOldestXmin & GetSnapshotData acquire only a
shared lwlock.

Are we willing to guarantee that recursive acquiration of shared lwlocks
continues to work?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical decoding - GetOldestXmin

From

Simon Riggs

Date:

16 December 2012, 16:44:30

On 13 December 2012 20:03, Andres Freund <andres@2ndquadrant.com> wrote:

> Does anybody have an opinion on the attached patches? Especially 0001,
> which contains the procarray changes?
>
> It moves a computation of the sort of:
>
> result -= vacuum_defer_cleanup_age;
> if (!TransactionIdIsNormal(result))
>    result = FirstNormalTransactionId;
>
> inside ProcArrayLock. But I can't really imagine that to be relevant...

I don't see why this is hard.

Just make the lock acquisition/release conditional on another parameter.

That way the only thing you'll be moving inside the lock is an if test
on a constant boolean.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: logical decoding - GetOldestXmin

From

Andres Freund

Date:

16 December 2012, 17:02:46

On 2012-12-16 16:44:04 +0000, Simon Riggs wrote:
> On 13 December 2012 20:03, Andres Freund <andres@2ndquadrant.com> wrote:
>
> > Does anybody have an opinion on the attached patches? Especially 0001,
> > which contains the procarray changes?
> >
> > It moves a computation of the sort of:
> >
> > result -= vacuum_defer_cleanup_age;
> > if (!TransactionIdIsNormal(result))
> >    result = FirstNormalTransactionId;
> >
> > inside ProcArrayLock. But I can't really imagine that to be relevant...
>
> I don't see why this is hard.
>
> Just make the lock acquisition/release conditional on another parameter.
>
> That way the only thing you'll be moving inside the lock is an if test
> on a constant boolean.

Thats not really cheaper. Two branches + additional parameter
passed/pushed vs one branch, one subtransaction, two assignments is a
close call.
As I don't think either really matters in the GetOldestXmin case, I
would be happy with that as well. If people prefer an additional
parameter + adjusting the few callsite vs. a separate function I will go
that way.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical decoding - GetOldestXmin

From

Robert Haas

Date:

18 December 2012, 20:21:08

On Fri, Dec 14, 2012 at 7:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2012-12-14 14:01:30 -0500, Robert Haas wrote:
>> On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > Just moving that tidbit inside the lock seems to be the pragmatic
>> > choice. GetOldestXmin is called
>> >
>> > * once per checkpoint
>> > * one per index build
>> > * once in analyze
>> > * twice per vacuum
>> > * once for HS feedback messages
>> >
>> > Nothing of that occurs frequently enough that 5 instructions will make a
>> > difference. I would be happy to go an alternative path, but right now I
>> > don't see any nice one. A "already_locked" parameter to GetOldestXmin
>> > seems to be a cure worse than the disease.
>>
>> I'm not sure that would be so bad, but I guess I question the need to
>> do it this way at all.  Most of the time, if you need to advertise
>> your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and
>> I guess I'm not seeing why that wouldn't also work here.  Am I dumb?
>
> I wondered upthread whether that would be better:
>
> On 2012-12-13 21:03:44 +0100, Andres Freund wrote:
>> Another alternative to this would be to get a snapshot with
>> GetSnapshotData(), copy the xmin to the logical slot, then call
>> ProcArrayEndTransaction(). But that doesn't really seem to be nicer to
>> me.
>
> Not sure why I considered it ugly anymore, but it actually has a
> noticeable disadvantage. GetOldestXmin is nicer is than GetSnapshotData
> as the latter set a fairly new xid as xmin whereas GetOldestXmin returns
> the actual current xmin horizon. Thats preferrable because it allows us
> to start up more quickly. snapbuild.c can only start building a snapshot
> once it has seen a xl_running_xact with oldestRunningXid >=
> own_xmin. Otherwise we cannot be sure that no relevant catalog tuples
> have been removed.

I'm a bit confused.  Are you talking about the difference between
RecentGlobalXmin and RecentXmin?  I think GetSnapshotData() updates
both.

Anyway, if there's no nicer way, I think it's probably OK to add a
parameter to GetOldestXmin().  It seems like kind of a hack, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding - GetOldestXmin

From

"anarazel@anarazel.de"

Date:

18 December 2012, 22:25:21

Hi,

Robert Haas <robertmhaas@gmail.com> schrieb:

>On Fri, Dec 14, 2012 at 7:19 PM, Andres Freund <andres@2ndquadrant.com>
>wrote:
>> On 2012-12-14 14:01:30 -0500, Robert Haas wrote:
>>> On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund
><andres@2ndquadrant.com> wrote:
>>> > Just moving that tidbit inside the lock seems to be the pragmatic
>>> > choice. GetOldestXmin is called
>>> >
>>> > * once per checkpoint
>>> > * one per index build
>>> > * once in analyze
>>> > * twice per vacuum
>>> > * once for HS feedback messages
>>> >
>>> > Nothing of that occurs frequently enough that 5 instructions will
>make a
>>> > difference. I would be happy to go an alternative path, but right
>now I
>>> > don't see any nice one. A "already_locked" parameter to
>GetOldestXmin
>>> > seems to be a cure worse than the disease.
>>>
>>> I'm not sure that would be so bad, but I guess I question the need
>to
>>> do it this way at all.  Most of the time, if you need to advertise
>>> your global xmin, you use GetSnapshotData(), not GetOldestXmin(),
>and
>>> I guess I'm not seeing why that wouldn't also work here.  Am I dumb?
>>
>> I wondered upthread whether that would be better:
>>
>> On 2012-12-13 21:03:44 +0100, Andres Freund wrote:
>>> Another alternative to this would be to get a snapshot with
>>> GetSnapshotData(), copy the xmin to the logical slot, then call
>>> ProcArrayEndTransaction(). But that doesn't really seem to be nicer
>to
>>> me.
>>
>> Not sure why I considered it ugly anymore, but it actually has a
>> noticeable disadvantage. GetOldestXmin is nicer is than
>GetSnapshotData
>> as the latter set a fairly new xid as xmin whereas GetOldestXmin
>returns
>> the actual current xmin horizon. Thats preferrable because it allows
>us
>> to start up more quickly. snapbuild.c can only start building a
>snapshot
>> once it has seen a xl_running_xact with oldestRunningXid >=
>> own_xmin. Otherwise we cannot be sure that no relevant catalog tuples
>> have been removed.
>
>I'm a bit confused.  Are you talking about the difference between
>RecentGlobalXmin and RecentXmin?  I think GetSnapshotData() updates
>both.

The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples required
fordecoding might get removed by other backends. That needs to be prevented while holding the  procarray lock
exclusively.

Does it make more sense now?

Andres

---
Please excuse the brevity and formatting - I am writing this on my mobile phone.

Re: logical decoding - GetOldestXmin

From

Robert Haas

Date:

19 December 2012, 00:56:24

On Tue, Dec 18, 2012 at 5:25 PM, anarazel@anarazel.de
<andres@anarazel.de> wrote:
> The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples required
fordecoding might get removed by other backends. That needs to be prevented while holding the  procarray lock
exclusively.

Well, for the ordinary use of GetSnapshotData(), that doesn't matter,
because GetSnapshotData() also updates proc->xmin.  If you're trying
to store a different value in that field then of course it matters.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding - GetOldestXmin

From

Andres Freund

Date:

19 December 2012, 00:59:15

On 2012-12-18 19:56:18 -0500, Robert Haas wrote:
> On Tue, Dec 18, 2012 at 5:25 PM, anarazel@anarazel.de
> <andres@anarazel.de> wrote:
> > The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples
requiredfor decoding might get removed by other backends. That needs to be prevented while holding the  procarray lock
exclusively.
>
> Well, for the ordinary use of GetSnapshotData(), that doesn't matter,
> because GetSnapshotData() also updates proc->xmin.  If you're trying
> to store a different value in that field then of course it matters.

Absolutely right. I don't want to say there's anything wrong with it
right now. The "problem" for me is that it sets proc->xmin to the newest
value it can while I want/need the oldest valid value...

I will go with adding a already_locked parameter to GetOldestXmin then.

Thanks for the input,

Andres
--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: logical decoding - GetOldestXmin

From

Robert Haas

Date:

19 December 2012, 02:04:05

On Tue, Dec 18, 2012 at 7:59 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2012-12-18 19:56:18 -0500, Robert Haas wrote:
>> On Tue, Dec 18, 2012 at 5:25 PM, anarazel@anarazel.de
>> <andres@anarazel.de> wrote:
>> > The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples
requiredfor decoding might get removed by other backends. That needs to be prevented while holding the  procarray lock
exclusively.
>>
>> Well, for the ordinary use of GetSnapshotData(), that doesn't matter,
>> because GetSnapshotData() also updates proc->xmin.  If you're trying
>> to store a different value in that field then of course it matters.
>
> Absolutely right. I don't want to say there's anything wrong with it
> right now. The "problem" for me is that it sets proc->xmin to the newest
> value it can while I want/need the oldest valid value...
>
> I will go with adding a already_locked parameter to GetOldestXmin then.

Or instead of bool already_locked, maybe bool advertise_xmin?   Seems
like that might be more friendly to the abstraction boundaries.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Noah Misch

Date:

13 January 2013, 00:29:01

[Catching up on old threads.]

On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote:
> On 11/17/2012 03:00 PM, Markus Wanner wrote:
>> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>>> PostgreSQL-R
>> No. There must be some way to logically identify the tuple.
> It can be done as selecting on _all_ attributes and updating/deleting  
> just the first matching row
>
> create cursor ...
> select from t ... where t.* = (....)
> fetch one ...
> delete where current of ...
>
> This is on distant (round 3 or 4) roadmap for this work, just was  
> interested
> if you had found any better way of doing this :)

That only works if every attribute's type has a notion of equality ("xml" does
not).  The equality operator may have a name other than "=", and an operator
named "=" may exist with semantics other than equality ("box" is affected).
Code attempting this replication strategy should select an equality operator
the way typcache.c does so.

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Hannu Krosing

Date:

13 January 2013, 10:49:39

On 01/13/2013 12:28 AM, Noah Misch wrote:
> [Catching up on old threads.]
>
> On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote:
>> On 11/17/2012 03:00 PM, Markus Wanner wrote:
>>> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>>>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>>>> PostgreSQL-R
>>> No. There must be some way to logically identify the tuple.
>> It can be done as selecting on _all_ attributes and updating/deleting
>> just the first matching row
>>
>> create cursor ...
>> select from t ... where t.* = (....)
>> fetch one ...
>> delete where current of ...
>>
>> This is on distant (round 3 or 4) roadmap for this work, just was
>> interested
>> if you had found any better way of doing this :)
> That only works if every attribute's type has a notion of equality ("xml" does
> not).  The equality operator may have a name other than "=", and an operator
> named "=" may exist with semantics other than equality ("box" is affected).
> Code attempting this replication strategy should select an equality operator
> the way typcache.c does so.
Does this hint that postgreSQL also needs an sameness operator
( "is" or "===" in same languages).

Or does "IS NOT DISTINCT FROM" already work even for types without
comparison operator ?

--------------
Hannu

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Hannu Krosing

Date:

13 January 2013, 11:32:38

On 01/13/2013 10:49 AM, Hannu Krosing wrote:
> On 01/13/2013 12:28 AM, Noah Misch wrote:
>> [Catching up on old threads.]
>>
>> On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote:
>>> On 11/17/2012 03:00 PM, Markus Wanner wrote:
>>>> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>>>>> Is it possible to replicate UPDATEs and DELETEs without a primary 
>>>>> key in
>>>>> PostgreSQL-R
>>>> No. There must be some way to logically identify the tuple.
>>> It can be done as selecting on _all_ attributes and updating/deleting
>>> just the first matching row
>>>
>>> create cursor ...
>>> select from t ... where t.* = (....)
>>> fetch one ...
>>> delete where current of ...
>>>
>>> This is on distant (round 3 or 4) roadmap for this work, just was
>>> interested
>>> if you had found any better way of doing this :)
>> That only works if every attribute's type has a notion of equality 
>> ("xml" does
>> not).  The equality operator may have a name other than "=", and an 
>> operator
>> named "=" may exist with semantics other than equality ("box" is 
>> affected).
>> Code attempting this replication strategy should select an equality 
>> operator
>> the way typcache.c does so.
> Does this hint that postgreSQL also needs an sameness operator
> ( "is" or "===" in same languages).
>
> Or does "IS NOT DISTINCT FROM" already work even for types without
> comparison operator ?
Just checked - it does not, it still looks for "=" operator so it is 
just equality-with-nulls

How do people feel about adding a real sameness operator ?

Hannu


>
> --------------
> Hannu
>
>

Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Tom Lane

Date:

13 January 2013, 17:44:57

Hannu Krosing <hannu@2ndQuadrant.com> writes:
> How do people feel about adding a real sameness operator ?

Just begs the question of "what's sameness?"

In many places we consider a datatype's default btree equality operator
to define sameness, but not all types provide a btree opclass (in
particular, anything that hasn't got a sensible one-dimensional sort
order will not).  And some do but it doesn't represent anything that
anyone would want to consider "sameness" --- IIRC, some of the geometric
types provide btree opclasses that sort by area.  Even for apparently
simple types like float8 there are interesting questions like whether
minus zero is the same as plus zero.

The messiness here is not just due to lack of a notation.
        regards, tom lane

Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Markus Wanner

Date:

13 January 2013, 18:03:01

On 01/13/2013 12:30 PM, Hannu Krosing wrote:
> On 01/13/2013 10:49 AM, Hannu Krosing wrote:
>> Does this hint that postgreSQL also needs an sameness operator
>> ( "is" or "===" in same languages).
> 
> How do people feel about adding a real sameness operator ?

We'd need to define what "sameness" means. If this goes toward "exact
match in binary representation", this gets a thumbs-up from me.

As a first step in that direction, I'd see adjusting send() and recv()
functions to use a portable binary format. A "sameness" operator could
then be implemented by simply comparing two value's send() outputs.

Regards

Markus Wanner

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Andres Freund

Date:

13 January 2013, 18:03:05

On 2013-01-13 12:44:44 -0500, Tom Lane wrote:
> Hannu Krosing <hannu@2ndQuadrant.com> writes:
> > How do people feel about adding a real sameness operator ?
> 
> Just begs the question of "what's sameness?"
> 
> In many places we consider a datatype's default btree equality operator
> to define sameness, but not all types provide a btree opclass (in
> particular, anything that hasn't got a sensible one-dimensional sort
> order will not).  And some do but it doesn't represent anything that
> anyone would want to consider "sameness" --- IIRC, some of the geometric
> types provide btree opclasses that sort by area.  Even for apparently
> simple types like float8 there are interesting questions like whether
> minus zero is the same as plus zero.
> 
> The messiness here is not just due to lack of a notation.

FWIW *I* (but others might) don't plan to support that case for now, it
just seems to be too messy for far too little benefit.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Hannu Krosing

Date:

13 January 2013, 20:04:45

On 01/13/2013 06:02 PM, Markus Wanner wrote:
> On 01/13/2013 12:30 PM, Hannu Krosing wrote:
>> On 01/13/2013 10:49 AM, Hannu Krosing wrote:
>>> Does this hint that postgreSQL also needs an sameness operator
>>> ( "is" or "===" in same languages).
>> How do people feel about adding a real sameness operator ?
> We'd need to define what "sameness" means. If this goes toward "exact
> match in binary representation", this gets a thumbs-up from me.
>
> As a first step in that direction, I'd see adjusting send() and recv()
> functions to use a portable binary format. A "sameness" operator could
> then be implemented by simply comparing two value's send() outputs.
This seems like a good definition of "sameness" to me - if binary
images are bitwise same, then the values are the same. And if
both are fields of the same type and NULLs then these are also "same"

And defining a cross-platform binary format also a good direction
of movement in implementing this.

I'd just start with what send() and recv() on each type produces
now using GCC on 64bit Intel and move towards adjusting others
to match. For a period anything else would still be allowed, but
be "non-standard"

I have no strong opinion on typed NULLs, though I'd like them
to also be "the same" for a sake of simplicity.

As this would be non-standard anyway, I'd make a row of all nulls NOT 
"be the same" as NULL

This would be much easier to explain than losing the "IS NULL"-ness at 
nesting level 3 ;)

Hannu

>
> Regards
>
> Markus Wanner
>
>

Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Dimitri Fontaine

Date:

13 January 2013, 20:06:38

Hannu Krosing <hannu@2ndQuadrant.com> writes:
>> Does this hint that postgreSQL also needs an sameness operator
>> ( "is" or "===" in same languages).
>
> How do people feel about adding a real sameness operator ?

Well. I would prefer it if we can bypass the need for it.

Then Do we need the full range of eq, eql, equal and equalp predicates,
and would all of them allow overriding or just some?
 http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node74.html

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Hannu Krosing

Date:

13 January 2013, 20:47:13

On 01/13/2013 12:28 AM, Noah Misch wrote:
> [Catching up on old threads.]
>
> On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote:
>> On 11/17/2012 03:00 PM, Markus Wanner wrote:
>>> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>>>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>>>> PostgreSQL-R
>>> No. There must be some way to logically identify the tuple.
>> It can be done as selecting on _all_ attributes and updating/deleting
>> just the first matching row
>>
>> create cursor ...
>> select from t ... where t.* = (....)
>> fetch one ...
>> delete where current of ...
>>
>> This is on distant (round 3 or 4) roadmap for this work, just was
>> interested
>> if you had found any better way of doing this :)
> That only works if every attribute's type has a notion of equality ("xml" does
> not).  The equality operator may have a name other than "=", and an operator
> named "=" may exist with semantics other than equality ("box" is affected).
> Code attempting this replication strategy should select an equality operator
> the way typcache.c does so.
A method for making this work as PostgreSQL works now would be to
compare "textual representations" of tuples

create cursor ...
select from t ... where t::text = '(<image of original row>)'
fetch one ...
delete where current of ...

But of course having an operator for "sameness" without needing to convert to text
would be better

----------------
Hannu

Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Hannu Krosing

Date:

14 January 2013, 08:51:21

On 01/13/2013 08:06 PM, Dimitri Fontaine wrote:
> Hannu Krosing <hannu@2ndQuadrant.com> writes:
>>> Does this hint that postgreSQL also needs an sameness operator
>>> ( "is" or "===" in same languages).
>> How do people feel about adding a real sameness operator ?
> Well. I would prefer it if we can bypass the need for it.
What is actually sufficient for current problem is sameness
which compares outputs of type output functions and also
considers NULLs to be the same.

The reason for not providing equality for xml was not that two xml
files which compare equal as text could be considered unequal in
any sense but that there are some other textual representations
of the same xml which could also be considered to be equal, like different whitespace between tag and attribute
> Then Do we need the full range of eq, eql, equal and equalp predicates,
> and would all of them allow overriding or just some?
I consider sameness as basic thing as IS NULL, so the sameness
should not be overridable. Extending IS NOT DISTINCT FROM to
do this comparison instead of current '=' seems reasonable.

That is

SELECT '<tag/>'::xml IS DISTINCT FROM '<tag />'::xml

should return TRUE as long as the internal representation of the
two differ and even after you add equality operator to xml
which compares some canonic form of xml and thus would make

SELECT '<tag/>'::xml = '<tag />'::xml ;

be TRUE.


Regards,
Hannu

>
>    http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node74.html
>
> Regards,

Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

From

Markus Wanner

Date:

14 January 2013, 09:06:30

On 01/13/2013 09:04 PM, Hannu Krosing wrote:
> I'd just start with what send() and recv() on each type produces
> now using GCC on 64bit Intel and move towards adjusting others
> to match. For a period anything else would still be allowed, but
> be "non-standard"

Intel being little endian seems like a bad choice to me, given that
send/recv kind of implies network byte ordering. I'd rather not tie this
to any particular processor architecture at all (at least not solely on
the ground that it's the most common one at the time).

I have no strong opinion on "sameness" of NULLs and could also imagine
that to throw some kind of invalid operation error. Based on the ground
that neither is a value and it's unclear what send() method to use at all.

FWIW, trying to determine the length of a sent NULL gives an interesting
result that I don't currently understand.

> psql (9.2.2)
> Type "help" for help.
> 
> postgres=# SELECT length(int4send(NULL));
>  length 
> --------
>        
> (1 row)
> 
> postgres=# SELECT length(float4send(NULL));
>  length 
> --------
>        
> (1 row)
> 
> postgres=# SELECT length(textsend(NULL));
>  length 
> --------
>        
> (1 row)
>
> postgres=# SELECT length(textsend(NULL) || '\000'::bytea);
>  length 
> --------
>        
> (1 row)

Regards

Markus Wanner