Thread: [RFC][PATCH] wal decoding, attempt #2

[RFC][PATCH] wal decoding, attempt #2

From

Andres Freund

Date:

15 September 2012, 00:33:44

Hi,

It took me far longer than I planned, its not finished, but time is running 
out. I would like some feedback that I am not going astray at this point... 
*I* think the general approach is sound and a good way forward that provides 
the basic infrastructure for many (all?) of the scenarios we talked about 
before.

Anyway, here is my next attempt at $TOPIC.

Lets start with a quick demo (via psql):

/* just so we keep a sensible xmin horizon */
ROLLBACK PREPARED 'f';
BEGIN;
CREATE TABLE keepalive();
PREPARE TRANSACTION 'f';

DROP TABLE IF EXISTS replication_example;

SELECT pg_current_xlog_insert_location();
CHECKPOINT;
CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text 
varchar(120));
begin;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
INSERT INTO replication_example(somedata, text) VALUES (1, 2);
commit;

ALTER TABLE replication_example ADD COLUMN bar int;

INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4);

BEGIN;
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL);

commit;
ALTER TABLE replication_example DROP COLUMN bar;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 2);
INSERT INTO replication_example(somedata, text) VALUES (3, 3);
commit;

ALTER TABLE replication_example RENAME COLUMN text TO somenum;

INSERT INTO replication_example(somedata, somenum) VALUES (4, 1);

ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING 
(somenum::int4);

INSERT INTO replication_example(somedata, somenum) VALUES (5, 1);

SELECT pg_current_xlog_insert_location();

---- Somewhat later ----

SELECT decode_xlog('0/1893D78', '0/18BE398');

WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1
WARNING:  tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4
WARNING:  tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4
WARNING:  tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]:
(null)
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2
WARNING:  tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1
WARNING:  COMMITdecode_xlog 
-------------t
(1 row)


As you can see the patchset can decode several changes made to a table even 
though we used DDL on it. Not everything is handled yet, but its a prototype 
after all ;)

The way this works is:

A new component called SnapshotBuilder analyzes the xlog and build a special 
kind of Snapshot. This works in a somewhat similar way to the 
KnownAssignedXids machinery for Hot Standby.
Whenever the - mostly unchanged - ApplyCache calls a 'apply_change' callback 
for a single change (INSERT|UPDATE|DELETE) it locally overrides the normal 
SnapshotNow semantics used for catalog access with one of the previously built 
snapshots. They should behave just the same as a normal SnapshotNow would have 
behaved when the tuple change was written to the xlog.

This patch doesn't provide anything that uses the new infrastructure for 
anything real, but I think thats good. Lets get this into something 
committable and then add new things using it!

Small overview over the individual patches that will come as separate mails:

old, Alvaro is doing this properly right now, separate thread
[01] Add embedded list interface (header only) 

A new piece of infrastructure (for k-way mergesort), pretty much untested, 
good idea in general I think, not very interesting:
[02] Add minimal binary heap implementation

Boring, old.:
[03] Add support for a generic wal reading facility dubbed XLogReader

Boring, old, borked:
[04] add simple xlogdump tool

Slightly changed to use (tablespace, relfilenode), possibly similar problems 
to earlier, not interesting at this point.
[05] Add a new syscache to fetch a pg_class entry via (reltablespace, 
relfilenode)

Unchanged:
[06] Log enough data into the wal to reconstruct logical changes from it if 
wal_level=logical

I didn't implement proper cache handling, so I need to use the big hammer...:
[07] Make InvalidateSystemCaches public

The major piece:
[08] Introduce wal decoding via catalog timetravel



[08] has loads of defficiencies. To cite the commit:   The snapshot building has the most critical infrastructure but
misses
 
several   important features:   * loads of docs about the internals   * improve snapshot building/distributions     *
don'tbuild them all the time, cache them     * don't increase ->xmax so slowly, its inefficient     * refcount     *
actuallyfree them   * proper cache handling     * we can probably reuse xl_xact_commit->nmsgs     * generate new local
invalmessages from catalog changes?   * handle transactions with both ddl, and changes     * command_id handling     *
combocidloggin/handling   * Add support for declaring tables as catalog tables that are not 
 
pg_catalog.*   * properly distribute new SnapshotNow snapshots after a transaction 
commits   * loads of testing/edge cases   * provision of a consistent snapshot for pg_dump   * spill state to disk at
checkpoints  * xmin handling
 

The decode_xlog() function is *purely* a debugging tool that I do not want to 
keep in the long run. I introduced it so we can concentrate on the topic at 
hand without involving even more moving parts (see the next paragraph)...

Some parts of this I would like to only discuss later, in separate threads, to 
avoid cluttering this one more than neccessary:
* how do we integrate this into walsender et al
* in which format do we transport changes
* how do we always keep enough wal 


I have some work ontop of this, that handles ComboCid's and CommandId's 
correctly (and thus mixed ddl/dml transactions), but its simply not finished 
enough. I am pretty sure by now that it works even with those additional 
complexities.

So, I am unfortunately too tired to write more than this... It will have to 
suffice. I plan to release a newer version with more documentation soon.

Comments about the approach or even the general direction of the 
implementation? Questions?

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

[PATCH 2/8] Add minimal binary heap implementation

From

Andres Freund

Date:

15 September 2012, 00:39:59

This is basically untested.
---
 src/backend/lib/Makefile     |   2 +-
 src/backend/lib/simpleheap.c | 255 +++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/simpleheap.h |  91 +++++++++++++++
 3 files changed, 347 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/simpleheap.c
 create mode 100644 src/include/lib/simpleheap.h

Attachment

0002-Add-minimal-binary-heap-implementation.patch

[PATCH 1/8] Add embedded list interface (header only)

From

Andres Freund

Date:

15 September 2012, 00:40:01

Adds a single and a double linked list which can easily embedded into other
datastructures and can be used without any additional allocations.

Problematic: It requires USE_INLINE to be used. It could be remade to fallback
to to externally defined functions if that is not available but that hardly
seems sensibly at this day and age. Besides, the speed hit would be noticeable
and its only used in new code which could be disabled on machines - given they
still exists - without proper support for inline functions
---
 src/include/utils/ilist.h | 253 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 253 insertions(+)
 create mode 100644 src/include/utils/ilist.h

Attachment

0001-Add-embedded-list-interface-header-only.patch

[PATCH 4/8] add simple xlogdump tool

From

Andres Freund

Date:

15 September 2012, 00:40:03

---
 src/bin/Makefile            |   2 +-
 src/bin/xlogdump/Makefile   |  25 ++++
 src/bin/xlogdump/xlogdump.c | 334 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 360 insertions(+), 1 deletion(-)
 create mode 100644 src/bin/xlogdump/Makefile
 create mode 100644 src/bin/xlogdump/xlogdump.c

Attachment

0004-add-simple-xlogdump-tool.patch

[PATCH 7/8] Make InvalidateSystemCaches public

From

Andres Freund

Date:

15 September 2012, 00:40:19

Pieces of this are in commit: make relfilenode lookup (tablespace, relfilenode
---
 src/backend/utils/cache/inval.c | 2 +-
 src/include/utils/inval.h       | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

Attachment

0007-Make-InvalidateSystemCaches-public.patch

[PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

15 September 2012, 00:40:21

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer
---
 src/backend/access/transam/Makefile     |    2 +-
 src/backend/access/transam/xlogreader.c | 1032 +++++++++++++++++++++++++++++++
 src/include/access/xlogreader.h         |  264 ++++++++
 3 files changed, 1297 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogreader.c
 create mode 100644 src/include/access/xlogreader.h

Attachment

0003-Add-support-for-a-generic-wal-reading-facility-dubbe.patch

[PATCH 6/8] Log enough data into the wal to reconstruct logical changes from it if wal_level=logical

From

Andres Freund

Date:

15 September 2012, 00:42:41

This adds a new wal_level value 'logical'

Missing cases:
- heap_multi_insert
- primary key changes for updates
- no primary key
- LOG_NEWPAGE
---
 src/backend/access/heap/heapam.c        | 135 +++++++++++++++++++++++++++++---
 src/backend/access/transam/xlog.c       |   1 +
 src/backend/catalog/index.c             |  74 +++++++++++++++++
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/include/access/xlog.h               |   3 +-
 src/include/catalog/index.h             |   4 +
 6 files changed, 207 insertions(+), 12 deletions(-)

Attachment

0006-Log-enough-data-into-the-wal-to-reconstruct-logical-.patch

[PATCH 5/8] Add a new syscache to fetch a pg_class entry via (reltablespace, relfilenode)

From

Andres Freund

Date:

15 September 2012, 00:42:48

This patch is problematic because formally indexes used by syscaches needs to
be unique, this one is not though because of 0/InvalidOids relfilenode entries
for nailed/shared catalog entries. Those values cannot be sensibly queries from
the catalog anyway though (the relmapper infrastructure needs to be used).

It might be nicer to add infrastructure to do this properly, I just don't have
a clue what the best way for this would be.
---
 src/backend/utils/cache/syscache.c | 11 +++++++++++
 src/include/catalog/indexing.h     |  2 ++
 src/include/catalog/pg_proc.h      |  1 +
 src/include/utils/syscache.h       |  1 +
 4 files changed, 15 insertions(+)

Attachment

0005-Add-a-new-syscache-to-fetch-a-pg_class-entry-via-rel.patch

[PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Andres Freund

Date:

15 September 2012, 00:43:08

This introduces several things:
* applycache module which reassembles transactions from a stream of interspersed changes
* snapbuilder which builds catalog snapshots so that tuples from wal can be understood
* wal decoding into an applycache
* decode_xlog(lsn, lsn) debugging function

The applycache provides 3 major callbacks:
* apply_begin
* apply_change
* apply_commit

It is missing several parts:
- spill-to-disk
- resource usage controls
- command id handling
- passing of the correct mvcc snapshot (already has it, just doesn't pass)

The snapshot building has the most critical infrastructure but misses several
important features:
* loads of docs about the internals
* improve snapshot building/distributions
  * don't build them all the time, cache them
  * don't increase ->xmax so slowly, its inefficient
  * refcount
  * actually free them
* proper cache handling
  * we can probably reuse xl_xact_commit->nmsgs
  * generate new local inval messages from catalog changes?
* handle transactions with both ddl, and changes
  * command_id handling
  * combocid loggin/handling
* Add support for declaring tables as catalog tables that are not pg_catalog.*
* properly distribute new SnapshotNow snapshots after a transaction commits
* loads of testing/edge cases
* provision of a consistent snapshot for pg_dump
* spill state to disk at checkpoints
* xmin handling

The xlog decoding also misses several parts:
- HEAP_NEWPAGE support
- HEAP2_MULTI_INSERT support
- handling of table rewrites
---
 src/backend/replication/Makefile               |    2 +
 src/backend/replication/logical/Makefile       |   19 +
 src/backend/replication/logical/applycache.c   |  574 +++++++++++++
 src/backend/replication/logical/decode.c       |  366 +++++++++
 src/backend/replication/logical/logicalfuncs.c |  237 ++++++
 src/backend/replication/logical/snapbuild.c    | 1045 ++++++++++++++++++++++++
 src/backend/utils/time/tqual.c                 |  161 ++++
 src/include/access/transam.h                   |    5 +
 src/include/catalog/pg_proc.h                  |    3 +
 src/include/replication/applycache.h           |  239 ++++++
 src/include/replication/decode.h               |   26 +
 src/include/replication/snapbuild.h            |  119 +++
 src/include/utils/tqual.h                      |   21 +-
 13 files changed, 2816 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/replication/logical/Makefile
 create mode 100644 src/backend/replication/logical/applycache.c
 create mode 100644 src/backend/replication/logical/decode.c
 create mode 100644 src/backend/replication/logical/logicalfuncs.c
 create mode 100644 src/backend/replication/logical/snapbuild.c
 create mode 100644 src/include/replication/applycache.h
 create mode 100644 src/include/replication/decode.h
 create mode 100644 src/include/replication/snapbuild.h

Attachment

0008-Introduce-wal-decoding-via-catalog-timetravel.patch

git tree

From

Andres Freund

Date:

15 September 2012, 01:14:41

Hi,

A last note:

A git tree of this is at
git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-
decoding-rebasing-cf2

checkout with:

git clone --branch xlog-decoding-rebasing-cf2 
git://git.postgresql.org/git/users/andresfreund/postgres.git 

Webview:

http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-
decoding-rebasing-cf2

That branch will be regularly rebased to a new master,fixes/new features, and 
a pgindent run over the new files...

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: git tree

From

Andres Freund

Date:

15 September 2012, 10:14:51

On Saturday, September 15, 2012 03:14:32 AM Andres Freund wrote:
> That branch will be regularly rebased to a new master,fixes/new features,
> and a pgindent run over the new files...
I fixed up the formatting of the new stuff (xlogreader, ilist are submitted 
separately, no point in doing anything there).

pushed to the repo mentioned upthread.

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Add pg_relation_by_filenode(reltbspc, filenode) admin function

From

Andres Freund

Date:

16 September 2012, 22:15:47

Now that I proposed a new syscache upthread its easily possible to provide
pg_relation_by_filenode which I wished for multiple times in the past when
looking at filesystem activity and wondering which table does what. You can
sortof get the same result via

SELECT oid FROM (      SELECT oid, pg_relation_filenode(oid::regclass) filenode      FROM pg_class WHERE relkind !=
'v'
) map
WHERE map.filenode = ...;

but thats neither efficient nor obvious.

So, two patches to do this:


Did others need this in the past? I can live with the 2nd patch living in a
private extension somewhere. The first one would also be useful for some
error/debug messages during decoding...

Greetings,

Andres

[PATCH 2/2] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs

From

Andres Freund

Date:

16 September 2012, 22:15:51

This requires the previously added RELFILENODE syscache.
---
 doc/src/sgml/func.sgml         | 23 ++++++++++++-
 src/backend/utils/adt/dbsize.c | 78 ++++++++++++++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.h  |  2 ++
 src/include/utils/builtins.h   |  1 +
 4 files changed, 103 insertions(+), 1 deletion(-)

Attachment

0002-Add-a-new-function-pg_relation_by_filenode-to-lookup.patch

[PATCH 1/2] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

From

Andres Freund

Date:

16 September 2012, 22:15:53

---
 src/backend/utils/cache/relmapper.c | 53 +++++++++++++++++++++++++++++++++++++
 src/include/utils/relmapper.h       |  2 ++
 2 files changed, 55 insertions(+)

Attachment

0001-Add-a-new-relmapper.c-function-RelationMapFilenodeTo.patch

Re: [PATCH 2/2] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs

From

Tom Lane

Date:

16 September 2012, 22:35:40

Andres Freund <andres@2ndquadrant.com> writes:
> This requires the previously added RELFILENODE syscache.

[ raised eyebrow... ]  There's a RELFILENODE syscache?  I don't see one,
and I doubt it would work given that the contents of
pg_class.relfilenode aren't unique (the zero entries are the problem).
        regards, tom lane

Re: [PATCH 2/2] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs

From

Andres Freund

Date:

16 September 2012, 22:46:32

On Monday, September 17, 2012 12:35:32 AM Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > This requires the previously added RELFILENODE syscache.
> 
> [ raised eyebrow... ]  There's a RELFILENODE syscache?  I don't see one,
> and I doubt it would work given that the contents of
> pg_class.relfilenode aren't unique (the zero entries are the problem).
Well, one patch upthread ;). It mentions the problem of it not being unique due 
to relfilenode in (reltablespace, relfilenode) being 0 for shared/nailed 
catalogs.

I am not really sure yet how big a problem for the caching infrastructure it is 
that values that shouldn't ever get queried (because the relfilenode is 
actually different) are duplicated. Reading code about all that atm.

Robert suggested writing a specialized cache akin to whats done in 
attoptcache.c or such.

I haven't formed an opinion on whats the way forward on that topic. But anyway, 
I don't see how the wal decoding stuff can progress without some variant of 
that mapping, so I sure hope I/we can build something. Changing that aspect of 
the patch should be trivial...

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Heikki Linnakangas

Date:

17 September 2012, 07:40:35

On 15.09.2012 03:39, Andres Freund wrote:
>
> Features:
> - streaming reading/writing
> - filtering
> - reassembly of records
>
> Reusing the ReadRecord infrastructure in situations where the code that wants
> to do so is not tightly integrated into xlog.c is rather hard and would require
> changes to rather integral parts of the recovery code which doesn't seem to be
> a good idea.

My previous objections to this approach still apply. 1. I don't want to
maintain a second copy of the code to read xlog. 2. We should focus on
reading WAL, I don't see the point of mixing WAL writing into this. 3. I
don't like the callback-style API.

I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.

There is still one callback, XLogPageRead(), to obtain a given page in
WAL. The XLogReader facility is responsible for decoding the WAL into
records, but the user of the facility is responsible for supplying the
physical bytes, via the callback.

So the usage is like this:

/*
  * Callback to read the page starting at 'RecPtr' into *readBuf. It's
  * up to you to do this any way you like. Typically you'd read from a
  * file. The WAL recovery implementation of this in xlog.c is more
  * complicated. It checks the archive, waits for streaming replication
  * etc.
  */
static bool
MyXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, char
*readBuf, void *private_data)
{
   ...
}

state = XLogReaderAllocate(&MyXLogPageRead);

while ((record = XLogReadRecord(state, ...)))
{
   /* do something with the record */
}

XLogReaderFree(state);

- Heikki

Attachment

xlogreader-heikki-1.patch

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

17 September 2012, 08:12:44

Hi Heikki,

On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
> On 15.09.2012 03:39, Andres Freund wrote:
> > Features:
> > - streaming reading/writing
> > - filtering
> > - reassembly of records
> > 
> > Reusing the ReadRecord infrastructure in situations where the code that
> > wants to do so is not tightly integrated into xlog.c is rather hard and
> > would require changes to rather integral parts of the recovery code
> > which doesn't seem to be a good idea.
> 
> My previous objections to this approach still apply. 1. I don't want to
> maintain a second copy of the code to read xlog.
Yes. I aggree. And I am willing to provide an implementation of this if should 
my xlogreader variant gets a bit more buyin.

> 2. We should focus on reading WAL, I don't see the point of mixing WAL 
writing into this.
If you write something that filters/analyzes and then forwards WAL and you want 
to do that without a big overhead (i.e. completely reassembling everything, and 
then deassembling it again for writeout) its hard to do that without 
integrating both sides.

Also, I want to read records incrementally/partially just as data comes in 
which again is hard to combine with writing out the data again.

> 3. I don't like the callback-style API.
I tried to accomodate to that by providing:

extern XLogRecordBuffer* XLogReaderReadOne(XLogReaderState* state);

which does exactly that.

> I came up with the attached. I moved ReadRecord and some supporting
> functions from xlog.c to xlogreader.c, and made it operate on
> XLogReaderState instead of global global variables. As discussed before,
> I didn't like the callback-style API, I think the consumer of the API
> should rather just call ReadRecord repeatedly to get each record. So
> that's what I did.
The problem with that is that kind of API is that it, at least as far as I can 
see, that it never can operate on incomplete/partial input. Your need to buffer 
larger amounts of xlog somewhere and you need to be aware of record boundaries. 
Both are things I dislike in a more generic user than xlog.c.

> There is still one callback, XLogPageRead(), to obtain a given page in
> WAL. The XLogReader facility is responsible for decoding the WAL into
> records, but the user of the facility is responsible for supplying the
> physical bytes, via the callback.
Makes sense.

> So the usage is like this:
> 
> /*
>   * Callback to read the page starting at 'RecPtr' into *readBuf. It's
>   * up to you to do this any way you like. Typically you'd read from a
>   * file. The WAL recovery implementation of this in xlog.c is more
>   * complicated. It checks the archive, waits for streaming replication
>   * etc.
>   */
> static bool
> MyXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, char
> *readBuf, void *private_data)
> {
>    ...
> }
> 
> 
> state = XLogReaderAllocate(&MyXLogPageRead);
> 
> while ((record = XLogReadRecord(state, ...)))
> {
>    /* do something with the record */
> }
If you don't want the capability to forward/filter the data and read partial 
data without regard for record constraints/buffering your patch seems to be 
quite a good start. It misses xlogreader.h though...

Do my aims make any sense to you?

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Heikki Linnakangas

Date:

17 September 2012, 08:30:50

On 17.09.2012 11:12, Andres Freund wrote:
> On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
>> On 15.09.2012 03:39, Andres Freund wrote:
>> 2. We should focus on reading WAL, I don't see the point of mixing WAL
> writing into this.
> If you write something that filters/analyzes and then forwards WAL and you want
> to do that without a big overhead (i.e. completely reassembling everything, and
> then deassembling it again for writeout) its hard to do that without
> integrating both sides.

It seems really complicated to filter/analyze WAL records without
reassembling them, anyway. The user of the facility is in charge of
reading the physical data, so you can still access the raw data, for
forwarding purposes, in addition to the reassembled records.

Or what exactly do you mean by "completely deassembling"? I read that to
mean dealing with page boundaries, ie. if a record is split across
pages, copy parts into a contiguous temporary buffer.

> Also, I want to read records incrementally/partially just as data comes in
> which again is hard to combine with writing out the data again.

You mean, you want to start reading the first half of a record, before
the 2nd half is available? That seems complicated. I'd suggest keeping
it simple for now, and optimize later if necessary. Note that before you
have the whole WAL record, you cannot CRC check it, so you don't know if
it's in fact a valid WAL record.

>> I came up with the attached. I moved ReadRecord and some supporting
>> functions from xlog.c to xlogreader.c, and made it operate on
>> XLogReaderState instead of global global variables. As discussed before,
>> I didn't like the callback-style API, I think the consumer of the API
>> should rather just call ReadRecord repeatedly to get each record. So
>> that's what I did.
> The problem with that is that kind of API is that it, at least as far as I can
> see, that it never can operate on incomplete/partial input. Your need to buffer
> larger amounts of xlog somewhere and you need to be aware of record boundaries.
> Both are things I dislike in a more generic user than xlog.c.

I don't understand that argument. A typical large WAL record is split
across 1-2 pages, maybe 3-4 at most, for an index page split record.
That doesn't feel like much to me. In extreme cases, a WAL record can be
much larger (e.g a commit record of a transaction with a huge number of
subtransactions), but that should be rare in practice.

The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.

> If you don't want the capability to forward/filter the data and read partial
> data without regard for record constraints/buffering your patch seems to be
> quite a good start. It misses xlogreader.h though...

Ah sorry, patch with xlogreader.h attached.

- Heikki

Attachment

xlogreader-heikki-2.patch

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

17 September 2012, 09:07:35

On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
> On 17.09.2012 11:12, Andres Freund wrote:
> > On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
> >> On 15.09.2012 03:39, Andres Freund wrote:
> >> 2. We should focus on reading WAL, I don't see the point of mixing WAL
> > 
> > writing into this.
> > If you write something that filters/analyzes and then forwards WAL and
> > you want to do that without a big overhead (i.e. completely reassembling
> > everything, and then deassembling it again for writeout) its hard to do
> > that without integrating both sides.
> 
> It seems really complicated to filter/analyze WAL records without
> reassembling them, anyway. The user of the facility is in charge of
> reading the physical data, so you can still access the raw data, for
> forwarding purposes, in addition to the reassembled records.
It works ;)

> Or what exactly do you mean by "completely deassembling"? I read that to
> mean dealing with page boundaries, ie. if a record is split across
> pages, copy parts into a contiguous temporary buffer.
Well, if you want to fully split reading and writing of records - which is a 
nice goal! - you basically need the full logic of XLogInsert again to split 
them apart again to write them. Alternatively you need to store record 
boundaries somewhere and copy that way, but in the end if you filter you need 
to correct CRCs...

> > Also, I want to read records incrementally/partially just as data comes
> > in which again is hard to combine with writing out the data again.
> 
> You mean, you want to start reading the first half of a record, before
> the 2nd half is available? That seems complicated.
Well, I just can say again: It works ;). Makes it easy to follow something like 
XLogwrtResult without taking care about record boundaries.

> I'd suggest keeping it simple for now, and optimize later if necessary.
Well, yes. The API should be able to comfortably support those cases though 
which I don't think is neccesarily the case in a simple, one call API as 
proposed.

> Note that before you have the whole WAL record, you cannot CRC check it, so
> you don't know if it's in fact a valid WAL record.
Sure. But you can start the CRC computation without any problems and finish it 
when the last part of the data comes in.

> >> I came up with the attached. I moved ReadRecord and some supporting
> >> functions from xlog.c to xlogreader.c, and made it operate on
> >> XLogReaderState instead of global global variables. As discussed before,
> >> I didn't like the callback-style API, I think the consumer of the API
> >> should rather just call ReadRecord repeatedly to get each record. So
> >> that's what I did.
> > 
> > The problem with that is that kind of API is that it, at least as far as
> > I can see, that it never can operate on incomplete/partial input. Your
> > need to buffer larger amounts of xlog somewhere and you need to be aware
> > of record boundaries. Both are things I dislike in a more generic user
> > than xlog.c.
> 
> I don't understand that argument. A typical large WAL record is split
> across 1-2 pages, maybe 3-4 at most, for an index page split record.
> That doesn't feel like much to me. In extreme cases, a WAL record can be
> much larger (e.g a commit record of a transaction with a huge number of
> subtransactions), but that should be rare in practice.
Well, imagine something like the walsender that essentially follows the flush 
position ideally without regard for record boundaries. It is nice to be able to 
send/analyze/filter as soon as possible without waiting till a page is full. 
And it sure would be nice to be able to read the data on the other side 
directly from the network, decompress it again, and only then store it to disk.

> The user of the facility doesn't need to be aware of record boundaries,
> that's the responsibility of the facility. I thought that's exactly the
> point of generalizing this thing, to make it unnecessary for the code
> that uses it to be aware of such things.
With the proposed API it seems pretty much a requirement to wait inside the 
callback. Thats not really nice if your process has other things to wait for as 
well.

In my proposal you can simply do something like:

XLogReaderRead(state);

DoSomeOtherWork();

if (CheckForForMessagesFromWalreceiver())   ProcessMessages();
else if (state->needs_input)   UseLatchOrSelectOnInputSocket();
else if (state->needs_output)   UseSelectOnOutputSocket();

but you can also do something like waiting on a Latch but *also* on other fds. 

> > If you don't want the capability to forward/filter the data and read
> > partial data without regard for record constraints/buffering your patch
> > seems to be quite a good start. It misses xlogreader.h though...
> 
> Ah sorry, patch with xlogreader.h attached.
Will look at it in a second.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

17 September 2012, 10:01:53

On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:
> On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
> > On 17.09.2012 11:12, Andres Freund wrote:
> > > On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
> > > If you don't want the capability to forward/filter the data and read
> > > partial data without regard for record constraints/buffering your patch
> > > seems to be quite a good start. It misses xlogreader.h though...
> > 
> > Ah sorry, patch with xlogreader.h attached.
> 
> Will look at it in a second.
It seems we would need one additional callback for both approaches like:

->error(severity, format, ...)

For both to avoid having to draw in elog.c.

Otherwise it looks sensible although it has a more minimal approach (which 
might or might not be a good thing). The one thing I definitely like is that 
nearly all of it is tried and true code...

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Heikki Linnakangas

Date:

17 September 2012, 10:52:46

On 17.09.2012 12:07, Andres Freund wrote:
> On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
>> The user of the facility doesn't need to be aware of record boundaries,
>> that's the responsibility of the facility. I thought that's exactly the
>> point of generalizing this thing, to make it unnecessary for the code
>> that uses it to be aware of such things.
> With the proposed API it seems pretty much a requirement to wait inside the
> callback.

Or you can return false from the XLogPageRead() callback if the 
requested page is not available. That will cause ReadRecord() to return 
NULL, and you can retry when more WAL is available.

- Heikki

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Heikki Linnakangas

Date:

17 September 2012, 10:55:56

On 17.09.2012 13:01, Andres Freund wrote:
> On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:
>> On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
>>> On 17.09.2012 11:12, Andres Freund wrote:
>>>> On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
>>>> If you don't want the capability to forward/filter the data and read
>>>> partial data without regard for record constraints/buffering your patch
>>>> seems to be quite a good start. It misses xlogreader.h though...
>>>
>>> Ah sorry, patch with xlogreader.h attached.
>>
>> Will look at it in a second.
> It seems we would need one additional callback for both approaches like:
>
> ->error(severity, format, ...)
>
> For both to avoid having to draw in elog.c.

Yeah. Another approach would be to return the error string from 
ReadRecord. The caller could then do whatever it pleases with it, like 
ereport() it to the log or PANIC. I think I'd like that better.

- Heikki

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

17 September 2012, 11:43:01

On Monday, September 17, 2012 12:55:47 PM Heikki Linnakangas wrote:
> On 17.09.2012 13:01, Andres Freund wrote:
> > On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:
> >> On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
> >>> On 17.09.2012 11:12, Andres Freund wrote:
> >>>> On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
> >>>> If you don't want the capability to forward/filter the data and read
> >>>> partial data without regard for record constraints/buffering your
> >>>> patch seems to be quite a good start. It misses xlogreader.h
> >>>> though...
> >>> 
> >>> Ah sorry, patch with xlogreader.h attached.
> >> 
> >> Will look at it in a second.
> > 
> > It seems we would need one additional callback for both approaches like:
> > 
> > ->error(severity, format, ...)
> > 
> > For both to avoid having to draw in elog.c.
> 
> Yeah. Another approach would be to return the error string from
> ReadRecord. The caller could then do whatever it pleases with it, like
> ereport() it to the log or PANIC. I think I'd like that better.
That seems a bit more complex from a memory management perspective as you 
probably would have to sprintf() into some buffer. We cannot rely on a backend 
environment with memory contexts around et al...

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Heikki Linnakangas

Date:

17 September 2012, 11:50:45

On 17.09.2012 14:42, Andres Freund wrote:
> On Monday, September 17, 2012 12:55:47 PM Heikki Linnakangas wrote:
>> On 17.09.2012 13:01, Andres Freund wrote:
>>> On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:
>>>> On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
>>>>> On 17.09.2012 11:12, Andres Freund wrote:
>>>>>> On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
>>>>>> If you don't want the capability to forward/filter the data and read
>>>>>> partial data without regard for record constraints/buffering your
>>>>>> patch seems to be quite a good start. It misses xlogreader.h
>>>>>> though...
>>>>>
>>>>> Ah sorry, patch with xlogreader.h attached.
>>>>
>>>> Will look at it in a second.
>>>
>>> It seems we would need one additional callback for both approaches like:
>>>
>>> ->error(severity, format, ...)
>>>
>>> For both to avoid having to draw in elog.c.
>>
>> Yeah. Another approach would be to return the error string from
>> ReadRecord. The caller could then do whatever it pleases with it, like
>> ereport() it to the log or PANIC. I think I'd like that better.
> That seems a bit more complex from a memory management perspective as you
> probably would have to sprintf() into some buffer. We cannot rely on a backend
> environment with memory contexts around et al...

Hmm. I was thinking that making this work in a non-backend context would 
be too hard, so I didn't give that much thought, but I guess there isn't 
many dependencies to backend functions after all. palloc/pfree are 
straightforward to replace with malloc/free. That's what we could easily 
do with the error messages too, just malloc a suitably sized buffer.

How does a non-backend program get access to xlogreader.c? Copy 
xlogreader.c from the source tree at build time and link into the 
program? Or should we turn it into a shared library?

- Heikki

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

17 September 2012, 12:06:32

On Monday, September 17, 2012 01:50:33 PM Heikki Linnakangas wrote:
> On 17.09.2012 14:42, Andres Freund wrote:
> > On Monday, September 17, 2012 12:55:47 PM Heikki Linnakangas wrote:
> >> On 17.09.2012 13:01, Andres Freund wrote:
> >>> On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:
> >>>> On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
> >>>>> On 17.09.2012 11:12, Andres Freund wrote:
> >>>>>> On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
> >>>>>> If you don't want the capability to forward/filter the data and read
> >>>>>> partial data without regard for record constraints/buffering your
> >>>>>> patch seems to be quite a good start. It misses xlogreader.h
> >>>>>> though...
> >>>>> 
> >>>>> Ah sorry, patch with xlogreader.h attached.
> >>>> 
> >>>> Will look at it in a second.
> >>> 
> >>> It seems we would need one additional callback for both approaches
> >>> like:
> >>> 
> >>> ->error(severity, format, ...)
> >>> 
> >>> For both to avoid having to draw in elog.c.
> >> 
> >> Yeah. Another approach would be to return the error string from
> >> ReadRecord. The caller could then do whatever it pleases with it, like
> >> ereport() it to the log or PANIC. I think I'd like that better.
> > 
> > That seems a bit more complex from a memory management perspective as you
> > probably would have to sprintf() into some buffer. We cannot rely on a
> > backend environment with memory contexts around et al...
> 
> Hmm. I was thinking that making this work in a non-backend context would
> be too hard, so I didn't give that much thought, but I guess there isn't
> many dependencies to backend functions after all. palloc/pfree are
> straightforward to replace with malloc/free. 
Hm. I thought that it was pretty much a design requirement that this is usable 
outside of the backend environment?

> That's what we could easily do with the error messages too, just malloc a
> suitably sized buffer.
Not very comfortable though... Especially if you need to return an error from 
the read_page callback...


> How does a non-backend program get access to xlogreader.c? Copy
> xlogreader.c from the source tree at build time and link into the
> program? Or should we turn it into a shared library?
Not really sure. I thought about just putting it in pgport or such, but that 
seemed ugly as well.
The bin/xlogdump hack, which I find really helpful, at first simply had a 
dependency on ../../backend/access/transam/xlogreader.o which worked fine. Till 
it needed more because of *_desc routines... But Alvaro started to work on this 
although I don't know when he will be able to finish it.


Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

17 September 2012, 12:41:28

On Monday, September 17, 2012 12:52:32 PM Heikki Linnakangas wrote:
> On 17.09.2012 12:07, Andres Freund wrote:
> > On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
> >> The user of the facility doesn't need to be aware of record boundaries,
> >> that's the responsibility of the facility. I thought that's exactly the
> >> point of generalizing this thing, to make it unnecessary for the code
> >> that uses it to be aware of such things.
> > 
> > With the proposed API it seems pretty much a requirement to wait inside
> > the callback.
> 
> Or you can return false from the XLogPageRead() callback if the
> requested page is not available. That will cause ReadRecord() to return
> NULL, and you can retry when more WAL is available.
That requires to build quite a bit of knowledge on the outside:
* you need to transport the information that you need more input via some 
external variable/->private_data
* you need to transport at which RecPtr you needed more data
* you need to signal that youre not dealing with an invalid record after 
returning, given both conditions return NULL
* you need to buffer all incoming data somewhere if it comes from the network 
or similar, because at the next call XLgReadRecord will restart reading from 
the beginning


Sorry, if I sound sceptical! If I had your patch in my hands half a year ago I 
would have been very happy, but after building the more generic version that 
can do all of the above (including a compatible XLogReaderReadOne(state)) its a 
bit hard to do that. Not sure if its just the feeling of possibly having wasted 
the time...

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Tom Lane

Date:

17 September 2012, 14:08:18

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 17.09.2012 13:01, Andres Freund wrote:
>> It seems we would need one additional callback for both approaches like:
>> ->error(severity, format, ...)
>> For both to avoid having to draw in elog.c.

> Yeah. Another approach would be to return the error string from 
> ReadRecord. The caller could then do whatever it pleases with it, like 
> ereport() it to the log or PANIC. I think I'd like that better.

I think it's basically insane to imagine that you can carve out a
non-trivial piece of the backend that doesn't contain any elog calls.
There's too much low-level infrastructure, such as palloc, that could
call it.  Even if you managed to make it safe at the instant the feature
is committed, the odds it would stay safe over time are negligible.

Furthermore, returning enough state for useful error messages back out
of multiple layers of function call is going to be notationally messy,
and will end up requiring complicated infrastructure barely simpler than
elog anyway.

It'd be a lot better for the wal-dumping program to supply a cut-down
version of elog than to try to promise that all errors will be returned
back from ReadRecord.
        regards, tom lane

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Heikki Linnakangas

Date:

17 September 2012, 14:18:41

On 17.09.2012 17:08, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>> On 17.09.2012 13:01, Andres Freund wrote:
>>> It seems we would need one additional callback for both approaches like:
>>> ->error(severity, format, ...)
>>> For both to avoid having to draw in elog.c.
>
>> Yeah. Another approach would be to return the error string from
>> ReadRecord. The caller could then do whatever it pleases with it, like
>> ereport() it to the log or PANIC. I think I'd like that better.
>
> I think it's basically insane to imagine that you can carve out a
> non-trivial piece of the backend that doesn't contain any elog calls.
> There's too much low-level infrastructure, such as palloc, that could
> call it.  Even if you managed to make it safe at the instant the feature
> is committed, the odds it would stay safe over time are negligible.

I wasn't thinking that we'd completely eliminate all elog() calls from 
ReadRecord and everything it calls, but only the "expected" ones that 
mean we've reached the end of valid WAL. The ones that use 
emode_for_corrupt_record(). Any unexpected errors like running out of 
file descriptors would still use ereport() like usual.

That said, Andres' suggestion of making this facility completely 
independent of any backend functions, making it usable in external 
programs, doesn't actually seem that hard. ReadRecord() itself is fairly 
small, as are the subroutines that validate the records. XLogReadPage(), 
which goes out to fetch the right xlog page from archive or whatever, is 
way more complicated. But that would live in the callback, so it would 
be free to use all the normal backend facilities. However, it means that 
external programs would need to supply their own (hopefully much 
simpler) version of XLogReadPage(); I'm not sure how that goes with 
Andres' plans on using xlogreader.

- Heikki

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

17 September 2012, 14:30:06

On Monday, September 17, 2012 04:08:01 PM Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> > On 17.09.2012 13:01, Andres Freund wrote:
> >> It seems we would need one additional callback for both approaches like:
> >> ->error(severity, format, ...)
> >> For both to avoid having to draw in elog.c.
> > 
> > Yeah. Another approach would be to return the error string from
> > ReadRecord. The caller could then do whatever it pleases with it, like
> > ereport() it to the log or PANIC. I think I'd like that better.
> 
> I think it's basically insane to imagine that you can carve out a
> non-trivial piece of the backend that doesn't contain any elog calls.
> There's too much low-level infrastructure, such as palloc, that could
> call it.  Even if you managed to make it safe at the instant the feature
> is committed, the odds it would stay safe over time are negligible.
If you start relying on palloc all hope is gone anyway. I "only" want a 
standalone XLogReader because thats just too damn annoying/hard to duplicate in 
standalone code. There are several very useful utilities out there that are 
incomplete and/or unreliable for that reason. And loads of others that haven't 
been written because of that.

That is one of the reasons - beside finding the respective xlog.c code very 
hard to read/modify/extend - why I wrote a completely standalone xlogreader. 
One other factor was just learning how the hell all that works ;)

I still think the interface that something plain as the proposed 
XLogReadRecord() provides is too restrictive for many use-cases. I aggree that 
a wrapper with exactly such an interface for xlog.c is useful, though.

> Furthermore, returning enough state for useful error messages back out
> of multiple layers of function call is going to be notationally messy,
> and will end up requiring complicated infrastructure barely simpler than
> elog anyway.
Hm. You mean because of file/function/location?

> It'd be a lot better for the wal-dumping program to supply a cut-down
> version of elog than to try to promise that all errors will be returned
> back from ReadRecord.
Well, I suggested a ->error() callback for that reason, that seems relatively 
easy to wrap.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

17 September 2012, 14:36:01

On Monday, September 17, 2012 04:18:28 PM Heikki Linnakangas wrote:
> On 17.09.2012 17:08, Tom Lane wrote:
> > Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
> >> On 17.09.2012 13:01, Andres Freund wrote:
> >>> It seems we would need one additional callback for both approaches
> >>> like: ->error(severity, format, ...)
> >>> For both to avoid having to draw in elog.c.
> >> 
> >> Yeah. Another approach would be to return the error string from
> >> ReadRecord. The caller could then do whatever it pleases with it, like
> >> ereport() it to the log or PANIC. I think I'd like that better.
> > 
> > I think it's basically insane to imagine that you can carve out a
> > non-trivial piece of the backend that doesn't contain any elog calls.
> > There's too much low-level infrastructure, such as palloc, that could
> > call it.  Even if you managed to make it safe at the instant the feature
> > is committed, the odds it would stay safe over time are negligible.
> 
> I wasn't thinking that we'd completely eliminate all elog() calls from
> ReadRecord and everything it calls, but only the "expected" ones that
> mean we've reached the end of valid WAL. The ones that use
> emode_for_corrupt_record(). Any unexpected errors like running out of
> file descriptors would still use ereport() like usual.
> 
> That said, Andres' suggestion of making this facility completely
> independent of any backend functions, making it usable in external
> programs, doesn't actually seem that hard. ReadRecord() itself is fairly
> small, as are the subroutines that validate the records. XLogReadPage(),
> which goes out to fetch the right xlog page from archive or whatever, is
> way more complicated. But that would live in the callback, so it would
> be free to use all the normal backend facilities. However, it means that
> external programs would need to supply their own (hopefully much
> simpler) version of XLogReadPage(); I'm not sure how that goes with
> Andres' plans on using xlogreader.
XLogRead() from walsender.c is pretty easy to translate to backend-independent 
code, so I don't think thats a problem. I don't see how the backend's version 
is useful outside of the startup process anyway.

We could provide a default backend independent variant that hits files in 
xlogreader.c, its not much code, to avoid others copying it multiple times.

I used a variant of that in the places that read from disk without any 
problems. Obviously not in the places that read from network, but thats shelved 
due to the different decoding approach atm anyway.

Regards,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents

From

Andres Freund

Date:

22 September 2012, 16:58:12

Hi all,

Attached is the .txt and .pdf (both are imo readable and contain the same 
content) with design documentation about the proposed feature.

Christan Kruse, Marko Tiikkaja and Hannu Krosing read the document and told me 
about my most egregious mistakes. Thanks!

I would appreciate some feedback!

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Andres Freund

Date:

22 September 2012, 17:01:04

This time I really attached both...
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

"md@rpzdesign.com"

Date:

22 September 2012, 17:37:21

Andres, nice job on the writeup.<br /><br /> I think one aspect you are missing is that there must be some way for the
multi-mastersto <br /> re-stabilize their data sets and quantify any data loss.  You cannot do this without<br /> some
replicationintelligence in each row of each table so that no matter how disastrous<br /> the hardware/internet failure
inthe cloud, the system can HEAL itself and keep going, no human beings involved.<br /><br /> I am laying down a
standarddesign pattern of columns for each row:<br /><br /> MKEY - Primary key guaranteed unique across ALL nodes in
theCLOUD with NODE information IN THE KEY. (A876543 vs B876543 or whatever)(network link UP or DOWN)<br /> CSTP -
createtime stamp on unix time stamp<br /> USTP - last update time stamp based on unix time stamp<br /> UNODE - Node
thatupdated this record<br /><br /> Many applications already need the above information, might as well standardize it
soexternal replication logic processing can self heal.<br /><br /> Postgresql tables have optional 32 bit int OIDs, you
maywant consider having a replication version of the ROID, replication object ID and then externalize the primary<br />
keygeneration into a loadable UDF.<br /><br /> Of course, ALL the nodes must be in contact with each other not allowing
signficantdrift on their clocks while operating. (NTP is a starter)<br /><br /> I just do not know of any other way to
addself healing without the above information, regardless of whether you hold up transactions for synchronous <br /> or
letthem pass thru asynch.   Regardless if you are getting your replication data from the WAL stream or thru the client
libraries.<br/><br /> Also, your replication model does not really discuss busted link replication operations, where is
theintelligence for that in the operation diagram?<br /><br /> Everytime you package up replication into the core,
someonehas to tear into that pile to add some extra functionality, so definitely think<br /> about providing sensible
hooksfor that extra bit of customization to override the base function.<br /><br /> Cheers,<br /><br /> marco<br /><br
/>On 9/22/2012 11:00 AM, Andres Freund wrote:<br /><blockquote cite="mid:201209221900.53190.andres@2ndquadrant.com"
type="cite"><prewrap="">This time I really attached both...
 
</pre><br /><fieldset class="mimeAttachmentHeader"></fieldset><br /><pre wrap="">
</pre></blockquote><br />

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Peter Geoghegan

Date:

10 October 2012, 23:02:46

On 15 September 2012 01:39, Andres Freund <andres@2ndquadrant.com> wrote:
> (0008-Introduce-wal-decoding-via-catalog-timetravel.patch)

This patch is the 8th of 8 in a patch series that covers different
aspects of the bi-directional replication feature planned for
PostgreSQL 9.3. For those that are unfamiliar with the BDR projection,
a useful summary is available in an e-mail sent to this list by Simon
Riggs back in April [1]. I should also point out that Andres has made
available a design document that discusses aspects of this patch in
particular, in another thread [2]. That document, "High level design
for logical replication in Postgres", emphasizes source data
generation in particular: generalising how PostgreSQL's WAL stream is
generated to represent the changes it describes logically, without
pointers to physical heap tuples and so forth (data generation as a
concept is described in detail in an earlier design document [3]).
This particular patch can be thought of as a response to the earlier
discussion [4] surrounding how to solve the problem of keeping system
catalogs consistent during WAL replay on followers: "catalog time
travel" is now used, rather than maintaining a synchronized catalog at
the decoding end. Andres' git tree ("xlog-decoding-rebasing-cf2"
branch) [5] provides additional useful comments in commit messages (he
rebases things such that each commit represents a distinct piece of
functionality/ patch for review).

This patch is not strictly speaking an atomic unit. It is necessary to
apply all 8 patches in order to get the code to compile. However, it
is approximately an atomic unit, that represents a subdivision of the
entire BDR patch that it is manageable and logical to write a discrete
review for. This is idiomatic use of git-format-patch, but it is
unusual enough within our community for me feel the need to note these
facts.

I briefly discussed this patch with Andres off-list. His feeling is
that the review process ought to focus on the design of WAL decoding,
including how it fits within the larger set of replication features
proposed for 9.3. There are a number of known omissions in this patch.
Andres has listed some of these above, and edge-cases and so on are
noted next to XXX and FIXME comments in the patch itself. I am
inclined to agree with Andres' view that we should attempt to solidify
community support for this prototype patch's design, or some variant,
before fixing the edge-cases and working towards committable code. I
will try my best to proceed on that basis.

What follows is an initial overview of the patch (or at least my
understanding of the patch, which you may need to correct), and some
points of concern.

> * applycache module which reassembles transactions from a stream of interspersed changes
>

This is what the design doc [2] refers to as "4.5. TX reassembly".

This functionality is concentrated in applycache.c. As [2] notes, the
reassembly component "was previously coined ApplyCache because it was
proposed to run on replication consumers just before applying changes.
This is not the case anymore. It is still called that way in the
source of the patch recently submitted."

The purpose of ApplyCache/transaction reassembly is to reassemble
interlaced records, and organise them by XID, so that the consumer
client code sees only streams (well, lists) of records split by XID.

I meant to avoid talking about anything other than the bigger picture
for now, but I must ask: Why the frequent use of malloc(),
particularly within applycache.c? The obvious answer is that it's
rough code and that that will change, but that still doesn't comport
with my idea about how rough Postgres code should look, so I have to
wonder if there's a better reason.

applycache.c has an acute paucity of comments, which makes it really
hard to review well. [2] doesn't have all that much to say about it
either. I'm going to not comment much on this here, except to say that
I think that the file should be renamed to reassembly.c or something
like that, to reflect its well-specified purpose, and not how it might
be used. Any cache really belongs in src/backend/utils/cache/ anyway.

Applycache is presumably where you're going to want to spill
transaction streams to disk, eventually. That seems like a
prerequisite to commit.

By the way, I see that you're doing this here:

+ /* most callers don't need snapshot.h */
+ typedef struct SnapshotData *Snapshot;

Tom, Alvaro and I had a discussion about whether or not this was an
acceptable way to reduce build dependencies back in July [8] – I lost
that one. You're relying on a Gnu extension/C11 feature here (that is,
typedef redefinition). If you find it intensely irritating that you
cannot do this while following the standard to the letter, you are not
alone.

> * snapbuilder which builds catalog snapshots so that tuples from wal can be understood
>

This component analyses xlog and builds a special kind of Snapshot.
This has been compared to the KnownAssignedXids machinery for Hot
Standby [6] (see SnapBuildEndTxn() et al to get an idea of what is
meant by this). Since decoding only has to occur within a single
backend, I guess it's sufficient that it's all within local memory (in
contrast to the KnownAssignedXids array, which is in shared memory).

The design document [2] really just explains the problem (which is the
need for catalog metadata at a point in time to make sense of heap
tuples), without describing the solution that this patch offers with
any degree of detail. Rather, [2] says "How we build snapshots is
somewhat intricate and complicated and seems to be out of scope for
this document", which is unsatisfactory. I look forward to reading the
promised document that describes this mechanism in more detail. It's
really hard to guess what you might have meant to do, and why you
might have done it, much less verifying the codes correctness.

This functionality is concentrated in snapbuild.c. A comment in decode.c notes:

+ * Its possible that the separation between decode.c and snapbuild.c is a
+ * bit too strict, in the end they just about have the same struct.

I prefer the current separation. I think it's reasonable that decode.c
is sort of minimal glue code.

[2] talks about this under "4.6. Snapshot building".

> * wal decoding into an applycache
>

This functionality is concentrated in decode.c (not applycache.c –
decoding just call those functions).

Decoding means breaking up individual XLogRecord structs, and storing
them in an applycache (applycache does this, and stores them as
ApplyCacheChange records), while building a snapshot (which is needed
in advance of adding tuples from records). It can be thought of as the
small piece of glue between applycache and snapbuild that is called by
XLogReader (DecodeRecordIntoApplyCache() is the only public function,
which will be called by many xlogreader_state.finished_record-hooked
plugin functions in practice, including this example one). An example
of what belongs in decode.c is the way it ignores physical
XLogRecords, because they are not of interest.

By the way, why not just use struct assignment here?:

+ memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));

> * decode_xlog(lsn, lsn) debugging function
>

You consider this to be a throw-away function that won't ever be
committed. However, I strongly feel that you should move it into
/contrib, so that it can serve as a sort of reference implementation
for authors of decoder client code, in the same spirit as numerous
existing contrib modules (think contrib/spi). I think that such a
module could even be useful to people that were just a bit
intellectually curious about how WAL works, which is something I'd
like to encourage. Besides, simply having this code in a module will
more explicitly demarcate client code (just look at logicalfuncs.c –
it is technically client code, but that's too subtle right now).

I don't like this code in decode_xlog():

+ apply_state = (ReaderApplyState*)xlogreader_state->private_data;

Why is it necessary to cast here? In other words, why is private_data
a void pointer at all? Are we really well-served by presuming
absolutely nothing about XlogReader's state? Wouldn't an “abstract
base class” pointer be more appropriate a type for private_data? I
don't think it's virtuous to remove type-safety any more than is
strictly necessary. Note that I'm not asserting that you shouldn't do
this – I'm merely asking the question. When developing a user-facing
API, it is particularly crucial to make interfaces easy to use
correctly and hard to use incorrectly.

> The applycache provides 3 major callbacks:
> * apply_begin
> * apply_change
> * apply_commit

These are callbacks intended to be used by third-party modules,
perhaps including a full multi-master replication implementation
(though this patch isn't directly concerned with that), or even a
speculative future version of a logical replication system like Slony.
[2] refers to these under "4.7. Output Plugin". These are the typedef
for the hook types involved (incidentally, don't like the name of
these types – I think you should lose the CB):

+/* XXX: were currently passing the originating subtxn. Not sure thats
necessary */
+typedef void (*ApplyCacheApplyChangeCB)(ApplyCache* cache,
ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change);
+typedef void (*ApplyCacheBeginCB)(ApplyCache* cache, ApplyCacheTXN* txn);
+typedef void (*ApplyCacheCommitCB)(ApplyCache* cache, ApplyCacheTXN* txn);

So we register these callbacks like this in the patch:

+ /*
+ * allocate an ApplyCache that will apply data using lowlevel calls
+ * without type conversion et al. This requires binary compatibility
+ * between both systems.
+ * XXX: This would be the place too hook different apply methods, like
+ * producing sql and applying it.
+ */
+ apply_cache = ApplyCacheAllocate();
+ apply_cache->begin = decode_begin_txn;
+ apply_cache->apply_change = decode_change;
+ apply_cache->commit = decode_commit_txn;

The decode_xlog(lsn, lsn) debugging function that Andres has played
with [6] (that this patch makes available, for now) is where this code
comes from.

Whenever ApplyCache calls an "apply_change" callback for a single
change (that is, a INSERT|UPDATE|DELETE) it locally overrides the
normal SnapshotNow semantics used for catalog access with a previously
built snapshot. Behaviour should now be consistent with a normal
snapshotNow acquired when the tuple change was originally written to
WAL.

Having commented specifically on modules that Andres highlighted, I'd
like to highlight one myself: tqual.c . This module has had
significant new functionalty added, so it would be an omission to not
pass remark on it in this opening e-mail, having mentioned all other
modules with significant new pieces of functionality. The file has had
new utility functions added, that pertain to snapshot visibility
during decoding - "time travel".

I draw attention to this. This code is located within the new function
HeapTupleSatisfiesMVCCDuringDecoding(), which is analogous to what is
done for "dirty snapshots" (dirty snapshots are used with
ri_triggers.c, for example, when even uncommitted tuple should be
visible). Both of these functions are generally accessed through
function pointers. Anyway, here's the code:

+ /*
+ * FIXME: The not yet existing decoding infrastructure will need to force
+ * the xmin to stay lower than what they are currently decoding.
+ */
+ bool fixme_xmin_horizon = false;

I'm sort of wondering what this is going to look like in the finished
patch. This FIXME is rather hard for me to take at face value. It
seems to me that the need to coordinate decoding with xmin horizon
itself represents a not insignificant engineering challenge. So, with
all due respect, it would be nice if I wasn't asked to make that leap
of faith. The xmin horizon prepared transaction hack needs to go.

Within tqual.h, shouldn't you have something like this, but for time
travel snapshots during decoding?:

#define InitDirtySnapshot(snapshotdata) \((snapshotdata).satisfies = HeapTupleSatisfiesDirty)

Alternatively, adding a variable to these two might be appropriate:

static SnapshotData CurrentSnapshotData = {HeapTupleSatisfiesMVCC};
static SnapshotData SecondarySnapshotData = {HeapTupleSatisfiesMVCC};

In any case, assigning this hook in snapbuild.c looks like a
modularity violation to me. See also my observations on initialising
ReaderApplyState below.

My general feeling is that the code is very under-commented, and in
need of a polish, though I'm sure that you are perfectly well aware of
that. The basic way all of these components that I have described
separately fit together is: (if others want to follow this, refer to
decode_xlog())

1. Start with some client code “output plugin” (for now, a throw-away
debugging function “decode_xlog()”)|
\ /
2. Client allocates an XLogReaderState. (This module is a black box to
me, though it's well encapsulated so that shouldn't matter much.
Heikki is reviewing this [7]. Like I said, this isn't quite an atomic
unit I'm reviewing.)|
\ /
3. Plugin registers various callbacks (within logicalfuncs.c). These
callbacks, while appearing in this patch, are mostly NO-OPS, and are
somewhat specific to XLogReader's concerns. I mostly defer to Heikki
here.|
\ /
4. Plugin allocates an “ApplyCache”. Plugin assigns some more
callbacks to “ApplyCache”. This time, they're the aforementioned 3
apply cache functions.|
\ /
5. Plugin assigns this new ApplyCache to variable within private state
of the XLogReader (this private state is a subset of its main state,
and is opaque to XLogReader).|
\ /
6. Finally, plugin calls XLogReader(main_state). | \ /
7. At some point during its magic,
XLogReader calls the hook registered in step 3, finished_record. This is all it
doesdirectly
with the plugin, which it makes minimal assumptions about. |
\ / 8. finished_record (which is
logically a part of the “plugin”) knows what type the opaque
private_data actually is. It casts it
to an apply_state, and calls the decoder (supplying the apply_state as an
argumentto
DecodeRecordIntoApplyCache()). | \ /
9. During the first call
(within the first record within a call to decode_xlog()), we allocate
a snapshot reader. | \ /
10. Builds snapshot callback.
This scribbles on our snapshot state, which essentially encapsulates a
snapshot. The state (and
snapshot) changes continually, once per call. |
\ / 11. Looks at XLogRecordBuffer
(an XLogReader struct). Looks at an XLogRecord. Decodes based on
record type. Let's assume it's an
XLOG_HEAP_INSERT. | \ /
12. DecodeInsert() called.
This in turn calls DecodeXLogTuple(). We store the tuple metadata in
our ApplyCache. (some
ilists, somewhere, each corresponding to an XID). We don't store the
relation oid, because we don't know it yet (only
relfilenode is known from WAL). / /
\ / 13. We're back in XLogReader(). It
calls the only callback of interest to us covered in step 3 (and
not of interest to XlogReader()/Heikki) – decode_change(). It does
this through the apply_cache.apply_change
hook. This happens because we encounter another record, this time a commit
record(in the same
codepath as discussed in step 12). | \ /
14. In decode_change(), the actual
function that raises the interesting WARNINGs within Andres' earlier example
[6],showing
actual integer/varchar Datum value for tuples previously inserted. Resolve table
oidbased on
relfilenode (albeit unsatisfactorily). Using a StringInfo,
tupledescs, syscache and typcache, build WARNING string.

So my general opinion of how all this fits together is that it isn't
quite right. Problems include:

* Why does the ReaderApplyState get magically initialised in two
stages? apply_cache is initialised in decode_xlog (or whatever
plugin). Snapstate is allocated within DecodeRecordIntoApplyCache()
(with a dependency on apply_cache). Shouldn't this all be happening
within a single function? As you yourself have point out, not everyone
needs to know about these snapshots.

* Maybe I've missed something, but I think you need a more
satisfactory example plugin. What happens in step 14 is plainly
unacceptable. You haven't adequately communicated to me how this is
going to be used in logical replication. Maybe I just haven't got that
far yet. I'm not impressed by the InvalidateSystemCaches() calls here
and elsewhere.

* Please break-out the client code as a contrib module. That
separation would increase the readability of the patch.

That's all I have for now...

[1] http://archives.postgresql.org/message-id/CA+U5nMLk-Wt806zab7SJ2x5X4pqC3WE-hFctONakTqSAgbqTYQ@mail.gmail.com

[2] http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com

[3] http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com

[4] http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com

[5]
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf2

[6] http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com

[7] http://archives.postgresql.org/message-id/5056DFAB.3050707@vmware.com

[8] http://archives.postgresql.org/message-id/CAEYLb_Uvbi9mns-uJWUW4QtHqnC27SEyyNmj1HKFY=5X5wwdgg@mail.gmail.com

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Bruce Momjian

Date:

10 October 2012, 23:16:14

On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote:
> On 15 September 2012 01:39, Andres Freund <andres@2ndquadrant.com> wrote:
> > (0008-Introduce-wal-decoding-via-catalog-timetravel.patch)
> 
> This patch is the 8th of 8 in a patch series that covers different
> aspects of the bi-directional replication feature planned for
> PostgreSQL 9.3. For those that are unfamiliar with the BDR projection,
> a useful summary is available in an e-mail sent to this list by Simon
> Riggs back in April [1]. I should also point out that Andres has made

Does this design allow replication/communcation between clusters running
different major versions of Postgres?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Josh Berkus

Date:

10 October 2012, 23:30:22

> Does this design allow replication/communcation between clusters running
> different major versions of Postgres?

Just catching up on your email, hmmm?

Yes, that's part of the design 2Q presented.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

"anarazel@anarazel.de"

Date:

10 October 2012, 23:35:10


Bruce Momjian <bruce@momjian.us> schrieb:

>On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote:
>> On 15 September 2012 01:39, Andres Freund <andres@2ndquadrant.com>
>wrote:
>> > (0008-Introduce-wal-decoding-via-catalog-timetravel.patch)
>>
>> This patch is the 8th of 8 in a patch series that covers different
>> aspects of the bi-directional replication feature planned for
>> PostgreSQL 9.3. For those that are unfamiliar with the BDR
>projection,
>> a useful summary is available in an e-mail sent to this list by Simon
>> Riggs back in April [1]. I should also point out that Andres has made
>
>Does this design allow replication/communcation between clusters
>running
>different major versions of Postgres?
This patchset only contains only the decoding/changeset generation part of logical replication. It provides (as in the
debuggingexample) the capabilities to generate a correct textual format and thus can be used to build a solution with
supportfor cross version/arch replication as long as the text format of the used types is compatible. 

Does that answer the question?

Andres

---
Please excuse the brevity and formatting - I am writing this on my mobile phone.

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Bruce Momjian

Date:

10 October 2012, 23:48:18

On Thu, Oct 11, 2012 at 01:34:58AM +0200, anarazel@anarazel.de wrote:
> 
> 
> Bruce Momjian <bruce@momjian.us> schrieb:
> 
> >On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote:
> >> On 15 September 2012 01:39, Andres Freund <andres@2ndquadrant.com>
> >wrote:
> >> > (0008-Introduce-wal-decoding-via-catalog-timetravel.patch)
> >> 
> >> This patch is the 8th of 8 in a patch series that covers different
> >> aspects of the bi-directional replication feature planned for
> >> PostgreSQL 9.3. For those that are unfamiliar with the BDR
> >projection,
> >> a useful summary is available in an e-mail sent to this list by Simon
> >> Riggs back in April [1]. I should also point out that Andres has made
> >
> >Does this design allow replication/communcation between clusters
> >running
> >different major versions of Postgres?
> This patchset only contains only the decoding/changeset generation part of logical replication. It provides (as in
thedebugging example) the capabilities to generate a correct textual format and thus can be used to build a solution
withsupport for cross version/arch replication as long as the text format of the used types is compatible.
 
> 
> Does that answer the question?

Yes.  This was posted so long ago I couldn't remember if that was part
of the design.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Robert Haas

Date:

11 October 2012, 01:10:52

On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> The purpose of ApplyCache/transaction reassembly is to reassemble
> interlaced records, and organise them by XID, so that the consumer
> client code sees only streams (well, lists) of records split by XID.

I think I've mentioned it before, but in the interest of not being
seen to critique the bikeshed only after it's been painted: this
design gives up something very important that exists in our current
built-in replication solution, namely pipelining.  With streaming
replication as it exists today, a transaction that modifies a huge
amount of data (such as a bulk load) can be applied on the standby as
it happens.  The rows thus inserted will become visible only if and
when the transaction commits on the master and the commit record is
replayed on the standby.  This has a number of important advantages,
perhaps most importantly that the lag between commit and data
visibility remains short.  With the proposed system, we can't start
applying the changes until the transaction has committed and the
commit record has been replayed, so a big transaction is going to have
a lot of apply latency.

Now, I am not 100% opposed to a design that surrenders this property
in exchange for other important benefits, but I think it would be
worth thinking about whether there is any way that we can design this
that either avoids giving that property up at all, or gives it up for
the time being but allows us to potentially get back to it in a later
version.  Reassembling complete transactions is surely cool and some
clients will want that, but being able to apply replicated
transactions *without* reassembling them in their entirety is even
cooler, and some clients will want that, too.

If we're going to stick with a design that reassembles transactions, I
think there are a number of issues that deserve careful thought.
First, memory usage.  I don't think it's acceptable for the decoding
process to assume that it can allocate enough backend-private memory
to store all of the in-flight changes (either as WAL or in some more
decoded form).  We have assiduously avoided such assumptions thus far;
you can write a terabyte of data in one transaction with just a
gigabyte of shared buffers if you so desire (and if you're patient).
Here's you making the same point in different words:

> Applycache is presumably where you're going to want to spill
> transaction streams to disk, eventually. That seems like a
> prerequisite to commit.

Second, crash recovery.  I think whatever we put in place here has to
be able to survive a crash on any node.  Decoding must be able to
restart successfully after a system crash, and it has to be able to
apply exactly the set of transactions that were committed but not
applied prior to the crash.  Maybe an appropriate mechanism for this
already exists or has been discussed, but I haven't seen it go by;
sorry if I have missed the boat.

> You consider this to be a throw-away function that won't ever be
> committed. However, I strongly feel that you should move it into
> /contrib, so that it can serve as a sort of reference implementation
> for authors of decoder client code, in the same spirit as numerous
> existing contrib modules (think contrib/spi).

Without prejudice to the rest of this review which looks quite
well-considered, I'd like to add a particular +1 to this point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Bruce Momjian

Date:

11 October 2012, 01:16:06

On Wed, Oct 10, 2012 at 09:10:48PM -0400, Robert Haas wrote:
> > You consider this to be a throw-away function that won't ever be
> > committed. However, I strongly feel that you should move it into
> > /contrib, so that it can serve as a sort of reference implementation
> > for authors of decoder client code, in the same spirit as numerous
> > existing contrib modules (think contrib/spi).
> 
> Without prejudice to the rest of this review which looks quite
> well-considered, I'd like to add a particular +1 to this point.

The review was _HUGE_!  :-O

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Tom Lane

Date:

11 October 2012, 01:40:21

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> The purpose of ApplyCache/transaction reassembly is to reassemble
>> interlaced records, and organise them by XID, so that the consumer
>> client code sees only streams (well, lists) of records split by XID.

> I think I've mentioned it before, but in the interest of not being
> seen to critique the bikeshed only after it's been painted: this
> design gives up something very important that exists in our current
> built-in replication solution, namely pipelining.

Isn't there an even more serious problem, namely that this assumes
*all* transactions are serializable?  What happens when they aren't?
Or even just that the effective commit order is not XID order?
        regards, tom lane

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Greg Stark

Date:

11 October 2012, 02:17:23

On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think I've mentioned it before, but in the interest of not being
>> seen to critique the bikeshed only after it's been painted: this
>> design gives up something very important that exists in our current
>> built-in replication solution, namely pipelining.
>
> Isn't there an even more serious problem, namely that this assumes
> *all* transactions are serializable?  What happens when they aren't?
> Or even just that the effective commit order is not XID order?

Firstly, I haven't read the code but I'm confident it doesn't make the
elementary error of assuming commit order == xid order. I assume it's
applying the reassembled transactions in commit order.

I don't think it assumes the transactions are serializable because
it's only concerned with writes, not reads. So the transaction it's
replaying may or may not have been able to view the data written by
other transactions that commited earlier but it doesn't matter when
trying to reproduce the effects using constants. The data written by
this transaction is either written or not when the commit happens and
it's all written or not at that time. Even in non-serializable mode
updates take row locks and nobody can see the data or modify it until
the transaction commits.

I have to say I was curious about Robert's point as well when I read
Peter's review. Especially because this is exactly how other logical
replication systems I've seen work too and I've always wondered about
it in those systems. Both MySQL and Oracle reassemble transactions and
don't write anything until they have the whole transaction
reassembled.  To me this always struck me as a bizarre and obviously
bad thing to do though.  It seems to me it would be better to create
sessions (or autonomous transactions) for each transaction seen in the
stream and issue the DML as it shows up, committing and cleaning each
up when the commit or abort (or shutdown or startup) record comes
along.

I imagine the reason lies with dealing with locking and ensuring that
you get the correct results without deadlocks when multiple
transactions try to update the same record. But it seems to me that
the original locks the source database took should protect you against
any problems. As long as you can suspend a transaction when it takes a
lock that blocks and keep processing WAL for other transactions (or an
abort for that transaction if that happened due to a deadlock or user
interruption) you should be fine.

-- 
greg

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Andres Freund

Date:

11 October 2012, 02:21:19

Hi,

First of: Awesome review.

On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:
> What follows is an initial overview of the patch (or at least my
> understanding of the patch, which you may need to correct), and some
> points of concern.
>
> > * applycache module which reassembles transactions from a stream of
> > interspersed changes
>
> This is what the design doc [2] refers to as "4.5. TX reassembly".
>
> This functionality is concentrated in applycache.c. As [2] notes, the
> reassembly component "was previously coined ApplyCache because it was
> proposed to run on replication consumers just before applying changes.
> This is not the case anymore. It is still called that way in the
> source of the patch recently submitted."
>
> The purpose of ApplyCache/transaction reassembly is to reassemble
> interlaced records, and organise them by XID, so that the consumer
> client code sees only streams (well, lists) of records split by XID.

> I meant to avoid talking about anything other than the bigger picture
> for now, but I must ask: Why the frequent use of malloc(),
> particularly within applycache.c? The obvious answer is that it's
> rough code and that that will change, but that still doesn't comport
> with my idea about how rough Postgres code should look, so I have to
> wonder if there's a better reason.
Several reasons, not sure how good:
- part of the code (was) supposed to be runnable on a target system without a
full postgres backend arround
- All of the allocations would basically have to be in TopMemoryContext or
something equally longlived as we don't have a transaction context or anything
like that
- the first revision showed that allocating memory was the primary bottleneck
so I added a small allocation cache to the critical pieces which solved that.
After that there seems no point in using mctx''s for that kind of memory
anymore.

> applycache.c has an acute paucity of comments, which makes it really
> hard to review well.
Working on that. I don't think its internals are really all that interesting
atm.

> I'm going to not comment much on this here, except to say that
> I think that the file should be renamed to reassembly.c or something
> like that, to reflect its well-specified purpose, and not how it might
> be used.
Agreed.

> Applycache is presumably where you're going to want to spill
> transaction streams to disk, eventually. That seems like a
> prerequisite to commit.
Yes.

> By the way, I see that you're doing this here:
>
> + /* most callers don't need snapshot.h */
> + typedef struct SnapshotData *Snapshot;
>
> Tom, Alvaro and I had a discussion about whether or not this was an
> acceptable way to reduce build dependencies back in July [8] – I lost
> that one. You're relying on a Gnu extension/C11 feature here (that is,
> typedef redefinition). If you find it intensely irritating that you
> cannot do this while following the standard to the letter, you are not
> alone.
Yuck :(. So I will just use struct SnapshotData directly instead of a
typedef...

> > * snapbuilder which builds catalog snapshots so that tuples from wal can
> > be understood
>
> This component analyses xlog and builds a special kind of Snapshot.
> This has been compared to the KnownAssignedXids machinery for Hot
> Standby [6] (see SnapBuildEndTxn() et al to get an idea of what is
> meant by this). Since decoding only has to occur within a single
> backend, I guess it's sufficient that it's all within local memory (in
> contrast to the KnownAssignedXids array, which is in shared memory).
I haven't found a convincing argument to share that information itself. If we
want to parallelise this I think sharing the snapshots should be sufficient.
Given the snapshot only covers system catalogs the changerate should be fine.

> The design document [2] really just explains the problem (which is the
> need for catalog metadata at a point in time to make sense of heap
> tuples), without describing the solution that this patch offers with
> any degree of detail. Rather, [2] says "How we build snapshots is
> somewhat intricate and complicated and seems to be out of scope for
> this document", which is unsatisfactory. I look forward to reading the
> promised document that describes this mechanism in more detail. It's
> really hard to guess what you might have meant to do, and why you
> might have done it, much less verifying the codes correctness.
Will concentrate on finishing that document.

> This functionality is concentrated in snapbuild.c. A comment in decode.c
> notes:
>
> + *    Its possible that the separation between decode.c and snapbuild.c is
> a + *    bit too strict, in the end they just about have the same struct.
>
> I prefer the current separation. I think it's reasonable that decode.c
> is sort of minimal glue code.
Good.

> Decoding means breaking up individual XLogRecord structs, and storing
> them in an applycache (applycache does this, and stores them as
> ApplyCacheChange records), while building a snapshot (which is needed
> in advance of adding tuples from records). It can be thought of as the
> small piece of glue between applycache and snapbuild that is called by
> XLogReader (DecodeRecordIntoApplyCache() is the only public function,
> which will be called by many xlogreader_state.finished_record-hooked
> plugin functions in practice, including this example one). An example
> of what belongs in decode.c is the way it ignores physical
> XLogRecords, because they are not of interest.
>
> By the way, why not just use struct assignment here?:
>
> +     memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));
At the time I initially wrote the code that seemed to be the project standard.
I think this only recently changed.

> > * decode_xlog(lsn, lsn) debugging function
>
> You consider this to be a throw-away function that won't ever be
> committed.
Only in its current hacked-together form.

> However, I strongly feel that you should move it into
> /contrib, so that it can serve as a sort of reference implementation
> for authors of decoder client code, in the same spirit as numerous
> existing contrib modules (think contrib/spi). I think that such a
> module could even be useful to people that were just a bit
> intellectually curious about how WAL works, which is something I'd
> like to encourage. Besides, simply having this code in a module will
> more explicitly demarcate client code (just look at logicalfuncs.c –
> it is technically client code, but that's too subtle right now).
I definitely think we need something like this, but it might look a bit
different. For reasons you found later (xmin horizon) I don't think we can run
it in the context of a normal backend for one.

> I don't like this code in decode_xlog():
>
> +     apply_state = (ReaderApplyState*)xlogreader_state->private_data;
>
> Why is it necessary to cast here? In other words, why is private_data
> a void pointer at all?
The reason is that for a long time I hoped to keep ApplyCache's generic enough
to be usable outside of this exact scenario. I am not sure if thats a
worthwile goal anymore, especially as there are some layering violations now.

> > The applycache provides 3 major callbacks:
> > * apply_begin
> > * apply_change
> > * apply_commit
>
> These are callbacks intended to be used by third-party modules,
> perhaps including a full multi-master replication implementation
> (though this patch isn't directly concerned with that), or even a
> speculative future version of a logical replication system like Slony.
> [2] refers to these under "4.7. Output Plugin". These are the typedef
> for the hook types involved (incidentally,  don't like the name of
> these types – I think you should lose the CB):
Not particularly attached to the CB, so I can loose it.

> +/* XXX: were currently passing the originating subtxn. Not sure thats
> necessary */
> +typedef void (*ApplyCacheApplyChangeCB)(ApplyCache* cache,
> ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change);
> +typedef void (*ApplyCacheBeginCB)(ApplyCache* cache, ApplyCacheTXN* txn);
> +typedef void (*ApplyCacheCommitCB)(ApplyCache* cache, ApplyCacheTXN* txn);
>
> So we register these callbacks like this in the patch:
>
> +    /*
> +     * allocate an ApplyCache that will apply data using lowlevel calls
> +     * without type conversion et al. This requires binary compatibility
> +     * between both systems.
> +     * XXX: This would be the place too hook different apply methods, like
> +     * producing sql and applying it.
> +     */
> +    apply_cache = ApplyCacheAllocate();
> +    apply_cache->begin = decode_begin_txn;
> +    apply_cache->apply_change = decode_change;
> +    apply_cache->commit = decode_commit_txn;
>
> The decode_xlog(lsn, lsn) debugging function that Andres has played
> with [6] (that this patch makes available, for now) is where this code
> comes from.
>
> Whenever ApplyCache calls an "apply_change" callback for a single
> change (that is, a INSERT|UPDATE|DELETE) it locally overrides the
> normal SnapshotNow semantics used for catalog access with a previously
> built snapshot. Behaviour should now be consistent with a normal
> snapshotNow acquired when the tuple change was originally written to
> WAL.
I additionally want pass a mvcc-ish Snapshot (should just need a a change in
function signatures) so the output plugins can query the catalog manually.

> Having commented specifically on modules that Andres highlighted, I'd
> like to highlight one myself: tqual.c . This module has had
> significant new functionalty added, so it would be an omission to not
> pass remark on it in this opening e-mail, having mentioned all other
> modules with significant new pieces of functionality. The file has had
> new utility functions added, that pertain to snapshot visibility
> during decoding - "time travel".
For me that mentally was part of catalog timetravel, thats why I didn't
highlight it separately. But yes, this needs to be discussed.

> I draw attention to this. This code is located within the new function
> HeapTupleSatisfiesMVCCDuringDecoding(), which is analogous to what is
> done for "dirty snapshots" (dirty snapshots are used with
> ri_triggers.c, for example, when even uncommitted tuple should be
> visible).
Not sure where you see the similarity with dirty snapshots?

> Both of these functions are generally accessed through
> function pointers. Anyway, here's the code:
>
> +     /*
> +      * FIXME: The not yet existing decoding infrastructure will need to
> force +      * the xmin to stay lower than what they are currently
decoding.
> +      */
> +     bool fixme_xmin_horizon = false;
>
> I'm sort of wondering what this is going to look like in the finished
> patch. This FIXME is rather hard for me to take at face value. It
> seems to me that the need to coordinate decoding with xmin horizon
> itself represents a not insignificant engineering challenge. So, with
> all due respect, it would be nice if I wasn't asked to make that leap
> of faith. The xmin horizon prepared transaction hack needs to go.
The idea is to do the decoding inside (or in something very similar) like
walsenders. Those already have the capability to keep the xmin horizon nailed
to some point for the hot_standby_feedback feature. There is a need for
something similar like what prepared statements do for restartability after a
restart though.

> Within tqual.h, shouldn't you have something like this, but for time
> travel snapshots during decoding?:
>
> #define InitDirtySnapshot(snapshotdata)  \
>     ((snapshotdata).satisfies = HeapTupleSatisfiesDirty)
Hm. Ok.

> Alternatively, adding a variable to these two might be appropriate:
>
> static SnapshotData CurrentSnapshotData = {HeapTupleSatisfiesMVCC};
> static SnapshotData SecondarySnapshotData = {HeapTupleSatisfiesMVCC};
Not sure where youre going from this?

> My general feeling is that the code is very under-commented, and in
> need of a polish, though I'm sure that you are perfectly well aware of
> that.
Totally aggreed.

> The basic way all of these components that I have described
> separately fit together is: (if others want to follow this, refer to
> decode_xlog())
>
> 1. Start with some client code “output plugin” (for now, a throw-away
> debugging function “decode_xlog()”)
The general idea is to integrate this into the walsender framework in a
command very roughly looking like:
START_LOGICAL_REPLICATION $plugin LSN

> \ /
> 2. Client allocates an XLogReaderState. (This module is a black box to
> me, though it's well encapsulated so that shouldn't matter much.
> Heikki is reviewing this [7]. Like I said, this isn't quite an atomic
> unit I'm reviewing.)
>
> \ /
> 3. Plugin registers various callbacks (within logicalfuncs.c). These
> callbacks, while appearing in this patch, are mostly NO-OPS, and are
> somewhat specific to XLogReader's concerns. I mostly defer to Heikki
> here.
>
> \ /
> 4. Plugin allocates an “ApplyCache”. Plugin assigns some more
> callbacks to “ApplyCache”. This time, they're the aforementioned 3
> apply cache functions.
>
> \ /
> 5. Plugin assigns this new ApplyCache to variable within private state
> of the XLogReader (this private state is a subset of its main state,
> and is opaque to XLogReader).
>
> \ /
> 6. Finally, plugin calls XLogReader(main_state).
>
>                                 \ /
>                                  7.  At some point during its magic,
> XLogReader calls the hook registered in step 3, finished_record.
>                                       This is all it does directly
> with the plugin, which it makes minimal assumptions about.
>
>                                         \ /
>                                          8. finished_record (which is
> logically a part of the “plugin”) knows what type the opaque
> private_data
>                                              actually is. It casts it
> to an apply_state, and calls the decoder (supplying the apply_state as
>                                              an argument to
> DecodeRecordIntoApplyCache()).
>
>                                         \ /
>                                          9. During the first call
> (within the first record within a call to decode_xlog()), we allocate
> a snapshot reader.
>
>                                         \ /
>                                          10. Builds snapshot callback.
> This scribbles on our snapshot state, which essentially encapsulates a
> snapshot.
>                                                The state (and
> snapshot) changes continually, once per call.
It changes only if there were catalog changes in the transaction and/or we
haven't yet build an initial snapshot.

>                                         \ /
>                                          11. Looks at XLogRecordBuffer
> (an XLogReader struct). Looks at an XLogRecord. Decodes based on
> record type.
>                                                Let's assume it's an
> XLOG_HEAP_INSERT.
>
>                                         \ /
>                                          12. DecodeInsert() called.
> This in turn calls DecodeXLogTuple(). We store the tuple metadata in
> our
>                                                ApplyCache. (some
> ilists, somewhere, each corresponding to an XID). We don't store the
> relation oid, because we don't
>                                                know it yet (only
> relfilenode is known from WAL).
>                                     /
>                                    /
>                                  \ /
>                                   13. We're back in XLogReader(). It
> calls the only callback of interest to
>                                          us  covered in step 3 (and
> not of interest to XlogReader()/Heikki) – decode_change(). It does
> this through the
>                                          apply_cache.apply_change
> hook. This happens because we encounter another record, this time a
>                                          commit record (in the same
> codepath as discussed in step 12).
>
>                                    \ /
>                                    14. In decode_change(), the actual
> function that raises the interesting WARNINGs within Andres'
>                                          earlier example [6], showing
> actual integer/varchar Datum value for tuples previously inserted.
>                                          Resolve table oid based on
> relfilenode (albeit unsatisfactorily).
>                                          Using a StringInfo,
> tupledescs, syscache and typcache, build WARNING string.
>
> So my general opinion of how all this fits together is that it isn't
> quite right. Problems include:
I think its more understandable if you think that normally the initialization
code shouldn't be repeated in the plugins but the plugins are just called from
a walsender.

> * Why does the ReaderApplyState get magically initialised in two
> stages? apply_cache is initialised in decode_xlog (or whatever
> plugin). Snapstate is allocated within DecodeRecordIntoApplyCache()
> (with a dependency on apply_cache). Shouldn't this all be happening
> within a single function? As you yourself have point out, not everyone
> needs to know about these snapshots.
The reason is/was that I didn't want the outside know anything about the
Snapshot building, that seems to be only relevant to decode.c (and somewhat
applycache.c).
I can look at it though.

> * Maybe I've missed something, but I think you need a more
> satisfactory example plugin. What happens in step 14 is plainly
> unacceptable. You haven't adequately communicated to me how this is
> going to be used in logical replication. Maybe I just haven't got that
> far yet.
Well, what would you like to happen instead? Does it make more sense with the
concept that the output plugin will eventually run inside walsender(-ish) and
stream changes via COPY? I just couldn't finish a POC for that in time.

> I'm not impressed by the InvalidateSystemCaches() calls here
> and elsewhere.
Yes, those definitely got to go and we have nearly all the infrastructure for
it from Hot Standby. It was noted as a deficiency in the overview mail & the
commit message though ;). I have that locally.

The rough idea here is to reuse xl_xact_commit->nmsgs to handle the cache
behaviour. There are some more complications but let me first write the
timetravel document.

> * Please break-out the client code as a contrib module. That
> separation would increase the readability of the patch.
Well, as I said above the debugging function in its current form is not going
to walk, as soon we have aggreed on how to integrate this into walsender I am
definitely going to provide a example/debugging plugin which I definitely intend
to submit.

Thanks!

Andres
--
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Bruce Momjian

Date:

11 October 2012, 02:26:35

On Thu, Oct 11, 2012 at 03:16:39AM +0100, Greg Stark wrote:
> On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I think I've mentioned it before, but in the interest of not being
> >> seen to critique the bikeshed only after it's been painted: this
> >> design gives up something very important that exists in our current
> >> built-in replication solution, namely pipelining.
> >
> > Isn't there an even more serious problem, namely that this assumes
> > *all* transactions are serializable?  What happens when they aren't?
> > Or even just that the effective commit order is not XID order?
> 
> Firstly, I haven't read the code but I'm confident it doesn't make the
> elementary error of assuming commit order == xid order. I assume it's
> applying the reassembled transactions in commit order.
> 
> I don't think it assumes the transactions are serializable because
> it's only concerned with writes, not reads. So the transaction it's
> replaying may or may not have been able to view the data written by
> other transactions that commited earlier but it doesn't matter when
> trying to reproduce the effects using constants. The data written by
> this transaction is either written or not when the commit happens and
> it's all written or not at that time. Even in non-serializable mode
> updates take row locks and nobody can see the data or modify it until
> the transaction commits.

How does Slony write its changes without causing serialization replay
conflicts?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Tom Lane

Date:

11 October 2012, 02:31:28

Greg Stark <stark@mit.edu> writes:
> On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Isn't there an even more serious problem, namely that this assumes
>> *all* transactions are serializable?  What happens when they aren't?
>> Or even just that the effective commit order is not XID order?

> I don't think it assumes the transactions are serializable because
> it's only concerned with writes, not reads. So the transaction it's
> replaying may or may not have been able to view the data written by
> other transactions that commited earlier but it doesn't matter when
> trying to reproduce the effects using constants.

I would believe that argument if the "apply" operations were at a
similar logical level to our current WAL records, namely drop these bits
into that spot.  Unfortunately, they're not.  I think this argument
falls to the ground entirely as soon as you think about DDL being
applied by transactions A,B,C and then needing to express what
concurrent transactions X,Y,Z did in "source" terms.  Even something as
simple as a few column renames could break it, let alone anything as
esoteric as changing the meaning of datatype literals.
        regards, tom lane

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Andres Freund

Date:

11 October 2012, 02:33:42

On Thursday, October 11, 2012 03:10:48 AM Robert Haas wrote:
> On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter@2ndquadrant.com> 
wrote:
> > The purpose of ApplyCache/transaction reassembly is to reassemble
> > interlaced records, and organise them by XID, so that the consumer
> > client code sees only streams (well, lists) of records split by XID.
> 
> I think I've mentioned it before, but in the interest of not being
> seen to critique the bikeshed only after it's been painted: this
> design gives up something very important that exists in our current
> built-in replication solution, namely pipelining.  With streaming
> replication as it exists today, a transaction that modifies a huge
> amount of data (such as a bulk load) can be applied on the standby as
> it happens.  The rows thus inserted will become visible only if and
> when the transaction commits on the master and the commit record is
> replayed on the standby.  This has a number of important advantages,
> perhaps most importantly that the lag between commit and data
> visibility remains short.  With the proposed system, we can't start
> applying the changes until the transaction has committed and the
> commit record has been replayed, so a big transaction is going to have
> a lot of apply latency.
I don't think there is a fundamental problem here, just an incremental ones. 

The major problems are:
* transactions with DDL & DML currently need to be reassembled, it might be 
possible to resolve this though, haven't thought about it too much
* subtransaction are only assigned to toplevel transactions at commit time
* you need a variable amount of backends/parallel transactions open at the 
target system to apply all the transactions concurrently. You can't smash them 
together because one of them might rollback.

All of those seem solveable to me, so I am not too worried about addition of a 
streaming mode somewhere down the line. I don't want to focus on it right now 
though. Ok?

> Here's you making the same point in different words:
> > Applycache is presumably where you're going to want to spill
> > transaction streams to disk, eventually. That seems like a
> > prerequisite to commit.
> 
> Second, crash recovery.  I think whatever we put in place here has to
> be able to survive a crash on any node.  Decoding must be able to
> restart successfully after a system crash, and it has to be able to
> apply exactly the set of transactions that were committed but not
> applied prior to the crash.  Maybe an appropriate mechanism for this
> already exists or has been discussed, but I haven't seen it go by;
> sorry if I have missed the boat.
I have discussed it privately & roughly prototyped, but not publically. There 
are two pieces to this:
1) restartable after a crash/disconnection/shutdown
2) pick of exactly where it stopped

Those are somewhat different because 1) is relevant on the source side and be 
solved there. 2) depends on the target system because it needs to ensure that 
it safely received the changes up to some point.

The idea for 1) is to serialize the applycache whenever we reach a checkpoint 
and have that as a starting point for every confirmed flush location of 2).

Obviously 2) will need cooperation by the receiving side.

> > You consider this to be a throw-away function that won't ever be
> > committed. However, I strongly feel that you should move it into
> > /contrib, so that it can serve as a sort of reference implementation
> > for authors of decoder client code, in the same spirit as numerous
> > existing contrib modules (think contrib/spi).
> 
> Without prejudice to the rest of this review which looks quite
> well-considered, I'd like to add a particular +1 to this point.
So were in violent agreement here ;)

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Andres Freund

Date:

11 October 2012, 02:36:28

On Thursday, October 11, 2012 04:16:39 AM Greg Stark wrote:
> On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I think I've mentioned it before, but in the interest of not being
> >> seen to critique the bikeshed only after it's been painted: this
> >> design gives up something very important that exists in our current
> >> built-in replication solution, namely pipelining.
> > 
> > Isn't there an even more serious problem, namely that this assumes
> > *all* transactions are serializable?  What happens when they aren't?
> > Or even just that the effective commit order is not XID order?
> 
> Firstly, I haven't read the code but I'm confident it doesn't make the
> elementary error of assuming commit order == xid order. I assume it's
> applying the reassembled transactions in commit order.
Yes, its commit order.

Imo commit order is more like assuming all transactions are done in read 
committed and not above than assuming serializable? Or am I missing something?

> I don't think it assumes the transactions are serializable because
> it's only concerned with writes, not reads. So the transaction it's
> replaying may or may not have been able to view the data written by
> other transactions that commited earlier but it doesn't matter when
> trying to reproduce the effects using constants. The data written by
> this transaction is either written or not when the commit happens and
> it's all written or not at that time. Even in non-serializable mode
> updates take row locks and nobody can see the data or modify it until
> the transaction commits.
Yes. There will be problems if you want to make serializable work across 
nodes, but that seems like something fiendishly complex anyway. I don't plan to 
work on it in the forseeable future.

Greetings,

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Andres Freund

Date:

11 October 2012, 02:49:36

On Thursday, October 11, 2012 04:31:21 AM Tom Lane wrote:
> Greg Stark <stark@mit.edu> writes:
> > On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Isn't there an even more serious problem, namely that this assumes
> >> *all* transactions are serializable?  What happens when they aren't?
> >> Or even just that the effective commit order is not XID order?
> > 
> > I don't think it assumes the transactions are serializable because
> > it's only concerned with writes, not reads. So the transaction it's
> > replaying may or may not have been able to view the data written by
> > other transactions that commited earlier but it doesn't matter when
> > trying to reproduce the effects using constants.
> 
> I would believe that argument if the "apply" operations were at a
> similar logical level to our current WAL records, namely drop these bits
> into that spot.  Unfortunately, they're not.  I think this argument
> falls to the ground entirely as soon as you think about DDL being
> applied by transactions A,B,C and then needing to express what
> concurrent transactions X,Y,Z did in "source" terms.  Even something as
> simple as a few column renames could break it,
Not sure what youre getting at here? Are you talking about the problems at the 
source side or the target side?

During decoding such problems should be handled already. As we reconstruct a 
Snapshot that lets catalog access look like it did back when the tuple was 
thrown into was we have the exact column names, data types and everything. The 
locking used when making the original changes prevents the data types, column 
names to be changed mid-transaction.

If youre talking about the receiving/apply side: Sure, if you do careless DDL 
and you don't replicate DDL (not included here, separate project), youre going 
to have problems. I don't think there is much we can do about that.

> let alone anything as esoteric as changing the meaning of datatype literals.
Hm. You mean like changing the input format of a datatype? Yes, sure, that 
will cause havoc.

Greetings,

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Andres Freund

Date:

11 October 2012, 02:52:50

On Thursday, October 11, 2012 04:49:20 AM Andres Freund wrote:
> On Thursday, October 11, 2012 04:31:21 AM Tom Lane wrote:
> > Greg Stark <stark@mit.edu> writes:
> > > On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> Isn't there an even more serious problem, namely that this assumes
> > >> all transactions are serializable?  What happens when they aren't?
> > >> Or even just that the effective commit order is not XID order?
> > >
> > > 
> > >
> > > I don't think it assumes the transactions are serializable because
> > > it's only concerned with writes, not reads. So the transaction it's
> > > replaying may or may not have been able to view the data written by
> > > other transactions that commited earlier but it doesn't matter when
> > > trying to reproduce the effects using constants.
> >
> > 
> >
> > I would believe that argument if the "apply" operations were at a
> > similar logical level to our current WAL records, namely drop these bits
> > into that spot.  Unfortunately, they're not.  I think this argument
> > falls to the ground entirely as soon as you think about DDL being
> > applied by transactions A,B,C and then needing to express what
> > concurrent transactions X,Y,Z did in "source" terms.  Even something as
> > simple as a few column renames could break it,
> 
> Not sure what youre getting at here? Are you talking about the problems at
> the  source side or the target side?
> 
> During decoding such problems should be handled already
Btw, the introductionary email upthread shows a trivial example. As submitted 
the code cannot handle intermingled DDL/DML transactions, but I fixed that now.

There are some problems with CLUSTER/VACUUM FULL on system catalogs, but thats 
going to be separate thread...

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Heikki Linnakangas

Date:

11 October 2012, 07:16:05

On 22.09.2012 20:00, Andres Freund wrote:
> [[basic-schema]]
> .Architecture Schema
> ["ditaa"]
> ------------------------------------------------------------------------------
>          Traditional Stuff
>
>   +---------+---------+---------+---------+----+
>   | Backend | Backend | Backend | Autovac | ...|
>   +----+----+---+-----+----+----+----+----+-+--+
>        |        |          |         |      |
>        +------+ | +--------+         |      |
>      +-+      | | | +----------------+      |
>      |        | | | |                       |
>      |        v v v v                       |
>      |     +------------+                   |
>      |     | WAL writer |<------------------+
>      |     +------------+
>      |       | | | | |
>      v       v v v v v       +-------------------+
> +--------+ +---------+   +->| Startup/Recovery  |
> |{s}     | |{s}      |   |  +-------------------+
> |Catalog | |   WAL   |---+->| SR/Hot Standby    |
> |        | |         |   |  +-------------------+
> +--------+ +---------+   +->| Point in Time     |
>      ^          |            +-------------------+
>   ---|----------|--------------------------------
>      |       New Stuff
> +---+          |
> |              v            Running separately
> | +----------------+  +=-------------------------+
> | | Walsender  |   |  |                          |
> | |            v   |  |    +-------------------+ |
> | +-------------+  |  | +->| Logical Rep.      | |
> | |     WAL     |  |  | |  +-------------------+ |
> +-|  decoding   |  |  | +->| Multimaster       | |
> | +------+------/  |  | |  +-------------------+ |
> | |            |   |  | +->| Slony             | |
> | |            v   |  | |  +-------------------+ |
> | +-------------+  |  | +->| Auditing          | |
> | |     TX      |  |  | |  +-------------------+ |
> +-| reassembly  |  |  | +->| Mysql/...         | |
> | +-------------/  |  | |  +-------------------+ |
> | |            |   |  | +->| Custom Solutions  | |
> | |            v   |  | |  +-------------------+ |
> | +-------------+  |  | +->| Debugging         | |
> | |   Output    |  |  | |  +-------------------+ |
> +-|   Plugin    |--|--|-+->| Data Recovery     | |
>    +-------------/  |  |    +-------------------+ |
>    |                |  |                          |
>    +----------------+  +--------------------------|
> ------------------------------------------------------------------------------

This diagram triggers a pet-peeve of mine: What do all the boxes and 
lines mean? An architecture diagram should always include a key. I find 
that when I am drawing a diagram myself, adding the key clarifies my own 
thinking too.

This looks like a data-flow diagram, where the arrows indicate the data 
flows between components, and the boxes seem to represent processes. But 
in that case, I think the arrows pointing from the plugins in walsender 
to Catalog are backwards. The catalog information flows from the Catalog 
to walsender, walsender does not write to the catalogs.

Zooming out to look at the big picture, I think the elephant in the room 
with this whole effort is how it fares against trigger-based 
replication. You list a number of disadvantages that trigger-based 
solutions have, compared to the proposed logical replication. Let's take 
a closer look at them:

> * essentially duplicates the amount of writes (or even more!)

True.

> * synchronous replication hard or impossible to implement

I don't see any explanation it could be implemented in the proposed 
logical replication either.

> * noticeable CPU overhead
>   * trigger functions
>   * text conversion of data

Well, I'm pretty sure we could find some micro-optimizations for these 
if we put in the effort. And the proposed code isn't exactly free, either.

> * complex parts implemented in several solutions

Not sure what this means, but the proposed code is quite complex too.

> * not in core

IMHO that's a good thing, and I'd hope this new logical replication to 
live outside core as well, as much as possible. But whether or not 
something is in core is just a political decision, not a reason to 
implement something new.

If the only meaningful advantage is reducing the amount of WAL written, 
I can't help thinking that we should just try to address that in the 
existing solutions, even if it seems "easy to solve at a first glance, 
but a solution not using a normal transactional table for its log/queue 
has to solve a lot of problems", as the document says. Sorry to be a 
naysayer, but I'm pretty scared of all the new code and complexity these 
patches bring into core.

PS. I'd love to see a basic Slony plugin for this, for example, to see 
how much extra code on top of the posted patches you need to write in a 
plugin like that to make it functional. I'm worried that it's a lot..

- Heikki

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Hannu Krosing

Date:

11 October 2012, 08:27:59

On 10/11/2012 04:31 AM, Tom Lane wrote:
> Greg Stark <stark@mit.edu> writes:
>> On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Isn't there an even more serious problem, namely that this assumes
>>> *all* transactions are serializable?  What happens when they aren't?
>>> Or even just that the effective commit order is not XID order?
>> I don't think it assumes the transactions are serializable because
>> it's only concerned with writes, not reads. So the transaction it's
>> replaying may or may not have been able to view the data written by
>> other transactions that commited earlier but it doesn't matter when
>> trying to reproduce the effects using constants.
> I would believe that argument if the "apply" operations were at a
> similar logical level to our current WAL records, namely drop these bits
> into that spot.  Unfortunately, they're not.  I think this argument
> falls to the ground entirely as soon as you think about DDL being
> applied by transactions A,B,C and then needing to express what
> concurrent transactions X,Y,Z did in "source" terms.  Even something as
> simple as a few column renames could break it, let alone anything as
> esoteric as changing the meaning of datatype literals.
This is the whole reason of moving the reassembly to the source
side and having the possibility to use old snapshots to get the
catalog information.

Also, the locks that protect you from effects of field name changes
by DDL concurrent transactions protect also the logical reassembly
if done in the commit order.

>
>             regards, tom lane
>
>

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Hannu Krosing

Date:

11 October 2012, 08:48:06

On 10/11/2012 03:10 AM, Robert Haas wrote:
> On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> The purpose of ApplyCache/transaction reassembly is to reassemble
>> interlaced records, and organise them by XID, so that the consumer
>> client code sees only streams (well, lists) of records split by XID.
> I think I've mentioned it before, but in the interest of not being
> seen to critique the bikeshed only after it's been painted: this
> design gives up something very important that exists in our current
> built-in replication solution, namely pipelining.
The lack of pipelining (and the following complexity of applycache
and spilling to disk) is something we have discussed with Andres and
to my understanding it is not a final design decision but just stepping
stones in how this quite large development is structured.

The pipelining (or parallel apply as I described it) requires either a 
large
number of apply backends and code to manage them or autonomous
transactions.

It could (arguably !) be easier to implement autonomous transactions
instead of apply cache, but Andres had valid reasons to start with apply
cache and move to parallel apply later .

As I understand it the parallel apply is definitely one of the things that
will be coming and after that the  performance characteristics (fast AND
smooth) will be very similar to current physical WAL streaming.

> With streaming
> replication as it exists today, a transaction that modifies a huge
> amount of data (such as a bulk load) can be applied on the standby as
> it happens.  The rows thus inserted will become visible only if and
> when the transaction commits on the master and the commit record is
> replayed on the standby.  This has a number of important advantages,
> perhaps most importantly that the lag between commit and data
> visibility remains short.  With the proposed system, we can't start
> applying the changes until the transaction has committed and the
> commit record has been replayed, so a big transaction is going to have
> a lot of apply latency.
>
> Now, I am not 100% opposed to a design that surrenders this property
> in exchange for other important benefits, but I think it would be
> worth thinking about whether there is any way that we can design this
> that either avoids giving that property up at all, or gives it up for
> the time being but allows us to potentially get back to it in a later
> version.  Reassembling complete transactions is surely cool and some
> clients will want that, but being able to apply replicated
> transactions *without* reassembling them in their entirety is even
> cooler, and some clients will want that, too.
>
> If we're going to stick with a design that reassembles transactions, I
> think there are a number of issues that deserve careful thought.
> First, memory usage.  I don't think it's acceptable for the decoding
> process to assume that it can allocate enough backend-private memory
> to store all of the in-flight changes (either as WAL or in some more
> decoded form).  We have assiduously avoided such assumptions thus far;
> you can write a terabyte of data in one transaction with just a
> gigabyte of shared buffers if you so desire (and if you're patient).
> Here's you making the same point in different words:
>
>> Applycache is presumably where you're going to want to spill
>> transaction streams to disk, eventually. That seems like a
>> prerequisite to commit.
> Second, crash recovery.  I think whatever we put in place here has to
> be able to survive a crash on any node.  Decoding must be able to
> restart successfully after a system crash, and it has to be able to
> apply exactly the set of transactions that were committed but not
> applied prior to the crash.  Maybe an appropriate mechanism for this
> already exists or has been discussed, but I haven't seen it go by;
> sorry if I have missed the boat.
>
>> You consider this to be a throw-away function that won't ever be
>> committed. However, I strongly feel that you should move it into
>> /contrib, so that it can serve as a sort of reference implementation
>> for authors of decoder client code, in the same spirit as numerous
>> existing contrib modules (think contrib/spi).
> Without prejudice to the rest of this review which looks quite
> well-considered, I'd like to add a particular +1 to this point.
>

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Andres Freund

Date:

11 October 2012, 11:42:41

On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote:
> On 22.09.2012 20:00, Andres Freund wrote:
> > [[basic-schema]]
> > .Architecture Schema
> > ["ditaa"]
> > -------------------------------------------------------------------------
> > -----
> > 
> >          Traditional Stuff
> >   
> >   +---------+---------+---------+---------+----+
> >   
> >   | Backend | Backend | Backend | Autovac | ...|
> >   
> >   +----+----+---+-----+----+----+----+----+-+--+
> >   
> >        +------+ | +--------+         |      |
> >      
> >      +-+      | | | +----------------+      |
> >      
> >      |        v v v v                       |
> >      |     
> >      |     +------------+                   |
> >      |     
> >      |     | WAL writer |<------------------+
> >      |     
> >      |     +------------+
> >      
> >      v       v v v v v       +-------------------+
> > 
> > +--------+ +---------+   +->| Startup/Recovery  |
> > 
> > |{s}     | |{s}      |   |  +-------------------+
> > |Catalog | |   WAL   |---+->| SR/Hot Standby    |
> > |
> > |        | |         |   |  +-------------------+
> > 
> > +--------+ +---------+   +->| Point in Time     |
> > 
> >      ^          |            +-------------------+
> >   
> >   ---|----------|--------------------------------
> >   
> >      |       New Stuff
> > 
> > +---+          |
> > 
> > |              v            Running separately
> > | 
> > | +----------------+  +=-------------------------+
> > | 
> > | | Walsender  |   |  |                          |
> > | | 
> > | |            v   |  |    +-------------------+ |
> > | 
> > | +-------------+  |  | +->| Logical Rep.      | |
> > | 
> > | |     WAL     |  |  | |  +-------------------+ |
> > 
> > +-|  decoding   |  |  | +->| Multimaster       | |
> > 
> > | +------+------/  |  | |  +-------------------+ |
> > | 
> > | |            |   |  | +->| Slony             | |
> > | |            
> > | |            v   |  | |  +-------------------+ |
> > | 
> > | +-------------+  |  | +->| Auditing          | |
> > | 
> > | |     TX      |  |  | |  +-------------------+ |
> > 
> > +-| reassembly  |  |  | +->| Mysql/...         | |
> > 
> > | +-------------/  |  | |  +-------------------+ |
> > | 
> > | |            |   |  | +->| Custom Solutions  | |
> > | |            
> > | |            v   |  | |  +-------------------+ |
> > | 
> > | +-------------+  |  | +->| Debugging         | |
> > | 
> > | |   Output    |  |  | |  +-------------------+ |
> > 
> > +-|   Plugin    |--|--|-+->| Data Recovery     | |
> > 
> >    +-------------/  |  |    +-------------------+ |
> >    
> >    +----------------+  +--------------------------|
> > 
> > -------------------------------------------------------------------------
> > -----
> 
> This diagram triggers a pet-peeve of mine: What do all the boxes and
> lines mean? An architecture diagram should always include a key. I find
> that when I am drawing a diagram myself, adding the key clarifies my own
> thinking too.
Hm. Ok.

> This looks like a data-flow diagram, where the arrows indicate the data
> flows between components, and the boxes seem to represent processes. But
> in that case, I think the arrows pointing from the plugins in walsender
> to Catalog are backwards. The catalog information flows from the Catalog
> to walsender, walsender does not write to the catalogs.
The reason I drew it that way is that it actively needs to go back to the 
catalog and query it which is somewhat different of the rest which basically 
could be seen as a unidirectional pipeline.

> Zooming out to look at the big picture, I think the elephant in the room
> with this whole effort is how it fares against trigger-based
> replication. You list a number of disadvantages that trigger-based
> solutions have, compared to the proposed logical replication. Let's take
> > a closer look at them:

> > * essentially duplicates the amount of writes (or even more!)
> True.
By now I think its essentially unfixable.

> > * synchronous replication hard or impossible to implement
> > I don't see any explanation it could be implemented in the proposed
> logical replication either.
Its basically the same as its for synchronous streaming repl. At the place 
where SyncRepWaitForLSN() is done you instead/also wait for the decoding to 
reach that lsn (its in the wal, so everything is decodable) and for the other 
side to have confirmed reception of those changes. I think this should be 
doable with only minor code modifications.

The existing support for all that is basically the reason we want to reuse the 
walsender framework. (will start a thread about that soon)

> > * noticeable CPU overhead
> > 
> >   * trigger functions
> >   * text conversion of data
> 
> Well, I'm pretty sure we could find some micro-optimizations for these
> if we put in the effort.
Any improvements there are a good idea independent from this proposal but I 
don't see how we can fundamentally improve from the status quo.

> And the proposed code isn't exactly free, either.
If you don't have frequent DDL its really not all that expensive. In the 
version without DDL support I didn't manage to saturate the ApplyCache with 
either parallel COPY in individual transactions (repeated 100MB files) or with 
pgbench.
Also its basically doing work that the trigger/queue based solutions have to do 
as well, just that they do it via far less optimized sql statements.

DDL support doesn't really change much as the overhead for transactions without 
DDL and without concurrently running DDL should be fairly minor (the submitted 
version is *not* finialized there, it builds a new snapshot instead of 
copying/referencing the old one).

> > * complex parts implemented in several solutions
> Not sure what this means, but the proposed code is quite complex too.
It is, agreed.

What I mean is that significantly complex logic is burried in the encoding, 
queuing and decoding/ordering logic of every trigger based replication. Thats 
not a good thing.

> > * not in core
> 
> IMHO that's a good thing, and I'd hope this new logical replication to
> live outside core as well, as much as possible.
I don't agree there, but I would like to keep that a separate discussion.

For now I/we only want to submit the changes that technically need in-core 
support to work sensibly (this, background workers, some walsender 
integration). The goal of working nearly completely without special in-core 
support held the existing solutions back quite a bit imo.


> But whether or not something is in core is just a political decision, not a
> reason to implement something new.
Isn't it both? There are things you simply cannot do unless youre inside core.

Politically I think the external status of all those logical replication 
projects grew to be an adoption barrier. I don't even want to think about how 
many bad home-grown logical replication solutions I have seen out there that 
implement everything from the get-go.

> If the only meaningful advantage is reducing the amount of WAL written,
> I can't help thinking that we should just try to address that in the
> existing solutions, even if it seems "easy to solve at a first glance,
> but a solution not using a normal transactional table for its log/queue
> has to solve a lot of problems", as the document says.
Youre welcome to make suggestions, but everything I could think of that didn't 
fall short of reality ended up basically duplicating the amount of writes & 
fsyncs, even if not going through the WAL.

You need to be crash safe/restartable (=> writes, fsyncs) and you need to 
reduce the writes (in memory, => !writes). There is only one authoritative 
point where you can rely on a commit to have been successfull and thats when 
the commit record has been written to the WAL. You can't send out the data to 
be committed before thats written because that could result in spuriously 
committed transactions on the remote side and you can't easily do it afterwards
because you can crash after the commit.

> Sorry to be a naysayer, but I'm pretty scared of all the new code and
> complexity these patches bring into core.
Understandable. I tried to keep the introduction of complexity in existing code 
paths relatively minor and I think I mostly succeeded there but it still needs 
to be maintained.

> PS. I'd love to see a basic Slony plugin for this, for example, to see
> how much extra code on top of the posted patches you need to write in a
> plugin like that to make it functional. I'm worried that it's a lot..
I think before its possible to do something like that a bit more design 
decisions need to be made. Mostly the walsender(ish) integration needs to be 
done.

After that I can imagine writing a demo plugin that outputs changes in a slony 
compatible format, but I would like to see some slony/londiste person 
cooperating on receiving/applying those.

What complications are you imagining?

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Josh Berkus

Date:

11 October 2012, 22:27:32

On 10/10/12 7:26 PM, Bruce Momjian wrote:
> How does Slony write its changes without causing serialization replay
> conflicts?

Since nobody from the Slony team answered this:

a) Slony replicates *rows*, not *statements*
b) Slony uses serializable mode internally for row replication
c) it's possible (though difficult) for creative usage to get Slony into
a deadlock situation

FWIW, I have always assumed that is is impossible (even theoretically)
to have statement-based replication without some constraints on the
statements you can run, or some replication failures.  I think we should
expect 9.3's logical replication out-the-gate to have some issues and
impose constraints on users, and improve with time but never be perfect.

The design Andres and Simon have advanced already eliminates a lot of
the common failure cases (now(), random(), nextval()) suffered by pgPool
and similar tools.  But remember, this feature doesn't have to be
*perfect*, it just has to be *better* than the alternatives.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Christopher Browne

Date:

11 October 2012, 22:41:04

On Wed, Oct 10, 2012 at 10:26 PM, Bruce Momjian <bruce@momjian.us> wrote:
> How does Slony write its changes without causing serialization replay
> conflicts?

It uses a sequence to break any ordering conflicts at the time that
data is inserted into its log tables.

If there are two transactions, A and B, that were "fighting" over a
tuple on the origin, then either:

a) A went first, B went second, and the ordering in the log makes that
order clear;
b) A succeeds, then B fails, so there's no conflict;
c) A is doing its thing, and B is blocked behind it for a while, then
A fails, and B gets to go through, and there's no conflict.

Switch A and B as needed.

The sequence that is used establishes what is termed a "compatible
ordering."  There are multiple possible compatible orderings; ours
happen to interleave transactions together, with the sequence
guaranteeing absence of conflict.

If we could get commit orderings, then a different but still
"compatible ordering" would be to have each transaction establish its
own internal sequence, and apply things in order based on
(commit_tx_order, sequence_within).
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Simon Riggs

Date:

11 October 2012, 22:46:54

On 11 October 2012 03:16, Greg Stark <stark@mit.edu> wrote:
> On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I think I've mentioned it before, but in the interest of not being
>>> seen to critique the bikeshed only after it's been painted: this
>>> design gives up something very important that exists in our current
>>> built-in replication solution, namely pipelining.
>>
>> Isn't there an even more serious problem, namely that this assumes
>> *all* transactions are serializable?  What happens when they aren't?
>> Or even just that the effective commit order is not XID order?
>
> Firstly, I haven't read the code but I'm confident it doesn't make the
> elementary error of assuming commit order == xid order. I assume it's
> applying the reassembled transactions in commit order.
>
> I don't think it assumes the transactions are serializable because
> it's only concerned with writes, not reads. So the transaction it's
> replaying may or may not have been able to view the data written by
> other transactions that commited earlier but it doesn't matter when
> trying to reproduce the effects using constants. The data written by
> this transaction is either written or not when the commit happens and
> it's all written or not at that time. Even in non-serializable mode
> updates take row locks and nobody can see the data or modify it until
> the transaction commits.

This uses Commit Serializability, which is valid, as you say.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Steve Singer

Date:

11 October 2012, 22:59:52

On 12-10-11 06:27 PM, Josh Berkus wrote:
> On 10/10/12 7:26 PM, Bruce Momjian wrote:
>> How does Slony write its changes without causing serialization replay
>> conflicts?
> Since nobody from the Slony team answered this:
>
> a) Slony replicates *rows*, not *statements*

True, but the proposed logical replication also would replicate rows not 
the original statements.  I don't consider this proposal to be an 
example of  'statement' replication like pgpool  is.  If the original 
SQL was 'update foo set x=x+1 where id > 10';  there will be a WAL 
record to decode for each row modified by the table. In a million row 
table I'd expect the replica will have to apply a million records 
(whether they be binary heap tuples or SQL statements).
> b) Slony uses serializable mode internally for row replication

Actually recent versions of slony apply transactions against the replica 
in read committed mode.  Older versions used serializable mode but with 
the SSI changes in 9.1 we found slony tended to have serialization 
conflicts with itself on the slony internal tables resulting in a lot of 
aborted transactions.

When slony applies changes on a replica table it does so in a single 
transaction.  Slony finds a set of transactions that committed on the 
master in between two SYNC events.  It then applies all of the rows 
changed by any of those transactions as part of a single transaction on 
the replica. Chris's post explains this in more detail.

Conflicts with user transactions on the replica are possible.
> c) it's possible (though difficult) for creative usage to get Slony into
> a deadlock situation
>
> FWIW, I have always assumed that is is impossible (even theoretically)
> to have statement-based replication without some constraints on the
> statements you can run, or some replication failures.  I think we should
> expect 9.3's logical replication out-the-gate to have some issues and
> impose constraints on users, and improve with time but never be perfect.
>
> The design Andres and Simon have advanced already eliminates a lot of
> the common failure cases (now(), random(), nextval()) suffered by pgPool
> and similar tools.  But remember, this feature doesn't have to be
> *perfect*, it just has to be *better* than the alternatives.
>

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Robert Haas

Date:

15 October 2012, 02:54:27

On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> IMHO that's a good thing, and I'd hope this new logical replication to live
> outside core as well, as much as possible. But whether or not something is
> in core is just a political decision, not a reason to implement something
> new.
>
> If the only meaningful advantage is reducing the amount of WAL written, I
> can't help thinking that we should just try to address that in the existing
> solutions, even if it seems "easy to solve at a first glance, but a solution
> not using a normal transactional table for its log/queue has to solve a lot
> of problems", as the document says. Sorry to be a naysayer, but I'm pretty
> scared of all the new code and complexity these patches bring into core.

I think what we're really missing at the moment is a decent way of
decoding WAL.  There are a decent number of customers who, when
presented with replication system, start by asking whether it's
trigger-based or WAL-based.  When you answer that it's trigger-based,
their interest goes... way down.  If you tell them the triggers are
written in anything but C, you lose a bunch more points.  Sure, some
people's concerns are overblown, but it's hard to escape the
conclusion that a WAL-based solution can be a lot more efficient than
a trigger-based solution, and EnterpriseDB has gotten comments from a
number of people who upgraded to 9.0 or 9.1 to the effect that SR was
way faster than Slony.

I do not personally believe that a WAL decoding solution adequate to
drive logical replication can live outside of core, at least not
unless core exposes a whole lot more interface than we do now, and
probably not even then.  Even if it could, I don't see the case for
making every replication solution reinvent that wheel.  It's a big
wheel to be reinventing, and everyone needs pretty much the same
thing.

That having been said, I have to agree that the people working on this
project seem to be wearing rose-colored glasses when it comes to the
difficulty of implementing a full-fledged solution in core.  I'm right
on board with everything up to the point where we start kicking out a
stream of decoded changes to the user... and that's about it.  To pick
on Slony for the moment, as the project that has been around for the
longest and has probably the largest user base (outside of built-in
SR, perhaps), they've got a project that they have been developing for
years and years and years.  What have they been doing all that time?
Maybe they are just stupid, but Chris and Jan and Steve don't strike
me that way, so I think the real answer is that they are solving
problems that we haven't even started to think about yet, especially
around control logic: how do you turn it on?  how do you turn it off?
how do you handle node failures?  how do you handle it when a node
gets behind?  We are not going to invent good solutions to all of
those problems between now and January, or even between now and next
January.

> PS. I'd love to see a basic Slony plugin for this, for example, to see how
> much extra code on top of the posted patches you need to write in a plugin
> like that to make it functional. I'm worried that it's a lot..

I agree.  I would go so far as to say that if Slony can't integrate
with this work and use it in place of their existing change-capture
facility, that's sufficient grounds for unconditional rejection.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Andres Freund

Date:

15 October 2012, 07:57:45

On Monday, October 15, 2012 04:54:20 AM Robert Haas wrote:
> On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas
> 
> <hlinnakangas@vmware.com> wrote:
> > IMHO that's a good thing, and I'd hope this new logical replication to
> > live outside core as well, as much as possible. But whether or not
> > something is in core is just a political decision, not a reason to
> > implement something new.
> > 
> > If the only meaningful advantage is reducing the amount of WAL written, I
> > can't help thinking that we should just try to address that in the
> > existing solutions, even if it seems "easy to solve at a first glance,
> > but a solution not using a normal transactional table for its log/queue
> > has to solve a lot of problems", as the document says. Sorry to be a
> > naysayer, but I'm pretty scared of all the new code and complexity these
> > patches bring into core.

> I do not personally believe that a WAL decoding solution adequate to
> drive logical replication can live outside of core, at least not
> unless core exposes a whole lot more interface than we do now, and
> probably not even then.  Even if it could, I don't see the case for
> making every replication solution reinvent that wheel.  It's a big
> wheel to be reinventing, and everyone needs pretty much the same
> thing.
Unsurprisingly I aggree.

> That having been said, I have to agree that the people working on this
> project seem to be wearing rose-colored glasses when it comes to the
> difficulty of implementing a full-fledged solution in core. 
That very well might be true. Sometimes rose-colored glasses can be quite 
productive in getting something started...

Note at this point were only want wal decoding, background workers and related 
things to get integrated...

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Bruce Momjian

Date:

15 October 2012, 18:20:01

On Mon, Oct 15, 2012 at 09:57:19AM +0200, Andres Freund wrote:
> On Monday, October 15, 2012 04:54:20 AM Robert Haas wrote:
> > On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas
> > 
> > <hlinnakangas@vmware.com> wrote:
> > > IMHO that's a good thing, and I'd hope this new logical replication to
> > > live outside core as well, as much as possible. But whether or not
> > > something is in core is just a political decision, not a reason to
> > > implement something new.
> > > 
> > > If the only meaningful advantage is reducing the amount of WAL written, I
> > > can't help thinking that we should just try to address that in the
> > > existing solutions, even if it seems "easy to solve at a first glance,
> > > but a solution not using a normal transactional table for its log/queue
> > > has to solve a lot of problems", as the document says. Sorry to be a
> > > naysayer, but I'm pretty scared of all the new code and complexity these
> > > patches bring into core.
> 
> > I do not personally believe that a WAL decoding solution adequate to
> > drive logical replication can live outside of core, at least not
> > unless core exposes a whole lot more interface than we do now, and
> > probably not even then.  Even if it could, I don't see the case for
> > making every replication solution reinvent that wheel.  It's a big
> > wheel to be reinventing, and everyone needs pretty much the same
> > thing.
> Unsurprisingly I aggree.
> 
> > That having been said, I have to agree that the people working on this
> > project seem to be wearing rose-colored glasses when it comes to the
> > difficulty of implementing a full-fledged solution in core. 
> That very well might be true. Sometimes rose-colored glasses can be quite 
> productive in getting something started...
> 
> Note at this point were only want wal decoding, background workers and related 
> things to get integrated...

Well, TODO does have:
Move pgfoundry's xlogdump to /contrib and have it rely more closely onthe WAL backend code 

I think Robert is right that if Slony can't use the API, it is unlikely
any other replication system could use it.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Andres Freund

Date:

15 October 2012, 18:26:19

On Monday, October 15, 2012 08:19:54 PM Bruce Momjian wrote:
> On Mon, Oct 15, 2012 at 09:57:19AM +0200, Andres Freund wrote:
> > On Monday, October 15, 2012 04:54:20 AM Robert Haas wrote:
> > > On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas
> > > 
> > > <hlinnakangas@vmware.com> wrote:
> > > > IMHO that's a good thing, and I'd hope this new logical replication
> > > > to live outside core as well, as much as possible. But whether or
> > > > not something is in core is just a political decision, not a reason
> > > > to implement something new.
> > > > 
> > > > If the only meaningful advantage is reducing the amount of WAL
> > > > written, I can't help thinking that we should just try to address
> > > > that in the existing solutions, even if it seems "easy to solve at a
> > > > first glance, but a solution not using a normal transactional table
> > > > for its log/queue has to solve a lot of problems", as the document
> > > > says. Sorry to be a naysayer, but I'm pretty scared of all the new
> > > > code and complexity these patches bring into core.
> > > 
> > > I do not personally believe that a WAL decoding solution adequate to
> > > drive logical replication can live outside of core, at least not
> > > unless core exposes a whole lot more interface than we do now, and
> > > probably not even then.  Even if it could, I don't see the case for
> > > making every replication solution reinvent that wheel.  It's a big
> > > wheel to be reinventing, and everyone needs pretty much the same
> > > thing.
> > 
> > Unsurprisingly I aggree.
> > 
> > > That having been said, I have to agree that the people working on this
> > > project seem to be wearing rose-colored glasses when it comes to the
> > > difficulty of implementing a full-fledged solution in core.
> > 
> > That very well might be true. Sometimes rose-colored glasses can be quite
> > productive in getting something started...
> > 
> > Note at this point were only want wal decoding, background workers and
> > related things to get integrated...
> 
> Well, TODO does have:
> 
>     Move pgfoundry's xlogdump to /contrib and have it rely more closely on
>     the WAL backend code

Uhm. How does that relate to my statement?

The xlogreader code I submitted does contain a very small POC xlogdump with 
almost no code duplication. It needs some work to be really useful though.

> I think Robert is right that if Slony can't use the API, it is unlikely
> any other replication system could use it.

I aggree and I don't think I have ever said something contrary. I just don't 
want to be the only one working on slony integration. I am ready to do a good 
part of that, but somebody with slony experience needs to help, especially on 
consuming those changes.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Hannu Krosing

Date:

15 October 2012, 18:38:24

On 10/11/2012 01:42 PM, Andres Freund wrote:
> On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote:
> ...
> If the only meaningful advantage is reducing the amount of WAL written,
> I can't help thinking that we should just try to address that in the
> existing solutions, even if it seems "easy to solve at a first glance,
> but a solution not using a normal transactional table for its log/queue
> has to solve a lot of problems", as the document says.
> Youre welcome to make suggestions, but everything I could think of that didn't
> fall short of reality ended up basically duplicating the amount of writes &
> fsyncs, even if not going through the WAL.
>
> You need to be crash safe/restartable (=> writes, fsyncs) and you need to
> reduce the writes (in memory, => !writes). There is only one authoritative
> point where you can rely on a commit to have been successfull and thats when
> the commit record has been written to the WAL. You can't send out the data to
> be committed before thats written because that could result in spuriously
> committed transactions on the remote side and you can't easily do it afterwards
> because you can crash after the commit.
Just curious here, but do you know how is this part solved in current sync wal replication - you can get "spurious"
commitson slave side id master
 
dies while waiting for confirmation.
> What complications are you imagining? Greetings, Andres 

Hannu

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Andres Freund

Date:

15 October 2012, 18:44:23

On Monday, October 15, 2012 08:38:07 PM Hannu Krosing wrote:
> On 10/11/2012 01:42 PM, Andres Freund wrote:
> > On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote:
> > ...
> > If the only meaningful advantage is reducing the amount of WAL written,
> > I can't help thinking that we should just try to address that in the
> > existing solutions, even if it seems "easy to solve at a first glance,
> > but a solution not using a normal transactional table for its log/queue
> > has to solve a lot of problems", as the document says.
> > Youre welcome to make suggestions, but everything I could think of that
> > didn't fall short of reality ended up basically duplicating the amount
> > of writes & fsyncs, even if not going through the WAL.
> > 
> > You need to be crash safe/restartable (=> writes, fsyncs) and you need to
> > reduce the writes (in memory, => !writes). There is only one
> > authoritative point where you can rely on a commit to have been
> > successfull and thats when the commit record has been written to the
> > WAL. You can't send out the data to be committed before thats written
> > because that could result in spuriously committed transactions on the
> > remote side and you can't easily do it afterwards because you can crash
> > after the commit.
> 
> Just curious here, but do you know how is this part solved in current sync
> wal replication - you can get "spurious" commits on slave side id master
> dies while waiting for confirmation.

Synchronous replication is only synchronous in respect to the COMMIT reply sent 
to the user. First the commit is written to WAL locally, so it persists across 
a crash (c.f. RecordTransactionCommit). Only then we wait for the standby 
(SyncRepWaitForLSN). After that finished the shared memory on the primary gets 
updated (c.f. ProcArrayEndTransaction in CommitTransaction) and soon after that 
the user gets the response to the COMMIT back.

I am not really sure what you were asking for, does the above explanation 
answer this?

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Hannu Krosing

Date:

15 October 2012, 18:50:02

On 10/15/2012 04:54 AM, Robert Haas wrote:
> On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> IMHO that's a good thing, and I'd hope this new logical replication to live
>> outside core as well, as much as possible. But whether or not something is
>> in core is just a political decision, not a reason to implement something
>> new.
>>
>> If the only meaningful advantage is reducing the amount of WAL written, I
>> can't help thinking that we should just try to address that in the existing
>> solutions, even if it seems "easy to solve at a first glance, but a solution
>> not using a normal transactional table for its log/queue has to solve a lot
>> of problems", as the document says. Sorry to be a naysayer, but I'm pretty
>> scared of all the new code and complexity these patches bring into core.
> I think what we're really missing at the moment is a decent way of
> decoding WAL.  There are a decent number of customers who, when
> presented with replication system, start by asking whether it's
> trigger-based or WAL-based.  When you answer that it's trigger-based,
> their interest goes... way down.  If you tell them the triggers are
> written in anything but C, you lose a bunch more points.  Sure, some
> people's concerns are overblown, but it's hard to escape the
> conclusion that a WAL-based solution can be a lot more efficient than
> a trigger-based solution, and EnterpriseDB has gotten comments from a
> number of people who upgraded to 9.0 or 9.1 to the effect that SR was
> way faster than Slony.
>
> I do not personally believe that a WAL decoding solution adequate to
> drive logical replication can live outside of core, at least not
> unless core exposes a whole lot more interface than we do now, and
> probably not even then.  Even if it could, I don't see the case for
> making every replication solution reinvent that wheel.  It's a big
> wheel to be reinventing, and everyone needs pretty much the same
> thing.
>
> That having been said, I have to agree that the people working on this
> project seem to be wearing rose-colored glasses when it comes to the
> difficulty of implementing a full-fledged solution in core.  I'm right
> on board with everything up to the point where we start kicking out a
> stream of decoded changes to the user... and that's about it.  To pick
> on Slony for the moment, as the project that has been around for the
> longest and has probably the largest user base (outside of built-in
> SR, perhaps), they've got a project that they have been developing for
> years and years and years.  What have they been doing all that time?
> Maybe they are just stupid, but Chris and Jan and Steve don't strike
> me that way, so I think the real answer is that they are solving
> problems that we haven't even started to think about yet, especially
> around control logic: how do you turn it on?  how do you turn it off?
> how do you handle node failures?  how do you handle it when a node
> gets behind?  We are not going to invent good solutions to all of
> those problems between now and January, or even between now and next
> January.
I understand the the current minimal target is to get on par to current WAL
streaming in terms of setup ease and performance with additional
benefit of having read-write subscribers with at least conflict detection
and logging.

Hoping that we have something by january that solves all possible
replication scenarios out of the box is unrealistic.
>> PS. I'd love to see a basic Slony plugin for this, for example, to see how
>> much extra code on top of the posted patches you need to write in a plugin
>> like that to make it functional. I'm worried that it's a lot..
> I agree.  I would go so far as to say that if Slony can't integrate
> with this work and use it in place of their existing change-capture
> facility, that's sufficient grounds for unconditional rejection.
>

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Hannu Krosing

Date:

15 October 2012, 18:54:49

On 10/15/2012 08:44 PM, Andres Freund wrote:
> On Monday, October 15, 2012 08:38:07 PM Hannu Krosing wrote:
>> On 10/11/2012 01:42 PM, Andres Freund wrote:
>>> On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote:
>>> ...
>>> If the only meaningful advantage is reducing the amount of WAL written,
>>> I can't help thinking that we should just try to address that in the
>>> existing solutions, even if it seems "easy to solve at a first glance,
>>> but a solution not using a normal transactional table for its log/queue
>>> has to solve a lot of problems", as the document says.
>>> Youre welcome to make suggestions, but everything I could think of that
>>> didn't fall short of reality ended up basically duplicating the amount
>>> of writes & fsyncs, even if not going through the WAL.
>>>
>>> You need to be crash safe/restartable (=> writes, fsyncs) and you need to
>>> reduce the writes (in memory, => !writes). There is only one
>>> authoritative point where you can rely on a commit to have been
>>> successfull and thats when the commit record has been written to the
>>> WAL. You can't send out the data to be committed before thats written
>>> because that could result in spuriously committed transactions on the
>>> remote side and you can't easily do it afterwards because you can crash
>>> after the commit.
>> Just curious here, but do you know how is this part solved in current sync
>> wal replication - you can get "spurious" commits on slave side id master
>> dies while waiting for confirmation.
> Synchronous replication is only synchronous in respect to the COMMIT reply sent
> to the user. First the commit is written to WAL locally, so it persists across
> a crash (c.f. RecordTransactionCommit). Only then we wait for the standby
> (SyncRepWaitForLSN). After that finished the shared memory on the primary gets
> updated (c.f. ProcArrayEndTransaction in CommitTransaction) and soon after that
> the user gets the response to the COMMIT back.
>
> I am not really sure what you were asking for, does the above explanation
> answer this?
I think I mostly got it if master crashes before the commit confirmation
comes back then it _will_ get it after restart.

To client it looks like it doid not commit, but it is no different in this
respect than any other crash-before-confirmation and thus client can
not rely on commit not happening and has to check it.
>
> Greetings,
>
> Andres

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Bruce Momjian

Date:

15 October 2012, 18:59:29

On Mon, Oct 15, 2012 at 08:26:08PM +0200, Andres Freund wrote:
> > > > I do not personally believe that a WAL decoding solution adequate to
> > > > drive logical replication can live outside of core, at least not
> > > > unless core exposes a whole lot more interface than we do now, and
> > > > probably not even then.  Even if it could, I don't see the case for
> > > > making every replication solution reinvent that wheel.  It's a big
> > > > wheel to be reinventing, and everyone needs pretty much the same
> > > > thing.
> > > 
> > > Unsurprisingly I aggree.
> > > 
> > > > That having been said, I have to agree that the people working on this
> > > > project seem to be wearing rose-colored glasses when it comes to the
> > > > difficulty of implementing a full-fledged solution in core.
> > > 
> > > That very well might be true. Sometimes rose-colored glasses can be quite
> > > productive in getting something started...
> > > 
> > > Note at this point were only want wal decoding, background workers and
> > > related things to get integrated...
> > 
> > Well, TODO does have:
> > 
> >     Move pgfoundry's xlogdump to /contrib and have it rely more closely on
> >     the WAL backend code
> 
> Uhm. How does that relate to my statement?
> 
> The xlogreader code I submitted does contain a very small POC xlogdump with 
> almost no code duplication. It needs some work to be really useful though.

I just meant that dumping xlog contents is something we want to improve.

> > I think Robert is right that if Slony can't use the API, it is unlikely
> > any other replication system could use it.
> 
> I aggree and I don't think I have ever said something contrary. I just don't 
> want to be the only one working on slony integration. I am ready to do a good 
> part of that, but somebody with slony experience needs to help, especially on 
> consuming those changes.

Agreed.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Hannu Krosing

Date:

15 October 2012, 19:00:54

On 10/15/2012 04:54 AM, Robert Haas wrote:
>> PS. I'd love to see a basic Slony plugin for this, for example, to see how
>> >much extra code on top of the posted patches you need to write in a plugin
>> >like that to make it functional. I'm worried that it's a lot..
> I agree.  I would go so far as to say that if Slony can't integrate
> with this work and use it in place of their existing change-capture
> facility, that's sufficient grounds for unconditional rejection.
The fact that current work starts with "apply cache" instead of streaming
makes the semantics very close to how londiste and slony do this.

Therefore I don't think there will be any problems with "can't" though 
it may be
that there will be nobody actually doing it, at least not before January.

------------
Hannu

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Peter Geoghegan

Date:

15 October 2012, 19:19:09

On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:
> I think Robert is right that if Slony can't use the API, it is unlikely
> any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future? That would be (with respect to the Slony
guys) a commercial risk that is fairly heavily concentrated with
Afilias. So, if you're going to attach as a condition to its
acceptance that the Slony guys be able to use it immediately (because
"can integrate" really means "will integrate", right?), you're
attaching it to a rather arbitrary condition that has nothing much to
do with the technical merit of the patches proposed. The fact of the
matter is that Slony was originally designed with a somewhat different
set of constraints to those that exist today, so I don't doubt that
this is something that they're going to need to integrate over time,
probably in a separate release branch, to get the upsides of in-core
logical replication, along with the great flexibility that Slony
currently offers (and that Afilias undoubtedly depend upon today).

Another way of putting this is that Postgres should go first because
we will get huge benefits even if only one of the trigger-based
logical replication systems adopts the technology. Though I hope and
expect that the Slony guys will be able to work with what we're doing,
surely a logical replication system with all the benefits implied by
being logical, but with with only some subset of Slony's functionality
is still going to be of great benefit.

My view is that the only reasonable approach is to build something
solid, well-integrated and generic, in core. I'd certainly like to
hear what the Slony guys have to say here, though.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Robert Haas

Date:

15 October 2012, 19:25:13

On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:
>> I think Robert is right that if Slony can't use the API, it is unlikely
>> any other replication system could use it.
>
> I don't accept that. Clearly there is a circular dependency, and
> someone has to go first - why should the Slony guys invest in adopting
> this technology if it is going to necessitate using a forked Postgres
> with an uncertain future?

Clearly, core needs to go first.  However, before we commit, I would
like to hear the Slony guys say something like this: We read the
documentation that is part of this patch and if the feature behaves as
advertised, we believe we will be able to use it in place of the
change-capture mechanism that we have now, and that it will be at
least as good as what we have now if not a whole lot better.

If they say something like "I'm not sure we have the right design for
this" or "this wouldn't be sufficient to replace this portion of what
we have now because it lacks critical feature X", I would be very
concerned about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Andres Freund

Date:

15 October 2012, 19:25:27

On Monday, October 15, 2012 09:18:57 PM Peter Geoghegan wrote:
> On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:
> > I think Robert is right that if Slony can't use the API, it is unlikely
> > any other replication system could use it.
> 
> I don't accept that. Clearly there is a circular dependency, and
> someone has to go first - why should the Slony guys invest in adopting
> this technology if it is going to necessitate using a forked Postgres
> with an uncertain future?

Well. I don't think (hope) anybody proposed making something release worthy for 
slony but rather a POC patch that proofs the API is generic enough to be used 
by them. If I (or somebody else familiar with this) work together with somebody 
familiar with with slony internals I think such a POC shouldn't be too hard to 
do.
I think some more input from that side is a good idea. I plan to send out an 
email to possibly interested parties in about two weeks...

Regards,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Tom Lane

Date:

15 October 2012, 20:04:00

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:
>>> I think Robert is right that if Slony can't use the API, it is unlikely
>>> any other replication system could use it.

>> I don't accept that. Clearly there is a circular dependency, and
>> someone has to go first - why should the Slony guys invest in adopting
>> this technology if it is going to necessitate using a forked Postgres
>> with an uncertain future?

> Clearly, core needs to go first.  However, before we commit, I would
> like to hear the Slony guys say something like this: We read the
> documentation that is part of this patch and if the feature behaves as
> advertised, we believe we will be able to use it in place of the
> change-capture mechanism that we have now, and that it will be at
> least as good as what we have now if not a whole lot better.

> If they say something like "I'm not sure we have the right design for
> this" or "this wouldn't be sufficient to replace this portion of what
> we have now because it lacks critical feature X", I would be very
> concerned about that.

The other point here is that core code without any implemented use-cases
is unlikely to be worth a tinker's damn.  Regardless of what time-frame
the Slony guys are able to work on, I think we need to see working code
(of at least prototype quality) before we believe that we've got it
right.  Or if not code from them, code from some other replication
project.

A possibly-useful comparison is to the FDW APIs we've been slowly
implementing over the past couple releases.  Even though we *did* have
some use-cases written right off the bat, we got it wrong and had to
change it in 9.2, and I wouldn't bet against having to change it again
in 9.3 (even without considering the need for extensions for non-SELECT
operations).  And, despite our very clear warnings that all that stuff
was in flux, people have been griping because the APIs changed.

So if we ship core hooks for logical replication in advance of proof
that they're actually usable by at least one (preferably more than one)
replication project, I confidently predict that they'll be wrong and
will need revision and the potential users will complain about the
API instability.
        regards, tom lane

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Christopher Browne

Date:

15 October 2012, 20:08:35

On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:
>> I think Robert is right that if Slony can't use the API, it is unlikely
>> any other replication system could use it.
>
> I don't accept that. Clearly there is a circular dependency, and
> someone has to go first - why should the Slony guys invest in adopting
> this technology if it is going to necessitate using a forked Postgres
> with an uncertain future? That would be (with respect to the Slony
> guys) a commercial risk that is fairly heavily concentrated with
> Afilias.

Yep, there's something a bit too circular there.

I'd also not be keen on reimplementing the "Slony integration" over
and over if it turns out that the API churns for a while before
stabilizing.  That shouldn't be misread as "I expect horrible amounts
of churn", just that *any* churn comes at a cost.  And if anything
unfortunate happens, that can easily multiply into a multiplicity of
painfulness(es?).
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Andres Freund

Date:

15 October 2012, 20:19:34

On Monday, October 15, 2012 10:03:40 PM Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> 
wrote:
> >> On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:
> >>> I think Robert is right that if Slony can't use the API, it is unlikely
> >>> any other replication system could use it.
> >> 
> >> I don't accept that. Clearly there is a circular dependency, and
> >> someone has to go first - why should the Slony guys invest in adopting
> >> this technology if it is going to necessitate using a forked Postgres
> >> with an uncertain future?
> > 
> > Clearly, core needs to go first.  However, before we commit, I would
> > like to hear the Slony guys say something like this: We read the
> > documentation that is part of this patch and if the feature behaves as
> > advertised, we believe we will be able to use it in place of the
> > change-capture mechanism that we have now, and that it will be at
> > least as good as what we have now if not a whole lot better.
> > 
> > If they say something like "I'm not sure we have the right design for
> > this" or "this wouldn't be sufficient to replace this portion of what
> > we have now because it lacks critical feature X", I would be very
> > concerned about that.
> 
> The other point here is that core code without any implemented use-cases
> is unlikely to be worth a tinker's damn.  Regardless of what time-frame
> the Slony guys are able to work on, I think we need to see working code
> (of at least prototype quality) before we believe that we've got it
> right.  Or if not code from them, code from some other replication
> project.

FWIW we (as in 2ndq), unsurprisingly, have a user of this which is in 
development atm.

> A possibly-useful comparison is to the FDW APIs we've been slowly
> implementing over the past couple releases.  Even though we *did* have
> some use-cases written right off the bat, we got it wrong and had to
> change it in 9.2, and I wouldn't bet against having to change it again
> in 9.3 (even without considering the need for extensions for non-SELECT
> operations).  And, despite our very clear warnings that all that stuff
> was in flux, people have been griping because the APIs changed.

On the other hand, I don't think we would have FDWs today at all if it wouldn't 
have been done that way. So I really cannot see that as an indication of not 
working incrementally.
Obviously thats not an argument for not trying to get the API correct right off 
the bat. I seriously hope the userlevel API continues to be simpler than what 
FDWs need.

Regards,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Simon Riggs

Date:

15 October 2012, 20:43:34

On 15 October 2012 21:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>>> On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:
>>>> I think Robert is right that if Slony can't use the API, it is unlikely
>>>> any other replication system could use it.
>
>>> I don't accept that. Clearly there is a circular dependency, and
>>> someone has to go first - why should the Slony guys invest in adopting
>>> this technology if it is going to necessitate using a forked Postgres
>>> with an uncertain future?
>
>> Clearly, core needs to go first.  However, before we commit, I would
>> like to hear the Slony guys say something like this: We read the
>> documentation that is part of this patch and if the feature behaves as
>> advertised, we believe we will be able to use it in place of the
>> change-capture mechanism that we have now, and that it will be at
>> least as good as what we have now if not a whole lot better.
>
>> If they say something like "I'm not sure we have the right design for
>> this" or "this wouldn't be sufficient to replace this portion of what
>> we have now because it lacks critical feature X", I would be very
>> concerned about that.
>
> The other point here is that core code without any implemented use-cases
> is unlikely to be worth a tinker's damn.  Regardless of what time-frame
> the Slony guys are able to work on, I think we need to see working code
> (of at least prototype quality) before we believe that we've got it
> right.  Or if not code from them, code from some other replication
> project.
>
> A possibly-useful comparison is to the FDW APIs we've been slowly
> implementing over the past couple releases.  Even though we *did* have
> some use-cases written right off the bat, we got it wrong and had to
> change it in 9.2, and I wouldn't bet against having to change it again
> in 9.3 (even without considering the need for extensions for non-SELECT
> operations).  And, despite our very clear warnings that all that stuff
> was in flux, people have been griping because the APIs changed.
>
> So if we ship core hooks for logical replication in advance of proof
> that they're actually usable by at least one (preferably more than one)
> replication project, I confidently predict that they'll be wrong and
> will need revision and the potential users will complain about the
> API instability.

The prototype we showed at PgCon illustrated a working system, so
we're on the right track.

We've split that in two now, specifically to allow other projects to
use what is being built. The exact API of that split is for discussion
and has been massively redesigned on community advice for the sole
purpose of including other approaches. We can't guarantee that
external open source or commercial vendors will use the API. But we
can say that in-core use cases exist for multiple approaches. We
shouldn't put the decision on that in the hands of others.

Jan spoke at length at PgCon, for all to hear, that what we are
building is a much better way than the trigger logging approach Slony
uses. I don't take that as carte blanche for approval of everything
being done, but its going in the right direction with an open heart,
which is about as good as it gets.

There will be a working system again soon, once we have re-built
things around the new API. The longer it takes to get a stable API the
longer we take to rebuild things around it again.

The goal of the project is to release everything open source, PGDG
copyrighted and TPL licenced and to submit to core. We are signed up
to that in various ways, not least of all our word given publicly.
Please give this your backing, so an open source outcome can be
possible.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Andres Freund

Date:

15 October 2012, 20:52:02

On Monday, October 15, 2012 10:08:28 PM Christopher Browne wrote:
> On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> 
wrote:
> > On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:
> >> I think Robert is right that if Slony can't use the API, it is unlikely
> >> any other replication system could use it.
> > 
> > I don't accept that. Clearly there is a circular dependency, and
> > someone has to go first - why should the Slony guys invest in adopting
> > this technology if it is going to necessitate using a forked Postgres
> > with an uncertain future? That would be (with respect to the Slony
> > guys) a commercial risk that is fairly heavily concentrated with
> > Afilias.
> 
> Yep, there's something a bit too circular there.
> 
> I'd also not be keen on reimplementing the "Slony integration" over
> and over if it turns out that the API churns for a while before
> stabilizing.  That shouldn't be misread as "I expect horrible amounts
> of churn", just that *any* churn comes at a cost.  And if anything
> unfortunate happens, that can easily multiply into a multiplicity of
> painfulness(es?).

Well, as a crosscheck, could you list your requirements?

Do you need anything more than outputting data in a format compatible to whats 
stored in sl_log_*? You wouldn't have sl_actionseq, everything else should be 
there (Well, you would need to do lookups to get the tableid, but thats not 
really much of a problem). The results would be ordered in complete 
transactions, in commit order.

I guess the other tables would stay as they are as they contain the "added 
value" of slony?

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Christopher Browne

Date:

15 October 2012, 22:13:19

On Mon, Oct 15, 2012 at 4:51 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On Monday, October 15, 2012 10:08:28 PM Christopher Browne wrote:
>> On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com>
> wrote:
>> > On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:
>> >> I think Robert is right that if Slony can't use the API, it is unlikely
>> >> any other replication system could use it.
>> >
>> > I don't accept that. Clearly there is a circular dependency, and
>> > someone has to go first - why should the Slony guys invest in adopting
>> > this technology if it is going to necessitate using a forked Postgres
>> > with an uncertain future? That would be (with respect to the Slony
>> > guys) a commercial risk that is fairly heavily concentrated with
>> > Afilias.
>>
>> Yep, there's something a bit too circular there.
>>
>> I'd also not be keen on reimplementing the "Slony integration" over
>> and over if it turns out that the API churns for a while before
>> stabilizing.  That shouldn't be misread as "I expect horrible amounts
>> of churn", just that *any* churn comes at a cost.  And if anything
>> unfortunate happens, that can easily multiply into a multiplicity of
>> painfulness(es?).
>
> Well, as a crosscheck, could you list your requirements?
>
> Do you need anything more than outputting data in a format compatible to whats
> stored in sl_log_*? You wouldn't have sl_actionseq, everything else should be
> there (Well, you would need to do lookups to get the tableid, but thats not
> really much of a problem). The results would be ordered in complete
> transactions, in commit order.

Hmm.  We need to have log data that's in a compatible ordering.

We use sl_actionseq, and can mix data from multiple transactions
together; if what you're providing is, instead, in order based on
transaction commit order followed by some sequencing within each
transaction, then that should be acceptable.

The stylized query on sl_log_* looks like...

select log_origin, log_txid, log_tableid,
log_actionseq, log_tablenspname,
log_tablerelname, log_cmdtype,
log_cmdupdncols, log_cmdargs
from %s.sl_log_%d
where log_origin = %d

How about I "quibble" about each of these:

a) log_origin - this indicates the node from which the data
originates.  Presumably, this is implicit in a "chunk" of data that is
coming in.

b) log_txid - indicating the transaction ID.  I presume you've got
this available.  It's less important with the WAL-based scheme in that
we'd probably not be using it as a basis for querying as is the case
today with Slony.

c) log_tableid - indicating the ID of the table.  Are you capturing an
OID equivalent to this?  Or what?

d) log_actionseq - indicating relative sequences of updates.  You
don't have this, but if you're capturing commit ordering, we don't
need it.

e) log_tablenspname, log_tablerelname - some small amount of magic
needful to get this.  Or perhaps you are already capturing it?

f) log_cmdtype - I/U/D/T - indicating the action
(insert/update/delete/truncate).  Hopefully you have something like
this?

g) log_cmdupdncols - for UPDATE action, the number of updated columns.Probably not mandatory; this was a new 2.1
thing...

h) log_cmdargs - the actual data needed to do the I/U/D.  The form of
this matters a fair bit.  Before Slony 2.1, this was a portion of a
SQL statement, omitting the operation (provided in log_cmdtype) and
the table name (in log_tablerelname et al).  In Slony 2.1, this
changes to be a text[] array that essentially consists of pairs of
[column name, column value] values.

I see one place, very notable in Slony 2.2, that would also be more
complicated, which is the handling of DDL.

In 2.1 and earlier, we handled DDL as "events", essentially out of
band.  This isn't actually correct; it could mix very badly if you had
replication activity mixing with DDL requests.  (More detail than you
want is in a bug on this...
<http://www.slony.info/bugzilla/show_bug.cgi?id=137>).

In Slony 2.2, we added a third "log table" where DDL gets captured.
sl_log_script has much the same schema as sl_log_{1,2}; it needs to
get "mixed in" in compatible order.  What I imagine would pointedly
complicate life is if a single transaction contained both DDL and
"regular replicable activity."  Slony 2.2 mixes this in using XID +
log_actionseq; how this would play out with your log capture mechanism
isn't completely clear to me.  That's the place where I'd expect the
very messiest interaction.

> I guess the other tables would stay as they are as they contain the "added
> value" of slony?

A fair bit of Slony is about the "event infrastructure," and some of
that ceases to be as needful.  The configuration bits probably
continue to remain interesting.

The parts that seem notably mysterious to me at the moment are:

a) How do we pull result sets (e.g. - sl_log_* data)?

b) How is the command data represented?

c) If we have a need to mix together your 'raw logs' and other
material (e.g. - our sl_log_script that captures DDL-like changes to
be mixed back in), how easy|impossible is this?
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Andres Freund

Date:

15 October 2012, 23:01:16

On Tuesday, October 16, 2012 12:13:14 AM Christopher Browne wrote:
> On Mon, Oct 15, 2012 at 4:51 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > On Monday, October 15, 2012 10:08:28 PM Christopher Browne wrote:
> >> On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com>
> > 
> > wrote:
> >> > On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:
> >> >> I think Robert is right that if Slony can't use the API, it is
> >> >> unlikely any other replication system could use it.
> >> > 
> >> > I don't accept that. Clearly there is a circular dependency, and
> >> > someone has to go first - why should the Slony guys invest in adopting
> >> > this technology if it is going to necessitate using a forked Postgres
> >> > with an uncertain future? That would be (with respect to the Slony
> >> > guys) a commercial risk that is fairly heavily concentrated with
> >> > Afilias.
> >> 
> >> Yep, there's something a bit too circular there.
> >> 
> >> I'd also not be keen on reimplementing the "Slony integration" over
> >> and over if it turns out that the API churns for a while before
> >> stabilizing.  That shouldn't be misread as "I expect horrible amounts
> >> of churn", just that *any* churn comes at a cost.  And if anything
> >> unfortunate happens, that can easily multiply into a multiplicity of
> >> painfulness(es?).
> > 
> > Well, as a crosscheck, could you list your requirements?
> > 
> > Do you need anything more than outputting data in a format compatible to
> > whats stored in sl_log_*? You wouldn't have sl_actionseq, everything
> > else should be there (Well, you would need to do lookups to get the
> > tableid, but thats not really much of a problem). The results would be
> > ordered in complete transactions, in commit order.
> 
> Hmm.  We need to have log data that's in a compatible ordering.
> 
> We use sl_actionseq, and can mix data from multiple transactions
> together; if what you're providing is, instead, in order based on
> transaction commit order followed by some sequencing within each
> transaction, then that should be acceptable.

Inside the transaction its sequenced by the order the XLogInsert calls were 
made which is the order the client sent the commands. That sounds like it 
should be compatible.

> The stylized query on sl_log_* looks like...
> 
> select log_origin, log_txid, log_tableid,
> log_actionseq, log_tablenspname,
> log_tablerelname, log_cmdtype,
> log_cmdupdncols, log_cmdargs
> from %s.sl_log_%d
> where log_origin = %d
> 
> How about I "quibble" about each of these:
> 
> a) log_origin - this indicates the node from which the data
> originates.  Presumably, this is implicit in a "chunk" of data that is
> coming in.

I think we can just reuse whatever method you use in slony to get the current 
node's id to get it in the output plugin.

> b) log_txid - indicating the transaction ID.  I presume you've got
> this available.  It's less important with the WAL-based scheme in that
> we'd probably not be using it as a basis for querying as is the case
> today with Slony.

Its directly available. The plugin will have to call txid_out, but thats 
obviously no problem.

> c) log_tableid - indicating the ID of the table.  Are you capturing an
> OID equivalent to this?  Or what?

You get the TupleDesc of the table.

> d) log_actionseq - indicating relative sequences of updates.  You
> don't have this, but if you're capturing commit ordering, we don't
> need it.

Good.

> e) log_tablenspname, log_tablerelname - some small amount of magic
> needful to get this.  Or perhaps you are already capturing it?

The relevant backend functions available, so its no problem 
(RelationGetNamespace(change->new_tuple->table)).

> f) log_cmdtype - I/U/D/T - indicating the action
> (insert/update/delete/truncate).  Hopefully you have something like
> this?

Yes:
enum ApplyCacheChangeType
{APPLY_CACHE_CHANGE_INSERT,APPLY_CACHE_CHANGE_UPDATE,APPLY_CACHE_CHANGE_DELETE,
..
}

> g) log_cmdupdncols - for UPDATE action, the number of updated columns.
>  Probably not mandatory; this was a new 2.1 thing...

Hm. NO. We don't have that. But then, you can just use normal C code there, so 
it shouldn't be too hard to compute if neede.d

> h) log_cmdargs - the actual data needed to do the I/U/D.  The form of
> this matters a fair bit.  Before Slony 2.1, this was a portion of a
> SQL statement, omitting the operation (provided in log_cmdtype) and
> the table name (in log_tablerelname et al).  In Slony 2.1, this
> changes to be a text[] array that essentially consists of pairs of
> [column name, column value] values.

The existing C code to generate this should be copy&pasteable into this with a 
relatively small amount of changes.

> I see one place, very notable in Slony 2.2, that would also be more
> complicated, which is the handling of DDL.
> 
> In 2.1 and earlier, we handled DDL as "events", essentially out of
> band.  This isn't actually correct; it could mix very badly if you had
> replication activity mixing with DDL requests.  (More detail than you
> want is in a bug on this...
> <http://www.slony.info/bugzilla/show_bug.cgi?id=137>).
> 
> In Slony 2.2, we added a third "log table" where DDL gets captured.
> sl_log_script has much the same schema as sl_log_{1,2}; it needs to
> get "mixed in" in compatible order.  What I imagine would pointedly
> complicate life is if a single transaction contained both DDL and
> "regular replicable activity."  Slony 2.2 mixes this in using XID +
> log_actionseq; how this would play out with your log capture mechanism
> isn't completely clear to me.  That's the place where I'd expect the
> very messiest interaction.

Hm. I expect some complications here as well. But then, unless you do something 
special changes to those tables (e.g. sl_log_script) will be streamed out as 
well, so you could just insert events into their respective tables on the 
receiving side and deal with them there.

> > I guess the other tables would stay as they are as they contain the
> > "added value" of slony?
> 
> A fair bit of Slony is about the "event infrastructure," and some of
> that ceases to be as needful.  The configuration bits probably
> continue to remain interesting.

Quite a bit of the event infrastructure seems to deal with configuration 
changes and such, all thats probably going to stay...

> The parts that seem notably mysterious to me at the moment are:
> 
> a) How do we pull result sets (e.g. - sl_log_* data)?

The details of this are in a bit of flux as of now but I hope we will nail this 
down soon. You open a replication connection to the primary 'replication=1 
dbname=...' and issue

START_LOGICAL_REPLICATION slony $slot_id 0/DEADBEEF

With 0/DEADBEEF being the location youve stopped getting changes the last time. 
That will start streaming out changes via the COPY protocol. The contents of 
whats streamed out is entirely up to you.

The first time you start replication you need to do:

INIT_LOGICAL_REPLICATION

which will return a $slot_id, a SET TRANSACTION SNAPSHOT compatible snapshot 
and the initial starting LSN.

The 'slony' in START_LOGICAL_REPLICATION above denotes which output plugin is 
to be used.

> b) How is the command data represented?

Command data is the old/new tuple? Thats up to the output plugin. You get a 
HeapTuple with the old/new tuple, and compatible TupleDesc's. You could simply 
stream out your current format for example.

> c) If we have a need to mix together your 'raw logs' and other
> material (e.g. - our sl_log_script that captures DDL-like changes to
> be mixed back in), how easy|impossible is this?

As described above in general that seems easy enough. Just insert data into 
e.g. sl_log_script and when you receive the changes on the other side decide in 
which table to redirect those.

Where I see a bit of a problem is the handling of replication sets, 
configuration and similar.

Currently there is a dichotomy between 'catalog tables' and 'data tables'. The
former are not replicated but can be queried in an output plugin (thats the 
timetravel part). The latter are replicated but cannot be queried. All system 
catalog tables are in the 'catalog' category by their nature, but I have played 
with a system column that allows other tables to be treated as catalog tables 
as well.

If you would want to filter data on the source - which probably makes sense? - 
you currently would need to have such an additional catalog table which is not 
replicated but can be queried by the output plugin. But I guess the contents of 
that table would also need to be replicated...

I wonder if it we should replicate changes to such user-defined catalog tables 
as well, that should be relatively easy and if its not wanted the output plugin 
easily can filter that (if (class_form->relusercat)).

Does that clear things up?

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From

Tatsuo Ishii

Date:

15 October 2012, 23:10:10

> The design Andres and Simon have advanced already eliminates a lot of
> the common failure cases (now(), random(), nextval()) suffered by pgPool
> and similar tools.  But remember, this feature doesn't have to be

Well, pgpool-II already solved the now() case by using query rewriting
technique. The technique could be applied to random() as well but I'm
not convinced it is worth the trouble. nexval() would be a little
harder because pgpool needs an assistance from PostgreSQL core.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Steve Singer

Date:

16 October 2012, 00:21:02

On 12-10-15 04:51 PM, Andres Freund wrote:
>
> Well, as a crosscheck, could you list your requirements?
>
> Do you need anything more than outputting data in a format compatible to whats
> stored in sl_log_*? You wouldn't have sl_actionseq, everything else should be
> there (Well, you would need to do lookups to get the tableid, but thats not
> really much of a problem). The results would be ordered in complete
> transactions, in commit order.
>
> I guess the other tables would stay as they are as they contain the "added
> value" of slony?
>
> Greetings,

I actually had spent some time a few weeks ago looking over the 
documents and code.  I never did get around to writing a review as 
elegant as Peter's.   I have not seen any red flags that make me thing 
that what your proposing wouldn't be suitable for slony but sometimes 
you don't see details until you start implementing something.

My initial approach to modifying slony to work with this might be 
something like:

* Leave sl_event as is for non SYNC events, slon would still generate 
SYNC events in sl_event
* We would modify the remote_worker thread in slon to instead of 
selecting from sl_event it would get the the next 'committed' 
transaction from your apply cache.   For each ApplyChange record we 
would check to see if it is an insert into sl_event ,if so we would 
trigger our existing event processing logic based on the contents of the 
ev_type column.
* If the change involves a insert/update/delete/truncate to a replicated 
table we would translate that change into SQL and apply it on the 
replica, we would  not commit changes on the replica until we encounter 
a SYNC being added to sl_event for the current origin.
* SQL will be applied in a slightly different order than slony does 
today.  Today if two concurrent transactions are inserting into the same 
replicated table and they commit one after the other there is a good 
chance that the apply order on the replica will also be intermixed 
(assuming both commits were in between two SYNC events). My thinking is 
that we would just replay them one after the other on the replica in 
commit order. (Slony doesn't use commit order because we don't have it, 
not because we don't like it) this would mean we do away with tracking 
the action id.

* If a node is configured as a 'forwarder' not it would store the 
processed output of each ApplyChange record in a table on the replica. 
If a slon is pulling data from a non-orign (ie if remoteWorkerThread_1 
is pulling data from node 2) then it would need to query this table 
instead of calling the functions that process the ApplyCache contents.

* To subscribe a node we would generate a SYNC event on the provider and 
do the copy_set.  We would keep track of that SYNC event.  The remote 
worker would then ignore any data that comes before that SYNC event  
when it starts pulling data from the apply cache.
* DDL events in 2.2+ go into sl_ddl_script (or someting like that) when 
we see INSERT commands to that table we would now to then apply the DDL 
on the node.

* We would need to continue to populate sl_confirm because nowing what 
SYNC events have already been processed by a node is pretty important in 
a MOVE SET or FAILOVER.  It is possible that we might need to still 
track the xip lists of each SYNC for MOVE SET/FAILOVER but I'm not sure 
why/why not.

This is all easier said than implemented

Steve

> Andres

First draft of snapshot snapshot building design document

From

Andres Freund

Date:

16 October 2012, 11:30:48

Hi All,

On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:
> The design document [2] really just explains the problem (which is the
> need for catalog metadata at a point in time to make sense of heap
> tuples), without describing the solution that this patch offers with
> any degree of detail. Rather, [2] says "How we build snapshots is
> somewhat intricate and complicated and seems to be out of scope for
> this document", which is unsatisfactory. I look forward to reading the
> promised document that describes this mechanism in more detail.

Here's the first version of the promised document. I hope it answers most of 
the questions.

Input welcome!

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Jan Wieck

Date:

16 October 2012, 14:27:10

On 10/15/2012 4:43 PM, Simon Riggs wrote:
> Jan spoke at length at PgCon, for all to hear, that what we are
> building is a much better way than the trigger logging approach Slony
> uses. I don't take that as carte blanche for approval of everything
> being done, but its going in the right direction with an open heart,
> which is about as good as it gets.

The mechanism you are building for capturing changes is certainly a lot 
better than what Bucardo, Londiste and Slony are doing today. That much 
is true.

The flip side of the coin however is that all of today's logical 
replication systems are designed Postgres version agnostic to a degree. 
This means that the transition time from the existing, trigger based 
approach to the new WAL based mechanism will see both technologies in 
parallel, which is no small thing to support. And that transition time 
may last for a good while. We still have people installing Slony 1.2 
because 2.0 (3 years old by now) requires Postgres 8.3 minimum.

Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Jan Wieck

Date:

16 October 2012, 15:12:06

On 10/15/2012 3:25 PM, Andres Freund wrote:
> On Monday, October 15, 2012 09:18:57 PM Peter Geoghegan wrote:
>> On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote:
>> > I think Robert is right that if Slony can't use the API, it is unlikely
>> > any other replication system could use it.
>>
>> I don't accept that. Clearly there is a circular dependency, and
>> someone has to go first - why should the Slony guys invest in adopting
>> this technology if it is going to necessitate using a forked Postgres
>> with an uncertain future?
>
> Well. I don't think (hope) anybody proposed making something release worthy for
> slony but rather a POC patch that proofs the API is generic enough to be used
> by them. If I (or somebody else familiar with this) work together with somebody
> familiar with with slony internals I think such a POC shouldn't be too hard to
> do.
> I think some more input from that side is a good idea. I plan to send out an
> email to possibly interested parties in about two weeks...

What Slony essentially sends to the receiver node is a COPY stream in 
the format, Christopher described. That stream is directly copied into 
the receiving node's sl_log_N table and picked up there by an apply 
trigger BEFORE INSERT, that performs the corresponding 
INSERT/UPDATE/DELETE operation via prepared plans to the user tables.

For a POC I think it is sufficient to demonstrate that this copy stream 
can be generated out of the WAL decoding.

Note that Slony today does not touch columns in an UPDATE, that have not 
changed in the original UPDATE on the origin. Sending toasted column 
values, that haven't changed, would be a substantial change to the 
storage efficiency on the replica. The consequence of this is that the 
number of colums that need to be in the UPDATE's SET clause varies. The 
log_cmdupdncols is to separate the new column/value pairs from the 
column/key pairs of the updated row. The old row "key" in Slony is based 
on a unique index (preferably a PK, but any unique key will do). This 
makes that cmdupdncols simply the number of column/value pairs minus the 
number of key columns. So it isn't too hard to figure out.

Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Peter Geoghegan

Date:

18 October 2012, 13:50:04

On 16 October 2012 15:26, Jan Wieck <JanWieck@yahoo.com> wrote:
> This means that the transition time from the existing, trigger based
> approach to the new WAL based mechanism will see both technologies in
> parallel, which is no small thing to support.

So, you're talking about a shim between the two in order to usefully
support inter-version replication, or are you just thinking about
making a clean break in compatibility for Postgres versions prior to
9.3 in a new release branch?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: First draft of snapshot snapshot building design document

From

Robert Haas

Date:

18 October 2012, 14:47:17

On Tue, Oct 16, 2012 at 7:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:
>> The design document [2] really just explains the problem (which is the
>> need for catalog metadata at a point in time to make sense of heap
>> tuples), without describing the solution that this patch offers with
>> any degree of detail. Rather, [2] says "How we build snapshots is
>> somewhat intricate and complicated and seems to be out of scope for
>> this document", which is unsatisfactory. I look forward to reading the
>> promised document that describes this mechanism in more detail.
>
> Here's the first version of the promised document. I hope it answers most of
> the questions.
>
> Input welcome!

I haven't grokked all of this in its entirety, but I'm kind of
uncomfortable with the relfilenode -> OID mapping stuff.  I'm
wondering if we should, when logical replication is enabled, find a
way to cram the table OID into the XLOG record.  It seems like that
would simplify things.

If we don't choose to do that, it's worth noting that you actually
need 16 bytes of data to generate a unique identifier for a relation,
as in database OID + tablespace OID + relfilenode# + backend ID.
Backend ID might be ignorable because WAL-based logical replication is
going to ignore temporary relations anyway, but you definitely need
the other two.  There's nothing, for example, to keep you from having
two relations with the same value in pg_class.relfilenode in the same
database but in different tablespaces.  It's unlikely to happen,
because for new relations we set OID = relfilenode, but a subsequent
rewrite can bring it about if the stars align just right.  (Such
situations are, of course, a breeding ground for bugs, which might
make you question whether our current scheme for assigning
relfilenodes has much of anything to recommend it.)

Another thing to think about is that, like catalog snapshots,
relfilenode mappings have to be time-relativized; that is, you need to
know what the mapping was at the proper point in the WAL sequence, not
what it is now.  In practice, the risk here seems to be minimal,
because it takes a while to churn through 4 billion OIDs.  However, I
suspect it pays to think about this fairly carefully because if we do
ever run into a situation where the OID counter wraps during a time
period comparable to the replication lag, the bugs will be extremely
difficult to debug.

Anyhow, adding the table OID to the WAL header would chew up a few
more bytes of WAL space, but it seems like it might be worth it to
avoid having to think very hard about all of these issues.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Christopher Browne

Date:

18 October 2012, 15:19:01

On Thu, Oct 18, 2012 at 9:49 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 16 October 2012 15:26, Jan Wieck <JanWieck@yahoo.com> wrote:
>> This means that the transition time from the existing, trigger based
>> approach to the new WAL based mechanism will see both technologies in
>> parallel, which is no small thing to support.
>
> So, you're talking about a shim between the two in order to usefully
> support inter-version replication, or are you just thinking about
> making a clean break in compatibility for Postgres versions prior to
> 9.3 in a new release branch?

It's early to assume either.

In Slony 2.0, we accepted that we were breaking compatibility with
versions of Postgres before 8.3; we accepted that because there were
considerable 'manageability' benefits (e.g. - system catalogues no
longer hacked up, so pg_dump works against all nodes, and some
dramatically reduced locking).

But that had the attendant cost that we have had to continue fixing
bugs on 1.2, to a degree, even until now, because people on Postgres
versions earlier than 8.3 have no way to use version 2.0.

Those merits and demerits apply pretty clearly to this.

It would be somewhat attractive for a "version 2.3" (or, more likely,
to indicate the break from earlier versions, "3.0" to make the clean
break to the new-in-PG-9.3 facilities.  It is attractive in that we
could:
a) Safely remove the trigger-based log capture apparatus (or, at
least, I'm assuming so), and
b) Consciously upgrade to take advantage of all the latest cool stuff
found in Postgres 9.3.  (I haven't got any particular features in
mind; perhaps we add RANGE comparators for xid to 9.3, and make
extensive use of xid_range types?  That would be something that
couldn't reasonably get hacked to work in anything before 9.2...)
c) Drop out any special cases having to do with support of versions
8.3, 8.4, 9.0, 9.1, and 9.2.

But, of course, we'd be leaving everyone running 8.3 thru 9.2 behind,
if we did so, and would corresponding shackle ourselves to need to
support the 2.x branches for still longer.  And this would mean that
this Slony "3.0" would expressly NOT support one of our intended use
cases, namely to support upgrading from elder versions of Postgres.

A "shim" adds complexity, but retains the "upgrade across versions"
use case, and reduces the need to keep supporting elder versions of
Slony.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: First draft of snapshot snapshot building design document

From

Andres Freund

Date:

18 October 2012, 15:20:37

On Thursday, October 18, 2012 04:47:12 PM Robert Haas wrote:
> On Tue, Oct 16, 2012 at 7:30 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:
> >> The design document [2] really just explains the problem (which is the
> >> need for catalog metadata at a point in time to make sense of heap
> >> tuples), without describing the solution that this patch offers with
> >> any degree of detail. Rather, [2] says "How we build snapshots is
> >> somewhat intricate and complicated and seems to be out of scope for
> >> this document", which is unsatisfactory. I look forward to reading the
> >> promised document that describes this mechanism in more detail.
> > 
> > Here's the first version of the promised document. I hope it answers most
> > of the questions.
> > 
> > Input welcome!
> 
> I haven't grokked all of this in its entirety, but I'm kind of
> uncomfortable with the relfilenode -> OID mapping stuff.  I'm
> wondering if we should, when logical replication is enabled, find a
> way to cram the table OID into the XLOG record.  It seems like that
> would simplify things.
> 
> If we don't choose to do that, it's worth noting that you actually
> need 16 bytes of data to generate a unique identifier for a relation,
> as in database OID + tablespace OID + relfilenode# + backend ID.
> Backend ID might be ignorable because WAL-based logical replication is
> going to ignore temporary relations anyway, but you definitely need
> the other two.  ...

Hm. I should take look at the way temporary tables are represented. As you say 
I is not going to matter for WAL decoding, but still...

> Another thing to think about is that, like catalog snapshots,
> relfilenode mappings have to be time-relativized; that is, you need to
> know what the mapping was at the proper point in the WAL sequence, not
> what it is now.  In practice, the risk here seems to be minimal,
> because it takes a while to churn through 4 billion OIDs.  However, I
> suspect it pays to think about this fairly carefully because if we do
> ever run into a situation where the OID counter wraps during a time
> period comparable to the replication lag, the bugs will be extremely
> difficult to debug.

I think with a rollbacks + restarts we might even be able to see the same 
relfilenode earlier.

> Anyhow, adding the table OID to the WAL header would chew up a few
> more bytes of WAL space, but it seems like it might be worth it to
> avoid having to think very hard about all of these issues.

I don't think its necessary to change wal logging here. The relfilenode mapping 
is now looked up using the timetravel snapshot we've built using (spcNode, 
relNode) as the key, so the time-relativized lookup is "builtin". If we screw 
that up way much more is broken anyway.

Two problems are left:

1) (reltablespace, relfilenode) is not unique in pg_class because InvalidOid is 
stored for relfilenode if its a shared or nailed table. That not a problem for 
the lookup because weve already checked the relmapper before that, so we never 
look those up anyway. But it violates documented requirements of syscache.c. 
Even after some looking I haven't found any problem that that could cause.

2) We need to decide whether a HEAP[1-2]_* record did catalog changes when 
building/updating snapshots. Unfortunately we also need to do this *before* we 
built the first snapshot. For now treating all tables as catalog modifying 
before we built the snapshot seems to work fine.
I think encoding the oid in the xlog header wouln't help all that much here, 
because I am pretty sure we want to have the set of "catalog tables" to be 
extensible at some point...

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

From

Peter Geoghegan

Date:

18 October 2012, 15:26:09

On 18 October 2012 16:18, Christopher Browne <cbbrowne@gmail.com> wrote:
> A "shim" adds complexity, but retains the "upgrade across versions"
> use case, and reduces the need to keep supporting elder versions of
> Slony.

Right. Upgrading across major versions is likely to continue to remain
a very important use-case for Slony.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Alvaro Herrera

Date:

18 October 2012, 16:07:53

This patch doesn't seem to be going anywhere, sadly.  Since we're a bit
late in the commitfest and this patch hasn't seen any activity for a
long time, I'll mark it as returned-with-feedback.  I hope one or both
versions are resubmitted (with additional fixes?) for the next
commitfest, and that the discussion continues to determine which of the
two approaches is the best.

Thanks.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: First draft of snapshot snapshot building design document

From

Robert Haas

Date:

19 October 2012, 16:38:36

On Thu, Oct 18, 2012 at 11:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On Thursday, October 18, 2012 04:47:12 PM Robert Haas wrote:
>> On Tue, Oct 16, 2012 at 7:30 AM, Andres Freund <andres@2ndquadrant.com>
> wrote:
>> > On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:
>> >> The design document [2] really just explains the problem (which is the
>> >> need for catalog metadata at a point in time to make sense of heap
>> >> tuples), without describing the solution that this patch offers with
>> >> any degree of detail. Rather, [2] says "How we build snapshots is
>> >> somewhat intricate and complicated and seems to be out of scope for
>> >> this document", which is unsatisfactory. I look forward to reading the
>> >> promised document that describes this mechanism in more detail.
>> >
>> > Here's the first version of the promised document. I hope it answers most
>> > of the questions.
>> >
>> > Input welcome!
>>
>> I haven't grokked all of this in its entirety, but I'm kind of
>> uncomfortable with the relfilenode -> OID mapping stuff.  I'm
>> wondering if we should, when logical replication is enabled, find a
>> way to cram the table OID into the XLOG record.  It seems like that
>> would simplify things.
>>
>> If we don't choose to do that, it's worth noting that you actually
>> need 16 bytes of data to generate a unique identifier for a relation,
>> as in database OID + tablespace OID + relfilenode# + backend ID.
>> Backend ID might be ignorable because WAL-based logical replication is
>> going to ignore temporary relations anyway, but you definitely need
>> the other two.  ...
>
> Hm. I should take look at the way temporary tables are represented. As you say
> I is not going to matter for WAL decoding, but still...
>
>> Another thing to think about is that, like catalog snapshots,
>> relfilenode mappings have to be time-relativized; that is, you need to
>> know what the mapping was at the proper point in the WAL sequence, not
>> what it is now.  In practice, the risk here seems to be minimal,
>> because it takes a while to churn through 4 billion OIDs.  However, I
>> suspect it pays to think about this fairly carefully because if we do
>> ever run into a situation where the OID counter wraps during a time
>> period comparable to the replication lag, the bugs will be extremely
>> difficult to debug.
>
> I think with a rollbacks + restarts we might even be able to see the same
> relfilenode earlier.
>
>> Anyhow, adding the table OID to the WAL header would chew up a few
>> more bytes of WAL space, but it seems like it might be worth it to
>> avoid having to think very hard about all of these issues.
>
> I don't think its necessary to change wal logging here. The relfilenode mapping
> is now looked up using the timetravel snapshot we've built using (spcNode,
> relNode) as the key, so the time-relativized lookup is "builtin". If we screw
> that up way much more is broken anyway.
>
> Two problems are left:
>
> 1) (reltablespace, relfilenode) is not unique in pg_class because InvalidOid is
> stored for relfilenode if its a shared or nailed table. That not a problem for
> the lookup because weve already checked the relmapper before that, so we never
> look those up anyway. But it violates documented requirements of syscache.c.
> Even after some looking I haven't found any problem that that could cause.
>
> 2) We need to decide whether a HEAP[1-2]_* record did catalog changes when
> building/updating snapshots. Unfortunately we also need to do this *before* we
> built the first snapshot. For now treating all tables as catalog modifying
> before we built the snapshot seems to work fine.
> I think encoding the oid in the xlog header wouln't help all that much here,
> because I am pretty sure we want to have the set of "catalog tables" to be
> extensible at some point...

I don't like catalog-only snapshots at all.  I think that's just a
recipe for subtle or not-so-subtle breakage down the road...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: First draft of snapshot snapshot building design document

From

Andres Freund

Date:

19 October 2012, 16:56:33

On Friday, October 19, 2012 06:38:30 PM Robert Haas wrote:
> On Thu, Oct 18, 2012 at 11:20 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > 2) We need to decide whether a HEAP[1-2]_* record did catalog changes
> > when building/updating snapshots. Unfortunately we also need to do this
> > *before* we built the first snapshot. For now treating all tables as
> > catalog modifying before we built the snapshot seems to work fine.
> > I think encoding the oid in the xlog header wouln't help all that much
> > here, because I am pretty sure we want to have the set of "catalog
> > tables" to be extensible at some point...
> 
> I don't like catalog-only snapshots at all.  I think that's just a
> recipe for subtle or not-so-subtle breakage down the road...

I really don't see this changing for now. At some point we need to draw the 
line otherwise we will never ever get anywhere. Naively building a snapshot 
that can access normal tables is just too expensive because it changes all the 
time.

Sure, obvious breakage will be there if you have a datatype that accesses 
other user-tables during decoding (as we talked about previously). But subtle 
breakage should be easily catchable by simply not allowing to open user 
relations.
If an extension needs this it will have to mark the table as catalog ones for 
now. I find this to be a reasonable restriction.

Greetings,

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: First draft of snapshot snapshot building design document

From

Peter Geoghegan

Date:

19 October 2012, 20:53:17

On 16 October 2012 12:30, Andres Freund <andres@2ndquadrant.com> wrote:
> Here's the first version of the promised document. I hope it answers most of
> the questions.

This makes for interesting reading.

So, I've taken a closer look at the snapshot building code in light of
this information. What follows are my thoughts on that aspect of the
patch (in particular, the merits of snapshot time-travel as a method
of solving the general problem of making sense of WAL during what I'll
loosely call "logical recovery", performed by what one design document
[1] calls an "output plugin", which I previously skirted over
somewhat.

You can think of this review as a critical exposition of what happens
during certain steps of my high-level schematic of "how this all fits
together", which covers this entire patchset (this comes under my
remit as the reviewer of one of the most important and complex patches
in that patchset,
"0008-Introduce-wal-decoding-via-catalog-timetravel.patch"), as
described in my original, high-level review [2].

To recap on what those steps are:

***SNIP***
(from within plugin, currently about to decode a record for the first time)      |     \ /      9. During the first
call(within the first record within a call 
to decode_xlog()), we allocate a snapshot reader.      |     \ /      10. Builds snapshot callback. This scribbles on
oursnapshot 
state, which essentially encapsulates a snapshot.            The state (and snapshot) changes continually, once per
call.     |     \ /      11. Looks at XLogRecordBuffer (an XLogReader struct). Looks at 
an XLogRecord. Decodes based on record type.            Let's assume it's an XLOG_HEAP_INSERT.      |     \ /      12.
DecodeInsert()called. This in turn calls 
DecodeXLogTuple(). We store the tuple metadata in our            ApplyCache. (some ilists, somewhere, each
corresponding
to an XID). We don't store the relation oid, because we            don't know it yet (only relfilenode is known from
WAL).// 
\ /
13. We're back in XLogReader(). It calls the only callback of interest to      us covered in step 3 (and not of
interestto 
XlogReader()/Heikki) – decode_change(). It does this through the      apply_cache.apply_change hook. This happens
becausewe 
encounter another record, this time a      commit record (in the same codepath as discussed in step 12). |\ /14. In
decode_change(),the actual function that raises the 
interesting WARNINGs within Andres'      earlier example [3], showing actual integer/varchar Datum value
for tuples previously inserted.      Resolve table oid based on relfilenode (albeit unsatisfactorily).      Using a
StringInfo,tupledescs, syscache and typcache, build 
WARNING string.

(No further steps. Aside: If I've done a good job of writing my
initial review [2], I should be able to continually refer back to this
as I drill down on other steps in later reviews.)

It's fairly obvious why steps 9 and 10 are interesting to us here.
Step 11 (the first call of SnapBuildCallback() - this is a bit of a
misnomer, since the function isn't ever called through a pointer) is
where the visibility-wise decision to decode a particular
XLogRecord/XLogRecordBuffer occurs.

Here is how the patch describes our reasons for needing this call
(curiously, this comment appears not above SnapBuildCallback() itself,
but above the decode.c call of said function, which may hint at a
lapse in separation of concerns):

+     /*---------
+      * Call the snapshot builder. It needs to be called before we analyze
+      * tuples for two reasons:
+      *
+      * * Only in the snapshot building logic we know whether we have enough
+      *   information to decode a particular tuple
+      *
+      * * The Snapshot/CommandIds computed by the SnapshotBuilder need to be
+      *   added to the ApplyCache before we add tuples using them
+      *---------
+      */

Step 12 is a step that I'm not going into for this review. That's all
decoding related. Step 13 is not really worthy of separate
consideration here, because it just covers what happens when we call
DecodeCommit() within decode.c , where text representations of tuples
are ultimately printed in simple elog WARNINGs, as in Andres' original
example [3], due to the apply_cache.apply_change hook being called.

Step 14 *is* kind-of relevant, because this is one place where we
resolve relation OID based on relfilenode, which is part of
snapbuild.c, since it has a LookupTableByRelFileNode() call (the only
other such call is within snapbuild.c itself, as part of checking if a
particular relation + xid have had catalogue changes, which can be a
concern due to DDL, which is the basic problem that snapbuild.c seeks
to solve). Assuming that it truly is necessary to have a
LookupTableByRelFileNode() function, I don't think your plugin (which
will soon actually be a contrib module, right?) has any business
calling it. Rather, this should all be part of some higher-level
abstraction, probably within applycache, that spoonfeeds your example
contrib module changesets without it having to care about system cache
invalidation and what-not.

As I've already noted, snapbuild.c (plus one or two other isolated
places) have rather heavy-handed and out of place
InvalidateSystemCaches() calls like these:

+ HeapTuple
+ LookupTableByRelFileNode(RelFileNode *relfilenode)
+ {
+     Oid spc;
+
+     InvalidateSystemCaches();

However, since you've privately told me that your next revision will
address the need to execute sinval messages granularly, rather than
using this heavy-handed kludge, I won't once again stress the need to
do better. If I've understood correctly, you've indicated that this
can be done by processing the shared invalidation messages in each
xl_xact_commit (i.e. each XLOG_XACT_COMMIT entry) during decoding
(which I guess means that decoding's reponsibility's have been widened
a bit, but that's still the place to do it - within the decoding
"dispatcher"/glue code). Apparently we can expect this revision in the
next couple of days. Thankfully, I *think* that this implies that
plugins don't need to directly concern themselves with cache
invalidation.

By the way, shouldn't this use InvalidOid?:

+     /*
+      * relations in the default tablespace are stored with a reltablespace = 0
+      * for some reason.
+      */
+     spc = relfilenode->spcNode == DEFAULTTABLESPACE_OID ?
+         0 : relfilenode->spcNode;

The objectionable thing about having your “plugin” handle cache
invalidation, apart from the fact that, as we all agree, the way
you're doing it is just not acceptable, is that your plugin is doing
it directly *at all*. You need to analyse the requirements of the
universe of possible logical changeset subscriber plugins, and whittle
them down to a lowest common denominator interface that likely
generalises cache invalidation, and ideally doesn't require plugin
authors to even know what a relfilenode or syscache is – some might
say this shouldn't be a priority, but I take the view that it should.
Exposing innards like this to these plugins is wrong headed, and to do
so will likely paint us into a corner. Now, I grant that catcache +
relcache can only be considered private innards in a relative sense
with Postgres, and indeed a few contrib modules do use syscache
directly for simple stuff, like hstore, which needs syscache +
typcache for generating text representations in a way that is not
completely unlike what you have here. Still, my feeling is that all
the heavy lifting should be encapsulated elsewhere, in core. I think
that you could easily justify adding another module/translation unit
that is tasked with massaging this data in a form amenable to serving
the needs of client code/plugins. If you don't get things quite right,
plugin authors can still do it all for themselves just as well.

I previously complained about having to take a leap of faith in
respect of xmin horizon handling [2]. This new document also tries to
allay some of those concerns.  It says:

> == xmin Horizon Handling ==
>
> Reusing MVCC for timetravel access has one obvious major problem:
> VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also
> obviously, we want autovacuum/manual vacuum to work as before.
>
> The idea here is to reuse the infrastrcuture built for hot_standby_feedback
> which allows us to keep the xmin horizon of a walsender backend artificially
> low. We keep it low enough so we can restart decoding from the last location
> the client has confirmed to be safely received. The means that we keep it low
> enough to contain the last checkpoints oldestXid value.

These ideas are still underdeveloped. For one thing, it seems kind of
weak to me that we're obliged to have vacuum grind to a halt across
the cluster because some speculative plugin has its proc's xmin pegged
to some value due to a logical standby still needing it for reading
catalogues only.   Think of the failure modes – what happens when the
standby participating in a plugin-based logical replication system
croaks for indeterminate reasons? Doing this with the WAL sender as
part of hot_standby_feedback is clearly much less hazardous, because
there *isn't* a WAL sender when streaming replication isn't active in
respect of some corresponding standby, and hot_standby_feeback need
only support deferring vacuum for the streaming replication standby
whose active snapshot's xmin is furthest in the past. If a standby is
taken out of commission for an hour, it can probably catch up without
difficulty (difficulty like needing a base-backup), and it has no
repercussions for the master or anyone else. However, I believe that
logical change records would be rendered meaningless in the same
scenario, because VACUUM, having not seen a “pegged” xmin horizon due
to the standby's unavailability, goes ahead and vacuums past a point
still needed to keep the standby in a consistent state.

Maybe you can invent a new type of xmin horizon that applies only to
catalogues so this isn't so bad, and I see you've suggested as much in
follow-up mail to Robert, but it might be unsatisfactory to have that
impact the PGAXT performance optimisation introduced in commit
ed0b409d, or obfuscate that code. You could have the xmin be a special
xmin, say though PGAXT.vacuumFlags indicating this, but wouldn't that
necessarily preclude the backend from having a non-special notion of
its xmin? How does that bode for this actually becoming maximally
generic infrastructure?

You do have some ideas about how to re-sync a speculative in-core
logical replication system standby,  and indeed you talk about this in
the design document posted a few weeks back [1] (4.8. Setup of
replication nodes). This process is indeed similar to a base backup,
and we'd hope to avoid having to do it for similar reasons - it would
be undesirable to have to do it much more often in practice due to
these types of concerns.

You go on to say:

> That also means we need to make that value persist across restarts/crashes in a
> very similar manner to twophase.c's. That infrastructure actually also useful
> to make hot_standby_feedback work properly across primary restarts.

So here you anticipating my criticism above about needing to peg the
xmin horizon. That's fine, but I still don't know yet is how you
propose to make this work in a reasonable way, free from all the usual
caveats about leaving prepared transactions open for an indefinitely
long time (what happens when we need to XID wraparound?, how does the
need to hold a transaction open interfere with a given plugin
backend's ability to execute regular queries?, etc). Furthermore, I
don't know how it's going to be independently useful to make
hot_standby_feedback work across primary restarts, because the primary
cannot then generate VACUUM cleanup WAL records that the standby might
replay, resulting in a hard conflict. Maybe there's something
incredibly obvious that I'm missing, but doesn't that problem almost
take care of itself? Granted, those cleanup records aren't the only
reason for a hard conflict, but they've got to be by far the most
important, and are documented as such. Either the cleanup record
already exists and you're usually already out of luck anyway, or it
doesn't and you're not. Are you thinking about race conditions?

You talk about the relfilenode/oid mapping problem some:

> == Catalog/User Table Detection ==
>
> To detect whether a record/transaction does catalog modifications - which we
> need to do for memory/performance reasons - we need to resolve the
> RelFileNode's in xlog records back to the original tables. Unfortunately
> RelFileNode's only contain the tables relfilenode, not their table oid. We only
> can do catalog access once we reached FULL_SNAPSHOT, before that we can use
> some heuristics but otherwise we have to assume that every record changes the
> catalog.

What exactly are the implications of having to assume catalogue
changes? I see that right now, you're just skipping actual decoding
once you've taken care of snapshot building chores within
SnapBuildCallback():

+     if (snapstate->state == SNAPBUILD_START)
+         return SNAPBUILD_SKIP;

> The heuristics we can use are:
> * relfilenode->spcNode == GLOBALTABLESPACE_OID
> * relfilenode->relNode <= FirstNormalObjectId
> * RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid
>
> Those detect some catalog tables but not all (think VACUUM FULL), but if they
> detect one they are correct.

If the heuristics aren't going to be completely reliable, why is that
acceptable? You've said it "seems to work fine", but I don't quite
follow.

I see this within SnapBuildCallback() (the function whose name I said
was a misnomer).

> After reaching FULL_SNAPSHOT we can do catalog access if our heuristics tell us
> a table might not be a catalog table. For that we use the new RELFILENODE
> syscache with (spcNode, relNode).

I share some of Robert's concerns here. The fact that relfilenode
mappings have to be time-relativised may tip the scales against this
approach. As Robert has said, it may be that simply adding the table
OID to the WAL header is the way forward. It's not as if we can't
optimise that later. One compromise might be to only do that when we
haven't yet reached FULL_SNAPSHOT

On the subject of making decoding restartable, you say:

> == Restartable Decoding ==
>
> As we want to generate a consistent stream of changes we need to have the
> ability to start from a previously decoded location without going to the whole
> multi-phase setup because that would make it very hard to calculate up to where
> we need to keep information.

Indeed, it would.

> To make that easier everytime a decoding process finds an online checkpoint
> record it exlusively takes a global lwlock and checks whether visibility
> information has been already been written out for that checkpoint and does so
> if not. We only need to do that once as visibility information is the same
> between all decoding backends.

Where and how would that visibility information be represented? So,
typically you'd expect it to say “no catalogue changes for entire
checkpoint“ much of the time?

The patch we've seen
(0008-Introduce-wal-decoding-via-catalog-timetravel.patch [4]) doesn't
address the question of transactions that contain DDL:

+             /* FIXME: For now skip transactions with catalog changes entirely */
+             if (ent && ent->does_timetravel)
+                 does_timetravel = true;
+             else
+                 does_timetravel = SnapBuildHasCatalogChanges(snapstate, xid, relfilenode);

but you do address this question (or a closely related question,
which, I gather is the crux of the issue: How to mix DDL and DML in
transactions?) in the new doc (README.SNAPBUILD.txt [6]):

> == mixed DDL/DML transaction handling  ==
>
> When a transactions uses DDL and DML in the same transaction things get a bit
> more complicated because we need to handle CommandIds and ComboCids as we need
> to use the correct version of the catalog when decoding the individual tuples.

Right, so it becomes necessary to think about time-travelling not just
to a particular transaction, but to a particular point in a particular
transaction – the exact point at which the catalogue showed a
structure consistent with sanely interpreting logical WAL records
created during that window after the last DDL command (if any), but
before the next (if any). This intelligence is only actually needed
when decoding tuples created in that actual transaction, because only
those tuples have their format change throughout a single transaction.

> CommandId handling itself is relatively simple, we can figure out the current
> CommandId relatively easily by looking at the currently used one in
> changes. The problematic part is that those CommandId frequently will not be
> actual cmin or cmax values but ComboCids. Those are used to minimize space in
> the heap. During normal operation cmin/cmax values are only used within the
> backend emitting those rows and only during one toplevel transaction, so
> instead of storing cmin/cmax only a reference to an in-memory value is stored
> that contains both. Whenever we see a new CommandId we call
> ApplyCacheAddNewCommandId.

Right. So in general, transaction A doesn't have to concern itself
with the order that other transactions had tuples become visible or
invisible (cmin and cmax); transaction A need only concern itself with
whether they're visible or invisible based on if relevant transactions
(xids) committed, its own xid, plus each tuple's xmin and xmax.  It is
this property of cmin/cmax that enabled the combocid optimisation in
8.3, which introduces an array in *backend local* memory, to map a
single HeapTupleHeader field (where previously there were 2 – cmin and
cmax) to an entry in that array, under the theory that it's unusual
for a tuple to be created and then deleted in the same transaction.
Most of the time, that one HeapTupleHeader field wouldn't have a
mapping to the local array – rather, it would simply have a cmin or a
cmax. That's how we save heap space.

> To resolve this problem during heap_* whenever we generate a new combocid
> (detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog table
> we log the new XLOG_HEAP2_NEW_COMBOCID record containing the mapping. During
> decoding this ComboCid is added to the applycache
> (ApplyCacheAddNewComboCid). They are only guaranteed to be visible within a
> single transaction, so we cannot simply setup all of them globally.

This seems more or less reasonable. The fact that the combocid
optimisation uses a local memory array isn't actually an important
property of combocids as a performance optimisation – it's just that
there was no need for the actual cmin and cmax values to be visible to
other backends, until now. Comments in combocids.c go on at length
about how infrequently actual combocids are actually used in practice.
For ease of implementation, but also because real combocids are
expected to be needed infrequently, I suggest that rather than logging
the mapping, you log the values directly in a record (i.e. The full
cmin and cmax mapped to the catalogue + catalogue tuple's ctid). You
could easily exhaust the combocid space otherwise, and besides, you
cannot do anything with the mapping from outside of the backend that
originated the combocid for that transaction (you don't have the local
array, or the local hashtable used for combocids).

> Before calling the output plugin ComboCids are temporarily setup and torn down
> afterwards.

How? You're using a combocid-like array + hashtable local to the
plugin backend?

Anyway, for now, this is unimplemented, which is perhaps the biggest
concern about it:

+     /* check if its one of our txids, toplevel is also in there */
+     else if (TransactionIdInArray(xmin, snapshot->subxip, snapshot->subxcnt))
+     {
+         CommandId cmin = HeapTupleHeaderGetRawCommandId(tuple);
+         /* no support for that yet */
+         if (tuple->t_infomask & HEAP_COMBOCID){
+             elog(WARNING, "combocids not yet supported");
+             return false;
+         }
+         if (cmin >= snapshot->curcid)
+             return false;    /* inserted after scan started */
+     }

Above, you aren't taking this into account (code from HeapTupleHeaderGetCmax()):

/* We do not store cmax when locking a tuple */
Assert(!(tup->t_infomask & (HEAP_MOVED | HEAP_IS_LOCKED)));

Sure, you're only interested in cmin, so this doesn't look like it
applies (isn't this just a sanity check?), but actually, based on this
it seems to me that the current sane representation of cmin needs to
be obtained in a more concurrency aware fashion - having the backend
local data-structures that originate the tuple isn't even good enough.
The paucity of other callers to HeapTupleHeaderGetRawCommandId() seems
to support this. Apart from contrib/pageinspect, there is only this
one other direct caller, within heap_getsysattr():

/** cmin and cmax are now both aliases for the same field, which* can in fact also be a combo command id.    XXX
perhapswe should* return the "real" cmin or cmax if possible, that is if we are* inside the originating transaction?*/ 
result = CommandIdGetDatum(HeapTupleHeaderGetRawCommandId(tup->t_data));

So these few direct call-sites that inspect CommandId outside of their
originating backend are all non-critical observers of the CommandId,
that are roughly speaking allowed to be wrong. Callers to the higher
level abstractions (HeapTupleHeaderGetCmin()/HeapTupleHeaderGetCmax())
know that the CommandId must be a cmin or cmax respectively, because
they have as their contract that
TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tup)) and
TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmax(tup))
respectively.

I guess this can all happen when you write that
XLOG_HEAP2_NEW_COMBOCID record within the originating transaction
(that is, when you xmin is that of the tuple in the originating
transaction, you are indeed dealing with a cmin), but these are
hazards that you need to consider in addition to the more obvious
ComboCid hazard. I have an idea that the
HeapTupleHeaderGetRawCommandId(tuple) call in your code could well be
bogus even when (t_infomask & HEAP_COMBOCID) == 0.

I look forward to seeing your revision that addressed the issue with
sinval messages.

[1] http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com

[2] http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com

[3] http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com

[4] http://archives.postgresql.org/message-id/1347669575-14371-8-git-send-email-andres@2ndquadrant.com

[5] http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com

[6] http://archives.postgresql.org/message-id/201210161330.37967.andres@2ndquadrant.com

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: First draft of snapshot snapshot building design document

From

Andres Freund

Date:

19 October 2012, 22:14:14

Hi,

On Friday, October 19, 2012 10:53:06 PM Peter Geoghegan wrote:
> On 16 October 2012 12:30, Andres Freund <andres@2ndquadrant.com> wrote:
> > Here's the first version of the promised document. I hope it answers most
> > of the questions.
>
> This makes for interesting reading.

Thanks.


> Step 14 *is* kind-of relevant, because this is one place where we
> resolve relation OID based on relfilenode, which is part of
> snapbuild.c, since it has a LookupTableByRelFileNode() call (the only
> other such call is within snapbuild.c itself, as part of checking if a
> particular relation + xid have had catalogue changes, which can be a
> concern due to DDL, which is the basic problem that snapbuild.c seeks
> to solve). Assuming that it truly is necessary to have a
> LookupTableByRelFileNode() function, I don't think your plugin (which
> will soon actually be a contrib module, right?) has any business
> calling it.

That shouldn't be needed anymore, that was just because I didn't finish some
loose ends quickly enough. The callback now gets passed the TupleDesc.


> Thankfully, I *think* that this implies that
> plugins don't need to directly concern themselves with cache
> invalidation.

Correct. They don't need to know anything about it, its all handled
transparently now.


> By the way, shouldn't this use InvalidOid?:

Yes. Fixed.


> allay some of those concerns.  It says:
> > == xmin Horizon Handling ==
> >
> > Reusing MVCC for timetravel access has one obvious major problem:
> > VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also
> > obviously, we want autovacuum/manual vacuum to work as before.
> >
> > The idea here is to reuse the infrastrcuture built for
> > hot_standby_feedback which allows us to keep the xmin horizon of a
> > walsender backend artificially low. We keep it low enough so we can
> > restart decoding from the last location the client has confirmed to be
> > safely received. The means that we keep it low enough to contain the
> > last checkpoints oldestXid value.
>
> These ideas are still underdeveloped. For one thing, it seems kind of
> weak to me that we're obliged to have vacuum grind to a halt across
> the cluster because some speculative plugin has its proc's xmin pegged
> to some value due to a logical standby still needing it for reading
> catalogues only.   Think of the failure modes – what happens when the
> standby participating in a plugin-based logical replication system
> croaks for indeterminate reasons? Doing this with the WAL sender as
> part of hot_standby_feedback is clearly much less hazardous, because
> there *isn't* a WAL sender when streaming replication isn't active in
> respect of some corresponding standby, and hot_standby_feeback need
> only support deferring vacuum for the streaming replication standby
> whose active snapshot's xmin is furthest in the past.

> If a standby is taken out of commission for an hour, it can probably catch
> up without difficulty (difficulty like needing a base-backup), and it has no
> repercussions for the master or anyone else.

Only if you have setup an archive_command with sufficient retention. Otherwise
it all depends on the wal_keep_segments value.

> However, I believe that logical change records would be rendered meaningless
> in the same scenario, because VACUUM, having not seen a “pegged” xmin
> horizon due to the standby's unavailability, goes ahead and vacuums past a
> point still needed to keep the standby in a consistent state.

Well, thats why I want to persist them similar to twophase.c.

> Maybe you can invent a new type of xmin horizon that applies only to
> catalogues so this isn't so bad

Yes, I thought about this. I seems like a very sensible optimization, but I
really, really, would like to get the simpler version finished.

> and I see you've suggested as much in follow-up mail to Robert

> but it might be unsatisfactory to have that impact the PGAXT performance
> optimisation introduced in commit ed0b409d, or obfuscate that code.

Hm. I don't forsee any need to pessimize that. But I guess the burden of proof
lies on me writing up a patch for this.

> You could have the xmin be a special xmin, say though PGAXT.vacuumFlags
indicating this, but wouldn't that necessarily preclude the backend from
> having a non-special notion of its xmin? How does that bode for this
actually becoming maximally generic infrastructure?

Uhm. Why should the decoding backend *ever* need its own xmin? I will *never*
be allowed to do writes itself or similar?

> ...
> You go on to say:
> > That also means we need to make that value persist across
> > restarts/crashes in a very similar manner to twophase.c's. That
> > infrastructure actually also useful to make hot_standby_feedback work
> > properly across primary restarts.
>
> So here you anticipating my criticism above about needing to peg the
> xmin horizon. That's fine, but I still don't know yet is how you
> propose to make this work in a reasonable way, free from all the usual
> caveats about leaving prepared transactions open for an indefinitely
> long time (what happens when we need to XID wraparound?, how does the
> need to hold a transaction open interfere with a given plugin
> backend's ability to execute regular queries?, etc).

Well, its not like its introducing a problem that wasn't there before. I
really don't see this as something all that problematic.

If people protest we can add some tiny bit of code to autovacuum that throws
everything away thats older than some autovacuum_vacuum_freeze_max_age or
such. We could even do the same to prepared transactions.

But in the end the answer is that you need to monitor *any* replication system
carefully.

> Furthermore, I don't know how it's going to be independently useful to make
> hot_standby_feedback work across primary restarts, because the primary
> cannot then generate VACUUM cleanup WAL records that the standby might
> replay, resulting in a hard conflict. Maybe there's something
> incredibly obvious that I'm missing, but doesn't that problem almost
> take care of itself?

Autovacuum immediately starts after a restart. Once its started some workers a
reconnecting standby cannot lower the xmin again, so you have a high
likelihood of conflicts. I have seen that problem in the field (ironically
"fixed" by creating a prepared xact before restarting ...).

> You talk about the relfilenode/oid mapping problem some:
> > == Catalog/User Table Detection ==
> >
> > To detect whether a record/transaction does catalog modifications - which
> > we need to do for memory/performance reasons - we need to resolve the
> > RelFileNode's in xlog records back to the original tables. Unfortunately
> > RelFileNode's only contain the tables relfilenode, not their table oid.
> > We only can do catalog access once we reached FULL_SNAPSHOT, before that
> > we can use some heuristics but otherwise we have to assume that every
> > record changes the catalog.
>
> What exactly are the implications of having to assume catalogue
> changes?

Higher cpu/storage overhead, nothing else.

> I see that right now, you're just skipping actual decoding
> once you've taken care of snapshot building chores within
> SnapBuildCallback():
>
> +     if (snapstate->state == SNAPBUILD_START)
> +         return SNAPBUILD_SKIP;

When were in SNAPBUILD_START state we don't have *any* knowledge about the
system yet, so we can't do anything with collected records anyway (very likely
the record we just read was part of an already started transaction). Once were
in SNAPBUILD_FULL_SNAPSHOT state we can start collecting changes if the
respective transaction started *after* we have reached
SNAPBUILD_FULL_SNAPSHOT:

> > The heuristics we can use are:
> > * relfilenode->spcNode == GLOBALTABLESPACE_OID
> > * relfilenode->relNode <= FirstNormalObjectId
> > * RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid
> >
> > Those detect some catalog tables but not all (think VACUUM FULL), but if
> > they detect one they are correct.
>
> If the heuristics aren't going to be completely reliable, why is that
> acceptable? You've said it "seems to work fine", but I don't quite
> follow.

Should have left that out, its a small internal optimization... If the above
heuristic detect that a relfilenode relates to a catalog table its guaranteed
to be correct. It cannot detect all catalog changes though, so you can only
use it to skip doing work for catalog tables, not for skipping work if
!catalog.

> > After reaching FULL_SNAPSHOT we can do catalog access if our heuristics
> > tell us a table might not be a catalog table. For that we use the new
> > RELFILENODE syscache with (spcNode, relNode).
>
> I share some of Robert's concerns here. The fact that relfilenode
> mappings have to be time-relativised may tip the scales against this
> approach.

Whats the problem with the time-relativized access?

> As Robert has said, it may be that simply adding the table
> OID to the WAL header is the way forward. It's not as if we can't
> optimise that later. One compromise might be to only do that when we
> haven't yet reached FULL_SNAPSHOT

When writing the WAL we don't have any knowledge about what state some
potential decoding process could be in, so its an all or nothing thing.

I don't have a problem with writing the table oid into the records somewhere,
I just think its not required. One reason for storing it in there independent
from this patchset/feature is debugging. I wished for that in the past.

We need to build the snapshot to access catalog anyway, so its not like doing
the relfilenode lookup time-relativzed incurs any additional costs. Also, we
need to do the table-oid lookup time-relatived as well, because table oids can
be reused.

> > To make that easier everytime a decoding process finds an online
> > checkpoint record it exlusively takes a global lwlock and checks whether
> > visibility information has been already been written out for that
> > checkpoint and does so if not. We only need to do that once as
> > visibility information is the same between all decoding backends.
>
> Where and how would that visibility information be represented?

Some extra pg_* directory, like pg_decode/$LSN_OF_CHECKPOINT.

> So, typically you'd expect it to say “no catalogue changes for entire
> checkpoint“ much of the time?

No, not really. It will probably look similar to the files ExportSnapshot
currently produces. Even if no catalog changes happened we still need to keep
knowledge about committed transactions and such.

Btw, I doubt all that many checkpoint<->checkpoint windows will have
absolutely no catalog changes. At least some pg_class.relfrozenxid,
pg_class.reltuples changes are to be expected.

> The patch we've seen
> (0008-Introduce-wal-decoding-via-catalog-timetravel.patch [4]) doesn't
> address the question of transactions that contain DDL:
> ...
> but you do address this question (or a closely related question,
> which, I gather is the crux of the issue: How to mix DDL and DML in
>
> transactions?) in the new doc (README.SNAPBUILD.txt [6]):
> > == mixed DDL/DML transaction handling  ==
> >
> > When a transactions uses DDL and DML in the same transaction things get a
> > bit more complicated because we need to handle CommandIds and ComboCids
> > as we need to use the correct version of the catalog when decoding the
> > individual tuples.
>
> Right, so it becomes necessary to think about time-travelling not just
> to a particular transaction, but to a particular point in a particular
> transaction – the exact point at which the catalogue showed a
> structure consistent with sanely interpreting logical WAL records
> created during that window after the last DDL command (if any), but
> before the next (if any). This intelligence is only actually needed
> when decoding tuples created in that actual transaction, because only
> those tuples have their format change throughout a single transaction.

Exactly.

> > CommandId handling itself is relatively simple, we can figure out the
> > current CommandId relatively easily by looking at the currently used one
> > in changes. The problematic part is that those CommandId frequently will
> > not be actual cmin or cmax values but ComboCids. Those are used to
> > minimize space in the heap. During normal operation cmin/cmax values are
> > only used within the backend emitting those rows and only during one
> > toplevel transaction, so instead of storing cmin/cmax only a reference
> > to an in-memory value is stored that contains both. Whenever we see a
> > new CommandId we call
> > ApplyCacheAddNewCommandId.
>
> Right. So in general, transaction A doesn't have to concern itself
> with the order that other transactions had tuples become visible or
> invisible (cmin and cmax); transaction A need only concern itself with
> whether they're visible or invisible based on if relevant transactions
> (xids) committed, its own xid, plus each tuple's xmin and xmax.  It is
> this property of cmin/cmax that enabled the combocid optimisation in
> 8.3, which introduces an array in *backend local* memory, to map a
> single HeapTupleHeader field (where previously there were 2 – cmin and
> cmax) to an entry in that array, under the theory that it's unusual
> for a tuple to be created and then deleted in the same transaction.
> Most of the time, that one HeapTupleHeader field wouldn't have a
> mapping to the local array – rather, it would simply have a cmin or a
> cmax. That's how we save heap space.

Yes. The whole handling here is nearly completely analogous to the normal
handling of CommandIds.

> > To resolve this problem during heap_* whenever we generate a new combocid
> > (detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog
> > table we log the new XLOG_HEAP2_NEW_COMBOCID record containing the
> > mapping. During decoding this ComboCid is added to the applycache
> > (ApplyCacheAddNewComboCid). They are only guaranteed to be visible within
> > a single transaction, so we cannot simply setup all of them globally.
>
> This seems more or less reasonable. The fact that the combocid
> optimisation uses a local memory array isn't actually an important
> property of combocids as a performance optimisation

It is an important property here because ComboCids from different toplevel
transactions conflict with each other which means we have to deal with them on
a per toplevel-xid basis.

> For ease of implementation, but also because real combocids are
> expected to be needed infrequently, I suggest that rather than logging
> the mapping, you log the values directly in a record (i.e. The full
> cmin and cmax mapped to the catalogue + catalogue tuple's ctid). You
> could easily exhaust the combocid space otherwise, and besides, you
> cannot do anything with the mapping from outside of the backend that
> originated the combocid for that transaction (you don't have the local
> array, or the local hashtable used for combocids).

I can't really follow here. Obviously we need to generate the
XLOG_HEAP2_NEW_COMBOCID locally in the transaction/backend that generated the
change?

> > Before calling the output plugin ComboCids are temporarily setup and torn
> > down afterwards.
>
> How? You're using a combocid-like array + hashtable local to the
> plugin backend?

I added
extern void PutComboCommandId(CommandId combocid, CommandId cmin, CommandId
cmax);
which in combination with the existing
extern void AtEOXact_ComboCid(void);
is enough.


> Anyway, for now, this is unimplemented, which is perhaps the biggest
> concern about it:
>
> +     /* check if its one of our txids, toplevel is also in there */
> +     else if (TransactionIdInArray(xmin, snapshot->subxip,
> snapshot->subxcnt)) +     {
> +         CommandId cmin = HeapTupleHeaderGetRawCommandId(tuple);
> +         /* no support for that yet */
> +         if (tuple->t_infomask & HEAP_COMBOCID){
> +             elog(WARNING, "combocids not yet supported");
> +             return false;
> +         }
> +         if (cmin >= snapshot->curcid)
> +             return false;    /* inserted after scan started */
> +     }
>
> Above, you aren't taking this into account (code from
> HeapTupleHeaderGetCmax()):
>
> /* We do not store cmax when locking a tuple */
> Assert(!(tup->t_infomask & (HEAP_MOVED | HEAP_IS_LOCKED)));
>
> Sure, you're only interested in cmin, so this doesn't look like it
> applies (isn't this just a sanity check?), but actually, based on this
> it seems to me that the current sane representation of cmin needs to
> be obtained in a more concurrency aware fashion - having the backend
> local data-structures that originate the tuple isn't even good enough.

You completely lost me here and in the following paragraphs. The infomask is
available for everyone, and we only read/write cmin|cmax|comboid when were
inside the transaction or when we have already logged a HEAP2_NEW_COMBOCID and
thus have the necessary information?
Which concurrency concerns are you referring to?

>  I have an idea that the  HeapTupleHeaderGetRawCommandId(tuple) call in your
>  code could well be bogus even when (t_infomask & HEAP_COMBOCID) == 0.

Ah? The other .satisfies routines do HeapTupleHeaderGetCmin(tuple) which
returns exactly that if !(tup->t_infomask & HEAP_COMBOCID).

But anyway, with the new combocid handling the code uses the usual
HeapTupleHeaderGetCmin/Cmax calls, so it looks even more like the normal
routines.

Thanks for the extensive review! I am pretty sure this a lot to take in ;).

Greetings,

Andres

--
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [RFC][PATCH] wal decoding, attempt #2

From

Alvaro Herrera

Date:

25 October 2012, 15:09:12

> Comments about the approach or even the general direction of the
> implementation? Questions?

This patch series has gotten serious amount of discussion and useful
feedback; even some parts of it have been committed.  I imagine lots
more feedback, discussion and spawning of new ideas will take place in
Prague.  I am marking it as Returned with Feedback for now.  Updated,
rebased, modified versions are expected for the next commitfest.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 4/8] add simple xlogdump tool

From

Alvaro Herrera

Date:

25 October 2012, 16:18:03

After some fooling around to provide the discussed backend functionality
to xlogdump (StringInfo atop PQExpBuffer and elog_start/elog_finish),
the following items still need work:

1. rmgr tables
We're linking rmgr.c so that we can obtain the appropriate rm_desc
function pointer for each rmgr.  However the table also includes the
rm_redo, startup, etc function pointers, which the linker wants resolved
at xlogdump link time.  The idea I have to handle this is to use a macro
similar to PG_KEYWORD: at compile time we define it differently on
xlogdump than on backend, so that the symbols we don't want are hidden.

2. ereport() functionality
Currently the xlogreader.c I'm using (the latest version posted by
Andres) has both elog() calls and ereport().  I have provided trivial
elog_start and elog_finish implementations, which covers the first.  I
am not really sure about implementing the whole errstart/errfinish
stack, because that'd be pretty duplicative, though I haven't tried.
The other alternative suggested elsewhere is to avoid elog/ereport
entirely in xlogreader.c and instead pass a function pointer for error
reportage.  The backend would normally use ereport(), but xlogdump could
do something simple with fprintf.  I think that would end up being
cleaner overall.

3. timestamptz_to_str
xact_desc uses this, which involves a couple of messy backend files
(because there's palloc in them, among other problems).  Alternatively
we could tweak xact_desc to use EncodeDateTime (probably through some
simple wrapper); given the constraints imposed on the values, that might
be simpler, and we can provide a simple implementation of EncodeDateTime
or of its hypothetical wrapper in xlogdump.

4. relpathbackend and pfree of its return value
This is messy.  Maybe we should a caller-supplied buffer instead of
palloc to solve this.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Alvaro Herrera

Date:

29 October 2012, 19:59:05

Heikki Linnakangas escribió:

> Hmm. I was thinking that making this work in a non-backend context
> would be too hard, so I didn't give that much thought, but I guess
> there isn't many dependencies to backend functions after all.
> palloc/pfree are straightforward to replace with malloc/free. That's
> what we could easily do with the error messages too, just malloc a
> suitably sized buffer.
>
> How does a non-backend program get access to xlogreader.c? Copy
> xlogreader.c from the source tree at build time and link into the
> program? Or should we turn it into a shared library?

It links the object file into the src/bin/ subdir or whatever.  I don't
think a shared library is warranted at this point.

One further (relatively minor) problem with what you propose here is
that the xlogreader.c code is using emode_for_corrupt_record(), which
not only lives in xlog.c but also depends on readSource.  I guess we
still want the message-supressing abilities of that function, so some
more thinking is required in this area.

I think you may have converted some malloc() calls from Andres' patch
into palloc() -- because you have some palloc() calls which are later
checked for NULL results, which obviously doesn't make sense.  At the
same time, if we're going to use malloc() instead of palloc(), we need
to check for NULL return value in XLogReaderAllocate() callers.  This
seems easy to fix at first glance, but what is the correct response if
it fails during StartupXLOG()?  Should we just elog(FATAL) and hope it
never happens in practice?


Andres commented elsewhere about reading xlog records, processing them
as they came in, and do a running CRC while we're still reading it.  I
think this is a mistake; we shouldn't do anything with a record until
the CRC has been verified.  Otherwise we risk reading arbitrarily
corrupt data.

Overall, I think I like Heikki's minimal patch better than Andres'
original proposal, but I'll keep looking at both.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Tom Lane

Date:

29 October 2012, 20:31:40

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I think you may have converted some malloc() calls from Andres' patch
> into palloc() -- because you have some palloc() calls which are later
> checked for NULL results, which obviously doesn't make sense.  At the
> same time, if we're going to use malloc() instead of palloc(), we need
> to check for NULL return value in XLogReaderAllocate() callers.  This
> seems easy to fix at first glance, but what is the correct response if
> it fails during StartupXLOG()?  Should we just elog(FATAL) and hope it
> never happens in practice?

Um, surely we can still let those functions use palloc?  It should
just be #define'd as pg_malloc() (ie something with an error exit)
in non-backend contexts.
        regards, tom lane

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

30 October 2012, 13:12:52

On Monday, October 29, 2012 08:58:53 PM Alvaro Herrera wrote:
> Heikki Linnakangas escribió:
> > Hmm. I was thinking that making this work in a non-backend context
> > would be too hard, so I didn't give that much thought, but I guess
> > there isn't many dependencies to backend functions after all.
> > palloc/pfree are straightforward to replace with malloc/free. That's
> > what we could easily do with the error messages too, just malloc a
> > suitably sized buffer.
> >
> > How does a non-backend program get access to xlogreader.c? Copy
> > xlogreader.c from the source tree at build time and link into the
> > program? Or should we turn it into a shared library?

> Andres commented elsewhere about reading xlog records, processing them
> as they came in, and do a running CRC while we're still reading it.  I
> think this is a mistake; we shouldn't do anything with a record until
> the CRC has been verified.  Otherwise we risk reading arbitrarily
> corrupt data.

Uhm. xlog.c does just the same. It reads the header and if it looks valid it
uses its length information to read the full record and only computes the CRC
at the end.

> Overall, I think I like Heikki's minimal patch better than Andres'
> original proposal, but I'll keep looking at both.

I'll defer to you two on that, no point in fighting that many experienced
people ;)

Andres

--
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Alvaro Herrera

Date:

30 October 2012, 14:20:11

Andres Freund escribió:
> On Monday, October 29, 2012 08:58:53 PM Alvaro Herrera wrote:
> > Heikki Linnakangas escribió:

> > Andres commented elsewhere about reading xlog records, processing them
> > as they came in, and do a running CRC while we're still reading it.  I
> > think this is a mistake; we shouldn't do anything with a record until
> > the CRC has been verified.  Otherwise we risk reading arbitrarily
> > corrupt data.
>
> Uhm. xlog.c does just the same. It reads the header and if it looks valid it
> uses its length information to read the full record and only computes the CRC
> at the end.

Uh.  Correct.

Am I the only one who finds this rather bizarre?  Maybe this was okay
when xlog data would only come from WAL files stored in the data
directory at recovery, but if we're now receiving these from a remote
sender over the network I wonder if we should be protecting against
malicious senders.  (This is not related to this patch anyway.)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

30 October 2012, 14:41:05

On Tuesday, October 30, 2012 03:20:03 PM Alvaro Herrera wrote:
> Andres Freund escribió:
> > On Monday, October 29, 2012 08:58:53 PM Alvaro Herrera wrote:
> > > Heikki Linnakangas escribió:
> > >
> > > Andres commented elsewhere about reading xlog records, processing them
> > > as they came in, and do a running CRC while we're still reading it.  I
> > > think this is a mistake; we shouldn't do anything with a record until
> > > the CRC has been verified.  Otherwise we risk reading arbitrarily
> > > corrupt data.
> >
> > Uhm. xlog.c does just the same. It reads the header and if it looks valid
> > it uses its length information to read the full record and only computes
> > the CRC at the end.
>
> Uh.  Correct.
>
> Am I the only one who finds this rather bizarre?  Maybe this was okay
> when xlog data would only come from WAL files stored in the data
> directory at recovery, but if we're now receiving these from a remote
> sender over the network I wonder if we should be protecting against
> malicious senders.  (This is not related to this patch anyway.)

How should this work otherwise? The CRC is over the whole data so we obviously
need to read the whole data to compute the CRC? Would you prefer protecting
the header with a separate CRC?
You can't use a CRC against malicous users anyway, its not cryptographically
secure in any meaning of the word, its trivial to generate different content
resulting in the same CRC. The biggest user of the CRC checking code we have
is making sure were not reading beyond the end of the WAL...

Greetings,

Andres
--
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Tom Lane

Date:

30 October 2012, 15:12:33

Andres Freund <andres@2ndquadrant.com> writes:
> On Tuesday, October 30, 2012 03:20:03 PM Alvaro Herrera wrote:
>> Am I the only one who finds this rather bizarre?  Maybe this was okay
>> when xlog data would only come from WAL files stored in the data
>> directory at recovery, but if we're now receiving these from a remote
>> sender over the network I wonder if we should be protecting against
>> malicious senders.  (This is not related to this patch anyway.)

> You can't use a CRC against malicous users anyway, its not cryptographically 
> secure in any meaning of the word,

More to the point, if a bad guy has got control of your WAL stream,
crashing the startup process with a ridiculous length word is several
orders of magnitude less than the worst thing he can find to do to you.
So the above argument seems completely nonsensical to me.  Anybody who's
worried about that type of scenario is better advised to be setting up
SSL to ensure that the stream is coming from the server they think it's
coming from.
        regards, tom lane

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Alvaro Herrera

Date:

30 October 2012, 15:24:28

Tom Lane escribió:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On Tuesday, October 30, 2012 03:20:03 PM Alvaro Herrera wrote:
> >> Am I the only one who finds this rather bizarre?  Maybe this was okay
> >> when xlog data would only come from WAL files stored in the data
> >> directory at recovery, but if we're now receiving these from a remote
> >> sender over the network I wonder if we should be protecting against
> >> malicious senders.  (This is not related to this patch anyway.)
>
> > You can't use a CRC against malicous users anyway, its not cryptographically
> > secure in any meaning of the word,
>
> More to the point, if a bad guy has got control of your WAL stream,
> crashing the startup process with a ridiculous length word is several
> orders of magnitude less than the worst thing he can find to do to you.
> So the above argument seems completely nonsensical to me.

Well, I wasn't talking just about the record length, but about the
record in general.  The length just came up because it's what I noticed.

And yeah, I was thinking in one sum for the header and another one for
the data.  If we're using CRC to detect end of WAL, what sense does it
make to have to read the whole record if we can detect the end by just
looking at the header?  (I obviously see that we need to checksum the
data as well; and having a second CRC field would bloat the record.)


> Anybody who's worried about that type of scenario is better advised to
> be setting up SSL to ensure that the stream is coming from the server
> they think it's coming from.

Okay.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Tom Lane

Date:

30 October 2012, 15:30:48

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> And yeah, I was thinking in one sum for the header and another one for
> the data.

I don't think it's worth the space.

> If we're using CRC to detect end of WAL, what sense does it
> make to have to read the whole record if we can detect the end by just
> looking at the header?

Er, what?  The typical case where CRC shows us it's end of WAL is that
the overall CRC doesn't match, ie, torn record.  Record header
corruption as such is just about nonexistent AFAIK.
        regards, tom lane

Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

From

Andres Freund

Date:

30 October 2012, 15:32:43

On Tuesday, October 30, 2012 04:24:21 PM Alvaro Herrera wrote:
> Tom Lane escribió:
> > Andres Freund <andres@2ndquadrant.com> writes:
> > > On Tuesday, October 30, 2012 03:20:03 PM Alvaro Herrera wrote:
> > >> Am I the only one who finds this rather bizarre?  Maybe this was okay
> > >> when xlog data would only come from WAL files stored in the data
> > >> directory at recovery, but if we're now receiving these from a remote
> > >> sender over the network I wonder if we should be protecting against
> > >> malicious senders.  (This is not related to this patch anyway.)
> > >
> > > You can't use a CRC against malicous users anyway, its not
> > > cryptographically secure in any meaning of the word,
> >
> > More to the point, if a bad guy has got control of your WAL stream,
> > crashing the startup process with a ridiculous length word is several
> > orders of magnitude less than the worst thing he can find to do to you.
> > So the above argument seems completely nonsensical to me.
>
> Well, I wasn't talking just about the record length, but about the
> record in general.  The length just came up because it's what I noticed.
>
> And yeah, I was thinking in one sum for the header and another one for
> the data.  If we're using CRC to detect end of WAL, what sense does it
> make to have to read the whole record if we can detect the end by just
> looking at the header?  (I obviously see that we need to checksum the
> data as well; and having a second CRC field would bloat the record.)

Well, the header is written first. In the header we can detect somewhat
accurately that were beyond the current end-of-wal by looking at ->xl_prev and
doing some validity checks, but thats not applicable for the data. A valid
looking header doesn't mean that the whole, potentially multi-megabyte, record
data is valid, we could have crashed while writing the data.

Greetings,

Andres
--
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Enabling frontend-only xlog "desc" routines

From

Alvaro Herrera

Date:

27 November 2012, 18:46:53

I mentioned the remaining issues in a previous email (see
message-id 20121025161751.GE6442@alvh.no-ip.org).  Attached is a
patch that enables xlogdump to #include xlog_internal.h by way of
removing that file's inclusion of fmgr.h, which is problematic.  I don't
think this should be too contentious.

The other interesting question remaining is what to do about the rm_desc
function in rmgr.c.  We're split between these two ideas:

1. Have this in rmgr.c:

#ifdef FRONTEND
#define RMGR_REDO_FUNC(func) NULL
#else
#define RMGR_REDO_FUNC(func) func
#endif /* FRONTEND */

and then use RMGR_REDO_FUNC() in the table.


2. Have this in rmgr.c:

#ifndef RMGR_REDO_FUNC
#define RMGR_REDO_FUNC(func) func
#endif

And then have the xlogdump Makefile use -D to define a suitable
RMGR_REDO_FUNC.

Opinions please?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

xlog-fn.patch

Re: Enabling frontend-only xlog "desc" routines

From

Amit Kapila

Date:

28 November 2012, 13:29:17

On Wednesday, November 28, 2012 12:17 AM Alvaro Herrera wrote:
> I mentioned the remaining issues in a previous email (see message-id
> 20121025161751.GE6442@alvh.no-ip.org).  Attached is a patch that enables
> xlogdump to #include xlog_internal.h by way of removing that file's
> inclusion of fmgr.h, which is problematic.  I don't think this should be
> too contentious.

I have tried to go through Xlogdump patch provided in mail chain of
message-id provided.
It seems there is no appropriate file/function header which makes it little
difficult to understand the purpose.
This is just a suggestion and not related to your this mail.
> The other interesting question remaining is what to do about the rm_desc
> function in rmgr.c.  We're split between these two ideas:
> 
> 1. Have this in rmgr.c:
> 
> #ifdef FRONTEND
> #define RMGR_REDO_FUNC(func) NULL
> #else
> #define RMGR_REDO_FUNC(func) func
> #endif /* FRONTEND */
> 
> and then use RMGR_REDO_FUNC() in the table.
> 
> 
> 2. Have this in rmgr.c:
> 
> #ifndef RMGR_REDO_FUNC
> #define RMGR_REDO_FUNC(func) func
> #endif
> 
> And then have the xlogdump Makefile use -D to define a suitable
> RMGR_REDO_FUNC.
> 

What I understood is that as there is only a need of rm_desc function in
xlogdump, so we want to separate it out such that
it can be easily used. 
In Approach-1, define for some of functions (redo, startup, cleanup,..) as
NULL for frontends so that corresponding functions for frontends become
hidden.
In Approach-2, frontend (in this case Xlogdump) need to define value for
that particular Macro by using -D in makefile.

If my above thinking is right, then I think Approach-2 might be better as in
that Frontend will have more flexibility if it wants
to use some other functionality of rmgr. 

With Regards,
Amit Kapila.

Re: Enabling frontend-only xlog "desc" routines

From

Andres Freund

Date:

28 November 2012, 13:37:20

On 2012-11-28 18:58:45 +0530, Amit Kapila wrote:
> On Wednesday, November 28, 2012 12:17 AM Alvaro Herrera wrote:
> > I mentioned the remaining issues in a previous email (see message-id
> > 20121025161751.GE6442@alvh.no-ip.org).  Attached is a patch that enables
> > xlogdump to #include xlog_internal.h by way of removing that file's
> > inclusion of fmgr.h, which is problematic.  I don't think this should be
> > too contentious.
>
> I have tried to go through Xlogdump patch provided in mail chain of
> message-id provided.
> It seems there is no appropriate file/function header which makes it little
> difficult to understand the purpose.
> This is just a suggestion and not related to your this mail.

An updated version of xlogdump with some initial documentation, sensible
building, and some more is available at
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlogreader_v3


> > The other interesting question remaining is what to do about the rm_desc
> > function in rmgr.c.  We're split between these two ideas:
> >
> > 1. Have this in rmgr.c:
> >
> > #ifdef FRONTEND
> > #define RMGR_REDO_FUNC(func) NULL
> > #else
> > #define RMGR_REDO_FUNC(func) func
> > #endif /* FRONTEND */
> >
> > and then use RMGR_REDO_FUNC() in the table.
> >
> >
> > 2. Have this in rmgr.c:
> >
> > #ifndef RMGR_REDO_FUNC
> > #define RMGR_REDO_FUNC(func) func
> > #endif
> >
> > And then have the xlogdump Makefile use -D to define a suitable
> > RMGR_REDO_FUNC.
> >
>
> What I understood is that as there is only a need of rm_desc function in
> xlogdump, so we want to separate it out such that
> it can be easily used.
> In Approach-1, define for some of functions (redo, startup, cleanup,..) as
> NULL for frontends so that corresponding functions for frontends become
> hidden.
> In Approach-2, frontend (in this case Xlogdump) need to define value for
> that particular Macro by using -D in makefile.
>
> If my above thinking is right, then I think Approach-2 might be better as in
> that Frontend will have more flexibility if it wants
> to use some other functionality of rmgr.

I personally favor approach-1 because I cannot see any potential other
use. You basically need to have the full backend code available just to
successfully link the other functions. Running is even more complex, and
there's no real point in doing that standalone anyway, so, what for?

Its not like thats something that annot be changed should an actual
usecase emerge.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Enabling frontend-only xlog "desc" routines

From

Amit Kapila

Date:

29 November 2012, 05:39:40

On Wednesday, November 28, 2012 7:07 PM Andres Freund wrote:
> On 2012-11-28 18:58:45 +0530, Amit Kapila wrote:
> > On Wednesday, November 28, 2012 12:17 AM Alvaro Herrera wrote:
> > > I mentioned the remaining issues in a previous email (see message-id
> > > 20121025161751.GE6442@alvh.no-ip.org).  Attached is a patch that
> enables
> > > xlogdump to #include xlog_internal.h by way of removing that file's
> > > inclusion of fmgr.h, which is problematic.  I don't think this
> should be
> > > too contentious.
> >
> > I have tried to go through Xlogdump patch provided in mail chain of
> > message-id provided.
> > It seems there is no appropriate file/function header which makes it
> little
> > difficult to understand the purpose.
> > This is just a suggestion and not related to your this mail.
> 
> An updated version of xlogdump with some initial documentation, sensible
> building, and some more is available at
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=sh
> ortlog;h=refs/heads/xlogreader_v3

Oops.. looked at wrong version.
> > > The other interesting question remaining is what to do about the
> rm_desc
> > > function in rmgr.c.  We're split between these two ideas:
> > >
> > > 1. Have this in rmgr.c:
> > >
> > > #ifdef FRONTEND
> > > #define RMGR_REDO_FUNC(func) NULL
> > > #else
> > > #define RMGR_REDO_FUNC(func) func
> > > #endif /* FRONTEND */
> > >
> > > and then use RMGR_REDO_FUNC() in the table.
> > >
> > >
> > > 2. Have this in rmgr.c:
> > >
> > > #ifndef RMGR_REDO_FUNC
> > > #define RMGR_REDO_FUNC(func) func
> > > #endif
> > >
> > > And then have the xlogdump Makefile use -D to define a suitable
> > > RMGR_REDO_FUNC.
> > >
> >
> > What I understood is that as there is only a need of rm_desc function
> in
> > xlogdump, so we want to separate it out such that
> > it can be easily used.
> > In Approach-1, define for some of functions (redo, startup,
> cleanup,..) as
> > NULL for frontends so that corresponding functions for frontends
> become
> > hidden.
> > In Approach-2, frontend (in this case Xlogdump) need to define value
> for
> > that particular Macro by using -D in makefile.
> >
> > If my above thinking is right, then I think Approach-2 might be better
> as in
> > that Frontend will have more flexibility if it wants
> > to use some other functionality of rmgr.
> 
> I personally favor approach-1 because I cannot see any potential other
> use. You basically need to have the full backend code available just to
> successfully link the other functions. Running is even more complex, and
> there's no real point in doing that standalone anyway, so, what for?

Such functionality might be used if somebody wants to write independent test
for storage engine, but not sure if such a thing (Approach-2) can be
helpful. 

As I can see that Approach-1 has advantage, there will be no dependency in
makefiles for exposing rm_desc functionality.
And for Approach-2, it is unnecessary for makefile to define value if there
is actually no other relevant use case for it.

Can you think of any other pros-cons of both approaches, or do you think we
already have sufficient reasoning to conclude on Approach-1.

With Regards,
Amit Kapila.

Re: Enabling frontend-only xlog "desc" routines

From

Tom Lane

Date:

29 November 2012, 20:03:58

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> The other interesting question remaining is what to do about the rm_desc
> function in rmgr.c.  We're split between these two ideas:

Why try to link rmgr.c into frontend versions at all?  Just make
a new table file that only includes the desc function pointers.
Yeah, then there would be two table files to maintain, but it's
not clear to me that it's uglier than these proposals ...
        regards, tom lane

Re: Enabling frontend-only xlog "desc" routines

From

Andres Freund

Date:

29 November 2012, 20:09:58

On 2012-11-29 15:03:48 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > The other interesting question remaining is what to do about the rm_desc
> > function in rmgr.c.  We're split between these two ideas:
>
> Why try to link rmgr.c into frontend versions at all?  Just make
> a new table file that only includes the desc function pointers.
> Yeah, then there would be two table files to maintain, but it's
> not clear to me that it's uglier than these proposals ...

Seems more likely to get out of sync. But then, rmgr.c isn't changing
that heavily...
I still prefer the -DFRONTEND solutions, but once more, its only a
slight preference.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Enabling frontend-only xlog "desc" routines

From

Robert Haas

Date:

30 November 2012, 20:03:24

On Thu, Nov 29, 2012 at 3:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> The other interesting question remaining is what to do about the rm_desc
>> function in rmgr.c.  We're split between these two ideas:
>
> Why try to link rmgr.c into frontend versions at all?  Just make
> a new table file that only includes the desc function pointers.
> Yeah, then there would be two table files to maintain, but it's
> not clear to me that it's uglier than these proposals ...

+1.  It's not likely to get updated very often, we can add comments to
remind people to keep them all in sync, and if you manage to screw it
up without noticing then you are adding recovery code that you have
not tested in any way whatsoever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company