Thread: [RFC][PATCH] wal decoding, attempt #2
Hi, It took me far longer than I planned, its not finished, but time is running out. I would like some feedback that I am not going astray at this point... *I* think the general approach is sound and a good way forward that provides the basic infrastructure for many (all?) of the scenarios we talked about before. Anyway, here is my next attempt at $TOPIC. Lets start with a quick demo (via psql): /* just so we keep a sensible xmin horizon */ ROLLBACK PREPARED 'f'; BEGIN; CREATE TABLE keepalive(); PREPARE TRANSACTION 'f'; DROP TABLE IF EXISTS replication_example; SELECT pg_current_xlog_insert_location(); CHECKPOINT; CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); begin; INSERT INTO replication_example(somedata, text) VALUES (1, 1); INSERT INTO replication_example(somedata, text) VALUES (1, 2); commit; ALTER TABLE replication_example ADD COLUMN bar int; INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4); BEGIN; INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4); INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4); INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL); commit; ALTER TABLE replication_example DROP COLUMN bar; INSERT INTO replication_example(somedata, text) VALUES (3, 1); BEGIN; INSERT INTO replication_example(somedata, text) VALUES (3, 2); INSERT INTO replication_example(somedata, text) VALUES (3, 3); commit; ALTER TABLE replication_example RENAME COLUMN text TO somenum; INSERT INTO replication_example(somedata, somenum) VALUES (4, 1); ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING (somenum::int4); INSERT INTO replication_example(somedata, somenum) VALUES (5, 1); SELECT pg_current_xlog_insert_location(); ---- Somewhat later ---- SELECT decode_xlog('0/1893D78', '0/18BE398'); WARNING: BEGIN WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1 WARNING: tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2 WARNING: COMMIT WARNING: BEGIN WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4 WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4 WARNING: tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4 WARNING: tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]: (null) WARNING: COMMIT WARNING: BEGIN WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1 WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2 WARNING: tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3 WARNING: COMMIT WARNING: BEGIN WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1 WARNING: COMMIT WARNING: BEGIN WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1 WARNING: COMMITdecode_xlog -------------t (1 row) As you can see the patchset can decode several changes made to a table even though we used DDL on it. Not everything is handled yet, but its a prototype after all ;) The way this works is: A new component called SnapshotBuilder analyzes the xlog and build a special kind of Snapshot. This works in a somewhat similar way to the KnownAssignedXids machinery for Hot Standby. Whenever the - mostly unchanged - ApplyCache calls a 'apply_change' callback for a single change (INSERT|UPDATE|DELETE) it locally overrides the normal SnapshotNow semantics used for catalog access with one of the previously built snapshots. They should behave just the same as a normal SnapshotNow would have behaved when the tuple change was written to the xlog. This patch doesn't provide anything that uses the new infrastructure for anything real, but I think thats good. Lets get this into something committable and then add new things using it! Small overview over the individual patches that will come as separate mails: old, Alvaro is doing this properly right now, separate thread [01] Add embedded list interface (header only) A new piece of infrastructure (for k-way mergesort), pretty much untested, good idea in general I think, not very interesting: [02] Add minimal binary heap implementation Boring, old.: [03] Add support for a generic wal reading facility dubbed XLogReader Boring, old, borked: [04] add simple xlogdump tool Slightly changed to use (tablespace, relfilenode), possibly similar problems to earlier, not interesting at this point. [05] Add a new syscache to fetch a pg_class entry via (reltablespace, relfilenode) Unchanged: [06] Log enough data into the wal to reconstruct logical changes from it if wal_level=logical I didn't implement proper cache handling, so I need to use the big hammer...: [07] Make InvalidateSystemCaches public The major piece: [08] Introduce wal decoding via catalog timetravel [08] has loads of defficiencies. To cite the commit: The snapshot building has the most critical infrastructure but misses several important features: * loads of docs about the internals * improve snapshot building/distributions * don'tbuild them all the time, cache them * don't increase ->xmax so slowly, its inefficient * refcount * actuallyfree them * proper cache handling * we can probably reuse xl_xact_commit->nmsgs * generate new local invalmessages from catalog changes? * handle transactions with both ddl, and changes * command_id handling * combocidloggin/handling * Add support for declaring tables as catalog tables that are not pg_catalog.* * properly distribute new SnapshotNow snapshots after a transaction commits * loads of testing/edge cases * provision of a consistent snapshot for pg_dump * spill state to disk at checkpoints * xmin handling The decode_xlog() function is *purely* a debugging tool that I do not want to keep in the long run. I introduced it so we can concentrate on the topic at hand without involving even more moving parts (see the next paragraph)... Some parts of this I would like to only discuss later, in separate threads, to avoid cluttering this one more than neccessary: * how do we integrate this into walsender et al * in which format do we transport changes * how do we always keep enough wal I have some work ontop of this, that handles ComboCid's and CommandId's correctly (and thus mixed ddl/dml transactions), but its simply not finished enough. I am pretty sure by now that it works even with those additional complexities. So, I am unfortunately too tired to write more than this... It will have to suffice. I plan to release a newer version with more documentation soon. Comments about the approach or even the general direction of the implementation? Questions? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
This is basically untested. --- src/backend/lib/Makefile | 2 +- src/backend/lib/simpleheap.c | 255 +++++++++++++++++++++++++++++++++++++++++++ src/include/lib/simpleheap.h | 91 +++++++++++++++ 3 files changed, 347 insertions(+), 1 deletion(-) create mode 100644 src/backend/lib/simpleheap.c create mode 100644 src/include/lib/simpleheap.h
Attachment
Adds a single and a double linked list which can easily embedded into other datastructures and can be used without any additional allocations. Problematic: It requires USE_INLINE to be used. It could be remade to fallback to to externally defined functions if that is not available but that hardly seems sensibly at this day and age. Besides, the speed hit would be noticeable and its only used in new code which could be disabled on machines - given they still exists - without proper support for inline functions --- src/include/utils/ilist.h | 253 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 253 insertions(+) create mode 100644 src/include/utils/ilist.h
Attachment
--- src/bin/Makefile | 2 +- src/bin/xlogdump/Makefile | 25 ++++ src/bin/xlogdump/xlogdump.c | 334 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 360 insertions(+), 1 deletion(-) create mode 100644 src/bin/xlogdump/Makefile create mode 100644 src/bin/xlogdump/xlogdump.c
Attachment
Pieces of this are in commit: make relfilenode lookup (tablespace, relfilenode --- src/backend/utils/cache/inval.c | 2 +- src/include/utils/inval.h | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-)
Attachment
[PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
Features: - streaming reading/writing - filtering - reassembly of records Reusing the ReadRecord infrastructure in situations where the code that wants to do so is not tightly integrated into xlog.c is rather hard and would require changes to rather integral parts of the recovery code which doesn't seem to be a good idea. Missing: - "compressing" the stream when removing uninteresting records - writing out correct CRCs - separating reader/writer --- src/backend/access/transam/Makefile | 2 +- src/backend/access/transam/xlogreader.c | 1032 +++++++++++++++++++++++++++++++ src/include/access/xlogreader.h | 264 ++++++++ 3 files changed, 1297 insertions(+), 1 deletion(-) create mode 100644 src/backend/access/transam/xlogreader.c create mode 100644 src/include/access/xlogreader.h
Attachment
[PATCH 6/8] Log enough data into the wal to reconstruct logical changes from it if wal_level=logical
From
Andres Freund
Date:
This adds a new wal_level value 'logical' Missing cases: - heap_multi_insert - primary key changes for updates - no primary key - LOG_NEWPAGE --- src/backend/access/heap/heapam.c | 135 +++++++++++++++++++++++++++++--- src/backend/access/transam/xlog.c | 1 + src/backend/catalog/index.c | 74 +++++++++++++++++ src/bin/pg_controldata/pg_controldata.c | 2 + src/include/access/xlog.h | 3 +- src/include/catalog/index.h | 4 + 6 files changed, 207 insertions(+), 12 deletions(-)
Attachment
[PATCH 5/8] Add a new syscache to fetch a pg_class entry via (reltablespace, relfilenode)
From
Andres Freund
Date:
This patch is problematic because formally indexes used by syscaches needs to be unique, this one is not though because of 0/InvalidOids relfilenode entries for nailed/shared catalog entries. Those values cannot be sensibly queries from the catalog anyway though (the relmapper infrastructure needs to be used). It might be nicer to add infrastructure to do this properly, I just don't have a clue what the best way for this would be. --- src/backend/utils/cache/syscache.c | 11 +++++++++++ src/include/catalog/indexing.h | 2 ++ src/include/catalog/pg_proc.h | 1 + src/include/utils/syscache.h | 1 + 4 files changed, 15 insertions(+)
Attachment
This introduces several things: * applycache module which reassembles transactions from a stream of interspersed changes * snapbuilder which builds catalog snapshots so that tuples from wal can be understood * wal decoding into an applycache * decode_xlog(lsn, lsn) debugging function The applycache provides 3 major callbacks: * apply_begin * apply_change * apply_commit It is missing several parts: - spill-to-disk - resource usage controls - command id handling - passing of the correct mvcc snapshot (already has it, just doesn't pass) The snapshot building has the most critical infrastructure but misses several important features: * loads of docs about the internals * improve snapshot building/distributions * don't build them all the time, cache them * don't increase ->xmax so slowly, its inefficient * refcount * actually free them * proper cache handling * we can probably reuse xl_xact_commit->nmsgs * generate new local inval messages from catalog changes? * handle transactions with both ddl, and changes * command_id handling * combocid loggin/handling * Add support for declaring tables as catalog tables that are not pg_catalog.* * properly distribute new SnapshotNow snapshots after a transaction commits * loads of testing/edge cases * provision of a consistent snapshot for pg_dump * spill state to disk at checkpoints * xmin handling The xlog decoding also misses several parts: - HEAP_NEWPAGE support - HEAP2_MULTI_INSERT support - handling of table rewrites --- src/backend/replication/Makefile | 2 + src/backend/replication/logical/Makefile | 19 + src/backend/replication/logical/applycache.c | 574 +++++++++++++ src/backend/replication/logical/decode.c | 366 +++++++++ src/backend/replication/logical/logicalfuncs.c | 237 ++++++ src/backend/replication/logical/snapbuild.c | 1045 ++++++++++++++++++++++++ src/backend/utils/time/tqual.c | 161 ++++ src/include/access/transam.h | 5 + src/include/catalog/pg_proc.h | 3 + src/include/replication/applycache.h | 239 ++++++ src/include/replication/decode.h | 26 + src/include/replication/snapbuild.h | 119 +++ src/include/utils/tqual.h | 21 +- 13 files changed, 2816 insertions(+), 1 deletion(-) create mode 100644 src/backend/replication/logical/Makefile create mode 100644 src/backend/replication/logical/applycache.c create mode 100644 src/backend/replication/logical/decode.c create mode 100644 src/backend/replication/logical/logicalfuncs.c create mode 100644 src/backend/replication/logical/snapbuild.c create mode 100644 src/include/replication/applycache.h create mode 100644 src/include/replication/decode.h create mode 100644 src/include/replication/snapbuild.h
Attachment
Hi, A last note: A git tree of this is at git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog- decoding-rebasing-cf2 checkout with: git clone --branch xlog-decoding-rebasing-cf2 git://git.postgresql.org/git/users/andresfreund/postgres.git Webview: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog- decoding-rebasing-cf2 That branch will be regularly rebased to a new master,fixes/new features, and a pgindent run over the new files... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Saturday, September 15, 2012 03:14:32 AM Andres Freund wrote: > That branch will be regularly rebased to a new master,fixes/new features, > and a pgindent run over the new files... I fixed up the formatting of the new stuff (xlogreader, ilist are submitted separately, no point in doing anything there). pushed to the repo mentioned upthread. -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Now that I proposed a new syscache upthread its easily possible to provide pg_relation_by_filenode which I wished for multiple times in the past when looking at filesystem activity and wondering which table does what. You can sortof get the same result via SELECT oid FROM ( SELECT oid, pg_relation_filenode(oid::regclass) filenode FROM pg_class WHERE relkind != 'v' ) map WHERE map.filenode = ...; but thats neither efficient nor obvious. So, two patches to do this: Did others need this in the past? I can live with the 2nd patch living in a private extension somewhere. The first one would also be useful for some error/debug messages during decoding... Greetings, Andres
[PATCH 2/2] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs
From
Andres Freund
Date:
This requires the previously added RELFILENODE syscache. --- doc/src/sgml/func.sgml | 23 ++++++++++++- src/backend/utils/adt/dbsize.c | 78 ++++++++++++++++++++++++++++++++++++++++++ src/include/catalog/pg_proc.h | 2 ++ src/include/utils/builtins.h | 1 + 4 files changed, 103 insertions(+), 1 deletion(-)
Attachment
[PATCH 1/2] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode
From
Andres Freund
Date:
--- src/backend/utils/cache/relmapper.c | 53 +++++++++++++++++++++++++++++++++++++ src/include/utils/relmapper.h | 2 ++ 2 files changed, 55 insertions(+)
Attachment
Re: [PATCH 2/2] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs
From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes: > This requires the previously added RELFILENODE syscache. [ raised eyebrow... ] There's a RELFILENODE syscache? I don't see one, and I doubt it would work given that the contents of pg_class.relfilenode aren't unique (the zero entries are the problem). regards, tom lane
Re: [PATCH 2/2] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs
From
Andres Freund
Date:
On Monday, September 17, 2012 12:35:32 AM Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > This requires the previously added RELFILENODE syscache. > > [ raised eyebrow... ] There's a RELFILENODE syscache? I don't see one, > and I doubt it would work given that the contents of > pg_class.relfilenode aren't unique (the zero entries are the problem). Well, one patch upthread ;). It mentions the problem of it not being unique due to relfilenode in (reltablespace, relfilenode) being 0 for shared/nailed catalogs. I am not really sure yet how big a problem for the caching infrastructure it is that values that shouldn't ever get queried (because the relfilenode is actually different) are duplicated. Reading code about all that atm. Robert suggested writing a specialized cache akin to whats done in attoptcache.c or such. I haven't formed an opinion on whats the way forward on that topic. But anyway, I don't see how the wal decoding stuff can progress without some variant of that mapping, so I sure hope I/we can build something. Changing that aspect of the patch should be trivial... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
On 15.09.2012 03:39, Andres Freund wrote: > > Features: > - streaming reading/writing > - filtering > - reassembly of records > > Reusing the ReadRecord infrastructure in situations where the code that wants > to do so is not tightly integrated into xlog.c is rather hard and would require > changes to rather integral parts of the recovery code which doesn't seem to be > a good idea. My previous objections to this approach still apply. 1. I don't want to maintain a second copy of the code to read xlog. 2. We should focus on reading WAL, I don't see the point of mixing WAL writing into this. 3. I don't like the callback-style API. I came up with the attached. I moved ReadRecord and some supporting functions from xlog.c to xlogreader.c, and made it operate on XLogReaderState instead of global global variables. As discussed before, I didn't like the callback-style API, I think the consumer of the API should rather just call ReadRecord repeatedly to get each record. So that's what I did. There is still one callback, XLogPageRead(), to obtain a given page in WAL. The XLogReader facility is responsible for decoding the WAL into records, but the user of the facility is responsible for supplying the physical bytes, via the callback. So the usage is like this: /* * Callback to read the page starting at 'RecPtr' into *readBuf. It's * up to you to do this any way you like. Typically you'd read from a * file. The WAL recovery implementation of this in xlog.c is more * complicated. It checks the archive, waits for streaming replication * etc. */ static bool MyXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, char *readBuf, void *private_data) { ... } state = XLogReaderAllocate(&MyXLogPageRead); while ((record = XLogReadRecord(state, ...))) { /* do something with the record */ } XLogReaderFree(state); - Heikki
Attachment
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
Hi Heikki, On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote: > On 15.09.2012 03:39, Andres Freund wrote: > > Features: > > - streaming reading/writing > > - filtering > > - reassembly of records > > > > Reusing the ReadRecord infrastructure in situations where the code that > > wants to do so is not tightly integrated into xlog.c is rather hard and > > would require changes to rather integral parts of the recovery code > > which doesn't seem to be a good idea. > > My previous objections to this approach still apply. 1. I don't want to > maintain a second copy of the code to read xlog. Yes. I aggree. And I am willing to provide an implementation of this if should my xlogreader variant gets a bit more buyin. > 2. We should focus on reading WAL, I don't see the point of mixing WAL writing into this. If you write something that filters/analyzes and then forwards WAL and you want to do that without a big overhead (i.e. completely reassembling everything, and then deassembling it again for writeout) its hard to do that without integrating both sides. Also, I want to read records incrementally/partially just as data comes in which again is hard to combine with writing out the data again. > 3. I don't like the callback-style API. I tried to accomodate to that by providing: extern XLogRecordBuffer* XLogReaderReadOne(XLogReaderState* state); which does exactly that. > I came up with the attached. I moved ReadRecord and some supporting > functions from xlog.c to xlogreader.c, and made it operate on > XLogReaderState instead of global global variables. As discussed before, > I didn't like the callback-style API, I think the consumer of the API > should rather just call ReadRecord repeatedly to get each record. So > that's what I did. The problem with that is that kind of API is that it, at least as far as I can see, that it never can operate on incomplete/partial input. Your need to buffer larger amounts of xlog somewhere and you need to be aware of record boundaries. Both are things I dislike in a more generic user than xlog.c. > There is still one callback, XLogPageRead(), to obtain a given page in > WAL. The XLogReader facility is responsible for decoding the WAL into > records, but the user of the facility is responsible for supplying the > physical bytes, via the callback. Makes sense. > So the usage is like this: > > /* > * Callback to read the page starting at 'RecPtr' into *readBuf. It's > * up to you to do this any way you like. Typically you'd read from a > * file. The WAL recovery implementation of this in xlog.c is more > * complicated. It checks the archive, waits for streaming replication > * etc. > */ > static bool > MyXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, char > *readBuf, void *private_data) > { > ... > } > > > state = XLogReaderAllocate(&MyXLogPageRead); > > while ((record = XLogReadRecord(state, ...))) > { > /* do something with the record */ > } If you don't want the capability to forward/filter the data and read partial data without regard for record constraints/buffering your patch seems to be quite a good start. It misses xlogreader.h though... Do my aims make any sense to you? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
On 17.09.2012 11:12, Andres Freund wrote: > On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote: >> On 15.09.2012 03:39, Andres Freund wrote: >> 2. We should focus on reading WAL, I don't see the point of mixing WAL > writing into this. > If you write something that filters/analyzes and then forwards WAL and you want > to do that without a big overhead (i.e. completely reassembling everything, and > then deassembling it again for writeout) its hard to do that without > integrating both sides. It seems really complicated to filter/analyze WAL records without reassembling them, anyway. The user of the facility is in charge of reading the physical data, so you can still access the raw data, for forwarding purposes, in addition to the reassembled records. Or what exactly do you mean by "completely deassembling"? I read that to mean dealing with page boundaries, ie. if a record is split across pages, copy parts into a contiguous temporary buffer. > Also, I want to read records incrementally/partially just as data comes in > which again is hard to combine with writing out the data again. You mean, you want to start reading the first half of a record, before the 2nd half is available? That seems complicated. I'd suggest keeping it simple for now, and optimize later if necessary. Note that before you have the whole WAL record, you cannot CRC check it, so you don't know if it's in fact a valid WAL record. >> I came up with the attached. I moved ReadRecord and some supporting >> functions from xlog.c to xlogreader.c, and made it operate on >> XLogReaderState instead of global global variables. As discussed before, >> I didn't like the callback-style API, I think the consumer of the API >> should rather just call ReadRecord repeatedly to get each record. So >> that's what I did. > The problem with that is that kind of API is that it, at least as far as I can > see, that it never can operate on incomplete/partial input. Your need to buffer > larger amounts of xlog somewhere and you need to be aware of record boundaries. > Both are things I dislike in a more generic user than xlog.c. I don't understand that argument. A typical large WAL record is split across 1-2 pages, maybe 3-4 at most, for an index page split record. That doesn't feel like much to me. In extreme cases, a WAL record can be much larger (e.g a commit record of a transaction with a huge number of subtransactions), but that should be rare in practice. The user of the facility doesn't need to be aware of record boundaries, that's the responsibility of the facility. I thought that's exactly the point of generalizing this thing, to make it unnecessary for the code that uses it to be aware of such things. > If you don't want the capability to forward/filter the data and read partial > data without regard for record constraints/buffering your patch seems to be > quite a good start. It misses xlogreader.h though... Ah sorry, patch with xlogreader.h attached. - Heikki
Attachment
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote: > On 17.09.2012 11:12, Andres Freund wrote: > > On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote: > >> On 15.09.2012 03:39, Andres Freund wrote: > >> 2. We should focus on reading WAL, I don't see the point of mixing WAL > > > > writing into this. > > If you write something that filters/analyzes and then forwards WAL and > > you want to do that without a big overhead (i.e. completely reassembling > > everything, and then deassembling it again for writeout) its hard to do > > that without integrating both sides. > > It seems really complicated to filter/analyze WAL records without > reassembling them, anyway. The user of the facility is in charge of > reading the physical data, so you can still access the raw data, for > forwarding purposes, in addition to the reassembled records. It works ;) > Or what exactly do you mean by "completely deassembling"? I read that to > mean dealing with page boundaries, ie. if a record is split across > pages, copy parts into a contiguous temporary buffer. Well, if you want to fully split reading and writing of records - which is a nice goal! - you basically need the full logic of XLogInsert again to split them apart again to write them. Alternatively you need to store record boundaries somewhere and copy that way, but in the end if you filter you need to correct CRCs... > > Also, I want to read records incrementally/partially just as data comes > > in which again is hard to combine with writing out the data again. > > You mean, you want to start reading the first half of a record, before > the 2nd half is available? That seems complicated. Well, I just can say again: It works ;). Makes it easy to follow something like XLogwrtResult without taking care about record boundaries. > I'd suggest keeping it simple for now, and optimize later if necessary. Well, yes. The API should be able to comfortably support those cases though which I don't think is neccesarily the case in a simple, one call API as proposed. > Note that before you have the whole WAL record, you cannot CRC check it, so > you don't know if it's in fact a valid WAL record. Sure. But you can start the CRC computation without any problems and finish it when the last part of the data comes in. > >> I came up with the attached. I moved ReadRecord and some supporting > >> functions from xlog.c to xlogreader.c, and made it operate on > >> XLogReaderState instead of global global variables. As discussed before, > >> I didn't like the callback-style API, I think the consumer of the API > >> should rather just call ReadRecord repeatedly to get each record. So > >> that's what I did. > > > > The problem with that is that kind of API is that it, at least as far as > > I can see, that it never can operate on incomplete/partial input. Your > > need to buffer larger amounts of xlog somewhere and you need to be aware > > of record boundaries. Both are things I dislike in a more generic user > > than xlog.c. > > I don't understand that argument. A typical large WAL record is split > across 1-2 pages, maybe 3-4 at most, for an index page split record. > That doesn't feel like much to me. In extreme cases, a WAL record can be > much larger (e.g a commit record of a transaction with a huge number of > subtransactions), but that should be rare in practice. Well, imagine something like the walsender that essentially follows the flush position ideally without regard for record boundaries. It is nice to be able to send/analyze/filter as soon as possible without waiting till a page is full. And it sure would be nice to be able to read the data on the other side directly from the network, decompress it again, and only then store it to disk. > The user of the facility doesn't need to be aware of record boundaries, > that's the responsibility of the facility. I thought that's exactly the > point of generalizing this thing, to make it unnecessary for the code > that uses it to be aware of such things. With the proposed API it seems pretty much a requirement to wait inside the callback. Thats not really nice if your process has other things to wait for as well. In my proposal you can simply do something like: XLogReaderRead(state); DoSomeOtherWork(); if (CheckForForMessagesFromWalreceiver()) ProcessMessages(); else if (state->needs_input) UseLatchOrSelectOnInputSocket(); else if (state->needs_output) UseSelectOnOutputSocket(); but you can also do something like waiting on a Latch but *also* on other fds. > > If you don't want the capability to forward/filter the data and read > > partial data without regard for record constraints/buffering your patch > > seems to be quite a good start. It misses xlogreader.h though... > > Ah sorry, patch with xlogreader.h attached. Will look at it in a second. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote: > On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote: > > On 17.09.2012 11:12, Andres Freund wrote: > > > On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote: > > > If you don't want the capability to forward/filter the data and read > > > partial data without regard for record constraints/buffering your patch > > > seems to be quite a good start. It misses xlogreader.h though... > > > > Ah sorry, patch with xlogreader.h attached. > > Will look at it in a second. It seems we would need one additional callback for both approaches like: ->error(severity, format, ...) For both to avoid having to draw in elog.c. Otherwise it looks sensible although it has a more minimal approach (which might or might not be a good thing). The one thing I definitely like is that nearly all of it is tried and true code... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
On 17.09.2012 12:07, Andres Freund wrote: > On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote: >> The user of the facility doesn't need to be aware of record boundaries, >> that's the responsibility of the facility. I thought that's exactly the >> point of generalizing this thing, to make it unnecessary for the code >> that uses it to be aware of such things. > With the proposed API it seems pretty much a requirement to wait inside the > callback. Or you can return false from the XLogPageRead() callback if the requested page is not available. That will cause ReadRecord() to return NULL, and you can retry when more WAL is available. - Heikki
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
On 17.09.2012 13:01, Andres Freund wrote: > On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote: >> On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote: >>> On 17.09.2012 11:12, Andres Freund wrote: >>>> On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote: >>>> If you don't want the capability to forward/filter the data and read >>>> partial data without regard for record constraints/buffering your patch >>>> seems to be quite a good start. It misses xlogreader.h though... >>> >>> Ah sorry, patch with xlogreader.h attached. >> >> Will look at it in a second. > It seems we would need one additional callback for both approaches like: > > ->error(severity, format, ...) > > For both to avoid having to draw in elog.c. Yeah. Another approach would be to return the error string from ReadRecord. The caller could then do whatever it pleases with it, like ereport() it to the log or PANIC. I think I'd like that better. - Heikki
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On Monday, September 17, 2012 12:55:47 PM Heikki Linnakangas wrote: > On 17.09.2012 13:01, Andres Freund wrote: > > On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote: > >> On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote: > >>> On 17.09.2012 11:12, Andres Freund wrote: > >>>> On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote: > >>>> If you don't want the capability to forward/filter the data and read > >>>> partial data without regard for record constraints/buffering your > >>>> patch seems to be quite a good start. It misses xlogreader.h > >>>> though... > >>> > >>> Ah sorry, patch with xlogreader.h attached. > >> > >> Will look at it in a second. > > > > It seems we would need one additional callback for both approaches like: > > > > ->error(severity, format, ...) > > > > For both to avoid having to draw in elog.c. > > Yeah. Another approach would be to return the error string from > ReadRecord. The caller could then do whatever it pleases with it, like > ereport() it to the log or PANIC. I think I'd like that better. That seems a bit more complex from a memory management perspective as you probably would have to sprintf() into some buffer. We cannot rely on a backend environment with memory contexts around et al... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
On 17.09.2012 14:42, Andres Freund wrote: > On Monday, September 17, 2012 12:55:47 PM Heikki Linnakangas wrote: >> On 17.09.2012 13:01, Andres Freund wrote: >>> On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote: >>>> On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote: >>>>> On 17.09.2012 11:12, Andres Freund wrote: >>>>>> On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote: >>>>>> If you don't want the capability to forward/filter the data and read >>>>>> partial data without regard for record constraints/buffering your >>>>>> patch seems to be quite a good start. It misses xlogreader.h >>>>>> though... >>>>> >>>>> Ah sorry, patch with xlogreader.h attached. >>>> >>>> Will look at it in a second. >>> >>> It seems we would need one additional callback for both approaches like: >>> >>> ->error(severity, format, ...) >>> >>> For both to avoid having to draw in elog.c. >> >> Yeah. Another approach would be to return the error string from >> ReadRecord. The caller could then do whatever it pleases with it, like >> ereport() it to the log or PANIC. I think I'd like that better. > That seems a bit more complex from a memory management perspective as you > probably would have to sprintf() into some buffer. We cannot rely on a backend > environment with memory contexts around et al... Hmm. I was thinking that making this work in a non-backend context would be too hard, so I didn't give that much thought, but I guess there isn't many dependencies to backend functions after all. palloc/pfree are straightforward to replace with malloc/free. That's what we could easily do with the error messages too, just malloc a suitably sized buffer. How does a non-backend program get access to xlogreader.c? Copy xlogreader.c from the source tree at build time and link into the program? Or should we turn it into a shared library? - Heikki
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On Monday, September 17, 2012 01:50:33 PM Heikki Linnakangas wrote: > On 17.09.2012 14:42, Andres Freund wrote: > > On Monday, September 17, 2012 12:55:47 PM Heikki Linnakangas wrote: > >> On 17.09.2012 13:01, Andres Freund wrote: > >>> On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote: > >>>> On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote: > >>>>> On 17.09.2012 11:12, Andres Freund wrote: > >>>>>> On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote: > >>>>>> If you don't want the capability to forward/filter the data and read > >>>>>> partial data without regard for record constraints/buffering your > >>>>>> patch seems to be quite a good start. It misses xlogreader.h > >>>>>> though... > >>>>> > >>>>> Ah sorry, patch with xlogreader.h attached. > >>>> > >>>> Will look at it in a second. > >>> > >>> It seems we would need one additional callback for both approaches > >>> like: > >>> > >>> ->error(severity, format, ...) > >>> > >>> For both to avoid having to draw in elog.c. > >> > >> Yeah. Another approach would be to return the error string from > >> ReadRecord. The caller could then do whatever it pleases with it, like > >> ereport() it to the log or PANIC. I think I'd like that better. > > > > That seems a bit more complex from a memory management perspective as you > > probably would have to sprintf() into some buffer. We cannot rely on a > > backend environment with memory contexts around et al... > > Hmm. I was thinking that making this work in a non-backend context would > be too hard, so I didn't give that much thought, but I guess there isn't > many dependencies to backend functions after all. palloc/pfree are > straightforward to replace with malloc/free. Hm. I thought that it was pretty much a design requirement that this is usable outside of the backend environment? > That's what we could easily do with the error messages too, just malloc a > suitably sized buffer. Not very comfortable though... Especially if you need to return an error from the read_page callback... > How does a non-backend program get access to xlogreader.c? Copy > xlogreader.c from the source tree at build time and link into the > program? Or should we turn it into a shared library? Not really sure. I thought about just putting it in pgport or such, but that seemed ugly as well. The bin/xlogdump hack, which I find really helpful, at first simply had a dependency on ../../backend/access/transam/xlogreader.o which worked fine. Till it needed more because of *_desc routines... But Alvaro started to work on this although I don't know when he will be able to finish it. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On Monday, September 17, 2012 12:52:32 PM Heikki Linnakangas wrote: > On 17.09.2012 12:07, Andres Freund wrote: > > On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote: > >> The user of the facility doesn't need to be aware of record boundaries, > >> that's the responsibility of the facility. I thought that's exactly the > >> point of generalizing this thing, to make it unnecessary for the code > >> that uses it to be aware of such things. > > > > With the proposed API it seems pretty much a requirement to wait inside > > the callback. > > Or you can return false from the XLogPageRead() callback if the > requested page is not available. That will cause ReadRecord() to return > NULL, and you can retry when more WAL is available. That requires to build quite a bit of knowledge on the outside: * you need to transport the information that you need more input via some external variable/->private_data * you need to transport at which RecPtr you needed more data * you need to signal that youre not dealing with an invalid record after returning, given both conditions return NULL * you need to buffer all incoming data somewhere if it comes from the network or similar, because at the next call XLgReadRecord will restart reading from the beginning Sorry, if I sound sceptical! If I had your patch in my hands half a year ago I would have been very happy, but after building the more generic version that can do all of the above (including a compatible XLogReaderReadOne(state)) its a bit hard to do that. Not sure if its just the feeling of possibly having wasted the time... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 17.09.2012 13:01, Andres Freund wrote: >> It seems we would need one additional callback for both approaches like: >> ->error(severity, format, ...) >> For both to avoid having to draw in elog.c. > Yeah. Another approach would be to return the error string from > ReadRecord. The caller could then do whatever it pleases with it, like > ereport() it to the log or PANIC. I think I'd like that better. I think it's basically insane to imagine that you can carve out a non-trivial piece of the backend that doesn't contain any elog calls. There's too much low-level infrastructure, such as palloc, that could call it. Even if you managed to make it safe at the instant the feature is committed, the odds it would stay safe over time are negligible. Furthermore, returning enough state for useful error messages back out of multiple layers of function call is going to be notationally messy, and will end up requiring complicated infrastructure barely simpler than elog anyway. It'd be a lot better for the wal-dumping program to supply a cut-down version of elog than to try to promise that all errors will be returned back from ReadRecord. regards, tom lane
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Heikki Linnakangas
Date:
On 17.09.2012 17:08, Tom Lane wrote: > Heikki Linnakangas<hlinnakangas@vmware.com> writes: >> On 17.09.2012 13:01, Andres Freund wrote: >>> It seems we would need one additional callback for both approaches like: >>> ->error(severity, format, ...) >>> For both to avoid having to draw in elog.c. > >> Yeah. Another approach would be to return the error string from >> ReadRecord. The caller could then do whatever it pleases with it, like >> ereport() it to the log or PANIC. I think I'd like that better. > > I think it's basically insane to imagine that you can carve out a > non-trivial piece of the backend that doesn't contain any elog calls. > There's too much low-level infrastructure, such as palloc, that could > call it. Even if you managed to make it safe at the instant the feature > is committed, the odds it would stay safe over time are negligible. I wasn't thinking that we'd completely eliminate all elog() calls from ReadRecord and everything it calls, but only the "expected" ones that mean we've reached the end of valid WAL. The ones that use emode_for_corrupt_record(). Any unexpected errors like running out of file descriptors would still use ereport() like usual. That said, Andres' suggestion of making this facility completely independent of any backend functions, making it usable in external programs, doesn't actually seem that hard. ReadRecord() itself is fairly small, as are the subroutines that validate the records. XLogReadPage(), which goes out to fetch the right xlog page from archive or whatever, is way more complicated. But that would live in the callback, so it would be free to use all the normal backend facilities. However, it means that external programs would need to supply their own (hopefully much simpler) version of XLogReadPage(); I'm not sure how that goes with Andres' plans on using xlogreader. - Heikki
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On Monday, September 17, 2012 04:08:01 PM Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: > > On 17.09.2012 13:01, Andres Freund wrote: > >> It seems we would need one additional callback for both approaches like: > >> ->error(severity, format, ...) > >> For both to avoid having to draw in elog.c. > > > > Yeah. Another approach would be to return the error string from > > ReadRecord. The caller could then do whatever it pleases with it, like > > ereport() it to the log or PANIC. I think I'd like that better. > > I think it's basically insane to imagine that you can carve out a > non-trivial piece of the backend that doesn't contain any elog calls. > There's too much low-level infrastructure, such as palloc, that could > call it. Even if you managed to make it safe at the instant the feature > is committed, the odds it would stay safe over time are negligible. If you start relying on palloc all hope is gone anyway. I "only" want a standalone XLogReader because thats just too damn annoying/hard to duplicate in standalone code. There are several very useful utilities out there that are incomplete and/or unreliable for that reason. And loads of others that haven't been written because of that. That is one of the reasons - beside finding the respective xlog.c code very hard to read/modify/extend - why I wrote a completely standalone xlogreader. One other factor was just learning how the hell all that works ;) I still think the interface that something plain as the proposed XLogReadRecord() provides is too restrictive for many use-cases. I aggree that a wrapper with exactly such an interface for xlog.c is useful, though. > Furthermore, returning enough state for useful error messages back out > of multiple layers of function call is going to be notationally messy, > and will end up requiring complicated infrastructure barely simpler than > elog anyway. Hm. You mean because of file/function/location? > It'd be a lot better for the wal-dumping program to supply a cut-down > version of elog than to try to promise that all errors will be returned > back from ReadRecord. Well, I suggested a ->error() callback for that reason, that seems relatively easy to wrap. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On Monday, September 17, 2012 04:18:28 PM Heikki Linnakangas wrote: > On 17.09.2012 17:08, Tom Lane wrote: > > Heikki Linnakangas<hlinnakangas@vmware.com> writes: > >> On 17.09.2012 13:01, Andres Freund wrote: > >>> It seems we would need one additional callback for both approaches > >>> like: ->error(severity, format, ...) > >>> For both to avoid having to draw in elog.c. > >> > >> Yeah. Another approach would be to return the error string from > >> ReadRecord. The caller could then do whatever it pleases with it, like > >> ereport() it to the log or PANIC. I think I'd like that better. > > > > I think it's basically insane to imagine that you can carve out a > > non-trivial piece of the backend that doesn't contain any elog calls. > > There's too much low-level infrastructure, such as palloc, that could > > call it. Even if you managed to make it safe at the instant the feature > > is committed, the odds it would stay safe over time are negligible. > > I wasn't thinking that we'd completely eliminate all elog() calls from > ReadRecord and everything it calls, but only the "expected" ones that > mean we've reached the end of valid WAL. The ones that use > emode_for_corrupt_record(). Any unexpected errors like running out of > file descriptors would still use ereport() like usual. > > That said, Andres' suggestion of making this facility completely > independent of any backend functions, making it usable in external > programs, doesn't actually seem that hard. ReadRecord() itself is fairly > small, as are the subroutines that validate the records. XLogReadPage(), > which goes out to fetch the right xlog page from archive or whatever, is > way more complicated. But that would live in the callback, so it would > be free to use all the normal backend facilities. However, it means that > external programs would need to supply their own (hopefully much > simpler) version of XLogReadPage(); I'm not sure how that goes with > Andres' plans on using xlogreader. XLogRead() from walsender.c is pretty easy to translate to backend-independent code, so I don't think thats a problem. I don't see how the backend's version is useful outside of the startup process anyway. We could provide a default backend independent variant that hits files in xlogreader.c, its not much code, to avoid others copying it multiple times. I used a variant of that in the places that read from disk without any problems. Obviously not in the places that read from network, but thats shelved due to the different decoding approach atm anyway. Regards, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi all, Attached is the .txt and .pdf (both are imo readable and contain the same content) with design documentation about the proposed feature. Christan Kruse, Marko Tiikkaja and Hannu Krosing read the document and told me about my most egregious mistakes. Thanks! I would appreciate some feedback! Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Andres Freund
Date:
This time I really attached both... -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
"md@rpzdesign.com"
Date:
Andres, nice job on the writeup.<br /><br /> I think one aspect you are missing is that there must be some way for the multi-mastersto <br /> re-stabilize their data sets and quantify any data loss. You cannot do this without<br /> some replicationintelligence in each row of each table so that no matter how disastrous<br /> the hardware/internet failure inthe cloud, the system can HEAL itself and keep going, no human beings involved.<br /><br /> I am laying down a standarddesign pattern of columns for each row:<br /><br /> MKEY - Primary key guaranteed unique across ALL nodes in theCLOUD with NODE information IN THE KEY. (A876543 vs B876543 or whatever)(network link UP or DOWN)<br /> CSTP - createtime stamp on unix time stamp<br /> USTP - last update time stamp based on unix time stamp<br /> UNODE - Node thatupdated this record<br /><br /> Many applications already need the above information, might as well standardize it soexternal replication logic processing can self heal.<br /><br /> Postgresql tables have optional 32 bit int OIDs, you maywant consider having a replication version of the ROID, replication object ID and then externalize the primary<br /> keygeneration into a loadable UDF.<br /><br /> Of course, ALL the nodes must be in contact with each other not allowing signficantdrift on their clocks while operating. (NTP is a starter)<br /><br /> I just do not know of any other way to addself healing without the above information, regardless of whether you hold up transactions for synchronous <br /> or letthem pass thru asynch. Regardless if you are getting your replication data from the WAL stream or thru the client libraries.<br/><br /> Also, your replication model does not really discuss busted link replication operations, where is theintelligence for that in the operation diagram?<br /><br /> Everytime you package up replication into the core, someonehas to tear into that pile to add some extra functionality, so definitely think<br /> about providing sensible hooksfor that extra bit of customization to override the base function.<br /><br /> Cheers,<br /><br /> marco<br /><br />On 9/22/2012 11:00 AM, Andres Freund wrote:<br /><blockquote cite="mid:201209221900.53190.andres@2ndquadrant.com" type="cite"><prewrap="">This time I really attached both... </pre><br /><fieldset class="mimeAttachmentHeader"></fieldset><br /><pre wrap=""> </pre></blockquote><br />
On 15 September 2012 01:39, Andres Freund <andres@2ndquadrant.com> wrote: > (0008-Introduce-wal-decoding-via-catalog-timetravel.patch) This patch is the 8th of 8 in a patch series that covers different aspects of the bi-directional replication feature planned for PostgreSQL 9.3. For those that are unfamiliar with the BDR projection, a useful summary is available in an e-mail sent to this list by Simon Riggs back in April [1]. I should also point out that Andres has made available a design document that discusses aspects of this patch in particular, in another thread [2]. That document, "High level design for logical replication in Postgres", emphasizes source data generation in particular: generalising how PostgreSQL's WAL stream is generated to represent the changes it describes logically, without pointers to physical heap tuples and so forth (data generation as a concept is described in detail in an earlier design document [3]). This particular patch can be thought of as a response to the earlier discussion [4] surrounding how to solve the problem of keeping system catalogs consistent during WAL replay on followers: "catalog time travel" is now used, rather than maintaining a synchronized catalog at the decoding end. Andres' git tree ("xlog-decoding-rebasing-cf2" branch) [5] provides additional useful comments in commit messages (he rebases things such that each commit represents a distinct piece of functionality/ patch for review). This patch is not strictly speaking an atomic unit. It is necessary to apply all 8 patches in order to get the code to compile. However, it is approximately an atomic unit, that represents a subdivision of the entire BDR patch that it is manageable and logical to write a discrete review for. This is idiomatic use of git-format-patch, but it is unusual enough within our community for me feel the need to note these facts. I briefly discussed this patch with Andres off-list. His feeling is that the review process ought to focus on the design of WAL decoding, including how it fits within the larger set of replication features proposed for 9.3. There are a number of known omissions in this patch. Andres has listed some of these above, and edge-cases and so on are noted next to XXX and FIXME comments in the patch itself. I am inclined to agree with Andres' view that we should attempt to solidify community support for this prototype patch's design, or some variant, before fixing the edge-cases and working towards committable code. I will try my best to proceed on that basis. What follows is an initial overview of the patch (or at least my understanding of the patch, which you may need to correct), and some points of concern. > * applycache module which reassembles transactions from a stream of interspersed changes > This is what the design doc [2] refers to as "4.5. TX reassembly". This functionality is concentrated in applycache.c. As [2] notes, the reassembly component "was previously coined ApplyCache because it was proposed to run on replication consumers just before applying changes. This is not the case anymore. It is still called that way in the source of the patch recently submitted." The purpose of ApplyCache/transaction reassembly is to reassemble interlaced records, and organise them by XID, so that the consumer client code sees only streams (well, lists) of records split by XID. I meant to avoid talking about anything other than the bigger picture for now, but I must ask: Why the frequent use of malloc(), particularly within applycache.c? The obvious answer is that it's rough code and that that will change, but that still doesn't comport with my idea about how rough Postgres code should look, so I have to wonder if there's a better reason. applycache.c has an acute paucity of comments, which makes it really hard to review well. [2] doesn't have all that much to say about it either. I'm going to not comment much on this here, except to say that I think that the file should be renamed to reassembly.c or something like that, to reflect its well-specified purpose, and not how it might be used. Any cache really belongs in src/backend/utils/cache/ anyway. Applycache is presumably where you're going to want to spill transaction streams to disk, eventually. That seems like a prerequisite to commit. By the way, I see that you're doing this here: + /* most callers don't need snapshot.h */ + typedef struct SnapshotData *Snapshot; Tom, Alvaro and I had a discussion about whether or not this was an acceptable way to reduce build dependencies back in July [8] – I lost that one. You're relying on a Gnu extension/C11 feature here (that is, typedef redefinition). If you find it intensely irritating that you cannot do this while following the standard to the letter, you are not alone. > * snapbuilder which builds catalog snapshots so that tuples from wal can be understood > This component analyses xlog and builds a special kind of Snapshot. This has been compared to the KnownAssignedXids machinery for Hot Standby [6] (see SnapBuildEndTxn() et al to get an idea of what is meant by this). Since decoding only has to occur within a single backend, I guess it's sufficient that it's all within local memory (in contrast to the KnownAssignedXids array, which is in shared memory). The design document [2] really just explains the problem (which is the need for catalog metadata at a point in time to make sense of heap tuples), without describing the solution that this patch offers with any degree of detail. Rather, [2] says "How we build snapshots is somewhat intricate and complicated and seems to be out of scope for this document", which is unsatisfactory. I look forward to reading the promised document that describes this mechanism in more detail. It's really hard to guess what you might have meant to do, and why you might have done it, much less verifying the codes correctness. This functionality is concentrated in snapbuild.c. A comment in decode.c notes: + * Its possible that the separation between decode.c and snapbuild.c is a + * bit too strict, in the end they just about have the same struct. I prefer the current separation. I think it's reasonable that decode.c is sort of minimal glue code. [2] talks about this under "4.6. Snapshot building". > * wal decoding into an applycache > This functionality is concentrated in decode.c (not applycache.c – decoding just call those functions). Decoding means breaking up individual XLogRecord structs, and storing them in an applycache (applycache does this, and stores them as ApplyCacheChange records), while building a snapshot (which is needed in advance of adding tuples from records). It can be thought of as the small piece of glue between applycache and snapbuild that is called by XLogReader (DecodeRecordIntoApplyCache() is the only public function, which will be called by many xlogreader_state.finished_record-hooked plugin functions in practice, including this example one). An example of what belongs in decode.c is the way it ignores physical XLogRecords, because they are not of interest. By the way, why not just use struct assignment here?: + memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode)); > * decode_xlog(lsn, lsn) debugging function > You consider this to be a throw-away function that won't ever be committed. However, I strongly feel that you should move it into /contrib, so that it can serve as a sort of reference implementation for authors of decoder client code, in the same spirit as numerous existing contrib modules (think contrib/spi). I think that such a module could even be useful to people that were just a bit intellectually curious about how WAL works, which is something I'd like to encourage. Besides, simply having this code in a module will more explicitly demarcate client code (just look at logicalfuncs.c – it is technically client code, but that's too subtle right now). I don't like this code in decode_xlog(): + apply_state = (ReaderApplyState*)xlogreader_state->private_data; Why is it necessary to cast here? In other words, why is private_data a void pointer at all? Are we really well-served by presuming absolutely nothing about XlogReader's state? Wouldn't an “abstract base class” pointer be more appropriate a type for private_data? I don't think it's virtuous to remove type-safety any more than is strictly necessary. Note that I'm not asserting that you shouldn't do this – I'm merely asking the question. When developing a user-facing API, it is particularly crucial to make interfaces easy to use correctly and hard to use incorrectly. > The applycache provides 3 major callbacks: > * apply_begin > * apply_change > * apply_commit These are callbacks intended to be used by third-party modules, perhaps including a full multi-master replication implementation (though this patch isn't directly concerned with that), or even a speculative future version of a logical replication system like Slony. [2] refers to these under "4.7. Output Plugin". These are the typedef for the hook types involved (incidentally, don't like the name of these types – I think you should lose the CB): +/* XXX: were currently passing the originating subtxn. Not sure thats necessary */ +typedef void (*ApplyCacheApplyChangeCB)(ApplyCache* cache, ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change); +typedef void (*ApplyCacheBeginCB)(ApplyCache* cache, ApplyCacheTXN* txn); +typedef void (*ApplyCacheCommitCB)(ApplyCache* cache, ApplyCacheTXN* txn); So we register these callbacks like this in the patch: + /* + * allocate an ApplyCache that will apply data using lowlevel calls + * without type conversion et al. This requires binary compatibility + * between both systems. + * XXX: This would be the place too hook different apply methods, like + * producing sql and applying it. + */ + apply_cache = ApplyCacheAllocate(); + apply_cache->begin = decode_begin_txn; + apply_cache->apply_change = decode_change; + apply_cache->commit = decode_commit_txn; The decode_xlog(lsn, lsn) debugging function that Andres has played with [6] (that this patch makes available, for now) is where this code comes from. Whenever ApplyCache calls an "apply_change" callback for a single change (that is, a INSERT|UPDATE|DELETE) it locally overrides the normal SnapshotNow semantics used for catalog access with a previously built snapshot. Behaviour should now be consistent with a normal snapshotNow acquired when the tuple change was originally written to WAL. Having commented specifically on modules that Andres highlighted, I'd like to highlight one myself: tqual.c . This module has had significant new functionalty added, so it would be an omission to not pass remark on it in this opening e-mail, having mentioned all other modules with significant new pieces of functionality. The file has had new utility functions added, that pertain to snapshot visibility during decoding - "time travel". I draw attention to this. This code is located within the new function HeapTupleSatisfiesMVCCDuringDecoding(), which is analogous to what is done for "dirty snapshots" (dirty snapshots are used with ri_triggers.c, for example, when even uncommitted tuple should be visible). Both of these functions are generally accessed through function pointers. Anyway, here's the code: + /* + * FIXME: The not yet existing decoding infrastructure will need to force + * the xmin to stay lower than what they are currently decoding. + */ + bool fixme_xmin_horizon = false; I'm sort of wondering what this is going to look like in the finished patch. This FIXME is rather hard for me to take at face value. It seems to me that the need to coordinate decoding with xmin horizon itself represents a not insignificant engineering challenge. So, with all due respect, it would be nice if I wasn't asked to make that leap of faith. The xmin horizon prepared transaction hack needs to go. Within tqual.h, shouldn't you have something like this, but for time travel snapshots during decoding?: #define InitDirtySnapshot(snapshotdata) \((snapshotdata).satisfies = HeapTupleSatisfiesDirty) Alternatively, adding a variable to these two might be appropriate: static SnapshotData CurrentSnapshotData = {HeapTupleSatisfiesMVCC}; static SnapshotData SecondarySnapshotData = {HeapTupleSatisfiesMVCC}; In any case, assigning this hook in snapbuild.c looks like a modularity violation to me. See also my observations on initialising ReaderApplyState below. My general feeling is that the code is very under-commented, and in need of a polish, though I'm sure that you are perfectly well aware of that. The basic way all of these components that I have described separately fit together is: (if others want to follow this, refer to decode_xlog()) 1. Start with some client code “output plugin” (for now, a throw-away debugging function “decode_xlog()”)| \ / 2. Client allocates an XLogReaderState. (This module is a black box to me, though it's well encapsulated so that shouldn't matter much. Heikki is reviewing this [7]. Like I said, this isn't quite an atomic unit I'm reviewing.)| \ / 3. Plugin registers various callbacks (within logicalfuncs.c). These callbacks, while appearing in this patch, are mostly NO-OPS, and are somewhat specific to XLogReader's concerns. I mostly defer to Heikki here.| \ / 4. Plugin allocates an “ApplyCache”. Plugin assigns some more callbacks to “ApplyCache”. This time, they're the aforementioned 3 apply cache functions.| \ / 5. Plugin assigns this new ApplyCache to variable within private state of the XLogReader (this private state is a subset of its main state, and is opaque to XLogReader).| \ / 6. Finally, plugin calls XLogReader(main_state). | \ / 7. At some point during its magic, XLogReader calls the hook registered in step 3, finished_record. This is all it doesdirectly with the plugin, which it makes minimal assumptions about. | \ / 8. finished_record (which is logically a part of the “plugin”) knows what type the opaque private_data actually is. It casts it to an apply_state, and calls the decoder (supplying the apply_state as an argumentto DecodeRecordIntoApplyCache()). | \ / 9. During the first call (within the first record within a call to decode_xlog()), we allocate a snapshot reader. | \ / 10. Builds snapshot callback. This scribbles on our snapshot state, which essentially encapsulates a snapshot. The state (and snapshot) changes continually, once per call. | \ / 11. Looks at XLogRecordBuffer (an XLogReader struct). Looks at an XLogRecord. Decodes based on record type. Let's assume it's an XLOG_HEAP_INSERT. | \ / 12. DecodeInsert() called. This in turn calls DecodeXLogTuple(). We store the tuple metadata in our ApplyCache. (some ilists, somewhere, each corresponding to an XID). We don't store the relation oid, because we don't know it yet (only relfilenode is known from WAL). / / \ / 13. We're back in XLogReader(). It calls the only callback of interest to us covered in step 3 (and not of interest to XlogReader()/Heikki) – decode_change(). It does this through the apply_cache.apply_change hook. This happens because we encounter another record, this time a commit record(in the same codepath as discussed in step 12). | \ / 14. In decode_change(), the actual function that raises the interesting WARNINGs within Andres' earlier example [6],showing actual integer/varchar Datum value for tuples previously inserted. Resolve table oidbased on relfilenode (albeit unsatisfactorily). Using a StringInfo, tupledescs, syscache and typcache, build WARNING string. So my general opinion of how all this fits together is that it isn't quite right. Problems include: * Why does the ReaderApplyState get magically initialised in two stages? apply_cache is initialised in decode_xlog (or whatever plugin). Snapstate is allocated within DecodeRecordIntoApplyCache() (with a dependency on apply_cache). Shouldn't this all be happening within a single function? As you yourself have point out, not everyone needs to know about these snapshots. * Maybe I've missed something, but I think you need a more satisfactory example plugin. What happens in step 14 is plainly unacceptable. You haven't adequately communicated to me how this is going to be used in logical replication. Maybe I just haven't got that far yet. I'm not impressed by the InvalidateSystemCaches() calls here and elsewhere. * Please break-out the client code as a contrib module. That separation would increase the readability of the patch. That's all I have for now... [1] http://archives.postgresql.org/message-id/CA+U5nMLk-Wt806zab7SJ2x5X4pqC3WE-hFctONakTqSAgbqTYQ@mail.gmail.com [2] http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com [3] http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com [4] http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com [5] http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf2 [6] http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com [7] http://archives.postgresql.org/message-id/5056DFAB.3050707@vmware.com [8] http://archives.postgresql.org/message-id/CAEYLb_Uvbi9mns-uJWUW4QtHqnC27SEyyNmj1HKFY=5X5wwdgg@mail.gmail.com
On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote: > On 15 September 2012 01:39, Andres Freund <andres@2ndquadrant.com> wrote: > > (0008-Introduce-wal-decoding-via-catalog-timetravel.patch) > > This patch is the 8th of 8 in a patch series that covers different > aspects of the bi-directional replication feature planned for > PostgreSQL 9.3. For those that are unfamiliar with the BDR projection, > a useful summary is available in an e-mail sent to this list by Simon > Riggs back in April [1]. I should also point out that Andres has made Does this design allow replication/communcation between clusters running different major versions of Postgres? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
> Does this design allow replication/communcation between clusters running > different major versions of Postgres? Just catching up on your email, hmmm? Yes, that's part of the design 2Q presented. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Bruce Momjian <bruce@momjian.us> schrieb: >On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote: >> On 15 September 2012 01:39, Andres Freund <andres@2ndquadrant.com> >wrote: >> > (0008-Introduce-wal-decoding-via-catalog-timetravel.patch) >> >> This patch is the 8th of 8 in a patch series that covers different >> aspects of the bi-directional replication feature planned for >> PostgreSQL 9.3. For those that are unfamiliar with the BDR >projection, >> a useful summary is available in an e-mail sent to this list by Simon >> Riggs back in April [1]. I should also point out that Andres has made > >Does this design allow replication/communcation between clusters >running >different major versions of Postgres? This patchset only contains only the decoding/changeset generation part of logical replication. It provides (as in the debuggingexample) the capabilities to generate a correct textual format and thus can be used to build a solution with supportfor cross version/arch replication as long as the text format of the used types is compatible. Does that answer the question? Andres --- Please excuse the brevity and formatting - I am writing this on my mobile phone.
On Thu, Oct 11, 2012 at 01:34:58AM +0200, anarazel@anarazel.de wrote: > > > Bruce Momjian <bruce@momjian.us> schrieb: > > >On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote: > >> On 15 September 2012 01:39, Andres Freund <andres@2ndquadrant.com> > >wrote: > >> > (0008-Introduce-wal-decoding-via-catalog-timetravel.patch) > >> > >> This patch is the 8th of 8 in a patch series that covers different > >> aspects of the bi-directional replication feature planned for > >> PostgreSQL 9.3. For those that are unfamiliar with the BDR > >projection, > >> a useful summary is available in an e-mail sent to this list by Simon > >> Riggs back in April [1]. I should also point out that Andres has made > > > >Does this design allow replication/communcation between clusters > >running > >different major versions of Postgres? > This patchset only contains only the decoding/changeset generation part of logical replication. It provides (as in thedebugging example) the capabilities to generate a correct textual format and thus can be used to build a solution withsupport for cross version/arch replication as long as the text format of the used types is compatible. > > Does that answer the question? Yes. This was posted so long ago I couldn't remember if that was part of the design. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > The purpose of ApplyCache/transaction reassembly is to reassemble > interlaced records, and organise them by XID, so that the consumer > client code sees only streams (well, lists) of records split by XID. I think I've mentioned it before, but in the interest of not being seen to critique the bikeshed only after it's been painted: this design gives up something very important that exists in our current built-in replication solution, namely pipelining. With streaming replication as it exists today, a transaction that modifies a huge amount of data (such as a bulk load) can be applied on the standby as it happens. The rows thus inserted will become visible only if and when the transaction commits on the master and the commit record is replayed on the standby. This has a number of important advantages, perhaps most importantly that the lag between commit and data visibility remains short. With the proposed system, we can't start applying the changes until the transaction has committed and the commit record has been replayed, so a big transaction is going to have a lot of apply latency. Now, I am not 100% opposed to a design that surrenders this property in exchange for other important benefits, but I think it would be worth thinking about whether there is any way that we can design this that either avoids giving that property up at all, or gives it up for the time being but allows us to potentially get back to it in a later version. Reassembling complete transactions is surely cool and some clients will want that, but being able to apply replicated transactions *without* reassembling them in their entirety is even cooler, and some clients will want that, too. If we're going to stick with a design that reassembles transactions, I think there are a number of issues that deserve careful thought. First, memory usage. I don't think it's acceptable for the decoding process to assume that it can allocate enough backend-private memory to store all of the in-flight changes (either as WAL or in some more decoded form). We have assiduously avoided such assumptions thus far; you can write a terabyte of data in one transaction with just a gigabyte of shared buffers if you so desire (and if you're patient). Here's you making the same point in different words: > Applycache is presumably where you're going to want to spill > transaction streams to disk, eventually. That seems like a > prerequisite to commit. Second, crash recovery. I think whatever we put in place here has to be able to survive a crash on any node. Decoding must be able to restart successfully after a system crash, and it has to be able to apply exactly the set of transactions that were committed but not applied prior to the crash. Maybe an appropriate mechanism for this already exists or has been discussed, but I haven't seen it go by; sorry if I have missed the boat. > You consider this to be a throw-away function that won't ever be > committed. However, I strongly feel that you should move it into > /contrib, so that it can serve as a sort of reference implementation > for authors of decoder client code, in the same spirit as numerous > existing contrib modules (think contrib/spi). Without prejudice to the rest of this review which looks quite well-considered, I'd like to add a particular +1 to this point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 10, 2012 at 09:10:48PM -0400, Robert Haas wrote: > > You consider this to be a throw-away function that won't ever be > > committed. However, I strongly feel that you should move it into > > /contrib, so that it can serve as a sort of reference implementation > > for authors of decoder client code, in the same spirit as numerous > > existing contrib modules (think contrib/spi). > > Without prejudice to the rest of this review which looks quite > well-considered, I'd like to add a particular +1 to this point. The review was _HUGE_! :-O -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> The purpose of ApplyCache/transaction reassembly is to reassemble >> interlaced records, and organise them by XID, so that the consumer >> client code sees only streams (well, lists) of records split by XID. > I think I've mentioned it before, but in the interest of not being > seen to critique the bikeshed only after it's been painted: this > design gives up something very important that exists in our current > built-in replication solution, namely pipelining. Isn't there an even more serious problem, namely that this assumes *all* transactions are serializable? What happens when they aren't? Or even just that the effective commit order is not XID order? regards, tom lane
On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think I've mentioned it before, but in the interest of not being >> seen to critique the bikeshed only after it's been painted: this >> design gives up something very important that exists in our current >> built-in replication solution, namely pipelining. > > Isn't there an even more serious problem, namely that this assumes > *all* transactions are serializable? What happens when they aren't? > Or even just that the effective commit order is not XID order? Firstly, I haven't read the code but I'm confident it doesn't make the elementary error of assuming commit order == xid order. I assume it's applying the reassembled transactions in commit order. I don't think it assumes the transactions are serializable because it's only concerned with writes, not reads. So the transaction it's replaying may or may not have been able to view the data written by other transactions that commited earlier but it doesn't matter when trying to reproduce the effects using constants. The data written by this transaction is either written or not when the commit happens and it's all written or not at that time. Even in non-serializable mode updates take row locks and nobody can see the data or modify it until the transaction commits. I have to say I was curious about Robert's point as well when I read Peter's review. Especially because this is exactly how other logical replication systems I've seen work too and I've always wondered about it in those systems. Both MySQL and Oracle reassemble transactions and don't write anything until they have the whole transaction reassembled. To me this always struck me as a bizarre and obviously bad thing to do though. It seems to me it would be better to create sessions (or autonomous transactions) for each transaction seen in the stream and issue the DML as it shows up, committing and cleaning each up when the commit or abort (or shutdown or startup) record comes along. I imagine the reason lies with dealing with locking and ensuring that you get the correct results without deadlocks when multiple transactions try to update the same record. But it seems to me that the original locks the source database took should protect you against any problems. As long as you can suspend a transaction when it takes a lock that blocks and keep processing WAL for other transactions (or an abort for that transaction if that happened due to a deadlock or user interruption) you should be fine. -- greg
Hi, First of: Awesome review. On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote: > What follows is an initial overview of the patch (or at least my > understanding of the patch, which you may need to correct), and some > points of concern. > > > * applycache module which reassembles transactions from a stream of > > interspersed changes > > This is what the design doc [2] refers to as "4.5. TX reassembly". > > This functionality is concentrated in applycache.c. As [2] notes, the > reassembly component "was previously coined ApplyCache because it was > proposed to run on replication consumers just before applying changes. > This is not the case anymore. It is still called that way in the > source of the patch recently submitted." > > The purpose of ApplyCache/transaction reassembly is to reassemble > interlaced records, and organise them by XID, so that the consumer > client code sees only streams (well, lists) of records split by XID. > I meant to avoid talking about anything other than the bigger picture > for now, but I must ask: Why the frequent use of malloc(), > particularly within applycache.c? The obvious answer is that it's > rough code and that that will change, but that still doesn't comport > with my idea about how rough Postgres code should look, so I have to > wonder if there's a better reason. Several reasons, not sure how good: - part of the code (was) supposed to be runnable on a target system without a full postgres backend arround - All of the allocations would basically have to be in TopMemoryContext or something equally longlived as we don't have a transaction context or anything like that - the first revision showed that allocating memory was the primary bottleneck so I added a small allocation cache to the critical pieces which solved that. After that there seems no point in using mctx''s for that kind of memory anymore. > applycache.c has an acute paucity of comments, which makes it really > hard to review well. Working on that. I don't think its internals are really all that interesting atm. > I'm going to not comment much on this here, except to say that > I think that the file should be renamed to reassembly.c or something > like that, to reflect its well-specified purpose, and not how it might > be used. Agreed. > Applycache is presumably where you're going to want to spill > transaction streams to disk, eventually. That seems like a > prerequisite to commit. Yes. > By the way, I see that you're doing this here: > > + /* most callers don't need snapshot.h */ > + typedef struct SnapshotData *Snapshot; > > Tom, Alvaro and I had a discussion about whether or not this was an > acceptable way to reduce build dependencies back in July [8] – I lost > that one. You're relying on a Gnu extension/C11 feature here (that is, > typedef redefinition). If you find it intensely irritating that you > cannot do this while following the standard to the letter, you are not > alone. Yuck :(. So I will just use struct SnapshotData directly instead of a typedef... > > * snapbuilder which builds catalog snapshots so that tuples from wal can > > be understood > > This component analyses xlog and builds a special kind of Snapshot. > This has been compared to the KnownAssignedXids machinery for Hot > Standby [6] (see SnapBuildEndTxn() et al to get an idea of what is > meant by this). Since decoding only has to occur within a single > backend, I guess it's sufficient that it's all within local memory (in > contrast to the KnownAssignedXids array, which is in shared memory). I haven't found a convincing argument to share that information itself. If we want to parallelise this I think sharing the snapshots should be sufficient. Given the snapshot only covers system catalogs the changerate should be fine. > The design document [2] really just explains the problem (which is the > need for catalog metadata at a point in time to make sense of heap > tuples), without describing the solution that this patch offers with > any degree of detail. Rather, [2] says "How we build snapshots is > somewhat intricate and complicated and seems to be out of scope for > this document", which is unsatisfactory. I look forward to reading the > promised document that describes this mechanism in more detail. It's > really hard to guess what you might have meant to do, and why you > might have done it, much less verifying the codes correctness. Will concentrate on finishing that document. > This functionality is concentrated in snapbuild.c. A comment in decode.c > notes: > > + * Its possible that the separation between decode.c and snapbuild.c is > a + * bit too strict, in the end they just about have the same struct. > > I prefer the current separation. I think it's reasonable that decode.c > is sort of minimal glue code. Good. > Decoding means breaking up individual XLogRecord structs, and storing > them in an applycache (applycache does this, and stores them as > ApplyCacheChange records), while building a snapshot (which is needed > in advance of adding tuples from records). It can be thought of as the > small piece of glue between applycache and snapbuild that is called by > XLogReader (DecodeRecordIntoApplyCache() is the only public function, > which will be called by many xlogreader_state.finished_record-hooked > plugin functions in practice, including this example one). An example > of what belongs in decode.c is the way it ignores physical > XLogRecords, because they are not of interest. > > By the way, why not just use struct assignment here?: > > + memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode)); At the time I initially wrote the code that seemed to be the project standard. I think this only recently changed. > > * decode_xlog(lsn, lsn) debugging function > > You consider this to be a throw-away function that won't ever be > committed. Only in its current hacked-together form. > However, I strongly feel that you should move it into > /contrib, so that it can serve as a sort of reference implementation > for authors of decoder client code, in the same spirit as numerous > existing contrib modules (think contrib/spi). I think that such a > module could even be useful to people that were just a bit > intellectually curious about how WAL works, which is something I'd > like to encourage. Besides, simply having this code in a module will > more explicitly demarcate client code (just look at logicalfuncs.c – > it is technically client code, but that's too subtle right now). I definitely think we need something like this, but it might look a bit different. For reasons you found later (xmin horizon) I don't think we can run it in the context of a normal backend for one. > I don't like this code in decode_xlog(): > > + apply_state = (ReaderApplyState*)xlogreader_state->private_data; > > Why is it necessary to cast here? In other words, why is private_data > a void pointer at all? The reason is that for a long time I hoped to keep ApplyCache's generic enough to be usable outside of this exact scenario. I am not sure if thats a worthwile goal anymore, especially as there are some layering violations now. > > The applycache provides 3 major callbacks: > > * apply_begin > > * apply_change > > * apply_commit > > These are callbacks intended to be used by third-party modules, > perhaps including a full multi-master replication implementation > (though this patch isn't directly concerned with that), or even a > speculative future version of a logical replication system like Slony. > [2] refers to these under "4.7. Output Plugin". These are the typedef > for the hook types involved (incidentally, don't like the name of > these types – I think you should lose the CB): Not particularly attached to the CB, so I can loose it. > +/* XXX: were currently passing the originating subtxn. Not sure thats > necessary */ > +typedef void (*ApplyCacheApplyChangeCB)(ApplyCache* cache, > ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change); > +typedef void (*ApplyCacheBeginCB)(ApplyCache* cache, ApplyCacheTXN* txn); > +typedef void (*ApplyCacheCommitCB)(ApplyCache* cache, ApplyCacheTXN* txn); > > So we register these callbacks like this in the patch: > > + /* > + * allocate an ApplyCache that will apply data using lowlevel calls > + * without type conversion et al. This requires binary compatibility > + * between both systems. > + * XXX: This would be the place too hook different apply methods, like > + * producing sql and applying it. > + */ > + apply_cache = ApplyCacheAllocate(); > + apply_cache->begin = decode_begin_txn; > + apply_cache->apply_change = decode_change; > + apply_cache->commit = decode_commit_txn; > > The decode_xlog(lsn, lsn) debugging function that Andres has played > with [6] (that this patch makes available, for now) is where this code > comes from. > > Whenever ApplyCache calls an "apply_change" callback for a single > change (that is, a INSERT|UPDATE|DELETE) it locally overrides the > normal SnapshotNow semantics used for catalog access with a previously > built snapshot. Behaviour should now be consistent with a normal > snapshotNow acquired when the tuple change was originally written to > WAL. I additionally want pass a mvcc-ish Snapshot (should just need a a change in function signatures) so the output plugins can query the catalog manually. > Having commented specifically on modules that Andres highlighted, I'd > like to highlight one myself: tqual.c . This module has had > significant new functionalty added, so it would be an omission to not > pass remark on it in this opening e-mail, having mentioned all other > modules with significant new pieces of functionality. The file has had > new utility functions added, that pertain to snapshot visibility > during decoding - "time travel". For me that mentally was part of catalog timetravel, thats why I didn't highlight it separately. But yes, this needs to be discussed. > I draw attention to this. This code is located within the new function > HeapTupleSatisfiesMVCCDuringDecoding(), which is analogous to what is > done for "dirty snapshots" (dirty snapshots are used with > ri_triggers.c, for example, when even uncommitted tuple should be > visible). Not sure where you see the similarity with dirty snapshots? > Both of these functions are generally accessed through > function pointers. Anyway, here's the code: > > + /* > + * FIXME: The not yet existing decoding infrastructure will need to > force + * the xmin to stay lower than what they are currently decoding. > + */ > + bool fixme_xmin_horizon = false; > > I'm sort of wondering what this is going to look like in the finished > patch. This FIXME is rather hard for me to take at face value. It > seems to me that the need to coordinate decoding with xmin horizon > itself represents a not insignificant engineering challenge. So, with > all due respect, it would be nice if I wasn't asked to make that leap > of faith. The xmin horizon prepared transaction hack needs to go. The idea is to do the decoding inside (or in something very similar) like walsenders. Those already have the capability to keep the xmin horizon nailed to some point for the hot_standby_feedback feature. There is a need for something similar like what prepared statements do for restartability after a restart though. > Within tqual.h, shouldn't you have something like this, but for time > travel snapshots during decoding?: > > #define InitDirtySnapshot(snapshotdata) \ > ((snapshotdata).satisfies = HeapTupleSatisfiesDirty) Hm. Ok. > Alternatively, adding a variable to these two might be appropriate: > > static SnapshotData CurrentSnapshotData = {HeapTupleSatisfiesMVCC}; > static SnapshotData SecondarySnapshotData = {HeapTupleSatisfiesMVCC}; Not sure where youre going from this? > My general feeling is that the code is very under-commented, and in > need of a polish, though I'm sure that you are perfectly well aware of > that. Totally aggreed. > The basic way all of these components that I have described > separately fit together is: (if others want to follow this, refer to > decode_xlog()) > > 1. Start with some client code “output plugin” (for now, a throw-away > debugging function “decode_xlog()”) The general idea is to integrate this into the walsender framework in a command very roughly looking like: START_LOGICAL_REPLICATION $plugin LSN > \ / > 2. Client allocates an XLogReaderState. (This module is a black box to > me, though it's well encapsulated so that shouldn't matter much. > Heikki is reviewing this [7]. Like I said, this isn't quite an atomic > unit I'm reviewing.) > > \ / > 3. Plugin registers various callbacks (within logicalfuncs.c). These > callbacks, while appearing in this patch, are mostly NO-OPS, and are > somewhat specific to XLogReader's concerns. I mostly defer to Heikki > here. > > \ / > 4. Plugin allocates an “ApplyCache”. Plugin assigns some more > callbacks to “ApplyCache”. This time, they're the aforementioned 3 > apply cache functions. > > \ / > 5. Plugin assigns this new ApplyCache to variable within private state > of the XLogReader (this private state is a subset of its main state, > and is opaque to XLogReader). > > \ / > 6. Finally, plugin calls XLogReader(main_state). > > \ / > 7. At some point during its magic, > XLogReader calls the hook registered in step 3, finished_record. > This is all it does directly > with the plugin, which it makes minimal assumptions about. > > \ / > 8. finished_record (which is > logically a part of the “plugin”) knows what type the opaque > private_data > actually is. It casts it > to an apply_state, and calls the decoder (supplying the apply_state as > an argument to > DecodeRecordIntoApplyCache()). > > \ / > 9. During the first call > (within the first record within a call to decode_xlog()), we allocate > a snapshot reader. > > \ / > 10. Builds snapshot callback. > This scribbles on our snapshot state, which essentially encapsulates a > snapshot. > The state (and > snapshot) changes continually, once per call. It changes only if there were catalog changes in the transaction and/or we haven't yet build an initial snapshot. > \ / > 11. Looks at XLogRecordBuffer > (an XLogReader struct). Looks at an XLogRecord. Decodes based on > record type. > Let's assume it's an > XLOG_HEAP_INSERT. > > \ / > 12. DecodeInsert() called. > This in turn calls DecodeXLogTuple(). We store the tuple metadata in > our > ApplyCache. (some > ilists, somewhere, each corresponding to an XID). We don't store the > relation oid, because we don't > know it yet (only > relfilenode is known from WAL). > / > / > \ / > 13. We're back in XLogReader(). It > calls the only callback of interest to > us covered in step 3 (and > not of interest to XlogReader()/Heikki) – decode_change(). It does > this through the > apply_cache.apply_change > hook. This happens because we encounter another record, this time a > commit record (in the same > codepath as discussed in step 12). > > \ / > 14. In decode_change(), the actual > function that raises the interesting WARNINGs within Andres' > earlier example [6], showing > actual integer/varchar Datum value for tuples previously inserted. > Resolve table oid based on > relfilenode (albeit unsatisfactorily). > Using a StringInfo, > tupledescs, syscache and typcache, build WARNING string. > > So my general opinion of how all this fits together is that it isn't > quite right. Problems include: I think its more understandable if you think that normally the initialization code shouldn't be repeated in the plugins but the plugins are just called from a walsender. > * Why does the ReaderApplyState get magically initialised in two > stages? apply_cache is initialised in decode_xlog (or whatever > plugin). Snapstate is allocated within DecodeRecordIntoApplyCache() > (with a dependency on apply_cache). Shouldn't this all be happening > within a single function? As you yourself have point out, not everyone > needs to know about these snapshots. The reason is/was that I didn't want the outside know anything about the Snapshot building, that seems to be only relevant to decode.c (and somewhat applycache.c). I can look at it though. > * Maybe I've missed something, but I think you need a more > satisfactory example plugin. What happens in step 14 is plainly > unacceptable. You haven't adequately communicated to me how this is > going to be used in logical replication. Maybe I just haven't got that > far yet. Well, what would you like to happen instead? Does it make more sense with the concept that the output plugin will eventually run inside walsender(-ish) and stream changes via COPY? I just couldn't finish a POC for that in time. > I'm not impressed by the InvalidateSystemCaches() calls here > and elsewhere. Yes, those definitely got to go and we have nearly all the infrastructure for it from Hot Standby. It was noted as a deficiency in the overview mail & the commit message though ;). I have that locally. The rough idea here is to reuse xl_xact_commit->nmsgs to handle the cache behaviour. There are some more complications but let me first write the timetravel document. > * Please break-out the client code as a contrib module. That > separation would increase the readability of the patch. Well, as I said above the debugging function in its current form is not going to walk, as soon we have aggreed on how to integrate this into walsender I am definitely going to provide a example/debugging plugin which I definitely intend to submit. Thanks! Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 11, 2012 at 03:16:39AM +0100, Greg Stark wrote: > On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I think I've mentioned it before, but in the interest of not being > >> seen to critique the bikeshed only after it's been painted: this > >> design gives up something very important that exists in our current > >> built-in replication solution, namely pipelining. > > > > Isn't there an even more serious problem, namely that this assumes > > *all* transactions are serializable? What happens when they aren't? > > Or even just that the effective commit order is not XID order? > > Firstly, I haven't read the code but I'm confident it doesn't make the > elementary error of assuming commit order == xid order. I assume it's > applying the reassembled transactions in commit order. > > I don't think it assumes the transactions are serializable because > it's only concerned with writes, not reads. So the transaction it's > replaying may or may not have been able to view the data written by > other transactions that commited earlier but it doesn't matter when > trying to reproduce the effects using constants. The data written by > this transaction is either written or not when the commit happens and > it's all written or not at that time. Even in non-serializable mode > updates take row locks and nobody can see the data or modify it until > the transaction commits. How does Slony write its changes without causing serialization replay conflicts? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Greg Stark <stark@mit.edu> writes: > On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Isn't there an even more serious problem, namely that this assumes >> *all* transactions are serializable? What happens when they aren't? >> Or even just that the effective commit order is not XID order? > I don't think it assumes the transactions are serializable because > it's only concerned with writes, not reads. So the transaction it's > replaying may or may not have been able to view the data written by > other transactions that commited earlier but it doesn't matter when > trying to reproduce the effects using constants. I would believe that argument if the "apply" operations were at a similar logical level to our current WAL records, namely drop these bits into that spot. Unfortunately, they're not. I think this argument falls to the ground entirely as soon as you think about DDL being applied by transactions A,B,C and then needing to express what concurrent transactions X,Y,Z did in "source" terms. Even something as simple as a few column renames could break it, let alone anything as esoteric as changing the meaning of datatype literals. regards, tom lane
On Thursday, October 11, 2012 03:10:48 AM Robert Haas wrote: > On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > > The purpose of ApplyCache/transaction reassembly is to reassemble > > interlaced records, and organise them by XID, so that the consumer > > client code sees only streams (well, lists) of records split by XID. > > I think I've mentioned it before, but in the interest of not being > seen to critique the bikeshed only after it's been painted: this > design gives up something very important that exists in our current > built-in replication solution, namely pipelining. With streaming > replication as it exists today, a transaction that modifies a huge > amount of data (such as a bulk load) can be applied on the standby as > it happens. The rows thus inserted will become visible only if and > when the transaction commits on the master and the commit record is > replayed on the standby. This has a number of important advantages, > perhaps most importantly that the lag between commit and data > visibility remains short. With the proposed system, we can't start > applying the changes until the transaction has committed and the > commit record has been replayed, so a big transaction is going to have > a lot of apply latency. I don't think there is a fundamental problem here, just an incremental ones. The major problems are: * transactions with DDL & DML currently need to be reassembled, it might be possible to resolve this though, haven't thought about it too much * subtransaction are only assigned to toplevel transactions at commit time * you need a variable amount of backends/parallel transactions open at the target system to apply all the transactions concurrently. You can't smash them together because one of them might rollback. All of those seem solveable to me, so I am not too worried about addition of a streaming mode somewhere down the line. I don't want to focus on it right now though. Ok? > Here's you making the same point in different words: > > Applycache is presumably where you're going to want to spill > > transaction streams to disk, eventually. That seems like a > > prerequisite to commit. > > Second, crash recovery. I think whatever we put in place here has to > be able to survive a crash on any node. Decoding must be able to > restart successfully after a system crash, and it has to be able to > apply exactly the set of transactions that were committed but not > applied prior to the crash. Maybe an appropriate mechanism for this > already exists or has been discussed, but I haven't seen it go by; > sorry if I have missed the boat. I have discussed it privately & roughly prototyped, but not publically. There are two pieces to this: 1) restartable after a crash/disconnection/shutdown 2) pick of exactly where it stopped Those are somewhat different because 1) is relevant on the source side and be solved there. 2) depends on the target system because it needs to ensure that it safely received the changes up to some point. The idea for 1) is to serialize the applycache whenever we reach a checkpoint and have that as a starting point for every confirmed flush location of 2). Obviously 2) will need cooperation by the receiving side. > > You consider this to be a throw-away function that won't ever be > > committed. However, I strongly feel that you should move it into > > /contrib, so that it can serve as a sort of reference implementation > > for authors of decoder client code, in the same spirit as numerous > > existing contrib modules (think contrib/spi). > > Without prejudice to the rest of this review which looks quite > well-considered, I'd like to add a particular +1 to this point. So were in violent agreement here ;) Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thursday, October 11, 2012 04:16:39 AM Greg Stark wrote: > On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I think I've mentioned it before, but in the interest of not being > >> seen to critique the bikeshed only after it's been painted: this > >> design gives up something very important that exists in our current > >> built-in replication solution, namely pipelining. > > > > Isn't there an even more serious problem, namely that this assumes > > *all* transactions are serializable? What happens when they aren't? > > Or even just that the effective commit order is not XID order? > > Firstly, I haven't read the code but I'm confident it doesn't make the > elementary error of assuming commit order == xid order. I assume it's > applying the reassembled transactions in commit order. Yes, its commit order. Imo commit order is more like assuming all transactions are done in read committed and not above than assuming serializable? Or am I missing something? > I don't think it assumes the transactions are serializable because > it's only concerned with writes, not reads. So the transaction it's > replaying may or may not have been able to view the data written by > other transactions that commited earlier but it doesn't matter when > trying to reproduce the effects using constants. The data written by > this transaction is either written or not when the commit happens and > it's all written or not at that time. Even in non-serializable mode > updates take row locks and nobody can see the data or modify it until > the transaction commits. Yes. There will be problems if you want to make serializable work across nodes, but that seems like something fiendishly complex anyway. I don't plan to work on it in the forseeable future. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thursday, October 11, 2012 04:31:21 AM Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: > > On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Isn't there an even more serious problem, namely that this assumes > >> *all* transactions are serializable? What happens when they aren't? > >> Or even just that the effective commit order is not XID order? > > > > I don't think it assumes the transactions are serializable because > > it's only concerned with writes, not reads. So the transaction it's > > replaying may or may not have been able to view the data written by > > other transactions that commited earlier but it doesn't matter when > > trying to reproduce the effects using constants. > > I would believe that argument if the "apply" operations were at a > similar logical level to our current WAL records, namely drop these bits > into that spot. Unfortunately, they're not. I think this argument > falls to the ground entirely as soon as you think about DDL being > applied by transactions A,B,C and then needing to express what > concurrent transactions X,Y,Z did in "source" terms. Even something as > simple as a few column renames could break it, Not sure what youre getting at here? Are you talking about the problems at the source side or the target side? During decoding such problems should be handled already. As we reconstruct a Snapshot that lets catalog access look like it did back when the tuple was thrown into was we have the exact column names, data types and everything. The locking used when making the original changes prevents the data types, column names to be changed mid-transaction. If youre talking about the receiving/apply side: Sure, if you do careless DDL and you don't replicate DDL (not included here, separate project), youre going to have problems. I don't think there is much we can do about that. > let alone anything as esoteric as changing the meaning of datatype literals. Hm. You mean like changing the input format of a datatype? Yes, sure, that will cause havoc. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thursday, October 11, 2012 04:49:20 AM Andres Freund wrote: > On Thursday, October 11, 2012 04:31:21 AM Tom Lane wrote: > > Greg Stark <stark@mit.edu> writes: > > > On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> Isn't there an even more serious problem, namely that this assumes > > >> all transactions are serializable? What happens when they aren't? > > >> Or even just that the effective commit order is not XID order? > > > > > > > > > > > > I don't think it assumes the transactions are serializable because > > > it's only concerned with writes, not reads. So the transaction it's > > > replaying may or may not have been able to view the data written by > > > other transactions that commited earlier but it doesn't matter when > > > trying to reproduce the effects using constants. > > > > > > > > I would believe that argument if the "apply" operations were at a > > similar logical level to our current WAL records, namely drop these bits > > into that spot. Unfortunately, they're not. I think this argument > > falls to the ground entirely as soon as you think about DDL being > > applied by transactions A,B,C and then needing to express what > > concurrent transactions X,Y,Z did in "source" terms. Even something as > > simple as a few column renames could break it, > > Not sure what youre getting at here? Are you talking about the problems at > the source side or the target side? > > During decoding such problems should be handled already Btw, the introductionary email upthread shows a trivial example. As submitted the code cannot handle intermingled DDL/DML transactions, but I fixed that now. There are some problems with CLUSTER/VACUUM FULL on system catalogs, but thats going to be separate thread... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Heikki Linnakangas
Date:
On 22.09.2012 20:00, Andres Freund wrote: > [[basic-schema]] > .Architecture Schema > ["ditaa"] > ------------------------------------------------------------------------------ > Traditional Stuff > > +---------+---------+---------+---------+----+ > | Backend | Backend | Backend | Autovac | ...| > +----+----+---+-----+----+----+----+----+-+--+ > | | | | | > +------+ | +--------+ | | > +-+ | | | +----------------+ | > | | | | | | > | v v v v | > | +------------+ | > | | WAL writer |<------------------+ > | +------------+ > | | | | | | > v v v v v v +-------------------+ > +--------+ +---------+ +->| Startup/Recovery | > |{s} | |{s} | | +-------------------+ > |Catalog | | WAL |---+->| SR/Hot Standby | > | | | | | +-------------------+ > +--------+ +---------+ +->| Point in Time | > ^ | +-------------------+ > ---|----------|-------------------------------- > | New Stuff > +---+ | > | v Running separately > | +----------------+ +=-------------------------+ > | | Walsender | | | | > | | v | | +-------------------+ | > | +-------------+ | | +->| Logical Rep. | | > | | WAL | | | | +-------------------+ | > +-| decoding | | | +->| Multimaster | | > | +------+------/ | | | +-------------------+ | > | | | | | +->| Slony | | > | | v | | | +-------------------+ | > | +-------------+ | | +->| Auditing | | > | | TX | | | | +-------------------+ | > +-| reassembly | | | +->| Mysql/... | | > | +-------------/ | | | +-------------------+ | > | | | | | +->| Custom Solutions | | > | | v | | | +-------------------+ | > | +-------------+ | | +->| Debugging | | > | | Output | | | | +-------------------+ | > +-| Plugin |--|--|-+->| Data Recovery | | > +-------------/ | | +-------------------+ | > | | | | > +----------------+ +--------------------------| > ------------------------------------------------------------------------------ This diagram triggers a pet-peeve of mine: What do all the boxes and lines mean? An architecture diagram should always include a key. I find that when I am drawing a diagram myself, adding the key clarifies my own thinking too. This looks like a data-flow diagram, where the arrows indicate the data flows between components, and the boxes seem to represent processes. But in that case, I think the arrows pointing from the plugins in walsender to Catalog are backwards. The catalog information flows from the Catalog to walsender, walsender does not write to the catalogs. Zooming out to look at the big picture, I think the elephant in the room with this whole effort is how it fares against trigger-based replication. You list a number of disadvantages that trigger-based solutions have, compared to the proposed logical replication. Let's take a closer look at them: > * essentially duplicates the amount of writes (or even more!) True. > * synchronous replication hard or impossible to implement I don't see any explanation it could be implemented in the proposed logical replication either. > * noticeable CPU overhead > * trigger functions > * text conversion of data Well, I'm pretty sure we could find some micro-optimizations for these if we put in the effort. And the proposed code isn't exactly free, either. > * complex parts implemented in several solutions Not sure what this means, but the proposed code is quite complex too. > * not in core IMHO that's a good thing, and I'd hope this new logical replication to live outside core as well, as much as possible. But whether or not something is in core is just a political decision, not a reason to implement something new. If the only meaningful advantage is reducing the amount of WAL written, I can't help thinking that we should just try to address that in the existing solutions, even if it seems "easy to solve at a first glance, but a solution not using a normal transactional table for its log/queue has to solve a lot of problems", as the document says. Sorry to be a naysayer, but I'm pretty scared of all the new code and complexity these patches bring into core. PS. I'd love to see a basic Slony plugin for this, for example, to see how much extra code on top of the posted patches you need to write in a plugin like that to make it functional. I'm worried that it's a lot.. - Heikki
On 10/11/2012 04:31 AM, Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: >> On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Isn't there an even more serious problem, namely that this assumes >>> *all* transactions are serializable? What happens when they aren't? >>> Or even just that the effective commit order is not XID order? >> I don't think it assumes the transactions are serializable because >> it's only concerned with writes, not reads. So the transaction it's >> replaying may or may not have been able to view the data written by >> other transactions that commited earlier but it doesn't matter when >> trying to reproduce the effects using constants. > I would believe that argument if the "apply" operations were at a > similar logical level to our current WAL records, namely drop these bits > into that spot. Unfortunately, they're not. I think this argument > falls to the ground entirely as soon as you think about DDL being > applied by transactions A,B,C and then needing to express what > concurrent transactions X,Y,Z did in "source" terms. Even something as > simple as a few column renames could break it, let alone anything as > esoteric as changing the meaning of datatype literals. This is the whole reason of moving the reassembly to the source side and having the possibility to use old snapshots to get the catalog information. Also, the locks that protect you from effects of field name changes by DDL concurrent transactions protect also the logical reassembly if done in the commit order. > > regards, tom lane > >
On 10/11/2012 03:10 AM, Robert Haas wrote: > On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> The purpose of ApplyCache/transaction reassembly is to reassemble >> interlaced records, and organise them by XID, so that the consumer >> client code sees only streams (well, lists) of records split by XID. > I think I've mentioned it before, but in the interest of not being > seen to critique the bikeshed only after it's been painted: this > design gives up something very important that exists in our current > built-in replication solution, namely pipelining. The lack of pipelining (and the following complexity of applycache and spilling to disk) is something we have discussed with Andres and to my understanding it is not a final design decision but just stepping stones in how this quite large development is structured. The pipelining (or parallel apply as I described it) requires either a large number of apply backends and code to manage them or autonomous transactions. It could (arguably !) be easier to implement autonomous transactions instead of apply cache, but Andres had valid reasons to start with apply cache and move to parallel apply later . As I understand it the parallel apply is definitely one of the things that will be coming and after that the performance characteristics (fast AND smooth) will be very similar to current physical WAL streaming. > With streaming > replication as it exists today, a transaction that modifies a huge > amount of data (such as a bulk load) can be applied on the standby as > it happens. The rows thus inserted will become visible only if and > when the transaction commits on the master and the commit record is > replayed on the standby. This has a number of important advantages, > perhaps most importantly that the lag between commit and data > visibility remains short. With the proposed system, we can't start > applying the changes until the transaction has committed and the > commit record has been replayed, so a big transaction is going to have > a lot of apply latency. > > Now, I am not 100% opposed to a design that surrenders this property > in exchange for other important benefits, but I think it would be > worth thinking about whether there is any way that we can design this > that either avoids giving that property up at all, or gives it up for > the time being but allows us to potentially get back to it in a later > version. Reassembling complete transactions is surely cool and some > clients will want that, but being able to apply replicated > transactions *without* reassembling them in their entirety is even > cooler, and some clients will want that, too. > > If we're going to stick with a design that reassembles transactions, I > think there are a number of issues that deserve careful thought. > First, memory usage. I don't think it's acceptable for the decoding > process to assume that it can allocate enough backend-private memory > to store all of the in-flight changes (either as WAL or in some more > decoded form). We have assiduously avoided such assumptions thus far; > you can write a terabyte of data in one transaction with just a > gigabyte of shared buffers if you so desire (and if you're patient). > Here's you making the same point in different words: > >> Applycache is presumably where you're going to want to spill >> transaction streams to disk, eventually. That seems like a >> prerequisite to commit. > Second, crash recovery. I think whatever we put in place here has to > be able to survive a crash on any node. Decoding must be able to > restart successfully after a system crash, and it has to be able to > apply exactly the set of transactions that were committed but not > applied prior to the crash. Maybe an appropriate mechanism for this > already exists or has been discussed, but I haven't seen it go by; > sorry if I have missed the boat. > >> You consider this to be a throw-away function that won't ever be >> committed. However, I strongly feel that you should move it into >> /contrib, so that it can serve as a sort of reference implementation >> for authors of decoder client code, in the same spirit as numerous >> existing contrib modules (think contrib/spi). > Without prejudice to the rest of this review which looks quite > well-considered, I'd like to add a particular +1 to this point. >
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Andres Freund
Date:
On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote: > On 22.09.2012 20:00, Andres Freund wrote: > > [[basic-schema]] > > .Architecture Schema > > ["ditaa"] > > ------------------------------------------------------------------------- > > ----- > > > > Traditional Stuff > > > > +---------+---------+---------+---------+----+ > > > > | Backend | Backend | Backend | Autovac | ...| > > > > +----+----+---+-----+----+----+----+----+-+--+ > > > > +------+ | +--------+ | | > > > > +-+ | | | +----------------+ | > > > > | v v v v | > > | > > | +------------+ | > > | > > | | WAL writer |<------------------+ > > | > > | +------------+ > > > > v v v v v v +-------------------+ > > > > +--------+ +---------+ +->| Startup/Recovery | > > > > |{s} | |{s} | | +-------------------+ > > |Catalog | | WAL |---+->| SR/Hot Standby | > > | > > | | | | | +-------------------+ > > > > +--------+ +---------+ +->| Point in Time | > > > > ^ | +-------------------+ > > > > ---|----------|-------------------------------- > > > > | New Stuff > > > > +---+ | > > > > | v Running separately > > | > > | +----------------+ +=-------------------------+ > > | > > | | Walsender | | | | > > | | > > | | v | | +-------------------+ | > > | > > | +-------------+ | | +->| Logical Rep. | | > > | > > | | WAL | | | | +-------------------+ | > > > > +-| decoding | | | +->| Multimaster | | > > > > | +------+------/ | | | +-------------------+ | > > | > > | | | | | +->| Slony | | > > | | > > | | v | | | +-------------------+ | > > | > > | +-------------+ | | +->| Auditing | | > > | > > | | TX | | | | +-------------------+ | > > > > +-| reassembly | | | +->| Mysql/... | | > > > > | +-------------/ | | | +-------------------+ | > > | > > | | | | | +->| Custom Solutions | | > > | | > > | | v | | | +-------------------+ | > > | > > | +-------------+ | | +->| Debugging | | > > | > > | | Output | | | | +-------------------+ | > > > > +-| Plugin |--|--|-+->| Data Recovery | | > > > > +-------------/ | | +-------------------+ | > > > > +----------------+ +--------------------------| > > > > ------------------------------------------------------------------------- > > ----- > > This diagram triggers a pet-peeve of mine: What do all the boxes and > lines mean? An architecture diagram should always include a key. I find > that when I am drawing a diagram myself, adding the key clarifies my own > thinking too. Hm. Ok. > This looks like a data-flow diagram, where the arrows indicate the data > flows between components, and the boxes seem to represent processes. But > in that case, I think the arrows pointing from the plugins in walsender > to Catalog are backwards. The catalog information flows from the Catalog > to walsender, walsender does not write to the catalogs. The reason I drew it that way is that it actively needs to go back to the catalog and query it which is somewhat different of the rest which basically could be seen as a unidirectional pipeline. > Zooming out to look at the big picture, I think the elephant in the room > with this whole effort is how it fares against trigger-based > replication. You list a number of disadvantages that trigger-based > solutions have, compared to the proposed logical replication. Let's take > > a closer look at them: > > * essentially duplicates the amount of writes (or even more!) > True. By now I think its essentially unfixable. > > * synchronous replication hard or impossible to implement > > I don't see any explanation it could be implemented in the proposed > logical replication either. Its basically the same as its for synchronous streaming repl. At the place where SyncRepWaitForLSN() is done you instead/also wait for the decoding to reach that lsn (its in the wal, so everything is decodable) and for the other side to have confirmed reception of those changes. I think this should be doable with only minor code modifications. The existing support for all that is basically the reason we want to reuse the walsender framework. (will start a thread about that soon) > > * noticeable CPU overhead > > > > * trigger functions > > * text conversion of data > > Well, I'm pretty sure we could find some micro-optimizations for these > if we put in the effort. Any improvements there are a good idea independent from this proposal but I don't see how we can fundamentally improve from the status quo. > And the proposed code isn't exactly free, either. If you don't have frequent DDL its really not all that expensive. In the version without DDL support I didn't manage to saturate the ApplyCache with either parallel COPY in individual transactions (repeated 100MB files) or with pgbench. Also its basically doing work that the trigger/queue based solutions have to do as well, just that they do it via far less optimized sql statements. DDL support doesn't really change much as the overhead for transactions without DDL and without concurrently running DDL should be fairly minor (the submitted version is *not* finialized there, it builds a new snapshot instead of copying/referencing the old one). > > * complex parts implemented in several solutions > Not sure what this means, but the proposed code is quite complex too. It is, agreed. What I mean is that significantly complex logic is burried in the encoding, queuing and decoding/ordering logic of every trigger based replication. Thats not a good thing. > > * not in core > > IMHO that's a good thing, and I'd hope this new logical replication to > live outside core as well, as much as possible. I don't agree there, but I would like to keep that a separate discussion. For now I/we only want to submit the changes that technically need in-core support to work sensibly (this, background workers, some walsender integration). The goal of working nearly completely without special in-core support held the existing solutions back quite a bit imo. > But whether or not something is in core is just a political decision, not a > reason to implement something new. Isn't it both? There are things you simply cannot do unless youre inside core. Politically I think the external status of all those logical replication projects grew to be an adoption barrier. I don't even want to think about how many bad home-grown logical replication solutions I have seen out there that implement everything from the get-go. > If the only meaningful advantage is reducing the amount of WAL written, > I can't help thinking that we should just try to address that in the > existing solutions, even if it seems "easy to solve at a first glance, > but a solution not using a normal transactional table for its log/queue > has to solve a lot of problems", as the document says. Youre welcome to make suggestions, but everything I could think of that didn't fall short of reality ended up basically duplicating the amount of writes & fsyncs, even if not going through the WAL. You need to be crash safe/restartable (=> writes, fsyncs) and you need to reduce the writes (in memory, => !writes). There is only one authoritative point where you can rely on a commit to have been successfull and thats when the commit record has been written to the WAL. You can't send out the data to be committed before thats written because that could result in spuriously committed transactions on the remote side and you can't easily do it afterwards because you can crash after the commit. > Sorry to be a naysayer, but I'm pretty scared of all the new code and > complexity these patches bring into core. Understandable. I tried to keep the introduction of complexity in existing code paths relatively minor and I think I mostly succeeded there but it still needs to be maintained. > PS. I'd love to see a basic Slony plugin for this, for example, to see > how much extra code on top of the posted patches you need to write in a > plugin like that to make it functional. I'm worried that it's a lot.. I think before its possible to do something like that a bit more design decisions need to be made. Mostly the walsender(ish) integration needs to be done. After that I can imagine writing a demo plugin that outputs changes in a slony compatible format, but I would like to see some slony/londiste person cooperating on receiving/applying those. What complications are you imagining? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 10/10/12 7:26 PM, Bruce Momjian wrote: > How does Slony write its changes without causing serialization replay > conflicts? Since nobody from the Slony team answered this: a) Slony replicates *rows*, not *statements* b) Slony uses serializable mode internally for row replication c) it's possible (though difficult) for creative usage to get Slony into a deadlock situation FWIW, I have always assumed that is is impossible (even theoretically) to have statement-based replication without some constraints on the statements you can run, or some replication failures. I think we should expect 9.3's logical replication out-the-gate to have some issues and impose constraints on users, and improve with time but never be perfect. The design Andres and Simon have advanced already eliminates a lot of the common failure cases (now(), random(), nextval()) suffered by pgPool and similar tools. But remember, this feature doesn't have to be *perfect*, it just has to be *better* than the alternatives. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Oct 10, 2012 at 10:26 PM, Bruce Momjian <bruce@momjian.us> wrote: > How does Slony write its changes without causing serialization replay > conflicts? It uses a sequence to break any ordering conflicts at the time that data is inserted into its log tables. If there are two transactions, A and B, that were "fighting" over a tuple on the origin, then either: a) A went first, B went second, and the ordering in the log makes that order clear; b) A succeeds, then B fails, so there's no conflict; c) A is doing its thing, and B is blocked behind it for a while, then A fails, and B gets to go through, and there's no conflict. Switch A and B as needed. The sequence that is used establishes what is termed a "compatible ordering." There are multiple possible compatible orderings; ours happen to interleave transactions together, with the sequence guaranteeing absence of conflict. If we could get commit orderings, then a different but still "compatible ordering" would be to have each transaction establish its own internal sequence, and apply things in order based on (commit_tx_order, sequence_within). -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On 11 October 2012 03:16, Greg Stark <stark@mit.edu> wrote: > On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I think I've mentioned it before, but in the interest of not being >>> seen to critique the bikeshed only after it's been painted: this >>> design gives up something very important that exists in our current >>> built-in replication solution, namely pipelining. >> >> Isn't there an even more serious problem, namely that this assumes >> *all* transactions are serializable? What happens when they aren't? >> Or even just that the effective commit order is not XID order? > > Firstly, I haven't read the code but I'm confident it doesn't make the > elementary error of assuming commit order == xid order. I assume it's > applying the reassembled transactions in commit order. > > I don't think it assumes the transactions are serializable because > it's only concerned with writes, not reads. So the transaction it's > replaying may or may not have been able to view the data written by > other transactions that commited earlier but it doesn't matter when > trying to reproduce the effects using constants. The data written by > this transaction is either written or not when the commit happens and > it's all written or not at that time. Even in non-serializable mode > updates take row locks and nobody can see the data or modify it until > the transaction commits. This uses Commit Serializability, which is valid, as you say. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12-10-11 06:27 PM, Josh Berkus wrote: > On 10/10/12 7:26 PM, Bruce Momjian wrote: >> How does Slony write its changes without causing serialization replay >> conflicts? > Since nobody from the Slony team answered this: > > a) Slony replicates *rows*, not *statements* True, but the proposed logical replication also would replicate rows not the original statements. I don't consider this proposal to be an example of 'statement' replication like pgpool is. If the original SQL was 'update foo set x=x+1 where id > 10'; there will be a WAL record to decode for each row modified by the table. In a million row table I'd expect the replica will have to apply a million records (whether they be binary heap tuples or SQL statements). > b) Slony uses serializable mode internally for row replication Actually recent versions of slony apply transactions against the replica in read committed mode. Older versions used serializable mode but with the SSI changes in 9.1 we found slony tended to have serialization conflicts with itself on the slony internal tables resulting in a lot of aborted transactions. When slony applies changes on a replica table it does so in a single transaction. Slony finds a set of transactions that committed on the master in between two SYNC events. It then applies all of the rows changed by any of those transactions as part of a single transaction on the replica. Chris's post explains this in more detail. Conflicts with user transactions on the replica are possible. > c) it's possible (though difficult) for creative usage to get Slony into > a deadlock situation > > FWIW, I have always assumed that is is impossible (even theoretically) > to have statement-based replication without some constraints on the > statements you can run, or some replication failures. I think we should > expect 9.3's logical replication out-the-gate to have some issues and > impose constraints on users, and improve with time but never be perfect. > > The design Andres and Simon have advanced already eliminates a lot of > the common failure cases (now(), random(), nextval()) suffered by pgPool > and similar tools. But remember, this feature doesn't have to be > *perfect*, it just has to be *better* than the alternatives. >
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Robert Haas
Date:
On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > IMHO that's a good thing, and I'd hope this new logical replication to live > outside core as well, as much as possible. But whether or not something is > in core is just a political decision, not a reason to implement something > new. > > If the only meaningful advantage is reducing the amount of WAL written, I > can't help thinking that we should just try to address that in the existing > solutions, even if it seems "easy to solve at a first glance, but a solution > not using a normal transactional table for its log/queue has to solve a lot > of problems", as the document says. Sorry to be a naysayer, but I'm pretty > scared of all the new code and complexity these patches bring into core. I think what we're really missing at the moment is a decent way of decoding WAL. There are a decent number of customers who, when presented with replication system, start by asking whether it's trigger-based or WAL-based. When you answer that it's trigger-based, their interest goes... way down. If you tell them the triggers are written in anything but C, you lose a bunch more points. Sure, some people's concerns are overblown, but it's hard to escape the conclusion that a WAL-based solution can be a lot more efficient than a trigger-based solution, and EnterpriseDB has gotten comments from a number of people who upgraded to 9.0 or 9.1 to the effect that SR was way faster than Slony. I do not personally believe that a WAL decoding solution adequate to drive logical replication can live outside of core, at least not unless core exposes a whole lot more interface than we do now, and probably not even then. Even if it could, I don't see the case for making every replication solution reinvent that wheel. It's a big wheel to be reinventing, and everyone needs pretty much the same thing. That having been said, I have to agree that the people working on this project seem to be wearing rose-colored glasses when it comes to the difficulty of implementing a full-fledged solution in core. I'm right on board with everything up to the point where we start kicking out a stream of decoded changes to the user... and that's about it. To pick on Slony for the moment, as the project that has been around for the longest and has probably the largest user base (outside of built-in SR, perhaps), they've got a project that they have been developing for years and years and years. What have they been doing all that time? Maybe they are just stupid, but Chris and Jan and Steve don't strike me that way, so I think the real answer is that they are solving problems that we haven't even started to think about yet, especially around control logic: how do you turn it on? how do you turn it off? how do you handle node failures? how do you handle it when a node gets behind? We are not going to invent good solutions to all of those problems between now and January, or even between now and next January. > PS. I'd love to see a basic Slony plugin for this, for example, to see how > much extra code on top of the posted patches you need to write in a plugin > like that to make it functional. I'm worried that it's a lot.. I agree. I would go so far as to say that if Slony can't integrate with this work and use it in place of their existing change-capture facility, that's sufficient grounds for unconditional rejection. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Andres Freund
Date:
On Monday, October 15, 2012 04:54:20 AM Robert Haas wrote: > On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas > > <hlinnakangas@vmware.com> wrote: > > IMHO that's a good thing, and I'd hope this new logical replication to > > live outside core as well, as much as possible. But whether or not > > something is in core is just a political decision, not a reason to > > implement something new. > > > > If the only meaningful advantage is reducing the amount of WAL written, I > > can't help thinking that we should just try to address that in the > > existing solutions, even if it seems "easy to solve at a first glance, > > but a solution not using a normal transactional table for its log/queue > > has to solve a lot of problems", as the document says. Sorry to be a > > naysayer, but I'm pretty scared of all the new code and complexity these > > patches bring into core. > I do not personally believe that a WAL decoding solution adequate to > drive logical replication can live outside of core, at least not > unless core exposes a whole lot more interface than we do now, and > probably not even then. Even if it could, I don't see the case for > making every replication solution reinvent that wheel. It's a big > wheel to be reinventing, and everyone needs pretty much the same > thing. Unsurprisingly I aggree. > That having been said, I have to agree that the people working on this > project seem to be wearing rose-colored glasses when it comes to the > difficulty of implementing a full-fledged solution in core. That very well might be true. Sometimes rose-colored glasses can be quite productive in getting something started... Note at this point were only want wal decoding, background workers and related things to get integrated... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Bruce Momjian
Date:
On Mon, Oct 15, 2012 at 09:57:19AM +0200, Andres Freund wrote: > On Monday, October 15, 2012 04:54:20 AM Robert Haas wrote: > > On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas > > > > <hlinnakangas@vmware.com> wrote: > > > IMHO that's a good thing, and I'd hope this new logical replication to > > > live outside core as well, as much as possible. But whether or not > > > something is in core is just a political decision, not a reason to > > > implement something new. > > > > > > If the only meaningful advantage is reducing the amount of WAL written, I > > > can't help thinking that we should just try to address that in the > > > existing solutions, even if it seems "easy to solve at a first glance, > > > but a solution not using a normal transactional table for its log/queue > > > has to solve a lot of problems", as the document says. Sorry to be a > > > naysayer, but I'm pretty scared of all the new code and complexity these > > > patches bring into core. > > > I do not personally believe that a WAL decoding solution adequate to > > drive logical replication can live outside of core, at least not > > unless core exposes a whole lot more interface than we do now, and > > probably not even then. Even if it could, I don't see the case for > > making every replication solution reinvent that wheel. It's a big > > wheel to be reinventing, and everyone needs pretty much the same > > thing. > Unsurprisingly I aggree. > > > That having been said, I have to agree that the people working on this > > project seem to be wearing rose-colored glasses when it comes to the > > difficulty of implementing a full-fledged solution in core. > That very well might be true. Sometimes rose-colored glasses can be quite > productive in getting something started... > > Note at this point were only want wal decoding, background workers and related > things to get integrated... Well, TODO does have: Move pgfoundry's xlogdump to /contrib and have it rely more closely onthe WAL backend code I think Robert is right that if Slony can't use the API, it is unlikely any other replication system could use it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Andres Freund
Date:
On Monday, October 15, 2012 08:19:54 PM Bruce Momjian wrote: > On Mon, Oct 15, 2012 at 09:57:19AM +0200, Andres Freund wrote: > > On Monday, October 15, 2012 04:54:20 AM Robert Haas wrote: > > > On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas > > > > > > <hlinnakangas@vmware.com> wrote: > > > > IMHO that's a good thing, and I'd hope this new logical replication > > > > to live outside core as well, as much as possible. But whether or > > > > not something is in core is just a political decision, not a reason > > > > to implement something new. > > > > > > > > If the only meaningful advantage is reducing the amount of WAL > > > > written, I can't help thinking that we should just try to address > > > > that in the existing solutions, even if it seems "easy to solve at a > > > > first glance, but a solution not using a normal transactional table > > > > for its log/queue has to solve a lot of problems", as the document > > > > says. Sorry to be a naysayer, but I'm pretty scared of all the new > > > > code and complexity these patches bring into core. > > > > > > I do not personally believe that a WAL decoding solution adequate to > > > drive logical replication can live outside of core, at least not > > > unless core exposes a whole lot more interface than we do now, and > > > probably not even then. Even if it could, I don't see the case for > > > making every replication solution reinvent that wheel. It's a big > > > wheel to be reinventing, and everyone needs pretty much the same > > > thing. > > > > Unsurprisingly I aggree. > > > > > That having been said, I have to agree that the people working on this > > > project seem to be wearing rose-colored glasses when it comes to the > > > difficulty of implementing a full-fledged solution in core. > > > > That very well might be true. Sometimes rose-colored glasses can be quite > > productive in getting something started... > > > > Note at this point were only want wal decoding, background workers and > > related things to get integrated... > > Well, TODO does have: > > Move pgfoundry's xlogdump to /contrib and have it rely more closely on > the WAL backend code Uhm. How does that relate to my statement? The xlogreader code I submitted does contain a very small POC xlogdump with almost no code duplication. It needs some work to be really useful though. > I think Robert is right that if Slony can't use the API, it is unlikely > any other replication system could use it. I aggree and I don't think I have ever said something contrary. I just don't want to be the only one working on slony integration. I am ready to do a good part of that, but somebody with slony experience needs to help, especially on consuming those changes. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Hannu Krosing
Date:
On 10/11/2012 01:42 PM, Andres Freund wrote: > On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote: > ... > If the only meaningful advantage is reducing the amount of WAL written, > I can't help thinking that we should just try to address that in the > existing solutions, even if it seems "easy to solve at a first glance, > but a solution not using a normal transactional table for its log/queue > has to solve a lot of problems", as the document says. > Youre welcome to make suggestions, but everything I could think of that didn't > fall short of reality ended up basically duplicating the amount of writes & > fsyncs, even if not going through the WAL. > > You need to be crash safe/restartable (=> writes, fsyncs) and you need to > reduce the writes (in memory, => !writes). There is only one authoritative > point where you can rely on a commit to have been successfull and thats when > the commit record has been written to the WAL. You can't send out the data to > be committed before thats written because that could result in spuriously > committed transactions on the remote side and you can't easily do it afterwards > because you can crash after the commit. Just curious here, but do you know how is this part solved in current sync wal replication - you can get "spurious" commitson slave side id master dies while waiting for confirmation. > What complications are you imagining? Greetings, Andres Hannu
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Andres Freund
Date:
On Monday, October 15, 2012 08:38:07 PM Hannu Krosing wrote: > On 10/11/2012 01:42 PM, Andres Freund wrote: > > On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote: > > ... > > If the only meaningful advantage is reducing the amount of WAL written, > > I can't help thinking that we should just try to address that in the > > existing solutions, even if it seems "easy to solve at a first glance, > > but a solution not using a normal transactional table for its log/queue > > has to solve a lot of problems", as the document says. > > Youre welcome to make suggestions, but everything I could think of that > > didn't fall short of reality ended up basically duplicating the amount > > of writes & fsyncs, even if not going through the WAL. > > > > You need to be crash safe/restartable (=> writes, fsyncs) and you need to > > reduce the writes (in memory, => !writes). There is only one > > authoritative point where you can rely on a commit to have been > > successfull and thats when the commit record has been written to the > > WAL. You can't send out the data to be committed before thats written > > because that could result in spuriously committed transactions on the > > remote side and you can't easily do it afterwards because you can crash > > after the commit. > > Just curious here, but do you know how is this part solved in current sync > wal replication - you can get "spurious" commits on slave side id master > dies while waiting for confirmation. Synchronous replication is only synchronous in respect to the COMMIT reply sent to the user. First the commit is written to WAL locally, so it persists across a crash (c.f. RecordTransactionCommit). Only then we wait for the standby (SyncRepWaitForLSN). After that finished the shared memory on the primary gets updated (c.f. ProcArrayEndTransaction in CommitTransaction) and soon after that the user gets the response to the COMMIT back. I am not really sure what you were asking for, does the above explanation answer this? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Hannu Krosing
Date:
On 10/15/2012 04:54 AM, Robert Haas wrote: > On Thu, Oct 11, 2012 at 3:15 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> IMHO that's a good thing, and I'd hope this new logical replication to live >> outside core as well, as much as possible. But whether or not something is >> in core is just a political decision, not a reason to implement something >> new. >> >> If the only meaningful advantage is reducing the amount of WAL written, I >> can't help thinking that we should just try to address that in the existing >> solutions, even if it seems "easy to solve at a first glance, but a solution >> not using a normal transactional table for its log/queue has to solve a lot >> of problems", as the document says. Sorry to be a naysayer, but I'm pretty >> scared of all the new code and complexity these patches bring into core. > I think what we're really missing at the moment is a decent way of > decoding WAL. There are a decent number of customers who, when > presented with replication system, start by asking whether it's > trigger-based or WAL-based. When you answer that it's trigger-based, > their interest goes... way down. If you tell them the triggers are > written in anything but C, you lose a bunch more points. Sure, some > people's concerns are overblown, but it's hard to escape the > conclusion that a WAL-based solution can be a lot more efficient than > a trigger-based solution, and EnterpriseDB has gotten comments from a > number of people who upgraded to 9.0 or 9.1 to the effect that SR was > way faster than Slony. > > I do not personally believe that a WAL decoding solution adequate to > drive logical replication can live outside of core, at least not > unless core exposes a whole lot more interface than we do now, and > probably not even then. Even if it could, I don't see the case for > making every replication solution reinvent that wheel. It's a big > wheel to be reinventing, and everyone needs pretty much the same > thing. > > That having been said, I have to agree that the people working on this > project seem to be wearing rose-colored glasses when it comes to the > difficulty of implementing a full-fledged solution in core. I'm right > on board with everything up to the point where we start kicking out a > stream of decoded changes to the user... and that's about it. To pick > on Slony for the moment, as the project that has been around for the > longest and has probably the largest user base (outside of built-in > SR, perhaps), they've got a project that they have been developing for > years and years and years. What have they been doing all that time? > Maybe they are just stupid, but Chris and Jan and Steve don't strike > me that way, so I think the real answer is that they are solving > problems that we haven't even started to think about yet, especially > around control logic: how do you turn it on? how do you turn it off? > how do you handle node failures? how do you handle it when a node > gets behind? We are not going to invent good solutions to all of > those problems between now and January, or even between now and next > January. I understand the the current minimal target is to get on par to current WAL streaming in terms of setup ease and performance with additional benefit of having read-write subscribers with at least conflict detection and logging. Hoping that we have something by january that solves all possible replication scenarios out of the box is unrealistic. >> PS. I'd love to see a basic Slony plugin for this, for example, to see how >> much extra code on top of the posted patches you need to write in a plugin >> like that to make it functional. I'm worried that it's a lot.. > I agree. I would go so far as to say that if Slony can't integrate > with this work and use it in place of their existing change-capture > facility, that's sufficient grounds for unconditional rejection. >
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Hannu Krosing
Date:
On 10/15/2012 08:44 PM, Andres Freund wrote: > On Monday, October 15, 2012 08:38:07 PM Hannu Krosing wrote: >> On 10/11/2012 01:42 PM, Andres Freund wrote: >>> On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote: >>> ... >>> If the only meaningful advantage is reducing the amount of WAL written, >>> I can't help thinking that we should just try to address that in the >>> existing solutions, even if it seems "easy to solve at a first glance, >>> but a solution not using a normal transactional table for its log/queue >>> has to solve a lot of problems", as the document says. >>> Youre welcome to make suggestions, but everything I could think of that >>> didn't fall short of reality ended up basically duplicating the amount >>> of writes & fsyncs, even if not going through the WAL. >>> >>> You need to be crash safe/restartable (=> writes, fsyncs) and you need to >>> reduce the writes (in memory, => !writes). There is only one >>> authoritative point where you can rely on a commit to have been >>> successfull and thats when the commit record has been written to the >>> WAL. You can't send out the data to be committed before thats written >>> because that could result in spuriously committed transactions on the >>> remote side and you can't easily do it afterwards because you can crash >>> after the commit. >> Just curious here, but do you know how is this part solved in current sync >> wal replication - you can get "spurious" commits on slave side id master >> dies while waiting for confirmation. > Synchronous replication is only synchronous in respect to the COMMIT reply sent > to the user. First the commit is written to WAL locally, so it persists across > a crash (c.f. RecordTransactionCommit). Only then we wait for the standby > (SyncRepWaitForLSN). After that finished the shared memory on the primary gets > updated (c.f. ProcArrayEndTransaction in CommitTransaction) and soon after that > the user gets the response to the COMMIT back. > > I am not really sure what you were asking for, does the above explanation > answer this? I think I mostly got it if master crashes before the commit confirmation comes back then it _will_ get it after restart. To client it looks like it doid not commit, but it is no different in this respect than any other crash-before-confirmation and thus client can not rely on commit not happening and has to check it. > > Greetings, > > Andres
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Bruce Momjian
Date:
On Mon, Oct 15, 2012 at 08:26:08PM +0200, Andres Freund wrote: > > > > I do not personally believe that a WAL decoding solution adequate to > > > > drive logical replication can live outside of core, at least not > > > > unless core exposes a whole lot more interface than we do now, and > > > > probably not even then. Even if it could, I don't see the case for > > > > making every replication solution reinvent that wheel. It's a big > > > > wheel to be reinventing, and everyone needs pretty much the same > > > > thing. > > > > > > Unsurprisingly I aggree. > > > > > > > That having been said, I have to agree that the people working on this > > > > project seem to be wearing rose-colored glasses when it comes to the > > > > difficulty of implementing a full-fledged solution in core. > > > > > > That very well might be true. Sometimes rose-colored glasses can be quite > > > productive in getting something started... > > > > > > Note at this point were only want wal decoding, background workers and > > > related things to get integrated... > > > > Well, TODO does have: > > > > Move pgfoundry's xlogdump to /contrib and have it rely more closely on > > the WAL backend code > > Uhm. How does that relate to my statement? > > The xlogreader code I submitted does contain a very small POC xlogdump with > almost no code duplication. It needs some work to be really useful though. I just meant that dumping xlog contents is something we want to improve. > > I think Robert is right that if Slony can't use the API, it is unlikely > > any other replication system could use it. > > I aggree and I don't think I have ever said something contrary. I just don't > want to be the only one working on slony integration. I am ready to do a good > part of that, but somebody with slony experience needs to help, especially on > consuming those changes. Agreed. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Hannu Krosing
Date:
On 10/15/2012 04:54 AM, Robert Haas wrote: >> PS. I'd love to see a basic Slony plugin for this, for example, to see how >> >much extra code on top of the posted patches you need to write in a plugin >> >like that to make it functional. I'm worried that it's a lot.. > I agree. I would go so far as to say that if Slony can't integrate > with this work and use it in place of their existing change-capture > facility, that's sufficient grounds for unconditional rejection. The fact that current work starts with "apply cache" instead of streaming makes the semantics very close to how londiste and slony do this. Therefore I don't think there will be any problems with "can't" though it may be that there will be nobody actually doing it, at least not before January. ------------ Hannu
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Peter Geoghegan
Date:
On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote: > I think Robert is right that if Slony can't use the API, it is unlikely > any other replication system could use it. I don't accept that. Clearly there is a circular dependency, and someone has to go first - why should the Slony guys invest in adopting this technology if it is going to necessitate using a forked Postgres with an uncertain future? That would be (with respect to the Slony guys) a commercial risk that is fairly heavily concentrated with Afilias. So, if you're going to attach as a condition to its acceptance that the Slony guys be able to use it immediately (because "can integrate" really means "will integrate", right?), you're attaching it to a rather arbitrary condition that has nothing much to do with the technical merit of the patches proposed. The fact of the matter is that Slony was originally designed with a somewhat different set of constraints to those that exist today, so I don't doubt that this is something that they're going to need to integrate over time, probably in a separate release branch, to get the upsides of in-core logical replication, along with the great flexibility that Slony currently offers (and that Afilias undoubtedly depend upon today). Another way of putting this is that Postgres should go first because we will get huge benefits even if only one of the trigger-based logical replication systems adopts the technology. Though I hope and expect that the Slony guys will be able to work with what we're doing, surely a logical replication system with all the benefits implied by being logical, but with with only some subset of Slony's functionality is still going to be of great benefit. My view is that the only reasonable approach is to build something solid, well-integrated and generic, in core. I'd certainly like to hear what the Slony guys have to say here, though. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Robert Haas
Date:
On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote: >> I think Robert is right that if Slony can't use the API, it is unlikely >> any other replication system could use it. > > I don't accept that. Clearly there is a circular dependency, and > someone has to go first - why should the Slony guys invest in adopting > this technology if it is going to necessitate using a forked Postgres > with an uncertain future? Clearly, core needs to go first. However, before we commit, I would like to hear the Slony guys say something like this: We read the documentation that is part of this patch and if the feature behaves as advertised, we believe we will be able to use it in place of the change-capture mechanism that we have now, and that it will be at least as good as what we have now if not a whole lot better. If they say something like "I'm not sure we have the right design for this" or "this wouldn't be sufficient to replace this portion of what we have now because it lacks critical feature X", I would be very concerned about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Andres Freund
Date:
On Monday, October 15, 2012 09:18:57 PM Peter Geoghegan wrote: > On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote: > > I think Robert is right that if Slony can't use the API, it is unlikely > > any other replication system could use it. > > I don't accept that. Clearly there is a circular dependency, and > someone has to go first - why should the Slony guys invest in adopting > this technology if it is going to necessitate using a forked Postgres > with an uncertain future? Well. I don't think (hope) anybody proposed making something release worthy for slony but rather a POC patch that proofs the API is generic enough to be used by them. If I (or somebody else familiar with this) work together with somebody familiar with with slony internals I think such a POC shouldn't be too hard to do. I think some more input from that side is a good idea. I plan to send out an email to possibly interested parties in about two weeks... Regards, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote: >>> I think Robert is right that if Slony can't use the API, it is unlikely >>> any other replication system could use it. >> I don't accept that. Clearly there is a circular dependency, and >> someone has to go first - why should the Slony guys invest in adopting >> this technology if it is going to necessitate using a forked Postgres >> with an uncertain future? > Clearly, core needs to go first. However, before we commit, I would > like to hear the Slony guys say something like this: We read the > documentation that is part of this patch and if the feature behaves as > advertised, we believe we will be able to use it in place of the > change-capture mechanism that we have now, and that it will be at > least as good as what we have now if not a whole lot better. > If they say something like "I'm not sure we have the right design for > this" or "this wouldn't be sufficient to replace this portion of what > we have now because it lacks critical feature X", I would be very > concerned about that. The other point here is that core code without any implemented use-cases is unlikely to be worth a tinker's damn. Regardless of what time-frame the Slony guys are able to work on, I think we need to see working code (of at least prototype quality) before we believe that we've got it right. Or if not code from them, code from some other replication project. A possibly-useful comparison is to the FDW APIs we've been slowly implementing over the past couple releases. Even though we *did* have some use-cases written right off the bat, we got it wrong and had to change it in 9.2, and I wouldn't bet against having to change it again in 9.3 (even without considering the need for extensions for non-SELECT operations). And, despite our very clear warnings that all that stuff was in flux, people have been griping because the APIs changed. So if we ship core hooks for logical replication in advance of proof that they're actually usable by at least one (preferably more than one) replication project, I confidently predict that they'll be wrong and will need revision and the potential users will complain about the API instability. regards, tom lane
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Christopher Browne
Date:
On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote: >> I think Robert is right that if Slony can't use the API, it is unlikely >> any other replication system could use it. > > I don't accept that. Clearly there is a circular dependency, and > someone has to go first - why should the Slony guys invest in adopting > this technology if it is going to necessitate using a forked Postgres > with an uncertain future? That would be (with respect to the Slony > guys) a commercial risk that is fairly heavily concentrated with > Afilias. Yep, there's something a bit too circular there. I'd also not be keen on reimplementing the "Slony integration" over and over if it turns out that the API churns for a while before stabilizing. That shouldn't be misread as "I expect horrible amounts of churn", just that *any* churn comes at a cost. And if anything unfortunate happens, that can easily multiply into a multiplicity of painfulness(es?). -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Andres Freund
Date:
On Monday, October 15, 2012 10:03:40 PM Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > >> On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote: > >>> I think Robert is right that if Slony can't use the API, it is unlikely > >>> any other replication system could use it. > >> > >> I don't accept that. Clearly there is a circular dependency, and > >> someone has to go first - why should the Slony guys invest in adopting > >> this technology if it is going to necessitate using a forked Postgres > >> with an uncertain future? > > > > Clearly, core needs to go first. However, before we commit, I would > > like to hear the Slony guys say something like this: We read the > > documentation that is part of this patch and if the feature behaves as > > advertised, we believe we will be able to use it in place of the > > change-capture mechanism that we have now, and that it will be at > > least as good as what we have now if not a whole lot better. > > > > If they say something like "I'm not sure we have the right design for > > this" or "this wouldn't be sufficient to replace this portion of what > > we have now because it lacks critical feature X", I would be very > > concerned about that. > > The other point here is that core code without any implemented use-cases > is unlikely to be worth a tinker's damn. Regardless of what time-frame > the Slony guys are able to work on, I think we need to see working code > (of at least prototype quality) before we believe that we've got it > right. Or if not code from them, code from some other replication > project. FWIW we (as in 2ndq), unsurprisingly, have a user of this which is in development atm. > A possibly-useful comparison is to the FDW APIs we've been slowly > implementing over the past couple releases. Even though we *did* have > some use-cases written right off the bat, we got it wrong and had to > change it in 9.2, and I wouldn't bet against having to change it again > in 9.3 (even without considering the need for extensions for non-SELECT > operations). And, despite our very clear warnings that all that stuff > was in flux, people have been griping because the APIs changed. On the other hand, I don't think we would have FDWs today at all if it wouldn't have been done that way. So I really cannot see that as an indication of not working incrementally. Obviously thats not an argument for not trying to get the API correct right off the bat. I seriously hope the userlevel API continues to be simpler than what FDWs need. Regards, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Simon Riggs
Date:
On 15 October 2012 21:03, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >>> On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote: >>>> I think Robert is right that if Slony can't use the API, it is unlikely >>>> any other replication system could use it. > >>> I don't accept that. Clearly there is a circular dependency, and >>> someone has to go first - why should the Slony guys invest in adopting >>> this technology if it is going to necessitate using a forked Postgres >>> with an uncertain future? > >> Clearly, core needs to go first. However, before we commit, I would >> like to hear the Slony guys say something like this: We read the >> documentation that is part of this patch and if the feature behaves as >> advertised, we believe we will be able to use it in place of the >> change-capture mechanism that we have now, and that it will be at >> least as good as what we have now if not a whole lot better. > >> If they say something like "I'm not sure we have the right design for >> this" or "this wouldn't be sufficient to replace this portion of what >> we have now because it lacks critical feature X", I would be very >> concerned about that. > > The other point here is that core code without any implemented use-cases > is unlikely to be worth a tinker's damn. Regardless of what time-frame > the Slony guys are able to work on, I think we need to see working code > (of at least prototype quality) before we believe that we've got it > right. Or if not code from them, code from some other replication > project. > > A possibly-useful comparison is to the FDW APIs we've been slowly > implementing over the past couple releases. Even though we *did* have > some use-cases written right off the bat, we got it wrong and had to > change it in 9.2, and I wouldn't bet against having to change it again > in 9.3 (even without considering the need for extensions for non-SELECT > operations). And, despite our very clear warnings that all that stuff > was in flux, people have been griping because the APIs changed. > > So if we ship core hooks for logical replication in advance of proof > that they're actually usable by at least one (preferably more than one) > replication project, I confidently predict that they'll be wrong and > will need revision and the potential users will complain about the > API instability. The prototype we showed at PgCon illustrated a working system, so we're on the right track. We've split that in two now, specifically to allow other projects to use what is being built. The exact API of that split is for discussion and has been massively redesigned on community advice for the sole purpose of including other approaches. We can't guarantee that external open source or commercial vendors will use the API. But we can say that in-core use cases exist for multiple approaches. We shouldn't put the decision on that in the hands of others. Jan spoke at length at PgCon, for all to hear, that what we are building is a much better way than the trigger logging approach Slony uses. I don't take that as carte blanche for approval of everything being done, but its going in the right direction with an open heart, which is about as good as it gets. There will be a working system again soon, once we have re-built things around the new API. The longer it takes to get a stable API the longer we take to rebuild things around it again. The goal of the project is to release everything open source, PGDG copyrighted and TPL licenced and to submit to core. We are signed up to that in various ways, not least of all our word given publicly. Please give this your backing, so an open source outcome can be possible. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Andres Freund
Date:
On Monday, October 15, 2012 10:08:28 PM Christopher Browne wrote: > On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > > On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote: > >> I think Robert is right that if Slony can't use the API, it is unlikely > >> any other replication system could use it. > > > > I don't accept that. Clearly there is a circular dependency, and > > someone has to go first - why should the Slony guys invest in adopting > > this technology if it is going to necessitate using a forked Postgres > > with an uncertain future? That would be (with respect to the Slony > > guys) a commercial risk that is fairly heavily concentrated with > > Afilias. > > Yep, there's something a bit too circular there. > > I'd also not be keen on reimplementing the "Slony integration" over > and over if it turns out that the API churns for a while before > stabilizing. That shouldn't be misread as "I expect horrible amounts > of churn", just that *any* churn comes at a cost. And if anything > unfortunate happens, that can easily multiply into a multiplicity of > painfulness(es?). Well, as a crosscheck, could you list your requirements? Do you need anything more than outputting data in a format compatible to whats stored in sl_log_*? You wouldn't have sl_actionseq, everything else should be there (Well, you would need to do lookups to get the tableid, but thats not really much of a problem). The results would be ordered in complete transactions, in commit order. I guess the other tables would stay as they are as they contain the "added value" of slony? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Christopher Browne
Date:
On Mon, Oct 15, 2012 at 4:51 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On Monday, October 15, 2012 10:08:28 PM Christopher Browne wrote: >> On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> > wrote: >> > On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote: >> >> I think Robert is right that if Slony can't use the API, it is unlikely >> >> any other replication system could use it. >> > >> > I don't accept that. Clearly there is a circular dependency, and >> > someone has to go first - why should the Slony guys invest in adopting >> > this technology if it is going to necessitate using a forked Postgres >> > with an uncertain future? That would be (with respect to the Slony >> > guys) a commercial risk that is fairly heavily concentrated with >> > Afilias. >> >> Yep, there's something a bit too circular there. >> >> I'd also not be keen on reimplementing the "Slony integration" over >> and over if it turns out that the API churns for a while before >> stabilizing. That shouldn't be misread as "I expect horrible amounts >> of churn", just that *any* churn comes at a cost. And if anything >> unfortunate happens, that can easily multiply into a multiplicity of >> painfulness(es?). > > Well, as a crosscheck, could you list your requirements? > > Do you need anything more than outputting data in a format compatible to whats > stored in sl_log_*? You wouldn't have sl_actionseq, everything else should be > there (Well, you would need to do lookups to get the tableid, but thats not > really much of a problem). The results would be ordered in complete > transactions, in commit order. Hmm. We need to have log data that's in a compatible ordering. We use sl_actionseq, and can mix data from multiple transactions together; if what you're providing is, instead, in order based on transaction commit order followed by some sequencing within each transaction, then that should be acceptable. The stylized query on sl_log_* looks like... select log_origin, log_txid, log_tableid, log_actionseq, log_tablenspname, log_tablerelname, log_cmdtype, log_cmdupdncols, log_cmdargs from %s.sl_log_%d where log_origin = %d How about I "quibble" about each of these: a) log_origin - this indicates the node from which the data originates. Presumably, this is implicit in a "chunk" of data that is coming in. b) log_txid - indicating the transaction ID. I presume you've got this available. It's less important with the WAL-based scheme in that we'd probably not be using it as a basis for querying as is the case today with Slony. c) log_tableid - indicating the ID of the table. Are you capturing an OID equivalent to this? Or what? d) log_actionseq - indicating relative sequences of updates. You don't have this, but if you're capturing commit ordering, we don't need it. e) log_tablenspname, log_tablerelname - some small amount of magic needful to get this. Or perhaps you are already capturing it? f) log_cmdtype - I/U/D/T - indicating the action (insert/update/delete/truncate). Hopefully you have something like this? g) log_cmdupdncols - for UPDATE action, the number of updated columns.Probably not mandatory; this was a new 2.1 thing... h) log_cmdargs - the actual data needed to do the I/U/D. The form of this matters a fair bit. Before Slony 2.1, this was a portion of a SQL statement, omitting the operation (provided in log_cmdtype) and the table name (in log_tablerelname et al). In Slony 2.1, this changes to be a text[] array that essentially consists of pairs of [column name, column value] values. I see one place, very notable in Slony 2.2, that would also be more complicated, which is the handling of DDL. In 2.1 and earlier, we handled DDL as "events", essentially out of band. This isn't actually correct; it could mix very badly if you had replication activity mixing with DDL requests. (More detail than you want is in a bug on this... <http://www.slony.info/bugzilla/show_bug.cgi?id=137>). In Slony 2.2, we added a third "log table" where DDL gets captured. sl_log_script has much the same schema as sl_log_{1,2}; it needs to get "mixed in" in compatible order. What I imagine would pointedly complicate life is if a single transaction contained both DDL and "regular replicable activity." Slony 2.2 mixes this in using XID + log_actionseq; how this would play out with your log capture mechanism isn't completely clear to me. That's the place where I'd expect the very messiest interaction. > I guess the other tables would stay as they are as they contain the "added > value" of slony? A fair bit of Slony is about the "event infrastructure," and some of that ceases to be as needful. The configuration bits probably continue to remain interesting. The parts that seem notably mysterious to me at the moment are: a) How do we pull result sets (e.g. - sl_log_* data)? b) How is the command data represented? c) If we have a need to mix together your 'raw logs' and other material (e.g. - our sl_log_script that captures DDL-like changes to be mixed back in), how easy|impossible is this? -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Andres Freund
Date:
On Tuesday, October 16, 2012 12:13:14 AM Christopher Browne wrote: > On Mon, Oct 15, 2012 at 4:51 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On Monday, October 15, 2012 10:08:28 PM Christopher Browne wrote: > >> On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <peter@2ndquadrant.com> > > > > wrote: > >> > On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote: > >> >> I think Robert is right that if Slony can't use the API, it is > >> >> unlikely any other replication system could use it. > >> > > >> > I don't accept that. Clearly there is a circular dependency, and > >> > someone has to go first - why should the Slony guys invest in adopting > >> > this technology if it is going to necessitate using a forked Postgres > >> > with an uncertain future? That would be (with respect to the Slony > >> > guys) a commercial risk that is fairly heavily concentrated with > >> > Afilias. > >> > >> Yep, there's something a bit too circular there. > >> > >> I'd also not be keen on reimplementing the "Slony integration" over > >> and over if it turns out that the API churns for a while before > >> stabilizing. That shouldn't be misread as "I expect horrible amounts > >> of churn", just that *any* churn comes at a cost. And if anything > >> unfortunate happens, that can easily multiply into a multiplicity of > >> painfulness(es?). > > > > Well, as a crosscheck, could you list your requirements? > > > > Do you need anything more than outputting data in a format compatible to > > whats stored in sl_log_*? You wouldn't have sl_actionseq, everything > > else should be there (Well, you would need to do lookups to get the > > tableid, but thats not really much of a problem). The results would be > > ordered in complete transactions, in commit order. > > Hmm. We need to have log data that's in a compatible ordering. > > We use sl_actionseq, and can mix data from multiple transactions > together; if what you're providing is, instead, in order based on > transaction commit order followed by some sequencing within each > transaction, then that should be acceptable. Inside the transaction its sequenced by the order the XLogInsert calls were made which is the order the client sent the commands. That sounds like it should be compatible. > The stylized query on sl_log_* looks like... > > select log_origin, log_txid, log_tableid, > log_actionseq, log_tablenspname, > log_tablerelname, log_cmdtype, > log_cmdupdncols, log_cmdargs > from %s.sl_log_%d > where log_origin = %d > > How about I "quibble" about each of these: > > a) log_origin - this indicates the node from which the data > originates. Presumably, this is implicit in a "chunk" of data that is > coming in. I think we can just reuse whatever method you use in slony to get the current node's id to get it in the output plugin. > b) log_txid - indicating the transaction ID. I presume you've got > this available. It's less important with the WAL-based scheme in that > we'd probably not be using it as a basis for querying as is the case > today with Slony. Its directly available. The plugin will have to call txid_out, but thats obviously no problem. > c) log_tableid - indicating the ID of the table. Are you capturing an > OID equivalent to this? Or what? You get the TupleDesc of the table. > d) log_actionseq - indicating relative sequences of updates. You > don't have this, but if you're capturing commit ordering, we don't > need it. Good. > e) log_tablenspname, log_tablerelname - some small amount of magic > needful to get this. Or perhaps you are already capturing it? The relevant backend functions available, so its no problem (RelationGetNamespace(change->new_tuple->table)). > f) log_cmdtype - I/U/D/T - indicating the action > (insert/update/delete/truncate). Hopefully you have something like > this? Yes: enum ApplyCacheChangeType {APPLY_CACHE_CHANGE_INSERT,APPLY_CACHE_CHANGE_UPDATE,APPLY_CACHE_CHANGE_DELETE, .. } > g) log_cmdupdncols - for UPDATE action, the number of updated columns. > Probably not mandatory; this was a new 2.1 thing... Hm. NO. We don't have that. But then, you can just use normal C code there, so it shouldn't be too hard to compute if neede.d > h) log_cmdargs - the actual data needed to do the I/U/D. The form of > this matters a fair bit. Before Slony 2.1, this was a portion of a > SQL statement, omitting the operation (provided in log_cmdtype) and > the table name (in log_tablerelname et al). In Slony 2.1, this > changes to be a text[] array that essentially consists of pairs of > [column name, column value] values. The existing C code to generate this should be copy&pasteable into this with a relatively small amount of changes. > I see one place, very notable in Slony 2.2, that would also be more > complicated, which is the handling of DDL. > > In 2.1 and earlier, we handled DDL as "events", essentially out of > band. This isn't actually correct; it could mix very badly if you had > replication activity mixing with DDL requests. (More detail than you > want is in a bug on this... > <http://www.slony.info/bugzilla/show_bug.cgi?id=137>). > > In Slony 2.2, we added a third "log table" where DDL gets captured. > sl_log_script has much the same schema as sl_log_{1,2}; it needs to > get "mixed in" in compatible order. What I imagine would pointedly > complicate life is if a single transaction contained both DDL and > "regular replicable activity." Slony 2.2 mixes this in using XID + > log_actionseq; how this would play out with your log capture mechanism > isn't completely clear to me. That's the place where I'd expect the > very messiest interaction. Hm. I expect some complications here as well. But then, unless you do something special changes to those tables (e.g. sl_log_script) will be streamed out as well, so you could just insert events into their respective tables on the receiving side and deal with them there. > > I guess the other tables would stay as they are as they contain the > > "added value" of slony? > > A fair bit of Slony is about the "event infrastructure," and some of > that ceases to be as needful. The configuration bits probably > continue to remain interesting. Quite a bit of the event infrastructure seems to deal with configuration changes and such, all thats probably going to stay... > The parts that seem notably mysterious to me at the moment are: > > a) How do we pull result sets (e.g. - sl_log_* data)? The details of this are in a bit of flux as of now but I hope we will nail this down soon. You open a replication connection to the primary 'replication=1 dbname=...' and issue START_LOGICAL_REPLICATION slony $slot_id 0/DEADBEEF With 0/DEADBEEF being the location youve stopped getting changes the last time. That will start streaming out changes via the COPY protocol. The contents of whats streamed out is entirely up to you. The first time you start replication you need to do: INIT_LOGICAL_REPLICATION which will return a $slot_id, a SET TRANSACTION SNAPSHOT compatible snapshot and the initial starting LSN. The 'slony' in START_LOGICAL_REPLICATION above denotes which output plugin is to be used. > b) How is the command data represented? Command data is the old/new tuple? Thats up to the output plugin. You get a HeapTuple with the old/new tuple, and compatible TupleDesc's. You could simply stream out your current format for example. > c) If we have a need to mix together your 'raw logs' and other > material (e.g. - our sl_log_script that captures DDL-like changes to > be mixed back in), how easy|impossible is this? As described above in general that seems easy enough. Just insert data into e.g. sl_log_script and when you receive the changes on the other side decide in which table to redirect those. Where I see a bit of a problem is the handling of replication sets, configuration and similar. Currently there is a dichotomy between 'catalog tables' and 'data tables'. The former are not replicated but can be queried in an output plugin (thats the timetravel part). The latter are replicated but cannot be queried. All system catalog tables are in the 'catalog' category by their nature, but I have played with a system column that allows other tables to be treated as catalog tables as well. If you would want to filter data on the source - which probably makes sense? - you currently would need to have such an additional catalog table which is not replicated but can be queried by the output plugin. But I guess the contents of that table would also need to be replicated... I wonder if it we should replicate changes to such user-defined catalog tables as well, that should be relatively easy and if its not wanted the output plugin easily can filter that (if (class_form->relusercat)). Does that clear things up? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> The design Andres and Simon have advanced already eliminates a lot of > the common failure cases (now(), random(), nextval()) suffered by pgPool > and similar tools. But remember, this feature doesn't have to be Well, pgpool-II already solved the now() case by using query rewriting technique. The technique could be applied to random() as well but I'm not convinced it is worth the trouble. nexval() would be a little harder because pgpool needs an assistance from PostgreSQL core. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Steve Singer
Date:
On 12-10-15 04:51 PM, Andres Freund wrote: > > Well, as a crosscheck, could you list your requirements? > > Do you need anything more than outputting data in a format compatible to whats > stored in sl_log_*? You wouldn't have sl_actionseq, everything else should be > there (Well, you would need to do lookups to get the tableid, but thats not > really much of a problem). The results would be ordered in complete > transactions, in commit order. > > I guess the other tables would stay as they are as they contain the "added > value" of slony? > > Greetings, I actually had spent some time a few weeks ago looking over the documents and code. I never did get around to writing a review as elegant as Peter's. I have not seen any red flags that make me thing that what your proposing wouldn't be suitable for slony but sometimes you don't see details until you start implementing something. My initial approach to modifying slony to work with this might be something like: * Leave sl_event as is for non SYNC events, slon would still generate SYNC events in sl_event * We would modify the remote_worker thread in slon to instead of selecting from sl_event it would get the the next 'committed' transaction from your apply cache. For each ApplyChange record we would check to see if it is an insert into sl_event ,if so we would trigger our existing event processing logic based on the contents of the ev_type column. * If the change involves a insert/update/delete/truncate to a replicated table we would translate that change into SQL and apply it on the replica, we would not commit changes on the replica until we encounter a SYNC being added to sl_event for the current origin. * SQL will be applied in a slightly different order than slony does today. Today if two concurrent transactions are inserting into the same replicated table and they commit one after the other there is a good chance that the apply order on the replica will also be intermixed (assuming both commits were in between two SYNC events). My thinking is that we would just replay them one after the other on the replica in commit order. (Slony doesn't use commit order because we don't have it, not because we don't like it) this would mean we do away with tracking the action id. * If a node is configured as a 'forwarder' not it would store the processed output of each ApplyChange record in a table on the replica. If a slon is pulling data from a non-orign (ie if remoteWorkerThread_1 is pulling data from node 2) then it would need to query this table instead of calling the functions that process the ApplyCache contents. * To subscribe a node we would generate a SYNC event on the provider and do the copy_set. We would keep track of that SYNC event. The remote worker would then ignore any data that comes before that SYNC event when it starts pulling data from the apply cache. * DDL events in 2.2+ go into sl_ddl_script (or someting like that) when we see INSERT commands to that table we would now to then apply the DDL on the node. * We would need to continue to populate sl_confirm because nowing what SYNC events have already been processed by a node is pretty important in a MOVE SET or FAILOVER. It is possible that we might need to still track the xip lists of each SYNC for MOVE SET/FAILOVER but I'm not sure why/why not. This is all easier said than implemented Steve > Andres
Hi All, On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote: > The design document [2] really just explains the problem (which is the > need for catalog metadata at a point in time to make sense of heap > tuples), without describing the solution that this patch offers with > any degree of detail. Rather, [2] says "How we build snapshots is > somewhat intricate and complicated and seems to be out of scope for > this document", which is unsatisfactory. I look forward to reading the > promised document that describes this mechanism in more detail. Here's the first version of the promised document. I hope it answers most of the questions. Input welcome! Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 10/15/2012 4:43 PM, Simon Riggs wrote: > Jan spoke at length at PgCon, for all to hear, that what we are > building is a much better way than the trigger logging approach Slony > uses. I don't take that as carte blanche for approval of everything > being done, but its going in the right direction with an open heart, > which is about as good as it gets. The mechanism you are building for capturing changes is certainly a lot better than what Bucardo, Londiste and Slony are doing today. That much is true. The flip side of the coin however is that all of today's logical replication systems are designed Postgres version agnostic to a degree. This means that the transition time from the existing, trigger based approach to the new WAL based mechanism will see both technologies in parallel, which is no small thing to support. And that transition time may last for a good while. We still have people installing Slony 1.2 because 2.0 (3 years old by now) requires Postgres 8.3 minimum. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 10/15/2012 3:25 PM, Andres Freund wrote: > On Monday, October 15, 2012 09:18:57 PM Peter Geoghegan wrote: >> On 15 October 2012 19:19, Bruce Momjian <bruce@momjian.us> wrote: >> > I think Robert is right that if Slony can't use the API, it is unlikely >> > any other replication system could use it. >> >> I don't accept that. Clearly there is a circular dependency, and >> someone has to go first - why should the Slony guys invest in adopting >> this technology if it is going to necessitate using a forked Postgres >> with an uncertain future? > > Well. I don't think (hope) anybody proposed making something release worthy for > slony but rather a POC patch that proofs the API is generic enough to be used > by them. If I (or somebody else familiar with this) work together with somebody > familiar with with slony internals I think such a POC shouldn't be too hard to > do. > I think some more input from that side is a good idea. I plan to send out an > email to possibly interested parties in about two weeks... What Slony essentially sends to the receiver node is a COPY stream in the format, Christopher described. That stream is directly copied into the receiving node's sl_log_N table and picked up there by an apply trigger BEFORE INSERT, that performs the corresponding INSERT/UPDATE/DELETE operation via prepared plans to the user tables. For a POC I think it is sufficient to demonstrate that this copy stream can be generated out of the WAL decoding. Note that Slony today does not touch columns in an UPDATE, that have not changed in the original UPDATE on the origin. Sending toasted column values, that haven't changed, would be a substantial change to the storage efficiency on the replica. The consequence of this is that the number of colums that need to be in the UPDATE's SET clause varies. The log_cmdupdncols is to separate the new column/value pairs from the column/key pairs of the updated row. The old row "key" in Slony is based on a unique index (preferably a PK, but any unique key will do). This makes that cmdupdncols simply the number of column/value pairs minus the number of key columns. So it isn't too hard to figure out. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Peter Geoghegan
Date:
On 16 October 2012 15:26, Jan Wieck <JanWieck@yahoo.com> wrote: > This means that the transition time from the existing, trigger based > approach to the new WAL based mechanism will see both technologies in > parallel, which is no small thing to support. So, you're talking about a shim between the two in order to usefully support inter-version replication, or are you just thinking about making a clean break in compatibility for Postgres versions prior to 9.3 in a new release branch? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Tue, Oct 16, 2012 at 7:30 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote: >> The design document [2] really just explains the problem (which is the >> need for catalog metadata at a point in time to make sense of heap >> tuples), without describing the solution that this patch offers with >> any degree of detail. Rather, [2] says "How we build snapshots is >> somewhat intricate and complicated and seems to be out of scope for >> this document", which is unsatisfactory. I look forward to reading the >> promised document that describes this mechanism in more detail. > > Here's the first version of the promised document. I hope it answers most of > the questions. > > Input welcome! I haven't grokked all of this in its entirety, but I'm kind of uncomfortable with the relfilenode -> OID mapping stuff. I'm wondering if we should, when logical replication is enabled, find a way to cram the table OID into the XLOG record. It seems like that would simplify things. If we don't choose to do that, it's worth noting that you actually need 16 bytes of data to generate a unique identifier for a relation, as in database OID + tablespace OID + relfilenode# + backend ID. Backend ID might be ignorable because WAL-based logical replication is going to ignore temporary relations anyway, but you definitely need the other two. There's nothing, for example, to keep you from having two relations with the same value in pg_class.relfilenode in the same database but in different tablespaces. It's unlikely to happen, because for new relations we set OID = relfilenode, but a subsequent rewrite can bring it about if the stars align just right. (Such situations are, of course, a breeding ground for bugs, which might make you question whether our current scheme for assigning relfilenodes has much of anything to recommend it.) Another thing to think about is that, like catalog snapshots, relfilenode mappings have to be time-relativized; that is, you need to know what the mapping was at the proper point in the WAL sequence, not what it is now. In practice, the risk here seems to be minimal, because it takes a while to churn through 4 billion OIDs. However, I suspect it pays to think about this fairly carefully because if we do ever run into a situation where the OID counter wraps during a time period comparable to the replication lag, the bugs will be extremely difficult to debug. Anyhow, adding the table OID to the WAL header would chew up a few more bytes of WAL space, but it seems like it might be worth it to avoid having to think very hard about all of these issues. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Christopher Browne
Date:
On Thu, Oct 18, 2012 at 9:49 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 16 October 2012 15:26, Jan Wieck <JanWieck@yahoo.com> wrote: >> This means that the transition time from the existing, trigger based >> approach to the new WAL based mechanism will see both technologies in >> parallel, which is no small thing to support. > > So, you're talking about a shim between the two in order to usefully > support inter-version replication, or are you just thinking about > making a clean break in compatibility for Postgres versions prior to > 9.3 in a new release branch? It's early to assume either. In Slony 2.0, we accepted that we were breaking compatibility with versions of Postgres before 8.3; we accepted that because there were considerable 'manageability' benefits (e.g. - system catalogues no longer hacked up, so pg_dump works against all nodes, and some dramatically reduced locking). But that had the attendant cost that we have had to continue fixing bugs on 1.2, to a degree, even until now, because people on Postgres versions earlier than 8.3 have no way to use version 2.0. Those merits and demerits apply pretty clearly to this. It would be somewhat attractive for a "version 2.3" (or, more likely, to indicate the break from earlier versions, "3.0" to make the clean break to the new-in-PG-9.3 facilities. It is attractive in that we could: a) Safely remove the trigger-based log capture apparatus (or, at least, I'm assuming so), and b) Consciously upgrade to take advantage of all the latest cool stuff found in Postgres 9.3. (I haven't got any particular features in mind; perhaps we add RANGE comparators for xid to 9.3, and make extensive use of xid_range types? That would be something that couldn't reasonably get hacked to work in anything before 9.2...) c) Drop out any special cases having to do with support of versions 8.3, 8.4, 9.0, 9.1, and 9.2. But, of course, we'd be leaving everyone running 8.3 thru 9.2 behind, if we did so, and would corresponding shackle ourselves to need to support the 2.x branches for still longer. And this would mean that this Slony "3.0" would expressly NOT support one of our intended use cases, namely to support upgrading from elder versions of Postgres. A "shim" adds complexity, but retains the "upgrade across versions" use case, and reduces the need to keep supporting elder versions of Slony. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On Thursday, October 18, 2012 04:47:12 PM Robert Haas wrote: > On Tue, Oct 16, 2012 at 7:30 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote: > >> The design document [2] really just explains the problem (which is the > >> need for catalog metadata at a point in time to make sense of heap > >> tuples), without describing the solution that this patch offers with > >> any degree of detail. Rather, [2] says "How we build snapshots is > >> somewhat intricate and complicated and seems to be out of scope for > >> this document", which is unsatisfactory. I look forward to reading the > >> promised document that describes this mechanism in more detail. > > > > Here's the first version of the promised document. I hope it answers most > > of the questions. > > > > Input welcome! > > I haven't grokked all of this in its entirety, but I'm kind of > uncomfortable with the relfilenode -> OID mapping stuff. I'm > wondering if we should, when logical replication is enabled, find a > way to cram the table OID into the XLOG record. It seems like that > would simplify things. > > If we don't choose to do that, it's worth noting that you actually > need 16 bytes of data to generate a unique identifier for a relation, > as in database OID + tablespace OID + relfilenode# + backend ID. > Backend ID might be ignorable because WAL-based logical replication is > going to ignore temporary relations anyway, but you definitely need > the other two. ... Hm. I should take look at the way temporary tables are represented. As you say I is not going to matter for WAL decoding, but still... > Another thing to think about is that, like catalog snapshots, > relfilenode mappings have to be time-relativized; that is, you need to > know what the mapping was at the proper point in the WAL sequence, not > what it is now. In practice, the risk here seems to be minimal, > because it takes a while to churn through 4 billion OIDs. However, I > suspect it pays to think about this fairly carefully because if we do > ever run into a situation where the OID counter wraps during a time > period comparable to the replication lag, the bugs will be extremely > difficult to debug. I think with a rollbacks + restarts we might even be able to see the same relfilenode earlier. > Anyhow, adding the table OID to the WAL header would chew up a few > more bytes of WAL space, but it seems like it might be worth it to > avoid having to think very hard about all of these issues. I don't think its necessary to change wal logging here. The relfilenode mapping is now looked up using the timetravel snapshot we've built using (spcNode, relNode) as the key, so the time-relativized lookup is "builtin". If we screw that up way much more is broken anyway. Two problems are left: 1) (reltablespace, relfilenode) is not unique in pg_class because InvalidOid is stored for relfilenode if its a shared or nailed table. That not a problem for the lookup because weve already checked the relmapper before that, so we never look those up anyway. But it violates documented requirements of syscache.c. Even after some looking I haven't found any problem that that could cause. 2) We need to decide whether a HEAP[1-2]_* record did catalog changes when building/updating snapshots. Unfortunately we also need to do this *before* we built the first snapshot. For now treating all tables as catalog modifying before we built the snapshot seems to work fine. I think encoding the oid in the xlog header wouln't help all that much here, because I am pretty sure we want to have the set of "catalog tables" to be extensible at some point... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
From
Peter Geoghegan
Date:
On 18 October 2012 16:18, Christopher Browne <cbbrowne@gmail.com> wrote: > A "shim" adds complexity, but retains the "upgrade across versions" > use case, and reduces the need to keep supporting elder versions of > Slony. Right. Upgrading across major versions is likely to continue to remain a very important use-case for Slony. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Alvaro Herrera
Date:
This patch doesn't seem to be going anywhere, sadly. Since we're a bit late in the commitfest and this patch hasn't seen any activity for a long time, I'll mark it as returned-with-feedback. I hope one or both versions are resubmitted (with additional fixes?) for the next commitfest, and that the discussion continues to determine which of the two approaches is the best. Thanks. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 18, 2012 at 11:20 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On Thursday, October 18, 2012 04:47:12 PM Robert Haas wrote: >> On Tue, Oct 16, 2012 at 7:30 AM, Andres Freund <andres@2ndquadrant.com> > wrote: >> > On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote: >> >> The design document [2] really just explains the problem (which is the >> >> need for catalog metadata at a point in time to make sense of heap >> >> tuples), without describing the solution that this patch offers with >> >> any degree of detail. Rather, [2] says "How we build snapshots is >> >> somewhat intricate and complicated and seems to be out of scope for >> >> this document", which is unsatisfactory. I look forward to reading the >> >> promised document that describes this mechanism in more detail. >> > >> > Here's the first version of the promised document. I hope it answers most >> > of the questions. >> > >> > Input welcome! >> >> I haven't grokked all of this in its entirety, but I'm kind of >> uncomfortable with the relfilenode -> OID mapping stuff. I'm >> wondering if we should, when logical replication is enabled, find a >> way to cram the table OID into the XLOG record. It seems like that >> would simplify things. >> >> If we don't choose to do that, it's worth noting that you actually >> need 16 bytes of data to generate a unique identifier for a relation, >> as in database OID + tablespace OID + relfilenode# + backend ID. >> Backend ID might be ignorable because WAL-based logical replication is >> going to ignore temporary relations anyway, but you definitely need >> the other two. ... > > Hm. I should take look at the way temporary tables are represented. As you say > I is not going to matter for WAL decoding, but still... > >> Another thing to think about is that, like catalog snapshots, >> relfilenode mappings have to be time-relativized; that is, you need to >> know what the mapping was at the proper point in the WAL sequence, not >> what it is now. In practice, the risk here seems to be minimal, >> because it takes a while to churn through 4 billion OIDs. However, I >> suspect it pays to think about this fairly carefully because if we do >> ever run into a situation where the OID counter wraps during a time >> period comparable to the replication lag, the bugs will be extremely >> difficult to debug. > > I think with a rollbacks + restarts we might even be able to see the same > relfilenode earlier. > >> Anyhow, adding the table OID to the WAL header would chew up a few >> more bytes of WAL space, but it seems like it might be worth it to >> avoid having to think very hard about all of these issues. > > I don't think its necessary to change wal logging here. The relfilenode mapping > is now looked up using the timetravel snapshot we've built using (spcNode, > relNode) as the key, so the time-relativized lookup is "builtin". If we screw > that up way much more is broken anyway. > > Two problems are left: > > 1) (reltablespace, relfilenode) is not unique in pg_class because InvalidOid is > stored for relfilenode if its a shared or nailed table. That not a problem for > the lookup because weve already checked the relmapper before that, so we never > look those up anyway. But it violates documented requirements of syscache.c. > Even after some looking I haven't found any problem that that could cause. > > 2) We need to decide whether a HEAP[1-2]_* record did catalog changes when > building/updating snapshots. Unfortunately we also need to do this *before* we > built the first snapshot. For now treating all tables as catalog modifying > before we built the snapshot seems to work fine. > I think encoding the oid in the xlog header wouln't help all that much here, > because I am pretty sure we want to have the set of "catalog tables" to be > extensible at some point... I don't like catalog-only snapshots at all. I think that's just a recipe for subtle or not-so-subtle breakage down the road... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Friday, October 19, 2012 06:38:30 PM Robert Haas wrote: > On Thu, Oct 18, 2012 at 11:20 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > 2) We need to decide whether a HEAP[1-2]_* record did catalog changes > > when building/updating snapshots. Unfortunately we also need to do this > > *before* we built the first snapshot. For now treating all tables as > > catalog modifying before we built the snapshot seems to work fine. > > I think encoding the oid in the xlog header wouln't help all that much > > here, because I am pretty sure we want to have the set of "catalog > > tables" to be extensible at some point... > > I don't like catalog-only snapshots at all. I think that's just a > recipe for subtle or not-so-subtle breakage down the road... I really don't see this changing for now. At some point we need to draw the line otherwise we will never ever get anywhere. Naively building a snapshot that can access normal tables is just too expensive because it changes all the time. Sure, obvious breakage will be there if you have a datatype that accesses other user-tables during decoding (as we talked about previously). But subtle breakage should be easily catchable by simply not allowing to open user relations. If an extension needs this it will have to mark the table as catalog ones for now. I find this to be a reasonable restriction. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 16 October 2012 12:30, Andres Freund <andres@2ndquadrant.com> wrote: > Here's the first version of the promised document. I hope it answers most of > the questions. This makes for interesting reading. So, I've taken a closer look at the snapshot building code in light of this information. What follows are my thoughts on that aspect of the patch (in particular, the merits of snapshot time-travel as a method of solving the general problem of making sense of WAL during what I'll loosely call "logical recovery", performed by what one design document [1] calls an "output plugin", which I previously skirted over somewhat. You can think of this review as a critical exposition of what happens during certain steps of my high-level schematic of "how this all fits together", which covers this entire patchset (this comes under my remit as the reviewer of one of the most important and complex patches in that patchset, "0008-Introduce-wal-decoding-via-catalog-timetravel.patch"), as described in my original, high-level review [2]. To recap on what those steps are: ***SNIP*** (from within plugin, currently about to decode a record for the first time) | \ / 9. During the first call(within the first record within a call to decode_xlog()), we allocate a snapshot reader. | \ / 10. Builds snapshot callback. This scribbles on oursnapshot state, which essentially encapsulates a snapshot. The state (and snapshot) changes continually, once per call. | \ / 11. Looks at XLogRecordBuffer (an XLogReader struct). Looks at an XLogRecord. Decodes based on record type. Let's assume it's an XLOG_HEAP_INSERT. | \ / 12. DecodeInsert()called. This in turn calls DecodeXLogTuple(). We store the tuple metadata in our ApplyCache. (some ilists, somewhere, each corresponding to an XID). We don't store the relation oid, because we don't know it yet (only relfilenode is known from WAL).// \ / 13. We're back in XLogReader(). It calls the only callback of interest to us covered in step 3 (and not of interestto XlogReader()/Heikki) – decode_change(). It does this through the apply_cache.apply_change hook. This happens becausewe encounter another record, this time a commit record (in the same codepath as discussed in step 12). |\ /14. In decode_change(),the actual function that raises the interesting WARNINGs within Andres' earlier example [3], showing actual integer/varchar Datum value for tuples previously inserted. Resolve table oid based on relfilenode (albeit unsatisfactorily). Using a StringInfo,tupledescs, syscache and typcache, build WARNING string. (No further steps. Aside: If I've done a good job of writing my initial review [2], I should be able to continually refer back to this as I drill down on other steps in later reviews.) It's fairly obvious why steps 9 and 10 are interesting to us here. Step 11 (the first call of SnapBuildCallback() - this is a bit of a misnomer, since the function isn't ever called through a pointer) is where the visibility-wise decision to decode a particular XLogRecord/XLogRecordBuffer occurs. Here is how the patch describes our reasons for needing this call (curiously, this comment appears not above SnapBuildCallback() itself, but above the decode.c call of said function, which may hint at a lapse in separation of concerns): + /*--------- + * Call the snapshot builder. It needs to be called before we analyze + * tuples for two reasons: + * + * * Only in the snapshot building logic we know whether we have enough + * information to decode a particular tuple + * + * * The Snapshot/CommandIds computed by the SnapshotBuilder need to be + * added to the ApplyCache before we add tuples using them + *--------- + */ Step 12 is a step that I'm not going into for this review. That's all decoding related. Step 13 is not really worthy of separate consideration here, because it just covers what happens when we call DecodeCommit() within decode.c , where text representations of tuples are ultimately printed in simple elog WARNINGs, as in Andres' original example [3], due to the apply_cache.apply_change hook being called. Step 14 *is* kind-of relevant, because this is one place where we resolve relation OID based on relfilenode, which is part of snapbuild.c, since it has a LookupTableByRelFileNode() call (the only other such call is within snapbuild.c itself, as part of checking if a particular relation + xid have had catalogue changes, which can be a concern due to DDL, which is the basic problem that snapbuild.c seeks to solve). Assuming that it truly is necessary to have a LookupTableByRelFileNode() function, I don't think your plugin (which will soon actually be a contrib module, right?) has any business calling it. Rather, this should all be part of some higher-level abstraction, probably within applycache, that spoonfeeds your example contrib module changesets without it having to care about system cache invalidation and what-not. As I've already noted, snapbuild.c (plus one or two other isolated places) have rather heavy-handed and out of place InvalidateSystemCaches() calls like these: + HeapTuple + LookupTableByRelFileNode(RelFileNode *relfilenode) + { + Oid spc; + + InvalidateSystemCaches(); However, since you've privately told me that your next revision will address the need to execute sinval messages granularly, rather than using this heavy-handed kludge, I won't once again stress the need to do better. If I've understood correctly, you've indicated that this can be done by processing the shared invalidation messages in each xl_xact_commit (i.e. each XLOG_XACT_COMMIT entry) during decoding (which I guess means that decoding's reponsibility's have been widened a bit, but that's still the place to do it - within the decoding "dispatcher"/glue code). Apparently we can expect this revision in the next couple of days. Thankfully, I *think* that this implies that plugins don't need to directly concern themselves with cache invalidation. By the way, shouldn't this use InvalidOid?: + /* + * relations in the default tablespace are stored with a reltablespace = 0 + * for some reason. + */ + spc = relfilenode->spcNode == DEFAULTTABLESPACE_OID ? + 0 : relfilenode->spcNode; The objectionable thing about having your “plugin” handle cache invalidation, apart from the fact that, as we all agree, the way you're doing it is just not acceptable, is that your plugin is doing it directly *at all*. You need to analyse the requirements of the universe of possible logical changeset subscriber plugins, and whittle them down to a lowest common denominator interface that likely generalises cache invalidation, and ideally doesn't require plugin authors to even know what a relfilenode or syscache is – some might say this shouldn't be a priority, but I take the view that it should. Exposing innards like this to these plugins is wrong headed, and to do so will likely paint us into a corner. Now, I grant that catcache + relcache can only be considered private innards in a relative sense with Postgres, and indeed a few contrib modules do use syscache directly for simple stuff, like hstore, which needs syscache + typcache for generating text representations in a way that is not completely unlike what you have here. Still, my feeling is that all the heavy lifting should be encapsulated elsewhere, in core. I think that you could easily justify adding another module/translation unit that is tasked with massaging this data in a form amenable to serving the needs of client code/plugins. If you don't get things quite right, plugin authors can still do it all for themselves just as well. I previously complained about having to take a leap of faith in respect of xmin horizon handling [2]. This new document also tries to allay some of those concerns. It says: > == xmin Horizon Handling == > > Reusing MVCC for timetravel access has one obvious major problem: > VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also > obviously, we want autovacuum/manual vacuum to work as before. > > The idea here is to reuse the infrastrcuture built for hot_standby_feedback > which allows us to keep the xmin horizon of a walsender backend artificially > low. We keep it low enough so we can restart decoding from the last location > the client has confirmed to be safely received. The means that we keep it low > enough to contain the last checkpoints oldestXid value. These ideas are still underdeveloped. For one thing, it seems kind of weak to me that we're obliged to have vacuum grind to a halt across the cluster because some speculative plugin has its proc's xmin pegged to some value due to a logical standby still needing it for reading catalogues only. Think of the failure modes – what happens when the standby participating in a plugin-based logical replication system croaks for indeterminate reasons? Doing this with the WAL sender as part of hot_standby_feedback is clearly much less hazardous, because there *isn't* a WAL sender when streaming replication isn't active in respect of some corresponding standby, and hot_standby_feeback need only support deferring vacuum for the streaming replication standby whose active snapshot's xmin is furthest in the past. If a standby is taken out of commission for an hour, it can probably catch up without difficulty (difficulty like needing a base-backup), and it has no repercussions for the master or anyone else. However, I believe that logical change records would be rendered meaningless in the same scenario, because VACUUM, having not seen a “pegged” xmin horizon due to the standby's unavailability, goes ahead and vacuums past a point still needed to keep the standby in a consistent state. Maybe you can invent a new type of xmin horizon that applies only to catalogues so this isn't so bad, and I see you've suggested as much in follow-up mail to Robert, but it might be unsatisfactory to have that impact the PGAXT performance optimisation introduced in commit ed0b409d, or obfuscate that code. You could have the xmin be a special xmin, say though PGAXT.vacuumFlags indicating this, but wouldn't that necessarily preclude the backend from having a non-special notion of its xmin? How does that bode for this actually becoming maximally generic infrastructure? You do have some ideas about how to re-sync a speculative in-core logical replication system standby, and indeed you talk about this in the design document posted a few weeks back [1] (4.8. Setup of replication nodes). This process is indeed similar to a base backup, and we'd hope to avoid having to do it for similar reasons - it would be undesirable to have to do it much more often in practice due to these types of concerns. You go on to say: > That also means we need to make that value persist across restarts/crashes in a > very similar manner to twophase.c's. That infrastructure actually also useful > to make hot_standby_feedback work properly across primary restarts. So here you anticipating my criticism above about needing to peg the xmin horizon. That's fine, but I still don't know yet is how you propose to make this work in a reasonable way, free from all the usual caveats about leaving prepared transactions open for an indefinitely long time (what happens when we need to XID wraparound?, how does the need to hold a transaction open interfere with a given plugin backend's ability to execute regular queries?, etc). Furthermore, I don't know how it's going to be independently useful to make hot_standby_feedback work across primary restarts, because the primary cannot then generate VACUUM cleanup WAL records that the standby might replay, resulting in a hard conflict. Maybe there's something incredibly obvious that I'm missing, but doesn't that problem almost take care of itself? Granted, those cleanup records aren't the only reason for a hard conflict, but they've got to be by far the most important, and are documented as such. Either the cleanup record already exists and you're usually already out of luck anyway, or it doesn't and you're not. Are you thinking about race conditions? You talk about the relfilenode/oid mapping problem some: > == Catalog/User Table Detection == > > To detect whether a record/transaction does catalog modifications - which we > need to do for memory/performance reasons - we need to resolve the > RelFileNode's in xlog records back to the original tables. Unfortunately > RelFileNode's only contain the tables relfilenode, not their table oid. We only > can do catalog access once we reached FULL_SNAPSHOT, before that we can use > some heuristics but otherwise we have to assume that every record changes the > catalog. What exactly are the implications of having to assume catalogue changes? I see that right now, you're just skipping actual decoding once you've taken care of snapshot building chores within SnapBuildCallback(): + if (snapstate->state == SNAPBUILD_START) + return SNAPBUILD_SKIP; > The heuristics we can use are: > * relfilenode->spcNode == GLOBALTABLESPACE_OID > * relfilenode->relNode <= FirstNormalObjectId > * RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid > > Those detect some catalog tables but not all (think VACUUM FULL), but if they > detect one they are correct. If the heuristics aren't going to be completely reliable, why is that acceptable? You've said it "seems to work fine", but I don't quite follow. I see this within SnapBuildCallback() (the function whose name I said was a misnomer). > After reaching FULL_SNAPSHOT we can do catalog access if our heuristics tell us > a table might not be a catalog table. For that we use the new RELFILENODE > syscache with (spcNode, relNode). I share some of Robert's concerns here. The fact that relfilenode mappings have to be time-relativised may tip the scales against this approach. As Robert has said, it may be that simply adding the table OID to the WAL header is the way forward. It's not as if we can't optimise that later. One compromise might be to only do that when we haven't yet reached FULL_SNAPSHOT On the subject of making decoding restartable, you say: > == Restartable Decoding == > > As we want to generate a consistent stream of changes we need to have the > ability to start from a previously decoded location without going to the whole > multi-phase setup because that would make it very hard to calculate up to where > we need to keep information. Indeed, it would. > To make that easier everytime a decoding process finds an online checkpoint > record it exlusively takes a global lwlock and checks whether visibility > information has been already been written out for that checkpoint and does so > if not. We only need to do that once as visibility information is the same > between all decoding backends. Where and how would that visibility information be represented? So, typically you'd expect it to say “no catalogue changes for entire checkpoint“ much of the time? The patch we've seen (0008-Introduce-wal-decoding-via-catalog-timetravel.patch [4]) doesn't address the question of transactions that contain DDL: + /* FIXME: For now skip transactions with catalog changes entirely */ + if (ent && ent->does_timetravel) + does_timetravel = true; + else + does_timetravel = SnapBuildHasCatalogChanges(snapstate, xid, relfilenode); but you do address this question (or a closely related question, which, I gather is the crux of the issue: How to mix DDL and DML in transactions?) in the new doc (README.SNAPBUILD.txt [6]): > == mixed DDL/DML transaction handling == > > When a transactions uses DDL and DML in the same transaction things get a bit > more complicated because we need to handle CommandIds and ComboCids as we need > to use the correct version of the catalog when decoding the individual tuples. Right, so it becomes necessary to think about time-travelling not just to a particular transaction, but to a particular point in a particular transaction – the exact point at which the catalogue showed a structure consistent with sanely interpreting logical WAL records created during that window after the last DDL command (if any), but before the next (if any). This intelligence is only actually needed when decoding tuples created in that actual transaction, because only those tuples have their format change throughout a single transaction. > CommandId handling itself is relatively simple, we can figure out the current > CommandId relatively easily by looking at the currently used one in > changes. The problematic part is that those CommandId frequently will not be > actual cmin or cmax values but ComboCids. Those are used to minimize space in > the heap. During normal operation cmin/cmax values are only used within the > backend emitting those rows and only during one toplevel transaction, so > instead of storing cmin/cmax only a reference to an in-memory value is stored > that contains both. Whenever we see a new CommandId we call > ApplyCacheAddNewCommandId. Right. So in general, transaction A doesn't have to concern itself with the order that other transactions had tuples become visible or invisible (cmin and cmax); transaction A need only concern itself with whether they're visible or invisible based on if relevant transactions (xids) committed, its own xid, plus each tuple's xmin and xmax. It is this property of cmin/cmax that enabled the combocid optimisation in 8.3, which introduces an array in *backend local* memory, to map a single HeapTupleHeader field (where previously there were 2 – cmin and cmax) to an entry in that array, under the theory that it's unusual for a tuple to be created and then deleted in the same transaction. Most of the time, that one HeapTupleHeader field wouldn't have a mapping to the local array – rather, it would simply have a cmin or a cmax. That's how we save heap space. > To resolve this problem during heap_* whenever we generate a new combocid > (detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog table > we log the new XLOG_HEAP2_NEW_COMBOCID record containing the mapping. During > decoding this ComboCid is added to the applycache > (ApplyCacheAddNewComboCid). They are only guaranteed to be visible within a > single transaction, so we cannot simply setup all of them globally. This seems more or less reasonable. The fact that the combocid optimisation uses a local memory array isn't actually an important property of combocids as a performance optimisation – it's just that there was no need for the actual cmin and cmax values to be visible to other backends, until now. Comments in combocids.c go on at length about how infrequently actual combocids are actually used in practice. For ease of implementation, but also because real combocids are expected to be needed infrequently, I suggest that rather than logging the mapping, you log the values directly in a record (i.e. The full cmin and cmax mapped to the catalogue + catalogue tuple's ctid). You could easily exhaust the combocid space otherwise, and besides, you cannot do anything with the mapping from outside of the backend that originated the combocid for that transaction (you don't have the local array, or the local hashtable used for combocids). > Before calling the output plugin ComboCids are temporarily setup and torn down > afterwards. How? You're using a combocid-like array + hashtable local to the plugin backend? Anyway, for now, this is unimplemented, which is perhaps the biggest concern about it: + /* check if its one of our txids, toplevel is also in there */ + else if (TransactionIdInArray(xmin, snapshot->subxip, snapshot->subxcnt)) + { + CommandId cmin = HeapTupleHeaderGetRawCommandId(tuple); + /* no support for that yet */ + if (tuple->t_infomask & HEAP_COMBOCID){ + elog(WARNING, "combocids not yet supported"); + return false; + } + if (cmin >= snapshot->curcid) + return false; /* inserted after scan started */ + } Above, you aren't taking this into account (code from HeapTupleHeaderGetCmax()): /* We do not store cmax when locking a tuple */ Assert(!(tup->t_infomask & (HEAP_MOVED | HEAP_IS_LOCKED))); Sure, you're only interested in cmin, so this doesn't look like it applies (isn't this just a sanity check?), but actually, based on this it seems to me that the current sane representation of cmin needs to be obtained in a more concurrency aware fashion - having the backend local data-structures that originate the tuple isn't even good enough. The paucity of other callers to HeapTupleHeaderGetRawCommandId() seems to support this. Apart from contrib/pageinspect, there is only this one other direct caller, within heap_getsysattr(): /** cmin and cmax are now both aliases for the same field, which* can in fact also be a combo command id. XXX perhapswe should* return the "real" cmin or cmax if possible, that is if we are* inside the originating transaction?*/ result = CommandIdGetDatum(HeapTupleHeaderGetRawCommandId(tup->t_data)); So these few direct call-sites that inspect CommandId outside of their originating backend are all non-critical observers of the CommandId, that are roughly speaking allowed to be wrong. Callers to the higher level abstractions (HeapTupleHeaderGetCmin()/HeapTupleHeaderGetCmax()) know that the CommandId must be a cmin or cmax respectively, because they have as their contract that TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tup)) and TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmax(tup)) respectively. I guess this can all happen when you write that XLOG_HEAP2_NEW_COMBOCID record within the originating transaction (that is, when you xmin is that of the tuple in the originating transaction, you are indeed dealing with a cmin), but these are hazards that you need to consider in addition to the more obvious ComboCid hazard. I have an idea that the HeapTupleHeaderGetRawCommandId(tuple) call in your code could well be bogus even when (t_infomask & HEAP_COMBOCID) == 0. I look forward to seeing your revision that addressed the issue with sinval messages. [1] http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com [2] http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com [3] http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com [4] http://archives.postgresql.org/message-id/1347669575-14371-8-git-send-email-andres@2ndquadrant.com [5] http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com [6] http://archives.postgresql.org/message-id/201210161330.37967.andres@2ndquadrant.com -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Hi, On Friday, October 19, 2012 10:53:06 PM Peter Geoghegan wrote: > On 16 October 2012 12:30, Andres Freund <andres@2ndquadrant.com> wrote: > > Here's the first version of the promised document. I hope it answers most > > of the questions. > > This makes for interesting reading. Thanks. > Step 14 *is* kind-of relevant, because this is one place where we > resolve relation OID based on relfilenode, which is part of > snapbuild.c, since it has a LookupTableByRelFileNode() call (the only > other such call is within snapbuild.c itself, as part of checking if a > particular relation + xid have had catalogue changes, which can be a > concern due to DDL, which is the basic problem that snapbuild.c seeks > to solve). Assuming that it truly is necessary to have a > LookupTableByRelFileNode() function, I don't think your plugin (which > will soon actually be a contrib module, right?) has any business > calling it. That shouldn't be needed anymore, that was just because I didn't finish some loose ends quickly enough. The callback now gets passed the TupleDesc. > Thankfully, I *think* that this implies that > plugins don't need to directly concern themselves with cache > invalidation. Correct. They don't need to know anything about it, its all handled transparently now. > By the way, shouldn't this use InvalidOid?: Yes. Fixed. > allay some of those concerns. It says: > > == xmin Horizon Handling == > > > > Reusing MVCC for timetravel access has one obvious major problem: > > VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also > > obviously, we want autovacuum/manual vacuum to work as before. > > > > The idea here is to reuse the infrastrcuture built for > > hot_standby_feedback which allows us to keep the xmin horizon of a > > walsender backend artificially low. We keep it low enough so we can > > restart decoding from the last location the client has confirmed to be > > safely received. The means that we keep it low enough to contain the > > last checkpoints oldestXid value. > > These ideas are still underdeveloped. For one thing, it seems kind of > weak to me that we're obliged to have vacuum grind to a halt across > the cluster because some speculative plugin has its proc's xmin pegged > to some value due to a logical standby still needing it for reading > catalogues only. Think of the failure modes – what happens when the > standby participating in a plugin-based logical replication system > croaks for indeterminate reasons? Doing this with the WAL sender as > part of hot_standby_feedback is clearly much less hazardous, because > there *isn't* a WAL sender when streaming replication isn't active in > respect of some corresponding standby, and hot_standby_feeback need > only support deferring vacuum for the streaming replication standby > whose active snapshot's xmin is furthest in the past. > If a standby is taken out of commission for an hour, it can probably catch > up without difficulty (difficulty like needing a base-backup), and it has no > repercussions for the master or anyone else. Only if you have setup an archive_command with sufficient retention. Otherwise it all depends on the wal_keep_segments value. > However, I believe that logical change records would be rendered meaningless > in the same scenario, because VACUUM, having not seen a “pegged” xmin > horizon due to the standby's unavailability, goes ahead and vacuums past a > point still needed to keep the standby in a consistent state. Well, thats why I want to persist them similar to twophase.c. > Maybe you can invent a new type of xmin horizon that applies only to > catalogues so this isn't so bad Yes, I thought about this. I seems like a very sensible optimization, but I really, really, would like to get the simpler version finished. > and I see you've suggested as much in follow-up mail to Robert > but it might be unsatisfactory to have that impact the PGAXT performance > optimisation introduced in commit ed0b409d, or obfuscate that code. Hm. I don't forsee any need to pessimize that. But I guess the burden of proof lies on me writing up a patch for this. > You could have the xmin be a special xmin, say though PGAXT.vacuumFlags indicating this, but wouldn't that necessarily preclude the backend from > having a non-special notion of its xmin? How does that bode for this actually becoming maximally generic infrastructure? Uhm. Why should the decoding backend *ever* need its own xmin? I will *never* be allowed to do writes itself or similar? > ... > You go on to say: > > That also means we need to make that value persist across > > restarts/crashes in a very similar manner to twophase.c's. That > > infrastructure actually also useful to make hot_standby_feedback work > > properly across primary restarts. > > So here you anticipating my criticism above about needing to peg the > xmin horizon. That's fine, but I still don't know yet is how you > propose to make this work in a reasonable way, free from all the usual > caveats about leaving prepared transactions open for an indefinitely > long time (what happens when we need to XID wraparound?, how does the > need to hold a transaction open interfere with a given plugin > backend's ability to execute regular queries?, etc). Well, its not like its introducing a problem that wasn't there before. I really don't see this as something all that problematic. If people protest we can add some tiny bit of code to autovacuum that throws everything away thats older than some autovacuum_vacuum_freeze_max_age or such. We could even do the same to prepared transactions. But in the end the answer is that you need to monitor *any* replication system carefully. > Furthermore, I don't know how it's going to be independently useful to make > hot_standby_feedback work across primary restarts, because the primary > cannot then generate VACUUM cleanup WAL records that the standby might > replay, resulting in a hard conflict. Maybe there's something > incredibly obvious that I'm missing, but doesn't that problem almost > take care of itself? Autovacuum immediately starts after a restart. Once its started some workers a reconnecting standby cannot lower the xmin again, so you have a high likelihood of conflicts. I have seen that problem in the field (ironically "fixed" by creating a prepared xact before restarting ...). > You talk about the relfilenode/oid mapping problem some: > > == Catalog/User Table Detection == > > > > To detect whether a record/transaction does catalog modifications - which > > we need to do for memory/performance reasons - we need to resolve the > > RelFileNode's in xlog records back to the original tables. Unfortunately > > RelFileNode's only contain the tables relfilenode, not their table oid. > > We only can do catalog access once we reached FULL_SNAPSHOT, before that > > we can use some heuristics but otherwise we have to assume that every > > record changes the catalog. > > What exactly are the implications of having to assume catalogue > changes? Higher cpu/storage overhead, nothing else. > I see that right now, you're just skipping actual decoding > once you've taken care of snapshot building chores within > SnapBuildCallback(): > > + if (snapstate->state == SNAPBUILD_START) > + return SNAPBUILD_SKIP; When were in SNAPBUILD_START state we don't have *any* knowledge about the system yet, so we can't do anything with collected records anyway (very likely the record we just read was part of an already started transaction). Once were in SNAPBUILD_FULL_SNAPSHOT state we can start collecting changes if the respective transaction started *after* we have reached SNAPBUILD_FULL_SNAPSHOT: > > The heuristics we can use are: > > * relfilenode->spcNode == GLOBALTABLESPACE_OID > > * relfilenode->relNode <= FirstNormalObjectId > > * RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid > > > > Those detect some catalog tables but not all (think VACUUM FULL), but if > > they detect one they are correct. > > If the heuristics aren't going to be completely reliable, why is that > acceptable? You've said it "seems to work fine", but I don't quite > follow. Should have left that out, its a small internal optimization... If the above heuristic detect that a relfilenode relates to a catalog table its guaranteed to be correct. It cannot detect all catalog changes though, so you can only use it to skip doing work for catalog tables, not for skipping work if !catalog. > > After reaching FULL_SNAPSHOT we can do catalog access if our heuristics > > tell us a table might not be a catalog table. For that we use the new > > RELFILENODE syscache with (spcNode, relNode). > > I share some of Robert's concerns here. The fact that relfilenode > mappings have to be time-relativised may tip the scales against this > approach. Whats the problem with the time-relativized access? > As Robert has said, it may be that simply adding the table > OID to the WAL header is the way forward. It's not as if we can't > optimise that later. One compromise might be to only do that when we > haven't yet reached FULL_SNAPSHOT When writing the WAL we don't have any knowledge about what state some potential decoding process could be in, so its an all or nothing thing. I don't have a problem with writing the table oid into the records somewhere, I just think its not required. One reason for storing it in there independent from this patchset/feature is debugging. I wished for that in the past. We need to build the snapshot to access catalog anyway, so its not like doing the relfilenode lookup time-relativzed incurs any additional costs. Also, we need to do the table-oid lookup time-relatived as well, because table oids can be reused. > > To make that easier everytime a decoding process finds an online > > checkpoint record it exlusively takes a global lwlock and checks whether > > visibility information has been already been written out for that > > checkpoint and does so if not. We only need to do that once as > > visibility information is the same between all decoding backends. > > Where and how would that visibility information be represented? Some extra pg_* directory, like pg_decode/$LSN_OF_CHECKPOINT. > So, typically you'd expect it to say “no catalogue changes for entire > checkpoint“ much of the time? No, not really. It will probably look similar to the files ExportSnapshot currently produces. Even if no catalog changes happened we still need to keep knowledge about committed transactions and such. Btw, I doubt all that many checkpoint<->checkpoint windows will have absolutely no catalog changes. At least some pg_class.relfrozenxid, pg_class.reltuples changes are to be expected. > The patch we've seen > (0008-Introduce-wal-decoding-via-catalog-timetravel.patch [4]) doesn't > address the question of transactions that contain DDL: > ... > but you do address this question (or a closely related question, > which, I gather is the crux of the issue: How to mix DDL and DML in > > transactions?) in the new doc (README.SNAPBUILD.txt [6]): > > == mixed DDL/DML transaction handling == > > > > When a transactions uses DDL and DML in the same transaction things get a > > bit more complicated because we need to handle CommandIds and ComboCids > > as we need to use the correct version of the catalog when decoding the > > individual tuples. > > Right, so it becomes necessary to think about time-travelling not just > to a particular transaction, but to a particular point in a particular > transaction – the exact point at which the catalogue showed a > structure consistent with sanely interpreting logical WAL records > created during that window after the last DDL command (if any), but > before the next (if any). This intelligence is only actually needed > when decoding tuples created in that actual transaction, because only > those tuples have their format change throughout a single transaction. Exactly. > > CommandId handling itself is relatively simple, we can figure out the > > current CommandId relatively easily by looking at the currently used one > > in changes. The problematic part is that those CommandId frequently will > > not be actual cmin or cmax values but ComboCids. Those are used to > > minimize space in the heap. During normal operation cmin/cmax values are > > only used within the backend emitting those rows and only during one > > toplevel transaction, so instead of storing cmin/cmax only a reference > > to an in-memory value is stored that contains both. Whenever we see a > > new CommandId we call > > ApplyCacheAddNewCommandId. > > Right. So in general, transaction A doesn't have to concern itself > with the order that other transactions had tuples become visible or > invisible (cmin and cmax); transaction A need only concern itself with > whether they're visible or invisible based on if relevant transactions > (xids) committed, its own xid, plus each tuple's xmin and xmax. It is > this property of cmin/cmax that enabled the combocid optimisation in > 8.3, which introduces an array in *backend local* memory, to map a > single HeapTupleHeader field (where previously there were 2 – cmin and > cmax) to an entry in that array, under the theory that it's unusual > for a tuple to be created and then deleted in the same transaction. > Most of the time, that one HeapTupleHeader field wouldn't have a > mapping to the local array – rather, it would simply have a cmin or a > cmax. That's how we save heap space. Yes. The whole handling here is nearly completely analogous to the normal handling of CommandIds. > > To resolve this problem during heap_* whenever we generate a new combocid > > (detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog > > table we log the new XLOG_HEAP2_NEW_COMBOCID record containing the > > mapping. During decoding this ComboCid is added to the applycache > > (ApplyCacheAddNewComboCid). They are only guaranteed to be visible within > > a single transaction, so we cannot simply setup all of them globally. > > This seems more or less reasonable. The fact that the combocid > optimisation uses a local memory array isn't actually an important > property of combocids as a performance optimisation It is an important property here because ComboCids from different toplevel transactions conflict with each other which means we have to deal with them on a per toplevel-xid basis. > For ease of implementation, but also because real combocids are > expected to be needed infrequently, I suggest that rather than logging > the mapping, you log the values directly in a record (i.e. The full > cmin and cmax mapped to the catalogue + catalogue tuple's ctid). You > could easily exhaust the combocid space otherwise, and besides, you > cannot do anything with the mapping from outside of the backend that > originated the combocid for that transaction (you don't have the local > array, or the local hashtable used for combocids). I can't really follow here. Obviously we need to generate the XLOG_HEAP2_NEW_COMBOCID locally in the transaction/backend that generated the change? > > Before calling the output plugin ComboCids are temporarily setup and torn > > down afterwards. > > How? You're using a combocid-like array + hashtable local to the > plugin backend? I added extern void PutComboCommandId(CommandId combocid, CommandId cmin, CommandId cmax); which in combination with the existing extern void AtEOXact_ComboCid(void); is enough. > Anyway, for now, this is unimplemented, which is perhaps the biggest > concern about it: > > + /* check if its one of our txids, toplevel is also in there */ > + else if (TransactionIdInArray(xmin, snapshot->subxip, > snapshot->subxcnt)) + { > + CommandId cmin = HeapTupleHeaderGetRawCommandId(tuple); > + /* no support for that yet */ > + if (tuple->t_infomask & HEAP_COMBOCID){ > + elog(WARNING, "combocids not yet supported"); > + return false; > + } > + if (cmin >= snapshot->curcid) > + return false; /* inserted after scan started */ > + } > > Above, you aren't taking this into account (code from > HeapTupleHeaderGetCmax()): > > /* We do not store cmax when locking a tuple */ > Assert(!(tup->t_infomask & (HEAP_MOVED | HEAP_IS_LOCKED))); > > Sure, you're only interested in cmin, so this doesn't look like it > applies (isn't this just a sanity check?), but actually, based on this > it seems to me that the current sane representation of cmin needs to > be obtained in a more concurrency aware fashion - having the backend > local data-structures that originate the tuple isn't even good enough. You completely lost me here and in the following paragraphs. The infomask is available for everyone, and we only read/write cmin|cmax|comboid when were inside the transaction or when we have already logged a HEAP2_NEW_COMBOCID and thus have the necessary information? Which concurrency concerns are you referring to? > I have an idea that the HeapTupleHeaderGetRawCommandId(tuple) call in your > code could well be bogus even when (t_infomask & HEAP_COMBOCID) == 0. Ah? The other .satisfies routines do HeapTupleHeaderGetCmin(tuple) which returns exactly that if !(tup->t_infomask & HEAP_COMBOCID). But anyway, with the new combocid handling the code uses the usual HeapTupleHeaderGetCmin/Cmax calls, so it looks even more like the normal routines. Thanks for the extensive review! I am pretty sure this a lot to take in ;). Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> Comments about the approach or even the general direction of the > implementation? Questions? This patch series has gotten serious amount of discussion and useful feedback; even some parts of it have been committed. I imagine lots more feedback, discussion and spawning of new ideas will take place in Prague. I am marking it as Returned with Feedback for now. Updated, rebased, modified versions are expected for the next commitfest. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
After some fooling around to provide the discussed backend functionality to xlogdump (StringInfo atop PQExpBuffer and elog_start/elog_finish), the following items still need work: 1. rmgr tables We're linking rmgr.c so that we can obtain the appropriate rm_desc function pointer for each rmgr. However the table also includes the rm_redo, startup, etc function pointers, which the linker wants resolved at xlogdump link time. The idea I have to handle this is to use a macro similar to PG_KEYWORD: at compile time we define it differently on xlogdump than on backend, so that the symbols we don't want are hidden. 2. ereport() functionality Currently the xlogreader.c I'm using (the latest version posted by Andres) has both elog() calls and ereport(). I have provided trivial elog_start and elog_finish implementations, which covers the first. I am not really sure about implementing the whole errstart/errfinish stack, because that'd be pretty duplicative, though I haven't tried. The other alternative suggested elsewhere is to avoid elog/ereport entirely in xlogreader.c and instead pass a function pointer for error reportage. The backend would normally use ereport(), but xlogdump could do something simple with fprintf. I think that would end up being cleaner overall. 3. timestamptz_to_str xact_desc uses this, which involves a couple of messy backend files (because there's palloc in them, among other problems). Alternatively we could tweak xact_desc to use EncodeDateTime (probably through some simple wrapper); given the constraints imposed on the values, that might be simpler, and we can provide a simple implementation of EncodeDateTime or of its hypothetical wrapper in xlogdump. 4. relpathbackend and pfree of its return value This is messy. Maybe we should a caller-supplied buffer instead of palloc to solve this. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Alvaro Herrera
Date:
Heikki Linnakangas escribió: > Hmm. I was thinking that making this work in a non-backend context > would be too hard, so I didn't give that much thought, but I guess > there isn't many dependencies to backend functions after all. > palloc/pfree are straightforward to replace with malloc/free. That's > what we could easily do with the error messages too, just malloc a > suitably sized buffer. > > How does a non-backend program get access to xlogreader.c? Copy > xlogreader.c from the source tree at build time and link into the > program? Or should we turn it into a shared library? It links the object file into the src/bin/ subdir or whatever. I don't think a shared library is warranted at this point. One further (relatively minor) problem with what you propose here is that the xlogreader.c code is using emode_for_corrupt_record(), which not only lives in xlog.c but also depends on readSource. I guess we still want the message-supressing abilities of that function, so some more thinking is required in this area. I think you may have converted some malloc() calls from Andres' patch into palloc() -- because you have some palloc() calls which are later checked for NULL results, which obviously doesn't make sense. At the same time, if we're going to use malloc() instead of palloc(), we need to check for NULL return value in XLogReaderAllocate() callers. This seems easy to fix at first glance, but what is the correct response if it fails during StartupXLOG()? Should we just elog(FATAL) and hope it never happens in practice? Andres commented elsewhere about reading xlog records, processing them as they came in, and do a running CRC while we're still reading it. I think this is a mistake; we shouldn't do anything with a record until the CRC has been verified. Otherwise we risk reading arbitrarily corrupt data. Overall, I think I like Heikki's minimal patch better than Andres' original proposal, but I'll keep looking at both. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I think you may have converted some malloc() calls from Andres' patch > into palloc() -- because you have some palloc() calls which are later > checked for NULL results, which obviously doesn't make sense. At the > same time, if we're going to use malloc() instead of palloc(), we need > to check for NULL return value in XLogReaderAllocate() callers. This > seems easy to fix at first glance, but what is the correct response if > it fails during StartupXLOG()? Should we just elog(FATAL) and hope it > never happens in practice? Um, surely we can still let those functions use palloc? It should just be #define'd as pg_malloc() (ie something with an error exit) in non-backend contexts. regards, tom lane
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On Monday, October 29, 2012 08:58:53 PM Alvaro Herrera wrote: > Heikki Linnakangas escribió: > > Hmm. I was thinking that making this work in a non-backend context > > would be too hard, so I didn't give that much thought, but I guess > > there isn't many dependencies to backend functions after all. > > palloc/pfree are straightforward to replace with malloc/free. That's > > what we could easily do with the error messages too, just malloc a > > suitably sized buffer. > > > > How does a non-backend program get access to xlogreader.c? Copy > > xlogreader.c from the source tree at build time and link into the > > program? Or should we turn it into a shared library? > Andres commented elsewhere about reading xlog records, processing them > as they came in, and do a running CRC while we're still reading it. I > think this is a mistake; we shouldn't do anything with a record until > the CRC has been verified. Otherwise we risk reading arbitrarily > corrupt data. Uhm. xlog.c does just the same. It reads the header and if it looks valid it uses its length information to read the full record and only computes the CRC at the end. > Overall, I think I like Heikki's minimal patch better than Andres' > original proposal, but I'll keep looking at both. I'll defer to you two on that, no point in fighting that many experienced people ;) Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Alvaro Herrera
Date:
Andres Freund escribió: > On Monday, October 29, 2012 08:58:53 PM Alvaro Herrera wrote: > > Heikki Linnakangas escribió: > > Andres commented elsewhere about reading xlog records, processing them > > as they came in, and do a running CRC while we're still reading it. I > > think this is a mistake; we shouldn't do anything with a record until > > the CRC has been verified. Otherwise we risk reading arbitrarily > > corrupt data. > > Uhm. xlog.c does just the same. It reads the header and if it looks valid it > uses its length information to read the full record and only computes the CRC > at the end. Uh. Correct. Am I the only one who finds this rather bizarre? Maybe this was okay when xlog data would only come from WAL files stored in the data directory at recovery, but if we're now receiving these from a remote sender over the network I wonder if we should be protecting against malicious senders. (This is not related to this patch anyway.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On Tuesday, October 30, 2012 03:20:03 PM Alvaro Herrera wrote: > Andres Freund escribió: > > On Monday, October 29, 2012 08:58:53 PM Alvaro Herrera wrote: > > > Heikki Linnakangas escribió: > > > > > > Andres commented elsewhere about reading xlog records, processing them > > > as they came in, and do a running CRC while we're still reading it. I > > > think this is a mistake; we shouldn't do anything with a record until > > > the CRC has been verified. Otherwise we risk reading arbitrarily > > > corrupt data. > > > > Uhm. xlog.c does just the same. It reads the header and if it looks valid > > it uses its length information to read the full record and only computes > > the CRC at the end. > > Uh. Correct. > > Am I the only one who finds this rather bizarre? Maybe this was okay > when xlog data would only come from WAL files stored in the data > directory at recovery, but if we're now receiving these from a remote > sender over the network I wonder if we should be protecting against > malicious senders. (This is not related to this patch anyway.) How should this work otherwise? The CRC is over the whole data so we obviously need to read the whole data to compute the CRC? Would you prefer protecting the header with a separate CRC? You can't use a CRC against malicous users anyway, its not cryptographically secure in any meaning of the word, its trivial to generate different content resulting in the same CRC. The biggest user of the CRC checking code we have is making sure were not reading beyond the end of the WAL... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes: > On Tuesday, October 30, 2012 03:20:03 PM Alvaro Herrera wrote: >> Am I the only one who finds this rather bizarre? Maybe this was okay >> when xlog data would only come from WAL files stored in the data >> directory at recovery, but if we're now receiving these from a remote >> sender over the network I wonder if we should be protecting against >> malicious senders. (This is not related to this patch anyway.) > You can't use a CRC against malicous users anyway, its not cryptographically > secure in any meaning of the word, More to the point, if a bad guy has got control of your WAL stream, crashing the startup process with a ridiculous length word is several orders of magnitude less than the worst thing he can find to do to you. So the above argument seems completely nonsensical to me. Anybody who's worried about that type of scenario is better advised to be setting up SSL to ensure that the stream is coming from the server they think it's coming from. regards, tom lane
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Alvaro Herrera
Date:
Tom Lane escribió: > Andres Freund <andres@2ndquadrant.com> writes: > > On Tuesday, October 30, 2012 03:20:03 PM Alvaro Herrera wrote: > >> Am I the only one who finds this rather bizarre? Maybe this was okay > >> when xlog data would only come from WAL files stored in the data > >> directory at recovery, but if we're now receiving these from a remote > >> sender over the network I wonder if we should be protecting against > >> malicious senders. (This is not related to this patch anyway.) > > > You can't use a CRC against malicous users anyway, its not cryptographically > > secure in any meaning of the word, > > More to the point, if a bad guy has got control of your WAL stream, > crashing the startup process with a ridiculous length word is several > orders of magnitude less than the worst thing he can find to do to you. > So the above argument seems completely nonsensical to me. Well, I wasn't talking just about the record length, but about the record in general. The length just came up because it's what I noticed. And yeah, I was thinking in one sum for the header and another one for the data. If we're using CRC to detect end of WAL, what sense does it make to have to read the whole record if we can detect the end by just looking at the header? (I obviously see that we need to checksum the data as well; and having a second CRC field would bloat the record.) > Anybody who's worried about that type of scenario is better advised to > be setting up SSL to ensure that the stream is coming from the server > they think it's coming from. Okay. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > And yeah, I was thinking in one sum for the header and another one for > the data. I don't think it's worth the space. > If we're using CRC to detect end of WAL, what sense does it > make to have to read the whole record if we can detect the end by just > looking at the header? Er, what? The typical case where CRC shows us it's end of WAL is that the overall CRC doesn't match, ie, torn record. Record header corruption as such is just about nonexistent AFAIK. regards, tom lane
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
From
Andres Freund
Date:
On Tuesday, October 30, 2012 04:24:21 PM Alvaro Herrera wrote: > Tom Lane escribió: > > Andres Freund <andres@2ndquadrant.com> writes: > > > On Tuesday, October 30, 2012 03:20:03 PM Alvaro Herrera wrote: > > >> Am I the only one who finds this rather bizarre? Maybe this was okay > > >> when xlog data would only come from WAL files stored in the data > > >> directory at recovery, but if we're now receiving these from a remote > > >> sender over the network I wonder if we should be protecting against > > >> malicious senders. (This is not related to this patch anyway.) > > > > > > You can't use a CRC against malicous users anyway, its not > > > cryptographically secure in any meaning of the word, > > > > More to the point, if a bad guy has got control of your WAL stream, > > crashing the startup process with a ridiculous length word is several > > orders of magnitude less than the worst thing he can find to do to you. > > So the above argument seems completely nonsensical to me. > > Well, I wasn't talking just about the record length, but about the > record in general. The length just came up because it's what I noticed. > > And yeah, I was thinking in one sum for the header and another one for > the data. If we're using CRC to detect end of WAL, what sense does it > make to have to read the whole record if we can detect the end by just > looking at the header? (I obviously see that we need to checksum the > data as well; and having a second CRC field would bloat the record.) Well, the header is written first. In the header we can detect somewhat accurately that were beyond the current end-of-wal by looking at ->xl_prev and doing some validity checks, but thats not applicable for the data. A valid looking header doesn't mean that the whole, potentially multi-megabyte, record data is valid, we could have crashed while writing the data. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I mentioned the remaining issues in a previous email (see message-id 20121025161751.GE6442@alvh.no-ip.org). Attached is a patch that enables xlogdump to #include xlog_internal.h by way of removing that file's inclusion of fmgr.h, which is problematic. I don't think this should be too contentious. The other interesting question remaining is what to do about the rm_desc function in rmgr.c. We're split between these two ideas: 1. Have this in rmgr.c: #ifdef FRONTEND #define RMGR_REDO_FUNC(func) NULL #else #define RMGR_REDO_FUNC(func) func #endif /* FRONTEND */ and then use RMGR_REDO_FUNC() in the table. 2. Have this in rmgr.c: #ifndef RMGR_REDO_FUNC #define RMGR_REDO_FUNC(func) func #endif And then have the xlogdump Makefile use -D to define a suitable RMGR_REDO_FUNC. Opinions please? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Wednesday, November 28, 2012 12:17 AM Alvaro Herrera wrote: > I mentioned the remaining issues in a previous email (see message-id > 20121025161751.GE6442@alvh.no-ip.org). Attached is a patch that enables > xlogdump to #include xlog_internal.h by way of removing that file's > inclusion of fmgr.h, which is problematic. I don't think this should be > too contentious. I have tried to go through Xlogdump patch provided in mail chain of message-id provided. It seems there is no appropriate file/function header which makes it little difficult to understand the purpose. This is just a suggestion and not related to your this mail. > The other interesting question remaining is what to do about the rm_desc > function in rmgr.c. We're split between these two ideas: > > 1. Have this in rmgr.c: > > #ifdef FRONTEND > #define RMGR_REDO_FUNC(func) NULL > #else > #define RMGR_REDO_FUNC(func) func > #endif /* FRONTEND */ > > and then use RMGR_REDO_FUNC() in the table. > > > 2. Have this in rmgr.c: > > #ifndef RMGR_REDO_FUNC > #define RMGR_REDO_FUNC(func) func > #endif > > And then have the xlogdump Makefile use -D to define a suitable > RMGR_REDO_FUNC. > What I understood is that as there is only a need of rm_desc function in xlogdump, so we want to separate it out such that it can be easily used. In Approach-1, define for some of functions (redo, startup, cleanup,..) as NULL for frontends so that corresponding functions for frontends become hidden. In Approach-2, frontend (in this case Xlogdump) need to define value for that particular Macro by using -D in makefile. If my above thinking is right, then I think Approach-2 might be better as in that Frontend will have more flexibility if it wants to use some other functionality of rmgr. With Regards, Amit Kapila.
On 2012-11-28 18:58:45 +0530, Amit Kapila wrote: > On Wednesday, November 28, 2012 12:17 AM Alvaro Herrera wrote: > > I mentioned the remaining issues in a previous email (see message-id > > 20121025161751.GE6442@alvh.no-ip.org). Attached is a patch that enables > > xlogdump to #include xlog_internal.h by way of removing that file's > > inclusion of fmgr.h, which is problematic. I don't think this should be > > too contentious. > > I have tried to go through Xlogdump patch provided in mail chain of > message-id provided. > It seems there is no appropriate file/function header which makes it little > difficult to understand the purpose. > This is just a suggestion and not related to your this mail. An updated version of xlogdump with some initial documentation, sensible building, and some more is available at http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlogreader_v3 > > The other interesting question remaining is what to do about the rm_desc > > function in rmgr.c. We're split between these two ideas: > > > > 1. Have this in rmgr.c: > > > > #ifdef FRONTEND > > #define RMGR_REDO_FUNC(func) NULL > > #else > > #define RMGR_REDO_FUNC(func) func > > #endif /* FRONTEND */ > > > > and then use RMGR_REDO_FUNC() in the table. > > > > > > 2. Have this in rmgr.c: > > > > #ifndef RMGR_REDO_FUNC > > #define RMGR_REDO_FUNC(func) func > > #endif > > > > And then have the xlogdump Makefile use -D to define a suitable > > RMGR_REDO_FUNC. > > > > What I understood is that as there is only a need of rm_desc function in > xlogdump, so we want to separate it out such that > it can be easily used. > In Approach-1, define for some of functions (redo, startup, cleanup,..) as > NULL for frontends so that corresponding functions for frontends become > hidden. > In Approach-2, frontend (in this case Xlogdump) need to define value for > that particular Macro by using -D in makefile. > > If my above thinking is right, then I think Approach-2 might be better as in > that Frontend will have more flexibility if it wants > to use some other functionality of rmgr. I personally favor approach-1 because I cannot see any potential other use. You basically need to have the full backend code available just to successfully link the other functions. Running is even more complex, and there's no real point in doing that standalone anyway, so, what for? Its not like thats something that annot be changed should an actual usecase emerge. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wednesday, November 28, 2012 7:07 PM Andres Freund wrote: > On 2012-11-28 18:58:45 +0530, Amit Kapila wrote: > > On Wednesday, November 28, 2012 12:17 AM Alvaro Herrera wrote: > > > I mentioned the remaining issues in a previous email (see message-id > > > 20121025161751.GE6442@alvh.no-ip.org). Attached is a patch that > enables > > > xlogdump to #include xlog_internal.h by way of removing that file's > > > inclusion of fmgr.h, which is problematic. I don't think this > should be > > > too contentious. > > > > I have tried to go through Xlogdump patch provided in mail chain of > > message-id provided. > > It seems there is no appropriate file/function header which makes it > little > > difficult to understand the purpose. > > This is just a suggestion and not related to your this mail. > > An updated version of xlogdump with some initial documentation, sensible > building, and some more is available at > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=sh > ortlog;h=refs/heads/xlogreader_v3 Oops.. looked at wrong version. > > > The other interesting question remaining is what to do about the > rm_desc > > > function in rmgr.c. We're split between these two ideas: > > > > > > 1. Have this in rmgr.c: > > > > > > #ifdef FRONTEND > > > #define RMGR_REDO_FUNC(func) NULL > > > #else > > > #define RMGR_REDO_FUNC(func) func > > > #endif /* FRONTEND */ > > > > > > and then use RMGR_REDO_FUNC() in the table. > > > > > > > > > 2. Have this in rmgr.c: > > > > > > #ifndef RMGR_REDO_FUNC > > > #define RMGR_REDO_FUNC(func) func > > > #endif > > > > > > And then have the xlogdump Makefile use -D to define a suitable > > > RMGR_REDO_FUNC. > > > > > > > What I understood is that as there is only a need of rm_desc function > in > > xlogdump, so we want to separate it out such that > > it can be easily used. > > In Approach-1, define for some of functions (redo, startup, > cleanup,..) as > > NULL for frontends so that corresponding functions for frontends > become > > hidden. > > In Approach-2, frontend (in this case Xlogdump) need to define value > for > > that particular Macro by using -D in makefile. > > > > If my above thinking is right, then I think Approach-2 might be better > as in > > that Frontend will have more flexibility if it wants > > to use some other functionality of rmgr. > > I personally favor approach-1 because I cannot see any potential other > use. You basically need to have the full backend code available just to > successfully link the other functions. Running is even more complex, and > there's no real point in doing that standalone anyway, so, what for? Such functionality might be used if somebody wants to write independent test for storage engine, but not sure if such a thing (Approach-2) can be helpful. As I can see that Approach-1 has advantage, there will be no dependency in makefiles for exposing rm_desc functionality. And for Approach-2, it is unnecessary for makefile to define value if there is actually no other relevant use case for it. Can you think of any other pros-cons of both approaches, or do you think we already have sufficient reasoning to conclude on Approach-1. With Regards, Amit Kapila.
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > The other interesting question remaining is what to do about the rm_desc > function in rmgr.c. We're split between these two ideas: Why try to link rmgr.c into frontend versions at all? Just make a new table file that only includes the desc function pointers. Yeah, then there would be two table files to maintain, but it's not clear to me that it's uglier than these proposals ... regards, tom lane
On 2012-11-29 15:03:48 -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > The other interesting question remaining is what to do about the rm_desc > > function in rmgr.c. We're split between these two ideas: > > Why try to link rmgr.c into frontend versions at all? Just make > a new table file that only includes the desc function pointers. > Yeah, then there would be two table files to maintain, but it's > not clear to me that it's uglier than these proposals ... Seems more likely to get out of sync. But then, rmgr.c isn't changing that heavily... I still prefer the -DFRONTEND solutions, but once more, its only a slight preference. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 29, 2012 at 3:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> The other interesting question remaining is what to do about the rm_desc >> function in rmgr.c. We're split between these two ideas: > > Why try to link rmgr.c into frontend versions at all? Just make > a new table file that only includes the desc function pointers. > Yeah, then there would be two table files to maintain, but it's > not clear to me that it's uglier than these proposals ... +1. It's not likely to get updated very often, we can add comments to remind people to keep them all in sync, and if you manage to screw it up without noticing then you are adding recovery code that you have not tested in any way whatsoever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company