Re: logical changeset generation v3 - git repository - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: logical changeset generation v3 - git repository |
Date | |
Msg-id | CAEYLb_W10YBH8uKb0hU0n8jjiUu8Cbbdxhq9z84zA8gwibqG5g@mail.gmail.com Whole thread Raw |
In response to | Re: logical changeset generation v3 - git repository (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: logical changeset generation v3 - git repository
|
List | pgsql-hackers |
On 9 December 2012 19:14, Andres Freund <andres@2ndquadrant.com> wrote: > I pushed a new version which > > - is rebased ontop of master > - is based ontop of the new xlogreader (biggest part) > - is base ontop of the new binaryheap.h > - some fixes > - some more comments I decided to take another look at this, following my earlier reviews of a substantial subset of earlier versions of this patch, including my earlier reviews of WAL decoding [1] and focused review of snapshot building [2] (itself a subset of WAL decoding). I think it was the right time to consolidate your multiple earlier patches, because some of the earlier BDR patches were committed (including "Rearrange storage of data in xl_running_xacts" [3], "Basic binary heap implementation" [4], "Embedded list interface" [5], and, though it isn't touched on here and is technically entirely distinct functionality, "Background worker processes" [6]). Furthermore, now that we've gotten past some early rounds of reviewing, it makes sense to build a *perfectly* (rather than just approximately) atomic unit, as we work towards something that is actually committable. So what's the footprint of this big, newly rebased feature branch? Well, though some of these changes are uncommitted stuff from Heikki (i.e. XLogReader, which you've modified), and some of this is README documentation, the footprint is very large. I merged master with your dev branch (last commit of yours, 743f3af081209f784a30270bdf49301e9e242b78, made on Mon 10th Dec 15:35), and the stats are: 91 files changed, 9736 insertions(+), 956 deletions(-) Note that there is a relatively large number of files affected in part because the tqual interface was bashed around a bit for the benefit of logical decoding - a lot of the changes to each of those 91 files are completely trivial. I'm very glad that you followed my earlier recommendation of splitting your demo logical changeset consumer into a contrib module, in the spirit of contrib/spi, etc. This module, "test_decoding", represents a logical entry point, if you will, for the entire patch. As unwieldy as it may appear to be, the patch is (or at least *should be*) ultimately reducible to some infrastructural changes to core to facilitate this example logical change-set consumer. test_decoding contrib module ============================ contrib/Makefile | 1 +contrib/test_decoding/Makefile | 16 +contrib/test_decoding/test_decoding.c | 192 ++++ Once again, because test_decoding is a kind of "entry point", it gives me a nice point to continually refer back to when talking about this patch. (Incidentally, maybe test_decoding should be called pg_decoding?). The regression tests pass, though this isn't all that surprising, since frankly the test coverage of this patch appears to be quite low. I know that you're working with Abhijit on improvements to the isolation tester to verify the correctness of the patch as it relates to supporting actual, practical logical replication systems. I would very much welcome any such test coverage (even if it wasn't in a committable state), since in effect you're asking me to take a leap of faith in respect of how well this infrastructure will support such systems – previously, I obliged you and didn't focus on concurrency and serializability concerns (it was sufficient to print out values/do some decoding in a toy function), but it's time to take a closer look at those now, I feel. test_decoding is a client of the logical change-set producing infrastructure, and there appears to be broad agreement that that infrastructure needs to treat such consumers in a way that is maximally abstract. My question is, just how abstract does this interface have to be, really? How well are you going to support the use-case of a real logical replication system? Now, maybe it's just that I haven't being paying attention (in particular, to the discussion surrounding [3] – though that commit doesn't appear to have been justified in terms of commit ordering in BDR at all), but I would like you to be more demonstrative of certain things, like: 1. Just what does a logical change-set consumer look like? What things are always true of one, and never true of one? 2. Please describe in as much detail as possible the concurrency issues with respect to logical replication systems. Please make verifiable, testable claims as to how well these issues are considered here, perhaps with reference to the previous remarks of subject-matter experts like Chris Browne [7], Steve Singer [8] and Kevin Grittner [9] following my earlier review. I'm not all that impressed with where test_decoding is at right now. There is still essentially no documentation. I think it's notable that you don't really touch the ReorderBufferTXN passed by the core system in the test_decoding plugin. test_decoding and pg_receivellog ======================== I surmised that the way that the test_decoding module is intended to be used is as a client of receivellog.c (*not* receivelog.c – that naming is *confusing*, perhaps call it receivelogiclog.c or something. Better still, make receivexlog handle the logical case rather than inventing a new tool). The reason for receivellog.c existing, as you yourself put it, is: + /* + * We have to use postgres.h not postgres_fe.h here, because there's so much + * backend-only stuff in the XLOG include files we need. But we need a + * frontend-ish environment otherwise. Hence this ugly hack. + */ So receivellog.c is part of a new utility called pg_receivellog, in much the same way as receivexlog.c is part of the existing pg_receivexlog utility (see commit b840640000934fca1575d29f94daad4ad85ba000 in Andres' tree). We're talking about these changes: src/backend/utils/misc/guc.c | 11 +src/bin/pg_basebackup/Makefile | 7 +-src/bin/pg_basebackup/pg_basebackup.c | 4 +-src/bin/pg_basebackup/pg_receivellog.c | 717 ++++++++++++src/bin/pg_basebackup/pg_receivexlog.c | 4 +-src/bin/pg_basebackup/receivelog.c | 4 +-src/bin/pg_basebackup/streamutil.c | 3 +-src/bin/pg_basebackup/streamutil.h | 1 + So far, so good. Incidentally, you forgot to do this: install: all installdirs $(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)' $(INSTALL_PROGRAM)pg_receivexlog$(X) '$(DESTDIR)$(bindir)/pg_receivexlog$(X)' + $(INSTALL_PROGRAM) pg_receivellog$(X) '$(DESTDIR)$(bindir)/pg_receivellog$(X)' So this creates a new binary executable, pg_receivellog, which is described as “the pg_receivexlog equivalent for logical changes”. Much like pg_receivexlog, pg_receivellog issues special new replication protocol commands for logical replication, which account for your changes to the replication protocol grammar and lexer (i.e. walsender): src/backend/replication/repl_gram.y | 32 +-src/backend/replication/repl_scanner.l | 2 + You say: + /* This is is just for demonstration, don't ever use this code for anything real! */ uh, why not? What is the purpose of a contrib module, if not to serve as a minimal example? So, I went to play with pg_receivellog, and I got lots of output like this: [peter@peterlaptop decode]$ pg_receivellog -f test.log -d postgres WARNING: Initiating logical rep pg_receivellog: could not init logical rep: got 0 rows and 0 fields, expected 1 rows and 4 fields pg_receivellog: disconnected. Waiting 5 seconds to try again. Evidently you expected me to see this message: + if (!walsnd) + { + elog(ERROR, "couldn't find free logical slot. free one or increase max_logical_slots"); + } If I did, that might have been okay. I didn't though, presumably because the “walsnd” variable was wild/uninitialised. So, I went and set max_logical_slots to something higher than 0, and restarted. pg_receivellog behaved itself this time. In one terminal: [peter@peterlaptop decode]$ tty /dev/pts/0 [peter@peterlaptop decode]$ pg_receivellog -f test.log -d postgres WARNING: Initiating logical rep WARNING: reached consistent point, stopping! WARNING: Starting logical replication In another: [peter@peterlaptop decode]$ tty /dev/pts/1 [peter@peterlaptop decode]$ psql Expanded display is used automatically. psql (9.3devel) Type "help" for help. postgres=# insert into b values(66,64); INSERT 0 1 postgres=# \q [peter@peterlaptop decode]$ cat test.log BEGIN 1910 table "b": INSERT: i[int4]:66 j[int4]:64 COMMIT 1910 We're subscribed to logical changes, and everything looks about right. We have a toy demo of a logical change-set subscriber. I wondered how this had actually worked. Since test_decoding had done nothing more than expose some functions, without registering any callback in the conventional way (hooks, etc), how could it have worked? That brings me to the interface used by plugins like this test_decoding. Plugin interface =========== So test_decoding uses various type of caches and catalogs. I'm mostly worried about the core BDR interface that it uses, more-so than this other stuff. I'm talking about: src/include/replication/output_plugin.h | 76 ++ One minor gripe is that output_plugin.h isn't going to pass muster with cpluspluscheck (private is a C++ keyword). There are more serious problems, though. In particular, I'm quite perplexed at some of the code that “installs” the test_decoding plugin. The test_decoding module is hard-coded within pg_receivellog thusly (the SCONST token here could name an arbitrary module): + res = PQexec(conn, "INIT_LOGICAL_REPLICATION 'test_decoding'"); Furthermore, the names of particular test_decoding routines are hard coded into core, using libdl/PG_MODULE_MAGIC introspection: + XLogReaderState * + normal_snapshot_reader(XLogRecPtr startpoint, TransactionId xmin, + char *plugin, XLogRecPtr valid_after) + { + /* to simplify things we reuse initial_snapshot_reader */ + XLogReaderState *xlogreader = initial_snapshot_reader(startpoint, xmin); *** SNIP *** + + /* lookup symbols in the shared libarary */ + + /* optional */ + apply_state->init_cb = (LogicalDecodeInitCB) + load_external_function(plugin, "pg_decode_init", false, NULL); + + /* required */ + apply_state->begin_cb = (LogicalDecodeBeginCB) + load_external_function(plugin, "pg_decode_begin_txn", true, NULL); *** SNIP *** This seems fairly wrong-headed. Comments above this function say: + /* + * Build a snapshot reader with callbacks found in the shared library "plugin" + * under the symbol names found in output_plugin.h. + * It wraps those callbacks so they send out their changes via an logical + * walsender. + */ So the idea is that the names of all functions with public linkage within test_decoding (their symbols) have magical significance, and that the core system resolve those magic symbols dynamically. I'm not aware of this pattern appearing anywhere else within Postgres. Furthermore, it seems kind of short sighted. Have we not painted ourselves into a corner with regard to using multiple plugins at once? This doesn't seem terribly unreasonable, if for example we wanted to use test_decoding in production to debug a problem, while running a proper logical replication system and some other logical change-set consumer in tandem. Idiomatic use of “hooks” allows multiple plugins to be called for the same call of the authoritative hook by the core system, as for example when using auto_explain and pg_stat_statements at the same time. Why not just use hooks? It isn't obvious that you shouldn't be able to do this. The signature of the function pg_decode_change (imposed by the function pointer typedef LogicalDecodeChangeCB) assumes that everything should go through a passed StringInfo, but I have a hard time believing that that's a good idea. It's like your plugin functions as a way of filtering reorder buffers. It's not as if the core system just passes logical change-sets off, as one might expect. It is actually the case that clients have to connect to the server in replication mode, and get their change-sets (as filtered by their plugin) streamed by a walsender over the wire protocol directly. What of making changeset subscribers generic abstractions? Again, maybe you don't have to do anything with the StringInfo, but that is far from clear from the extant code and documentation. Snapshot builder ================ We've seen [1] that the snapshot builder is concerned with building snapshots for the purposes of timetravel. This is needed to see the contents of the catalog at a point in time when decoding (see design documents for more). src/backend/access/transam/xact.c | 4 +-src/backend/replication/logical/DESIGN.txt | 603++++++++++src/backend/replication/logical/Makefile | 25 +src/backend/replication/logical/README.SNAPBUILD.txt| 298 +++++src/backend/replication/logical/snapbuild.c | 1181+++++++++++++++++++src/backend/utils/time/tqual.c | 299 ++++-src/include/access/heapam_xlog.h | 23 +src/include/c.h | 1 +src/include/replication/snapbuild.h | 129 +++src/include/utils/snapshot.h | 4 +-src/include/utils/tqual.h | 51 +- I've lumped the tqual/snapshot visibility changes under “Snapshot builder” too, and anything mostly to do with ComboCids. The README.SNAPBUILD document (and the code described by it) was previously the focus of an entire review of its own [2]. I still don't see why you're allocating snapshots within DecodeRecordIntoReorderBuffer(). As I've said, I think that snapshots would be better allocated alongside the ReadApplyState that is directly concerned with snapshots, to better encapsulate the snapshot stuff. Now, you do at least acknowledge this problem this time around: + /* + * FIXME: The existance of the snapshot builder is pretty obvious to the + * outside right now, that doesn't seem to be very good... + */ However, the fact is that this function: + Snapstate * + AllocateSnapshotBuilder(ReorderBuffer *reorder) + { doesn't actually do anything with the ReorderBuffer pointer that it is passed. So I don't see why you've put off doing this, as if it's something that would require a non-trivial effort. One of my major concerns during that review was the need for this “peg xmin horizon” hack - you presented an example that required the use of a prepared transaction to artificially peg the global xmin horizon, and I wasn't happy about that. We were worried about catalog tables getting vacuumed in a way that prevented us from correctly interpreting data about types in the face of transactions that mix DML and DDL. If the catalog tables were vacuumed, we'd be out of luck - we needed to do something somewhat analogous to hot_standby_feedback. At the same time, we need to manage the risk of bloat on the primary due to non-availability of a standby in some speculative replication system using this infrastructure. One proposal floated around was to have a special notion of xmin horizon - a more granular xmin horizon applicable to only the necessary catalog tables. You didn't pursue that idea yet, preferring to solve the simpler case. You say of xmin horizon handling: + == xmin Horizon Handling == + + Reusing MVCC for timetravel access has one obvious major problem: + VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also + obviously, we want autovacuum/manual vacuum to work as before. + + The idea here is to reuse the infrastrcuture built for hot_standby_feedback + which allows us to keep the xmin horizon of a walsender backend artificially + low. We keep it low enough so we can restart decoding from the last location + the client has confirmed to be safely received. The means that we keep it low + enough to contain the last checkpoints oldestXid value. + + That also means we need to make that value persist across restarts/crashes in a + very similar manner to twophase.c's. That infrastructure actually also useful + to make hot_standby_feedback work properly across primary restarts. So we jury rig the actual xmin horizon by doing this: + /* + * inrease shared memory state, so vacuum can work + * on tuples we prevent from being purged. + */ + IncreaseLogicalXminForSlot(buf->origptr, + running->oldestRunningXid); We switch the WAL Sender proc's xmin while the walsender replies to a message, while preserving the “real” xmin horizon. Presumably this is crash safe, since we do this as part of XLOG_RUNNING_XACTS replay (iff we're doing “logical recovery”; that is, decoding is being performed as we reach SNAPBUILD_CONSISTENT): recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS, rdata); I continue to be quite concerned about the failure modes here. I do not accept that this is no worse than using hot_standby_feedback. hot_standby_feedback can see a standby bloat up the master because it has a long-running transaction - it's a process that the standby must actively engage in. However, what you have here will bloat up the master passively; standbys have to actively work to *prevent* that from happening. That's a *fundamental* distinction. Maybe it's actually reasonable to do that, at least for now, but I think that you should at least acknowledge the distinction as an important one. We also use this new tqual.c infrastructure to time-travel during decoding, with the snapshot built for us by snapshot builder: + /* + * See the comments for HeapTupleSatisfiesMVCC for the semantics this function + * obeys. + * + * Only usable on tuples from catalog tables! + * + * We don't need to support HEAP_MOVED_(IN|OFF) for now because we only support + * reading catalog pages which couldn't have been created in an older version. + * + * We don't set any hint bits in here as it seems unlikely to be beneficial as + * those should already be set by normal access and it seems to be too + * dangerous to do so as the semantics of doing so during timetravel are more + * complicated than when dealing "only" with the present. + */ + bool + HeapTupleSatisfiesMVCCDuringDecoding(HeapTuple htup, Snapshot snapshot, + Buffer buffer) Are you sure that ReorderBuffer.private_data should be a void*? Maybe we'd be better off if it was a minimal “abstract base class” pointer, that contained a MemoryContext? This whole area could use a lot more scrutiny. That's all I have for now, though. I'm happy to note that the overhead of computing the pegged Recent(Global)Xmin is one TransactionIdIsValid, one TransactionIdPrecedes and, potentially, one assignment. I am also pleased to see that you're invalidating system caches in a more granular fashion (for transactions that contain both DDL and DML, where we cannot rely on the usual Hot Standby where sinval messages are applied for commit records). That is a subject worthy of another e-mail, though. Decoding (“glue code”) ====================== We've seen [1] that decoding is concerned with decoding WAL records from an xlogreader.h callback into an reorderbuffer. Decoding means breaking up individual XLogRecord structs, reading them through an XlogReaderState, and storing them in an Re-Order buffer (reorderbuffer.c does this, and stores them as ReorderBufferChange records), while building a snapshot (which is needed in advance of adding tuples from records). It can be thought of as the small piece of glue between reorderbuffer and snapbuild that is called by XLogReader (DecodeRecordIntoReorderBuffer() is the only public function, which will be called by the WAL sender – previously, this was called by plugins directly). An example of what belongs in decode.c is the way it ignores physical XLogRecords, because they are not of interest. src/backend/replication/logical/decode.c | 494 ++++++++src/backend/replication/logical/logicalfuncs.c | 224++++src/backend/utils/adt/dbsize.c | 79 ++src/include/catalog/indexing.h | 2 +src/include/catalog/pg_proc.h | 2 +src/include/replication/decode.h | 21+src/include/replication/logicalfuncs.h | 45 +src/include/storage/itemptr.h | 3 +src/include/utils/builtins.h | 1 + The pg_proc accessible utility function pg_relation_by_filenode() - which you've documented - doesn't appear to be used at present (it's just a way of exposing the core infrastructure, as described under “Miscellaneous thoughts”) . A new index is created on pg_class (reltablespace oid_ops, relfilenode oid_ops). We've seen that we need a whole new infrastructure for resolving relfilenodes to relation OIDs, because only relfilenodes are available from the WAL stream, and in general the mapping isn't stable, as for example when we need to do a table rewrite. We have a new syscache for this. We WAL-log the new XLOG_HEAP2_NEW_CID record to store both table relfilenode and combocids. I'm still not clear on how you're managing corner case with relfilenode/table oid mapping that Robert spoke of previously [17]. Could you talk about that? Reorder buffer ============== Last time around [1], this was known as ApplyCache. It's still concerned with the management of logical replay cache - it reassembles transactions from a stream of interspersed changes. This is what a design doc previously talked about under “4.5 - TX reassembly” [14]. src/backend/replication/logical/reorderbuffer.c | 1185 ++++++++++++++++++++src/include/replication/reorderbuffer.h | 284 +++++ Last time around, I described spooling to disk, like a tuplestore, as a probable prerequisite to commit - I raise that now because I thought that this was the place where you'd most likely want to do that. Concerns about the crash-safety of buffered change-sets were raised too. You say this in a README: + * crash safety, restartability & spilling to disk + * consistency with the commit status of transactions + * only a minimal amount of synchronous work should be done inside individual + transactions + + In our opinion those problems are restricting progress/wider distribution of + these class of solutions. It is our aim though that existing solutions in this + space - most prominently slony and londiste - can benefit from the work we are + doing & planning to do by incorporating at least parts of the changeset + generation infrastructure. So, have I understood correctly - are you proposing that we simply outsource this to something else? I'm not sure how I feel about that, but I'd like clarity on this matter. reorderbuffer.h should have way, way more comments for each of the structs. I want to see detailed comments, like those you see for the structs in parsenodes.h - you shouldn't have to jump to some design document to see how each struct fits within the overall design of reorder buffering. XLog stuff (in particular, the new XLogReader) ================================= Andres rebased on top of Heikki's XLogReader patch for the purposes of BDR, and privately identified this whole area to me as a particular concern for this review. The version that I'm reviewing here is not the version that Andres described last week, v3.0 [10], but a slight revision thereof, v3.1 [11]. See the commit message in Andres' feature branch for full details [12]. doc/src/sgml/ref/pg_xlogdump.sgml | 76 ++src/backend/access/transam/Makefile | 2 +-src/backend/access/transam/xlog.c | 1084 ++++--------------src/backend/access/transam/xlogfuncs.c | 1 +src/backend/access/transam/xlogreader.c | 962 ++++++++++++++++src/bin/Makefile | 2 +-src/bin/pg_xlogdump/Makefile | 87 ++src/bin/pg_xlogdump/compat.c | 173 +++src/bin/pg_xlogdump/nls.mk | 4 +src/bin/pg_xlogdump/pg_xlogdump.c | 462 ++++++++src/bin/pg_xlogdump/pqexpbuf_strinfo.c | 76 ++src/bin/pg_xlogdump/tables.c | 78 ++src/include/access/heapam_xlog.h | 23 +src/include/access/transam.h | 5 +src/include/access/xlog.h | 3 +-src/include/access/xlog_fn.h | 35 +src/include/access/xlog_internal.h | 23 -src/include/access/xlogdefs.h | 1 +src/include/access/xlogreader.h | 159 +++ There was some controversy over the approach to implementing a “generic xlog reader”[13]. This revision of Andres' work presumably resolves that controversy, since it heavily incorporates Heikki's own work. Heikki has described the design of his original XLogReader patch [18]. pg_xlogdump is a hacker-orientated utility that has been around in various forms for quite some time (i.e. at least since the 8.3 days), concerned with reading and writing Postgres transaction logs for debugging purposes. It has long been obvious that it would be useful to maintain along with Postgres (there has been a tendency for xlogdump to fall behind, and only stable releases are supported), but the XLogReader-related refactoring makes adding an official xlogdump tool quite compelling (we're talking about 462 lines of wrapper code for pg_xlogdump.c, against several thousands of lines of code for the version in common use [15] that has hard-coded per-version knowledge of catalog oids and things like that). I think that some of the refactoring that Simon did to xlog.c last year [16] makes things easier here, and kind of anticipates this. Again, with pg_xlogdump you forgot to do this: pg_xlogdump: $(OBJS) | submake-libpq submake-libpgport $(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) $(libpq_pgport) -o $@$(X) + install: all installdirs + $(INSTALL_PROGRAM) pg_xlogdump$(X) '$(DESTDIR)$(bindir)'/pg_xlogdump$(X) + + installdirs: + $(MKDIR_P) '$(DESTDIR)$(bindir)' + + uninstall: + rm -f $(addprefix '$(DESTDIR)$(bindir)'/, pg_xlogdump$(X)) pg_xlogdump could be considered a useful way of testing the XLogReader and decoding functionality, independent of the test_decoding plugin. It is something that I'll probably use to debug this patch over the next few weeks. Example usage: [peter@peterlaptop pg_xlog]$ pg_xlogdump -f 000000010000000000000002 | head -n 3 xlog record: rmgr: Heap2 , record_len: 34, tot_len: 66, tx: 1902, lsn: 0/020011C8, prev 0/01FFFC48, bkp: 0000, desc: new_cid: rel 1663/12933/12671; tid 7/44; cmin: 0, cmax: 4294967295, combo: 4294967295 xlog record: rmgr: Heap , record_len: 175, tot_len: 207, tx: 1902, lsn: 0/02001210, prev 0/020011C8, bkp: 0000, desc: insert: rel 1663/12933/12671; tid 7/44 xlog record: rmgr: Btree , record_len: 34, tot_len: 66, tx: 1902, lsn: 0/020012E0, prev 0/02001210, bkp: 0000, desc: insert: rel 1663/12933/12673; tid 1/355 In another thread, Robert and Heikki remarked that pg_xlogdump ought to be in contrib, and not in src/bin. As you know, I am inclined to agree. [peter@peterlaptop pg_xlog]$ pg_xlogdump -f 1234567 fatal_error: requested WAL segment 012345670000000000000009 has already been removed This error message seems a bit presumptuous to me; as it happens there never was such a WAL segment. Saying that there was introduces the possibility of operator error. This appears to be superfluous: *** a/src/backend/access/transam/xlogfuncs.c --- b/src/backend/access/transam/xlogfuncs.c *************** *** 18,23 **** --- 18,24 ---- #include "access/htup_details.h" #include "access/xlog.h" + #include "access/xlog_fn.h" #include "access/xlog_internal.h" The real heavyweight here is xlogreader.c, at 962 lines. The module refactors xlog.c, moving ReadRecord and some supporting functions to xlogreader.c. Those supporting functions now operate on *generic* XLogReaderState rather than various global variables. The idea here is that the client of the API calls ReadRecord repeatedly to get each record. There is a callback of type XLogPageReadCB, which is used by the client to obtain a given page in the WAL stream. The XLogReader facility is responsible for decoding the WAL into records, but the client is responsible for supplying the physical bytes via the callback within XLogReader state. There is an error-handling callback too, added by Andres. Andres added a new function, XLogFindNextRecord(), which is used for checking wether RecPtr is a valid XLog address for reading and to find the first valid address after some address when dumping records, for debugging purposes. Why did you move the page validation handling into XLogReader? Support was added for reading pages which are only partially valid, which seems reasonable. The callback that acts as a replacement for emode_for_corrupt_record might be a bit questionable. I'd like to have more to say on this. I'll leave that for another day. I note that there are many mallocs in this module (see note below under “Miscellaneous thoughts”). heapam and other executor stuff =============================== One aspect of this patch that I feel certainly warrants another of these subsections is the changes to heapam.c and related executor changes. These are essentially changes to functions called by nodeModifyTable.c frequently, including functions like heap_hot_search_buffer, heap_insert, heap_multi_insert and heap_delete. We now have to do extra logical logging, and we need primary key values to be looked up. Files changed include: src/backend/access/heap/heapam.c | 284 ++++-src/backend/access/heap/pruneheap.c | 16 +-src/backend/catalog/index.c | 76 +-src/backend/access/rmgrdesc/heapdesc.c | 9 +src/include/access/heapam_xlog.h | 23 +src/include/catalog/index.h | 4 + What of this? (I'm using the dellstore sample database, as always): postgres=# \d+ orders Table "public.orders" *** SNIP *** Indexes: "orders_pkey" PRIMARY KEY, btree (orderid) "ix_order_custid" btree (customerid) ***SNIP *** postgres=# delete from orders where orderid = 77; WARNING: Could not find primary key for table with oid 16406 CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE $1 OPERATOR(pg_catalog.=) "orderid"" WARNING: Could not find primary key for table with oid 16406 CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE $1 OPERATOR(pg_catalog.=) "orderid"" WARNING: Could not find primary key for table with oid 16406 CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE $1 OPERATOR(pg_catalog.=) "orderid"" DELETE 1 I don't have time to figure out what this issue is right now. Hot Standby, Replication and libpq stuff ======================================== Not forgetting existing replication infrastructure and libpq stuff affected by this patch. Files under this category that have been modified are: src/backend/access/rmgrdesc/xlogdesc.c | 1 +src/backend/postmaster/bgwriter.c | 35 +src/backend/postmaster/postmaster.c | 7 +-src/backend/replication/libpqwalreceiver/libpqwalreceiver.c| 4 +-src/backend/replication/Makefile | 2 +src/backend/replication/walsender.c | 732 +++++++++++-src/backend/storage/ipc/procarray.c | 23 +src/backend/storage/ipc/standby.c | 8 +src/backend/utils/init/postinit.c | 5 +src/bin/pg_controldata/pg_controldata.c | 2 +src/include/nodes/nodes.h | 2 +src/include/nodes/replnodes.h | 22 +src/include/replication/walsender.h | 1 +src/include/replication/walsender_private.h | 43 +-src/interfaces/libpq/exports.txt | 1 +src/interfaces/libpq/pqexpbuffer.c | 40 +src/interfaces/libpq/pqexpbuffer.h | 5 + I take particular interest in bgwriter.c here. You're doing this: + * Log a new xl_running_xacts every now and then so replication can get + * into a consistent state faster and clean up resources more + * frequently. The costs of this are relatively low, so doing it 4 + * times a minute seems fine. What about the power consumption of the bgwriter? I think that the way try to interact with the existing loop logic is ill-considered. Just why is the bgwriter the compelling auxiliary process in which to do this extra work? Quite a lot of code has been added to walsender. This is mostly down to some new functions, responsible for initialising logical replication: ! typedef void (*WalSndSendData)(bool *); ! static void WalSndLoop(WalSndSendData send_data) __attribute__((noreturn)); static void InitWalSenderSlot(void); staticvoid WalSndKill(int code, Datum arg); ! static void XLogSendPhysical(bool *caughtup); ! static void XLogSendLogical(bool *caughtup); static void IdentifySystem(void); static void StartReplication(StartReplicationCmd*cmd); + static void CheckLogicalReplicationRequirements(void); + static void InitLogicalReplication(InitLogicalReplicationCmd *cmd); + static void StartLogicalReplication(StartLogicalReplicationCmd *cmd); + static void ComputeLogicalXmin(void); This is mostly infrastructure for initialising and starting logical replication. Initialisation means finding a free “logical slot” from shared memory, then looping until the new magic xmin horizon for logical walsenders (stored in their “slot”) is that of the weakest link (think local global xmin). + * FIXME: think about solving the race conditions in a nicer way. + */ + recompute_xmin: + walsnd->xmin = GetOldestXmin(true, true); + ComputeLogicalXmin(); + if (walsnd->xmin != GetOldestXmin(true, true)) + goto recompute_xmin; Apart from the race conditions that I'm not confident are addressed here, I think that the above could easily get stuck indefinitely in the event of contention. Initialisation occurs due to a “INIT_LOGICAL_REPLICATION” replication command. Initialisation also means that decoding state is allocated (a snapshot reader is initialised), and we report back success or failure to the client that's using the streaming replication protocol (i.e. in our toy example, pg_receivellog). Starting logical replication means we load the previously initialised slot, and find a snapshot reader plugin (using the “magic symbols” pattern described above, under “Plugin interface”). Why do we have to “find” a logical slot twice (both during initialisation and starting)? Since I've already described the “peg xmin horizon” stuff under “Snapshot builder”, I won't belabour the point. I think that I have more to say about this, but not today. Minor point: This is a terrible name for the variable in question: + LogicalWalSnd *walsnd; Miscellaneous thoughts ====================== You're still using C stdlib functions like malloc, free, calloc quite a bit. My concern is that this points to a lack of thought about the memory management strategy; why are you still not using memory contexts in some places? If it's so difficult to anticipate what clients of, say, XLogReaderAllocate() want for the lifetime of their memory, then likely as not those clients should be doing their own memory allocation, and passing the allocated buffer directly. If it is obvious that the memory ought to persist indefinitely (and I think it's your contention that it is in the case of XLogReaderAllocate()), I'd just allocate it in the top memory context. Now, I am aware that there are a trivial number of backend mallocs that you can point to as precedent here, but I'm still not satisfied with your explanation for using malloc(). At the very least, you ought to be handling the case where malloc returns NULL, and you're not doing so consistently. Memory contexts are very handy for debugging. As you know, I wrote a little Python script with GDB bindings, that walks the tree of memory contexts and prints out statistics about them using the aset.c/AllocSetStats() infrastructure. It isn't difficult to imagine that something like that could be quite useful with this patch - I'd like to be able to easily determine how many snapshot builders have been allocated from within a given backend, for example (though I see you refcount that anyway for reasons that are not immediately apparent - just debugging?). Minor gripes: * There is no need to use a *.txt extension for README files; we don't currently use those anywhere else. * If you only credit the PGDG and not the Berkeley guys (as you should, for the most part), there is no need to phrase the notice “Portions Copyright...”. You should just say “Copyright...”. * You're still calling function pointer typedefs things like LogicalDecodeInitCB. As I've already pointed out, you should prefer the existing conventions (call it something like LogicalDecodeInit_hook_type). Under this section are all modifications to files that are not separately described under some dedicated section header. I'll quickly pass remark on them. System caches were knocked around a bit: // LocalExecuteInvalidationMessage now exposed:src/backend/utils/cache/inval.c | 2 +-// relcache.chas stray whitespace:src/backend/utils/cache/relcache.c | 1 -// New RelationMapFilenodeToOid()function:src/backend/utils/cache/relmapper.c | 53 +// New RELFILENODE syscacheadded:src/backend/utils/cache/syscache.c | 11 +// Headers:src/include/storage/sinval.h | 2 +src/include/utils/relmapper.h | 2 +src/include/utils/syscache.h | 1 + These are only of tangential interest to snapshot building, and so are not described separately. Essentially, just “add new syscache” boilerplate. There's also a little documentation, covering only the pg_relation_by_filenode() utility function (this exposes RelationMapFilenodeToOid()/RELFILENODE syscache): doc/src/sgml/func.sgml | 23 +-doc/src/sgml/ref/allfiles.sgml | 1 +doc/src/sgml/reference.sgml | 1 + The following files were only changed due to the change in the tqual.c interfaces of HeapTupleSatisfies*(). contrib/pgrowlocks/pgrowlocks.c | 2 +-src/backend/commands/analyze.c | 3 +-src/backend/commands/cluster.c | 2 +-src/backend/commands/vacuumlazy.c | 3 +-src/backend/storage/lmgr/predicate.c | 2 +- That's all the feedback that I have for now. I'd have liked to have gone into more detail in many cases, but I cannot only do so much. I always like to start off rounds of review with “this is the current state of play as I see it” type e-mails. There will be more to follow, now that I have that out of the way. References ========== [1] Earlier WAL decoding review: http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com [2] Earlier snapshot building doc review: http://archives.postgresql.org/message-id/CAEYLb_Xj=t-4CW6gLV5jUvdPZSsYwSTbZtUethsW2oMpd58jzA@mail.gmail.com [3] "Rearrange storage of data in xl_running_xacts" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5c11725867ac3cb06db065f7940143114280649c [4] "Basic binary heap implementation" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7a2fe9bd0371b819aacc97a007ec1d955237d207 [5] "Embedded list interface" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a66ee69add6e129c7674a59f8c3ba010ed4c9386 [6] "Background worker processes" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=da07a1e856511dca59cbb1357616e26baa64428e [7] Chris Browne on Slony and ordering conflicts: http://archives.postgresql.org/message-id/CAFNqd5VY9aKZtPSEyzOTMsGAhfFHKaGNCgY0D0wZvqjC0Dtt1g@mail.gmail.com [8] Steve Singer on Slony and transaction isolation level: http://archives.postgresql.org/message-id/BLU0-SMTP6402AA6F3A1F850EDFA1B2DC8D0@phx.gbl [9] Kevin Grittner on commit ordering: http://archives.postgresql.org/message-id/20121022141701.224550@gmx.com [10] v3.0 of the XLogReader (Andres' revision): http://archives.postgresql.org/message-id/20121204175212.GB12055@awork2.anarazel.de [11] v3.1 of the XLogReader(Andres' slight tweak of [10]): http://archives.postgresql.org/message-id/20121209190532.GD4694@awork2.anarazel.de [12] Andres' XLogReader commit: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commit;h=3ea7ec5eea2cf890c14075b559e77a25a4130efc [13] Heikki objects to XLogReader approach, proposes alternative: http://archives.postgresql.org/message-id/5056D3E1.3060108@vmware.com [14] “WAL decoding, attempt #2” design documents: http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com [15] xlogdump satellite project: https://github.com/snaga/xlogdump [16] Numerous refactoring commits. Main split was commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9aceb6ab3c202a5bf00d5f00436bb6ad285fc0bf [17] Robert of relfilenodes: http://archives.postgresql.org/message-id/CA+TgmoZXkCo5FAbU=3JHuXXF0Op2SLhGJcVuFM3tkmcBnmhBMQ@mail.gmail.com [18] Heikki on his XLogReader: http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php
pgsql-hackers by date: