Re: Changeset Extraction v7.5 - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Changeset Extraction v7.5
Date
Msg-id CA+TgmoYTc0gHm_ca=wDWqc8TLY1VQ9vx6jxPSSE6=OifAtW=cQ@mail.gmail.com
Whole thread Raw
In response to Re: Changeset Extraction v7.5  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Changeset Extraction v7.5  (Andres Freund <andres@2ndquadrant.com>)
Re: Changeset Extraction v7.5  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On Fri, Feb 7, 2014 at 2:35 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> attached you can find the next version of the patchset.

As usual, I'm going to be reviewing patch 1.  The definition of "patch
1" has changed quite a few times over the past year, but that's
usually the one I'm reviewing.

+ *      contents of records in here xexcept turning them into a more usable

Typo.

+                       /*
+                        * XXX: There doesn't seem to be a usecase for decoding
+                        * HEAP_NEWPAGE's. Its only used in various
indexam's and CLUSTER,
+                        * neither of which should be relevant for the logical
+                        * changestream.
+                        */

There's a level of uncertainty here that doesn't seem consistent with
calling this a finished patch.  It's also not a complete list of
places where log_newpage() is called, but frankly I don't think that
should be the aim of this comment.  The only relevant question is
whether we ever use XLOG_HEAP_NEWPAGE to log heap changes that are
relevant to logical replication.  I think we don't.

+                               /* FIXME: skip if wrong db? */

It's time to fish or cut bait.

+                       /*
+                        * XXX: As a future feature, we could replay
the transaction and
+                        * prepare it as well, allowing for 2PC via
logical decoding.
+                        */

Let's try to avoid using XXX (or FIXME) for things that really mean TODO.

I think this comment deserves to be expanded a bit, too.  Maybe
something like: "Right now, logical decoding ignores PREPARE
TRANSACTION and simply decodes the subsequent COMMIT TRANSACTION or
ROLLBACK TRANSACTION just as it would a regular COMMIT or ROLLBACK.
In the future, we might want to change this.  Decoding PREPARE might
enable future code to prepare each locally prepared transaction on the
remote side before doing a COMMIT TRANSACTION locally, allowing for
logical synchronous replication."

+                       /*
+                        * If the record wasn't part of a transaction,
it will not have
+                        * caused invalidations and thus isn't
important when building
+                        * snapshots. If it was part of a transaction,
that transaction
+                        * just performed DDL because those are the
only codepaths using
+                        * inplace updates.
+                        */

Under what circumstances do we issue in-place updates not associated
with a transaction?  And under what circumstances do we issue in-place
updates that ARE associated with a transaction?


+                * XXX: At some point we might want to execute the transaction's

The XXX again seems needless; the comment is fine as it stands.

+                               /*
+                                * Abort all transactions that we keep
track of that are older
+                                * than ->oldestRunningXid. This is
the most convenient spot

I think writing ->membername is poor commenting style.  Just leave out
the arrow, or write "the WAL record's oldestRunningXid."

+/*
+ * Get the data from the various forms of commit records and pass it
+ * on to snapbuild.c and reorderbuffer.c
+ */

This is a lousy comment.  I suggest something like: "Currently, each
transaction is decoded only once it commit, so the arrival of a commit
record means that we can now decode the changes made by this toplevel
transaction and all of its committed subtransactions, unless we have
to skip it because the replication system isn't fully initialized yet.Whether decoding the transaction or not, we must
takenote of any
 
invalidations it issues, as those will affect the snapshot used for
decoding of *other* transactions."

+/*
+ * Get the data from the various forms of abort records and pass it on to
+ * snapbuild.c and reorderbuffer.c
+ */

Suggest: "When a transaction abort is detected, we throw away any data
we've stashed away for possible future decoding of that transaction.
Knowledge of the abort may also help us establish our initial snapshot
when logical decoding is first initiated."

+/*
+ * Set the xmin required for decoding snapshots for the specific decoding
+ * slot.
+ */
+void
+IncreaseLogicalXminForSlot(XLogRecPtr lsn, TransactionId xmin)

I'm thinking this and everything that follows, up through
LogicalDecodingCountDBSlots, probably should be moved to slot.c.

+       /* XXX: Add the current LSN? */

+1.

+       /* shorter lines... */
+       slot = MyReplicationSlot;

If you're going to do this, which seems like it's probably a good
idea, do it at the top of the function and use it all the way through
instead of doing it in the middle.

+       if (MyReplicationSlot == NULL)
+               elog(ERROR, "need a current slot");
+
+       if (is_init && start_lsn != InvalidXLogRecPtr)
+               elog(ERROR, "Cannot INIT_LOGICAL_REPLICATION at a
specified LSN");
+
+       if (is_init && plugin == NULL)
+               elog(ERROR, "Cannot INIT_LOGICAL_REPLICATION without a
specified plugin");

One of these error messages is not like the others.

+       context = AllocSetContextCreate(CurrentMemoryContext,
+"Changeset Extraction Context",
+ALLOCSET_DEFAULT_MINSIZE,
+ALLOCSET_DEFAULT_INITSIZE,
+ALLOCSET_DEFAULT_MAXSIZE);

I have my doubts about whether it's wise to make this the child of
CurrentMemoryContext.  Certainly, if we do that, then expectations
around what that context is need to be documented.  Short-lived
contexts are presumably unsuitable.

+                * Lets start with enough information if we can, so
log a standby
+                * snapshot and start decoding at exactl that position.

Let's.  Exactly.

+                        * the xlog records didn't result in anyting
relevant for

anything.

+                       elog(LOG, "cannot stream from %X/%X, minimum
is %X/%X, forwarding",

This isn't very clear, and I don't think it follows style guidelines either.

+               /* register output plugin name with slot */
+               strncpy(NameStr(MyReplicationSlot->data.plugin), plugin,
+                               NAMEDATALEN);
+               NameStr(MyReplicationSlot->data.plugin)[NAMEDATALEN - 1] = '\0';

Hmm.  Shouldn't this be delegated to something in slot.c?  Why don't
we need the lock?  Why don't we need to fsync() the change?

+               /*
+                * Acquire the current global xmin value and directly
set the logical
+                * xmin before releasing the lock if necessary. We do
this so wal
+                * decoding is guaranteed to have all catalog rows
produced by xacts
+                * with an xid > walsnd->xmin available.
+                *
+                * We can't let ReplicationSlotsComputeRequiredXmin() lock the
+                * procarray as that acquires ProcArrayLock separately
which would
+                * open a short window for the global xmin to advance
above our xmin.
+                */
+               LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+               slot->effective_catalog_xmin =
GetOldestSafeDecodingTransactionId();
+               slot->data.catalog_xmin = slot->effective_catalog_xmin;
+
+               ReplicationSlotsComputeRequiredXmin(true);
+
+               LWLockRelease(ProcArrayLock);

I don't understand this.

+                       (errmsg("changeset extraction started,
extracting changes after %X/%X, reading from %X/%X",

Needs some style work.  What does "extracting changes after %X/%X"
really mean?  Also, how about including the slot name in there?

+ * Performoutput plugin write into tuplestore.

Space.

+       /*
+        * XXX: maybe we ought to assert ctx->out is in database encoding when
+        * we're writing textual output.
+        */

Good idea.

+               /*
+                * FIXME: we're going to have to do something more
intelligent about
+                * timelines on standby's. Use readTimeLineHistory() and
+                * tliOfPointInHistory() to get the proper LSN?
+                */

So what's the plan for that?

+               /*
+                * XXX: It'd be way nicer to be able to use the
walsender waiting logic
+                * here, but that's not available in all environments.
+                */

I don't understand this.

pg_create_decoding_replication_slot() should go in slotfuncs.c.

+/* number of changes kept in memory, per transaction */
+const Size     max_memtries = 4096;
+
+/* Size of the slab caches used for frequently allocated objects */
+const Size     max_cached_changes = 4096 * 2;
+const Size     max_cached_tuplebufs = 4096 * 2;                /* ~8MB */
+const Size     max_cached_transactions = 512;

Hmm.  Is max_memtries the number of "tries" you keep in "mem"?  Or is
"tries" an inadvertent abbreviation for "entries" or something?  Also,
there's no real discussion here of the logic behind these values, or
the performance or memory impact of changing them.

+ * Free an ReorderBufferTXN. Deallocation might be delayed for efficiency
+ * purposes.

Wow, so we have a bespoke dllist for caching these objects, and that's
actually material to performance?  What is the allocation/deallocation
rate here?

+/*
+ * FIXME: better comment and/or name
+ */

If not now, then when?

+               ReorderBufferChange *next_change =
+               dlist_container(ReorderBufferChange, node, next);

Formatting.

+        * ->subxip contains all txids that belong to our transaction which we

snap->subxip

+        * XXX: ->nsubxcnt can be out of date when subtransactions abort, count
+        * manually.

Why is this an XXX?

+                       if (GetTopTransactionIdIfAny() != InvalidTransactionId)
+                               elog(ERROR, "cannot replay using sub,
already allocated xid %u",
+                                        GetTopTransactionIdIfAny());

Message style.  "cannot replay using top", too.  Also, what makes this
elog() material?  If decoding can be performed from a user session
this seems likely to be (blech!) user-facing.

+                                       /* XXX: we could skip
snapshots in non toplevel txns */

TODO.  Or fix it.

+               /*
+                * don't do a ReorderBufferCleanupTXN here, with the
vague idea of
+                * allowing to retry decoding.
+                */

It's a bit late in the cycle for such vagueness.

+ * Rejigger change->newtuple to point to in-memory toast tuples instead to
+ * on-disk toast tuples that may not longer exist (think DROP TABLE or VACUUM).
+ *
+ * We cannot replace unchanged toast tuples though, so those will still point
+ * to on-disk toast data.

Why is that OK?

+ * location indicated by 'lsn'. Returns true if successfull, false otherwise.

Extra "l".

My eyes are glazing over, so I'm stopping here for now.

Generally, I think there's a lot of good stuff in here, but I think
you need to make a serious pass through here and try to get rid of all
the things marked XXX and FIXME, either by changing them to say TODO
(or nothing), or by deciding that they're OK and removing the comment,
or by fixing them.  It's fine to have some XXX comments on points
where you're hoping for reviewer feedback, but the time to have notes
in there for yourself is long past.  Also, you need to either go
through and make a studious attempt to fix all the typographical
errors and message style issues, or you need to find someone else who
can help you with that.  Preferably not me, because I'd rather focus
on things of somewhat greater significance.

Generally, I find decode.c to be relatively straightforward, and it's
even got a nice long header comment explaining what it does.
logical.c, on the other hand, has a one-line header comment.  As I
mentioned above, I think a lot of the stuff in this file properly
belongs elsewhere, so maybe that's not so bad.  But it could probably
use at least a little more work.

To the extent that it's possible, the documentation should be
incorporated into the patches that introduce the corresponding
facilities rather than being a separate patch all of its own.

The meat of the patch seems to be reorderbuffer.c.  That file needs
more and better explanations of what is being done and why it is being
done.  For example:

+/*
+ * Abort a transaction that possibly has previous changes. Needs to be done
+ * independently for toplevel and subtransactions.
+ */
+void
+ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)

It's already clear from the name of the function and how it's used
elsewhere that it gets called when a transaction aborts.  Mentioning
that the function gets invoked for both toplevel and subtransactions
is, surely, worthwhile.  But the function has no comments explaining
what it's doing in response to that abort, or why it's doing those
things, or why it's doing those things in that order rather than some
other order.  Here is an example of a better comment:

+/*
+ * Setup the base snapshot of a transaction. That is the snapshot that is used
+ * to decode all changes until either this transaction modifies the catalog or
+ * another catalog modifying transaction commits.
+ */

Now, the grammar there might need a tad of work, but there's a lot of
useful information in that comment.  It would be even better if there
were a comment explaining, either here or in the caller, how we decide
*when* to call this function.

Both reorderbuffer.c and snapbuild.c contain a significant number of
functions that are quite short.  I can't decide whether this is
excellent attention to abstraction boundaries or a sign that the
abstraction boundary is too permeable in the first place.

This email is a bit down in the trenches; I will try to write another
with some higher-level considerations.  But I think there is plenty of
stuff here for you to get started fixing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [PERFORM] encouraging index-only scans
Next
From: Greg Stark
Date:
Subject: Re: Recovery inconsistencies, standby much larger than primary