Re: Changeset Extraction v7.5 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Changeset Extraction v7.5
Date
Msg-id 20140211175734.GI15246@awork2.anarazel.de
Whole thread Raw
In response to Re: Changeset Extraction v7.5  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi!

On 2014-02-11 11:22:24 -0500, Robert Haas wrote:
> + *      contents of records in here xexcept turning them into a more usable
> 
> Typo.
> 
> +                       /*
> +                        * XXX: There doesn't seem to be a usecase for decoding
> +                        * HEAP_NEWPAGE's. Its only used in various
> indexam's and CLUSTER,
> +                        * neither of which should be relevant for the logical
> +                        * changestream.
> +                        */
> 
> There's a level of uncertainty here that doesn't seem consistent with
> calling this a finished patch.  It's also not a complete list of
> places where log_newpage() is called, but frankly I don't think that
> should be the aim of this comment.  The only relevant question is
> whether we ever use XLOG_HEAP_NEWPAGE to log heap changes that are
> relevant to logical replication.  I think we don't.

You're right, we currently don't. I guess we should add a comment to
log_newpage()/buffer to make sure that's not violated in future code,
that seems like the only place that's somewhat likely to be read?

> Decoding PREPARE might
> enable future code to prepare each locally prepared transaction on the
> remote side before doing a COMMIT TRANSACTION locally, allowing for
> logical synchronous replication."

We do support logical synchronous replication over the walsender
interface, what it would allow us to do would be some form of 2pc...

I'll adapt your version.

> +                       /*
> +                        * If the record wasn't part of a transaction,
> it will not have
> +                        * caused invalidations and thus isn't
> important when building
> +                        * snapshots. If it was part of a transaction,
> that transaction
> +                        * just performed DDL because those are the
> only codepaths using
> +                        * inplace updates.
> +                        */
> 
> Under what circumstances do we issue in-place updates not associated
> with a transaction?  And under what circumstances do we issue in-place
> updates that ARE associated with a transaction?

vacuum updates relfrozenxid outside a transaction, e.g. create/drop
index, analyze inside one.

> +/*
> + * Get the data from the various forms of commit records and pass it
> + * on to snapbuild.c and reorderbuffer.c
> + */
> 
> This is a lousy comment.  I suggest something like: "Currently, each
> transaction is decoded only once it commit, so the arrival of a commit
> record means that we can now decode the changes made by this toplevel
> transaction and all of its committed subtransactions, unless we have
> to skip it because the replication system isn't fully initialized yet.
>  Whether decoding the transaction or not, we must take note of any
> invalidations it issues, as those will affect the snapshot used for
> decoding of *other* transactions."
>
> +/*
> + * Get the data from the various forms of abort records and pass it on to
> + * snapbuild.c and reorderbuffer.c
> + */
> 
> Suggest: "When a transaction abort is detected, we throw away any data
> we've stashed away for possible future decoding of that transaction.
> Knowledge of the abort may also help us establish our initial snapshot
> when logical decoding is first initiated."

Hm, those should go into reorderbuffer.c instead, the interesting part
of DecodeCommit/DecodeAbort is just to centralize the handling of the
various forms of commit/abort records. decode.c really shouldn't need to
be changed much when we start to optionally support streaming out changes
immediately.

> +               /* register output plugin name with slot */
> +               strncpy(NameStr(MyReplicationSlot->data.plugin), plugin,
> +                               NAMEDATALEN);
> +               NameStr(MyReplicationSlot->data.plugin)[NAMEDATALEN - 1] = '\0';
> 
> Hmm.  Shouldn't this be delegated to something in slot.c?  Why don't
> we need the lock?

I wasn't sure, we can place it there, but it doesn't really need to know
about these details either. Lockingwise I don't see it needing more,
nobody but the slot that has it acquired is interested in it.

> Why don't we need to fsync() the change?

I've since pushed a patch that does the fsyncing. Not doing so was part
of a rebase screwup.

> +               /*
> +                * FIXME: we're going to have to do something more
> intelligent about
> +                * timelines on standby's. Use readTimeLineHistory() and
> +                * tliOfPointInHistory() to get the proper LSN?
> +                */
> 
> So what's the plan for that?

Prohibit decoding on the standby for now. Not sure how to deal with the
relevant code, leave it there, #ifdef it out, remove it?

> +               /*
> +                * XXX: It'd be way nicer to be able to use the
> walsender waiting logic
> +                * here, but that's not available in all environments.
> +                */
> 
> I don't understand this.

The walsender get's notified when flushing WAL, which allows the
walsender specific read_page callback to wait on the walsender
latch. For normal backends that's not available, so we have to do a
check/sleep/repeat logic.

> pg_create_decoding_replication_slot() should go in slotfuncs.c.

I wasn't sure about where to place it.

> + * Free an ReorderBufferTXN. Deallocation might be delayed for efficiency
> + * purposes.
> 
> Wow, so we have a bespoke dllist for caching these objects, and that's
> actually material to performance?  What is the allocation/deallocation
> rate here?

It's absolutely massively relevant for performance, I was really
surprised. reorderbuffer.c was the original reason for developing
ilist.h...
Before doing this, memory allocation was by far the top profile,
afterwards it has left the top ten.

I've tried my memory manager rewrite on it, but it didn't fix things
sufficiently. I don't think there's anything as good as a per-type slab
allocation (which this essentially boils down to, even if it has a
higher memory overhead) for frequently allocated types.

> +        * XXX: ->nsubxcnt can be out of date when subtransactions abort, count
> +        * manually.
> 
> Why is this an XXX?

For me XXX really is "watch out", that's the only reason.

> +               /*
> +                * don't do a ReorderBufferCleanupTXN here, with the
> vague idea of
> +                * allowing to retry decoding.
> +                */
> 
> It's a bit late in the cycle for such vagueness.

Well, I sure think people (including me) will continue to work on
this. There's not much downside to not doing a cleanup here, it will be
cleaned up later.

> + * Rejigger change->newtuple to point to in-memory toast tuples instead to
> + * on-disk toast tuples that may not longer exist (think DROP TABLE or VACUUM).
> + *
> + * We cannot replace unchanged toast tuples though, so those will still point
> + * to on-disk toast data.
> 
> Why is that OK?

Because we currently only allow accessing changed toast tuples. We'd
discussed that a long time back and it seemed people agree that that's
fair enough initially. In fact, more people were interested in that
behaviour than in doing it differently.
Alternatively we can vacuum toast tables less agressively or WAL log
more.

> This email is a bit down in the trenches; I will try to write another
> with some higher-level considerations.  But I think there is plenty of
> stuff here for you to get started fixing.

Yes, working on it. I'll push the smaller increments to git, ok?

Working on all the issues now, thanks!

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [PERFORM] encouraging index-only scans
Next
From: Merlin Moncure
Date:
Subject: Re: jsonb and nested hstore