Re: logical changeset generation v6.2 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: logical changeset generation v6.2
Date
Msg-id 20131022170843.GD7435@awork2.anarazel.de
Whole thread Raw
In response to Re: logical changeset generation v6.2  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: logical changeset generation v6.2  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2013-10-22 11:59:35 -0400, Robert Haas wrote:
> >> So I have a new idea for handling this problem, which seems obvious in
> >> retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
> >> CTID -> new CTID mappings?  This would only need to be done for
> >> catalog tables, and maybe could be skipped for tuples whose XIDs are
> >> old enough that we know those transactions must already be decoded.
> >
> > Ah. If it only were so simple ;). That was my first idea, and after I'd
> > bragged in an 2ndq internal chat that I'd found a simple idea I
> > obviously had to realize it doesn't work.
> >
> > Consider:
> > INIT_LOGICAL_REPLICATION;
> > CREATE TABLE foo(...);
> > BEGIN;
> > INSERT INTO foo;
> > ALTER TABLE foo ...;
> > INSERT INTO foo;
> > COMMIT TX 3;
> > VACUUM FULL pg_class;
> > START_LOGICAL_REPLICATION;
> >
> > When we decode tx 3 we haven't yet read the mapping from the vacuum
> > freeze. That scenario can happen either because decoding was stopped for
> > a moment, or because decoding couldn't keep up (slow connection,
> > whatever).

> It seems to me that you have to think of the CTID map as tied to a
> relfilenode; if you try to use one relfilenode's map with a different
> relfilenode, it's obviously not going to work.  So don't do that.

It has to be tied to relfilenode (+ctid) *and* transaction
unfortunately.
> That strikes me as a flaw in the implementation rather than the idea.
> You're presupposing a patch where the necessary information is
> available in WAL yet you don't make use of it at the proper time.

The problem is that the mapping would be somewhere *ahead* from the
transaction/WAL we're currently decoding. We'd need to read ahead till
we find the correct one.
But I think I mainly misunderstood what you proposed. That mapping could
be written besides relfilenode, instead of into the WAL. Then my
imagined problem doesn't exist anymore.

We only would need to write out mappings for tuples modified since the
xmin horizon, so it wouldn't even be *too* bad for bigger relations.

This won't easily work for two+ rewrites because we'd need to apply all
mappings in order and thus would have to keep a history of intermediate
nodes/mappings. But it'd be perfectly doable to simply wait till
decoders are caught up.

I still "feel" that simply storing both cmin, cmax is cleaner, but if
that's not acceptable, I can certainly live with something like this.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Failure while inserting parent tuple to B-tree is not fun
Next
From: Peter Geoghegan
Date:
Subject: Re: Failure while inserting parent tuple to B-tree is not fun