On 2013-10-22 11:59:35 -0400, Robert Haas wrote:
> >> So I have a new idea for handling this problem, which seems obvious in
> >> retrospect. What if we make the VACUUM FULL or CLUSTER log the old
> >> CTID -> new CTID mappings? This would only need to be done for
> >> catalog tables, and maybe could be skipped for tuples whose XIDs are
> >> old enough that we know those transactions must already be decoded.
> >
> > Ah. If it only were so simple ;). That was my first idea, and after I'd
> > bragged in an 2ndq internal chat that I'd found a simple idea I
> > obviously had to realize it doesn't work.
> >
> > Consider:
> > INIT_LOGICAL_REPLICATION;
> > CREATE TABLE foo(...);
> > BEGIN;
> > INSERT INTO foo;
> > ALTER TABLE foo ...;
> > INSERT INTO foo;
> > COMMIT TX 3;
> > VACUUM FULL pg_class;
> > START_LOGICAL_REPLICATION;
> >
> > When we decode tx 3 we haven't yet read the mapping from the vacuum
> > freeze. That scenario can happen either because decoding was stopped for
> > a moment, or because decoding couldn't keep up (slow connection,
> > whatever).
> It seems to me that you have to think of the CTID map as tied to a
> relfilenode; if you try to use one relfilenode's map with a different
> relfilenode, it's obviously not going to work. So don't do that.
It has to be tied to relfilenode (+ctid) *and* transaction
unfortunately.
> That strikes me as a flaw in the implementation rather than the idea.
> You're presupposing a patch where the necessary information is
> available in WAL yet you don't make use of it at the proper time.
The problem is that the mapping would be somewhere *ahead* from the
transaction/WAL we're currently decoding. We'd need to read ahead till
we find the correct one.
But I think I mainly misunderstood what you proposed. That mapping could
be written besides relfilenode, instead of into the WAL. Then my
imagined problem doesn't exist anymore.
We only would need to write out mappings for tuples modified since the
xmin horizon, so it wouldn't even be *too* bad for bigger relations.
This won't easily work for two+ rewrites because we'd need to apply all
mappings in order and thus would have to keep a history of intermediate
nodes/mappings. But it'd be perfectly doable to simply wait till
decoders are caught up.
I still "feel" that simply storing both cmin, cmax is cleaner, but if
that's not acceptable, I can certainly live with something like this.
Greetings,
Andres Freund
-- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services