Re: Catalog/Metadata consistency during changeset extraction from wal - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Catalog/Metadata consistency during changeset extraction from wal
Date
Msg-id 201206251950.08949.andres@2ndquadrant.com
Whole thread Raw
In response to Re: Catalog/Metadata consistency during changeset extraction from wal  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Catalog/Metadata consistency during changeset extraction from wal
List pgsql-hackers
On Monday, June 25, 2012 05:34:13 PM Robert Haas wrote:
> On Mon, Jun 25, 2012 at 9:43 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> > The only theoretical way I see against that problem would be to
> >> > postpone all relation unlinks untill everything that could possibly
> >> > read them has finished. Doesn't seem to alluring although it would be
> >> > needed if we ever move more things of SnapshotNow.
> >> > 
> >> > Input/Ideas/Opinions?
> >> 
> >> Yeah, this is slightly nasty.  I'm not sure whether or not there's a
> >> way to make it work.
> > 
> > Postponing all non-rollback unlinks to the next "logical checkpoint" is
> > the only thing I can think of...
> There are a number of cool things we could do if we postponed unlinks.
>  Like, why can't we allow concurrent read-only queries while a CLUSTER
> operation is in progress?  Well, two reasons.  The first is that we
> currently can't do ANY DDL with less than a full table lock because of
> SnapshotNow-related race conditions.  The second is that people might
> still need to look at the old heap after the CLUSTER transaction
> commits.  Some kind of delayed unlink facility where we
> garbage-collect relation backing files when their refcount falls to
> zero would solve the second problem - not that that's any help by
> itself without a solution to the first one, but hey.
Its an argument why related infrastructure would be interesting to more than 
that patch and thats not bad.
If the garbage collecting is done in a very simplistic manner it doesn't sound 
too hard... The biggest problem is probably crash-recovery of that knowledge 
and how to hook knowledge into it that logical rep needs that data...

> >> I had another idea.  Suppose decoding happens directly on the primary,
> >> because I'm still hoping there's a way to swing that.  Suppose further
> >> that we handle DDL by insisting that (1) any backend which wants to
> >> add columns or change the types of existing columns must first wait
> >> for logical replication to catch up and (2) if a backend which has
> >> added columns or changed the types of existing columns then writes to
> >> the modified table, decoding of those writes will be postponed until
> >> transaction commit.  I think that's enough to guarantee that the
> >> decoding process can just use the catalogs as they stand, with plain
> >> old SnapshotNow.
> > 
> > I don't think its that easy. If you e.g. have multiple ALTER's in the
> > same transaction interspersed with inserted rows they will all have
> > different TupleDesc's.
> 
> If new columns were added, then tuples created with those older
> tuple-descriptors can still be interpreted with the latest
> tuple-descriptor.
But you need to figure that out. If you have just the before-after images of 
the tupledescs you don't know what happened in there... That would mean either 
doing special things on catalog changes or reassembling the meaning from the 
changed pg_* rows. Neither seems enticing.

> Columns that are dropped or retyped are a little trickier, but
> honestly... how much do we care about those cases?  How practical is
> it to suppose we're going to be able to handle them sanely anyway?
> Suppose that the user defines a type which works just like int4 except
> that the output functions writes out each number in pig latin (and the
> input function parses pig latin).  The user defines the types as
> binary coercible to each other and then does ALTER TABLE on a large
> table with an int4 column, transforming it into an int4piglatin
> column.  Due to Noah Misch's fine work, we will conclude that no table
> rewrite is needed.  But if logical replication is in use, then in
> theory we should scan the whole table and generate an LCR for each row
> saying "the row with primary key X was updated, and column Y, which
> used to contain 42, now contains ourty-two-fay".  Otherwise, if we're
> doing heterogenous replication into a system that just stores that
> column as text, it'll end up with the wrong contents.  On the other
> hand, if we're trying to ship data to another PostgreSQL instance
> where the column hasn't yet been updated, then all of those LCRs are
> just going to error out when we try to apply them.

> A more realistic scenario where you have the same problem is with
> something like ALTER TABLE .. ADD COLUMN .. DEFAULT.   If you add a
> column with a default in a single step (as opposed to first adding the
> column and then setting its default), we rewrite the table and set
> every row to the default value.  Should that generate LCRs showing
> every row being updated to add that new value, or should we generate
> no LCRs and assume that the DBA will independently do the same
> operation on the remote side?  Either answer could be correct,
> depending on how the LCRs are being used.  If you're just rewriting
> with a constant default, then perhaps the sensible thing is to
> generate no LCRs, since it will be more efficient to mimic the
> operation on the remote side than to replay the changes row-by-row.
> But what if the default isn't a constant, like maybe it's
> nextval('new_synthetic_pkey_seq') or even something like now().  In
> those cases, it seems quite likely that if you don't generate LCRs,
> manual user intervention will be required to get things back on track.
>  On the other hand, if you do generate LCRs, the remote side will
> become horribly bloated on replay, unless the LCRs also instruct the
> far side that they should be applied via a full-table rewrite.
I think that heavily depends on your use-case. In the settings that I/we have 
in mind the (partial) schemas are tightly coupled so command triggers can just 
do the equivalent changes.
But there will definitely be cases where user errors will cause problems (now, 
nextval as you pointed out) or where we just have to punt.

In other scenarios like Kevins the application code will need to make that 
decisison.

> Can we just agree to punt all this complexity for version 1 (and maybe
> versions 2, 3, and 4)?  I'm not sure what Slony does in situations
> like this but I bet for a lot of replication systems, the answer is
> "do a full resync".  In other words, we either forbid the operation
> outright when the table is enabled for logical replication, or else we
> emit an LCR that says, in effect, "transaction 12345 monkeyed with the
> table, please resync".  It strikes me that it's really the job of some
> higher-level control logic to decide what the "correct" behavior is in
> these cases; the decoding process doesn't really have enough
> information about what the user is trying to do to make a sensible
> decision anyway.  It would be nice to be able to support some simple
> cases like "adding a column that has no default" or "dropping a
> column" without punting, but going much further than that seems like
> it will require embedding policy decisions that should really be
> happening at a higher level.
I am totally fine with saying that we do not support everything from the 
start. But we need to choose an architecture where its possible to add that 
support gradually and I don't think without looking inside transaction makes 
that possible.

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: WAL format changes
Next
From: Fujii Masao
Date:
Subject: Re: WAL format changes