Re: logical changeset generation v6 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: logical changeset generation v6
Date
Msg-id 20130920103347.GA5287@awork2.anarazel.de
Whole thread Raw
In response to Re: logical changeset generation v6  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: logical changeset generation v6
List pgsql-hackers
Hi,

On 2013-09-19 12:05:35 -0400, Robert Haas wrote:
> No question.  I'm not saying that that optimization shouldn't go in
> right after the main patch does, but IMHO right now there are too many
> things going in the 0004 patch to discuss them all simultaneously.
> I'd like to find a way of splitting this up that will let us
> block-and-tackle individual pieces of it, even we end up committing
> them all one right after the other.

Fine with me. I was critized for splitting up stuff too much before ;)

Expect a finer-grained series.

> But that raises an interesting question: why is the overhead so bad?
> I mean, this shouldn't be any worse than having a series of
> transactions running concurrently with pgbench that take a snapshot
> and hold it for as long as it takes the decoding process to decode the
> most-recently committed transaction.

Pgbench really slows down scarily if there are some slightly longer
running transactions around...

> Is the issue here that we can't
> advance xmin until we've fsync'd the fruits of decoding down to disk?

Basically yes. We only advance the xmin of the slot so far that we could
still build a valid snapshot to decode the first transaction not
confirmed to have been synced to disk by the client.

> If so, that's mighty painful.  But we'd really only need to hold back
> xmin in that situation when some catalog change has occurred
> meanwhile, which for pgbench means never.  So something seems fishy
> here.

It's less simple than that. We need to protect against concurrent DDL
producing deleted rows that we will still need. We need
HeapTupleStisfiesVacuum() to return HEAPTUPLE_RECENTLY_DEAD not
HEAPTUPLE_DEAD for such rows, right?
The way to do that is to guarantee that if
TransactionIdDidCommit(xmax) is true, TransactionIdPrecedes(xmax, OldestXmin) is also true.
So, we need to peg OldestXmin (as passed to HTSV) to the xid of the
oldest transaction we're still decoding.

I am not sure how you could do that iff somewhere in the future DDL has
started since there's no interlock preventing anyone against doing so.

> >> - It still bothers me that we're going to have mandatory slots for
> >> logical replication and no slots for physical replication.

> > If people think this needs to be a general facility from the start, I
> > can be convinced that way, but I think there's so much to discuss around
> > the semantics and different usecases that I'd much prefer to discuss
> > that later.
> 
> I'm worried that if we don't know how the physical replication slots
> are going to work, they'll end up being randomly different from the
> logical replication slots, and that'll be an API wart which we'll have
> a hard time getting rid of later.

Hm. I actually think that minus some s/Logical//g and a mv won't be much
need to change on the slot interface itself.

What we need for physical rep is basically to a) store the position up
to where the primary has fsynced the WAL b) store the xmin horizon the standby
currently has.
Sure, we can store more stats (most of pg_stat_replication, perhaps some
more) but that's not functionally critical and not hard to extend.

The points I find daunting are the semantics, like:
* How do we control whether a standby is allowed prevent WAL file removal. What if archiving is configured?
* How do we control whether a standby is allowed to peg xmin?
* How long do we peg an xmin/wal file removal if the standby is gone
* How does the userinterface look to remove a slot if a standby is gone
* How do we decide/control which commands use a slot in which cases?


> >> - What is the purpose of (Un)SuspendDecodingSnapshots?  It seems that
> >> should be explained somewhere.  I have my doubts about how safe that
> >> is.
> >
> > I'll document the details if they aren't right now. Consider what
> > happens if somebody does something like: "VACUUM FULL pg_am;". If we
> > were to build the relation descriptor of pg_am in an "historical
> > snapshot", as you coin it, we'd have the wrong filenode in there. And
> > consequently any future lookups in pg_am will open a file that doesn't
> > exist.
> > That problem only exist for non-nailed relations that are accessed
> > during decoding.
> 
> But if it's some user table flagged with the terribly-named
> treat_as_catalog_table flag, then they could have not only changed the
> relfilenode but also the tupledesc.  And then you can't just wave your
> hands at the problem.

Heh. Well cought.

There's a comment about that somewhere... Those are problematic, my plan
so far is to throw my hands up and forbid alter tables that rewrite
those.

I know you don't like that flag and especially it's name. I am open to
suggestions to a) rename it b) find a better solution. I am pretty sure
a) is possible but I have severe doubts about any realistic b).

> > Yes, I don't like it either. I am not sure what to replace it with
> > though. It's easy enough to fit something in GetCatalogSnapshot() and I
> > don't have a problem with that, but I am less happy with adding code
> > like that to GetSnapshotData() for callers that use explicit snapshots.
> 
> I'm not sure exactly what a good solution would like, either.  I just
> think this isn't it.  :-)

I know that feeling ;)

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Samrat Revagade
Date:
Subject: Re: psql tab completion for updatable foreign tables
Next
From: Heikki Linnakangas
Date:
Subject: SSI freezing bug