Re: logical changeset generation v6.8 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: logical changeset generation v6.8
Date
Msg-id 20131216110138.GA6019@awork2.anarazel.de
Whole thread Raw
In response to Re: logical changeset generation v6.8  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: logical changeset generation v6.8
List pgsql-hackers
Hi Robert,

On 2013-12-16 00:53:10 -0500, Robert Haas wrote:
> On Wed, Dec 11, 2013 at 7:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I don't think there's much point in including "remapping" in all of
> >> the error messages.  It adds burden for translators and users won't
> >> know what a remapping file is anyway.
> >
> > It helps in locating wich part of the code caused a problem. I utterly
> > hate to get reports with error messages that I can't correlate to a
> > sourcefile. Yes, I know that verbose error output exists, but usually
> > you don't get it that way... That said, I'll try to make the messages
> > simpler.
> 
> Well, we could adopt a policy of not making messages originating from
> different locations in the code the same.  However, it looks to me
> like that's NOT the current policy, because some care has been taken
> to reuse messages rather than distinguish them.

To me that mostly looks like cases where we either don't want to tell
more for various security-ish purposes or where they have been copy and
pasted...

> There's no hard and
> fast rule here, because some cases are distinguished, but my gut
> feeling is that all of the errors your patch introduces are
> sufficiently obscure cases that separate messages with separate
> translations are not warranted.

Perhaps we should just introduce a marker that some such strings are not
to be translated if they are of the unexpected kind. That would probably
make debugging easier too ;)

> I need to examine that logic in more detail,
> but I had trouble following it and was hoping the next version would
> be better-commented.

Yea, I've started expanding the comments about the higher level concerns
- I've been so knee deep in this that I didn't realize they weren't
there.

> > b) make hot_standby_feedback work across disconnections of the walsender
> >    connection (i.e peg xmin, not just for catalogs though)
> 
> Check; might need to be optional.

Yea, I am pretty sure it will. It'd probably pretty nasty to set
min_recovery_apply_delay=7d and force xmin kept to that...

> > c) Make sure we can transport those across cascading
> >    replication.
> 
> Not sure I follow.

Consider a replication scenario like primary <-> standby-1 <->
standby-2. The primary may not only not remove data that standby-1
requires, but also not data that standby-2 needs. Not really necessary
for WAL since that will also reside on standby-1 but definitely for the
xmin horizon.
So standby-1 will need to signal not only his own needs, but also of the
nodes below.

> > The hard questions that I see are like:
> > * How do we manage standby registration? Does the admin have to do that
> >   manually? Or does a standby register itself automatically if some config
> >   paramter is set?
> > * If automatically, how do we deal with the situation that registrant
> >   dies before noting his own identifier somewhere persistent? My best idea
> >   is a two phase registration process where registration in phase 1 are
> >   thrown away after a restart, but yuck.
>
> If you don't know the answers to these questions for the kind of
> replication that we have now, then how do you know the answers for
> logical replication?  Conversely, what makes the answers that you've
> selected for logical replication unsuitable for our existing
> replication?

There's a pretty fundamental difference imo - with the logical decoding
stuff we only supply support for change producing nodes, with physical
rep we supply both.
There's no need to decide about the way node ids are stored in in-core logical
rep. consumers since there are no in-core ones. Yet. Also, physical rep
by now is a pretty established thing, we need to be much more careful
about compatibility there.

> I have to admit that before I saw your design for the logical
> replication slots, the problem of making this work for our existing
> replication stuff seemed almost intractable to me; I had no idea how
> that was going to work.

Believe me, it caused me some headaches to deceive it for decoding
too. Oh, and I think I watched just about all episodes of some stupid TV
show during it ;)

> Keeping the data in shared memory,
> persisting them across shutdowns, and managing them via either
> function calls or the replication command language seems perfect.

Thanks. I think the concept has quite some merits. The implementation is
a bit simplistic atm, we e.g. might want to work harder at coalescing
fsync()s and such, but that's a further step when we see whether it's
worthwile in the real world.

> Now, in terms of how registration works, whether for physical
> replication or logical, it seems to me that the DBA will have to be
> responsible for telling each client the name of the slot to which that
> client should connect (omitting it at said DBA's option if, as for
> physical replication, the slot is not mandatory).

It seems reasonable to me to reuse the application_name for the slot's
name, similar to the way it's used for synchronous rep. It seems odd to
use two different ways to identify nodes. t should probably only be part
of final slot name though, with the rest being autogenerated.

> Assuming such a design, clients could elect for one of two possible
> behaviors when the anticipated slot doesn't exist: either they error
> out, or they create it.  Either is reasonable; we could even support
> both.

For physical rep, I don't see too much argument for not autogenerating
it. The one I can see it that it makes accidentally changing a slots
name easier, with the consequence of leaving a unused slot around.

>  Athird
> alternative is to make each client responsible for generating a name,
> but I think that's probably not as good.  If we went that route, the
> name would presumably be some kind of random string, which will
> probably be a lot less usable than a DBA-assigned name.  The client
> would first generate it, second save it somewhere persistent (like a
> file in the data directory), and third create a slot by that name.  If
> the client crashes before creating the slot, it will find the name in
> its persistent store after restart and, upon discovering that no such
> slot exists, try again to create it.

I thought we'd need to go that route for a while, but I struggled
exactly with the kind of races you desribe. I'd already toyed with ideas
of making slots "ephemeral" intially, until they get confirmed by the
standby. Essentially reinventing 2PC...
Not needing to solve those problems sounds like a good idea.

> But note that no matter which of those three options we pick, the
> server support really need not work any differently.

Yea, that part isn't worrying me overly much - there's really not much
beside passing the slot name before/in START_REPLICATION that needs to
be done.

> > * How do we deal with the usability wart that people *will* forget to
> >   delete a slot when removing a node?
> 
> Aside from the point mentioned in the previous paragraph, we don't.
> The fact that a standby which gets too far behind can't recover is a
> usability wart, too.  I think this is a bit like asking what happens
> if a user keeps inserting data of only ephemeral value into a table
> and never deletes any of it as they ought to have done.  Well, they'll
> be wasting disk space; potentially, they will fill the disk.  Sucks to
> be them.  The solution is not to prohibit inserting data.

I am perfectly happy with taking that stance - in previous discussions
some (most notably Peter G.) argued ardently against it though.

I think we need to improve the monitoring facilities a bit, and that
should be it. Like
* expose xmin in pg_stat_activity, pg_prepared_xacts, pg_replication_slots (or whatever it's going to be called)
* expose the last restartpoint's redo pointer in pg_stat_replication, pg_replication_slots

That said, the consequences can be a bit harsher than a full disk - the
anti-wraparound security measures might kick in requiring a restart into
single user mode. That's way more confusing than cleaning up a bit of
space on the disk.

> > * What requirements do we have for transporting knowlede about this
> >   across a failover?

> I wouldn't consider this a requirement for a useful feature, and if it
> is a requirement, then how are you solving it for logical replication?

Failover for physical rep and for logical rep are pretty different
beasts imo - I don't think they can easily be compared. Especially since
we won't deliver a full logical rep solution and as you say it will look
pretty different depending on the individual solution.

Consider what happens though, if you promote a node for physical rep. As
soon as it's promoted, it will accept writes and then start a
checkpoint. Unless other standbys have started to follow that node
before either that checkpoint happens (removing WAL) or
autovacuuming/hot-pruning is performed (creating recovery conflicts),
we'll possibly loose the data required to let the standbys follow the
promotion. Note that wal_keep_segments and vacuum_defer_cleanup_age both
sorta work for that...

Could somebody please deliver me a time dilation device?

Regards,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: autovacuum_work_mem
Next
From: Heikki Linnakangas
Date:
Subject: Re: GIN improvements part 1: additional information