Re: unite recovery.conf and postgresql.conf - Mailing list pgsql-hackers

From Greg Smith
Subject Re: unite recovery.conf and postgresql.conf
Date
Msg-id 4EE91248.8010505@2ndQuadrant.com
Whole thread Raw
In response to Re: unite recovery.conf and postgresql.conf  (Josh Berkus <josh@agliodbs.com>)
Responses Re: unite recovery.conf and postgresql.conf
Re: unite recovery.conf and postgresql.conf
Re: unite recovery.conf and postgresql.conf
List pgsql-hackers
I've submitted two patches for adding new include features to the 
postgresql.conf file.  While not quite commit quality yet, I hope 
everyone will agree their reviews this week suggest both are close 
enough that any number of people could finish them off.  Before 
re-opening this can of worms, I wanted people to be comfortable that we 
can expect them to be available as building blocks before 9.2 
development is finished.  Both of those came out of requests from this 
unification thread, and they're a helpful part of what I'd like to propose.

I don't see any path forward here that still expects the recovery.conf 
file to function as it used to.  The "make replication easy" crowd will 
eventually be larger than the pre-9.0 user base, if it isn't already.  
And they clearly want no parts of that thing.  There's been over a year 
of arguing around how to cope with it that will satisfy everyone, so 
many messages I couldn't even read them all usefully in our archives and 
had to go here:

http://postgresql.1045698.n5.nabble.com/recovery-conf-location-td2854644.html
http://postgresql.1045698.n5.nabble.com/unite-recovery-conf-and-postgresql-conf-td4785717.html

I don't think it's possible.  What I would propose is a specification 
based on forced failure if there's any sign of recovery.conf, combined 
with the simplest migration path we can design to ease upgrades from 
older versions.  I think we can make the transition easy enough.  Users 
and tool vendors can make relatively simple changes to support 9.2 
without changing everything they're used to just yet--while still being 
very clear deprecation has arrived and they should reconsider their 
approach.  Only bug-compatible levels of backwards compatibility would 
make this transparent to them, and there's way too many issues to allow 
moving forward that way--a forward path that as far as I can see is 
desired by the majority of users, and just as importantly for all of the 
potential new users we're impacting with the current mess.

There's another, perhaps under considered, concern I want to highlight 
as well.  Tom has repeatedly noted that one of the worst problems here 
would go away if the "existence means do recovery" nature of 
recovery.conf went elsewhere.  And we know some packagers want to 
separate the necessary to manipulate configuration files from the 
database directory, for permissions and management reasons.  As Heikki 
nicely put it (over a year ago), "You don't want to give monitoring 
tools that decide on failover write access to the data directory".  Any 
information that the user is supplying for the purpose of specifying 
things needs to be easy to relocate to a separate config file area, 
instead of treating it more like a control file in $PGDATA.  Some 
chatting this morning with Simon pointed out a second related concern 
there, which makes ideas like "specify the path to the recovery.conf 
file" infeasible.  The data_directory is itself a parameter, so anything 
tied to that or a new GUC means that config files specified there we 
would need two passes.  First identify the data directory, then go back 
again to read recovery.conf from somewhere else.  And nobody wants to 
wander that way.  If it doesn't fit cleanly into the existing 
postgresql.conf parsing, it's gotta go.

Here's the rough outline of what I think would work here:

-All settings move into the postgresql.conf.

-As of 9.2, relocating the postgresql.conf means there are no user 
writable files needed in the data directory.

-Creating a standby.enabled file in the directory that houses the 
postgresql.conf (same logic as "include" uses to locate things) puts the 
system into recovery mode.  That feature needs to save some state, and 
making those decisions based on existence of a file is already a thing 
we do.  Rather than emulating the rename to recovery.done that happens 
now, the server can just delete it, to keep from incorrectly returning 
to a state it's exited.  A UI along the lines of the promote one, 
allowing "pg_ctl standby", should fall out of here.  I think this is 
enough that people who just want to use replication features need never 
hear about this file at all, at least during the important to simplify 
first run through.

-If startup finds a recovery.conf file where it used to live at, 
abort--someone is expecting the old behavior.  Hint to RTFM or include a 
short migration guide right on the spot.  That can have a nice section 
about how you might use the various postgresql.conf include* features if 
they want to continue managing those files separately.  Example: rename 
it as replication.conf and use include_if_exists if you want to be able 
to rename it to recovery.done like before.  Or drop it into a conf.d 
directory where the rename will make it then skipped.

-Tools such as pgpool that want to write a simple configuration file, 
only touching the things that used to go into recovery.conf, can tell 
people to do the same trick.  End their postgresql.conf with a call to 
\include_if_exists replication.conf as part of setup.  While I don't 
like pushing problems toward tool vendors, as one I think validating if 
this has been done doesn't require the sort of fully GUC compatible 
parser people (rightly) want to avoid.  A simple scan of the 
postgresql.conf looking for the recommended text at the front of a line 
could confirm whether that bit is there.  And adding a single 
"include_if_exists" line to the end of the postgresql.conf is not a 
terrible edit job to consider pushing toward tools.  None of this is any 
more complicated than the little search and replace job that initdb does 
right now.

-[Optional:  for 9.2 but not forever, report incompatibilities, or 
potentially stop working altogether, if the things that used to go into 
recovery.conf show up when they aren't going to do anything.  That would 
match the 'stop if the old way is used' theme, but may introduce its own 
new operational issues.]

That's basically it for new ideas.  The mechanics of making everything 
in recovery.conf turn into a GUC seemed pretty uncontroversial when I 
scanned the archives here.  Another idea not to forget is getting the 
read permissions on the new GUCs correct, like making primary_conninfo 
superuser visible only.  Seems like some of the restart error handling 
might still have a tricky part or two left in it.  But if you've 
accepted that a clear break from existing behavior is happening, some of 
those simplify I think.

If we're moving forward this way for 9.2, we do need to be careful that 
a spec is nailed down and people take responsibility for their 
respective parts of it.  There's a few interlocking parts that won't 
work well if partially completed.  I marked myself as reviewer for this 
patch hoping I could crack the deadlock around the specification.  One 
open item I'm not going to do is a more serious review of the already 
submitted code.

To step back toward the thinking that brought me to here (which has 
helped pull Simon a bit closer to the middle of the opinion pack too), 
an application facing change to PostgreSQL has serious backward 
compatibility requirements.  It's really bad if the server runs but will 
quietly malfunction, such that you may not notice the change until much 
later when a rare code path is hit.  But when we're talking about 
something that impacts whether the server will start or not, that can be 
different.  You have a new option.  Make sure the break is obvious and 
total when it happens:  do not function at all if someone tries to 
operate the old way.  The migration path toward the new one should be as 
painless as possible.  If people only have to tweak one or two simple 
things to get an approximation of the old behavior, provide that, give 
migration instructions.  But this is a new major version; if people 
expected all their old configurations to just move over with zero 
changes, that's full of fail in all directions.  The best we can do is 
try to improve the odds their test migrations will fail early, and make
the path to upgrade as simple and well defined as possible.

(It can be argued that people will not actually run into the proposed 
"server blocks if recovery.conf exists" fail-safe during migration 
testing.  It might hit them only when the first fail-over happens in the 
field.  Then they've got a production server down and are staring at the 
new error message, at best; if they don't see the logs even that may not 
be facing them.  I would say that someone who migrates a 
high-availability system over to a new major version, and doesn't read 
the release notes nor do the simplest of fail-over testing, is doomed 
regardless of what we do to try and help them.)

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: review: CHECK FUNCTION statement
Next
From: Robert Haas
Date:
Subject: Re: Command Triggers