Thread: Simplifying replication

Simplifying replication

From
Josh Berkus
Date:
Robert asked me to write this up, so here it is.

It is critical that we make replication easier to set up, administrate 
and monitor than it currently is.  In my conversations with people, this 
is more important to our users and the adoption of PostgreSQL than 
synchronous replication is.

First, I'm finding myself constantly needing to tutor people on how to 
set up replication.  The mere fact that it requires a minimum 1-hour 
class to explain how to use it, or a 10-page tutoral, tells us it's too 
complex.  As further evidence, Bruce and I explained binary replication 
to several MySQL geeks at OpenSQLCamp last weekend, and they were 
horrified at the number and complexity of the steps required.  As it 
currently is, binary replication is not going to win us a lot of new 
users from the web development or virtualization world.

I had to write it up a couple of times; I started with a critique of the 
various current commands and options, but that seemed to miss the point.  So instead, let me lay out how I think
replicationshould work in my 
 
dream world 9.1:

1. Any postgresql standalone server can become a replication master 
simply by enabling replication connections in pg_hba.conf.  No other 
configuration is required, and no server restart is required.

2. Should I choose to adjust master configuration, for say performance 
reasons, most replication variables (including ones like 
wal_keep_segments) should be changeable without a server restart.

3. I can configure a standby by copying the same postgresql.conf on the 
master.  I only have to change a single configuration variable (the 
primary_conninfo, or maybe a replication_mode setting) in order to start 
the server in standby mode.  GUCs which apply only to masters are ignored.

4. I can start a new replica off the master by running a single 
command-line utility on the standby and giving it connection information 
to the master.  Using this connection, it should be able to start a 
backup snapshot, copy the entire database and any required logs, and 
then come up in standby mode.  All that should be required for this is 
one or two highport connections to the master.  No recovery.conf file is 
required, or exists.

5. I can to cause the standby to fail over with a single command to the 
failover server.  If this is a trigger file, then it already has a 
default path to the trigger file in postgresql.conf, so that this does 
not require reconfiguration and restart of the standby at crisis time. 
Ideally, I use a "pg_failover" command or something similar.

6. Should I decide to make the standby the new master, this should also 
be possible with a single command and a one-line configuration on the 
other standbys.  To aid this, we have an easy way to tell which standby 
in a group are most "caught up".  If I try to promote the wrong standby 
(it's behind or somehow incompatible), it should fail with an 
appropriate message.

7. Should I choose to use archive files as well as streaming 
replication, the utilities to manage them (such as pg_archivecleanup and 
pg_standby) are built and installed with PostgreSQL by default, and do 
not require complex settings with escape codes.

That's my vision of "simple replication".  It is also 100% achieveable.  We just have to priorities ease-of-use over
having,and requiring the 
 
user to set, 1,000 little knobs.

Speaking of knobs .... (next message)

--                                   -- Josh Berkus                                     PostgreSQL Experts Inc.
                           http://www.pgexperts.com
 


Re: Simplifying replication

From
Brendan Jurd
Date:
On 19 October 2010 11:16, Josh Berkus <josh@agliodbs.com> wrote:
> 4. I can start a new replica off the master by running a single command-line
> utility on the standby and giving it connection information to the master.
>  Using this connection, it should be able to start a backup snapshot, copy
> the entire database and any required logs, and then come up in standby mode.
>  All that should be required for this is one or two highport connections to
> the master.  No recovery.conf file is required, or exists.

Having just configured replication for the first time with 9.0, I
agree in general with your whole message, but in particular, I want to
give a double-thumbs-up to the above.

Resolving this one item would subtract a great deal of pain -- and
potential for error -- from the process.

Cheers,
BJ


Re: Simplifying replication

From
Dimitri Fontaine
Date:
Hi,

Josh Berkus <josh@agliodbs.com> writes:
> It is critical that we make replication easier to set up, administrate and
> monitor than it currently is.  In my conversations with people, this is more
> important to our users and the adoption of PostgreSQL than synchronous
> replication is.

I want to say a big big +1 here. The way replication and PITR setup are
implemented now are a very good prototype, it's time to consolidate and
get to something usable by normal people, as opposed to PostgreSQL full
time geeks.

Well, the current setup offers lots of flexibility which we'd better not
lose in the process, but the simple setup simply does not exists yet.

> 1. Any postgresql standalone server can become a replication master simply
> by enabling replication connections in pg_hba.conf.  No other configuration
> is required, and no server restart is required.

That sounds as simple as changing the default wal_level to hot_standby,
and the default max_wal_senders to non-zero.

> 2. Should I choose to adjust master configuration, for say performance
> reasons, most replication variables (including ones like wal_keep_segments)
> should be changeable without a server restart.

Anybody know how difficult that is without having to spend lots of time
studying the source code with the question in mind?

> 3. I can configure a standby by copying the same postgresql.conf on the
> master.  I only have to change a single configuration variable (the
> primary_conninfo, or maybe a replication_mode setting) in order to start the
> server in standby mode.  GUCs which apply only to masters are ignored.
>
> 4. I can start a new replica off the master by running a single command-line
> utility on the standby and giving it connection information to the master.
> Using this connection, it should be able to start a backup snapshot, copy
> the entire database and any required logs, and then come up in standby mode.
> All that should be required for this is one or two highport connections to
> the master.  No recovery.conf file is required, or exists.

There's a prototype to stream a base backup from a libpq connection, I
think someone here wanted to integrate that into the replication
protocol itself. It should be doable with a simple libpq connection and
all automated.

The pg_basebackup python client software is 100 lines of code. It's
mainly a recursive query to get the list of files within the master,
then two server side functions to get binary file chunks,
compressed. Then client side, a loop to decompress and write the chunks
at the right place. That's it.
 http://github.com/dimitri/pg_basebackup/blob/master/pg_basebackup.py

I could prepare a patch given some advice on the replication protocol
integration. For one, is streaming a base backup something that
walsender should care about?

> 5. I can to cause the standby to fail over with a single command to the
> failover server.  If this is a trigger file, then it already has a default
> path to the trigger file in postgresql.conf, so that this does not require
> reconfiguration and restart of the standby at crisis time. Ideally, I use a
> "pg_failover" command or something similar.

This feature is in walmgr.py from Skytools and it's something necessary
to have in -core now that we have failover standby capacity. Much
agreed, and the pg_failover command is a good idea.

BTW, do we have a clear idea of how to implement pg_ping, and should it
reports current WAL location(s) of a standby?

> 6. Should I decide to make the standby the new master, this should also be
> possible with a single command and a one-line configuration on the other
> standbys.  To aid this, we have an easy way to tell which standby in a group
> are most "caught up".  If I try to promote the wrong standby (it's behind or
> somehow incompatible), it should fail with an appropriate message.

That needs a way to define a group of standby. There's nothing there
that makes them know about each other. That could fall off the automated
registration of them in a shared catalog on the master, with this shared
catalog spread over (hard-coded) asynchronous replication (sync ==
disaster here). But there's no agreement on this feature yet.

Then you need a way to organise them in groups in this shared catalog,
and you need to ask your network admins to make it so that they can
communicate with each other.

Now say we have pg_ping (or another tool) returning the current recv,
applied and synced LSNs, it would be possible for any standby to figure
out which other ones must be shot in case you failover here. The
failover command could list those other standby in the group that you're
behind of, and with a force command allow you to still failover to this
one. Now you have to STONITH the one listed, but that's your problem
after all.

Then, of course, any standby that's not in the same group as the one
that you failed over to has to be checked and resynced.

> 7. Should I choose to use archive files as well as streaming replication,
> the utilities to manage them (such as pg_archivecleanup and pg_standby) are
> built and installed with PostgreSQL by default, and do not require complex
> settings with escape codes.

Now that PITR has been in for a long enough time, we *need* to take it
to next step integration-wise. By that I mean that we have to support
internal commands and provide reasonable default implementation of the
different scripts needed (in portable C, hence "internal").

There are too many pitfalls in this part of the setup to be serious in
documenting them all and expecting people to come up with bash or perl
implementations that avoid them all. That used to be good enough, but
Josh is right, we need to get even better!

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Simplifying replication

From
Greg Smith
Date:
Josh Berkus wrote:
> It is critical that we make replication easier to set up, administrate 
> and monitor than it currently is.  In my conversations with people, 
> this is more important to our users and the adoption of PostgreSQL 
> than synchronous replication is.

You should enjoy one of the patches we're furiously working on then, 
which is aiming at some of the administration and monitoring pieces 
here.  I have my own grand vision of how easy replication should be to 
setup too.  Visions and plans are nice, but building functional pieces 
of them and delivering them to the community is what actually moves 
PostgreSQL forward.  So far, multiple people have done that for sync 
rep, and what we're supposed to be focused on at this stage in the 
development cycle is finishing the work related to the open CommitFest 
item that includes that.

I find this launch into a new round of bike-shedding a bit distracting.  
If you want this to be easier to use, which it's obvious to any observer 
it should be because what's delivered in 9.0 is way too complicated, 
please work on finding development resources to assign to that problem.  
Because that's the bottleneck on simplifying things, not ideas about 
what to do.  I would recommend finding or assigning a developer to work 
on integrating base backup in to the streaming protocol as the biggest 
single thing that would improve the built-in replication.  All of the 
rest of the trivia about what knobs to set and such are tiny details 
that make for only a minor improvement until that's taken care of.

-- 
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us




Re: Simplifying replication

From
Josh Berkus
Date:
> You should enjoy one of the patches we're furiously working on then,
> which is aiming at some of the administration and monitoring pieces
> here.

Great, glad to hear it!  Would you be willing to go into detail?

> I have my own grand vision of how easy replication should be to
> setup too.

So, share it.  I'd look forward to hearing it, especially since your 
vision probably takes synch rep and quorum commit into account, which 
mine doesn't.   If not here, then on your blog.

> Visions and plans are nice, but building functional pieces of
> them and delivering them to the community is what actually moves
> PostgreSQL forward.

*shrug*.  Robert asked me to write it up for the list based on the 
discussions around synch rep.  Now you're going to bash me for doing so?

Many of the goals I described will mean removing knobs and changing 
defaults, or even foregoing fine-grained control entirely.  If we don't 
have agreement that simplifying replication is a high-priority goal, 
then it won't happen; anyone submitting a patch will be 
this-or-that-use-cased to death and will give up.

For that matter, I'm not sure that everyone agrees that simplification 
is a worthwhile goal.  For example, somewhere between 9.0beta4 and final 
release, someone changed the defaults for max_wal_senders and 
hot_standby to "0" and "off".  I don't remember there even being 
discussion about it.

The discussion around synch rep certainly showed that the "natural" 
tendency of this list is to add complexity with each incarnation of a 
feature.  It's the easiest way to accomodate conflicting use cases, but 
it's not the best way.

--                                   -- Josh Berkus                                     PostgreSQL Experts Inc.
                           http://www.pgexperts.com
 


Re: Simplifying replication

From
Greg Smith
Date:
Josh Berkus wrote:
> *shrug*.  Robert asked me to write it up for the list based on the 
> discussions around synch rep.  Now you're going to bash me for doing so?

Sorry, next time I'll make sure to bash Robert too.  I don't have any 
problems with the basic ideas you're proposing, just concerns about when 
the right time to get into that whole giant subject is and who is going 
to work on.

-- 
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us




Re: Simplifying replication

From
Josh Berkus
Date:
Dimitri, Greg,

> I want to say a big big +1 here. The way replication and PITR setup are
> implemented now are a very good prototype, it's time to consolidate and
> get to something usable by normal people, as opposed to PostgreSQL full
> time geeks.

Well, one thing to be addressed is separating the PITR functionality 
from replication.  PITR needs a lot of features -- timelines, recovery 
stop points, etc. -- which replication doesn't need or want.  I think 
that focussing on streaming replication functionality and ignoring the 
archive logs case is probably the best way to logically separate these 
two.  Presumably anyone who needs archive logs as well will be a 
professional DBA.

> I could prepare a patch given some advice on the replication protocol
> integration. For one, is streaming a base backup something that
> walsender should care about?

Yeah, I thought there was a prototype for this somewhere.   From a user 
perspective, using a 2nd pgport connection for the initial clone is 
fine.   I don't know if we want to worry about it otherwise from a 
resource management perspective; presumably the cloning process is going 
to be a pretty big performance hit on the master.

> BTW, do we have a clear idea of how to implement pg_ping, and should it
> reports current WAL location(s) of a standby?

pg_ping?

> That needs a way to define a group of standby. There's nothing there
> that makes them know about each other.

Let me clarify.  I meant that if I try to make a *single* standby point 
to a new master, and that new master was behind the standby when it 
failed over, then the attempt to remaster should fail with an error.

I do *not* want to get into standby groups.  That way lies madness.  ;-)

> Now say we have pg_ping (or another tool) returning the current recv,
> applied and synced LSNs, it would be possible for any standby to figure
> out which other ones must be shot in case you failover here. The
> failover command could list those other standby in the group that you're
> behind of, and with a force command allow you to still failover to this
> one. Now you have to STONITH the one listed, but that's your problem
> after all.

The LSN isn't enough; as others have pointed out, we have a fairly 
serious failure case if a standby comes up as a master, accepts 
transactions, and then we try to remaster a 2nd standby which was 
actually ahead of the first standby at the time of master failure.  I 
haven't seen a solution posted to that yet; maybe I missed it?
> Sorry, next time I'll make sure to bash Robert too. I don't have any> problems with the basic ideas you're proposing,
justconcerns about when> the right time to get into that whole giant subject is and who is going> to work on.
 

If not now, when?  The 2nd CommitFest is almost complete.   If we're 
going to make any substantial changes, we need to have patches for the 
3rd commitfest.  And I didn't see anyone discussing simplification until 
I brought it up.

I don't realistically think that we're going to get 100% simplification 
for 9.1.  But it would be nice to at least get some components, which 
means getting agreement on how things should work, at least roughly.

--                                   -- Josh Berkus                                     PostgreSQL Experts Inc.
                           http://www.pgexperts.com
 


Re: Simplifying replication

From
Greg Stark
Date:
On Tue, Oct 19, 2010 at 9:16 AM, Josh Berkus <josh@agliodbs.com> wrote:
> Well, one thing to be addressed is separating the PITR functionality from
> replication.  PITR needs a lot of features -- timelines, recovery stop
> points, etc. -- which replication doesn't need or want.  I think that
> focussing on streaming replication functionality and ignoring the archive
> logs case is probably the best way to logically separate these two.
>  Presumably anyone who needs archive logs as well will be a professional
> DBA.

The way things stand you *always* need archived logs. Even if you have
streaming set up it might try to use archived logs if it falls too far
behind.

Also all the features PITR needs are needed by replication as well.
Recovery stop points are absolutely critical. Otherwise if your
replica crashed it would have to start over from the original clone
time and replay all logs since then.

Timelines are not as obvious but perhaps that's our own mistake. When
you fail over to your replica shouldn't the new master get a new
timelineid? Isn't that the answer to the failure case when a slave
finds it's ahead of the master? If it has already replayed logs from a
different timelineid in the same lsn range then it can't switch
timelines to follow the new master. But if it hasn't then it can.

--
greg


Re: Simplifying replication

From
Josh Berkus
Date:
Greg,

> The way things stand you *always* need archived logs. Even if you have
> streaming set up it might try to use archived logs if it falls too far
> behind.

Actually, you don't.  If you're willing to accept possible
desynchronization and recloning of the standbys, then you can skip the
archive logs.

> Timelines are not as obvious but perhaps that's our own mistake. When
> you fail over to your replica shouldn't the new master get a new
> timelineid? Isn't that the answer to the failure case when a slave
> finds it's ahead of the master? If it has already replayed logs from a
> different timelineid in the same lsn range then it can't switch
> timelines to follow the new master. But if it hasn't then it can.

Oh?  Do we have this information (i.e. what LSNs are associated with
which timeline)?


--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Simplifying replication

From
Dimitri Fontaine
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Well, one thing to be addressed is separating the PITR functionality from
> replication.  PITR needs a lot of features -- timelines, recovery stop
> points, etc. -- which replication doesn't need or want.  I think that
> focussing on streaming replication functionality and ignoring the archive
> logs case is probably the best way to logically separate these two.
> Presumably anyone who needs archive logs as well will be a professional DBA.

So, I've been thinking some more about this.

We now have two modes of operation when starting up a PostgreSQL
cluster, either it's a primary/master or it's a standby (in recovery or
hot_standby). What I think would make sense here would be to add another
mode of operation, archiving.

A cluster that is archiving will accept replication connections and will
accept WAL files streamed there, that it will put on its archive
directory, defaults to $PGDATA/pg_xlog_archive. It should also be able
to take a base backup from its primary server, maybe with a new pg_ctl
command. The base backup location defaults to
$PGDATA/pg_basebackup/$label.

Then, it would also accept replication connection in the other way
around, starting a walsender to publish its archive or its base backup,
so that you could prepare a new standby against the archive server, then
switch the primary_conninfo to the master and that's it.

Now, the archive and restore commands in this setup would be internal
commands pg_walsender and pg_walreceiver. That would mean we need to
extend those spacial backends to be able to send and receive a full WAL
file but it sounds simple enough, right?

Then we could add support for those to compress the WAL files before
streaming, as we're talking "internal commands" here it sounds easy
enough (we're already linking against gzip I think).

The base backup streaming support would be there for the archiving mode
but also for the normal standby "empty $PGDATA" initial start up.

Comments?
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Simplifying replication

From
Robert Treat
Date:
On Tue, Oct 19, 2010 at 11:16 AM, Greg Smith <greg@2ndquadrant.com> wrote:
Josh Berkus wrote:
It is critical that we make replication easier to set up, administrate and monitor than it currently is.  In my conversations with people, this is more important to our users and the adoption of PostgreSQL than synchronous replication is.
 
<snip>
 
I find this launch into a new round of bike-shedding a bit distracting.  If you want this to be easier to use, which it's obvious to any observer it should be because what's delivered in 9.0 is way too complicated, please work on finding development resources to assign to that problem.  Because that's the bottleneck on simplifying things, not ideas about what to do.  I would recommend finding or assigning a developer to work on integrating base backup in to the streaming protocol as the biggest single thing that would improve the built-in replication.  All of the rest of the trivia about what knobs to set and such are tiny details that make for only a minor improvement until that's taken care of.


Yeah, I'm sure we all think it should be easier, but figuring out what that means is certainly a moving target. The idea of being able to create a base backup automagically sounds good, but comparatively it's not significantly more difficult than what many other systems make you do, and actually if done incorrectly could be something rather limiting. On the whole the customers we are talking with are far more concerned about things like managing failover scenarios when you have multiple slaves, and it's the lack of capabilities around those kinds of things that hurt postgres adoption much more than it being hard to set up. 


Robert Treat

Re: Simplifying replication

From
Mark Kirkwood
Date:
On 19/10/10 13:16, Josh Berkus wrote:
> Robert asked me to write this up, so here it is.
>
> It is critical that we make replication easier to set up, administrate 
> and monitor than it currently is.  In my conversations with people, 
> this is more important to our users and the adoption of PostgreSQL 
> than synchronous replication is.
>
> First, I'm finding myself constantly needing to tutor people on how to 
> set up replication.  The mere fact that it requires a minimum 1-hour 
> class to explain how to use it, or a 10-page tutoral, tells us it's 
> too complex.  As further evidence, Bruce and I explained binary 
> replication to several MySQL geeks at OpenSQLCamp last weekend, and 
> they were horrified at the number and complexity of the steps 
> required.  As it currently is, binary replication is not going to win 
> us a lot of new users from the web development or virtualization world.
>

+1

I've been having the same experience - how to set this up and do 
failover and failback etc occupies quite a bit of time in courses I've 
been teaching here in NZ and Australia. Having this whole replication 
business much simpler is definitely the way to go.

A good example of how simple it can be is mongodb, where it is 
essentially one command to setup a 2 replica system with a voting arbiter:

$ mongo> rs.initiate(  {    _id     : "replication_set0",    members : [                 { _id  : 0, host :
"192.163,2,100"},                 { _id  : 1, host : "192.168.2.101" },                 { _id  : 2, host :
"192.168.2.103",arbiterOnly : true }    ]  }
 
)




Re: Simplifying replication

From
Bruce Momjian
Date:
Josh Berkus wrote:
> Greg,
> 
> > The way things stand you *always* need archived logs. Even if you have
> > streaming set up it might try to use archived logs if it falls too far
> > behind.
> 
> Actually, you don't.  If you're willing to accept possible
> desynchronization and recloning of the standbys, then you can skip the
> archive logs.

Agreed, but as a reality check:  when I proposed that wal_keep_segments
= -1 would keep all WAL segments (for use while the file system was
being backed up), I was told administrators shoud compute how much free
disk space they had.  Obviously easy of use is not our #1 priority.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Simplifying replication

From
Robert Haas
Date:
On Thu, Oct 21, 2010 at 8:22 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Josh Berkus wrote:
>> Greg,
>>
>> > The way things stand you *always* need archived logs. Even if you have
>> > streaming set up it might try to use archived logs if it falls too far
>> > behind.
>>
>> Actually, you don't.  If you're willing to accept possible
>> desynchronization and recloning of the standbys, then you can skip the
>> archive logs.
>
> Agreed, but as a reality check:  when I proposed that wal_keep_segments
> = -1 would keep all WAL segments (for use while the file system was
> being backed up), I was told administrators shoud compute how much free
> disk space they had.  Obviously easy of use is not our #1 priority.

Amen.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Simplifying replication

From
Josh Berkus
Date:
> Agreed, but as a reality check:  when I proposed that wal_keep_segments
> = -1 would keep all WAL segments (for use while the file system was
> being backed up), I was told administrators shoud compute how much free
> disk space they had.  Obviously easy of use is not our #1 priority.

Depends.  Running out of disk space isn't exactly user-friendly either.And detecting how much free space is available
wouldbe a painful bit
 
of platform-dependant code ...

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Simplifying replication

From
Greg Stark
Date:
On Thu, Oct 21, 2010 at 5:46 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> Agreed, but as a reality check:  when I proposed that wal_keep_segments
>> = -1 would keep all WAL segments (for use while the file system was
>> being backed up), I was told administrators shoud compute how much free
>> disk space they had.  Obviously easy of use is not our #1 priority.
>
> Depends.  Running out of disk space isn't exactly user-friendly either.
>  And detecting how much free space is available would be a painful bit
> of platform-dependant code ...

Nor can we assume we're the only thing using disk space.

However the user-unfriendliness isn't the fact that administrators
need to determine how much disk they're willing to dedicate to
Postgres. The user-unfriendliness is that they then have to specify
this in terms of WAL log files and also have to know that we sometimes
keep more than that and so on.

We've done a good job in the past of converting GUC variables to
meaningful units for administrators and users but it's an ongoing
effort. If we need a GUC to control the amount of disk space we use it
should be in units of MB/GB/TB. If we need a GUC for controlling how
much WAL history to keep for recovering standbys or replicas then it
should be specified in units of time.

Units like "number of wal files" or worse in the case of
checkpoint_segments "number of wal files / 2 - 1" or something like
that.... are terrible. They require arcane knowledge for the
administrator to have a clue how to set.


--
greg


Re: Simplifying replication

From
Robert Haas
Date:
On Thu, Oct 21, 2010 at 8:52 PM, Greg Stark <gsstark@mit.edu> wrote:
> On Thu, Oct 21, 2010 at 5:46 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>
>>> Agreed, but as a reality check:  when I proposed that wal_keep_segments
>>> = -1 would keep all WAL segments (for use while the file system was
>>> being backed up), I was told administrators shoud compute how much free
>>> disk space they had.  Obviously easy of use is not our #1 priority.
>>
>> Depends.  Running out of disk space isn't exactly user-friendly either.
>>  And detecting how much free space is available would be a painful bit
>> of platform-dependant code ...
>
> Nor can we assume we're the only thing using disk space.
>
> However the user-unfriendliness isn't the fact that administrators
> need to determine how much disk they're willing to dedicate to
> Postgres. The user-unfriendliness is that they then have to specify
> this in terms of WAL log files and also have to know that we sometimes
> keep more than that and so on.
>
> We've done a good job in the past of converting GUC variables to
> meaningful units for administrators and users but it's an ongoing
> effort. If we need a GUC to control the amount of disk space we use it
> should be in units of MB/GB/TB. If we need a GUC for controlling how
> much WAL history to keep for recovering standbys or replicas then it
> should be specified in units of time.
>
> Units like "number of wal files" or worse in the case of
> checkpoint_segments "number of wal files / 2 - 1" or something like
> that.... are terrible. They require arcane knowledge for the
> administrator to have a clue how to set.

Very true.  But the lack of a -1 setting for wal_keep_segments means
that if you would like to take a backup without archiving, you must
set wal_keep_segments to a value greater than or equal to the rate at
which you generate WAL segments multiplied by the time it takes you to
run a backup.  If that doesn't qualify as requiring arcane knowledge,
I'm mystified as to what ever could.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Simplifying replication

From
Bruce Momjian
Date:
Robert Haas wrote:
> > However the user-unfriendliness isn't the fact that administrators
> > need to determine how much disk they're willing to dedicate to
> > Postgres. The user-unfriendliness is that they then have to specify
> > this in terms of WAL log files and also have to know that we sometimes
> > keep more than that and so on.
> >
> > We've done a good job in the past of converting GUC variables to
> > meaningful units for administrators and users but it's an ongoing
> > effort. If we need a GUC to control the amount of disk space we use it
> > should be in units of MB/GB/TB. If we need a GUC for controlling how
> > much WAL history to keep for recovering standbys or replicas then it
> > should be specified in units of time.
> >
> > Units like "number of wal files" or worse in the case of
> > checkpoint_segments "number of wal files / 2 - 1" or something like
> > that.... are terrible. They require arcane knowledge for the
> > administrator to have a clue how to set.
> 
> Very true.  But the lack of a -1 setting for wal_keep_segments means
> that if you would like to take a backup without archiving, you must
> set wal_keep_segments to a value greater than or equal to the rate at
> which you generate WAL segments multiplied by the time it takes you to
> run a backup.  If that doesn't qualify as requiring arcane knowledge,
> I'm mystified as to what ever could.

LOL.  Time machine required (both forward and backward time options).

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Simplifying replication

From
Josh Berkus
Date:
>> Very true.  But the lack of a -1 setting for wal_keep_segments means
>> that if you would like to take a backup without archiving, you must
>> set wal_keep_segments to a value greater than or equal to the rate at
>> which you generate WAL segments multiplied by the time it takes you to
>> run a backup.  If that doesn't qualify as requiring arcane knowledge,
>> I'm mystified as to what ever could.

Speaking of which, what's the relationship between checkpoint_segments
and wal_keep_segments?  PG seems perfectly willing to let me set the
latter higher than the former, and it's not documented.

If checkpoint_segments were a hard limit, then we could let admins set
wal_keep_segments to -1, knowing that they'd set checkpoint_segments to
the max space they had available.

Although we might want to rename those.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Simplifying replication

From
Robert Haas
Date:
On Thu, Oct 21, 2010 at 9:09 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>>> Very true.  But the lack of a -1 setting for wal_keep_segments means
>>> that if you would like to take a backup without archiving, you must
>>> set wal_keep_segments to a value greater than or equal to the rate at
>>> which you generate WAL segments multiplied by the time it takes you to
>>> run a backup.  If that doesn't qualify as requiring arcane knowledge,
>>> I'm mystified as to what ever could.
>
> Speaking of which, what's the relationship between checkpoint_segments
> and wal_keep_segments?  PG seems perfectly willing to let me set the
> latter higher than the former, and it's not documented.

I think it's pretty well explained in the fine manual.

http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-KEEP-SEGMENTS

> If checkpoint_segments were a hard limit, then we could let admins set
> wal_keep_segments to -1, knowing that they'd set checkpoint_segments to
> the max space they had available.

This assumes that more checkpoint segments is always better, which
isn't true.  I might have 100 GB of disk space free, but not want to
replay WAL for 4 days if I have a crash.

I do think that the current default of checkpoint_segments=3 is
pathologically insane, but that's another can of worms.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Simplifying replication

From
Josh Berkus
Date:
> I think it's pretty well explained in the fine manual.
> 
> http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-KEEP-SEGMENTS

Nope.  No relationship to checkpoint_segments is explained there.  Try
again?

>> If checkpoint_segments were a hard limit, then we could let admins set
>> wal_keep_segments to -1, knowing that they'd set checkpoint_segments to
>> the max space they had available.
> 
> This assumes that more checkpoint segments is always better, which
> isn't true.  I might have 100 GB of disk space free, but not want to
> replay WAL for 4 days if I have a crash.

No, it assumes no such thing.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Simplifying replication

From
Robert Haas
Date:
On Thu, Oct 21, 2010 at 10:03 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> I think it's pretty well explained in the fine manual.
>>
>> http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-KEEP-SEGMENTS
>
> Nope.  No relationship to checkpoint_segments is explained there.  Try
> again?

Well, it says "This sets only the minimum number of segments retained
in pg_xlog; the system might need to retain more segments for WAL
archival or to recover from a checkpoint."  So in other words, the
relationship with checkpoint segments is that whichever one currently
requires retaining a larger number of segments applies.  That's all
the relationship there is.  I'm not sure I understand the question.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Simplifying replication

From
Fujii Masao
Date:
On Fri, Oct 22, 2010 at 11:03 AM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> I think it's pretty well explained in the fine manual.
>>
>> http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-KEEP-SEGMENTS
>
> Nope.  No relationship to checkpoint_segments is explained there.  Try
> again?

Please see
http://archives.postgresql.org/pgsql-docs/2010-10/msg00038.php

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Simplifying replication

From
Josh Berkus
Date:
> Please see
> http://archives.postgresql.org/pgsql-docs/2010-10/msg00038.php

Ye gods and little fishes!

You really want to talk arcane formulas.  I've re-read that
three times, and am still not sure that I could tell someone 
definitively how much disk space WAL needs for a given group of 
settings.  I'll also point out that that formula is not in our docs -- 
what's an appropriate location?

I think this needs to be corrected in 9.1, *even if it means breaking 
backwards compatibility*.

What would be sensible for DBAs is to have two settings:

max_wal_size
min_wal_size

These would be expresses in MB or GB and would be simple direct 
quantities, which our formulas would work backwards from.  max_wal_size 
would be a hard limit (i.e. Postgres would stop accepting writes if we 
hit it), and Admins would not be allowed to set min_wal_size to more 
than max_wal_size - 2.

Even better would be to replace min_wal_size with min_wal_time, which 
would set a time span for the oldest WAL segment to be kept (up to 
max_wal_size - 2).   Hmmm.  That doesn't seem that hard to implement. 
Is it?

(BTW, Robert, that e-mail is what I meant by "relationship")

--                                   -- Josh Berkus                                     PostgreSQL Experts Inc.
                           http://www.pgexperts.com
 


Re: Simplifying replication

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> What would be sensible for DBAs is to have two settings:

> max_wal_size
> min_wal_size

[ scratches head... ]  What's the functional effect of min_wal_size, exactly?

> Even better would be to replace min_wal_size with min_wal_time, which 
> would set a time span for the oldest WAL segment to be kept (up to 
> max_wal_size - 2).   Hmmm.  That doesn't seem that hard to implement. 
> Is it?

Um, what happens when honoring min_wal_time conflicts with honoring
max_wal_size?
        regards, tom lane


Re: Simplifying replication

From
Josh Berkus
Date:
>> max_wal_size
>> min_wal_size
>
> [ scratches head... ]  What's the functional effect of min_wal_size, exactly?

Replaces wal_keep_segments. The rename is to make the GUCs obviously 
symmetrical, and to make it clear that the *meaning* of the variable has 
changed.

>> Even better would be to replace min_wal_size with min_wal_time, which
>> would set a time span for the oldest WAL segment to be kept (up to
>> max_wal_size - 2).   Hmmm.  That doesn't seem that hard to implement.
>> Is it?
>
> Um, what happens when honoring min_wal_time conflicts with honoring
> max_wal_size?

When we get close enough to max_wal_size (we'll need a couple segments 
of leeway, I think), we start recycling WAL segments even if they are 
less that min_wal_time old.  This is under the presumption that most 
DBAs will prefer having the standby desyncrhonize to having the master 
lock up due to running out of disk space.  Presumably if such recycling 
happens we'd also write a WARNING to the logs.

--                                   -- Josh Berkus                                     PostgreSQL Experts Inc.
                           http://www.pgexperts.com
 


Re: Simplifying replication

From
Fujii Masao
Date:
On Sat, Oct 23, 2010 at 2:34 AM, Josh Berkus <josh@agliodbs.com> wrote:
> When we get close enough to max_wal_size (we'll need a couple segments of
> leeway, I think), we start recycling WAL segments even if they are less that
> min_wal_time old.

What happens if max_wal_size is less than checkpoint_segments?
Currently a checkpoint tries to leave WAL files which were generated
from the prior ckpt start to current ckpt end. Because those WAL files
are required for crash recovery. But we should delete some of them
according to max_wal_size?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Simplifying replication

From
Josh Berkus
Date:
> What happens if max_wal_size is less than checkpoint_segments?
> Currently a checkpoint tries to leave WAL files which were generated
> from the prior ckpt start to current ckpt end. Because those WAL files
> are required for crash recovery. But we should delete some of them
> according to max_wal_size?

The ideas is that max_wal_size would *replace* checkpoint_segments.  The
checkpoint_segments setting is baffling to most PG DBAs.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Simplifying replication

From
Simon Riggs
Date:
On Thu, 2010-10-21 at 20:57 -0400, Robert Haas wrote:

> Very true.  But the lack of a -1 setting for wal_keep_segments means
> that if you would like to take a backup without archiving, you must
> set wal_keep_segments to a value greater than or equal to the rate at
> which you generate WAL segments multiplied by the time it takes you to
> run a backup.  If that doesn't qualify as requiring arcane knowledge,
> I'm mystified as to what ever could.

People are missing the point here:

You have to put the WAL files *somewhere* while you do the base backup.
PostgreSQL can't itself work out where that is, nor can it work out
ahead of time how big it will need to be, since it is up to you how you
do your base backup. Setting a parameter to -1 doesn't make the problem
go away, it just pretends and hopes it doesn't exist, but screws you
badly if you do hit the wall. 

My view is that is irresponsible, even if I share people's wish that the
problem did not exist.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Simplifying replication

From
Robert Haas
Date:
On Tue, Oct 26, 2010 at 8:27 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, 2010-10-21 at 20:57 -0400, Robert Haas wrote:
>
>> Very true.  But the lack of a -1 setting for wal_keep_segments means
>> that if you would like to take a backup without archiving, you must
>> set wal_keep_segments to a value greater than or equal to the rate at
>> which you generate WAL segments multiplied by the time it takes you to
>> run a backup.  If that doesn't qualify as requiring arcane knowledge,
>> I'm mystified as to what ever could.
>
> People are missing the point here:
>
> You have to put the WAL files *somewhere* while you do the base backup.
> PostgreSQL can't itself work out where that is, nor can it work out
> ahead of time how big it will need to be, since it is up to you how you
> do your base backup. Setting a parameter to -1 doesn't make the problem
> go away, it just pretends and hopes it doesn't exist, but screws you
> badly if you do hit the wall.

If you set wal_keep_segments=0, archive_mode=on, and
archive_command=<something>, you might run out of disk space.

If you set wal_keep_segments=-1, you might run out of disk space.

Are you any more screwed in the second case than you are in the first
case?  Why?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Simplifying replication

From
Josh Berkus
Date:
> If you set wal_keep_segments=0, archive_mode=on, and
> archive_command=<something>, you might run out of disk space.
> 
> If you set wal_keep_segments=-1, you might run out of disk space.
> 
> Are you any more screwed in the second case than you are in the first
> case?

It is the same to the user either way.  In either case you have to
change some settings and restart the master.

Well, for the archive case, you could conceivably mass-delete the
archive files.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Simplifying replication

From
Robert Haas
Date:
On Tue, Oct 26, 2010 at 9:59 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> If you set wal_keep_segments=0, archive_mode=on, and
>> archive_command=<something>, you might run out of disk space.
>>
>> If you set wal_keep_segments=-1, you might run out of disk space.
>>
>> Are you any more screwed in the second case than you are in the first
>> case?
>
> It is the same to the user either way.  In either case you have to
> change some settings and restart the master.

Except that changing wal_keep_segments doesn't require restarting the master.

The point of allowing -1 was to allow someone to set it to that value
temporarily, to be able to do a hot backup without having to guess how
large to set it.  If you don't have enough disk space for a backup to
complete, you're kind of hosed either way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Simplifying replication

From
Simon Riggs
Date:
On Tue, 2010-10-26 at 22:03 -0400, Robert Haas wrote:
> On Tue, Oct 26, 2010 at 9:59 PM, Josh Berkus <josh@agliodbs.com> wrote:
> >
> >> If you set wal_keep_segments=0, archive_mode=on, and
> >> archive_command=<something>, you might run out of disk space.
> >>
> >> If you set wal_keep_segments=-1, you might run out of disk space.
> >>
> >> Are you any more screwed in the second case than you are in the first
> >> case?
> >
> > It is the same to the user either way.  In either case you have to
> > change some settings and restart the master.
> 
> Except that changing wal_keep_segments doesn't require restarting the master.
> 
> The point of allowing -1 was to allow someone to set it to that value
> temporarily, to be able to do a hot backup without having to guess how
> large to set it.  If you don't have enough disk space for a backup to
> complete, you're kind of hosed either way.

You're not hosed either way. Fujii designed this carefully to avoid that
and it works. The case of archive_command failing isn't comparable
because that is a failure case, not a normal working server.

You don't need to guess the setting of wal_keep_segments. It's a safety
net that has been deliberately created to avoid the crash that would
otherwise happen. I've not heard a better proposal, yet, though I too am
hopeful there is a better one.

This is all described in my new book on PostgreSQL Administration,
available from the link below. I'm told that everything you need is also
in the docs.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Simplifying replication

From
Josh Berkus
Date:
>> It is the same to the user either way.  In either case you have to
>> change some settings and restart the master.
>
> Except that changing wal_keep_segments doesn't require restarting the master.

Our docs say that it does:
This parameter can only be set in the postgresql.conf file or on the 
server command line.

http://www.postgresql.org/docs/9.0/static/runtime-config-wal.html#RUNTIME-CONFIG-REPLICATION

--                                   -- Josh Berkus                                     PostgreSQL Experts Inc.
                           http://www.pgexperts.com
 


Re: Simplifying replication

From
"Kevin Grittner"
Date:
Josh Berkus <josh@agliodbs.com> wrote:
>> Except that changing wal_keep_segments doesn't require restarting
>> the master.
> 
> Our docs say that it does:
> This parameter can only be set in the postgresql.conf file or on
> the server command line.
That sounds as though a reload would do it; I don't see that
indicating that a restart is needed.
-Kevin


Re: Simplifying replication

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Josh Berkus <josh@agliodbs.com> wrote:
>>> Except that changing wal_keep_segments doesn't require restarting
>>> the master.
>> 
>> Our docs say that it does:
>> This parameter can only be set in the postgresql.conf file or on
>> the server command line.
> That sounds as though a reload would do it; I don't see that
> indicating that a restart is needed.

That is, in fact, our standard boilerplate wording for SIGHUP
parameters.
        regards, tom lane


Re: Simplifying replication

From
Josh Berkus
Date:
> You have to put the WAL files *somewhere* while you do the base backup.
> PostgreSQL can't itself work out where that is, nor can it work out
> ahead of time how big it will need to be, since it is up to you how you
> do your base backup. Setting a parameter to -1 doesn't make the problem
> go away, it just pretends and hopes it doesn't exist, but screws you
> badly if you do hit the wall. 

Agreed.  That's why I like the idea of having a
max_wal_size/min_wal_time instead of keep_wal_segments or
checkpoint_segments.  It's relatively simple for a DBA to know how much
disk space s/he has for WAL, total, before locking up the system.

And to answer Robert's question, because now I understand what he was
getting at.  The reason we want a min_wal_time is because we don't want
to keep a larger WAL around always.  If more WAL were always better,
then we'd only need max_wal_size and we'd only recycle when we hit it.
Instead, we'd recycle whenever we passed max_wal_time.  That's why I
said that I was assuming nothing of the sort.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Simplifying replication

From
Robert Haas
Date:
On Wed, Oct 27, 2010 at 3:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, 2010-10-26 at 22:03 -0400, Robert Haas wrote:
>> On Tue, Oct 26, 2010 at 9:59 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> >
>> >> If you set wal_keep_segments=0, archive_mode=on, and
>> >> archive_command=<something>, you might run out of disk space.
>> >>
>> >> If you set wal_keep_segments=-1, you might run out of disk space.
>> >>
>> >> Are you any more screwed in the second case than you are in the first
>> >> case?
>> >
>> > It is the same to the user either way.  In either case you have to
>> > change some settings and restart the master.
>>
>> Except that changing wal_keep_segments doesn't require restarting the master.
>>
>> The point of allowing -1 was to allow someone to set it to that value
>> temporarily, to be able to do a hot backup without having to guess how
>> large to set it.  If you don't have enough disk space for a backup to
>> complete, you're kind of hosed either way.
>
> You're not hosed either way. Fujii designed this carefully to avoid that
> and it works. The case of archive_command failing isn't comparable
> because that is a failure case, not a normal working server.
>
> You don't need to guess the setting of wal_keep_segments. It's a safety
> net that has been deliberately created to avoid the crash that would
> otherwise happen. I've not heard a better proposal, yet, though I too am
> hopeful there is a better one.

I think you might be confused about what the use case Bruce and I are
imagining, because this doesn't make any sense at all in that context.The specific use case is that you have
archive_mode=off,
wal_level=archive or wal_level=hot_standby, and you want to take a hot
backup.  If you do pg_start_backup(), copy the data directory, and do
pg_stop_backup(), you won't necessarily end up with enough xlog to
reach a consistent state.  To do that, you must copy all the WAL files
that exist after pg_stop_backup() has completed, but you may not
actually be able to do that, because they might get recycled before
you can copy them.  You can fix this by temporarily increasing
max_wal_segments to a sufficiently large value, and then dropping it
back down to whatever you like after you've copied the files out of
pg_xlog.  If -1 isn't an option, you can always use a million, which
is almost certainly large enough to be safe.  But what you definitely
DON'T want to do is temporarily set wal_keep_segments to a value like
7, because if you should manage to go through 8 segments while running
the backup, you may find that you now have an unusable backup.  And
you may not realize this until you try to restore it.

I wonder if we should document this procedure.  I see that it is not
in the docs at present, and it might be useful to somebody.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Simplifying replication

From
Robert Haas
Date:
On Wed, Oct 27, 2010 at 5:01 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> You have to put the WAL files *somewhere* while you do the base backup.
>> PostgreSQL can't itself work out where that is, nor can it work out
>> ahead of time how big it will need to be, since it is up to you how you
>> do your base backup. Setting a parameter to -1 doesn't make the problem
>> go away, it just pretends and hopes it doesn't exist, but screws you
>> badly if you do hit the wall.
>
> Agreed.  That's why I like the idea of having a
> max_wal_size/min_wal_time instead of keep_wal_segments or
> checkpoint_segments.  It's relatively simple for a DBA to know how much
> disk space s/he has for WAL, total, before locking up the system.
>
> And to answer Robert's question, because now I understand what he was
> getting at.  The reason we want a min_wal_time is because we don't want
> to keep a larger WAL around always.  If more WAL were always better,
> then we'd only need max_wal_size and we'd only recycle when we hit it.
> Instead, we'd recycle whenever we passed max_wal_time.  That's why I
> said that I was assuming nothing of the sort.

I sort of agree with you that the current checkpoint_segments
parameter is a bit hard to tune, at least if your goal is to control
the amount of disk space that will be used by WAL files.  But I'm not
sure your proposal is better.  Instead of having a complicated formula
for predicting how much disk space would get used by a given value for
checkpoint_segments, we'd have a complicated formula for the amount of
WAL that would force a checkpoint based on max_wal_size.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Simplifying replication

From
Josh Berkus
Date:
> I sort of agree with you that the current checkpoint_segments
> parameter is a bit hard to tune, at least if your goal is to control
> the amount of disk space that will be used by WAL files.  But I'm not
> sure your proposal is better.  Instead of having a complicated formula
> for predicting how much disk space would get used by a given value for
> checkpoint_segments, we'd have a complicated formula for the amount of
> WAL that would force a checkpoint based on max_wal_size.

Yes, but the complicated formula would then be *in our code* instead of 
being inflicted on the user, as it now is.

--                                   -- Josh Berkus                                     PostgreSQL Experts Inc.
                           http://www.pgexperts.com
 


Re: Simplifying replication

From
Robert Haas
Date:
On Thu, Oct 28, 2010 at 1:13 AM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> I sort of agree with you that the current checkpoint_segments
>> parameter is a bit hard to tune, at least if your goal is to control
>> the amount of disk space that will be used by WAL files.  But I'm not
>> sure your proposal is better.  Instead of having a complicated formula
>> for predicting how much disk space would get used by a given value for
>> checkpoint_segments, we'd have a complicated formula for the amount of
>> WAL that would force a checkpoint based on max_wal_size.
>
> Yes, but the complicated formula would then be *in our code* instead of
> being inflicted on the user, as it now is.

I don't think so - I think it will just be inflicted on the user in a
different way.  We'd still have to document what the formula is,
because people will want to understand how often a checkpoint is going
to get forced.

So here's an example of how this could happen.  Someone sets
max_wal_size = 480MB.  Then, they hear about the
checkpoint_completion_target parameter, and say, ooh, goody, let me
boost that.  So they raise it from 0.5 to 0.9.  Now, all of a sudden,
they're getting more frequent checkpoints.  Performance may get worse
rather than better.  To figure out what value for max_wal_size forces
a checkpoint after the same amount of WAL that forced a checkpoint
before, they need to work backwards from max_wal_size to
checkpoint_segments, and then work forward again to figure out the new
value for the max_wal_size parameter.

Here's the math.  max_wal_size = 480MB = 30 segments.  With
checkpoint_completion_target = 0.5, that means that
checkpoint_segments is (30 - 1) / (2 + 0.5) = 11 (rounded down).  With
checkpoint_completion_target = 0.9, that means they'll need to set
max_wal_size to (2 + 0.9) * 11 + 1 = 33 (rounded up) * 16MB = 528MB.
Whew!

My theory is that most tuning of checkpoint_segments is based on a
worry about recovery time or performance, not disk consumption.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Simplifying replication

From
Dimitri Fontaine
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> You have to put the WAL files *somewhere* while you do the base backup.
>
> Agreed.  That's why I like the idea of having a
> max_wal_size/min_wal_time instead of keep_wal_segments or
> checkpoint_segments.  It's relatively simple for a DBA to know how much
> disk space s/he has for WAL, total, before locking up the system.

What if that somewhere is as easy to setup as a PostgreSQL archive
cluster: set a GUC a two, start the server, then in the production
server have archive_mode = on and use some internal archive and restore
commands, like 'pg_archivewal -h host -p port …'?

It's only pushing the problem away, but in my mind the only reason why
we're still talking about the problem is *not* the wal related settings
but the current complexity of setting up a trustworthy archive server,
and the number of external tools required in the operation (shell, scp,
rsync, rm, etc…).

Or is it just me?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Simplifying replication

From
Hannu Krosing
Date:
On Tue, 2010-10-19 at 10:31 +0200, Dimitri Fontaine wrote:

> > 4. I can start a new replica off the master by running a single command-line
> > utility on the standby and giving it connection information to the master.
> > Using this connection, it should be able to start a backup snapshot, copy
> > the entire database and any required logs, and then come up in standby mode.
> > All that should be required for this is one or two highport connections to
> > the master.  No recovery.conf file is required, or exists.
> 
> There's a prototype to stream a base backup from a libpq connection, I
> think someone here wanted to integrate that into the replication
> protocol itself. It should be doable with a simple libpq connection and
> all automated.
> 
> The pg_basebackup python client software is 100 lines of code. It's
> mainly a recursive query to get the list of files within the master,
> then two server side functions to get binary file chunks,
> compressed. Then client side, a loop to decompress and write the chunks
> at the right place. That's it.
> 
>   http://github.com/dimitri/pg_basebackup/blob/master/pg_basebackup.py
> 
> I could prepare a patch given some advice on the replication protocol
> integration. For one, is streaming a base backup something that
> walsender should care about?

To make pg_basebackup.py self-sufficient it should also open 2nd
connection to the same master and make sure that all WAL files are
copied for the duration of base copy.

This way you don't need to do anything extra to make sure you have
enough wal files. And if you fail because of filling up disks, you fail
on slave side, where it is less of a problem.

> > 
-- 
-------
Hannu Krosing
PostgreSQL Infinite Scalability and Preformance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/




Re: Simplifying replication

From
Dimitri Fontaine
Date:
Hannu Krosing <hannu@2ndQuadrant.com> writes:
> To make pg_basebackup.py self-sufficient it should also open 2nd
> connection to the same master and make sure that all WAL files are
> copied for the duration of base copy.

Excellent idea, will make that happen soon'ish.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Simplifying replication

From
Hannu Krosing
Date:
On Sat, 2010-11-06 at 18:02 +0100, Dimitri Fontaine wrote:
> Hannu Krosing <hannu@2ndQuadrant.com> writes:
> > To make pg_basebackup.py self-sufficient it should also open 2nd
> > connection to the same master and make sure that all WAL files are
> > copied for the duration of base copy.
> 
> Excellent idea, will make that happen soon'ish.

Unitil I learned better, I thought that this is how SR is supposed to works ;)

btw, as next step you could backport this to 8.x and have most of the
benefits of SR. It should not be very hard to keep track of wal position
inside a pl/pythonu function and send one or more records back in form of

(walfile_name text, start_pos int, data bytea)

and then call this function from client every second to keep possible data 
loss down to 1 sec.

this lets you set up warm standby with nothing more than a postgresql 
superuser access to master (assuming right defaults in postgresql conf).

-- 
-------
Hannu Krosing
PostgreSQL Infinite Scalability and Preformance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/





Re: Simplifying replication

From
Dimitri Fontaine
Date:
Hannu Krosing <hannu@2ndQuadrant.com> writes:
>> > To make pg_basebackup.py self-sufficient it should also open 2nd
>> > connection to the same master and make sure that all WAL files are
>> > copied for the duration of base copy.

Done now, please have a look and try it if possible:
 https://github.com/dimitri/pg_basebackup

> btw, as next step you could backport this to 8.x and have most of the
> benefits of SR. It should not be very hard to keep track of wal position
> inside a pl/pythonu function and send one or more records back in form of

You now have the -x and -D options to set that up, but I didn't check
the backport part: it still depends on the pg_bb_list_files() function
to get the recursive listing of the pg_xlog directory, and it does that
using a WITH RECURSIVE query.

The way I did it is to only copy the (whole) WAL again if its ctime
changed since last loop. Also pg_basebackup won't start a backup if you
run it on its own, I don't think that's what you want here.

Oh, as I needed to fork() a process to care for the pg_xlog in a loop
while the base backup is sill ongoing, I added a -j --jobs option so
that you can hammer the master some more by having more than one process
doing the copying.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Simplifying replication

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Thu, Oct 28, 2010 at 1:13 AM, Josh Berkus <josh@agliodbs.com> wrote:
> >
> >> I sort of agree with you that the current checkpoint_segments
> >> parameter is a bit hard to tune, at least if your goal is to control
> >> the amount of disk space that will be used by WAL files. ?But I'm not
> >> sure your proposal is better. ?Instead of having a complicated formula
> >> for predicting how much disk space would get used by a given value for
> >> checkpoint_segments, we'd have a complicated formula for the amount of
> >> WAL that would force a checkpoint based on max_wal_size.
> >
> > Yes, but the complicated formula would then be *in our code* instead of
> > being inflicted on the user, as it now is.
> 
> I don't think so - I think it will just be inflicted on the user in a
> different way.  We'd still have to document what the formula is,
> because people will want to understand how often a checkpoint is going
> to get forced.
> 
> So here's an example of how this could happen.  Someone sets
> max_wal_size = 480MB.  Then, they hear about the
> checkpoint_completion_target parameter, and say, ooh, goody, let me
> boost that.  So they raise it from 0.5 to 0.9.  Now, all of a sudden,
> they're getting more frequent checkpoints.  Performance may get worse

Uh, checkpoint_completion_target only controls flushing of buffers
between checkpoints, not the frequency of checkpoints.

It is hard to believe that, for tuning, the number of 16mb files is more
meaningful then raw file size.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Simplifying replication

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Wed, Oct 27, 2010 at 3:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On Tue, 2010-10-26 at 22:03 -0400, Robert Haas wrote:
> >> On Tue, Oct 26, 2010 at 9:59 PM, Josh Berkus <josh@agliodbs.com> wrote:
> >> >
> >> >> If you set wal_keep_segments=0, archive_mode=on, and
> >> >> archive_command=<something>, you might run out of disk space.
> >> >>
> >> >> If you set wal_keep_segments=-1, you might run out of disk space.
> >> >>
> >> >> Are you any more screwed in the second case than you are in the first
> >> >> case?
> >> >
> >> > It is the same to the user either way. ?In either case you have to
> >> > change some settings and restart the master.
> >>
> >> Except that changing wal_keep_segments doesn't require restarting the master.
> >>
> >> The point of allowing -1 was to allow someone to set it to that value
> >> temporarily, to be able to do a hot backup without having to guess how
> >> large to set it. ?If you don't have enough disk space for a backup to
> >> complete, you're kind of hosed either way.
> >
> > You're not hosed either way. Fujii designed this carefully to avoid that
> > and it works. The case of archive_command failing isn't comparable
> > because that is a failure case, not a normal working server.
> >
> > You don't need to guess the setting of wal_keep_segments. It's a safety
> > net that has been deliberately created to avoid the crash that would
> > otherwise happen. I've not heard a better proposal, yet, though I too am
> > hopeful there is a better one.
> 
> I think you might be confused about what the use case Bruce and I are
> imagining, because this doesn't make any sense at all in that context.
>  The specific use case is that you have archive_mode=off,
> wal_level=archive or wal_level=hot_standby, and you want to take a hot
> backup.  If you do pg_start_backup(), copy the data directory, and do
> pg_stop_backup(), you won't necessarily end up with enough xlog to

This is a clear case of protecting people from themselves (make them
specify a max wal size), and making the feature easy to use.  We can't
have both, folks.  For 9.0, we picked the former.  The same tradeoff
often exists for flexibility and ease of use.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Simplifying replication

From
Robert Haas
Date:
On Thu, Nov 11, 2010 at 10:13 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> On Thu, Oct 28, 2010 at 1:13 AM, Josh Berkus <josh@agliodbs.com> wrote:
>> >
>> >> I sort of agree with you that the current checkpoint_segments
>> >> parameter is a bit hard to tune, at least if your goal is to control
>> >> the amount of disk space that will be used by WAL files. ?But I'm not
>> >> sure your proposal is better. ?Instead of having a complicated formula
>> >> for predicting how much disk space would get used by a given value for
>> >> checkpoint_segments, we'd have a complicated formula for the amount of
>> >> WAL that would force a checkpoint based on max_wal_size.
>> >
>> > Yes, but the complicated formula would then be *in our code* instead of
>> > being inflicted on the user, as it now is.
>>
>> I don't think so - I think it will just be inflicted on the user in a
>> different way.  We'd still have to document what the formula is,
>> because people will want to understand how often a checkpoint is going
>> to get forced.
>>
>> So here's an example of how this could happen.  Someone sets
>> max_wal_size = 480MB.  Then, they hear about the
>> checkpoint_completion_target parameter, and say, ooh, goody, let me
>> boost that.  So they raise it from 0.5 to 0.9.  Now, all of a sudden,
>> they're getting more frequent checkpoints.  Performance may get worse
>
> Uh, checkpoint_completion_target only controls flushing of buffers
> between checkpoints, not the frequency of checkpoints.

According to the formula in our fine documentation, if you increase
checkpoint_completion_target, the maximum number of WAL files also
increases.  This makes sense: the files from the last checkpoint can't
be removed until further along into the next cycle.  Therefore, if you
wanted to increase the checkpoint_completion_target while keeping the
maximum amount of WAL on disk the same, you'd need to trigger
checkpoints more frequently.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Simplifying replication

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Thu, Nov 11, 2010 at 10:13 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > Robert Haas wrote:
> >> On Thu, Oct 28, 2010 at 1:13 AM, Josh Berkus <josh@agliodbs.com> wrote:
> >> >
> >> >> I sort of agree with you that the current checkpoint_segments
> >> >> parameter is a bit hard to tune, at least if your goal is to control
> >> >> the amount of disk space that will be used by WAL files. ?But I'm not
> >> >> sure your proposal is better. ?Instead of having a complicated formula
> >> >> for predicting how much disk space would get used by a given value for
> >> >> checkpoint_segments, we'd have a complicated formula for the amount of
> >> >> WAL that would force a checkpoint based on max_wal_size.
> >> >
> >> > Yes, but the complicated formula would then be *in our code* instead of
> >> > being inflicted on the user, as it now is.
> >>
> >> I don't think so - I think it will just be inflicted on the user in a
> >> different way. ?We'd still have to document what the formula is,
> >> because people will want to understand how often a checkpoint is going
> >> to get forced.
> >>
> >> So here's an example of how this could happen. ?Someone sets
> >> max_wal_size = 480MB. ?Then, they hear about the
> >> checkpoint_completion_target parameter, and say, ooh, goody, let me
> >> boost that. ?So they raise it from 0.5 to 0.9. ?Now, all of a sudden,
> >> they're getting more frequent checkpoints. ?Performance may get worse
> >
> > Uh, checkpoint_completion_target only controls flushing of buffers
> > between checkpoints, not the frequency of checkpoints.
> 
> According to the formula in our fine documentation, if you increase
> checkpoint_completion_target, the maximum number of WAL files also
> increases.  This makes sense: the files from the last checkpoint can't
> be removed until further along into the next cycle.  Therefore, if you
> wanted to increase the checkpoint_completion_target while keeping the
> maximum amount of WAL on disk the same, you'd need to trigger
> checkpoints more frequently.

Do we recycle WAL files between checkpoints or just at checkpoint time? 
I thought it was only at checkpoint time.

Also, there was talk that a larger WAL directory would slow recovery,
but I thought it was only the time since the last checkpoint that
controlled that.

[ Again, sorry for my late reading of this and other threads. ]

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +