Thread: Simplifying replication
Robert asked me to write this up, so here it is. It is critical that we make replication easier to set up, administrate and monitor than it currently is. In my conversations with people, this is more important to our users and the adoption of PostgreSQL than synchronous replication is. First, I'm finding myself constantly needing to tutor people on how to set up replication. The mere fact that it requires a minimum 1-hour class to explain how to use it, or a 10-page tutoral, tells us it's too complex. As further evidence, Bruce and I explained binary replication to several MySQL geeks at OpenSQLCamp last weekend, and they were horrified at the number and complexity of the steps required. As it currently is, binary replication is not going to win us a lot of new users from the web development or virtualization world. I had to write it up a couple of times; I started with a critique of the various current commands and options, but that seemed to miss the point. So instead, let me lay out how I think replicationshould work in my dream world 9.1: 1. Any postgresql standalone server can become a replication master simply by enabling replication connections in pg_hba.conf. No other configuration is required, and no server restart is required. 2. Should I choose to adjust master configuration, for say performance reasons, most replication variables (including ones like wal_keep_segments) should be changeable without a server restart. 3. I can configure a standby by copying the same postgresql.conf on the master. I only have to change a single configuration variable (the primary_conninfo, or maybe a replication_mode setting) in order to start the server in standby mode. GUCs which apply only to masters are ignored. 4. I can start a new replica off the master by running a single command-line utility on the standby and giving it connection information to the master. Using this connection, it should be able to start a backup snapshot, copy the entire database and any required logs, and then come up in standby mode. All that should be required for this is one or two highport connections to the master. No recovery.conf file is required, or exists. 5. I can to cause the standby to fail over with a single command to the failover server. If this is a trigger file, then it already has a default path to the trigger file in postgresql.conf, so that this does not require reconfiguration and restart of the standby at crisis time. Ideally, I use a "pg_failover" command or something similar. 6. Should I decide to make the standby the new master, this should also be possible with a single command and a one-line configuration on the other standbys. To aid this, we have an easy way to tell which standby in a group are most "caught up". If I try to promote the wrong standby (it's behind or somehow incompatible), it should fail with an appropriate message. 7. Should I choose to use archive files as well as streaming replication, the utilities to manage them (such as pg_archivecleanup and pg_standby) are built and installed with PostgreSQL by default, and do not require complex settings with escape codes. That's my vision of "simple replication". It is also 100% achieveable. We just have to priorities ease-of-use over having,and requiring the user to set, 1,000 little knobs. Speaking of knobs .... (next message) -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 19 October 2010 11:16, Josh Berkus <josh@agliodbs.com> wrote: > 4. I can start a new replica off the master by running a single command-line > utility on the standby and giving it connection information to the master. > Using this connection, it should be able to start a backup snapshot, copy > the entire database and any required logs, and then come up in standby mode. > All that should be required for this is one or two highport connections to > the master. No recovery.conf file is required, or exists. Having just configured replication for the first time with 9.0, I agree in general with your whole message, but in particular, I want to give a double-thumbs-up to the above. Resolving this one item would subtract a great deal of pain -- and potential for error -- from the process. Cheers, BJ
Hi, Josh Berkus <josh@agliodbs.com> writes: > It is critical that we make replication easier to set up, administrate and > monitor than it currently is. In my conversations with people, this is more > important to our users and the adoption of PostgreSQL than synchronous > replication is. I want to say a big big +1 here. The way replication and PITR setup are implemented now are a very good prototype, it's time to consolidate and get to something usable by normal people, as opposed to PostgreSQL full time geeks. Well, the current setup offers lots of flexibility which we'd better not lose in the process, but the simple setup simply does not exists yet. > 1. Any postgresql standalone server can become a replication master simply > by enabling replication connections in pg_hba.conf. No other configuration > is required, and no server restart is required. That sounds as simple as changing the default wal_level to hot_standby, and the default max_wal_senders to non-zero. > 2. Should I choose to adjust master configuration, for say performance > reasons, most replication variables (including ones like wal_keep_segments) > should be changeable without a server restart. Anybody know how difficult that is without having to spend lots of time studying the source code with the question in mind? > 3. I can configure a standby by copying the same postgresql.conf on the > master. I only have to change a single configuration variable (the > primary_conninfo, or maybe a replication_mode setting) in order to start the > server in standby mode. GUCs which apply only to masters are ignored. > > 4. I can start a new replica off the master by running a single command-line > utility on the standby and giving it connection information to the master. > Using this connection, it should be able to start a backup snapshot, copy > the entire database and any required logs, and then come up in standby mode. > All that should be required for this is one or two highport connections to > the master. No recovery.conf file is required, or exists. There's a prototype to stream a base backup from a libpq connection, I think someone here wanted to integrate that into the replication protocol itself. It should be doable with a simple libpq connection and all automated. The pg_basebackup python client software is 100 lines of code. It's mainly a recursive query to get the list of files within the master, then two server side functions to get binary file chunks, compressed. Then client side, a loop to decompress and write the chunks at the right place. That's it. http://github.com/dimitri/pg_basebackup/blob/master/pg_basebackup.py I could prepare a patch given some advice on the replication protocol integration. For one, is streaming a base backup something that walsender should care about? > 5. I can to cause the standby to fail over with a single command to the > failover server. If this is a trigger file, then it already has a default > path to the trigger file in postgresql.conf, so that this does not require > reconfiguration and restart of the standby at crisis time. Ideally, I use a > "pg_failover" command or something similar. This feature is in walmgr.py from Skytools and it's something necessary to have in -core now that we have failover standby capacity. Much agreed, and the pg_failover command is a good idea. BTW, do we have a clear idea of how to implement pg_ping, and should it reports current WAL location(s) of a standby? > 6. Should I decide to make the standby the new master, this should also be > possible with a single command and a one-line configuration on the other > standbys. To aid this, we have an easy way to tell which standby in a group > are most "caught up". If I try to promote the wrong standby (it's behind or > somehow incompatible), it should fail with an appropriate message. That needs a way to define a group of standby. There's nothing there that makes them know about each other. That could fall off the automated registration of them in a shared catalog on the master, with this shared catalog spread over (hard-coded) asynchronous replication (sync == disaster here). But there's no agreement on this feature yet. Then you need a way to organise them in groups in this shared catalog, and you need to ask your network admins to make it so that they can communicate with each other. Now say we have pg_ping (or another tool) returning the current recv, applied and synced LSNs, it would be possible for any standby to figure out which other ones must be shot in case you failover here. The failover command could list those other standby in the group that you're behind of, and with a force command allow you to still failover to this one. Now you have to STONITH the one listed, but that's your problem after all. Then, of course, any standby that's not in the same group as the one that you failed over to has to be checked and resynced. > 7. Should I choose to use archive files as well as streaming replication, > the utilities to manage them (such as pg_archivecleanup and pg_standby) are > built and installed with PostgreSQL by default, and do not require complex > settings with escape codes. Now that PITR has been in for a long enough time, we *need* to take it to next step integration-wise. By that I mean that we have to support internal commands and provide reasonable default implementation of the different scripts needed (in portable C, hence "internal"). There are too many pitfalls in this part of the setup to be serious in documenting them all and expecting people to come up with bash or perl implementations that avoid them all. That used to be good enough, but Josh is right, we need to get even better! Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Josh Berkus wrote: > It is critical that we make replication easier to set up, administrate > and monitor than it currently is. In my conversations with people, > this is more important to our users and the adoption of PostgreSQL > than synchronous replication is. You should enjoy one of the patches we're furiously working on then, which is aiming at some of the administration and monitoring pieces here. I have my own grand vision of how easy replication should be to setup too. Visions and plans are nice, but building functional pieces of them and delivering them to the community is what actually moves PostgreSQL forward. So far, multiple people have done that for sync rep, and what we're supposed to be focused on at this stage in the development cycle is finishing the work related to the open CommitFest item that includes that. I find this launch into a new round of bike-shedding a bit distracting. If you want this to be easier to use, which it's obvious to any observer it should be because what's delivered in 9.0 is way too complicated, please work on finding development resources to assign to that problem. Because that's the bottleneck on simplifying things, not ideas about what to do. I would recommend finding or assigning a developer to work on integrating base backup in to the streaming protocol as the biggest single thing that would improve the built-in replication. All of the rest of the trivia about what knobs to set and such are tiny details that make for only a minor improvement until that's taken care of. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
> You should enjoy one of the patches we're furiously working on then, > which is aiming at some of the administration and monitoring pieces > here. Great, glad to hear it! Would you be willing to go into detail? > I have my own grand vision of how easy replication should be to > setup too. So, share it. I'd look forward to hearing it, especially since your vision probably takes synch rep and quorum commit into account, which mine doesn't. If not here, then on your blog. > Visions and plans are nice, but building functional pieces of > them and delivering them to the community is what actually moves > PostgreSQL forward. *shrug*. Robert asked me to write it up for the list based on the discussions around synch rep. Now you're going to bash me for doing so? Many of the goals I described will mean removing knobs and changing defaults, or even foregoing fine-grained control entirely. If we don't have agreement that simplifying replication is a high-priority goal, then it won't happen; anyone submitting a patch will be this-or-that-use-cased to death and will give up. For that matter, I'm not sure that everyone agrees that simplification is a worthwhile goal. For example, somewhere between 9.0beta4 and final release, someone changed the defaults for max_wal_senders and hot_standby to "0" and "off". I don't remember there even being discussion about it. The discussion around synch rep certainly showed that the "natural" tendency of this list is to add complexity with each incarnation of a feature. It's the easiest way to accomodate conflicting use cases, but it's not the best way. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus wrote: > *shrug*. Robert asked me to write it up for the list based on the > discussions around synch rep. Now you're going to bash me for doing so? Sorry, next time I'll make sure to bash Robert too. I don't have any problems with the basic ideas you're proposing, just concerns about when the right time to get into that whole giant subject is and who is going to work on. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
Dimitri, Greg, > I want to say a big big +1 here. The way replication and PITR setup are > implemented now are a very good prototype, it's time to consolidate and > get to something usable by normal people, as opposed to PostgreSQL full > time geeks. Well, one thing to be addressed is separating the PITR functionality from replication. PITR needs a lot of features -- timelines, recovery stop points, etc. -- which replication doesn't need or want. I think that focussing on streaming replication functionality and ignoring the archive logs case is probably the best way to logically separate these two. Presumably anyone who needs archive logs as well will be a professional DBA. > I could prepare a patch given some advice on the replication protocol > integration. For one, is streaming a base backup something that > walsender should care about? Yeah, I thought there was a prototype for this somewhere. From a user perspective, using a 2nd pgport connection for the initial clone is fine. I don't know if we want to worry about it otherwise from a resource management perspective; presumably the cloning process is going to be a pretty big performance hit on the master. > BTW, do we have a clear idea of how to implement pg_ping, and should it > reports current WAL location(s) of a standby? pg_ping? > That needs a way to define a group of standby. There's nothing there > that makes them know about each other. Let me clarify. I meant that if I try to make a *single* standby point to a new master, and that new master was behind the standby when it failed over, then the attempt to remaster should fail with an error. I do *not* want to get into standby groups. That way lies madness. ;-) > Now say we have pg_ping (or another tool) returning the current recv, > applied and synced LSNs, it would be possible for any standby to figure > out which other ones must be shot in case you failover here. The > failover command could list those other standby in the group that you're > behind of, and with a force command allow you to still failover to this > one. Now you have to STONITH the one listed, but that's your problem > after all. The LSN isn't enough; as others have pointed out, we have a fairly serious failure case if a standby comes up as a master, accepts transactions, and then we try to remaster a 2nd standby which was actually ahead of the first standby at the time of master failure. I haven't seen a solution posted to that yet; maybe I missed it? > Sorry, next time I'll make sure to bash Robert too. I don't have any> problems with the basic ideas you're proposing, justconcerns about when> the right time to get into that whole giant subject is and who is going> to work on. If not now, when? The 2nd CommitFest is almost complete. If we're going to make any substantial changes, we need to have patches for the 3rd commitfest. And I didn't see anyone discussing simplification until I brought it up. I don't realistically think that we're going to get 100% simplification for 9.1. But it would be nice to at least get some components, which means getting agreement on how things should work, at least roughly. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, Oct 19, 2010 at 9:16 AM, Josh Berkus <josh@agliodbs.com> wrote: > Well, one thing to be addressed is separating the PITR functionality from > replication. PITR needs a lot of features -- timelines, recovery stop > points, etc. -- which replication doesn't need or want. I think that > focussing on streaming replication functionality and ignoring the archive > logs case is probably the best way to logically separate these two. > Presumably anyone who needs archive logs as well will be a professional > DBA. The way things stand you *always* need archived logs. Even if you have streaming set up it might try to use archived logs if it falls too far behind. Also all the features PITR needs are needed by replication as well. Recovery stop points are absolutely critical. Otherwise if your replica crashed it would have to start over from the original clone time and replay all logs since then. Timelines are not as obvious but perhaps that's our own mistake. When you fail over to your replica shouldn't the new master get a new timelineid? Isn't that the answer to the failure case when a slave finds it's ahead of the master? If it has already replayed logs from a different timelineid in the same lsn range then it can't switch timelines to follow the new master. But if it hasn't then it can. -- greg
Greg, > The way things stand you *always* need archived logs. Even if you have > streaming set up it might try to use archived logs if it falls too far > behind. Actually, you don't. If you're willing to accept possible desynchronization and recloning of the standbys, then you can skip the archive logs. > Timelines are not as obvious but perhaps that's our own mistake. When > you fail over to your replica shouldn't the new master get a new > timelineid? Isn't that the answer to the failure case when a slave > finds it's ahead of the master? If it has already replayed logs from a > different timelineid in the same lsn range then it can't switch > timelines to follow the new master. But if it hasn't then it can. Oh? Do we have this information (i.e. what LSNs are associated with which timeline)? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > Well, one thing to be addressed is separating the PITR functionality from > replication. PITR needs a lot of features -- timelines, recovery stop > points, etc. -- which replication doesn't need or want. I think that > focussing on streaming replication functionality and ignoring the archive > logs case is probably the best way to logically separate these two. > Presumably anyone who needs archive logs as well will be a professional DBA. So, I've been thinking some more about this. We now have two modes of operation when starting up a PostgreSQL cluster, either it's a primary/master or it's a standby (in recovery or hot_standby). What I think would make sense here would be to add another mode of operation, archiving. A cluster that is archiving will accept replication connections and will accept WAL files streamed there, that it will put on its archive directory, defaults to $PGDATA/pg_xlog_archive. It should also be able to take a base backup from its primary server, maybe with a new pg_ctl command. The base backup location defaults to $PGDATA/pg_basebackup/$label. Then, it would also accept replication connection in the other way around, starting a walsender to publish its archive or its base backup, so that you could prepare a new standby against the archive server, then switch the primary_conninfo to the master and that's it. Now, the archive and restore commands in this setup would be internal commands pg_walsender and pg_walreceiver. That would mean we need to extend those spacial backends to be able to send and receive a full WAL file but it sounds simple enough, right? Then we could add support for those to compress the WAL files before streaming, as we're talking "internal commands" here it sounds easy enough (we're already linking against gzip I think). The base backup streaming support would be there for the archiving mode but also for the normal standby "empty $PGDATA" initial start up. Comments? -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Tue, Oct 19, 2010 at 11:16 AM, Greg Smith <greg@2ndquadrant.com> wrote:
Josh Berkus wrote:It is critical that we make replication easier to set up, administrate and monitor than it currently is. In my conversations with people, this is more important to our users and the adoption of PostgreSQL than synchronous replication is.
<snip>
I find this launch into a new round of bike-shedding a bit distracting. If you want this to be easier to use, which it's obvious to any observer it should be because what's delivered in 9.0 is way too complicated, please work on finding development resources to assign to that problem. Because that's the bottleneck on simplifying things, not ideas about what to do. I would recommend finding or assigning a developer to work on integrating base backup in to the streaming protocol as the biggest single thing that would improve the built-in replication. All of the rest of the trivia about what knobs to set and such are tiny details that make for only a minor improvement until that's taken care of.
Yeah, I'm sure we all think it should be easier, but figuring out what that means is certainly a moving target. The idea of being able to create a base backup automagically sounds good, but comparatively it's not significantly more difficult than what many other systems make you do, and actually if done incorrectly could be something rather limiting. On the whole the customers we are talking with are far more concerned about things like managing failover scenarios when you have multiple slaves, and it's the lack of capabilities around those kinds of things that hurt postgres adoption much more than it being hard to set up.
Robert Treat
play: http://www.xzilla.net
On 19/10/10 13:16, Josh Berkus wrote: > Robert asked me to write this up, so here it is. > > It is critical that we make replication easier to set up, administrate > and monitor than it currently is. In my conversations with people, > this is more important to our users and the adoption of PostgreSQL > than synchronous replication is. > > First, I'm finding myself constantly needing to tutor people on how to > set up replication. The mere fact that it requires a minimum 1-hour > class to explain how to use it, or a 10-page tutoral, tells us it's > too complex. As further evidence, Bruce and I explained binary > replication to several MySQL geeks at OpenSQLCamp last weekend, and > they were horrified at the number and complexity of the steps > required. As it currently is, binary replication is not going to win > us a lot of new users from the web development or virtualization world. > +1 I've been having the same experience - how to set this up and do failover and failback etc occupies quite a bit of time in courses I've been teaching here in NZ and Australia. Having this whole replication business much simpler is definitely the way to go. A good example of how simple it can be is mongodb, where it is essentially one command to setup a 2 replica system with a voting arbiter: $ mongo> rs.initiate( { _id : "replication_set0", members : [ { _id : 0, host : "192.163,2,100"}, { _id : 1, host : "192.168.2.101" }, { _id : 2, host : "192.168.2.103",arbiterOnly : true } ] } )
Josh Berkus wrote: > Greg, > > > The way things stand you *always* need archived logs. Even if you have > > streaming set up it might try to use archived logs if it falls too far > > behind. > > Actually, you don't. If you're willing to accept possible > desynchronization and recloning of the standbys, then you can skip the > archive logs. Agreed, but as a reality check: when I proposed that wal_keep_segments = -1 would keep all WAL segments (for use while the file system was being backed up), I was told administrators shoud compute how much free disk space they had. Obviously easy of use is not our #1 priority. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, Oct 21, 2010 at 8:22 PM, Bruce Momjian <bruce@momjian.us> wrote: > Josh Berkus wrote: >> Greg, >> >> > The way things stand you *always* need archived logs. Even if you have >> > streaming set up it might try to use archived logs if it falls too far >> > behind. >> >> Actually, you don't. If you're willing to accept possible >> desynchronization and recloning of the standbys, then you can skip the >> archive logs. > > Agreed, but as a reality check: when I proposed that wal_keep_segments > = -1 would keep all WAL segments (for use while the file system was > being backed up), I was told administrators shoud compute how much free > disk space they had. Obviously easy of use is not our #1 priority. Amen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Agreed, but as a reality check: when I proposed that wal_keep_segments > = -1 would keep all WAL segments (for use while the file system was > being backed up), I was told administrators shoud compute how much free > disk space they had. Obviously easy of use is not our #1 priority. Depends. Running out of disk space isn't exactly user-friendly either.And detecting how much free space is available wouldbe a painful bit of platform-dependant code ... -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Thu, Oct 21, 2010 at 5:46 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> Agreed, but as a reality check: when I proposed that wal_keep_segments >> = -1 would keep all WAL segments (for use while the file system was >> being backed up), I was told administrators shoud compute how much free >> disk space they had. Obviously easy of use is not our #1 priority. > > Depends. Running out of disk space isn't exactly user-friendly either. > And detecting how much free space is available would be a painful bit > of platform-dependant code ... Nor can we assume we're the only thing using disk space. However the user-unfriendliness isn't the fact that administrators need to determine how much disk they're willing to dedicate to Postgres. The user-unfriendliness is that they then have to specify this in terms of WAL log files and also have to know that we sometimes keep more than that and so on. We've done a good job in the past of converting GUC variables to meaningful units for administrators and users but it's an ongoing effort. If we need a GUC to control the amount of disk space we use it should be in units of MB/GB/TB. If we need a GUC for controlling how much WAL history to keep for recovering standbys or replicas then it should be specified in units of time. Units like "number of wal files" or worse in the case of checkpoint_segments "number of wal files / 2 - 1" or something like that.... are terrible. They require arcane knowledge for the administrator to have a clue how to set. -- greg
On Thu, Oct 21, 2010 at 8:52 PM, Greg Stark <gsstark@mit.edu> wrote: > On Thu, Oct 21, 2010 at 5:46 PM, Josh Berkus <josh@agliodbs.com> wrote: >> >>> Agreed, but as a reality check: when I proposed that wal_keep_segments >>> = -1 would keep all WAL segments (for use while the file system was >>> being backed up), I was told administrators shoud compute how much free >>> disk space they had. Obviously easy of use is not our #1 priority. >> >> Depends. Running out of disk space isn't exactly user-friendly either. >> And detecting how much free space is available would be a painful bit >> of platform-dependant code ... > > Nor can we assume we're the only thing using disk space. > > However the user-unfriendliness isn't the fact that administrators > need to determine how much disk they're willing to dedicate to > Postgres. The user-unfriendliness is that they then have to specify > this in terms of WAL log files and also have to know that we sometimes > keep more than that and so on. > > We've done a good job in the past of converting GUC variables to > meaningful units for administrators and users but it's an ongoing > effort. If we need a GUC to control the amount of disk space we use it > should be in units of MB/GB/TB. If we need a GUC for controlling how > much WAL history to keep for recovering standbys or replicas then it > should be specified in units of time. > > Units like "number of wal files" or worse in the case of > checkpoint_segments "number of wal files / 2 - 1" or something like > that.... are terrible. They require arcane knowledge for the > administrator to have a clue how to set. Very true. But the lack of a -1 setting for wal_keep_segments means that if you would like to take a backup without archiving, you must set wal_keep_segments to a value greater than or equal to the rate at which you generate WAL segments multiplied by the time it takes you to run a backup. If that doesn't qualify as requiring arcane knowledge, I'm mystified as to what ever could. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > > However the user-unfriendliness isn't the fact that administrators > > need to determine how much disk they're willing to dedicate to > > Postgres. The user-unfriendliness is that they then have to specify > > this in terms of WAL log files and also have to know that we sometimes > > keep more than that and so on. > > > > We've done a good job in the past of converting GUC variables to > > meaningful units for administrators and users but it's an ongoing > > effort. If we need a GUC to control the amount of disk space we use it > > should be in units of MB/GB/TB. If we need a GUC for controlling how > > much WAL history to keep for recovering standbys or replicas then it > > should be specified in units of time. > > > > Units like "number of wal files" or worse in the case of > > checkpoint_segments "number of wal files / 2 - 1" or something like > > that.... are terrible. They require arcane knowledge for the > > administrator to have a clue how to set. > > Very true. But the lack of a -1 setting for wal_keep_segments means > that if you would like to take a backup without archiving, you must > set wal_keep_segments to a value greater than or equal to the rate at > which you generate WAL segments multiplied by the time it takes you to > run a backup. If that doesn't qualify as requiring arcane knowledge, > I'm mystified as to what ever could. LOL. Time machine required (both forward and backward time options). -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
>> Very true. But the lack of a -1 setting for wal_keep_segments means >> that if you would like to take a backup without archiving, you must >> set wal_keep_segments to a value greater than or equal to the rate at >> which you generate WAL segments multiplied by the time it takes you to >> run a backup. If that doesn't qualify as requiring arcane knowledge, >> I'm mystified as to what ever could. Speaking of which, what's the relationship between checkpoint_segments and wal_keep_segments? PG seems perfectly willing to let me set the latter higher than the former, and it's not documented. If checkpoint_segments were a hard limit, then we could let admins set wal_keep_segments to -1, knowing that they'd set checkpoint_segments to the max space they had available. Although we might want to rename those. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Thu, Oct 21, 2010 at 9:09 PM, Josh Berkus <josh@agliodbs.com> wrote: > >>> Very true. But the lack of a -1 setting for wal_keep_segments means >>> that if you would like to take a backup without archiving, you must >>> set wal_keep_segments to a value greater than or equal to the rate at >>> which you generate WAL segments multiplied by the time it takes you to >>> run a backup. If that doesn't qualify as requiring arcane knowledge, >>> I'm mystified as to what ever could. > > Speaking of which, what's the relationship between checkpoint_segments > and wal_keep_segments? PG seems perfectly willing to let me set the > latter higher than the former, and it's not documented. I think it's pretty well explained in the fine manual. http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-KEEP-SEGMENTS > If checkpoint_segments were a hard limit, then we could let admins set > wal_keep_segments to -1, knowing that they'd set checkpoint_segments to > the max space they had available. This assumes that more checkpoint segments is always better, which isn't true. I might have 100 GB of disk space free, but not want to replay WAL for 4 days if I have a crash. I do think that the current default of checkpoint_segments=3 is pathologically insane, but that's another can of worms. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> I think it's pretty well explained in the fine manual. > > http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-KEEP-SEGMENTS Nope. No relationship to checkpoint_segments is explained there. Try again? >> If checkpoint_segments were a hard limit, then we could let admins set >> wal_keep_segments to -1, knowing that they'd set checkpoint_segments to >> the max space they had available. > > This assumes that more checkpoint segments is always better, which > isn't true. I might have 100 GB of disk space free, but not want to > replay WAL for 4 days if I have a crash. No, it assumes no such thing. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Thu, Oct 21, 2010 at 10:03 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> I think it's pretty well explained in the fine manual. >> >> http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-KEEP-SEGMENTS > > Nope. No relationship to checkpoint_segments is explained there. Try > again? Well, it says "This sets only the minimum number of segments retained in pg_xlog; the system might need to retain more segments for WAL archival or to recover from a checkpoint." So in other words, the relationship with checkpoint segments is that whichever one currently requires retaining a larger number of segments applies. That's all the relationship there is. I'm not sure I understand the question. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 22, 2010 at 11:03 AM, Josh Berkus <josh@agliodbs.com> wrote: > >> I think it's pretty well explained in the fine manual. >> >> http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-KEEP-SEGMENTS > > Nope. No relationship to checkpoint_segments is explained there. Try > again? Please see http://archives.postgresql.org/pgsql-docs/2010-10/msg00038.php Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
> Please see > http://archives.postgresql.org/pgsql-docs/2010-10/msg00038.php Ye gods and little fishes! You really want to talk arcane formulas. I've re-read that three times, and am still not sure that I could tell someone definitively how much disk space WAL needs for a given group of settings. I'll also point out that that formula is not in our docs -- what's an appropriate location? I think this needs to be corrected in 9.1, *even if it means breaking backwards compatibility*. What would be sensible for DBAs is to have two settings: max_wal_size min_wal_size These would be expresses in MB or GB and would be simple direct quantities, which our formulas would work backwards from. max_wal_size would be a hard limit (i.e. Postgres would stop accepting writes if we hit it), and Admins would not be allowed to set min_wal_size to more than max_wal_size - 2. Even better would be to replace min_wal_size with min_wal_time, which would set a time span for the oldest WAL segment to be kept (up to max_wal_size - 2). Hmmm. That doesn't seem that hard to implement. Is it? (BTW, Robert, that e-mail is what I meant by "relationship") -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > What would be sensible for DBAs is to have two settings: > max_wal_size > min_wal_size [ scratches head... ] What's the functional effect of min_wal_size, exactly? > Even better would be to replace min_wal_size with min_wal_time, which > would set a time span for the oldest WAL segment to be kept (up to > max_wal_size - 2). Hmmm. That doesn't seem that hard to implement. > Is it? Um, what happens when honoring min_wal_time conflicts with honoring max_wal_size? regards, tom lane
>> max_wal_size >> min_wal_size > > [ scratches head... ] What's the functional effect of min_wal_size, exactly? Replaces wal_keep_segments. The rename is to make the GUCs obviously symmetrical, and to make it clear that the *meaning* of the variable has changed. >> Even better would be to replace min_wal_size with min_wal_time, which >> would set a time span for the oldest WAL segment to be kept (up to >> max_wal_size - 2). Hmmm. That doesn't seem that hard to implement. >> Is it? > > Um, what happens when honoring min_wal_time conflicts with honoring > max_wal_size? When we get close enough to max_wal_size (we'll need a couple segments of leeway, I think), we start recycling WAL segments even if they are less that min_wal_time old. This is under the presumption that most DBAs will prefer having the standby desyncrhonize to having the master lock up due to running out of disk space. Presumably if such recycling happens we'd also write a WARNING to the logs. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Sat, Oct 23, 2010 at 2:34 AM, Josh Berkus <josh@agliodbs.com> wrote: > When we get close enough to max_wal_size (we'll need a couple segments of > leeway, I think), we start recycling WAL segments even if they are less that > min_wal_time old. What happens if max_wal_size is less than checkpoint_segments? Currently a checkpoint tries to leave WAL files which were generated from the prior ckpt start to current ckpt end. Because those WAL files are required for crash recovery. But we should delete some of them according to max_wal_size? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
> What happens if max_wal_size is less than checkpoint_segments? > Currently a checkpoint tries to leave WAL files which were generated > from the prior ckpt start to current ckpt end. Because those WAL files > are required for crash recovery. But we should delete some of them > according to max_wal_size? The ideas is that max_wal_size would *replace* checkpoint_segments. The checkpoint_segments setting is baffling to most PG DBAs. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Thu, 2010-10-21 at 20:57 -0400, Robert Haas wrote: > Very true. But the lack of a -1 setting for wal_keep_segments means > that if you would like to take a backup without archiving, you must > set wal_keep_segments to a value greater than or equal to the rate at > which you generate WAL segments multiplied by the time it takes you to > run a backup. If that doesn't qualify as requiring arcane knowledge, > I'm mystified as to what ever could. People are missing the point here: You have to put the WAL files *somewhere* while you do the base backup. PostgreSQL can't itself work out where that is, nor can it work out ahead of time how big it will need to be, since it is up to you how you do your base backup. Setting a parameter to -1 doesn't make the problem go away, it just pretends and hopes it doesn't exist, but screws you badly if you do hit the wall. My view is that is irresponsible, even if I share people's wish that the problem did not exist. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Tue, Oct 26, 2010 at 8:27 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2010-10-21 at 20:57 -0400, Robert Haas wrote: > >> Very true. But the lack of a -1 setting for wal_keep_segments means >> that if you would like to take a backup without archiving, you must >> set wal_keep_segments to a value greater than or equal to the rate at >> which you generate WAL segments multiplied by the time it takes you to >> run a backup. If that doesn't qualify as requiring arcane knowledge, >> I'm mystified as to what ever could. > > People are missing the point here: > > You have to put the WAL files *somewhere* while you do the base backup. > PostgreSQL can't itself work out where that is, nor can it work out > ahead of time how big it will need to be, since it is up to you how you > do your base backup. Setting a parameter to -1 doesn't make the problem > go away, it just pretends and hopes it doesn't exist, but screws you > badly if you do hit the wall. If you set wal_keep_segments=0, archive_mode=on, and archive_command=<something>, you might run out of disk space. If you set wal_keep_segments=-1, you might run out of disk space. Are you any more screwed in the second case than you are in the first case? Why? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> If you set wal_keep_segments=0, archive_mode=on, and > archive_command=<something>, you might run out of disk space. > > If you set wal_keep_segments=-1, you might run out of disk space. > > Are you any more screwed in the second case than you are in the first > case? It is the same to the user either way. In either case you have to change some settings and restart the master. Well, for the archive case, you could conceivably mass-delete the archive files. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, Oct 26, 2010 at 9:59 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> If you set wal_keep_segments=0, archive_mode=on, and >> archive_command=<something>, you might run out of disk space. >> >> If you set wal_keep_segments=-1, you might run out of disk space. >> >> Are you any more screwed in the second case than you are in the first >> case? > > It is the same to the user either way. In either case you have to > change some settings and restart the master. Except that changing wal_keep_segments doesn't require restarting the master. The point of allowing -1 was to allow someone to set it to that value temporarily, to be able to do a hot backup without having to guess how large to set it. If you don't have enough disk space for a backup to complete, you're kind of hosed either way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 2010-10-26 at 22:03 -0400, Robert Haas wrote: > On Tue, Oct 26, 2010 at 9:59 PM, Josh Berkus <josh@agliodbs.com> wrote: > > > >> If you set wal_keep_segments=0, archive_mode=on, and > >> archive_command=<something>, you might run out of disk space. > >> > >> If you set wal_keep_segments=-1, you might run out of disk space. > >> > >> Are you any more screwed in the second case than you are in the first > >> case? > > > > It is the same to the user either way. In either case you have to > > change some settings and restart the master. > > Except that changing wal_keep_segments doesn't require restarting the master. > > The point of allowing -1 was to allow someone to set it to that value > temporarily, to be able to do a hot backup without having to guess how > large to set it. If you don't have enough disk space for a backup to > complete, you're kind of hosed either way. You're not hosed either way. Fujii designed this carefully to avoid that and it works. The case of archive_command failing isn't comparable because that is a failure case, not a normal working server. You don't need to guess the setting of wal_keep_segments. It's a safety net that has been deliberately created to avoid the crash that would otherwise happen. I've not heard a better proposal, yet, though I too am hopeful there is a better one. This is all described in my new book on PostgreSQL Administration, available from the link below. I'm told that everything you need is also in the docs. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
>> It is the same to the user either way. In either case you have to >> change some settings and restart the master. > > Except that changing wal_keep_segments doesn't require restarting the master. Our docs say that it does: This parameter can only be set in the postgresql.conf file or on the server command line. http://www.postgresql.org/docs/9.0/static/runtime-config-wal.html#RUNTIME-CONFIG-REPLICATION -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> wrote: >> Except that changing wal_keep_segments doesn't require restarting >> the master. > > Our docs say that it does: > This parameter can only be set in the postgresql.conf file or on > the server command line. That sounds as though a reload would do it; I don't see that indicating that a restart is needed. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Josh Berkus <josh@agliodbs.com> wrote: >>> Except that changing wal_keep_segments doesn't require restarting >>> the master. >> >> Our docs say that it does: >> This parameter can only be set in the postgresql.conf file or on >> the server command line. > That sounds as though a reload would do it; I don't see that > indicating that a restart is needed. That is, in fact, our standard boilerplate wording for SIGHUP parameters. regards, tom lane
> You have to put the WAL files *somewhere* while you do the base backup. > PostgreSQL can't itself work out where that is, nor can it work out > ahead of time how big it will need to be, since it is up to you how you > do your base backup. Setting a parameter to -1 doesn't make the problem > go away, it just pretends and hopes it doesn't exist, but screws you > badly if you do hit the wall. Agreed. That's why I like the idea of having a max_wal_size/min_wal_time instead of keep_wal_segments or checkpoint_segments. It's relatively simple for a DBA to know how much disk space s/he has for WAL, total, before locking up the system. And to answer Robert's question, because now I understand what he was getting at. The reason we want a min_wal_time is because we don't want to keep a larger WAL around always. If more WAL were always better, then we'd only need max_wal_size and we'd only recycle when we hit it. Instead, we'd recycle whenever we passed max_wal_time. That's why I said that I was assuming nothing of the sort. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Wed, Oct 27, 2010 at 3:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2010-10-26 at 22:03 -0400, Robert Haas wrote: >> On Tue, Oct 26, 2010 at 9:59 PM, Josh Berkus <josh@agliodbs.com> wrote: >> > >> >> If you set wal_keep_segments=0, archive_mode=on, and >> >> archive_command=<something>, you might run out of disk space. >> >> >> >> If you set wal_keep_segments=-1, you might run out of disk space. >> >> >> >> Are you any more screwed in the second case than you are in the first >> >> case? >> > >> > It is the same to the user either way. In either case you have to >> > change some settings and restart the master. >> >> Except that changing wal_keep_segments doesn't require restarting the master. >> >> The point of allowing -1 was to allow someone to set it to that value >> temporarily, to be able to do a hot backup without having to guess how >> large to set it. If you don't have enough disk space for a backup to >> complete, you're kind of hosed either way. > > You're not hosed either way. Fujii designed this carefully to avoid that > and it works. The case of archive_command failing isn't comparable > because that is a failure case, not a normal working server. > > You don't need to guess the setting of wal_keep_segments. It's a safety > net that has been deliberately created to avoid the crash that would > otherwise happen. I've not heard a better proposal, yet, though I too am > hopeful there is a better one. I think you might be confused about what the use case Bruce and I are imagining, because this doesn't make any sense at all in that context.The specific use case is that you have archive_mode=off, wal_level=archive or wal_level=hot_standby, and you want to take a hot backup. If you do pg_start_backup(), copy the data directory, and do pg_stop_backup(), you won't necessarily end up with enough xlog to reach a consistent state. To do that, you must copy all the WAL files that exist after pg_stop_backup() has completed, but you may not actually be able to do that, because they might get recycled before you can copy them. You can fix this by temporarily increasing max_wal_segments to a sufficiently large value, and then dropping it back down to whatever you like after you've copied the files out of pg_xlog. If -1 isn't an option, you can always use a million, which is almost certainly large enough to be safe. But what you definitely DON'T want to do is temporarily set wal_keep_segments to a value like 7, because if you should manage to go through 8 segments while running the backup, you may find that you now have an unusable backup. And you may not realize this until you try to restore it. I wonder if we should document this procedure. I see that it is not in the docs at present, and it might be useful to somebody. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 27, 2010 at 5:01 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> You have to put the WAL files *somewhere* while you do the base backup. >> PostgreSQL can't itself work out where that is, nor can it work out >> ahead of time how big it will need to be, since it is up to you how you >> do your base backup. Setting a parameter to -1 doesn't make the problem >> go away, it just pretends and hopes it doesn't exist, but screws you >> badly if you do hit the wall. > > Agreed. That's why I like the idea of having a > max_wal_size/min_wal_time instead of keep_wal_segments or > checkpoint_segments. It's relatively simple for a DBA to know how much > disk space s/he has for WAL, total, before locking up the system. > > And to answer Robert's question, because now I understand what he was > getting at. The reason we want a min_wal_time is because we don't want > to keep a larger WAL around always. If more WAL were always better, > then we'd only need max_wal_size and we'd only recycle when we hit it. > Instead, we'd recycle whenever we passed max_wal_time. That's why I > said that I was assuming nothing of the sort. I sort of agree with you that the current checkpoint_segments parameter is a bit hard to tune, at least if your goal is to control the amount of disk space that will be used by WAL files. But I'm not sure your proposal is better. Instead of having a complicated formula for predicting how much disk space would get used by a given value for checkpoint_segments, we'd have a complicated formula for the amount of WAL that would force a checkpoint based on max_wal_size. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> I sort of agree with you that the current checkpoint_segments > parameter is a bit hard to tune, at least if your goal is to control > the amount of disk space that will be used by WAL files. But I'm not > sure your proposal is better. Instead of having a complicated formula > for predicting how much disk space would get used by a given value for > checkpoint_segments, we'd have a complicated formula for the amount of > WAL that would force a checkpoint based on max_wal_size. Yes, but the complicated formula would then be *in our code* instead of being inflicted on the user, as it now is. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Thu, Oct 28, 2010 at 1:13 AM, Josh Berkus <josh@agliodbs.com> wrote: > >> I sort of agree with you that the current checkpoint_segments >> parameter is a bit hard to tune, at least if your goal is to control >> the amount of disk space that will be used by WAL files. But I'm not >> sure your proposal is better. Instead of having a complicated formula >> for predicting how much disk space would get used by a given value for >> checkpoint_segments, we'd have a complicated formula for the amount of >> WAL that would force a checkpoint based on max_wal_size. > > Yes, but the complicated formula would then be *in our code* instead of > being inflicted on the user, as it now is. I don't think so - I think it will just be inflicted on the user in a different way. We'd still have to document what the formula is, because people will want to understand how often a checkpoint is going to get forced. So here's an example of how this could happen. Someone sets max_wal_size = 480MB. Then, they hear about the checkpoint_completion_target parameter, and say, ooh, goody, let me boost that. So they raise it from 0.5 to 0.9. Now, all of a sudden, they're getting more frequent checkpoints. Performance may get worse rather than better. To figure out what value for max_wal_size forces a checkpoint after the same amount of WAL that forced a checkpoint before, they need to work backwards from max_wal_size to checkpoint_segments, and then work forward again to figure out the new value for the max_wal_size parameter. Here's the math. max_wal_size = 480MB = 30 segments. With checkpoint_completion_target = 0.5, that means that checkpoint_segments is (30 - 1) / (2 + 0.5) = 11 (rounded down). With checkpoint_completion_target = 0.9, that means they'll need to set max_wal_size to (2 + 0.9) * 11 + 1 = 33 (rounded up) * 16MB = 528MB. Whew! My theory is that most tuning of checkpoint_segments is based on a worry about recovery time or performance, not disk consumption. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Josh Berkus <josh@agliodbs.com> writes: >> You have to put the WAL files *somewhere* while you do the base backup. > > Agreed. That's why I like the idea of having a > max_wal_size/min_wal_time instead of keep_wal_segments or > checkpoint_segments. It's relatively simple for a DBA to know how much > disk space s/he has for WAL, total, before locking up the system. What if that somewhere is as easy to setup as a PostgreSQL archive cluster: set a GUC a two, start the server, then in the production server have archive_mode = on and use some internal archive and restore commands, like 'pg_archivewal -h host -p port …'? It's only pushing the problem away, but in my mind the only reason why we're still talking about the problem is *not* the wal related settings but the current complexity of setting up a trustworthy archive server, and the number of external tools required in the operation (shell, scp, rsync, rm, etc…). Or is it just me? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Tue, 2010-10-19 at 10:31 +0200, Dimitri Fontaine wrote: > > 4. I can start a new replica off the master by running a single command-line > > utility on the standby and giving it connection information to the master. > > Using this connection, it should be able to start a backup snapshot, copy > > the entire database and any required logs, and then come up in standby mode. > > All that should be required for this is one or two highport connections to > > the master. No recovery.conf file is required, or exists. > > There's a prototype to stream a base backup from a libpq connection, I > think someone here wanted to integrate that into the replication > protocol itself. It should be doable with a simple libpq connection and > all automated. > > The pg_basebackup python client software is 100 lines of code. It's > mainly a recursive query to get the list of files within the master, > then two server side functions to get binary file chunks, > compressed. Then client side, a loop to decompress and write the chunks > at the right place. That's it. > > http://github.com/dimitri/pg_basebackup/blob/master/pg_basebackup.py > > I could prepare a patch given some advice on the replication protocol > integration. For one, is streaming a base backup something that > walsender should care about? To make pg_basebackup.py self-sufficient it should also open 2nd connection to the same master and make sure that all WAL files are copied for the duration of base copy. This way you don't need to do anything extra to make sure you have enough wal files. And if you fail because of filling up disks, you fail on slave side, where it is less of a problem. > > -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Preformance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
Hannu Krosing <hannu@2ndQuadrant.com> writes: > To make pg_basebackup.py self-sufficient it should also open 2nd > connection to the same master and make sure that all WAL files are > copied for the duration of base copy. Excellent idea, will make that happen soon'ish. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Sat, 2010-11-06 at 18:02 +0100, Dimitri Fontaine wrote: > Hannu Krosing <hannu@2ndQuadrant.com> writes: > > To make pg_basebackup.py self-sufficient it should also open 2nd > > connection to the same master and make sure that all WAL files are > > copied for the duration of base copy. > > Excellent idea, will make that happen soon'ish. Unitil I learned better, I thought that this is how SR is supposed to works ;) btw, as next step you could backport this to 8.x and have most of the benefits of SR. It should not be very hard to keep track of wal position inside a pl/pythonu function and send one or more records back in form of (walfile_name text, start_pos int, data bytea) and then call this function from client every second to keep possible data loss down to 1 sec. this lets you set up warm standby with nothing more than a postgresql superuser access to master (assuming right defaults in postgresql conf). -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Preformance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
Hannu Krosing <hannu@2ndQuadrant.com> writes: >> > To make pg_basebackup.py self-sufficient it should also open 2nd >> > connection to the same master and make sure that all WAL files are >> > copied for the duration of base copy. Done now, please have a look and try it if possible: https://github.com/dimitri/pg_basebackup > btw, as next step you could backport this to 8.x and have most of the > benefits of SR. It should not be very hard to keep track of wal position > inside a pl/pythonu function and send one or more records back in form of You now have the -x and -D options to set that up, but I didn't check the backport part: it still depends on the pg_bb_list_files() function to get the recursive listing of the pg_xlog directory, and it does that using a WITH RECURSIVE query. The way I did it is to only copy the (whole) WAL again if its ctime changed since last loop. Also pg_basebackup won't start a backup if you run it on its own, I don't think that's what you want here. Oh, as I needed to fork() a process to care for the pg_xlog in a loop while the base backup is sill ongoing, I added a -j --jobs option so that you can hammer the master some more by having more than one process doing the copying. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Robert Haas wrote: > On Thu, Oct 28, 2010 at 1:13 AM, Josh Berkus <josh@agliodbs.com> wrote: > > > >> I sort of agree with you that the current checkpoint_segments > >> parameter is a bit hard to tune, at least if your goal is to control > >> the amount of disk space that will be used by WAL files. ?But I'm not > >> sure your proposal is better. ?Instead of having a complicated formula > >> for predicting how much disk space would get used by a given value for > >> checkpoint_segments, we'd have a complicated formula for the amount of > >> WAL that would force a checkpoint based on max_wal_size. > > > > Yes, but the complicated formula would then be *in our code* instead of > > being inflicted on the user, as it now is. > > I don't think so - I think it will just be inflicted on the user in a > different way. We'd still have to document what the formula is, > because people will want to understand how often a checkpoint is going > to get forced. > > So here's an example of how this could happen. Someone sets > max_wal_size = 480MB. Then, they hear about the > checkpoint_completion_target parameter, and say, ooh, goody, let me > boost that. So they raise it from 0.5 to 0.9. Now, all of a sudden, > they're getting more frequent checkpoints. Performance may get worse Uh, checkpoint_completion_target only controls flushing of buffers between checkpoints, not the frequency of checkpoints. It is hard to believe that, for tuning, the number of 16mb files is more meaningful then raw file size. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Robert Haas wrote: > On Wed, Oct 27, 2010 at 3:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2010-10-26 at 22:03 -0400, Robert Haas wrote: > >> On Tue, Oct 26, 2010 at 9:59 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> > > >> >> If you set wal_keep_segments=0, archive_mode=on, and > >> >> archive_command=<something>, you might run out of disk space. > >> >> > >> >> If you set wal_keep_segments=-1, you might run out of disk space. > >> >> > >> >> Are you any more screwed in the second case than you are in the first > >> >> case? > >> > > >> > It is the same to the user either way. ?In either case you have to > >> > change some settings and restart the master. > >> > >> Except that changing wal_keep_segments doesn't require restarting the master. > >> > >> The point of allowing -1 was to allow someone to set it to that value > >> temporarily, to be able to do a hot backup without having to guess how > >> large to set it. ?If you don't have enough disk space for a backup to > >> complete, you're kind of hosed either way. > > > > You're not hosed either way. Fujii designed this carefully to avoid that > > and it works. The case of archive_command failing isn't comparable > > because that is a failure case, not a normal working server. > > > > You don't need to guess the setting of wal_keep_segments. It's a safety > > net that has been deliberately created to avoid the crash that would > > otherwise happen. I've not heard a better proposal, yet, though I too am > > hopeful there is a better one. > > I think you might be confused about what the use case Bruce and I are > imagining, because this doesn't make any sense at all in that context. > The specific use case is that you have archive_mode=off, > wal_level=archive or wal_level=hot_standby, and you want to take a hot > backup. If you do pg_start_backup(), copy the data directory, and do > pg_stop_backup(), you won't necessarily end up with enough xlog to This is a clear case of protecting people from themselves (make them specify a max wal size), and making the feature easy to use. We can't have both, folks. For 9.0, we picked the former. The same tradeoff often exists for flexibility and ease of use. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, Nov 11, 2010 at 10:13 PM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> On Thu, Oct 28, 2010 at 1:13 AM, Josh Berkus <josh@agliodbs.com> wrote: >> > >> >> I sort of agree with you that the current checkpoint_segments >> >> parameter is a bit hard to tune, at least if your goal is to control >> >> the amount of disk space that will be used by WAL files. ?But I'm not >> >> sure your proposal is better. ?Instead of having a complicated formula >> >> for predicting how much disk space would get used by a given value for >> >> checkpoint_segments, we'd have a complicated formula for the amount of >> >> WAL that would force a checkpoint based on max_wal_size. >> > >> > Yes, but the complicated formula would then be *in our code* instead of >> > being inflicted on the user, as it now is. >> >> I don't think so - I think it will just be inflicted on the user in a >> different way. We'd still have to document what the formula is, >> because people will want to understand how often a checkpoint is going >> to get forced. >> >> So here's an example of how this could happen. Someone sets >> max_wal_size = 480MB. Then, they hear about the >> checkpoint_completion_target parameter, and say, ooh, goody, let me >> boost that. So they raise it from 0.5 to 0.9. Now, all of a sudden, >> they're getting more frequent checkpoints. Performance may get worse > > Uh, checkpoint_completion_target only controls flushing of buffers > between checkpoints, not the frequency of checkpoints. According to the formula in our fine documentation, if you increase checkpoint_completion_target, the maximum number of WAL files also increases. This makes sense: the files from the last checkpoint can't be removed until further along into the next cycle. Therefore, if you wanted to increase the checkpoint_completion_target while keeping the maximum amount of WAL on disk the same, you'd need to trigger checkpoints more frequently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Thu, Nov 11, 2010 at 10:13 PM, Bruce Momjian <bruce@momjian.us> wrote: > > Robert Haas wrote: > >> On Thu, Oct 28, 2010 at 1:13 AM, Josh Berkus <josh@agliodbs.com> wrote: > >> > > >> >> I sort of agree with you that the current checkpoint_segments > >> >> parameter is a bit hard to tune, at least if your goal is to control > >> >> the amount of disk space that will be used by WAL files. ?But I'm not > >> >> sure your proposal is better. ?Instead of having a complicated formula > >> >> for predicting how much disk space would get used by a given value for > >> >> checkpoint_segments, we'd have a complicated formula for the amount of > >> >> WAL that would force a checkpoint based on max_wal_size. > >> > > >> > Yes, but the complicated formula would then be *in our code* instead of > >> > being inflicted on the user, as it now is. > >> > >> I don't think so - I think it will just be inflicted on the user in a > >> different way. ?We'd still have to document what the formula is, > >> because people will want to understand how often a checkpoint is going > >> to get forced. > >> > >> So here's an example of how this could happen. ?Someone sets > >> max_wal_size = 480MB. ?Then, they hear about the > >> checkpoint_completion_target parameter, and say, ooh, goody, let me > >> boost that. ?So they raise it from 0.5 to 0.9. ?Now, all of a sudden, > >> they're getting more frequent checkpoints. ?Performance may get worse > > > > Uh, checkpoint_completion_target only controls flushing of buffers > > between checkpoints, not the frequency of checkpoints. > > According to the formula in our fine documentation, if you increase > checkpoint_completion_target, the maximum number of WAL files also > increases. This makes sense: the files from the last checkpoint can't > be removed until further along into the next cycle. Therefore, if you > wanted to increase the checkpoint_completion_target while keeping the > maximum amount of WAL on disk the same, you'd need to trigger > checkpoints more frequently. Do we recycle WAL files between checkpoints or just at checkpoint time? I thought it was only at checkpoint time. Also, there was talk that a larger WAL directory would slow recovery, but I thought it was only the time since the last checkpoint that controlled that. [ Again, sorry for my late reading of this and other threads. ] -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +