Thread: Streaming replication status
I've gone through the patch in detail now. Here's my list of remaining issues: * If there's no WAL to send, walsender doesn't notice if the client has closed connection already. This is the issue Fujii reported already. We'll need to add a select() call to the walsender main loop to check if the socket has been closed. * I removed the feature that archiver was started during recovery. The idea of that was to enable archiving from a standby server, to relieve the master server of that duty, but I found it annoying because it causes trouble if the standby and master are configured to archive to the same location; they will fight over which copies the file to the archive first. Frankly the feature doesn't seem very useful as the patch stands, because you still have to configure archiving in the master in practice; you can't take an online base backup otherwise, and you have the risk of standby falling too much behind and having to restore from base backup whenever the standby is disconnected for any reason. Let's revisit this later when it's truly useful. * We still have a related issue, though: if standby is configured to archive to the same location as master (as it always is on my laptop, where I use the postgresql.conf of the master unmodified in the server), right after failover the standby server will try to archive all the old WAL files that were streamed from the master; but they exist already in the archive, as the master archived them already. I'm not sure if this is a pilot error, or if we should do something in the server to tell apart WAL segments streamed from master and those generated in the standby server after failover. Maybe we should immediately create a .done file for every file received from master? * I don't think we should require superuser rights for replication. Although you see all WAL and potentially all data in the system through that, a standby doesn't need any write access to the master, so it would be good practice to create a dedicated account with limited privileges for replication. * A standby that connects to master, initiates streaming, and then sits idle without stalls recycling of old WAL files in the master. That will eventually lead to a full disk in master. Do we need some kind of a emergency valve on that? * Do we really need REPLICATION_DEBUG_ENABLED? The output doesn't seem very useful to me. * Need to add comments somewhere to note that ReadRecord depends on the fact that a WAL record is always send as whole, never split across two messages. * Do we really need to split the sleep in walsender to NAPTIME_PER_CYCLE increments? * Walreceiver should flush less aggresively than after each received piece of WAL as noted by XXX comment. * Consider renaming PREPARE_REPLICATION to IDENTIFY_SYSTEM or something. * What's the change in bgwriter.c for? * ReadRecord/FetchRecord is a bit of mess. I earlier tried to refactor it into something simpler a couple of times, but failed. So I'm going to leave it as it is, but if someone else wants to give it a shot, that would be good. * Documentation. The patch used to move around some sections, but I think that has been partially reverted so that it now just duplicates them. It probably needs other work too, I haven't looked at the docs in any detail. These are all the issues I know of right now. Assuming no new issues crop up (which often does happen), the patch is ready for committing after those have been addressed. Attached is my latest version as a patch, also available in my git repository. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
On 1/8/10 1:16 PM, Heikki Linnakangas wrote: > * A standby that connects to master, initiates streaming, and then sits > idle without stalls recycling of old WAL files in the master. That will > eventually lead to a full disk in master. Do we need some kind of a > emergency valve on that? WARNING: I haven't thought about how this would work together with HS yes. I think this needs to be administrator-configurable. I'd suggest a GUC approach: archiving_lag_action = { ignore, shutdown, stop } "Ignore" would be the default. Some users would rather have the master shut down if the slave has stopped taking segments; that's "shutdown". Otherwise, it's "stop" which simply stops archiving and starts recylcing when we reach that number of segments. Better name for the GUC very welcome ... --Josh Berkus
On Fri, Jan 8, 2010 at 9:16 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > * We still have a related issue, though: if standby is configured to > archive to the same location as master (as it always is on my laptop, > where I use the postgresql.conf of the master unmodified in the server), > right after failover the standby server will try to archive all the old > WAL files that were streamed from the master; but they exist already in > the archive, as the master archived them already. I'm not sure if this > is a pilot error, or if we should do something in the server to tell > apart WAL segments streamed from master and those generated in the > standby server after failover. Maybe we should immediately create a > .done file for every file received from master? How do we know the master has finished archiving them? If the master crashes suddenly and you fail over couldn't it have failed to archive segments that have been received by the standby via streaming replication? > * Need to add comments somewhere to note that ReadRecord depends on the > fact that a WAL record is always send as whole, never split across two > messages. What happens in the case of the very large records Tom was describing recently. If the entire record doesn't fit in a WAL segment is it the whole record or the partial record with the continuation bit that needs to fit in a message? -- greg
On Sat, Jan 9, 2010 at 6:16 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I've gone through the patch in detail now. Here's my list of remaining > issues: Great! Thanks a lot! > * If there's no WAL to send, walsender doesn't notice if the client has > closed connection already. This is the issue Fujii reported already. > We'll need to add a select() call to the walsender main loop to check if > the socket has been closed. We should reactivate pq_wait() and secure_poll()? > * I removed the feature that archiver was started during recovery. The > idea of that was to enable archiving from a standby server, to relieve > the master server of that duty, but I found it annoying because it > causes trouble if the standby and master are configured to archive to > the same location; they will fight over which copies the file to the > archive first. Frankly the feature doesn't seem very useful as the patch > stands, because you still have to configure archiving in the master in > practice; you can't take an online base backup otherwise, and you have > the risk of standby falling too much behind and having to restore from > base backup whenever the standby is disconnected for any reason. Let's > revisit this later when it's truly useful. Okey. > * We still have a related issue, though: if standby is configured to > archive to the same location as master (as it always is on my laptop, > where I use the postgresql.conf of the master unmodified in the server), > right after failover the standby server will try to archive all the old > WAL files that were streamed from the master; but they exist already in > the archive, as the master archived them already. I'm not sure if this > is a pilot error, or if we should do something in the server to tell > apart WAL segments streamed from master and those generated in the > standby server after failover. Maybe we should immediately create a > .done file for every file received from master? There is no guarantee that such file has already been archived by master. This is just an idea, but new WAL record indicating the completion of the archiving would be useful for the standby to create .done file. But, this idea might kill the "archiving during recovery" idea discussed above. Personally, I'm OK with that issue because we can avoid it by tweaking archive_command. Could we revisit this discussion with the "archiving during recovery" discussion later? > * I don't think we should require superuser rights for replication. > Although you see all WAL and potentially all data in the system through > that, a standby doesn't need any write access to the master, so it would > be good practice to create a dedicated account with limited privileges > for replication. Okey to just drop the superuser() check from walsender.c. > * A standby that connects to master, initiates streaming, and then sits > idle without stalls recycling of old WAL files in the master. That will > eventually lead to a full disk in master. Do we need some kind of a > emergency valve on that? I think that we need the GUC parameter to specify the maximum number of log file segments held in pg_xlog directory to send to the standby server. The replication to the standby which falls more than that GUC value behind is just terminated. http://archives.postgresql.org/pgsql-hackers/2009-12/msg01901.php > * Do we really need REPLICATION_DEBUG_ENABLED? The output doesn't seem > very useful to me. This was useful for me to debug the code. But, right now, Okey to drop it. > * Need to add comments somewhere to note that ReadRecord depends on the > fact that a WAL record is always send as whole, never split across two > messages. Okey. > * Do we really need to split the sleep in walsender to NAPTIME_PER_CYCLE > increments? Yes. It's required for some platforms (probably HP-UX) in which signals cannot interrupt the sleep. > * Walreceiver should flush less aggresively than after each received > piece of WAL as noted by XXX comment. > * XXX: Flushing after each received message is overly aggressive. Should > * implement some sort of lazy flushing. Perhaps check in the main loop > * if there's any more messages before blocking and waiting for one, and > * flush the WAL if there isn't, just blocking. In this approach, if messages continuously arrive from master, the fsync would be delayed until WAL segment is switched. Likewise, recovery also would be delayed, which seems to be problem. How about the straightforward approach; let the process which wants to flush the buffer send the fsync-request to walreceiver and wait until WAL is flushed up to the buffer's LSN? > * Consider renaming PREPARE_REPLICATION to IDENTIFY_SYSTEM or something. Okey. > * What's the change in bgwriter.c for? It's for the bgwriter to know the current timeline for recycling the WAL files. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sat, Jan 9, 2010 at 10:38 AM, Greg Stark <gsstark@mit.edu> wrote: >> * Need to add comments somewhere to note that ReadRecord depends on the >> fact that a WAL record is always send as whole, never split across two >> messages. > > What happens in the case of the very large records Tom was describing > recently. If the entire record doesn't fit in a WAL segment is it the > whole record or the partial record with the continuation bit that > needs to fit in a message? It's the partial record with the continuation bit. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Sat, Jan 9, 2010 at 6:16 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> * If there's no WAL to send, walsender doesn't notice if the client has >> closed connection already. This is the issue Fujii reported already. >> We'll need to add a select() call to the walsender main loop to check if >> the socket has been closed. > > We should reactivate pq_wait() and secure_poll()? I don't think we need all that, a simple select() should be enough. Though I must admit I'm not very familiar with select/poll(). >> * We still have a related issue, though: if standby is configured to >> archive to the same location as master (as it always is on my laptop, >> where I use the postgresql.conf of the master unmodified in the server), >> right after failover the standby server will try to archive all the old >> WAL files that were streamed from the master; but they exist already in >> the archive, as the master archived them already. I'm not sure if this >> is a pilot error, or if we should do something in the server to tell >> apart WAL segments streamed from master and those generated in the >> standby server after failover. Maybe we should immediately create a >> .done file for every file received from master? > > There is no guarantee that such file has already been archived by master. > This is just an idea, but new WAL record indicating the completion of the > archiving would be useful for the standby to create .done file. But, this idea > might kill the "archiving during recovery" idea discussed above. > > Personally, I'm OK with that issue because we can avoid it by tweaking > archive_command. Could we revisit this discussion with the "archiving > during recovery" discussion later? Ok. The workaround is to configure standby to archive to a different location. If you need to restore from that, you'll need to stitch together the logs from the old master and the new one. >> * A standby that connects to master, initiates streaming, and then sits >> idle without stalls recycling of old WAL files in the master. That will >> eventually lead to a full disk in master. Do we need some kind of a >> emergency valve on that? > > I think that we need the GUC parameter to specify the maximum number > of log file segments held in pg_xlog directory to send to the standby server. > The replication to the standby which falls more than that GUC value behind > is just terminated. > http://archives.postgresql.org/pgsql-hackers/2009-12/msg01901.php Oh yes, sounds good. >> * Do we really need to split the sleep in walsender to NAPTIME_PER_CYCLE >> increments? > > Yes. It's required for some platforms (probably HP-UX) in which signals > cannot interrupt the sleep. I'm thinking that the wal_sender_delay is so small that maybe it's not worth worrying about. >> * Walreceiver should flush less aggresively than after each received >> piece of WAL as noted by XXX comment. > >> * XXX: Flushing after each received message is overly aggressive. Should >> * implement some sort of lazy flushing. Perhaps check in the main loop >> * if there's any more messages before blocking and waiting for one, and >> * flush the WAL if there isn't, just blocking. > > In this approach, if messages continuously arrive from master, the fsync > would be delayed until WAL segment is switched. Likewise, recovery also > would be delayed, which seems to be problem. That seems OK to me. If messages are really coming in that fast, fsyncing the whole WAL segment at a time is probably most efficient. But if that really is too much, you could still do extra flushes within XLogRecv() every few megabytes for example. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2010-01-08 at 23:16 +0200, Heikki Linnakangas wrote: > * I removed the feature that archiver was started during recovery. The > idea of that was to enable archiving from a standby server, to relieve > the master server of that duty, but I found it annoying because it > causes trouble if the standby and master are configured to archive to > the same location; they will fight over which copies the file to the > archive first. Frankly the feature doesn't seem very useful as the patch > stands, because you still have to configure archiving in the master in > practice; you can't take an online base backup otherwise, and you have > the risk of standby falling too much behind and having to restore from > base backup whenever the standby is disconnected for any reason. Let's > revisit this later when it's truly useful. Agreed > * We still have a related issue, though: if standby is configured to > archive to the same location as master (as it always is on my laptop, > where I use the postgresql.conf of the master unmodified in the server), > right after failover the standby server will try to archive all the old > WAL files that were streamed from the master; but they exist already in > the archive, as the master archived them already. I'm not sure if this > is a pilot error, or if we should do something in the server to tell > apart WAL segments streamed from master and those generated in the > standby server after failover. Maybe we should immediately create a > .done file for every file received from master? That sounds like the right thing to do. > * I don't think we should require superuser rights for replication. > Although you see all WAL and potentially all data in the system through > that, a standby doesn't need any write access to the master, so it would > be good practice to create a dedicated account with limited privileges > for replication. Agreed. I think we should have a predefined user, called "replication" that has only the correct rights. > * A standby that connects to master, initiates streaming, and then sits > idle without stalls recycling of old WAL files in the master. That will > eventually lead to a full disk in master. Do we need some kind of a > emergency valve on that? Can you explain how this could occur? My understanding was that the walreceiver and startup processes were capable of independent action specifically to avoid for this kind of effect. > * Documentation. The patch used to move around some sections, but I > think that has been partially reverted so that it now just duplicates > them. It probably needs other work too, I haven't looked at the docs in > any detail. I believe the docs need urgent attention. We need more people to read the docs and understand the implications so that people can then comment. It is extremely non-obvious from the patch how things work at a behaviour level. I am very concerned that there is no thought given to monitoring replication. This will make the feature difficult to use in practice. -- Simon Riggs www.2ndQuadrant.com
On Fri, 2010-01-08 at 14:20 -0800, Josh Berkus wrote: > On 1/8/10 1:16 PM, Heikki Linnakangas wrote: > > * A standby that connects to master, initiates streaming, and then sits > > idle without stalls recycling of old WAL files in the master. That will > > eventually lead to a full disk in master. Do we need some kind of a > > emergency valve on that? > > WARNING: I haven't thought about how this would work together with HS yes. I've been reviewing things as we go along, so I'm not that tense overall. Having said that I don't understand why the problem above would occur and the sentence seems to be missing a verb between "without" and "stalls". More explanation please. What could happen is that the standby could slowly lag behind master. We don't have any way of monitoring that, as yet. Setting ps display is not enough here. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Fri, 2010-01-08 at 14:20 -0800, Josh Berkus wrote: >> On 1/8/10 1:16 PM, Heikki Linnakangas wrote: >>> * A standby that connects to master, initiates streaming, and then sits >>> idle without stalls recycling of old WAL files in the master. That will >>> eventually lead to a full disk in master. Do we need some kind of a >>> emergency valve on that? >> WARNING: I haven't thought about how this would work together with HS yes. > > I've been reviewing things as we go along, so I'm not that tense > overall. Having said that I don't understand why the problem above would > occur and the sentence seems to be missing a verb between "without" and > "stalls". More explanation please. Yeah, that sentence was broken. > What could happen is that the standby could slowly lag behind master. Right, that's what I'm worried about. In the worst case it the walreceiver process in the standby might stall completely for some reason, e.g hardware problem or SIGSTOP by an administrator. > We > don't have any way of monitoring that, as yet. Setting ps display is not > enough here. Yeah, monitoring would be nice too. But what I was wondering is whether we need some way of stopping that from filling the disk in master. (Fujii-san's suggestion of a GUC to set the max. amount of WAL to keep in the master for standbys feels good to me). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2010-01-10 at 18:40 +0200, Heikki Linnakangas wrote: > > We > > don't have any way of monitoring that, as yet. Setting ps display is not > > enough here. > > Yeah, monitoring would be nice too. But what I was wondering is whether > we need some way of stopping that from filling the disk in master. > (Fujii-san's suggestion of a GUC to set the max. amount of WAL to keep > in the master for standbys feels good to me). OK, now I got you. I thought that was already agreed; guess it is now. We need monitoring anywhere we have a max_* parameter. Otherwise we won't know how close we are to disaster until we hit the limit and things break down. Otherwise we will have to set parameters by trial and error, or set them so high they are meaningless. -- Simon Riggs www.2ndQuadrant.com
> We need monitoring anywhere we have a max_* parameter. Otherwise we > won't know how close we are to disaster until we hit the limit and > things break down. Otherwise we will have to set parameters by trial and > error, or set them so high they are meaningless. I agree. Thing is, though, we have a de-facto max already ... when pgxlog runs out of disk space. And no monitoring *in postgresql* for that, although obviously you can use OS monitoring for it. I'm saying, even for plain PITR, it would be an improvement in manageablity if the DBA could set a maximum number of checkpoint segments before replication is abandonded or the master shuts down. It's something we've been missing. --Josh Berkus
On Sun, 2010-01-10 at 12:10 -0800, Josh Berkus wrote: > > We need monitoring anywhere we have a max_* parameter. Otherwise we > > won't know how close we are to disaster until we hit the limit and > > things break down. Otherwise we will have to set parameters by trial and > > error, or set them so high they are meaningless. > > I agree. > > Thing is, though, we have a de-facto max already ... when pgxlog runs > out of disk space. What I mean is this: The purpose of monitoring is to avoid bad things happening by being able to predict that a bad thing will happen before it actually does happen. Cars have windows to allow us to see we are about to hit something. > And no monitoring *in postgresql* for that, although > obviously you can use OS monitoring for it. PostgreSQL doesn't need to monitor that. If the user wants to avoid out-of-space they can write a script to monitor files/space. The info is accessible, if you wish to monitor it. Currently there is no way of knowing what the average/current transit time is on replication, no way of knowing what is happening if we go idle etc.. Those things need to be included because they are not otherwise accessible. Cars need windows, not just a finely tuned engine. -- Simon Riggs www.2ndQuadrant.com
> Currently there is no way of knowing what the average/current transit > time is on replication, no way of knowing what is happening if we go > idle etc.. Those things need to be included because they are not > otherwise accessible. Cars need windows, not just a finely tuned engine. Like I said, I agree. I'm just pointing out that the monitoring deficiency already exists whether or not we add a max_* parameter. --Josh Berkus
On 9/01/2010 6:20 AM, Josh Berkus wrote: > On 1/8/10 1:16 PM, Heikki Linnakangas wrote: >> * A standby that connects to master, initiates streaming, and then sits >> idle without stalls recycling of old WAL files in the master. That will >> eventually lead to a full disk in master. Do we need some kind of a >> emergency valve on that? > > WARNING: I haven't thought about how this would work together with HS yes. > > I think this needs to be administrator-configurable. > > I'd suggest a GUC approach: > > archiving_lag_action = { ignore, shutdown, stop } > > "Ignore" would be the default. Some users would rather have the master > shut down if the slave has stopped taking segments; that's "shutdown". > Otherwise, it's "stop" which simply stops archiving and starts recylcing > when we reach that number of segments. IMO "stop" would be *really* bad without some sort of administrator alert support (scream for help) and/or the ability to refresh the slave's base backup when it started responding again. We'd start seeing mailing list posts along the lines of "my master failed over to the slave, and it's missing the last 3 months of data! Help!". Personally, I'd be uncomfortable enabling something like that without _both_ an admin alert _and_ the ability to refresh the slave's base backup without admin intervention. It'd also be necessary to define what exactly "lag" means here, preferably in a way that doesn't generally need admin tuning for most users. Ideally there'd be separate thresholds for "scream to the admin for help, something's wrong" and "forced to act, slave is holding up the master". -- Craig Ringer
Simon Riggs wrote: > > * I don't think we should require superuser rights for replication. > > Although you see all WAL and potentially all data in the system through > > that, a standby doesn't need any write access to the master, so it would > > be good practice to create a dedicated account with limited privileges > > for replication. > > Agreed. I think we should have a predefined user, called "replication" > that has only the correct rights. I am concerned that knowledge of this new read-only replication user would have to be spread all over the backend code, which is really not something we should be doing at this stage in 8.5 development. I am also thinking such a special user might fall out of work on mandatory access control, so maybe we should just require super-user for 8.5 and revisit this for 8.6. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sat, Jan 9, 2010 at 4:25 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I don't think we need all that, a simple select() should be enough. > Though I must admit I'm not very familiar with select/poll(). I'm not sure whether poll(2) should be called for this purpose. But poll(2) and select(2) seem to often come together in the existing code. We should follow such custom? >>> * Do we really need to split the sleep in walsender to NAPTIME_PER_CYCLE >>> increments? >> >> Yes. It's required for some platforms (probably HP-UX) in which signals >> cannot interrupt the sleep. > > I'm thinking that the wal_sender_delay is so small that maybe it's not > worth worrying about. The same problem exists in walwriter.c, too. Though we can expect that wal_writer_delay is small, its sleep has been broken down into smaller bit. We should follow such existing code? Or just remove that feature from walwriter? >>> * Walreceiver should flush less aggresively than after each received >>> piece of WAL as noted by XXX comment. >> >>> * XXX: Flushing after each received message is overly aggressive. Should >>> * implement some sort of lazy flushing. Perhaps check in the main loop >>> * if there's any more messages before blocking and waiting for one, and >>> * flush the WAL if there isn't, just blocking. >> >> In this approach, if messages continuously arrive from master, the fsync >> would be delayed until WAL segment is switched. Likewise, recovery also >> would be delayed, which seems to be problem. > > That seems OK to me. If messages are really coming in that fast, > fsyncing the whole WAL segment at a time is probably most efficient. OK, I'll implement your idea. But that seems to be inefficient in the synchronous replication (especially "wait WAL-replay" mode). So let's revisit this discussion later. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sun, Jan 10, 2010 at 8:17 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > What could happen is that the standby could slowly lag behind master. We > don't have any way of monitoring that, as yet. Setting ps display is not > enough here. I agree that the statistical information about replication activity is very useful. But I think that it's not an urgent issue. Shall we think it later? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Jan 11, 2010 at 5:36 PM, Craig Ringer <craig@postnewspapers.com.au> wrote: > Personally, I'd be uncomfortable enabling something like that without _both_ > an admin alert _and_ the ability to refresh the slave's base backup without > admin intervention. What feature do you specifically need as an alert? Just writing the warning into the logfile is enough? Or need to notify by using SNMP trap message? Though I'm not sure if this is a role of Postgres. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: <blockquote cite="mid:3f0b79eb1001111945m1d9aa6b6n1a7e62705a00e9fe@mail.gmail.com" type="cite"><pre wrap="">OnSun, Jan 10, 2010 at 8:17 PM, Simon Riggs <a class="moz-txt-link-rfc2396E" href="mailto:simon@2ndquadrant.com"><simon@2ndquadrant.com></a>wrote: </pre><blockquote type="cite"><pre wrap="">Whatcould happen is that the standby could slowly lag behind master. We don't have any way of monitoring that, as yet. Setting ps display is not enough here. </pre></blockquote><pre wrap=""> I agree that the statistical information about replication activity is very useful. But I think that it's not an urgent issue. Shall we think it later? </pre></blockquote><br /> I don't think anybody can deploy this feature without at least some very basic monitoringhere. I like the basic proposal you made back in September for adding a pg_standbys_xlog_location to replace whatyou have to get from ps right now: <a class="moz-txt-link-freetext" href="http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php">http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php</a><br /><br/> That's basic, but enough that people could get by for a V1.<br /><br /><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a> </pre>
Fujii Masao wrote: <blockquote cite="mid:3f0b79eb1001111959m2978c507n59fd490828aa3a8f@mail.gmail.com" type="cite"><pre wrap="">OnMon, Jan 11, 2010 at 5:36 PM, Craig Ringer <a class="moz-txt-link-rfc2396E" href="mailto:craig@postnewspapers.com.au"><craig@postnewspapers.com.au></a> wrote:</pre><blockquote type="cite"><pre wrap="">Personally, I'd be uncomfortable enabling something like that without _both_ an admin alert _and_ the ability to refresh the slave's base backup without admin intervention. </pre></blockquote><pre wrap=""> What feature do you specifically need as an alert? Just writing the warning into the logfile is enough? Or need to notify by using SNMP trap message? Though I'm not sure if this is a role of Postgres. </pre></blockquote><br /> It's impossible for the database to have any idea whatsoever how people are goingto want to be alerted. Provide functions to monitor things like replication lag, like the number of segments queuedup to feed to archive_command, and let people build their own alerting mechanism for now. They're going to do thatanyway, so why waste precious time here building someone that's unlikely to fit any but a very narrow use case?<br /><br/><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a> </pre>
On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote: > I don't think anybody can deploy this feature without at least some very > basic monitoring here. I like the basic proposal you made back in September > for adding a pg_standbys_xlog_location to replace what you have to get from > ps right now: > http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php > > That's basic, but enough that people could get by for a V1. Yeah, I have no objection to add such simple capability which monitors the lag into the first release. But I guess that, in addition to that, Simon wanted the capability to collect the statistical information about replication activity (e.g., a transfer time, a write time, replay time). So I'd like to postpone it. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jan 12, 2010 at 1:24 PM, Greg Smith <greg@2ndquadrant.com> wrote: > It's impossible for the database to have any idea whatsoever how people are > going to want to be alerted. Provide functions to monitor things like > replication lag, like the number of segments queued up to feed to > archive_command, and let people build their own alerting mechanism for now. > They're going to do that anyway, so why waste precious time here building > someone that's unlikely to fit any but a very narrow use case? Agreed. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Greg Smith wrote: > I don't think anybody can deploy this feature without at least some very > basic monitoring here. I like the basic proposal you made back in > September for adding a pg_standbys_xlog_location to replace what you > have to get from ps right now: > http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php > > That's basic, but enough that people could get by for a V1. It would be more straightforward to have a function in the standby to return the current replay location. It feels more logical to poll the standby to get the status of the standby, instead of indirectly from the master. Besides, the master won't know how far the standby is if the connection to the standby is broken. Maybe we should just change the existing pg_current_xlog_location() function to return that when recovery is in progress. It currently throws an error during hot standby. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Fujii Masao wrote: > On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> I don't think anybody can deploy this feature without at least some very >> basic monitoring here. I like the basic proposal you made back in September >> for adding a pg_standbys_xlog_location to replace what you have to get from >> ps right now: >> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php >> >> That's basic, but enough that people could get by for a V1. > > Yeah, I have no objection to add such simple capability which monitors > the lag into the first release. But I guess that, in addition to that, > Simon wanted the capability to collect the statistical information about > replication activity (e.g., a transfer time, a write time, replay time). > So I'd like to postpone it. yeah getting that would all be nice and handy but we have to remember that this is really our first cut at integrated replication. Being able to monitor lag is what is needed as a minimum, more advanced stuff can and will emerge once we get some actual feedback from the field. Stefan
On Tue, 2010-01-12 at 08:24 +0100, Stefan Kaltenbrunner wrote: > Fujii Masao wrote: > > On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote: > >> I don't think anybody can deploy this feature without at least some very > >> basic monitoring here. I like the basic proposal you made back in September > >> for adding a pg_standbys_xlog_location to replace what you have to get from > >> ps right now: > >> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php > >> > >> That's basic, but enough that people could get by for a V1. > > > > Yeah, I have no objection to add such simple capability which monitors > > the lag into the first release. But I guess that, in addition to that, > > Simon wanted the capability to collect the statistical information about > > replication activity (e.g., a transfer time, a write time, replay time). > > So I'd like to postpone it. > > yeah getting that would all be nice and handy but we have to remember > that this is really our first cut at integrated replication. Being able > to monitor lag is what is needed as a minimum, more advanced stuff can > and will emerge once we get some actual feedback from the field. Though there won't be any feedback from the field because there won't be any numbers to discuss. Just "it appears to be working". Then we will go into production and the problems will begin to be reported. We will be able to do nothing to resolve them because we won't know how many people are affected. -- Simon Riggs www.2ndQuadrant.com
Heikki Linnakangas wrote: <blockquote cite="mid:4B4C231E.90508@enterprisedb.com" type="cite"><pre wrap="">Greg Smith wrote:</pre><blockquote type="cite"><pre wrap="">I don't think anybody can deploy this feature without at least some very basic monitoring here. I like the basic proposal you made back in September for adding a pg_standbys_xlog_location to replace what you have to get from ps right now: <a class="moz-txt-link-freetext" href="http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php">http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php</a> That's basic, but enough that people could get by for a V1. </pre></blockquote><pre wrap=""> It would be more straightforward to have a function in the standby to return the current replay location. It feels more logical to poll the standby to get the status of the standby, instead of indirectly from the master. Besides, the master won't know how far the standby is if the connection to the standby is broken. </pre></blockquote><br /> This is one reason I was talking in my other message aboutgetting simple stats on how bad the archive_command backlog is, which I'd think is an easy way to inform the DBA "thestandby isn't keeping up and disk is filling" in a way that's more database-centric than just looking at disk space gettinggobbled.<br /><br /> I think that it's important to be able to get whatever useful information you can from both theprimary and the standby, because most of the interesting (read: painful) situations here are when one or the other isdown. The fundamental questions here are:<br /><br /> -When things are running normally, how much is the standby laggingby? This is needed for a baseline of good performance, by which you can detect problems before they get too bad.<br/> -If the standby is down altogether, how can I get more information about the state of things from the primary?<br/> -If the primary is down, how can I tell more from the standby?<br /><br /> Predicting what people are goingto want to do when one of these bad conditions pops up is a large step ahead of where I think this discussion shouldbe focusing on now. You have to show how you're going to measure the badness here in the likely failure situationsbefore you can then take action on them. If you do the former well enough, admins will figure out how to dealwith the latter in a way compatible with their business processes in the first version.<br /> <br /><pre class="moz-signature"cols="72">-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a> </pre>
On Tue, Jan 12, 2010 at 4:22 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > It would be more straightforward to have a function in the standby to > return the current replay location. It feels more logical to poll the > standby to get the status of the standby, instead of indirectly from the > master. Besides, the master won't know how far the standby is if the > connection to the standby is broken. > > Maybe we should just change the existing pg_current_xlog_location() > function to return that when recovery is in progress. It currently > throws an error during hot standby. Sounds good. I'd like to hear from someone which location should be returned by that function (WAL receive/write/flush/replay location?). I vote for WAL flush location because it's important for me to know how far the standby can replay the WAL, i.e., how much transactions might be lost at failover. And, it's also OK to provide the dedicated function for WAL replay location. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jan 12, 2010 at 08:22, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Greg Smith wrote: >> I don't think anybody can deploy this feature without at least some very >> basic monitoring here. I like the basic proposal you made back in >> September for adding a pg_standbys_xlog_location to replace what you >> have to get from ps right now: >> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php >> >> That's basic, but enough that people could get by for a V1. > > It would be more straightforward to have a function in the standby to > return the current replay location. It feels more logical to poll the > standby to get the status of the standby, instead of indirectly from the > master. Besides, the master won't know how far the standby is if the > connection to the standby is broken. > > Maybe we should just change the existing pg_current_xlog_location() > function to return that when recovery is in progress. It currently > throws an error during hot standby. > Not sure. I don't really like to monitor functions that return different things depending on a scenario. Assume I monitor it, and then do a failover. Suddenly the values I monitor mean something else. I think I'd prefer a separate function to monitor this status on the slave. Oh, and it'd be nice if that one worked in HS mode both when in streaming and non-streaming mode :-) -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Fujii Masao <masao.fujii@gmail.com> writes: > I'm not sure whether poll(2) should be called for this purpose. But > poll(2) and select(2) seem to often come together in the existing code. > We should follow such custom? Yes. poll() is usually more efficient, so it's preferred, but not all platforms have it. (On the other side, I think Windows might have only poll and not select.) regards, tom lane
Tom Lane wrote: > Fujii Masao <masao.fujii@gmail.com> writes: > >> I'm not sure whether poll(2) should be called for this purpose. But >> poll(2) and select(2) seem to often come together in the existing code. >> We should follow such custom? >> > > Yes. poll() is usually more efficient, so it's preferred, but not all > platforms have it. (On the other side, I think Windows might have > only poll and not select.) > > > No, other way around, I'm fairly sure. cheers andrew
On Tue, Jan 12, 2010 at 15:13, Andrew Dunstan <andrew@dunslane.net> wrote: > > > Tom Lane wrote: >> >> Fujii Masao <masao.fujii@gmail.com> writes: >> >>> >>> I'm not sure whether poll(2) should be called for this purpose. But >>> poll(2) and select(2) seem to often come together in the existing code. >>> We should follow such custom? >>> >> >> Yes. poll() is usually more efficient, so it's preferred, but not all >> platforms have it. (On the other side, I think Windows might have >> only poll and not select.) >> >> >> > > No, other way around, I'm fairly sure. Yeah, the emulation layer has select, not poll. It basically translates the select into what looks very much like a poll, so maybe we should consider implementing poll as well/instead. But for now, select() is what we have. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > On Tue, Jan 12, 2010 at 08:22, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Maybe we should just change the existing pg_current_xlog_location() >> function to return that when recovery is in progress. It currently >> throws an error during hot standby. > Not sure. I don't really like to monitor functions that return > different things depending on a scenario. Yeah. We should only use that function if we can define it to mean something on the slave that is very close to what it means on the master. Otherwise, pick another name. It seems to me that we should have at least two functions available on the slave: latest xlog location received and synced to disk by walreceiver (ie, we are guaranteed to be able to replay up to here); and latest xlog location actually replayed (ie, the state visible to queries on the slave). The latter perhaps could be pg_current_xlog_location(). regards, tom lane
Simon Riggs wrote: > On Tue, 2010-01-12 at 08:24 +0100, Stefan Kaltenbrunner wrote: >> Fujii Masao wrote: >>> On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote: >>>> I don't think anybody can deploy this feature without at least some very >>>> basic monitoring here. I like the basic proposal you made back in September >>>> for adding a pg_standbys_xlog_location to replace what you have to get from >>>> ps right now: >>>> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php >>>> >>>> That's basic, but enough that people could get by for a V1. >>> Yeah, I have no objection to add such simple capability which monitors >>> the lag into the first release. But I guess that, in addition to that, >>> Simon wanted the capability to collect the statistical information about >>> replication activity (e.g., a transfer time, a write time, replay time). >>> So I'd like to postpone it. >> yeah getting that would all be nice and handy but we have to remember >> that this is really our first cut at integrated replication. Being able >> to monitor lag is what is needed as a minimum, more advanced stuff can >> and will emerge once we get some actual feedback from the field. > > Though there won't be any feedback from the field because there won't be > any numbers to discuss. Just "it appears to be working". Then we will go > into production and the problems will begin to be reported. We will be > able to do nothing to resolve them because we won't know how many people > are affected. field is also production usage in my pov, and I'm not sure how we would know how many people are affected by some imaginary issue just because there is a column that has some numbers in it. All of the large features we added in the past got finetuned and improved in the following releases, and I expect SR to be one of them that will see a lot of improvement in 8.5+n. Adding detailed monitoring of some random stuff (I don't think there was a clear proposal of what kind of stuff you would like to see) while we don't really know what the performance characteristics are might easily lead to us provding a ton of data and nothing relevant :( What I really think we should do for this first cut is to make it as foolproof and easy to set up as possible and add the minimum required monitoring knobs but not going overboard with doing too many stats. Stefan
On 1/12/10, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fujii Masao <masao.fujii@gmail.com> writes: > > I'm not sure whether poll(2) should be called for this purpose. But > > poll(2) and select(2) seem to often come together in the existing code. > > We should follow such custom? > > > Yes. poll() is usually more efficient, so it's preferred, but not all > platforms have it. (On the other side, I think Windows might have > only poll and not select.) FYI: on PL/Proxy we use poll() exclusively and on platforms that dont have it (win32) we emulate poll() with select(): http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/plproxy/plproxy/src/poll_compat.c?rev=1.3&content-type=text/x-cvsweb-markup End result is efficient and clean #ifdef-less code. Something to consider. -- marko
Stefan Kaltenbrunner wrote: > Simon Riggs wrote: > > On Tue, 2010-01-12 at 08:24 +0100, Stefan Kaltenbrunner wrote: > >> Fujii Masao wrote: > >>> On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote: > >>>> I don't think anybody can deploy this feature without at least some very > >>>> basic monitoring here. I like the basic proposal you made back in September > >>>> for adding a pg_standbys_xlog_location to replace what you have to get from > >>>> ps right now: > >>>> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php > >>>> > >>>> That's basic, but enough that people could get by for a V1. > >>> Yeah, I have no objection to add such simple capability which monitors > >>> the lag into the first release. But I guess that, in addition to that, > >>> Simon wanted the capability to collect the statistical information about > >>> replication activity (e.g., a transfer time, a write time, replay time). > >>> So I'd like to postpone it. > >> yeah getting that would all be nice and handy but we have to remember > >> that this is really our first cut at integrated replication. Being able > >> to monitor lag is what is needed as a minimum, more advanced stuff can > >> and will emerge once we get some actual feedback from the field. > > > > Though there won't be any feedback from the field because there won't be > > any numbers to discuss. Just "it appears to be working". Then we will go > > into production and the problems will begin to be reported. We will be > > able to do nothing to resolve them because we won't know how many people > > are affected. > > field is also production usage in my pov, and I'm not sure how we would > know how many people are affected by some imaginary issue just because > there is a column that has some numbers in it. > All of the large features we added in the past got finetuned and > improved in the following releases, and I expect SR to be one of them > that will see a lot of improvement in 8.5+n. > Adding detailed monitoring of some random stuff (I don't think there was > a clear proposal of what kind of stuff you would like to see) while we > don't really know what the performance characteristics are might easily > lead to us provding a ton of data and nothing relevant :( > What I really think we should do for this first cut is to make it as > foolproof and easy to set up as possible and add the minimum required > monitoring knobs but not going overboard with doing too many stats. I totally agree. If SR isn't going to be useful without being feature-complete, we might as well just drop it for 8.5 right now. Let's get a reasonable feature set implemented and then come back in 8.6 to improve it. For example, there is no need for a special 'replication' user (just use super-user), and monitoring should be minimal until we have field experience of exactly what monitoring we need. The final commit-fest is in 5 days --- this is not the time for design discussion and feature additions. If we wait for SR to be feature complete, with design discussions, etc, we will hopelessly delay 8.5 and people will get frustrated. I am not saying we can't talk about design, but none of this should be a requirement for 8.5. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, Jan 12, 2010 at 3:11 PM, Bruce Momjian <bruce@momjian.us> wrote: > The final commit-fest is in 5 days --- this is not the time for design Actually just over 2 days at this point... ...Robert
Marko Kreen <markokr@gmail.com> writes: > FYI: on PL/Proxy we use poll() exclusively and on platforms > that dont have it (win32) we emulate poll() with select(): Yeah, maybe. At the time we started adding poll() support there were enough platforms with only select() that it didn't make sense to impose any sort of penalty on the latter. But by now maybe it'd make sense. Especially if someone fixes the Windows code --- two levels of emulation on Windows probably won't fly ... regards, tom lane
Bruce Momjian <bruce@momjian.us> writes: > The final commit-fest is in 5 days --- this is not the time for design > discussion and feature additions. +10 --- the one reason I can see for deciding to bounce SR is that there still seem to be design discussions going on. It is WAY TOO LATE for that folks. It's time to be thinking "what's the least we have to do to make this shippable?" regards, tom lane
On Tue, 2010-01-12 at 15:11 -0500, Bruce Momjian wrote: > Stefan Kaltenbrunner wrote: > > Simon Riggs wrote: > > > On Tue, 2010-01-12 at 08:24 +0100, Stefan Kaltenbrunner wrote: > > >> Fujii Masao wrote: > > >>> On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote: > > >>>> I don't think anybody can deploy this feature without at least some very > > >>>> basic monitoring here. I like the basic proposal you made back in September > > >>>> for adding a pg_standbys_xlog_location to replace what you have to get from > > >>>> ps right now: > > >>>> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php > > >>>> > > >>>> That's basic, but enough that people could get by for a V1. > > >>> Yeah, I have no objection to add such simple capability which monitors > > >>> the lag into the first release. But I guess that, in addition to that, > > >>> Simon wanted the capability to collect the statistical information about > > >>> replication activity (e.g., a transfer time, a write time, replay time). > > >>> So I'd like to postpone it. > > >> yeah getting that would all be nice and handy but we have to remember > > >> that this is really our first cut at integrated replication. Being able > > >> to monitor lag is what is needed as a minimum, more advanced stuff can > > >> and will emerge once we get some actual feedback from the field. > > > > > > Though there won't be any feedback from the field because there won't be > > > any numbers to discuss. Just "it appears to be working". Then we will go > > > into production and the problems will begin to be reported. We will be > > > able to do nothing to resolve them because we won't know how many people > > > are affected. > > > > field is also production usage in my pov, and I'm not sure how we would > > know how many people are affected by some imaginary issue just because > > there is a column that has some numbers in it. > > All of the large features we added in the past got finetuned and > > improved in the following releases, and I expect SR to be one of them > > that will see a lot of improvement in 8.5+n. > > Adding detailed monitoring of some random stuff (I don't think there was > > a clear proposal of what kind of stuff you would like to see) while we > > don't really know what the performance characteristics are might easily > > lead to us provding a ton of data and nothing relevant :( > > What I really think we should do for this first cut is to make it as > > foolproof and easy to set up as possible and add the minimum required > > monitoring knobs but not going overboard with doing too many stats. > > I totally agree. If SR isn't going to be useful without being > feature-complete, we might as well just drop it for 8.5 right now. > > Let's get a reasonable feature set implemented and then come back in 8.6 > to improve it. For example, there is no need for a special > 'replication' user (just use super-user), and monitoring should be > minimal until we have field experience of exactly what monitoring we > need. > > The final commit-fest is in 5 days --- this is not the time for design > discussion and feature additions. If we wait for SR to be feature > complete, with design discussions, etc, we will hopelessly delay 8.5 and > people will get frustrated. I am not saying we can't talk about design, > but none of this should be a requirement for 8.5. We can't add monitoring until we know what the performance characteristics are. Hmmm. And how will we know what the performance characteristics are, I wonder? Anyway, I'll leave it to you now. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Tue, 2010-01-12 at 15:11 -0500, Bruce Momjian wrote: >> Stefan Kaltenbrunner wrote: >>> Simon Riggs wrote: >>>> On Tue, 2010-01-12 at 08:24 +0100, Stefan Kaltenbrunner wrote: >>>>> Fujii Masao wrote: >>>>>> On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote: >>>>>>> I don't think anybody can deploy this feature without at least some very >>>>>>> basic monitoring here. I like the basic proposal you made back in September >>>>>>> for adding a pg_standbys_xlog_location to replace what you have to get from >>>>>>> ps right now: >>>>>>> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php >>>>>>> >>>>>>> That's basic, but enough that people could get by for a V1. >>>>>> Yeah, I have no objection to add such simple capability which monitors >>>>>> the lag into the first release. But I guess that, in addition to that, >>>>>> Simon wanted the capability to collect the statistical information about >>>>>> replication activity (e.g., a transfer time, a write time, replay time). >>>>>> So I'd like to postpone it. >>>>> yeah getting that would all be nice and handy but we have to remember >>>>> that this is really our first cut at integrated replication. Being able >>>>> to monitor lag is what is needed as a minimum, more advanced stuff can >>>>> and will emerge once we get some actual feedback from the field. >>>> Though there won't be any feedback from the field because there won't be >>>> any numbers to discuss. Just "it appears to be working". Then we will go >>>> into production and the problems will begin to be reported. We will be >>>> able to do nothing to resolve them because we won't know how many people >>>> are affected. >>> field is also production usage in my pov, and I'm not sure how we would >>> know how many people are affected by some imaginary issue just because >>> there is a column that has some numbers in it. >>> All of the large features we added in the past got finetuned and >>> improved in the following releases, and I expect SR to be one of them >>> that will see a lot of improvement in 8.5+n. >>> Adding detailed monitoring of some random stuff (I don't think there was >>> a clear proposal of what kind of stuff you would like to see) while we >>> don't really know what the performance characteristics are might easily >>> lead to us provding a ton of data and nothing relevant :( >>> What I really think we should do for this first cut is to make it as >>> foolproof and easy to set up as possible and add the minimum required >>> monitoring knobs but not going overboard with doing too many stats. >> I totally agree. If SR isn't going to be useful without being >> feature-complete, we might as well just drop it for 8.5 right now. >> >> Let's get a reasonable feature set implemented and then come back in 8.6 >> to improve it. For example, there is no need for a special >> 'replication' user (just use super-user), and monitoring should be >> minimal until we have field experience of exactly what monitoring we >> need. >> >> The final commit-fest is in 5 days --- this is not the time for design >> discussion and feature additions. If we wait for SR to be feature >> complete, with design discussions, etc, we will hopelessly delay 8.5 and >> people will get frustrated. I am not saying we can't talk about design, >> but none of this should be a requirement for 8.5. > > We can't add monitoring until we know what the performance > characteristics are. Hmmm. And how will we know what the performance > characteristics are, I wonder? well I would say we do exactly how we have done in the past with other features - by debugging the stuff with low level tools until we fully understand what it really is and then we can always add more "accessible" stats. Stefan
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > The final commit-fest is in 5 days --- this is not the time for design > > discussion and feature additions. > > +10 --- the one reason I can see for deciding to bounce SR is that there > still seem to be design discussions going on. It is WAY TOO LATE for > that folks. It's time to be thinking "what's the least we have to do to > make this shippable?" I didn't know the plus meter went that high. ;-) LOL -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Stefan Kaltenbrunner wrote: > >> Let's get a reasonable feature set implemented and then come back in 8.6 > >> to improve it. For example, there is no need for a special > >> 'replication' user (just use super-user), and monitoring should be > >> minimal until we have field experience of exactly what monitoring we > >> need. > >> > >> The final commit-fest is in 5 days --- this is not the time for design > >> discussion and feature additions. If we wait for SR to be feature > >> complete, with design discussions, etc, we will hopelessly delay 8.5 and > >> people will get frustrated. I am not saying we can't talk about design, > >> but none of this should be a requirement for 8.5. > > > > We can't add monitoring until we know what the performance > > characteristics are. Hmmm. And how will we know what the performance > > characteristics are, I wonder? > > well I would say we do exactly how we have done in the past with other > features - by debugging the stuff with low level tools until we fully > understand what it really is and then we can always add more > "accessible" stats. Right, so what is the risk of shipping without any fancy monitoring? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 2010-01-12 at 16:34 -0500, Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > The final commit-fest is in 5 days --- this is not the time for design > > > discussion and feature additions. > > > > +10 --- the one reason I can see for deciding to bounce SR is that there > > still seem to be design discussions going on. It is WAY TOO LATE for > > that folks. It's time to be thinking "what's the least we have to do to > > make this shippable?" > > I didn't know the plus meter went that high. ;-) LOL Well, it is Tom. He has karma points. JD -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
> Right, so what is the risk of shipping without any fancy monitoring? We add monitoring in 9.1. er, 8.6. --Josh Berkus
On Tue, 2010-01-12 at 15:42 -0500, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > The final commit-fest is in 5 days --- this is not the time for design > > discussion and feature additions. > > +10 --- the one reason I can see for deciding to bounce SR is that there > still seem to be design discussions going on. It is WAY TOO LATE for > that folks. It's time to be thinking "what's the least we have to do to > make this shippable?" I've not asked to bounce SR, I am strongly in favour of it going in, having been supporting the project on and off for 18 months. There is not much sense being talked here. I have asked for sufficient monitoring to allow us to manage it in production, which is IMHO the minimum required to make it shippable. This is a point I have mentioned over the course of many months, not a sudden additional thought. If the majority thinks that being able to find out the current replay point of recovery is all we need to manage replication then I will happily defer to that view, without changing my opinion that we need more. It should be clear that we didn't even have that before I raised the point. Overall, it isn't sensible or appropriate to oppose my viewpoint by putting words into my mouth that have never been said, which applies to most people's comments to me on this recent thread. -- Simon Riggs www.2ndQuadrant.com
Bruce Momjian wrote: > Right, so what is the risk of shipping without any fancy monitoring? > You can monitor the code right now by watching the output shown in the ps display and by trolling the database logs. If I had to I could build a whole monitoring system out of those components, it would just be very fragile. I'd rather see one or two very basic bits of internals exposed beyond those to reduce that effort. I think it's a stretch to say that request represents a design change; a couple of UDFs to expose some internals is all I think it would take to dramatically drop the amount of process/log scraping required here to support a SR system. I guess the slightly more ambitious performance monitoring bits that Simon was suggesting may cross the line as being too late to implement now though (depends on how productive the people actually coding on this are I guess), and certainly the ideas thrown out for implementing any smart behavior or alerting when replication goes bad like Josh's "archiving_lag_action" seem based the deadline to get addressed now--even though I agree with the basic idea. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Tue, 2010-01-12 at 17:41 -0500, Greg Smith wrote: > Bruce Momjian wrote: > > Right, so what is the risk of shipping without any fancy monitoring? > > > > You can monitor the code right now by watching the output shown in the > ps display and by trolling the database logs. If I had to I could build > a whole monitoring system out of those components, it would just be very > fragile. I'd rather see one or two very basic bits of internals exposed > beyond those to reduce that effort. Considering that is pretty much the best we can do with log shipping, I would have to agree. We should either provide real monitoring facilities (not necessarily tools, but at least queries or an api) for the feature or the feature isn't ready to go in. > I think it's a stretch to say that > request represents a design change; a couple of UDFs to expose some > internals is all I think it would take to dramatically drop the amount > of process/log scraping required here to support a SR system. Bingo. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
Greg Smith wrote: > Bruce Momjian wrote: >> Right, so what is the risk of shipping without any fancy monitoring? >> > > You can monitor the code right now by watching the output shown in the > ps display and by trolling the database logs. If I had to I could build > a whole monitoring system out of those components, it would just be very > fragile. I'd rather see one or two very basic bits of internals exposed > beyond those to reduce that effort. I think it's a stretch to say that > request represents a design change; a couple of UDFs to expose some > internals is all I think it would take to dramatically drop the amount > of process/log scraping required here to support a SR system. so is there an actually concrete proposal of _what_ interals to expose? > > I guess the slightly more ambitious performance monitoring bits that > Simon was suggesting may cross the line as being too late to implement > now though (depends on how productive the people actually coding on this > are I guess), and certainly the ideas thrown out for implementing any > smart behavior or alerting when replication goes bad like Josh's > "archiving_lag_action" seem based the deadline to get addressed > now--even though I agree with the basic idea. I'm not convinced that embedding actual alerting functionality in the database is a good idea. Any reasonable production deployment is probably using a dedicated monitoring and alerting system that is aggregating and qualifying all monitoring results (as wel as proper ratelimiting and stuff) that just needs a way to read in basic data. Initially something like archiving_lag_action sounds like an invitation to do a send_mail_to_admin() thingy which is really the wrong way to approach monitoring in large scale environments... The database needs to prove very basic information like "we are 10min behind in replication" or "3 wal files behind" - the decision if any of that is an actual issue or not should be left to the actual monitoring system. Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > The database needs to prove very basic information like "we are > 10min behind in replication" or "3 wal files behind" - the > decision if any of that is an actual issue or not should be left > to the actual monitoring system. +1 -Kevin
On Monday 11 January 2010 23:24:24 Greg Smith wrote: > Fujii Masao wrote: > > On Mon, Jan 11, 2010 at 5:36 PM, Craig Ringer > > > > <craig@postnewspapers.com.au> wrote: > >> Personally, I'd be uncomfortable enabling something like that without > >> _both_ an admin alert _and_ the ability to refresh the slave's base > >> backup without admin intervention. > > > > What feature do you specifically need as an alert? Just writing > > the warning into the logfile is enough? Or need to notify by > > using SNMP trap message? Though I'm not sure if this is a role > > of Postgres. > > It's impossible for the database to have any idea whatsoever how people > are going to want to be alerted. Provide functions to monitor things > like replication lag, like the number of segments queued up to feed to > archive_command, and let people build their own alerting mechanism for > now. They're going to do that anyway, so why waste precious time here > building someone that's unlikely to fit any but a very narrow use case? That said, emitting the information to a log file makes for a crappy way to retrieve the information. The ideal api is that I can find the information out via result of some SELECT query; view, table ,function doesn't matter, as long as I can select it out. Bonus points for being able to get information from the hot standby. -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com
On Tuesday 12 January 2010 17:37:11 Simon Riggs wrote: > There is not much sense being talked here. I have asked for sufficient > monitoring to allow us to manage it in production, which is IMHO the > minimum required to make it shippable. This is a point I have mentioned > over the course of many months, not a sudden additional thought. > Even subscribing to this view point, there is sure to be a significant wiggle room in what people find to be "sufficient monitoring". If I had to score the monitoring facilities we have for PITR standby, I'd give it about a crap out of 5, and yet somehow we seem to manage it. > If the majority thinks that being able to find out the current replay > point of recovery is all we need to manage replication then I will > happily defer to that view, without changing my opinion that we need > more. It should be clear that we didn't even have that before I raised > the point. > I'm certainly interested in specifics of what you think need to be exposed for monitoring, and I'd be interested in whether those things can be exposed as either trace points or possibly as C functions. My guess is that we won't get them into core for 8.5, but that we might be able to provide some additional facilities after the fact as we get more of these systems deployed. -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com
> I guess the slightly more ambitious performance monitoring bits that > Simon was suggesting may cross the line as being too late to implement > now though (depends on how productive the people actually coding on this > are I guess), and certainly the ideas thrown out for implementing any > smart behavior or alerting when replication goes bad like Josh's > "archiving_lag_action" seem based the deadline to get addressed > now--even though I agree with the basic idea. Well, honestly, I wasn't talking about monitoring at all. I was talking about the general issue of "how should the system behave when it runs out of disk space". For the installation for which data integrity is paramount, when replication becomes impossible because there is no more room for logs, then the whole system, master and slaves, should shut down. For most people, they'd just want the master to start ignoring the slave and recycling logs. Presumably, the slave would notice this and shut down. So I was talking about data integrity, not monitoring. However, it's probably a better thing to simply expose a way to query how much extra log data we have, in raw form (bytes or pages). From this, an administration script could take appropriate action. --Josh Berkus
> However, it's probably a better thing to simply expose a way to query > how much extra log data we have, in raw form (bytes or pages). From > this, an administration script could take appropriate action. Also: I think we could release without having this facility. We did with PITR, after all. --Josh
Simon Riggs wrote: > On Tue, 2010-01-12 at 15:42 -0500, Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > The final commit-fest is in 5 days --- this is not the time for design > > > discussion and feature additions. > > > > +10 --- the one reason I can see for deciding to bounce SR is that there > > still seem to be design discussions going on. It is WAY TOO LATE for > > that folks. It's time to be thinking "what's the least we have to do to > > make this shippable?" > > I've not asked to bounce SR, I am strongly in favour of it going in, > having been supporting the project on and off for 18 months. > > There is not much sense being talked here. I have asked for sufficient > monitoring to allow us to manage it in production, which is IMHO the > minimum required to make it shippable. This is a point I have mentioned Let me explain why Simon feels he is misquoted --- Simon, you are saying above that "sufficient monitoring" is a minimum requirement, meaning it is necessary, and I and others are saying if we need to design a monitoring system at this stage to ship SR, then let's forget about this feature for 8.5. In summary, by requiring monitoring, you are encouraging others to just abandon SR completely for 8.5. We didn't say you were suggesting abandonment SR, it is just that the monitoring requirement is making abandonment of SR for 8.5 more likely because the addition of monitoring could hopelessly delay 8.5 because we have no idea even how to implement monitoring. > over the course of many months, not a sudden additional thought. > > Overall, it isn't sensible or appropriate to oppose my viewpoint by > putting words into my mouth that have never been said, which applies to > most people's comments to me on this recent thread. Yea, yea, everyone seems to misquote you Simon, at least from your perspective. You must admit that you seem to feel that way a lot. > If the majority thinks that being able to find out the current replay > point of recovery is all we need to manage replication then I will > happily defer to that view, without changing my opinion that we need > more. It should be clear that we didn't even have that before I raised > the point. Good --- let's move forward with a minimal feature set to get SR in 8.5 in a reasonable timeframe. If we have extra time we can add stuff but let's not require it from the start. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 2010-01-12 at 16:34 -0500, Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > The final commit-fest is in 5 days --- this is not the time for design > > > discussion and feature additions. > > > > +10 --- the one reason I can see for deciding to bounce SR is that there > > still seem to be design discussions going on. It is WAY TOO LATE for > > that folks. It's time to be thinking "what's the least we have to do to > > make this shippable?" > > I didn't know the plus meter went that high. ;-) LOL Well, it is Tom. He has karma points. JD -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Tue, Jan 12, 2010 at 10:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> I'm not sure whether poll(2) should be called for this purpose. But >> poll(2) and select(2) seem to often come together in the existing code. >> We should follow such custom? > > Yes. poll() is usually more efficient, so it's preferred, but not all > platforms have it. (On the other side, I think Windows might have > only poll and not select.) OK. I reactivated pq_wait() and secure_poll() which uses poll(2) to check the socket if available, otherwise select(2). Also the capability to check the socket for data to be written is not unused for SR right now (it was provided previously). So I dropped it for simplification. http://archives.postgresql.org/pgsql-hackers/2010-01/msg00827.php > Oh, I think we need to fix that, I'm thinking of doing a select() in the > loop to check that the socket hasn't been closed yet. I meant we don't > need to try reading the 'X' to tell apart e.g a network problem from a > standby that's shut down cleanly. Without reading the 'X' message from the standby, the walsender doesn't detect the close of connection immediately in my environment. So I also reactivated the subset of ProcessStreamMessage(). git://git.postgresql.org/git/users/fujii/postgres.git branch: replication Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jan 12, 2010 at 10:16 AM, Bruce Momjian <bruce@momjian.us> wrote: > I am concerned that knowledge of this new read-only replication user > would have to be spread all over the backend code, which is really not > something we should be doing at this stage in 8.5 development. I am > also thinking such a special user might fall out of work on mandatory > access control, so maybe we should just require super-user for 8.5 and > revisit this for 8.6. OK. I leave that code as it is. If the majority feel it's overkill to require a superuser privilege when authenticating the standby, we can just drop it later. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, 2010-01-12 at 17:41 -0500, Greg Smith wrote: > Bruce Momjian wrote: > > Right, so what is the risk of shipping without any fancy monitoring? > > > > You can monitor the code right now by watching the output shown in the > ps display and by trolling the database logs. If I had to I could build > a whole monitoring system out of those components, it would just be very > fragile. I'd rather see one or two very basic bits of internals exposed > beyond those to reduce that effort. Considering that is pretty much the best we can do with log shipping, I would have to agree. We should either provide real monitoring facilities (not necessarily tools, but at least queries or an api) for the feature or the feature isn't ready to go in. > I think it's a stretch to say that > request represents a design change; a couple of UDFs to expose some > internals is all I think it would take to dramatically drop the amount > of process/log scraping required here to support a SR system. Bingo. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
Stefan Kaltenbrunner wrote: > so is there an actually concrete proposal of _what_ interals to expose? ' The pieces are coming together...summary: -Status quo: really bad, but could probably ship anyway because existing PITR is no better and people manage to use it -Add slave pg_current_xlog_location() and something like pg_standby_received_xlog_location(): Much better, gets rid of the worst issues here. -Also add pg_standbys_xlog_location() on the master: while they could live without it, this really helps out the "alert/monitor" script writer whose use cases keep popping up here. Details...the original idea from Fujii was: "I'm thinking something like pg_standbys_xlog_location() [on the primary] which returns one row per standby servers, showing pid of walsender, host name/ port number/user OID of the standby, the location where the standby has written/flushed WAL. DBA can measure the gap from the combination of pg_current_xlog_location() and pg_standbys_xlog_location() via one query on the primary." After some naming quibbles and questions about what direction that should happen in, Tom suggested the initial step here is: "It seems to me that we should have at least two functions available on the slave: latest xlog location received and synced to disk by walreceiver (ie, we are guaranteed to be able to replay up to here); and latest xlog location actually replayed (ie, the state visible to queries on the slave). The latter perhaps could be pg_current_xlog_location()." So there's the first two of them: on the slave, pg_current_xlog_location() giving the latest location replayed, and a new one named something like pg_standby_received_xlog_location(). If you take the position that an unreachable standby does provide answers to these questions too (you just won't like them), this pair might be sufficient to ship. To help a lot at dealing with all the error situations where the standby isn't reachable and segments are piling up (possibly leading to full disk), the next figure that seems to answer the most questions is asking the primary "what's the location of the last WAL segment file in the pile of ones to be archived/distributed that has been requested (or processed if that's the easier thing to note) by the standby?". That's what is named pg_standbys_xlog_location() in the first paragraph I quoted. If you know enough to identify that segment file on disk, you can always look at its timestamp (and the ones on the rest of the files in that directory) in a monitoring script to turn that information into segments or a time measurement instead--xlog segments are nicely ordered after all. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Wed, Jan 13, 2010 at 5:47 PM, Greg Smith <greg@2ndquadrant.com> wrote: > The pieces are coming together...summary: Thanks for the summary! > -Also add pg_standbys_xlog_location() on the master: while they could live without it, this really helps out the "alert/monitor"script writer whose use cases keep popping up here. > > Details...the original idea from Fujii was: > > "I'm thinking something like pg_standbys_xlog_location() [on the primary] which returns > one row per standby servers, showing pid of walsender, host name/ > port number/user OID of the standby, the location where the standby > has written/flushed WAL. DBA can measure the gap from the > combination of pg_current_xlog_location() and pg_standbys_xlog_location() > via one query on the primary." This function is useful but not essential for troubleshooting, I think. So I'd like to postpone it. > "It seems to me that we should have at least two functions available > on the slave: latest xlog location received and synced to disk by > walreceiver (ie, we are guaranteed to be able to replay up to here); > and latest xlog location actually replayed (ie, the state visible > to queries on the slave). The latter perhaps could be > pg_current_xlog_location()." > > So there's the first two of them: on the slave, pg_current_xlog_location() > giving the latest location replayed, and a new one named something like > pg_standby_received_xlog_location(). If you take the position that an > unreachable standby does provide answers to these questions too (you just > won't like them), this pair might be sufficient to ship. Done. git://git.postgresql.org/git/users/fujii/postgres.gitbranch: replication I added two new functions; (1) pg_last_xlog_receive_location() reports the last WAL location received and synced by walreceiver. If streaming replicationis still in progress this will increase monotonically. If streaming replication has completed then this valuewill remain static at the value of the last WAL record received and synced. When the server has been started withouta streaming replication then the return value will be InvalidXLogRecPtr (0/0). (2) pg_last_xlog_replay_location() reports the last WAL location replayed during recovery. If recovery is still in progressthis will increase monotonically. If recovery has completed then this value will remain static at the value ofthe last WAL record applied. When the server has been started normally without a recovery then the return value willbe InvalidXLogRecPtr (0/0). Since it's somewhat odd for me that pg_current_xlog_location() reports the WAL replay location, I didn't do that. But if the majority feel that it's sane, I'll merge pg_last_xlog_replay_location() into pg_current_xlog_location(). Thought? Better name? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote:<br /><blockquote cite="mid:3f0b79eb1001140033p707862f4yabece81301ca609c@mail.gmail.com" type="cite"><blockquotetype="cite"><pre wrap="">"I'm thinking something like pg_standbys_xlog_location() [on the primary]which returns one row per standby servers, showing pid of walsender, host name/ port number/user OID of the standby, the location where the standby has written/flushed WAL. DBA can measure the gap from the combination of pg_current_xlog_location() and pg_standbys_xlog_location() via one query on the primary." </pre></blockquote><pre wrap=""> This function is useful but not essential for troubleshooting, I think. So I'd like to postpone it. </pre></blockquote><br /> Sure; in a functional system where primary and secondary are both up,you can assemble the info using the new functions you just added, so this other one is certainly optional. I just tooka brief look at the code of the features you added, and it looks like it exposes the minimum necessary to make this wholething possible to manage. I think it's OK if you postpone this other bit, more important stuff for you to work on.<br/><br /> So: the one piece of information I though was most important to expose here at an absolute minimum is therenow. Good progress. The other popular request that keeps popping up here is providing an easy way to see how backloggedthe archive_command is, to make it easier to monitor for out of disk errors that might prove catastrophic to replication.<br/><br /> I just spent some time looking through the WAL/archiving code in that context. It looks to me thatthat this information isn't really stored anywhere right now. The only thing that knows what segment is currently queuedup to copy over is pgarch_ArchiverCopyLoop via its call to pgarch_readyXlog. Now, this is a pretty brute-force pieceof code: it doesn't remember its previous work at all, it literally walks the archive_status directory looking for*.ready files that have names that look like xlog files, then returns the earliest. That unfortunately means that it'snot even thinking in the same terms as all these other functions, which are driven by the xlog_location advancing, andthen the filename is computed from that. All you've got is the filename at this point, and it's not even guaranteed tobe real--you could easily fool this code if you dropped an inappropriately named file into that directory.<br /><br />I could easily update this code path to save the name of the last archived file in memory while all this directory scanningis going on anyway, and then provide a UDF to expose that bit of information. The result would need to have documentationthat disclaims it like this:<br /><br /> pg_last_archived_xlogfile() text: Get the name of the last file thearchive_command [tried to|successfully] archived since the server was started. If archiving is disabled or no xlog fileshave become ready to archive since startup, a blank line will be returned. It is possible for this function to returna result that does not reflect an actual xlogfile if files are manually added to the server's archive_status directory.<br/><br /> I'd find this extremely handy as a hook for monitoring scripts that want to watch the server but don'thave access to the filesystem directly, even given those limitations. I'd prefer to have the "tried to" version, becauseit will populate with the name of the troublesome file it's stuck on even if archiving never gets its first segmentdelivered.<br /><br /> I'd happily write a patch to handle all that if I thought it would be accepted. I fear thatthe whole approach will be considered a bit too hackish and get rejected on that basis though. Not really sure of a"right" way to handle this though. Anything better is going to be more complicated because it requires passing more informationinto the archiver, with little gain for that work beyond improving the quality of this diagnostic routine. AndI think most people would find what I described above useful enough.<br /><br /><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a> </pre>
On Thu, 2010-01-14 at 23:07 -0500, Greg Smith wrote: > pg_last_archived_xlogfile() text: Get the name of the last file the > archive_command [tried to|successfully] archived since the server was > started. If archiving is disabled or no xlog files have become ready > to archive since startup, a blank line will be returned. OK > It is possible for this function to return a result that does not > reflect an actual xlogfile if files are manually added to the server's > archive_status directory. > I'd find this extremely handy as a hook for monitoring scripts that > want to watch the server but don't have access to the filesystem > directly, even given those limitations. I'd prefer to have the "tried > to" version, because it will populate with the name of the troublesome > file it's stuck on even if archiving never gets its first segment > delivered. > > I'd happily write a patch to handle all that if I thought it would be > accepted. I fear that the whole approach will be considered a bit too > hackish and get rejected on that basis though. Not really sure of a > "right" way to handle this though. Anything better is going to be > more complicated because it requires passing more information into the > archiver, with little gain for that work beyond improving the quality > of this diagnostic routine. And I think most people would find what I > described above useful enough. Yes, please write it. It's separate from SR, so will not interfere. -- Simon Riggs www.2ndQuadrant.com
Greg Smith wrote: > Fujii Masao wrote: >>> "I'm thinking something like pg_standbys_xlog_location() [on the primary] which returns >>> one row per standby servers, showing pid of walsender, host name/ >>> port number/user OID of the standby, the location where the standby >>> has written/flushed WAL. DBA can measure the gap from the >>> combination of pg_current_xlog_location() and pg_standbys_xlog_location() >>> via one query on the primary." >>> >> >> This function is useful but not essential for troubleshooting, I think. >> So I'd like to postpone it. >> > > Sure; in a functional system where primary and secondary are both up, > you can assemble the info using the new functions you just added, so > this other one is certainly optional. I just took a brief look at the > code of the features you added, and it looks like it exposes the minimum > necessary to make this whole thing possible to manage. I think it's OK > if you postpone this other bit, more important stuff for you to work on. agreed > > So: the one piece of information I though was most important to expose > here at an absolute minimum is there now. Good progress. The other > popular request that keeps popping up here is providing an easy way to > see how backlogged the archive_command is, to make it easier to monitor > for out of disk errors that might prove catastrophic to replication. I tend to disagree - in any reasonable production setup basic stulff like disk space usage is monitored by non-application specific matters. While monitoring backlog might be interesting for other reasons, citing disk space usage/exhaustions seems just wrong. [...] > > I'd find this extremely handy as a hook for monitoring scripts that want > to watch the server but don't have access to the filesystem directly, > even given those limitations. I'd prefer to have the "tried to" > version, because it will populate with the name of the troublesome file > it's stuck on even if archiving never gets its first segment delivered. While fancy at all I think this goes way to far for the first cut at SR(or say this release), monitoring disk usage and tracking log files for errors are SOLVED issues in estabilished production setups. If you are in an environment that does neither for each and every server independent on what you have running on it, or a setup where the sysadmins are clueless and the poor DBA has to hack around that fact you have way bigger issues anyway. Stefan
Greg Smith wrote: > to make it easier to monitor for out of disk errors that might > prove catastrophic to replication. We handle that with the fsutil functions (in pgfoundry). This can actually measure free space on each volume. These weren't portable enough to include in core, but maybe they could be made more portable? -Kevin
Stefan Kaltenbrunner wrote: > Greg Smith wrote: >> >> The other popular request that keeps popping up here is providing an >> easy way to see how backlogged the archive_command is, to make it >> easier to monitor for out of disk errors that might prove >> catastrophic to replication. > > I tend to disagree - in any reasonable production setup basic stulff > like disk space usage is monitored by non-application specific matters. > While monitoring backlog might be interesting for other reasons, > citing disk space usage/exhaustions seems just wrong. I was just mentioning that one use of the data, but there are others. Let's say that your archive_command works by copying things over to a NFS mount, and the mount goes down. It could be a long time before you noticed this via disk space monitoring. But if you were monitoring "how long has it been since the last time pg_last_archived_xlogfile() changed?", this would jump right out at you. Another popular question is "how far behind real-time is the archiver process?" You can do this right now by duplicating the same xlog file name scanning and sorting that the archiver does in your own code, looking for .ready files. It would be simpler if you could call pg_last_archived_xlogfile() and then just grab that file's timestamp. I think it's also important to consider the fact that diagnostic internals exposed via the database are far more useful to some people than things you have to setup outside of it. You talk about reasonable configurations above, but some production setups are not so reasonable. In many of the more secure environments I've worked in (finance, defense), there is *no* access to the database server beyond what comes out of port 5432 without getting a whole separate team of people involved. If the DBA can write a simple monitoring program themselves that presents data via the one port that is exposed, that makes life easier for them. This same issue pops up sometimes when we consider the shared hosting case too, where the user may not have the option of running a full-fledged monitoring script. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith <greg@2ndquadrant.com> wrote: > In many of the more secure environments I've worked in (finance, > defense), there is *no* access to the database server beyond what > comes out of port 5432 without getting a whole separate team of > people involved. If the DBA can write a simple monitoring program > themselves that presents data via the one port that is exposed, > that makes life easier for them. Right, we don't want to give the monitoring software an OS login for the database servers, for security reasons. -Kevin
Greg Smith wrote: > Stefan Kaltenbrunner wrote: >> Greg Smith wrote: >>> >>> The other popular request that keeps popping up here is providing an >>> easy way to see how backlogged the archive_command is, to make it >>> easier to monitor for out of disk errors that might prove >>> catastrophic to replication. >> >> I tend to disagree - in any reasonable production setup basic stulff >> like disk space usage is monitored by non-application specific matters. >> While monitoring backlog might be interesting for other reasons, >> citing disk space usage/exhaustions seems just wrong. > > I was just mentioning that one use of the data, but there are others. > Let's say that your archive_command works by copying things over to a > NFS mount, and the mount goes down. It could be a long time before you > noticed this via disk space monitoring. But if you were monitoring "how > long has it been since the last time pg_last_archived_xlogfile() > changed?", this would jump right out at you. well from an syadmin perspective you have to monitor the NFS mount anyway - so why do you need the database to do too(and not in a sane way because there is no way the database can even figure out what the real problem is and if there is one)? > > Another popular question is "how far behind real-time is the archiver > process?" You can do this right now by duplicating the same xlog file > name scanning and sorting that the archiver does in your own code, > looking for .ready files. It would be simpler if you could call > pg_last_archived_xlogfile() and then just grab that file's timestamp. well that one seems a more reasonable reasoning to me however I'm not so sure that the proposed implementation feels right - though can't come up with a better suggestion for now. > > I think it's also important to consider the fact that diagnostic > internals exposed via the database are far more useful to some people > than things you have to setup outside of it. You talk about reasonable > configurations above, but some production setups are not so reasonable. > In many of the more secure environments I've worked in (finance, > defense), there is *no* access to the database server beyond what comes > out of port 5432 without getting a whole separate team of people > involved. If the DBA can write a simple monitoring program themselves > that presents data via the one port that is exposed, that makes life > easier for them. This same issue pops up sometimes when we consider the > shared hosting case too, where the user may not have the option of > running a full-fledged monitoring script. well again I consider stuff like "available diskspace" or "NFS mount available" completely in the realm of the OS level management. The database side should focus on the stuff that concerns the internal state and operation of the database app itself. If you continue your line of thought you will have to add all kind of stuff to the database, like CPU usage tracking, getting information about running processes, storage health. As soon as you are done you have reimplemented nagios-plugins over SQL on port 5432 instead of NRPE(or SNMP or whatnot). Again I fully understand and know that there are environments where the DBA does not have OS level (be it root or no shell at all) access has to the OS but even if you had that "archiving is hanging" function you would still have to go back to that "completely different group" and have them diagnose again. So my point is - that even if you have disparate groups of people being responsible for different parts of a system solution you can't really work around incompetency(or slownest or whatever) of the group responsible for the lower layer by adding partial and inexact functionality at the upper part that can only guess what the real issue is. Stefan
Kevin Grittner wrote: > Greg Smith <greg@2ndquadrant.com> wrote: > >> In many of the more secure environments I've worked in (finance, >> defense), there is *no* access to the database server beyond what >> comes out of port 5432 without getting a whole separate team of >> people involved. If the DBA can write a simple monitoring program >> themselves that presents data via the one port that is exposed, >> that makes life easier for them. > > Right, we don't want to give the monitoring software an OS login for > the database servers, for security reasons. depending on what you exactly mean by that I do have to wonder how you monitor more complex stuff (or stuff that require elevated privs) - say raid health, multipath configuration, status of OS level updates, "are certain processes running or not" as well as basic parameters like CPU or IO load. as in stuff you cannot know usless you have it exported through "some" port. Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > Kevin Grittner wrote: >> Right, we don't want to give the monitoring software an OS login >> for the database servers, for security reasons. > > depending on what you exactly mean by that I do have to wonder how > you monitor more complex stuff (or stuff that require elevated > privs) - say raid health, multipath configuration, status of OS > level updates, "are certain processes running or not" as well as > basic parameters like CPU or IO load. as in stuff you cannot know > usless you have it exported through "some" port. Many of those are monitored on the server one way or another, through a hardware card accessible only to the DBAs. The card sends an email to the DBAs for any sort of distress, including impending or actual drive failure, ambient temperature out of bounds, internal or external power out of bounds, etc. OS updates are managed by the DBAs through scripts. Ideally we would tie these in to our opcenter software, which displays status through hundreds of "LED" boxes on big plasma displays in our support areas (and can send emails and jabber messages when things get to a bad state), but since the messages are getting to the right people in a timely manner, this is a low priority as far as monitoring enhancement requests go. Only the DBAs have OS logins to database servers. Monitoring software must deal with application ports (which have to be open anyway, so that doesn't add any security risk). Since the hardware monitoring doesn't know about file systems, and the disk space on database servers is primarily an issue for the database, it made sense to us to add the ability to check the space available to the database through a database connection. Hence, fsutil. -Kevin
Stefan Kaltenbrunner wrote: >> >> Another popular question is "how far behind real-time is the archiver >> process?" You can do this right now by duplicating the same xlog >> file name scanning and sorting that the archiver does in your own >> code, looking for .ready files. It would be simpler if you could >> call pg_last_archived_xlogfile() and then just grab that file's >> timestamp. > > well that one seems a more reasonable reasoning to me however I'm not > so sure that the proposed implementation feels right - though can't > come up with a better suggestion for now. That's basically where I'm at, and I was looking more for feedback on that topic rather than to get lost defending use-cases here. There are a few of them, and you can debate their individual merits all day. As a general comment to your line of criticism here, I feel the idea that "we're monitoring that already via <x>" does not mean that an additional check is without value. The kind of people who like redundancy in their database like it in their monitoring, too. I feel there's at least one unique thing exposing this bit buys you, and the fact that it can be a useful secondary source of information too for systems monitoring is welcome bonus--regardless of whether good practice already supplies a primary one. > If you continue your line of thought you will have to add all kind of > stuff to the database, like CPU usage tracking, getting information > about running processes, storage health. I'm looking to expose something that only the database knows for sure--"what is the archiver working on?"--via the standard way you ask the database questions, a SELECT call. The database doesn't know anything about the CPU, running processes, or storage, so suggesting this path leads in that direction doesn't make any sense. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith wrote: > Stefan Kaltenbrunner wrote: >>> >>> Another popular question is "how far behind real-time is the archiver >>> process?" You can do this right now by duplicating the same xlog >>> file name scanning and sorting that the archiver does in your own >>> code, looking for .ready files. It would be simpler if you could >>> call pg_last_archived_xlogfile() and then just grab that file's >>> timestamp. >> >> well that one seems a more reasonable reasoning to me however I'm not >> so sure that the proposed implementation feels right - though can't >> come up with a better suggestion for now. > > That's basically where I'm at, and I was looking more for feedback on > that topic rather than to get lost defending use-cases here. There are > a few of them, and you can debate their individual merits all day. As a > general comment to your line of criticism here, I feel the idea that > "we're monitoring that already via <x>" does not mean that an additional > check is without value. The kind of people who like redundancy in their > database like it in their monitoring, too. I feel there's at least one > unique thing exposing this bit buys you, and the fact that it can be a > useful secondary source of information too for systems monitoring is > welcome bonus--regardless of whether good practice already supplies a > primary one. well that might be true - but as somebody with an extensive sysadmin background I was specifically ticked by the "disk full" stuff mentioned upthread. Monitoring also means standardization and somebody who runs hundreds (or dozends) of servers is much better of getting the basics monitored the same on all systems and getting more specific as you move upwards the (application)stack. > >> If you continue your line of thought you will have to add all kind of >> stuff to the database, like CPU usage tracking, getting information >> about running processes, storage health. > > I'm looking to expose something that only the database knows for > sure--"what is the archiver working on?"--via the standard way you ask > the database questions, a SELECT call. The database doesn't know > anything about the CPU, running processes, or storage, so suggesting > this path leads in that direction doesn't make any sense. well the database does not really know much about "free diskspace" in reality as well - the only thing it knows is that it might not be able to write data or execute a script and unless you have shell/logfile access you cannot diagnose those anyway even with all the proposed functions. However what I was really trying to say is that we should focus on getting the code stable first and that prettying it up with fancy stat functions is something that really can and should be done in a followup release once we understand how the code behaves and maybe also how it is likely going to evolve... Stefan
Kevin Grittner wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: >> Kevin Grittner wrote: > >>> Right, we don't want to give the monitoring software an OS login >>> for the database servers, for security reasons. >> depending on what you exactly mean by that I do have to wonder how >> you monitor more complex stuff (or stuff that require elevated >> privs) - say raid health, multipath configuration, status of OS >> level updates, "are certain processes running or not" as well as >> basic parameters like CPU or IO load. as in stuff you cannot know >> usless you have it exported through "some" port. > > Many of those are monitored on the server one way or another, > through a hardware card accessible only to the DBAs. The card sends > an email to the DBAs for any sort of distress, including impending > or actual drive failure, ambient temperature out of bounds, internal > or external power out of bounds, etc. OS updates are managed by the > DBAs through scripts. Ideally we would tie these in to our opcenter > software, which displays status through hundreds of "LED" boxes on > big plasma displays in our support areas (and can send emails and > jabber messages when things get to a bad state), but since the > messages are getting to the right people in a timely manner, this is > a low priority as far as monitoring enhancement requests go. well a lot of people (including myself) consider it a necessity to aggregate all that stuff in your system monitoring, only that way you can guarantee proper dependency handling (ie no need to page for "webserver not running" if the whole server is down). There is also a case to be made for statistics tracking and long term monitoring of stuff. > > Only the DBAs have OS logins to database servers. Monitoring > software must deal with application ports (which have to be open > anyway, so that doesn't add any security risk). Since the hardware > monitoring doesn't know about file systems, and the disk space on > database servers is primarily an issue for the database, it made > sense to us to add the ability to check the space available to the > database through a database connection. Hence, fsutil. still seems very backwards - there is much much more than can only be monitored from within the OS(and not from an external iLO/RSA/IMM/DRAC/whatever) that you cannot really do from within the database (or any other application) so I'm still puzzled... Stefan
> I'd happily write a patch to handle all that if I thought it would be > accepted. I fear that the whole approach will be considered a bit too > hackish and get rejected on that basis though. Not really sure of a > "right" way to handle this though. Anything better is going to be more > complicated because it requires passing more information into the > archiver, with little gain for that work beyond improving the quality of > this diagnostic routine. And I think most people would find what I > described above useful enough. Yeah, I think we should focus right now on "what monitoring can we get into this version without holding up release?" Your proposal sounds like a good one in that respect. In future versions, I think we'll want a host of granular data on including: * amount of *time* since last successful archive (this would be a good trigger for alerts) * number of failed archive attempts * number of archive files awaiting processing (presumably monitored by the slave) * last archive file processed by the slave, and when * for HS: frequency and length of conflict delays in log processing, as a stat * for HS: number of query cancels due to write/lock conflicts from the master, as a stat However, *all* of the above can wait for the next version, especially since by then we'll have user feedback from the field on required monitoring. If we try to nail this all down now, not only will it delay the release, but we'll get it wrong and have to re-do it anyway. Release early and often, y'know? I think it's key to keep our data as granular and low-level as possible; with good low-level data people can write good tools, but if we over-summarize they can't. Also, it would be nice to have all of our archiving stuff grouped into something like pg_stat_archive rather than being a bunch of disconnected functions. --Josh Berkus
On Sun, Jan 17, 2010 at 8:53 AM, Josh Berkus <josh@agliodbs.com> wrote: > * amount of *time* since last successful archive (this would be a good > trigger for alerts) > * number of failed archive attempts > * number of archive files awaiting processing (presumably monitored by > the slave) > * last archive file processed by the slave, and when Are these for warm-standby, not SR? At least SR isn't so much involved in WAL archiving, i.e, WAL is sent to the standby by walsender instead of an archiver. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, 2010-01-14 at 17:33 +0900, Fujii Masao wrote: > I added two new functions; > > (1) pg_last_xlog_receive_location() reports the last WAL location received > and synced by walreceiver. If streaming replication is still in progress > this will increase monotonically. If streaming replication has completed > then this value will remain static at the value of the last WAL record > received and synced. When the server has been started without a streaming > replication then the return value will be InvalidXLogRecPtr (0/0). > > (2) pg_last_xlog_replay_location() reports the last WAL location replayed > during recovery. If recovery is still in progress this will increase > monotonically. If recovery has completed then this value will remain > static at the value of the last WAL record applied. When the server has > been started normally without a recovery then the return value will be > InvalidXLogRecPtr (0/0). I just noticed that these functions have almost the same name as functions I wrote for Hot Standby and Heikki removed from that patch. The function code and docs are 99% identical. I'm happy that the code was used and it is BSD, though it does seem strange to have this credited to others in the release notes. >From May 2 2009 the patch included + <entry> + <literal><function>pg_last_recovered_xlog_location</function>()</literal> + </entry> + <entry><type>text</type></entry> + <entry>Returns the transaction log location of the last WAL record + in the current recovery. If recovery is still in progress this + will increase monotonically. If recovery is complete then this value will + remain static at the value of the last transaction applied during that + recovery. When the server has been started normally this will return + InvalidXLogRecPtr (0/0). + (zero). + </entry> with code + /* + * Returns xlog location of last recovered WAL record. + */ + Datum + pg_last_recovered_xlog_location(PG_FUNCTION_ARGS) + { + char location[MAXFNAMELEN]; + + { + /* use volatile pointer to prevent code rearrangement */ + volatile XLogCtlData *xlogctl = XLogCtl; + + SpinLockAcquire(&xlogctl->info_lck); + + LastRec = xlogctl->recoveryLastRecPtr; + + SpinLockRelease(&xlogctl->info_lck); + } + + snprintf(location, sizeof(location), "%X/%X", + LastRec.xlogid, LastRec.xrecoff); + PG_RETURN_TEXT_P(cstring_to_text(location)); + } -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Thu, 2010-01-14 at 17:33 +0900, Fujii Masao wrote: > > > I added two new functions; > > > > (1) pg_last_xlog_receive_location() reports the last WAL location received > > and synced by walreceiver. If streaming replication is still in progress > > this will increase monotonically. If streaming replication has completed > > then this value will remain static at the value of the last WAL record > > received and synced. When the server has been started without a streaming > > replication then the return value will be InvalidXLogRecPtr (0/0). > > > > (2) pg_last_xlog_replay_location() reports the last WAL location replayed > > during recovery. If recovery is still in progress this will increase > > monotonically. If recovery has completed then this value will remain > > static at the value of the last WAL record applied. When the server has > > been started normally without a recovery then the return value will be > > InvalidXLogRecPtr (0/0). > > I just noticed that these functions have almost the same name as > functions I wrote for Hot Standby and Heikki removed from that patch. > The function code and docs are 99% identical. > > I'm happy that the code was used and it is BSD, though it does seem > strange to have this credited to others in the release notes. Sorry, release notes updated: Add <link linkend="functions-recovery-info-table"><function>pg_last_xlog_receive_location()</></link> and <function>pg_last_xlog_replay_location()</>,which can be used to monitor standby server <acronym>WAL</> activity (Simon) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
On Tue, Mar 23, 2010 at 7:56 AM, Bruce Momjian <bruce@momjian.us> wrote: > Sorry, release notes updated: > > Add <link > linkend="functions-recovery-info-table"><function>pg_last_xlog_receive_location()</></link> > and <function>pg_last_xlog_replay_location()</>, which > can be used to monitor standby server <acronym>WAL</> > activity (Simon) Umm... though I'm not sure the policy about credit, I think that three names should be put down with. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Tue, Mar 23, 2010 at 7:56 AM, Bruce Momjian <bruce@momjian.us> wrote: > > Sorry, release notes updated: > > > > ? ? ? ? ? ? ?Add <link > > ? ? ? ? ? ? ?linkend="functions-recovery-info-table"><function>pg_last_xlog_receive_location()</></link> > > ? ? ? ? ? ? ?and <function>pg_last_xlog_replay_location()</>, which > > ? ? ? ? ? ? ?can be used to monitor standby server <acronym>WAL</> > > ? ? ? ? ? ? ?activity (Simon) > > Umm... though I'm not sure the policy about credit, I think that > three names should be put down with. OK, all three are there now: Add <link linkend="functions-recovery-info-table"><function>pg_last_xlog_receive_location()</></link> and <function>pg_last_xlog_replay_location()</>,which can be used to monitor standby server <acronym>WAL</> activity (Simon, Fujii Masao, Heikki) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do