Thread: Streaming replication status

Streaming replication status

From

Heikki Linnakangas

Date:

08 January 2010, 17:16:34

I've gone through the patch in detail now. Here's my list of remaining
issues:

* If there's no WAL to send, walsender doesn't notice if the client has
closed connection already. This is the issue Fujii reported already.
We'll need to add a select() call to the walsender main loop to check if
the socket has been closed.

* I removed the feature that archiver was started during recovery. The
idea of that was to enable archiving from a standby server, to relieve
the master server of that duty, but I found it annoying because it
causes trouble if the standby and master are configured to archive to
the same location; they will fight over which copies the file to the
archive first. Frankly the feature doesn't seem very useful as the patch
stands, because you still have to configure archiving in the master in
practice; you can't take an online base backup otherwise, and you have
the risk of standby falling too much behind and having to restore from
base backup whenever the standby is disconnected for any reason. Let's
revisit this later when it's truly useful.

* We still have a related issue, though: if standby is configured to
archive to the same location as master (as it always is on my laptop,
where I use the postgresql.conf of the master unmodified in the server),
right after failover the standby server will try to archive all the old
WAL files that were streamed from the master; but they exist already in
the archive, as the master archived them already. I'm not sure if this
is a pilot error, or if we should do something in the server to tell
apart WAL segments streamed from master and those generated in the
standby server after failover. Maybe we should immediately create a
.done file for every file received from master?

* I don't think we should require superuser rights for replication.
Although you see all WAL and potentially all data in the system through
that, a standby doesn't need any write access to the master, so it would
be good practice to create a dedicated account with limited privileges
for replication.

* A standby that connects to master, initiates streaming, and then sits
idle without stalls recycling of old WAL files in the master. That will
eventually lead to a full disk in master. Do we need some kind of a
emergency valve on that?

* Do we really need REPLICATION_DEBUG_ENABLED? The output doesn't seem
very useful to me.

* Need to add comments somewhere to note that ReadRecord depends on the
fact that a WAL record is always send as whole, never split across two
messages.

* Do we really need to split the sleep in walsender to NAPTIME_PER_CYCLE
  increments?

* Walreceiver should flush less aggresively than after each received
piece of WAL as noted by XXX comment.

* Consider renaming PREPARE_REPLICATION to IDENTIFY_SYSTEM or something.

* What's the change in bgwriter.c for?

* ReadRecord/FetchRecord is a bit of mess. I earlier tried to refactor
it into something simpler a couple of times, but failed. So I'm going to
leave it as it is, but if someone else wants to give it a shot, that
would be good.

* Documentation. The patch used to move around some sections, but I
think that has been partially reverted so that it now just duplicates
them. It probably needs other work too, I haven't looked at the docs in
any detail.

These are all the issues I know of right now. Assuming no new issues
crop up (which often does happen), the patch is ready for committing
after those have been addressed.

Attached is my latest version as a patch, also available in my git
repository.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Attachment

replication-20100108.patch.gz

Re: Streaming replication status

From

Josh Berkus

Date:

08 January 2010, 18:21:06

On 1/8/10 1:16 PM, Heikki Linnakangas wrote:
> * A standby that connects to master, initiates streaming, and then sits
> idle without stalls recycling of old WAL files in the master. That will
> eventually lead to a full disk in master. Do we need some kind of a
> emergency valve on that?

WARNING: I haven't thought about how this would work together with HS yes.

I think this needs to be administrator-configurable.

I'd suggest a GUC approach:

archiving_lag_action = { ignore, shutdown, stop }

"Ignore" would be the default.  Some users would rather have the master
shut down if the slave has stopped taking segments; that's "shutdown".
Otherwise, it's "stop" which simply stops archiving and starts recylcing
when we reach that number of segments.

Better name for the GUC very welcome ...

--Josh Berkus

Re: Streaming replication status

From

Greg Stark

Date:

08 January 2010, 21:38:55

On Fri, Jan 8, 2010 at 9:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

> * We still have a related issue, though: if standby is configured to
> archive to the same location as master (as it always is on my laptop,
> where I use the postgresql.conf of the master unmodified in the server),
> right after failover the standby server will try to archive all the old
> WAL files that were streamed from the master; but they exist already in
> the archive, as the master archived them already. I'm not sure if this
> is a pilot error, or if we should do something in the server to tell
> apart WAL segments streamed from master and those generated in the
> standby server after failover. Maybe we should immediately create a
> .done file for every file received from master?

How do we know the master has finished archiving them? If the master
crashes suddenly and you fail over couldn't it have failed to archive
segments that have been received by the standby via streaming
replication?

> * Need to add comments somewhere to note that ReadRecord depends on the
> fact that a WAL record is always send as whole, never split across two
> messages.

What happens in the case of the very large records Tom was describing
recently. If the entire record doesn't fit in a WAL segment is it the
whole record or the partial record with the continuation bit that
needs to fit in a message?

-- 
greg

Re: Streaming replication status

From

Fujii Masao

Date:

09 January 2010, 02:49:26

On Sat, Jan 9, 2010 at 6:16 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I've gone through the patch in detail now. Here's my list of remaining
> issues:

Great! Thanks a lot!

> * If there's no WAL to send, walsender doesn't notice if the client has
> closed connection already. This is the issue Fujii reported already.
> We'll need to add a select() call to the walsender main loop to check if
> the socket has been closed.

We should reactivate pq_wait() and secure_poll()?

> * I removed the feature that archiver was started during recovery. The
> idea of that was to enable archiving from a standby server, to relieve
> the master server of that duty, but I found it annoying because it
> causes trouble if the standby and master are configured to archive to
> the same location; they will fight over which copies the file to the
> archive first. Frankly the feature doesn't seem very useful as the patch
> stands, because you still have to configure archiving in the master in
> practice; you can't take an online base backup otherwise, and you have
> the risk of standby falling too much behind and having to restore from
> base backup whenever the standby is disconnected for any reason. Let's
> revisit this later when it's truly useful.

Okey.

> * We still have a related issue, though: if standby is configured to
> archive to the same location as master (as it always is on my laptop,
> where I use the postgresql.conf of the master unmodified in the server),
> right after failover the standby server will try to archive all the old
> WAL files that were streamed from the master; but they exist already in
> the archive, as the master archived them already. I'm not sure if this
> is a pilot error, or if we should do something in the server to tell
> apart WAL segments streamed from master and those generated in the
> standby server after failover. Maybe we should immediately create a
> .done file for every file received from master?

There is no guarantee that such file has already been archived by master.
This is just an idea, but new WAL record indicating the completion of the
archiving would be useful for the standby to create .done file. But, this idea
might kill the "archiving during recovery" idea discussed above.

Personally, I'm OK with that issue because we can avoid it by tweaking
archive_command. Could we revisit this discussion with the "archiving
during recovery" discussion later?

> * I don't think we should require superuser rights for replication.
> Although you see all WAL and potentially all data in the system through
> that, a standby doesn't need any write access to the master, so it would
> be good practice to create a dedicated account with limited privileges
> for replication.

Okey to just drop the superuser() check from walsender.c.

> * A standby that connects to master, initiates streaming, and then sits
> idle without stalls recycling of old WAL files in the master. That will
> eventually lead to a full disk in master. Do we need some kind of a
> emergency valve on that?

I think that we need the GUC parameter to specify the maximum number
of log file segments held in pg_xlog directory to send to the standby server.
The replication to the standby which falls more than that GUC value behind
is just terminated.
http://archives.postgresql.org/pgsql-hackers/2009-12/msg01901.php

> * Do we really need REPLICATION_DEBUG_ENABLED? The output doesn't seem
> very useful to me.

This was useful for me to debug the code. But, right now, Okey to drop it.

> * Need to add comments somewhere to note that ReadRecord depends on the
> fact that a WAL record is always send as whole, never split across two
> messages.

Okey.

> * Do we really need to split the sleep in walsender to NAPTIME_PER_CYCLE
>  increments?

Yes. It's required for some platforms (probably HP-UX) in which signals
cannot interrupt the sleep.

> * Walreceiver should flush less aggresively than after each received
> piece of WAL as noted by XXX comment.

>     * XXX: Flushing after each received message is overly aggressive. Should
>     * implement some sort of lazy flushing. Perhaps check in the main loop
>     * if there's any more messages before blocking and waiting for one, and
>     * flush the WAL if there isn't, just blocking.

In this approach, if messages continuously arrive from master, the fsync
would be delayed until WAL segment is switched. Likewise, recovery also
would be delayed, which seems to be problem.

How about the straightforward approach; let the process which wants to
flush the buffer send the fsync-request to walreceiver and wait until WAL
is flushed up to the buffer's LSN?

> * Consider renaming PREPARE_REPLICATION to IDENTIFY_SYSTEM or something.

Okey.

> * What's the change in bgwriter.c for?

It's for the bgwriter to know the current timeline for recycling the WAL files.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

Fujii Masao

Date:

09 January 2010, 02:54:02

On Sat, Jan 9, 2010 at 10:38 AM, Greg Stark <gsstark@mit.edu> wrote:
>> * Need to add comments somewhere to note that ReadRecord depends on the
>> fact that a WAL record is always send as whole, never split across two
>> messages.
>
> What happens in the case of the very large records Tom was describing
> recently. If the entire record doesn't fit in a WAL segment is it the
> whole record or the partial record with the continuation bit that
> needs to fit in a message?

It's the partial record with the continuation bit.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

Heikki Linnakangas

Date:

09 January 2010, 03:25:50

Fujii Masao wrote:
> On Sat, Jan 9, 2010 at 6:16 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> * If there's no WAL to send, walsender doesn't notice if the client has
>> closed connection already. This is the issue Fujii reported already.
>> We'll need to add a select() call to the walsender main loop to check if
>> the socket has been closed.
> 
> We should reactivate pq_wait() and secure_poll()?

I don't think we need all that, a simple select() should be enough.
Though I must admit I'm not very familiar with select/poll().

>> * We still have a related issue, though: if standby is configured to
>> archive to the same location as master (as it always is on my laptop,
>> where I use the postgresql.conf of the master unmodified in the server),
>> right after failover the standby server will try to archive all the old
>> WAL files that were streamed from the master; but they exist already in
>> the archive, as the master archived them already. I'm not sure if this
>> is a pilot error, or if we should do something in the server to tell
>> apart WAL segments streamed from master and those generated in the
>> standby server after failover. Maybe we should immediately create a
>> .done file for every file received from master?
> 
> There is no guarantee that such file has already been archived by master.
> This is just an idea, but new WAL record indicating the completion of the
> archiving would be useful for the standby to create .done file. But, this idea
> might kill the "archiving during recovery" idea discussed above.
> 
> Personally, I'm OK with that issue because we can avoid it by tweaking
> archive_command. Could we revisit this discussion with the "archiving
> during recovery" discussion later?

Ok. The workaround is to configure standby to archive to a different
location. If you need to restore from that, you'll need to stitch
together the logs from the old master and the new one.

>> * A standby that connects to master, initiates streaming, and then sits
>> idle without stalls recycling of old WAL files in the master. That will
>> eventually lead to a full disk in master. Do we need some kind of a
>> emergency valve on that?
> 
> I think that we need the GUC parameter to specify the maximum number
> of log file segments held in pg_xlog directory to send to the standby server.
> The replication to the standby which falls more than that GUC value behind
> is just terminated.
> http://archives.postgresql.org/pgsql-hackers/2009-12/msg01901.php

Oh yes, sounds good.

>> * Do we really need to split the sleep in walsender to NAPTIME_PER_CYCLE
>>  increments?
> 
> Yes. It's required for some platforms (probably HP-UX) in which signals
> cannot interrupt the sleep.

I'm thinking that the wal_sender_delay is so small that maybe it's not
worth worrying about.

>> * Walreceiver should flush less aggresively than after each received
>> piece of WAL as noted by XXX comment.
> 
>>      * XXX: Flushing after each received message is overly aggressive. Should
>>      * implement some sort of lazy flushing. Perhaps check in the main loop
>>      * if there's any more messages before blocking and waiting for one, and
>>      * flush the WAL if there isn't, just blocking.
> 
> In this approach, if messages continuously arrive from master, the fsync
> would be delayed until WAL segment is switched. Likewise, recovery also
> would be delayed, which seems to be problem.

That seems OK to me. If messages are really coming in that fast,
fsyncing the whole WAL segment at a time is probably most efficient.

But if that really is too much, you could still do extra flushes within
XLogRecv() every few megabytes for example.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication status

From

Simon Riggs

Date:

10 January 2010, 07:13:55

On Fri, 2010-01-08 at 23:16 +0200, Heikki Linnakangas wrote:

> * I removed the feature that archiver was started during recovery. The
> idea of that was to enable archiving from a standby server, to relieve
> the master server of that duty, but I found it annoying because it
> causes trouble if the standby and master are configured to archive to
> the same location; they will fight over which copies the file to the
> archive first. Frankly the feature doesn't seem very useful as the patch
> stands, because you still have to configure archiving in the master in
> practice; you can't take an online base backup otherwise, and you have
> the risk of standby falling too much behind and having to restore from
> base backup whenever the standby is disconnected for any reason. Let's
> revisit this later when it's truly useful.

Agreed

> * We still have a related issue, though: if standby is configured to
> archive to the same location as master (as it always is on my laptop,
> where I use the postgresql.conf of the master unmodified in the server),
> right after failover the standby server will try to archive all the old
> WAL files that were streamed from the master; but they exist already in
> the archive, as the master archived them already. I'm not sure if this
> is a pilot error, or if we should do something in the server to tell
> apart WAL segments streamed from master and those generated in the
> standby server after failover. Maybe we should immediately create a
> .done file for every file received from master?

That sounds like the right thing to do.

> * I don't think we should require superuser rights for replication.
> Although you see all WAL and potentially all data in the system through
> that, a standby doesn't need any write access to the master, so it would
> be good practice to create a dedicated account with limited privileges
> for replication.

Agreed. I think we should have a predefined user, called "replication"
that has only the correct rights.

> * A standby that connects to master, initiates streaming, and then sits
> idle without stalls recycling of old WAL files in the master. That will
> eventually lead to a full disk in master. Do we need some kind of a
> emergency valve on that?

Can you explain how this could occur? My understanding was that the
walreceiver and startup processes were capable of independent action
specifically to avoid for this kind of effect.

> * Documentation. The patch used to move around some sections, but I
> think that has been partially reverted so that it now just duplicates
> them. It probably needs other work too, I haven't looked at the docs in
> any detail.

I believe the docs need urgent attention. We need more people to read
the docs and understand the implications so that people can then
comment. It is extremely non-obvious from the patch how things work at a
behaviour level.

I am very concerned that there is no thought given to monitoring
replication. This will make the feature difficult to use in practice.

-- Simon Riggs           www.2ndQuadrant.com

Re: Streaming replication status

From

Simon Riggs

Date:

10 January 2010, 07:18:23

On Fri, 2010-01-08 at 14:20 -0800, Josh Berkus wrote:
> On 1/8/10 1:16 PM, Heikki Linnakangas wrote:
> > * A standby that connects to master, initiates streaming, and then sits
> > idle without stalls recycling of old WAL files in the master. That will
> > eventually lead to a full disk in master. Do we need some kind of a
> > emergency valve on that?
> 
> WARNING: I haven't thought about how this would work together with HS yes.

I've been reviewing things as we go along, so I'm not that tense
overall. Having said that I don't understand why the problem above would
occur and the sentence seems to be missing a verb between "without" and
"stalls". More explanation please.

What could happen is that the standby could slowly lag behind master. We
don't have any way of monitoring that, as yet. Setting ps display is not
enough here.

-- Simon Riggs           www.2ndQuadrant.com

Re: Streaming replication status

From

Heikki Linnakangas

Date:

10 January 2010, 12:40:57

Simon Riggs wrote:
> On Fri, 2010-01-08 at 14:20 -0800, Josh Berkus wrote:
>> On 1/8/10 1:16 PM, Heikki Linnakangas wrote:
>>> * A standby that connects to master, initiates streaming, and then sits
>>> idle without stalls recycling of old WAL files in the master. That will
>>> eventually lead to a full disk in master. Do we need some kind of a
>>> emergency valve on that?
>> WARNING: I haven't thought about how this would work together with HS yes.
> 
> I've been reviewing things as we go along, so I'm not that tense
> overall. Having said that I don't understand why the problem above would
> occur and the sentence seems to be missing a verb between "without" and
> "stalls". More explanation please.

Yeah, that sentence was broken.

> What could happen is that the standby could slowly lag behind master. 

Right, that's what I'm worried about. In the worst case it the
walreceiver process in the standby might stall completely for some
reason, e.g hardware problem or SIGSTOP by an administrator.

> We
> don't have any way of monitoring that, as yet. Setting ps display is not
> enough here.

Yeah, monitoring would be nice too. But what I was wondering is whether
we need some way of stopping that from filling the disk in master.
(Fujii-san's suggestion of a GUC to set the max. amount of WAL to keep
in the master for standbys feels good to me).

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication status

From

Simon Riggs

Date:

10 January 2010, 13:41:28

On Sun, 2010-01-10 at 18:40 +0200, Heikki Linnakangas wrote:

> > We
> > don't have any way of monitoring that, as yet. Setting ps display is not
> > enough here.
> 
> Yeah, monitoring would be nice too. But what I was wondering is whether
> we need some way of stopping that from filling the disk in master.
> (Fujii-san's suggestion of a GUC to set the max. amount of WAL to keep
> in the master for standbys feels good to me).

OK, now I got you. I thought that was already agreed; guess it is now.

We need monitoring anywhere we have a max_* parameter. Otherwise we
won't know how close we are to disaster until we hit the limit and
things break down. Otherwise we will have to set parameters by trial and
error, or set them so high they are meaningless.

-- Simon Riggs           www.2ndQuadrant.com

Re: Streaming replication status

From

Josh Berkus

Date:

10 January 2010, 16:10:49

> We need monitoring anywhere we have a max_* parameter. Otherwise we
> won't know how close we are to disaster until we hit the limit and
> things break down. Otherwise we will have to set parameters by trial and
> error, or set them so high they are meaningless.

I agree.

Thing is, though, we have a de-facto max already ... when pgxlog runs
out of disk space.  And no monitoring *in postgresql* for that, although
obviously you can use OS monitoring for it.

I'm saying, even for plain PITR, it would be an improvement in
manageablity if the DBA could set a maximum number of checkpoint
segments before replication is abandonded or the master shuts down.
It's something we've been missing.

--Josh Berkus

Re: Streaming replication status

From

Simon Riggs

Date:

10 January 2010, 18:31:31

On Sun, 2010-01-10 at 12:10 -0800, Josh Berkus wrote:
> > We need monitoring anywhere we have a max_* parameter. Otherwise we
> > won't know how close we are to disaster until we hit the limit and
> > things break down. Otherwise we will have to set parameters by trial and
> > error, or set them so high they are meaningless.
> 
> I agree.
> 
> Thing is, though, we have a de-facto max already ... when pgxlog runs
> out of disk space.  

What I mean is this: The purpose of monitoring is to avoid bad things
happening by being able to predict that a bad thing will happen before
it actually does happen. Cars have windows to allow us to see we are
about to hit something.

> And no monitoring *in postgresql* for that, although
> obviously you can use OS monitoring for it.

PostgreSQL doesn't need to monitor that. If the user wants to avoid
out-of-space they can write a script to monitor files/space. The info is
accessible, if you wish to monitor it.

Currently there is no way of knowing what the average/current transit
time is on replication, no way of knowing what is happening if we go
idle etc.. Those things need to be included because they are not
otherwise accessible. Cars need windows, not just a finely tuned engine.

-- Simon Riggs           www.2ndQuadrant.com

Re: Streaming replication status

From

Josh Berkus

Date:

10 January 2010, 19:14:44

> Currently there is no way of knowing what the average/current transit
> time is on replication, no way of knowing what is happening if we go
> idle etc.. Those things need to be included because they are not
> otherwise accessible. Cars need windows, not just a finely tuned engine.

Like I said, I agree.  I'm just pointing out that the monitoring
deficiency already exists whether or not we add a max_* parameter.

--Josh Berkus

Re: Streaming replication status

From

Craig Ringer

Date:

11 January 2010, 04:36:44

On 9/01/2010 6:20 AM, Josh Berkus wrote:
> On 1/8/10 1:16 PM, Heikki Linnakangas wrote:
>> * A standby that connects to master, initiates streaming, and then sits
>> idle without stalls recycling of old WAL files in the master. That will
>> eventually lead to a full disk in master. Do we need some kind of a
>> emergency valve on that?
>
> WARNING: I haven't thought about how this would work together with HS yes.
>
> I think this needs to be administrator-configurable.
>
> I'd suggest a GUC approach:
>
> archiving_lag_action = { ignore, shutdown, stop }
>
> "Ignore" would be the default.  Some users would rather have the master
> shut down if the slave has stopped taking segments; that's "shutdown".
> Otherwise, it's "stop" which simply stops archiving and starts recylcing
> when we reach that number of segments.

IMO "stop" would be *really* bad without some sort of administrator 
alert support (scream for help) and/or the ability to refresh the 
slave's base backup when it started responding again. We'd start seeing 
mailing list posts along the lines of "my master failed over to the 
slave, and it's missing the last 3 months of data! Help!".

Personally, I'd be uncomfortable enabling something like that without 
_both_ an admin alert _and_ the ability to refresh the slave's base 
backup without admin intervention.

It'd also be necessary to define what exactly "lag" means here, 
preferably in a way that doesn't generally need admin tuning for most 
users. Ideally there'd be separate thresholds for "scream to the admin 
for help, something's wrong" and "forced to act, slave is holding up the 
master".

--
Craig Ringer

Re: Streaming replication status

From

Bruce Momjian

Date:

11 January 2010, 21:19:57

Simon Riggs wrote:
> > * I don't think we should require superuser rights for replication.
> > Although you see all WAL and potentially all data in the system through
> > that, a standby doesn't need any write access to the master, so it would
> > be good practice to create a dedicated account with limited privileges
> > for replication.
> 
> Agreed. I think we should have a predefined user, called "replication"
> that has only the correct rights.

I am concerned that knowledge of this new read-only replication user
would have to be spread all over the backend code, which is really not
something we should be doing at this stage in 8.5 development.  I am
also thinking such a special user might fall out of work on mandatory
access control, so maybe we should just require super-user for 8.5 and
revisit this for 8.6.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Streaming replication status

From

Fujii Masao

Date:

11 January 2010, 22:59:33

On Sat, Jan 9, 2010 at 4:25 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I don't think we need all that, a simple select() should be enough.
> Though I must admit I'm not very familiar with select/poll().

I'm not sure whether poll(2) should be called for this purpose. But
poll(2) and select(2) seem to often come together in the existing code.
We should follow such custom?

>>> * Do we really need to split the sleep in walsender to NAPTIME_PER_CYCLE
>>>  increments?
>>
>> Yes. It's required for some platforms (probably HP-UX) in which signals
>> cannot interrupt the sleep.
>
> I'm thinking that the wal_sender_delay is so small that maybe it's not
> worth worrying about.

The same problem exists in walwriter.c, too. Though we can expect that
wal_writer_delay is small, its sleep has been broken down into smaller bit.
We should follow such existing code? Or just remove that feature from
walwriter?

>>> * Walreceiver should flush less aggresively than after each received
>>> piece of WAL as noted by XXX comment.
>>
>>>       * XXX: Flushing after each received message is overly aggressive. Should
>>>       * implement some sort of lazy flushing. Perhaps check in the main loop
>>>       * if there's any more messages before blocking and waiting for one, and
>>>       * flush the WAL if there isn't, just blocking.
>>
>> In this approach, if messages continuously arrive from master, the fsync
>> would be delayed until WAL segment is switched. Likewise, recovery also
>> would be delayed, which seems to be problem.
>
> That seems OK to me. If messages are really coming in that fast,
> fsyncing the whole WAL segment at a time is probably most efficient.

OK, I'll implement your idea. But that seems to be inefficient in the
synchronous replication (especially "wait WAL-replay" mode). So let's
revisit this discussion later.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

Fujii Masao

Date:

11 January 2010, 23:45:55

On Sun, Jan 10, 2010 at 8:17 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> What could happen is that the standby could slowly lag behind master. We
> don't have any way of monitoring that, as yet. Setting ps display is not
> enough here.

I agree that the statistical information about replication activity is
very useful. But I think that it's not an urgent issue. Shall we think
it later?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

Fujii Masao

Date:

11 January 2010, 23:59:43

On Mon, Jan 11, 2010 at 5:36 PM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
> Personally, I'd be uncomfortable enabling something like that without _both_
> an admin alert _and_ the ability to refresh the slave's base backup without
> admin intervention.

What feature do you specifically need as an alert? Just writing
the warning into the logfile is enough? Or need to notify by
using SNMP trap message? Though I'm not sure if this is a role
of Postgres.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

Greg Smith

Date:

12 January 2010, 00:21:58

Fujii Masao wrote: <blockquote cite="mid:3f0b79eb1001111945m1d9aa6b6n1a7e62705a00e9fe@mail.gmail.com" type="cite"><pre
wrap="">OnSun, Jan 10, 2010 at 8:17 PM, Simon Riggs <a class="moz-txt-link-rfc2396E"
href="mailto:simon@2ndquadrant.com"><simon@2ndquadrant.com></a>wrote: </pre><blockquote type="cite"><pre
wrap="">Whatcould happen is that the standby could slowly lag behind master. We
 
don't have any way of monitoring that, as yet. Setting ps display is not
enough here.   </pre></blockquote><pre wrap="">
I agree that the statistical information about replication activity is
very useful. But I think that it's not an urgent issue. Shall we think
it later? </pre></blockquote><br /> I don't think anybody can deploy this feature without at least some very basic
monitoringhere.  I like the basic proposal you made back in September for adding a pg_standbys_xlog_location to replace
whatyou have to get from ps right now:  <a class="moz-txt-link-freetext"
href="http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php">http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php</a><br
/><br/> That's basic, but enough that people could get by for a V1.<br /><br /><pre class="moz-signature" cols="72">--

Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
<a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a>  <a
class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a>
 
</pre>

Re: Streaming replication status

From

Greg Smith

Date:

12 January 2010, 00:24:46

Fujii Masao wrote: <blockquote cite="mid:3f0b79eb1001111959m2978c507n59fd490828aa3a8f@mail.gmail.com" type="cite"><pre
wrap="">OnMon, Jan 11, 2010 at 5:36 PM, Craig Ringer
 
<a class="moz-txt-link-rfc2396E" href="mailto:craig@postnewspapers.com.au"><craig@postnewspapers.com.au></a>
wrote:</pre><blockquote type="cite"><pre wrap="">Personally, I'd be uncomfortable enabling something like that without
_both_
an admin alert _and_ the ability to refresh the slave's base backup without
admin intervention.   </pre></blockquote><pre wrap="">
What feature do you specifically need as an alert? Just writing
the warning into the logfile is enough? Or need to notify by
using SNMP trap message? Though I'm not sure if this is a role
of Postgres. </pre></blockquote><br /> It's impossible for the database to have any idea whatsoever how people are
goingto want to be alerted.  Provide functions to monitor things like replication lag, like the number of segments
queuedup to feed to archive_command, and let people build their own alerting mechanism for now.  They're going to do
thatanyway, so why waste precious time here building someone that's unlikely to fit any but a very narrow use case?<br
/><br/><pre class="moz-signature" cols="72">-- 
 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
<a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a>  <a
class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a>
 
</pre>

Re: Streaming replication status

From

Fujii Masao

Date:

12 January 2010, 01:14:01

On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> I don't think anybody can deploy this feature without at least some very
> basic monitoring here.  I like the basic proposal you made back in September
> for adding a pg_standbys_xlog_location to replace what you have to get from
> ps right now:
> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php
>
> That's basic, but enough that people could get by for a V1.

Yeah, I have no objection to add such simple capability which monitors
the lag into the first release. But I guess that, in addition to that,
Simon wanted the capability to collect the statistical information about
replication activity (e.g., a transfer time, a write time, replay time).
So I'd like to postpone it.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

Fujii Masao

Date:

12 January 2010, 01:16:35

On Tue, Jan 12, 2010 at 1:24 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> It's impossible for the database to have any idea whatsoever how people are
> going to want to be alerted.  Provide functions to monitor things like
> replication lag, like the number of segments queued up to feed to
> archive_command, and let people build their own alerting mechanism for now.
> They're going to do that anyway, so why waste precious time here building
> someone that's unlikely to fit any but a very narrow use case?

Agreed.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

Heikki Linnakangas

Date:

12 January 2010, 03:22:31

Greg Smith wrote:
> I don't think anybody can deploy this feature without at least some very
> basic monitoring here.  I like the basic proposal you made back in
> September for adding a pg_standbys_xlog_location to replace what you
> have to get from ps right now: 
> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php
> 
> That's basic, but enough that people could get by for a V1.

It would be more straightforward to have a function in the standby to
return the current replay location. It feels more logical to poll the
standby to get the status of the standby, instead of indirectly from the
master. Besides, the master won't know how far the standby is if the
connection to the standby is broken.

Maybe we should just change the existing pg_current_xlog_location()
function to return that when recovery is in progress. It currently
throws an error during hot standby.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Streaming replication status

From

Stefan Kaltenbrunner

Date:

12 January 2010, 03:25:31

Fujii Masao wrote:
> On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> I don't think anybody can deploy this feature without at least some very
>> basic monitoring here.  I like the basic proposal you made back in September
>> for adding a pg_standbys_xlog_location to replace what you have to get from
>> ps right now:
>> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php
>>
>> That's basic, but enough that people could get by for a V1.
> 
> Yeah, I have no objection to add such simple capability which monitors
> the lag into the first release. But I guess that, in addition to that,
> Simon wanted the capability to collect the statistical information about
> replication activity (e.g., a transfer time, a write time, replay time).
> So I'd like to postpone it.

yeah getting that would all be nice and handy but we have to remember 
that this is really our first cut at integrated replication. Being able 
to monitor lag is what is needed as a minimum, more advanced stuff can 
and will emerge once we get some actual feedback from the field.


Stefan

Re: Streaming replication status

From

Simon Riggs

Date:

12 January 2010, 03:40:40

On Tue, 2010-01-12 at 08:24 +0100, Stefan Kaltenbrunner wrote:
> Fujii Masao wrote:
> > On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> >> I don't think anybody can deploy this feature without at least some very
> >> basic monitoring here.  I like the basic proposal you made back in September
> >> for adding a pg_standbys_xlog_location to replace what you have to get from
> >> ps right now:
> >> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php
> >>
> >> That's basic, but enough that people could get by for a V1.
> > 
> > Yeah, I have no objection to add such simple capability which monitors
> > the lag into the first release. But I guess that, in addition to that,
> > Simon wanted the capability to collect the statistical information about
> > replication activity (e.g., a transfer time, a write time, replay time).
> > So I'd like to postpone it.
> 
> yeah getting that would all be nice and handy but we have to remember 
> that this is really our first cut at integrated replication. Being able 
> to monitor lag is what is needed as a minimum, more advanced stuff can 
> and will emerge once we get some actual feedback from the field.

Though there won't be any feedback from the field because there won't be
any numbers to discuss. Just "it appears to be working". Then we will go
into production and the problems will begin to be reported. We will be
able to do nothing to resolve them because we won't know how many people
are affected.

-- Simon Riggs           www.2ndQuadrant.com

Re: Streaming replication status

From

Greg Smith

Date:

12 January 2010, 04:06:44

Heikki Linnakangas wrote: <blockquote cite="mid:4B4C231E.90508@enterprisedb.com" type="cite"><pre wrap="">Greg Smith
wrote:</pre><blockquote type="cite"><pre wrap="">I don't think anybody can deploy this feature without at least some
very
basic monitoring here.  I like the basic proposal you made back in
September for adding a pg_standbys_xlog_location to replace what you
have to get from ps right now: 
<a class="moz-txt-link-freetext"
href="http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php">http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php</a>

That's basic, but enough that people could get by for a V1.   </pre></blockquote><pre wrap="">
It would be more straightforward to have a function in the standby to
return the current replay location. It feels more logical to poll the
standby to get the status of the standby, instead of indirectly from the
master. Besides, the master won't know how far the standby is if the
connection to the standby is broken. </pre></blockquote><br /> This is one reason I was talking in my other message
aboutgetting simple stats on how bad the archive_command backlog is, which I'd think is an easy way to inform the DBA
"thestandby isn't keeping up and disk is filling" in a way that's more database-centric than just looking at disk space
gettinggobbled.<br /><br /> I think that it's important to be able to get whatever useful information you can from both
theprimary and the standby, because most of the interesting (read:  painful) situations here are when one or the other
isdown.  The fundamental questions here are:<br /><br /> -When things are running normally, how much is the standby
laggingby?  This is needed for a baseline of good performance, by which you can detect problems before they get too
bad.<br/> -If the standby is down altogether, how can I get more information about the state of things from the
primary?<br/> -If the primary is down, how can I tell more from the standby?<br /><br /> Predicting what people are
goingto want to do when one of these bad conditions pops up is a large step ahead of where I think this discussion
shouldbe focusing on now.  You have to show how you're going to measure the badness here in the likely failure
situationsbefore you can then take action on them.  If you do the former well enough, admins will figure out how to
dealwith the latter in a way compatible with their business processes in the first version.<br />  <br /><pre
class="moz-signature"cols="72">-- 
 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
<a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a>  <a
class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a>
 
</pre>

Re: Streaming replication status

From

Fujii Masao

Date:

12 January 2010, 06:15:34

On Tue, Jan 12, 2010 at 4:22 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> It would be more straightforward to have a function in the standby to
> return the current replay location. It feels more logical to poll the
> standby to get the status of the standby, instead of indirectly from the
> master. Besides, the master won't know how far the standby is if the
> connection to the standby is broken.
>
> Maybe we should just change the existing pg_current_xlog_location()
> function to return that when recovery is in progress. It currently
> throws an error during hot standby.

Sounds good.

I'd like to hear from someone which location should be returned by
that function (WAL receive/write/flush/replay location?). I vote for
WAL flush location because it's important for me to know how far the
standby can replay the WAL, i.e., how much transactions might be lost
at failover. And, it's also OK to provide the dedicated function for
WAL replay location. Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

Magnus Hagander

Date:

12 January 2010, 06:18:59

On Tue, Jan 12, 2010 at 08:22, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Greg Smith wrote:
>> I don't think anybody can deploy this feature without at least some very
>> basic monitoring here.  I like the basic proposal you made back in
>> September for adding a pg_standbys_xlog_location to replace what you
>> have to get from ps right now:
>> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php
>>
>> That's basic, but enough that people could get by for a V1.
>
> It would be more straightforward to have a function in the standby to
> return the current replay location. It feels more logical to poll the
> standby to get the status of the standby, instead of indirectly from the
> master. Besides, the master won't know how far the standby is if the
> connection to the standby is broken.
>
> Maybe we should just change the existing pg_current_xlog_location()
> function to return that when recovery is in progress. It currently
> throws an error during hot standby.
>

Not sure. I don't really like to monitor functions that return
different things depending on a scenario.

Assume I monitor it, and then do a failover. Suddenly the values I
monitor mean something else.

I think I'd prefer a separate function to monitor this status on the
slave. Oh, and it'd be nice if that one worked in HS mode both when in
streaming and non-streaming mode :-)


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/

Re: Streaming replication status

From

Tom Lane

Date:

12 January 2010, 09:59:21

Fujii Masao <masao.fujii@gmail.com> writes:
> I'm not sure whether poll(2) should be called for this purpose. But
> poll(2) and select(2) seem to often come together in the existing code.
> We should follow such custom?

Yes.  poll() is usually more efficient, so it's preferred, but not all
platforms have it.  (On the other side, I think Windows might have
only poll and not select.)
        regards, tom lane

Re: Streaming replication status

From

Andrew Dunstan

Date:

12 January 2010, 10:13:44


Tom Lane wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>   
>> I'm not sure whether poll(2) should be called for this purpose. But
>> poll(2) and select(2) seem to often come together in the existing code.
>> We should follow such custom?
>>     
>
> Yes.  poll() is usually more efficient, so it's preferred, but not all
> platforms have it.  (On the other side, I think Windows might have
> only poll and not select.)
>
>             
>   

No, other way around, I'm fairly sure.

cheers

andrew

Re: Streaming replication status

From

Magnus Hagander

Date:

12 January 2010, 10:20:25

On Tue, Jan 12, 2010 at 15:13, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> Tom Lane wrote:
>>
>> Fujii Masao <masao.fujii@gmail.com> writes:
>>
>>>
>>> I'm not sure whether poll(2) should be called for this purpose. But
>>> poll(2) and select(2) seem to often come together in the existing code.
>>> We should follow such custom?
>>>
>>
>> Yes.  poll() is usually more efficient, so it's preferred, but not all
>> platforms have it.  (On the other side, I think Windows might have
>> only poll and not select.)
>>
>>
>>
>
> No, other way around, I'm fairly sure.

Yeah, the emulation layer has select, not poll. It basically
translates the select into what looks very much like a poll, so maybe
we should consider implementing poll as well/instead. But for now,
select() is what we have.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/

Re: Streaming replication status

From

Tom Lane

Date:

12 January 2010, 10:40:18

Magnus Hagander <magnus@hagander.net> writes:
> On Tue, Jan 12, 2010 at 08:22, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> Maybe we should just change the existing pg_current_xlog_location()
>> function to return that when recovery is in progress. It currently
>> throws an error during hot standby.

> Not sure. I don't really like to monitor functions that return
> different things depending on a scenario.

Yeah.  We should only use that function if we can define it to mean
something on the slave that is very close to what it means on the
master.  Otherwise, pick another name.

It seems to me that we should have at least two functions available
on the slave: latest xlog location received and synced to disk by
walreceiver (ie, we are guaranteed to be able to replay up to here);
and latest xlog location actually replayed (ie, the state visible
to queries on the slave).  The latter perhaps could be
pg_current_xlog_location().
        regards, tom lane

Re: Streaming replication status

From

Stefan Kaltenbrunner

Date:

12 January 2010, 14:49:20

Simon Riggs wrote:
> On Tue, 2010-01-12 at 08:24 +0100, Stefan Kaltenbrunner wrote:
>> Fujii Masao wrote:
>>> On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>>> I don't think anybody can deploy this feature without at least some very
>>>> basic monitoring here.  I like the basic proposal you made back in September
>>>> for adding a pg_standbys_xlog_location to replace what you have to get from
>>>> ps right now:
>>>> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php
>>>>
>>>> That's basic, but enough that people could get by for a V1.
>>> Yeah, I have no objection to add such simple capability which monitors
>>> the lag into the first release. But I guess that, in addition to that,
>>> Simon wanted the capability to collect the statistical information about
>>> replication activity (e.g., a transfer time, a write time, replay time).
>>> So I'd like to postpone it.
>> yeah getting that would all be nice and handy but we have to remember 
>> that this is really our first cut at integrated replication. Being able 
>> to monitor lag is what is needed as a minimum, more advanced stuff can 
>> and will emerge once we get some actual feedback from the field.
> 
> Though there won't be any feedback from the field because there won't be
> any numbers to discuss. Just "it appears to be working". Then we will go
> into production and the problems will begin to be reported. We will be
> able to do nothing to resolve them because we won't know how many people
> are affected.

field is also production usage in my pov, and I'm not sure how we would 
know how many people are affected by some imaginary issue just because 
there is a column that has some numbers in it.
All of the large features we added in the past got finetuned and 
improved in the following releases, and I expect SR to be one of them 
that will see a lot of improvement in 8.5+n.
Adding detailed monitoring of some random stuff (I don't think there was 
a clear proposal of what kind of stuff you would like to see) while we 
don't really know what the performance characteristics are might easily 
lead to us provding a ton of data and nothing relevant :(
What I really think we should do for this first cut is to make it as 
foolproof and easy to set up as possible and add the minimum required 
monitoring knobs but not going overboard with doing too many stats.

Stefan

Re: Streaming replication status

From

Marko Kreen

Date:

12 January 2010, 16:06:24

On 1/12/10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>  > I'm not sure whether poll(2) should be called for this purpose. But
>  > poll(2) and select(2) seem to often come together in the existing code.
>  > We should follow such custom?
>
>
> Yes.  poll() is usually more efficient, so it's preferred, but not all
>  platforms have it.  (On the other side, I think Windows might have
>  only poll and not select.)

FYI: on PL/Proxy we use poll() exclusively and on platforms
that dont have it (win32) we emulate poll() with select():

http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/plproxy/plproxy/src/poll_compat.c?rev=1.3&content-type=text/x-cvsweb-markup

End result is efficient and clean #ifdef-less code.

Something to consider.

-- 
marko

Re: Streaming replication status

From

Bruce Momjian

Date:

12 January 2010, 16:12:18

Stefan Kaltenbrunner wrote:
> Simon Riggs wrote:
> > On Tue, 2010-01-12 at 08:24 +0100, Stefan Kaltenbrunner wrote:
> >> Fujii Masao wrote:
> >>> On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> >>>> I don't think anybody can deploy this feature without at least some very
> >>>> basic monitoring here.  I like the basic proposal you made back in September
> >>>> for adding a pg_standbys_xlog_location to replace what you have to get from
> >>>> ps right now:
> >>>> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php
> >>>>
> >>>> That's basic, but enough that people could get by for a V1.
> >>> Yeah, I have no objection to add such simple capability which monitors
> >>> the lag into the first release. But I guess that, in addition to that,
> >>> Simon wanted the capability to collect the statistical information about
> >>> replication activity (e.g., a transfer time, a write time, replay time).
> >>> So I'd like to postpone it.
> >> yeah getting that would all be nice and handy but we have to remember 
> >> that this is really our first cut at integrated replication. Being able 
> >> to monitor lag is what is needed as a minimum, more advanced stuff can 
> >> and will emerge once we get some actual feedback from the field.
> > 
> > Though there won't be any feedback from the field because there won't be
> > any numbers to discuss. Just "it appears to be working". Then we will go
> > into production and the problems will begin to be reported. We will be
> > able to do nothing to resolve them because we won't know how many people
> > are affected.
> 
> field is also production usage in my pov, and I'm not sure how we would 
> know how many people are affected by some imaginary issue just because 
> there is a column that has some numbers in it.
> All of the large features we added in the past got finetuned and 
> improved in the following releases, and I expect SR to be one of them 
> that will see a lot of improvement in 8.5+n.
> Adding detailed monitoring of some random stuff (I don't think there was 
> a clear proposal of what kind of stuff you would like to see) while we 
> don't really know what the performance characteristics are might easily 
> lead to us provding a ton of data and nothing relevant :(
> What I really think we should do for this first cut is to make it as 
> foolproof and easy to set up as possible and add the minimum required 
> monitoring knobs but not going overboard with doing too many stats.

I totally agree.  If SR isn't going to be useful without being
feature-complete, we might as well just drop it for 8.5 right now. 

Let's get a reasonable feature set implemented and then come back in 8.6
to improve it.  For example, there is no need for a special
'replication' user (just use super-user), and monitoring should be
minimal until we have field experience of exactly what monitoring we
need.  

The final commit-fest is in 5 days --- this is not the time for design
discussion and feature additions.  If we wait for SR to be feature
complete, with design discussions, etc, we will hopelessly delay 8.5 and
people will get frustrated.  I am not saying we can't talk about design,
but none of this should be a requirement for 8.5.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Streaming replication status

From

Robert Haas

Date:

12 January 2010, 16:39:14

On Tue, Jan 12, 2010 at 3:11 PM, Bruce Momjian <bruce@momjian.us> wrote:
> The final commit-fest is in 5 days --- this is not the time for design

Actually just over 2 days at this point...

...Robert

Re: Streaming replication status

From

Tom Lane

Date:

12 January 2010, 16:40:49

Marko Kreen <markokr@gmail.com> writes:
> FYI: on PL/Proxy we use poll() exclusively and on platforms
> that dont have it (win32) we emulate poll() with select():

Yeah, maybe.  At the time we started adding poll() support there were
enough platforms with only select() that it didn't make sense to impose
any sort of penalty on the latter.  But by now maybe it'd make sense.
Especially if someone fixes the Windows code --- two levels of emulation
on Windows probably won't fly ...
        regards, tom lane

Re: Streaming replication status

From

Tom Lane

Date:

12 January 2010, 16:43:27

Bruce Momjian <bruce@momjian.us> writes:
> The final commit-fest is in 5 days --- this is not the time for design
> discussion and feature additions.

+10 --- the one reason I can see for deciding to bounce SR is that there
still seem to be design discussions going on.  It is WAY TOO LATE for
that folks.  It's time to be thinking "what's the least we have to do to
make this shippable?"
        regards, tom lane

Re: Streaming replication status

From

Simon Riggs

Date:

12 January 2010, 16:49:59

On Tue, 2010-01-12 at 15:11 -0500, Bruce Momjian wrote:
> Stefan Kaltenbrunner wrote:
> > Simon Riggs wrote:
> > > On Tue, 2010-01-12 at 08:24 +0100, Stefan Kaltenbrunner wrote:
> > >> Fujii Masao wrote:
> > >>> On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> > >>>> I don't think anybody can deploy this feature without at least some very
> > >>>> basic monitoring here.  I like the basic proposal you made back in September
> > >>>> for adding a pg_standbys_xlog_location to replace what you have to get from
> > >>>> ps right now:
> > >>>> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php
> > >>>>
> > >>>> That's basic, but enough that people could get by for a V1.
> > >>> Yeah, I have no objection to add such simple capability which monitors
> > >>> the lag into the first release. But I guess that, in addition to that,
> > >>> Simon wanted the capability to collect the statistical information about
> > >>> replication activity (e.g., a transfer time, a write time, replay time).
> > >>> So I'd like to postpone it.
> > >> yeah getting that would all be nice and handy but we have to remember 
> > >> that this is really our first cut at integrated replication. Being able 
> > >> to monitor lag is what is needed as a minimum, more advanced stuff can 
> > >> and will emerge once we get some actual feedback from the field.
> > > 
> > > Though there won't be any feedback from the field because there won't be
> > > any numbers to discuss. Just "it appears to be working". Then we will go
> > > into production and the problems will begin to be reported. We will be
> > > able to do nothing to resolve them because we won't know how many people
> > > are affected.
> > 
> > field is also production usage in my pov, and I'm not sure how we would 
> > know how many people are affected by some imaginary issue just because 
> > there is a column that has some numbers in it.
> > All of the large features we added in the past got finetuned and 
> > improved in the following releases, and I expect SR to be one of them 
> > that will see a lot of improvement in 8.5+n.
> > Adding detailed monitoring of some random stuff (I don't think there was 
> > a clear proposal of what kind of stuff you would like to see) while we 
> > don't really know what the performance characteristics are might easily 
> > lead to us provding a ton of data and nothing relevant :(
> > What I really think we should do for this first cut is to make it as 
> > foolproof and easy to set up as possible and add the minimum required 
> > monitoring knobs but not going overboard with doing too many stats.
> 
> I totally agree.  If SR isn't going to be useful without being
> feature-complete, we might as well just drop it for 8.5 right now. 
> 
> Let's get a reasonable feature set implemented and then come back in 8.6
> to improve it.  For example, there is no need for a special
> 'replication' user (just use super-user), and monitoring should be
> minimal until we have field experience of exactly what monitoring we
> need.  
> 
> The final commit-fest is in 5 days --- this is not the time for design
> discussion and feature additions.  If we wait for SR to be feature
> complete, with design discussions, etc, we will hopelessly delay 8.5 and
> people will get frustrated.  I am not saying we can't talk about design,
> but none of this should be a requirement for 8.5.

We can't add monitoring until we know what the performance
characteristics are. Hmmm. And how will we know what the performance
characteristics are, I wonder?

Anyway, I'll leave it to you now.

-- Simon Riggs           www.2ndQuadrant.com

Re: Streaming replication status

From

Stefan Kaltenbrunner

Date:

12 January 2010, 17:03:19

Simon Riggs wrote:
> On Tue, 2010-01-12 at 15:11 -0500, Bruce Momjian wrote:
>> Stefan Kaltenbrunner wrote:
>>> Simon Riggs wrote:
>>>> On Tue, 2010-01-12 at 08:24 +0100, Stefan Kaltenbrunner wrote:
>>>>> Fujii Masao wrote:
>>>>>> On Tue, Jan 12, 2010 at 1:21 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>>>>>> I don't think anybody can deploy this feature without at least some very
>>>>>>> basic monitoring here.  I like the basic proposal you made back in September
>>>>>>> for adding a pg_standbys_xlog_location to replace what you have to get from
>>>>>>> ps right now:
>>>>>>> http://archives.postgresql.org/pgsql-hackers/2009-09/msg00889.php
>>>>>>>
>>>>>>> That's basic, but enough that people could get by for a V1.
>>>>>> Yeah, I have no objection to add such simple capability which monitors
>>>>>> the lag into the first release. But I guess that, in addition to that,
>>>>>> Simon wanted the capability to collect the statistical information about
>>>>>> replication activity (e.g., a transfer time, a write time, replay time).
>>>>>> So I'd like to postpone it.
>>>>> yeah getting that would all be nice and handy but we have to remember 
>>>>> that this is really our first cut at integrated replication. Being able 
>>>>> to monitor lag is what is needed as a minimum, more advanced stuff can 
>>>>> and will emerge once we get some actual feedback from the field.
>>>> Though there won't be any feedback from the field because there won't be
>>>> any numbers to discuss. Just "it appears to be working". Then we will go
>>>> into production and the problems will begin to be reported. We will be
>>>> able to do nothing to resolve them because we won't know how many people
>>>> are affected.
>>> field is also production usage in my pov, and I'm not sure how we would 
>>> know how many people are affected by some imaginary issue just because 
>>> there is a column that has some numbers in it.
>>> All of the large features we added in the past got finetuned and 
>>> improved in the following releases, and I expect SR to be one of them 
>>> that will see a lot of improvement in 8.5+n.
>>> Adding detailed monitoring of some random stuff (I don't think there was 
>>> a clear proposal of what kind of stuff you would like to see) while we 
>>> don't really know what the performance characteristics are might easily 
>>> lead to us provding a ton of data and nothing relevant :(
>>> What I really think we should do for this first cut is to make it as 
>>> foolproof and easy to set up as possible and add the minimum required 
>>> monitoring knobs but not going overboard with doing too many stats.
>> I totally agree.  If SR isn't going to be useful without being
>> feature-complete, we might as well just drop it for 8.5 right now. 
>>
>> Let's get a reasonable feature set implemented and then come back in 8.6
>> to improve it.  For example, there is no need for a special
>> 'replication' user (just use super-user), and monitoring should be
>> minimal until we have field experience of exactly what monitoring we
>> need.  
>>
>> The final commit-fest is in 5 days --- this is not the time for design
>> discussion and feature additions.  If we wait for SR to be feature
>> complete, with design discussions, etc, we will hopelessly delay 8.5 and
>> people will get frustrated.  I am not saying we can't talk about design,
>> but none of this should be a requirement for 8.5.
> 
> We can't add monitoring until we know what the performance
> characteristics are. Hmmm. And how will we know what the performance
> characteristics are, I wonder?

well I would say we do exactly how we have done in the past with other 
features - by debugging the stuff with low level tools until we fully 
understand what it really is and then we can always add more 
"accessible" stats.


Stefan

Re: Streaming replication status

From

Bruce Momjian

Date:

12 January 2010, 17:34:44

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > The final commit-fest is in 5 days --- this is not the time for design
> > discussion and feature additions.
> 
> +10 --- the one reason I can see for deciding to bounce SR is that there
> still seem to be design discussions going on.  It is WAY TOO LATE for
> that folks.  It's time to be thinking "what's the least we have to do to
> make this shippable?"

I didn't know the plus meter went that high.  ;-)  LOL

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Streaming replication status

From

Bruce Momjian

Date:

12 January 2010, 17:35:36

Stefan Kaltenbrunner wrote:
> >> Let's get a reasonable feature set implemented and then come back in 8.6
> >> to improve it.  For example, there is no need for a special
> >> 'replication' user (just use super-user), and monitoring should be
> >> minimal until we have field experience of exactly what monitoring we
> >> need.  
> >>
> >> The final commit-fest is in 5 days --- this is not the time for design
> >> discussion and feature additions.  If we wait for SR to be feature
> >> complete, with design discussions, etc, we will hopelessly delay 8.5 and
> >> people will get frustrated.  I am not saying we can't talk about design,
> >> but none of this should be a requirement for 8.5.
> > 
> > We can't add monitoring until we know what the performance
> > characteristics are. Hmmm. And how will we know what the performance
> > characteristics are, I wonder?
> 
> well I would say we do exactly how we have done in the past with other 
> features - by debugging the stuff with low level tools until we fully 
> understand what it really is and then we can always add more 
> "accessible" stats.

Right, so what is the risk of shipping without any fancy monitoring?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Streaming replication status

From

"Joshua D. Drake"

Date:

12 January 2010, 17:36:55

On Tue, 2010-01-12 at 16:34 -0500, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > The final commit-fest is in 5 days --- this is not the time for design
> > > discussion and feature additions.
> >
> > +10 --- the one reason I can see for deciding to bounce SR is that there
> > still seem to be design discussions going on.  It is WAY TOO LATE for
> > that folks.  It's time to be thinking "what's the least we have to do to
> > make this shippable?"
>
> I didn't know the plus meter went that high.  ;-)  LOL

Well, it is Tom. He has karma points.

JD



--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: Streaming replication status

From

Josh Berkus

Date:

12 January 2010, 18:27:08

> Right, so what is the risk of shipping without any fancy monitoring?

We add monitoring in 9.1.  er, 8.6.

--Josh Berkus

Re: Streaming replication status

From

Simon Riggs

Date:

12 January 2010, 18:37:48

On Tue, 2010-01-12 at 15:42 -0500, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > The final commit-fest is in 5 days --- this is not the time for design
> > discussion and feature additions.
> 
> +10 --- the one reason I can see for deciding to bounce SR is that there
> still seem to be design discussions going on.  It is WAY TOO LATE for
> that folks.  It's time to be thinking "what's the least we have to do to
> make this shippable?"

I've not asked to bounce SR, I am strongly in favour of it going in,
having been supporting the project on and off for 18 months.

There is not much sense being talked here. I have asked for sufficient
monitoring to allow us to manage it in production, which is IMHO the
minimum required to make it shippable. This is a point I have mentioned
over the course of many months, not a sudden additional thought.

If the majority thinks that being able to find out the current replay
point of recovery is all we need to manage replication then I will
happily defer to that view, without changing my opinion that we need
more. It should be clear that we didn't even have that before I raised
the point.

Overall, it isn't sensible or appropriate to oppose my viewpoint by
putting words into my mouth that have never been said, which applies to
most people's comments to me on this recent thread.

-- Simon Riggs           www.2ndQuadrant.com

Re: Streaming replication status

From

Greg Smith

Date:

12 January 2010, 18:41:53

Bruce Momjian wrote:
> Right, so what is the risk of shipping without any fancy monitoring?
>   

You can monitor the code right now by watching the output shown in the 
ps display and by trolling the database logs.  If I had to I could build 
a whole monitoring system out of those components, it would just be very 
fragile.  I'd rather see one or two very basic bits of internals exposed 
beyond those to reduce that effort.  I think it's a stretch to say that 
request represents a design change; a couple of UDFs to expose some 
internals is all I think it would take to dramatically drop the amount 
of process/log scraping required here to support a SR system.

I guess the slightly more ambitious performance monitoring bits that 
Simon was suggesting may cross the line as being too late to implement 
now though (depends on how productive the people actually coding on this 
are I guess), and certainly the ideas thrown out for implementing any 
smart behavior or alerting when replication goes bad like Josh's 
"archiving_lag_action" seem based the deadline to get addressed 
now--even though I agree with the basic idea.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: Streaming replication status

From

"Joshua D. Drake"

Date:

12 January 2010, 18:48:37

On Tue, 2010-01-12 at 17:41 -0500, Greg Smith wrote:
> Bruce Momjian wrote:
> > Right, so what is the risk of shipping without any fancy monitoring?
> >
>
> You can monitor the code right now by watching the output shown in the
> ps display and by trolling the database logs.  If I had to I could build
> a whole monitoring system out of those components, it would just be very
> fragile.  I'd rather see one or two very basic bits of internals exposed
> beyond those to reduce that effort.

Considering that is pretty much the best we can do with log shipping, I
would have to agree. We should either provide real monitoring facilities
(not necessarily tools, but at least queries or an api) for the feature
or the feature isn't ready to go in.

>   I think it's a stretch to say that
> request represents a design change; a couple of UDFs to expose some
> internals is all I think it would take to dramatically drop the amount
> of process/log scraping required here to support a SR system.

Bingo.

Joshua D. Drake

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: Streaming replication status

From

Stefan Kaltenbrunner

Date:

12 January 2010, 18:55:12

Greg Smith wrote:
> Bruce Momjian wrote:
>> Right, so what is the risk of shipping without any fancy monitoring?
>>   
> 
> You can monitor the code right now by watching the output shown in the 
> ps display and by trolling the database logs.  If I had to I could build 
> a whole monitoring system out of those components, it would just be very 
> fragile.  I'd rather see one or two very basic bits of internals exposed 
> beyond those to reduce that effort.  I think it's a stretch to say that 
> request represents a design change; a couple of UDFs to expose some 
> internals is all I think it would take to dramatically drop the amount 
> of process/log scraping required here to support a SR system.

so is there an actually concrete proposal of _what_ interals to expose?

> 
> I guess the slightly more ambitious performance monitoring bits that 
> Simon was suggesting may cross the line as being too late to implement 
> now though (depends on how productive the people actually coding on this 
> are I guess), and certainly the ideas thrown out for implementing any 
> smart behavior or alerting when replication goes bad like Josh's 
> "archiving_lag_action" seem based the deadline to get addressed 
> now--even though I agree with the basic idea.

I'm not convinced that embedding actual alerting functionality in the 
database is a good idea. Any reasonable production deployment is 
probably using a dedicated monitoring and alerting system that is 
aggregating and qualifying all monitoring results (as wel as proper 
ratelimiting and stuff) that just needs a way to read in basic data.
Initially something like archiving_lag_action sounds like an invitation 
to do a send_mail_to_admin() thingy which is really the wrong way to 
approach monitoring in large scale environments...
The database needs to prove very basic information like "we are 10min 
behind in replication" or "3 wal files behind" - the decision if any of 
that is an actual issue or not should be left to the actual monitoring 
system.

Stefan

Re: Streaming replication status

From

"Kevin Grittner"

Date:

12 January 2010, 19:00:27

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:
> The database needs to prove very basic information like "we are
> 10min behind in replication" or "3 wal files behind" - the
> decision if any of that is an actual issue or not should be left
> to the actual monitoring system.
+1
-Kevin

Re: Streaming replication status

From

Robert Treat

Date:

12 January 2010, 21:45:30

On Monday 11 January 2010 23:24:24 Greg Smith wrote:
> Fujii Masao wrote:
> > On Mon, Jan 11, 2010 at 5:36 PM, Craig Ringer
> >
> > <craig@postnewspapers.com.au> wrote:
> >> Personally, I'd be uncomfortable enabling something like that without
> >> _both_ an admin alert _and_ the ability to refresh the slave's base
> >> backup without admin intervention.
> >
> > What feature do you specifically need as an alert? Just writing
> > the warning into the logfile is enough? Or need to notify by
> > using SNMP trap message? Though I'm not sure if this is a role
> > of Postgres.
>
> It's impossible for the database to have any idea whatsoever how people
> are going to want to be alerted.  Provide functions to monitor things
> like replication lag, like the number of segments queued up to feed to
> archive_command, and let people build their own alerting mechanism for
> now.  They're going to do that anyway, so why waste precious time here
> building someone that's unlikely to fit any but a very narrow use case?

That said, emitting the information to a log file makes for a crappy way to 
retrieve the information. The ideal api is that I can find the information out 
via result of some SELECT query; view, table ,function doesn't matter, as long 
as I can select it out. Bonus points for being able to get information from 
the hot standby.

-- 
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

Re: Streaming replication status

From

Robert Treat

Date:

12 January 2010, 22:19:16

On Tuesday 12 January 2010 17:37:11 Simon Riggs wrote:
> There is not much sense being talked here. I have asked for sufficient
> monitoring to allow us to manage it in production, which is IMHO the
> minimum required to make it shippable. This is a point I have mentioned
> over the course of many months, not a sudden additional thought.
>

Even subscribing to this view point, there is sure to be a significant wiggle 
room in what people find to be "sufficient monitoring". If I had to score the 
monitoring facilities we have for PITR standby, I'd give it about a crap out 
of 5, and yet somehow we seem to manage it. 

> If the majority thinks that being able to find out the current replay
> point of recovery is all we need to manage replication then I will
> happily defer to that view, without changing my opinion that we need
> more. It should be clear that we didn't even have that before I raised
> the point.
>

I'm certainly interested in specifics of what you think need to be exposed for 
monitoring, and I'd be interested in whether those things can be exposed as 
either trace points or possibly as C functions. My guess is that we won't get 
them into core for 8.5, but that we might be able to provide some additional 
facilities after the fact as we get more of these systems deployed.  

-- 
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

Re: Streaming replication status

From

Josh Berkus

Date:

12 January 2010, 23:07:04

> I guess the slightly more ambitious performance monitoring bits that
> Simon was suggesting may cross the line as being too late to implement
> now though (depends on how productive the people actually coding on this
> are I guess), and certainly the ideas thrown out for implementing any
> smart behavior or alerting when replication goes bad like Josh's
> "archiving_lag_action" seem based the deadline to get addressed
> now--even though I agree with the basic idea.

Well, honestly, I wasn't talking about monitoring at all.  I was talking
about the general issue of "how should the system behave when it runs
out of disk space".

For the installation for which data integrity is paramount, when
replication becomes impossible because there is no more room for logs,
then the whole system, master and slaves, should shut down.  For most
people, they'd just want the master to start ignoring the slave and
recycling logs. Presumably, the slave would notice this and shut down.

So I was talking about data integrity, not monitoring.

However, it's probably a better thing to simply expose a way to query
how much extra log data we have, in raw form (bytes or pages).  From
this, an administration script could take appropriate action.

--Josh Berkus

Re: Streaming replication status

From

Josh Berkus

Date:

12 January 2010, 23:55:51

> However, it's probably a better thing to simply expose a way to query
> how much extra log data we have, in raw form (bytes or pages).  From
> this, an administration script could take appropriate action.

Also: I think we could release without having this facility.   We did
with PITR, after all.

--Josh

Re: Streaming replication status

From

Bruce Momjian

Date:

13 January 2010, 00:27:26

Simon Riggs wrote:
> On Tue, 2010-01-12 at 15:42 -0500, Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > The final commit-fest is in 5 days --- this is not the time for design
> > > discussion and feature additions.
> > 
> > +10 --- the one reason I can see for deciding to bounce SR is that there
> > still seem to be design discussions going on.  It is WAY TOO LATE for
> > that folks.  It's time to be thinking "what's the least we have to do to
> > make this shippable?"
> 
> I've not asked to bounce SR, I am strongly in favour of it going in,
> having been supporting the project on and off for 18 months.
> 
> There is not much sense being talked here. I have asked for sufficient
> monitoring to allow us to manage it in production, which is IMHO the
> minimum required to make it shippable. This is a point I have mentioned

Let me explain why Simon feels he is misquoted --- Simon, you are saying
above that "sufficient monitoring" is a minimum requirement, meaning it
is necessary, and I and others are saying if we need to design a
monitoring system at this stage to ship SR, then let's forget about this
feature for 8.5.

In summary, by requiring monitoring, you are encouraging others to just
abandon SR completely for 8.5.  We didn't say you were suggesting
abandonment SR, it is just that the monitoring requirement is making
abandonment of SR for 8.5 more likely because the addition of monitoring
could hopelessly delay 8.5 because we have no idea even how to implement
monitoring.

> over the course of many months, not a sudden additional thought.
> 
> Overall, it isn't sensible or appropriate to oppose my viewpoint by
> putting words into my mouth that have never been said, which applies to
> most people's comments to me on this recent thread.

Yea, yea, everyone seems to misquote you Simon, at least from your
perspective.  You must admit that you seem to feel that way a lot.

> If the majority thinks that being able to find out the current replay
> point of recovery is all we need to manage replication then I will
> happily defer to that view, without changing my opinion that we need
> more. It should be clear that we didn't even have that before I raised
> the point.

Good --- let's move forward with a minimal feature set to get SR in 8.5
in a reasonable timeframe.  If we have extra time we can add stuff but
let's not require it from the start.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Streaming replication status

From

"Joshua D. Drake"

Date:

13 January 2010, 00:36:05

On Tue, 2010-01-12 at 16:34 -0500, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > The final commit-fest is in 5 days --- this is not the time for design
> > > discussion and feature additions.
> > 
> > +10 --- the one reason I can see for deciding to bounce SR is that there
> > still seem to be design discussions going on.  It is WAY TOO LATE for
> > that folks.  It's time to be thinking "what's the least we have to do to
> > make this shippable?"
> 
> I didn't know the plus meter went that high.  ;-)  LOL

Well, it is Tom. He has karma points.

JD



-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: Streaming replication status

From

Fujii Masao

Date:

13 January 2010, 01:10:41

On Tue, Jan 12, 2010 at 10:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> I'm not sure whether poll(2) should be called for this purpose. But
>> poll(2) and select(2) seem to often come together in the existing code.
>> We should follow such custom?
>
> Yes.  poll() is usually more efficient, so it's preferred, but not all
> platforms have it.  (On the other side, I think Windows might have
> only poll and not select.)

OK. I reactivated pq_wait() and secure_poll() which uses poll(2) to
check the socket if available, otherwise select(2).

Also the capability to check the socket for data to be written is not
unused for SR right now (it was provided previously). So I dropped it
for simplification.

http://archives.postgresql.org/pgsql-hackers/2010-01/msg00827.php
> Oh, I think we need to fix that, I'm thinking of doing a select() in the
> loop to check that the socket hasn't been closed yet. I meant we don't
> need to try reading the 'X' to tell apart e.g a network problem from a
> standby that's shut down cleanly.

Without reading the 'X' message from the standby, the walsender doesn't
detect the close of connection immediately in my environment. So I also
reactivated the subset of ProcessStreamMessage().
 git://git.postgresql.org/git/users/fujii/postgres.git branch: replication

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

Fujii Masao

Date:

13 January 2010, 01:34:38

On Tue, Jan 12, 2010 at 10:16 AM, Bruce Momjian <bruce@momjian.us> wrote:
> I am concerned that knowledge of this new read-only replication user
> would have to be spread all over the backend code, which is really not
> something we should be doing at this stage in 8.5 development.  I am
> also thinking such a special user might fall out of work on mandatory
> access control, so maybe we should just require super-user for 8.5 and
> revisit this for 8.6.

OK. I leave that code as it is. If the majority feel it's overkill to
require a superuser privilege when authenticating the standby, we can
just drop it later.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

"Joshua D. Drake"

Date:

13 January 2010, 01:47:46

On Tue, 2010-01-12 at 17:41 -0500, Greg Smith wrote:
> Bruce Momjian wrote:
> > Right, so what is the risk of shipping without any fancy monitoring?
> >   
> 
> You can monitor the code right now by watching the output shown in the 
> ps display and by trolling the database logs.  If I had to I could build 
> a whole monitoring system out of those components, it would just be very 
> fragile.  I'd rather see one or two very basic bits of internals exposed 
> beyond those to reduce that effort.

Considering that is pretty much the best we can do with log shipping, I
would have to agree. We should either provide real monitoring facilities
(not necessarily tools, but at least queries or an api) for the feature
or the feature isn't ready to go in.

>   I think it's a stretch to say that 
> request represents a design change; a couple of UDFs to expose some 
> internals is all I think it would take to dramatically drop the amount 
> of process/log scraping required here to support a SR system.

Bingo.

Joshua D. Drake

-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: Streaming replication status

From

Greg Smith

Date:

13 January 2010, 04:47:54

Stefan Kaltenbrunner wrote:
> so is there an actually concrete proposal of _what_ interals to expose? '

The pieces are coming together...summary:

-Status quo:  really bad, but could probably ship anyway because 
existing PITR is no better and people manage to use it
-Add slave pg_current_xlog_location() and something like 
pg_standby_received_xlog_location():  Much better, gets rid of the worst 
issues here.
-Also add pg_standbys_xlog_location() on the master:  while they could 
live without it, this really helps out the "alert/monitor" script writer 
whose use cases keep popping up here.

Details...the original idea from Fujii was:

"I'm thinking something like pg_standbys_xlog_location() [on the 
primary] which returns
one row per standby servers, showing pid of walsender, host name/
port number/user OID of the standby, the location where the standby
has written/flushed WAL. DBA can measure the gap from the
combination of pg_current_xlog_location() and pg_standbys_xlog_location()
via one query on the primary."

After some naming quibbles and questions about what direction that 
should happen in, Tom suggested the initial step here is:

"It seems to me that we should have at least two functions available
on the slave: latest xlog location received and synced to disk by
walreceiver (ie, we are guaranteed to be able to replay up to here);
and latest xlog location actually replayed (ie, the state visible
to queries on the slave).  The latter perhaps could be
pg_current_xlog_location()."

So there's the first two of them:  on the slave, 
pg_current_xlog_location() giving the latest location replayed, and a 
new one named something like pg_standby_received_xlog_location().  If 
you take the position that an unreachable standby does provide answers 
to these questions too (you just won't like them), this pair might be 
sufficient to ship.

To help a lot at dealing with all the error situations where the standby 
isn't reachable and segments are piling up (possibly leading to full 
disk), the next figure that seems to answer the most questions is asking 
the primary "what's the location of the last WAL segment file in the 
pile of ones to be archived/distributed that has been requested (or 
processed if that's the easier thing to note) by the standby?".  That's 
what is named pg_standbys_xlog_location() in the first paragraph I 
quoted.  If you know enough to identify that segment file on disk, you 
can always look at its timestamp (and the ones on the rest of the files 
in that directory) in a monitoring script to turn that information into 
segments or a time measurement instead--xlog segments are nicely ordered 
after all.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: Streaming replication status

From

Fujii Masao

Date:

14 January 2010, 04:33:55

On Wed, Jan 13, 2010 at 5:47 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> The pieces are coming together...summary:

Thanks for the summary!

> -Also add pg_standbys_xlog_location() on the master:  while they could live without it, this really helps out the
"alert/monitor"script writer whose use cases keep popping up here. 
>
> Details...the original idea from Fujii was:
>
> "I'm thinking something like pg_standbys_xlog_location() [on the primary] which returns
> one row per standby servers, showing pid of walsender, host name/
> port number/user OID of the standby, the location where the standby
> has written/flushed WAL. DBA can measure the gap from the
> combination of pg_current_xlog_location() and pg_standbys_xlog_location()
> via one query on the primary."

This function is useful but not essential for troubleshooting, I think.
So I'd like to postpone it.

> "It seems to me that we should have at least two functions available
> on the slave: latest xlog location received and synced to disk by
> walreceiver (ie, we are guaranteed to be able to replay up to here);
> and latest xlog location actually replayed (ie, the state visible
> to queries on the slave).  The latter perhaps could be
> pg_current_xlog_location()."
>
> So there's the first two of them:  on the slave, pg_current_xlog_location()
> giving the latest location replayed, and a new one named something like
> pg_standby_received_xlog_location().  If you take the position that an
> unreachable standby does provide answers to these questions too (you just
> won't like them), this pair might be sufficient to ship.

Done.
git://git.postgresql.org/git/users/fujii/postgres.gitbranch: replication

I added two new functions;

(1) pg_last_xlog_receive_location() reports the last WAL location received   and synced by walreceiver. If streaming
replicationis still in progress   this will increase monotonically. If streaming replication has completed   then this
valuewill remain static at the value of the last WAL record   received and synced. When the server has been started
withouta streaming   replication then the return value will be InvalidXLogRecPtr (0/0). 

(2) pg_last_xlog_replay_location() reports the last WAL location replayed   during recovery. If recovery is still in
progressthis will increase   monotonically. If recovery has completed then this value will remain   static at the value
ofthe last WAL record applied. When the server has   been started normally without a recovery then the return value
willbe   InvalidXLogRecPtr (0/0). 

Since it's somewhat odd for me that pg_current_xlog_location() reports the
WAL replay location, I didn't do that. But if the majority feel that it's sane,
I'll merge pg_last_xlog_replay_location() into pg_current_xlog_location().

Thought? Better name?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

Greg Smith

Date:

15 January 2010, 00:08:32

Fujii Masao wrote:<br /><blockquote cite="mid:3f0b79eb1001140033p707862f4yabece81301ca609c@mail.gmail.com"
type="cite"><blockquotetype="cite"><pre wrap="">"I'm thinking something like pg_standbys_xlog_location() [on the
primary]which returns
 
one row per standby servers, showing pid of walsender, host name/
port number/user OID of the standby, the location where the standby
has written/flushed WAL. DBA can measure the gap from the
combination of pg_current_xlog_location() and pg_standbys_xlog_location()
via one query on the primary."   </pre></blockquote><pre wrap="">
This function is useful but not essential for troubleshooting, I think.
So I'd like to postpone it. </pre></blockquote><br /> Sure; in a functional system where primary and secondary are both
up,you can assemble the info using the new functions you just added, so this other one is certainly optional.  I just
tooka brief look at the code of the features you added, and it looks like it exposes the minimum necessary to make this
wholething possible to manage.  I think it's OK if you postpone this other bit, more important stuff for you to work
on.<br/><br /> So:  the one piece of information I though was most important to expose here at an absolute minimum is
therenow.  Good progress.  The other popular request that keeps popping up here is  providing an easy way to see how
backloggedthe archive_command is, to make it easier to monitor for out of disk errors that might prove catastrophic to
replication.<br/><br /> I just spent some time looking through the WAL/archiving code in that context.  It looks to me
thatthat this information isn't really stored anywhere right now.  The only thing that knows what segment is currently
queuedup to copy over is pgarch_ArchiverCopyLoop via its call to pgarch_readyXlog.  Now, this is a pretty brute-force
pieceof code:  it doesn't remember its previous work at all, it literally walks the archive_status directory looking
for*.ready files that have names that look like xlog files, then returns the earliest.  That unfortunately means that
it'snot even thinking in the same terms as all these other functions, which are driven by the xlog_location advancing,
andthen the filename is computed from that.  All you've got is the filename at this point, and it's not even guaranteed
tobe real--you could easily fool this code if you dropped an inappropriately named file into that directory.<br /><br
/>I could easily update this code path to save the name of the last archived file in memory while all this directory
scanningis going on anyway, and then provide a UDF to expose that bit of information.  The result would need to have
documentationthat disclaims it like this:<br /><br /> pg_last_archived_xlogfile() text:  Get the name of the last file
thearchive_command [tried to|successfully] archived since the server was started.  If archiving is disabled or no xlog
fileshave become ready to archive since startup, a blank line will be returned.  It is possible for this function to
returna result that does not reflect an actual xlogfile if files are manually added to the server's archive_status
directory.<br/><br /> I'd find this extremely handy as a hook for monitoring scripts that want to watch the server but
don'thave access to the filesystem directly, even given those limitations.  I'd prefer to have the "tried to" version,
becauseit will populate with the name of the troublesome file it's stuck on even if archiving never gets its first
segmentdelivered.<br /><br /> I'd happily write a patch to handle all that if I thought it would be accepted.  I fear
thatthe whole approach will be considered a bit too hackish and get rejected on that basis though.  Not really sure of
a"right" way to handle this though.  Anything better is going to be more complicated because it requires passing more
informationinto the archiver, with little gain for that work beyond improving the quality of this diagnostic routine. 
AndI think most people would find what I described above useful enough.<br /><br /><pre class="moz-signature"
cols="72">--
 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
<a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a>  <a
class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a>
 
</pre>

Re: Streaming replication status

From

Simon Riggs

Date:

15 January 2010, 00:24:29

On Thu, 2010-01-14 at 23:07 -0500, Greg Smith wrote:

> pg_last_archived_xlogfile() text:  Get the name of the last file the
> archive_command [tried to|successfully] archived since the server was
> started.  If archiving is disabled or no xlog files have become ready
> to archive since startup, a blank line will be returned.  

OK

> It is possible for this function to return a result that does not
> reflect an actual xlogfile if files are manually added to the server's
> archive_status directory.

> I'd find this extremely handy as a hook for monitoring scripts that
> want to watch the server but don't have access to the filesystem
> directly, even given those limitations.  I'd prefer to have the "tried
> to" version, because it will populate with the name of the troublesome
> file it's stuck on even if archiving never gets its first segment
> delivered.
> 
> I'd happily write a patch to handle all that if I thought it would be
> accepted.  I fear that the whole approach will be considered a bit too
> hackish and get rejected on that basis though.  Not really sure of a
> "right" way to handle this though.  Anything better is going to be
> more complicated because it requires passing more information into the
> archiver, with little gain for that work beyond improving the quality
> of this diagnostic routine.  And I think most people would find what I
> described above useful enough.

Yes, please write it. It's separate from SR, so will not interfere.

-- Simon Riggs           www.2ndQuadrant.com

Re: Streaming replication status

From

Stefan Kaltenbrunner

Date:

15 January 2010, 02:53:59

Greg Smith wrote:
> Fujii Masao wrote:
>>> "I'm thinking something like pg_standbys_xlog_location() [on the primary] which returns
>>> one row per standby servers, showing pid of walsender, host name/
>>> port number/user OID of the standby, the location where the standby
>>> has written/flushed WAL. DBA can measure the gap from the
>>> combination of pg_current_xlog_location() and pg_standbys_xlog_location()
>>> via one query on the primary."
>>>     
>>
>> This function is useful but not essential for troubleshooting, I think.
>> So I'd like to postpone it.
>>   
> 
> Sure; in a functional system where primary and secondary are both up, 
> you can assemble the info using the new functions you just added, so 
> this other one is certainly optional.  I just took a brief look at the 
> code of the features you added, and it looks like it exposes the minimum 
> necessary to make this whole thing possible to manage.  I think it's OK 
> if you postpone this other bit, more important stuff for you to work on.

agreed

> 
> So:  the one piece of information I though was most important to expose 
> here at an absolute minimum is there now.  Good progress.  The other 
> popular request that keeps popping up here is  providing an easy way to 
> see how backlogged the archive_command is, to make it easier to monitor 
> for out of disk errors that might prove catastrophic to replication.

I tend to disagree - in any reasonable production setup basic stulff 
like disk space usage is monitored by non-application specific matters.
While monitoring backlog might be interesting for other reasons, citing 
disk space usage/exhaustions seems just wrong.


[...]
> 
> I'd find this extremely handy as a hook for monitoring scripts that want 
> to watch the server but don't have access to the filesystem directly, 
> even given those limitations.  I'd prefer to have the "tried to" 
> version, because it will populate with the name of the troublesome file 
> it's stuck on even if archiving never gets its first segment delivered.

While fancy at all I think this goes way to far for the first cut at 
SR(or say this release), monitoring disk usage and tracking log files 
for errors are SOLVED issues in estabilished production setups. If you 
are in an environment that does neither for each and every server 
independent on what you have running on it, or a setup where the 
sysadmins are clueless and the poor DBA has to hack around that fact you 
have way bigger issues anyway.


Stefan

Re: Streaming replication status

From

"Kevin Grittner"

Date:

15 January 2010, 09:28:00

Greg Smith  wrote:
> to make it easier to monitor for out of disk errors that might
> prove catastrophic to replication.
We handle that with the fsutil functions (in pgfoundry).  This can
actually measure free space on each volume.  These weren't portable
enough to include in core, but maybe they could be made more
portable?
-Kevin

Re: Streaming replication status

From

Greg Smith

Date:

15 January 2010, 12:50:59

Stefan Kaltenbrunner wrote:
> Greg Smith wrote:
>>
>> The other popular request that keeps popping up here is  providing an 
>> easy way to see how backlogged the archive_command is, to make it 
>> easier to monitor for out of disk errors that might prove 
>> catastrophic to replication.
>
> I tend to disagree - in any reasonable production setup basic stulff 
> like disk space usage is monitored by non-application specific matters.
> While monitoring backlog might be interesting for other reasons, 
> citing disk space usage/exhaustions seems just wrong.

I was just mentioning that one use of the data, but there are others.  
Let's say that your archive_command works by copying things over to a 
NFS mount, and the mount goes down.  It could be a long time before you 
noticed this via disk space monitoring.  But if you were monitoring "how 
long has it been since the last time pg_last_archived_xlogfile() 
changed?", this would jump right out at you.

Another popular question is "how far behind real-time is the archiver 
process?"  You can do this right now by duplicating the same xlog file 
name scanning and sorting that the archiver does in your own code, 
looking for .ready files.  It would be simpler if you could call 
pg_last_archived_xlogfile() and then just grab that file's timestamp.

I think it's also important to consider the fact that diagnostic 
internals exposed via the database are far more useful to some people 
than things you have to setup outside of it.  You talk about reasonable 
configurations above, but some production setups are not so reasonable.  
In many of the more secure environments I've worked in (finance, 
defense), there is *no* access to the database server beyond what comes 
out of port 5432 without getting a whole separate team of people 
involved.  If the DBA can write a simple monitoring program themselves 
that presents data via the one port that is exposed, that makes life 
easier for them.  This same issue pops up sometimes when we consider the 
shared hosting case too, where the user may not have the option of 
running a full-fledged monitoring script.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: Streaming replication status

From

"Kevin Grittner"

Date:

15 January 2010, 12:56:20

Greg Smith <greg@2ndquadrant.com> wrote:
> In many of the more secure environments I've worked in (finance, 
> defense), there is *no* access to the database server beyond what
> comes out of port 5432 without getting a whole separate team of
> people involved.  If the DBA can write a simple monitoring program
> themselves that presents data via the one port that is exposed,
> that makes life easier for them.
Right, we don't want to give the monitoring software an OS login for
the database servers, for security reasons.
-Kevin

Re: Streaming replication status

From

Stefan Kaltenbrunner

Date:

15 January 2010, 13:25:38

Greg Smith wrote:
> Stefan Kaltenbrunner wrote:
>> Greg Smith wrote:
>>>
>>> The other popular request that keeps popping up here is  providing an 
>>> easy way to see how backlogged the archive_command is, to make it 
>>> easier to monitor for out of disk errors that might prove 
>>> catastrophic to replication.
>>
>> I tend to disagree - in any reasonable production setup basic stulff 
>> like disk space usage is monitored by non-application specific matters.
>> While monitoring backlog might be interesting for other reasons, 
>> citing disk space usage/exhaustions seems just wrong.
> 
> I was just mentioning that one use of the data, but there are others.  
> Let's say that your archive_command works by copying things over to a 
> NFS mount, and the mount goes down.  It could be a long time before you 
> noticed this via disk space monitoring.  But if you were monitoring "how 
> long has it been since the last time pg_last_archived_xlogfile() 
> changed?", this would jump right out at you.

well from an syadmin perspective you have to monitor the NFS mount 
anyway - so why do you need the database to do too(and not in a sane way 
because there is no way the database can even figure out what the real 
problem is and if there is one)?

> 
> Another popular question is "how far behind real-time is the archiver 
> process?"  You can do this right now by duplicating the same xlog file 
> name scanning and sorting that the archiver does in your own code, 
> looking for .ready files.  It would be simpler if you could call 
> pg_last_archived_xlogfile() and then just grab that file's timestamp.

well that one seems a more reasonable reasoning to me however I'm not so 
sure that the proposed implementation feels right - though can't come up 
with a better suggestion for now.

> 
> I think it's also important to consider the fact that diagnostic 
> internals exposed via the database are far more useful to some people 
> than things you have to setup outside of it.  You talk about reasonable 
> configurations above, but some production setups are not so reasonable.  
> In many of the more secure environments I've worked in (finance, 
> defense), there is *no* access to the database server beyond what comes 
> out of port 5432 without getting a whole separate team of people 
> involved.  If the DBA can write a simple monitoring program themselves 
> that presents data via the one port that is exposed, that makes life 
> easier for them.  This same issue pops up sometimes when we consider the 
> shared hosting case too, where the user may not have the option of 
> running a full-fledged monitoring script.

well again I consider stuff like "available diskspace" or "NFS mount 
available" completely in the realm of the OS level management. The 
database side should focus on the stuff that concerns the internal state 
and operation of the database app itself.
If you continue your line of thought you will have to add all kind of 
stuff to the database, like CPU usage tracking, getting information 
about running processes, storage health.
As soon as you are done you have reimplemented nagios-plugins over SQL 
on port 5432 instead of NRPE(or SNMP or whatnot).
Again I fully understand and know that there are environments where the 
DBA does not have OS level (be it root or no shell at all) access has to 
the OS but even if you had that "archiving is hanging" function you 
would still have to go back to that "completely different group" and 
have them diagnose again.
So my point is - that even if you have disparate groups of people being 
responsible for different parts of a system solution you can't really 
work around incompetency(or slownest or whatever) of the group 
responsible for the lower layer by adding partial and inexact 
functionality at the upper part that can only guess what the real issue is.

Stefan

Re: Streaming replication status

From

Stefan Kaltenbrunner

Date:

15 January 2010, 13:31:03

Kevin Grittner wrote:
> Greg Smith <greg@2ndquadrant.com> wrote:
>  
>> In many of the more secure environments I've worked in (finance, 
>> defense), there is *no* access to the database server beyond what
>> comes out of port 5432 without getting a whole separate team of
>> people involved.  If the DBA can write a simple monitoring program
>> themselves that presents data via the one port that is exposed,
>> that makes life easier for them.
>  
> Right, we don't want to give the monitoring software an OS login for
> the database servers, for security reasons.

depending on what you exactly mean by that I do have to wonder how you 
monitor more complex stuff (or stuff that require elevated privs) - say 
raid health, multipath configuration, status of OS level updates, "are 
certain processes running or not" as well as basic parameters like CPU 
or IO load. as in stuff you cannot know usless you have it exported 
through "some" port.

Stefan

Re: Streaming replication status

From

"Kevin Grittner"

Date:

15 January 2010, 14:04:07

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:
> Kevin Grittner wrote:
>> Right, we don't want to give the monitoring software an OS login
>> for the database servers, for security reasons.
> 
> depending on what you exactly mean by that I do have to wonder how
> you monitor more complex stuff (or stuff that require elevated
> privs) - say raid health, multipath configuration, status of OS
> level updates, "are certain processes running or not" as well as
> basic parameters like CPU or IO load. as in stuff you cannot know
> usless you have it exported through "some" port.
Many of those are monitored on the server one way or another,
through a hardware card accessible only to the DBAs.  The card sends
an email to the DBAs for any sort of distress, including impending
or actual drive failure, ambient temperature out of bounds, internal
or external power out of bounds, etc.  OS updates are managed by the
DBAs through scripts.  Ideally we would tie these in to our opcenter
software, which displays status through hundreds of "LED" boxes on
big plasma displays in our support areas (and can send emails and
jabber messages when things get to a bad state), but since the
messages are getting to the right people in a timely manner, this is
a low priority as far as monitoring enhancement requests go.
Only the DBAs have OS logins to database servers.  Monitoring
software must deal with application ports (which have to be open
anyway, so that doesn't add any security risk).  Since the hardware
monitoring doesn't know about file systems, and the disk space on
database servers is primarily an issue for the database, it made
sense to us to add the ability to check the space available to the
database through a database connection.  Hence, fsutil.
-Kevin

Re: Streaming replication status

From

Greg Smith

Date:

15 January 2010, 14:44:54

Stefan Kaltenbrunner wrote:
>>
>> Another popular question is "how far behind real-time is the archiver 
>> process?"  You can do this right now by duplicating the same xlog 
>> file name scanning and sorting that the archiver does in your own 
>> code, looking for .ready files.  It would be simpler if you could 
>> call pg_last_archived_xlogfile() and then just grab that file's 
>> timestamp.
>
> well that one seems a more reasonable reasoning to me however I'm not 
> so sure that the proposed implementation feels right - though can't 
> come up with a better suggestion for now.

That's basically where I'm at, and I was looking more for feedback on 
that topic rather than to get lost defending use-cases here.  There are 
a few of them, and you can debate their individual merits all day.  As a 
general comment to your line of criticism here, I feel the idea that 
"we're monitoring that already via <x>" does not mean that an additional 
check is without value.  The kind of people who like redundancy in their 
database like it in their monitoring, too.  I feel there's at least one 
unique thing exposing this bit buys you, and the fact that it can be a 
useful secondary source of information too for systems monitoring is 
welcome bonus--regardless of whether good practice already supplies a 
primary one.

> If you continue your line of thought you will have to add all kind of 
> stuff to the database, like CPU usage tracking, getting information 
> about running processes, storage health.

I'm looking to expose something that only the database knows for 
sure--"what is the archiver working on?"--via the standard way you ask 
the database questions, a SELECT call.  The database doesn't know 
anything about the CPU, running processes, or storage, so suggesting 
this path leads in that direction doesn't make any sense.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: Streaming replication status

From

Stefan Kaltenbrunner

Date:

16 January 2010, 13:18:46

Greg Smith wrote:
> Stefan Kaltenbrunner wrote:
>>>
>>> Another popular question is "how far behind real-time is the archiver 
>>> process?"  You can do this right now by duplicating the same xlog 
>>> file name scanning and sorting that the archiver does in your own 
>>> code, looking for .ready files.  It would be simpler if you could 
>>> call pg_last_archived_xlogfile() and then just grab that file's 
>>> timestamp.
>>
>> well that one seems a more reasonable reasoning to me however I'm not 
>> so sure that the proposed implementation feels right - though can't 
>> come up with a better suggestion for now.
> 
> That's basically where I'm at, and I was looking more for feedback on 
> that topic rather than to get lost defending use-cases here.  There are 
> a few of them, and you can debate their individual merits all day.  As a 
> general comment to your line of criticism here, I feel the idea that 
> "we're monitoring that already via <x>" does not mean that an additional 
> check is without value.  The kind of people who like redundancy in their 
> database like it in their monitoring, too.  I feel there's at least one 
> unique thing exposing this bit buys you, and the fact that it can be a 
> useful secondary source of information too for systems monitoring is 
> welcome bonus--regardless of whether good practice already supplies a 
> primary one.

well that might be true - but as somebody with an extensive sysadmin 
background I was specifically ticked by the "disk full" stuff mentioned 
upthread. Monitoring also means standardization and somebody who runs 
hundreds (or dozends) of servers is much better of getting the basics 
monitored the same on all systems and getting more specific as you move 
upwards the (application)stack.

> 
>> If you continue your line of thought you will have to add all kind of 
>> stuff to the database, like CPU usage tracking, getting information 
>> about running processes, storage health.
> 
> I'm looking to expose something that only the database knows for 
> sure--"what is the archiver working on?"--via the standard way you ask 
> the database questions, a SELECT call.  The database doesn't know 
> anything about the CPU, running processes, or storage, so suggesting 
> this path leads in that direction doesn't make any sense.

well the database does not really know much about "free diskspace" in 
reality as well - the only thing it knows is that it might not be able 
to write data or execute a script and unless you have shell/logfile 
access you cannot diagnose those anyway even with all the proposed 
functions.
However what I was really trying to say is that we should focus on 
getting the code stable first and that prettying it up with fancy stat 
functions is something that really can and should be done in a followup 
release once we understand how the code behaves and maybe also how it is 
likely going to evolve...

Stefan

Re: Streaming replication status

From

Stefan Kaltenbrunner

Date:

16 January 2010, 13:23:34

Kevin Grittner wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:
>> Kevin Grittner wrote:
>  
>>> Right, we don't want to give the monitoring software an OS login
>>> for the database servers, for security reasons.
>> depending on what you exactly mean by that I do have to wonder how
>> you monitor more complex stuff (or stuff that require elevated
>> privs) - say raid health, multipath configuration, status of OS
>> level updates, "are certain processes running or not" as well as
>> basic parameters like CPU or IO load. as in stuff you cannot know
>> usless you have it exported through "some" port.
>  
> Many of those are monitored on the server one way or another,
> through a hardware card accessible only to the DBAs.  The card sends
> an email to the DBAs for any sort of distress, including impending
> or actual drive failure, ambient temperature out of bounds, internal
> or external power out of bounds, etc.  OS updates are managed by the
> DBAs through scripts.  Ideally we would tie these in to our opcenter
> software, which displays status through hundreds of "LED" boxes on
> big plasma displays in our support areas (and can send emails and
> jabber messages when things get to a bad state), but since the
> messages are getting to the right people in a timely manner, this is
> a low priority as far as monitoring enhancement requests go.

well a lot of people (including myself) consider it a necessity to 
aggregate all that stuff in your system monitoring, only that way you 
can guarantee proper dependency handling (ie no need to page for 
"webserver not running" if the whole server is down).
There is also a case to be made for statistics tracking and long term 
monitoring of stuff.


>  
> Only the DBAs have OS logins to database servers.  Monitoring
> software must deal with application ports (which have to be open
> anyway, so that doesn't add any security risk).  Since the hardware
> monitoring doesn't know about file systems, and the disk space on
> database servers is primarily an issue for the database, it made
> sense to us to add the ability to check the space available to the
> database through a database connection.  Hence, fsutil.

still seems very backwards - there is much much more than can only be 
monitored from within the OS(and not from an external 
iLO/RSA/IMM/DRAC/whatever) that you cannot really do from within the 
database (or any other application) so I'm still puzzled...


Stefan

Re: Streaming replication status

From

Josh Berkus

Date:

16 January 2010, 19:54:25

> I'd happily write a patch to handle all that if I thought it would be
> accepted.  I fear that the whole approach will be considered a bit too
> hackish and get rejected on that basis though.  Not really sure of a
> "right" way to handle this though.  Anything better is going to be more
> complicated because it requires passing more information into the
> archiver, with little gain for that work beyond improving the quality of
> this diagnostic routine.  And I think most people would find what I
> described above useful enough.

Yeah, I think we should focus right now on "what monitoring can we get
into this version without holding up release?"   Your proposal sounds
like a good one in that respect.

In future versions, I think we'll want a host of granular data on including:

* amount of *time* since last successful archive (this would be a good
trigger for alerts)
* number of failed archive attempts
* number of archive files awaiting processing (presumably monitored by
the slave)
* last archive file processed by the slave, and when
* for HS: frequency and length of conflict delays in log processing, as
a stat
* for HS: number of query cancels due to write/lock conflicts from the
master, as a stat

However, *all* of the above can wait for the next version, especially
since by then we'll have user feedback from the field on required
monitoring.  If we try to nail this all down now, not only will it delay
the release, but we'll get it wrong and have to re-do it anyway. Release
early and often, y'know?

I think it's key to keep our data as granular and low-level as possible;
with good low-level data people can write good tools, but if we
over-summarize they can't.  Also, it would be nice to have all of our
archiving stuff grouped into something like pg_stat_archive rather than
being a bunch of disconnected functions.

--Josh Berkus

Re: Streaming replication status

From

Fujii Masao

Date:

17 January 2010, 23:03:27

On Sun, Jan 17, 2010 at 8:53 AM, Josh Berkus <josh@agliodbs.com> wrote:
> * amount of *time* since last successful archive (this would be a good
> trigger for alerts)
> * number of failed archive attempts
> * number of archive files awaiting processing (presumably monitored by
> the slave)
> * last archive file processed by the slave, and when

Are these for warm-standby, not SR? At least SR isn't so much
involved in WAL archiving, i.e, WAL is sent to the standby by
walsender instead of an archiver.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

Simon Riggs

Date:

20 March 2010, 05:53:25

On Thu, 2010-01-14 at 17:33 +0900, Fujii Masao wrote:

> I added two new functions;
> 
> (1) pg_last_xlog_receive_location() reports the last WAL location received
>     and synced by walreceiver. If streaming replication is still in progress
>     this will increase monotonically. If streaming replication has completed
>     then this value will remain static at the value of the last WAL record
>     received and synced. When the server has been started without a streaming
>     replication then the return value will be InvalidXLogRecPtr (0/0).
> 
> (2) pg_last_xlog_replay_location() reports the last WAL location replayed
>     during recovery. If recovery is still in progress this will increase
>     monotonically. If recovery has completed then this value will remain
>     static at the value of the last WAL record applied. When the server has
>     been started normally without a recovery then the return value will be
>     InvalidXLogRecPtr (0/0).

I just noticed that these functions have almost the same name as
functions I wrote for Hot Standby and Heikki removed from that patch.
The function code and docs are 99% identical.

I'm happy that the code was used and it is BSD, though it does seem
strange to have this credited to others in the release notes.

>From May 2 2009 the patch included

+        <entry>
+
<literal><function>pg_last_recovered_xlog_location</function>()</literal>
+         </entry>
+        <entry><type>text</type></entry>
+        <entry>Returns the transaction log location of the last WAL
record
+         in the current recovery. If recovery is still in progress this
+         will increase monotonically. If recovery is complete then this value
will 
+         remain static at the value of the last transaction applied during
that
+         recovery. When the server has been started normally this will
return 
+         InvalidXLogRecPtr (0/0).
+         (zero).
+        </entry>

with code

+ /*
+  * Returns xlog location of last recovered WAL record.
+  */
+ Datum
+ pg_last_recovered_xlog_location(PG_FUNCTION_ARGS)
+ {
+     char        location[MAXFNAMELEN];
+ 
+     {
+         /* use volatile pointer to prevent code rearrangement */
+         volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+         SpinLockAcquire(&xlogctl->info_lck);
+ 
+         LastRec = xlogctl->recoveryLastRecPtr;
+ 
+         SpinLockRelease(&xlogctl->info_lck);
+     }
+ 
+     snprintf(location, sizeof(location), "%X/%X",
+              LastRec.xlogid, LastRec.xrecoff);
+     PG_RETURN_TEXT_P(cstring_to_text(location));
+ }

-- Simon Riggs           www.2ndQuadrant.com

Re: Streaming replication status

From

Bruce Momjian

Date:

22 March 2010, 19:57:18

Simon Riggs wrote:
> On Thu, 2010-01-14 at 17:33 +0900, Fujii Masao wrote:
> 
> > I added two new functions;
> > 
> > (1) pg_last_xlog_receive_location() reports the last WAL location received
> >     and synced by walreceiver. If streaming replication is still in progress
> >     this will increase monotonically. If streaming replication has completed
> >     then this value will remain static at the value of the last WAL record
> >     received and synced. When the server has been started without a streaming
> >     replication then the return value will be InvalidXLogRecPtr (0/0).
> > 
> > (2) pg_last_xlog_replay_location() reports the last WAL location replayed
> >     during recovery. If recovery is still in progress this will increase
> >     monotonically. If recovery has completed then this value will remain
> >     static at the value of the last WAL record applied. When the server has
> >     been started normally without a recovery then the return value will be
> >     InvalidXLogRecPtr (0/0).
> 
> I just noticed that these functions have almost the same name as
> functions I wrote for Hot Standby and Heikki removed from that patch.
> The function code and docs are 99% identical.
> 
> I'm happy that the code was used and it is BSD, though it does seem
> strange to have this credited to others in the release notes.

Sorry, release notes updated:
             Add <link
linkend="functions-recovery-info-table"><function>pg_last_xlog_receive_location()</></link>            and
<function>pg_last_xlog_replay_location()</>,which             can be used to monitor standby server <acronym>WAL</>
       activity (Simon)
 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

Re: Streaming replication status

From

Fujii Masao

Date:

22 March 2010, 22:36:33

On Tue, Mar 23, 2010 at 7:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Sorry, release notes updated:
>
>              Add <link
>              linkend="functions-recovery-info-table"><function>pg_last_xlog_receive_location()</></link>
>              and <function>pg_last_xlog_replay_location()</>, which
>              can be used to monitor standby server <acronym>WAL</>
>              activity (Simon)

Umm... though I'm not sure the policy about credit, I think that
three names should be put down with.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Streaming replication status

From

Bruce Momjian

Date:

22 March 2010, 23:29:30

Fujii Masao wrote:
> On Tue, Mar 23, 2010 at 7:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > Sorry, release notes updated:
> >
> > ? ? ? ? ? ? ?Add <link
> > ? ? ? ? ? ? ?linkend="functions-recovery-info-table"><function>pg_last_xlog_receive_location()</></link>
> > ? ? ? ? ? ? ?and <function>pg_last_xlog_replay_location()</>, which
> > ? ? ? ? ? ? ?can be used to monitor standby server <acronym>WAL</>
> > ? ? ? ? ? ? ?activity (Simon)
> 
> Umm... though I'm not sure the policy about credit, I think that
> three names should be put down with.

OK, all three are there now:
             Add <link
linkend="functions-recovery-info-table"><function>pg_last_xlog_receive_location()</></link>            and
<function>pg_last_xlog_replay_location()</>,which             can be used to monitor standby server <acronym>WAL</>
       activity (Simon, Fujii Masao, Heikki)
 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do