Thread: Expose checkpoint start/finish times into SQL.

Expose checkpoint start/finish times into SQL.

From

Theo Schlossnagle

Date:

03 April 2008, 19:10:32

First whack at exposing the start and finish checkpoint times into SQL.

--
Theo Schlossnagle
Esoteric Curio -- http://lethargy.org/
OmniTI Computer Consulting, Inc. -- http://omniti.com/

Attachment

checkpoint_exposed.patch

Re: Expose checkpoint start/finish times into SQL.

From

Alvaro Herrera

Date:

03 April 2008, 19:18:01

Theo Schlossnagle wrote:
> First whack at exposing the start and finish checkpoint times into SQL.

I suggest using GetCurrentTimestamp() directly instead of time_t and
converting.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Expose checkpoint start/finish times into SQL.

From

Heikki Linnakangas

Date:

03 April 2008, 19:22:13

Theo Schlossnagle wrote:
> First whack at exposing the start and finish checkpoint times into SQL.

Why is that useful?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Expose checkpoint start/finish times into SQL.

From

"Joshua D. Drake"

Date:

03 April 2008, 19:27:35

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 03 Apr 2008 23:21:49 +0100
Heikki Linnakangas <heikki@enterprisedb.com> wrote:

> Theo Schlossnagle wrote:
> > First whack at exposing the start and finish checkpoint times into
> > SQL.
> 
> Why is that useful?

For knowing how long checkpoints are taking. If they are taking too
long you may need to adjust your bgwriter settings, and it is a
serious drag to parse postgresql logs for this info.

Joshua D. Drake 



- -- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH9VpcATb/zqfZUUQRAiFwAJ0W7uu4Xk4DgXph1JaL180XfsAKpQCghDRw
GYNE9ouPjlRhEqUmxwktDYc=
=DRsZ
-----END PGP SIGNATURE-----

Re: Expose checkpoint start/finish times into SQL.

From

Tom Lane

Date:

03 April 2008, 19:34:41

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Theo Schlossnagle wrote:
>> First whack at exposing the start and finish checkpoint times into SQL.

> Why is that useful?

Does this implementation even work?  It looks to me like the
globalStats.last_checkpoint_start/done fields will go back to zero the
very next time the bgwriter sends a stats message.  I'm not sure what
a sane behavior would be, but it seems unlikely that that's it.

            regards, tom lane

Re: Expose checkpoint start/finish times into SQL.

From

Andrew Dunstan

Date:

03 April 2008, 20:08:55

Joshua D. Drake wrote:
>> Theo Schlossnagle wrote:
>>
>>> First whack at exposing the start and finish checkpoint times into
>>> SQL.
>>>
>> Why is that useful?
>>
>
> For knowing how long checkpoints are taking. If they are taking too
> long you may need to adjust your bgwriter settings, and it is a
> serious drag to parse postgresql logs for this info.
>
>
>

Even if this were true, surely the answer is to improve the logging.

Has this feature been discussed on -hackers? I don't recall it (and my
memory has plenty of holes in it), but I'm sure that after attending my
talk last Sunday Theo hasn't sent in a patch for an undiscussed feature ;-)

cheers

andrew

Re: Expose checkpoint start/finish times into SQL.

From

Tom Lane

Date:

03 April 2008, 21:29:28

"Joshua D. Drake" <jd@commandprompt.com> writes:
> Heikki Linnakangas <heikki@enterprisedb.com> wrote:
>> Why is that useful?

> For knowing how long checkpoints are taking. If they are taking too
> long you may need to adjust your bgwriter settings, and it is a
> serious drag to parse postgresql logs for this info.

1. To do anything useful along those lines, you would need to look at a
lot of checkpoints over time, which is what log_checkpoints is good for.
This patch only tells you about the latest, which isn't very useful
for making any good decisions about parameters.

2. If I read the patch correctly, half of the time what you'd be seeing
is the start time of the currently-active checkpoint and the completion
time of the prior checkpoint.  I don't know what those numbers are good
for at all.

3. As of PG 8.3, the bgwriter tries very hard to make the elapsed time
of a checkpoint be just about checkpoint_timeout *
checkpoint_completion_target, regardless of load factors.  So unless
your settings are completely broken, measuring the actual time isn't
going to tell you much.

In short: Heikki's question is on point.

            regards, tom lane

Re: Expose checkpoint start/finish times into SQL.

From

Robert Treat

Date:

03 April 2008, 21:36:44

On Thursday 03 April 2008 19:08, Andrew Dunstan wrote:
> Joshua D. Drake wrote:
> >> Theo Schlossnagle wrote:
> >>> First whack at exposing the start and finish checkpoint times into
> >>> SQL.
> >>
> >> Why is that useful?
> >
> > For knowing how long checkpoints are taking. If they are taking too
> > long you may need to adjust your bgwriter settings, and it is a
> > serious drag to parse postgresql logs for this info.
>
> Even if this were true, surely the answer is to improve the logging.
>

Exposing everything into the log files isn't always sufficient (says the guy
who maintains a remote admin tool)

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: Expose checkpoint start/finish times into SQL.

From

Andrew Dunstan

Date:

03 April 2008, 21:46:05


Robert Treat wrote:
> On Thursday 03 April 2008 19:08, Andrew Dunstan wrote:
>
>> Joshua D. Drake wrote:
>>
>>>> Theo Schlossnagle wrote:
>>>>
>>>>> First whack at exposing the start and finish checkpoint times into
>>>>> SQL.
>>>>>
>>>> Why is that useful?
>>>>
>>> For knowing how long checkpoints are taking. If they are taking too
>>> long you may need to adjust your bgwriter settings, and it is a
>>> serious drag to parse postgresql logs for this info.
>>>
>> Even if this were true, surely the answer is to improve the logging.
>>
>>
>
> Exposing everything into the log files isn't always sufficient (says the guy
> who maintains a remote admin tool)
>

It should be now that you can have machine readable logs (says the guy
who literally spent weeks making that happen) ;-)

cheers

andrew

Re: Expose checkpoint start/finish times into SQL.

From

"Joshua D. Drake"

Date:

03 April 2008, 22:12:50

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 03 Apr 2008 20:29:18 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "Joshua D. Drake" <jd@commandprompt.com> writes:
> > Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> >> Why is that useful?
> 
> > For knowing how long checkpoints are taking. If they are taking too
> > long you may need to adjust your bgwriter settings, and it is a
> > serious drag to parse postgresql logs for this info.
> 
> 1. To do anything useful along those lines, you would need to look at
> a lot of checkpoints over time, which is what log_checkpoints is good
> for. This patch only tells you about the latest, which isn't very
> useful for making any good decisions about parameters.

I would agree with this. We would need a history of checkpoints that
didn't reset until we told it to.

> 
> 2. If I read the patch correctly, half of the time what you'd be
> seeing is the start time of the currently-active checkpoint and the
> completion time of the prior checkpoint.  I don't know what those
> numbers are good for at all.

IMO we should see start checkpoint, end checkpoint. As part of a single
record. Other possibly interesting info would be how many logs were
processed.

> 
> 3. As of PG 8.3, the bgwriter tries very hard to make the elapsed time
> of a checkpoint be just about checkpoint_timeout *
> checkpoint_completion_target, regardless of load factors.  So unless
> your settings are completely broken, measuring the actual time isn't
> going to tell you much.
> 

You assume that people don't have broken settings. :)

> In short: Heikki's question is on point.

I didn't say it wasn't. I was just stating why a patch that does what
was described is useful. I believe with changes it would still be
useful. Having to go to log for a lot of this stuff is painful.

Joshua D. Drake


- -- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH9YETATb/zqfZUUQRAmiTAJ0Shc4rSIKRG5nabAv9RwW1MVi/BQCfUIiK
Nb8qyBonAlNl/Agp/wCyvTU=
=H8TF
-----END PGP SIGNATURE-----

Re: Expose checkpoint start/finish times into SQL.

From

"Joshua D. Drake"

Date:

03 April 2008, 22:14:42

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 03 Apr 2008 20:45:37 -0400
Andrew Dunstan <andrew@dunslane.net> wrote:

> > Exposing everything into the log files isn't always sufficient
> > (says the guy who maintains a remote admin tool)
> >   
> 
> It should be now that you can have machine readable logs (says the
> guy who literally spent weeks making that happen) ;-)

And how does the person get access to those? And what script do I need
to write to make it happen? Don't get me wrong, the feature you worked
entirely too hard on to get working is valuable but... being able to
say, "SELECT * FROM give_me_my_db_info;" is much more useful in this
context.

In short, I should never have to go to log for this class of
information. It should be available in the database.

Sincerely,

Joshua D. Drake

- -- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH9YF9ATb/zqfZUUQRAp+fAKCIOOKCKv5FaWiOF7hZjtLr8PzfGQCfRpq+
lY353B+9fmWwxWppAkhncMY=
=BwYD
-----END PGP SIGNATURE-----

Re: Expose checkpoint start/finish times into SQL.

From

Tom Lane

Date:

03 April 2008, 22:26:54

"Joshua D. Drake" <jd@commandprompt.com> writes:
> I would agree with this. We would need a history of checkpoints that
> didn't reset until we told it to.

Indeed, but the submitted patch has nought whatsoever to do with that.
It exposes some instantaneous state.

You could perhaps *build* a log facility on top of that, at the SQL
level; but I don't see the point, and I definitely disagree that it
would be "easier than trolling the logs".

            regards, tom lane

Re: Expose checkpoint start/finish times into SQL.

From

"Joshua D. Drake"

Date:

03 April 2008, 22:31:09

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 03 Apr 2008 21:26:46 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "Joshua D. Drake" <jd@commandprompt.com> writes:
> > I would agree with this. We would need a history of checkpoints that
> > didn't reset until we told it to.
> 
> Indeed, but the submitted patch has nought whatsoever to do with that.
> It exposes some instantaneous state.
> 
> You could perhaps *build* a log facility on top of that, at the SQL
> level; but I don't see the point, and I definitely disagree that it
> would be "easier than trolling the logs".

Having the ability to do this:

SELECT * FROM pg_stat_bgwriter 
   WHERE last_checkpoint 
   BETWEEN (current_time - '1 Day'::interval) AND current_time;

Would be very useful. Which I can do with logs currently. You are
correct. However from a usability, remote reporting and manageability
perspective it certainly is not the same as something as I describe
above.

Note I am perfectly willing to table this until we have a full todo and
specification for a feature.

Sincerely,

Joshua D. Drake



- -- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH9YVlATb/zqfZUUQRAnRxAKCab3O4dmBXctXTptDFwkRx+1zUQQCdFmsN
E3GWNoC90jS7ooFgArR8Nv0=
=CvMx
-----END PGP SIGNATURE-----

Re: Expose checkpoint start/finish times into SQL.

From

Theo Schlossnagle

Date:

03 April 2008, 22:31:30

On Apr 3, 2008, at 7:08 PM, Andrew Dunstan wrote:

>
>
> Joshua D. Drake wrote:
>>> Theo Schlossnagle wrote:
>>>
>>>> First whack at exposing the start and finish checkpoint times into
>>>> SQL.
>>>>
>>> Why is that useful?
>>>
>>
>> For knowing how long checkpoints are taking. If they are taking too
>> long you may need to adjust your bgwriter settings, and it is a
>> serious drag to parse postgresql logs for this info.
>>
>>
>>
>
> Even if this were true, surely the answer is to improve the logging.
>
> Has this feature been discussed on -hackers? I don't recall it (and
> my memory has plenty of holes in it), but I'm sure that after
> attending my talk last Sunday Theo hasn't sent in a patch for an
> undiscussed feature ;-)

Andrew: I don't think this feature has been discussed on hackers.  The
patch took about 15 minutes to author, so it sounds like the most
concise way to start a conversation.  Seems silly to start the
conversation on hackers with a patch. :-)

Alvaro: Thanks, I flip that to GetCurrentTimestamp()

Heikki: It it useful for knowing when the last checkpoint occurred.
Like Robert, we have situations where reading the log file is a PITA
-- so this provides that information.  I originally planned on only
adding the start time, but figured adding the end would make sense too.

Tom: It worked for me in my testing, though I did not extensively
tested.  I didn't see anywhere the stats are zero'd out, so I believe
the timestamp is zero at start and then only ever set to the starttime
during a checkpoint invocation.  I admittedly don't have a thorough
understanding of that code -- but that segment (before my patch)
looked pretty concise.

--
Theo Schlossnagle
Esoteric Curio -- http://lethargy.org/
OmniTI Computer Consulting, Inc. -- http://omniti.com/

Re: Expose checkpoint start/finish times into SQL.

From

Andrew Dunstan

Date:

03 April 2008, 22:44:22

Joshua D. Drake wrote:
>>> Exposing everything into the log files isn't always sufficient
>>> (says the guy who maintains a remote admin tool)
>>>
>>>
>> It should be now that you can have machine readable logs (says the
>> guy who literally spent weeks making that happen) ;-)
>>
>
> And how does the person get access to those? And what script do I need
> to write to make it happen? Don't get me wrong, the feature you worked
> entirely too hard on to get working is valuable but... being able to
> say, "SELECT * FROM give_me_my_db_info;" is much more useful in this
> context.
>

How to load the CSV logs is very clearly documented. It's really *very*
easy, so easy it's mostly C&P. See
http://www.postgresql.org/docs/current/static/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-CSVLOG

If you are trying to tell me that that's too hard for a DBA, then I have
to say you need better DBAs.

> In short, I should never have to go to log for this class of
> information. It should be available in the database.
>
>

What you haven't explained is why this information needs to be kept in
the db on a historical basis, as opposed to all the other possible
diagnostics where history might be useful (and, as Tom has pointed out,
this patch doesn keep it historically any way).

I think there is quite possibly a good case for keeping some diagnostics
in a table or tables, on a rolling basis, maybe. But then that's a
facility that needs to be properly designed.

cheers

andrew

Re: Expose checkpoint start/finish times into SQL.

From

"Joshua D. Drake"

Date:

03 April 2008, 22:46:03

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 03 Apr 2008 21:44:00 -0400
Andrew Dunstan <andrew@dunslane.net> wrote:
 
> I think there is quite possibly a good case for keeping some
> diagnostics in a table or tables, on a rolling basis, maybe. But then
> that's a facility that needs to be properly designed.

Please see my later email on the thread.

Joshua D. Drake

- -- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH9YjfATb/zqfZUUQRAnrwAKCUW76EI+lnz+qGJXFPyp9QqWJ9DgCgpnmP
eiGM+P4P79O20VgBm6ew1u4=
=/abt
-----END PGP SIGNATURE-----

Re: Expose checkpoint start/finish times into SQL.

From

Tom Lane

Date:

03 April 2008, 22:59:00

Theo Schlossnagle <jesus@omniti.com> writes:
> Heikki: It it useful for knowing when the last checkpoint occurred.

I guess I'm wondering why that's important.  In the current bgwriter
design, the system spends half its time checkpointing (or in general
checkpoint_completion_target % of the time).  So this seems fairly close
to wanting to know when the bgwriter last wrote a dirty buffer --- yeah,
I can imagine scenarios for wanting to know that, but they probably
require a whole pile of other knowledge as well.

JD seems to have gone off into the weeds imagining that this patch would
provide tracking of the last N checkpoints; which might start to
approach the level of an interesting feature, except that it's still not
clear *why* those numbers are interesting, given the bgwriter's
propensity to try to hold the checkpoint duration constant.

            regards, tom lane

Re: Expose checkpoint start/finish times into SQL.

From

Andrew Dunstan

Date:

03 April 2008, 23:00:36


Theo Schlossnagle wrote:
>
>>
>>
>> Has this feature been discussed on -hackers? I don't recall it (and
>> my memory has plenty of holes in it), but I'm sure that after
>> attending my talk last Sunday Theo hasn't sent in a patch for an
>> undiscussed feature ;-)
>
>
> Andrew: I don't think this feature has been discussed on hackers.  The
> patch took about 15 minutes to author, so it sounds like the most
> concise way to start a conversation.  Seems silly to start the
> conversation on hackers with a patch. :-)
>
>

Well, not really. I believe -hackers has a much larger readership than
-patches, so even for small features we generally want them discussed
there.

cheers

andrew

Re: Expose checkpoint start/finish times into SQL.

From

Robert Treat

Date:

03 April 2008, 23:11:23

On Thursday 03 April 2008 21:14, Joshua D. Drake wrote:
> On Thu, 03 Apr 2008 20:29:18 -0400
>
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > "Joshua D. Drake" <jd@commandprompt.com> writes:
> > > Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> > >> Why is that useful?
> > >
> > > For knowing how long checkpoints are taking. If they are taking too
> > > long you may need to adjust your bgwriter settings, and it is a
> > > serious drag to parse postgresql logs for this info.
> >
> > 1. To do anything useful along those lines, you would need to look at
> > a lot of checkpoints over time, which is what log_checkpoints is good
> > for. This patch only tells you about the latest, which isn't very
> > useful for making any good decisions about parameters.
>
> I would agree with this. We would need a history of checkpoints that
> didn't reset until we told it to.
>

You can plug a single item graphed over time into things like rrdtool to get
good trending information. And it's often easier to do this using sql
interfaces to get the data than pulling it out of log files (almost like the
db was designed for that :-)

> > 2. If I read the patch correctly, half of the time what you'd be
> > seeing is the start time of the currently-active checkpoint and the
> > completion time of the prior checkpoint.  I don't know what those
> > numbers are good for at all.
>

Knowing when the last checkpoint occured is certainly useful for monitoring
purposes (wrt pitr and as a general item for cuasing alerts if checkpoints
stop occuring frequently enough, or if they start taking too long.)

>
> > 3. As of PG 8.3, the bgwriter tries very hard to make the elapsed time
> > of a checkpoint be just about checkpoint_timeout *
> > checkpoint_completion_target, regardless of load factors.  So unless
> > your settings are completely broken, measuring the actual time isn't
> > going to tell you much.
>

How does one measure when the bgwriter is failing at this effort?

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: Expose checkpoint start/finish times into SQL.

From

Tom Lane

Date:

03 April 2008, 23:33:32

Robert Treat <xzilla@users.sourceforge.net> writes:
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 3. As of PG 8.3, the bgwriter tries very hard to make the elapsed time
> of a checkpoint be just about checkpoint_timeout *
> checkpoint_completion_target, regardless of load factors.  So unless
> your settings are completely broken, measuring the actual time isn't
> going to tell you much.

> How does one measure when the bgwriter is failing at this effort?

Well, not with *this* patch.  At least not without adding a lot of
infrastructure on top of it, and I'm failing to see why you'd build
such infrastructure in order to track just two numbers that are of
uncertain value.

JD seems to be on record that the existing logging mechanism sucks
and he needs something else.  That's fine, but I think it means that
we need to improve logging in general, not invent a single-purpose
mechanism for logging checkpoint times.

Theo claimed he had a reason for wanting to know the latest checkpoint
time, *without* any intention of time-extended tracking of that; but
he didn't say what it was.  If there is a credible reason for that
then it might justify a patch of this nature, but I don't see that
the reasons that have been stated so far in the thread hold any water.

            regards, tom lane

Re: Expose checkpoint start/finish times into SQL.

From

"Joshua D. Drake"

Date:

03 April 2008, 23:39:05

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 03 Apr 2008 22:33:15 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> JD seems to be on record that the existing logging mechanism sucks
> and he needs something else.  That's fine, but I think it means that
> we need to improve logging in general, not invent a single-purpose
> mechanism for logging checkpoint times.

Alright hold on. That is *not* what I said. I said *in this context*
an SQL interface would be nicer. I also said that I was willing to
table the entire discussion based on your review of the patch in that
it didn't offer what I thought it did.

Sincerely,

Joshua D. Drake


- -- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH9ZVSATb/zqfZUUQRAhDJAJ49gMASDA40N9ydhzTTaMBdr/KccQCfei3d
4QXp/4C6iH7GXatqwU5qnuk=
=slQS
-----END PGP SIGNATURE-----

Re: Expose checkpoint start/finish times into SQL.

From

Theo Schlossnagle

Date:

04 April 2008, 00:26:36

On Apr 3, 2008, at 10:33 PM, Tom Lane wrote:
> Theo claimed he had a reason for wanting to know the latest checkpoint
> time, *without* any intention of time-extended tracking of that; but
> he didn't say what it was.  If there is a credible reason for that
> then it might justify a patch of this nature, but I don't see that
> the reasons that have been stated so far in the thread hold any water.

We had a recent event where the system stopped checkpointing, which
eventually lead to Robert Treat's therapy session at PostgreSQL
Conference East.  If you have been good to be able to remotely (via
SQL) monitor the system to ensure that the last checkpoint start time
is within a reasonable window from current_timestamp.

My patch was clearly single purposed and never intended to be a SQL
accessible log of checkpoint times.

So, in reality, the "finish time" of the checkpoint is really not of
any importance to me either.

--
Theo Schlossnagle
Esoteric Curio -- http://lethargy.org/
OmniTI Computer Consulting, Inc. -- http://omniti.com/

Re: Expose checkpoint start/finish times into SQL.

From

Tom Lane

Date:

04 April 2008, 00:58:44

Theo Schlossnagle <jesus@omniti.com> writes:
> On Apr 3, 2008, at 10:33 PM, Tom Lane wrote:
>> Theo claimed he had a reason for wanting to know the latest checkpoint
>> time, *without* any intention of time-extended tracking of that; but
>> he didn't say what it was.

> We had a recent event where the system stopped checkpointing, which
> eventually lead to Robert Treat's therapy session at PostgreSQL
> Conference East.

Therapy is great ;-) ... but "the system stopped checkpointing" does not
strike me as a routine occurrence that we should be making provisions
for DBAs to watch for.  What, pray tell, is the DBA supposed to do when
and if he notices that?  Not to mention that since checkpointing is
ingrained in the bgwriter, you've got worse problems than that if it's
not working.

I'd much rather be spending our time and effort on understanding what
broke for you, and fixing the code so it doesn't happen again.  An ounce
of prevention is worth a pound of cure, and all that...

            regards, tom lane

Re: Expose checkpoint start/finish times into SQL.

From

Greg Smith

Date:

04 April 2008, 01:09:46

On Thu, 3 Apr 2008, Robert Treat wrote:

> You can plug a single item graphed over time into things like rrdtool to
> get good trending information. And it's often easier to do this using
> sql interfaces to get the data than pulling it out of log files (almost
> like the db was designed for that :-)

The pg_stat_bgwriter value for buffers_checkpoint was intentionally
implemented in 8.3 such that it jumps in one big lump when the checkpoint
is done.  While it's not the ideal interface for what you're looking for,
the reason for that is to made it possible to build a "when was the last
checkpoint finished?" interface via some remote monitoring tool just by
determining the last time that the value jumped upwards.  You can easily
see them just by graphing that value, it shouldn't be too hard to teach
something with rrdtool guts to find them.

Since checkpoints have a fairly predictable duration in 8.3, as long as
you catch the start or end of them you can make a resonable guess where
the other side was.  The case you're trying to avoid here, the system
going a long time without checkpointing, can be implemented by looking for
a begin or end regularly, you don't need to track both.  As long as
there's a checkpoint finish "pulse" in buffers_checkpoint showing up
regularly you're fine.  The only situation I can think of where this might
be problematic is where the system has been idle enough to not have any
buffers to write at checkpoint time, but I recall a code path there where
checkpoints stop altogether unless there's been activity so even tracking
the time may not change that.

Ultimately a lot of the other questions you might ask (i.e. "how many
buffers have been written per hour by checkpoints?") require processing
the numbers in this way anyway, and I thought this implementation was good
enough to monitor the situation you're trying to avoid--presuming you're
using some sort of moderately powerful remote monitoring tool.  Theo's
patch would make it easier to answer with a simple command which has some
value; a little SQL in a cron job would be good enough to trigger an alert
rather than needing a real monitoring probe.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Expose checkpoint start/finish times into SQL.

From

Bruce Momjian

Date:

04 April 2008, 01:20:51

Greg Smith wrote:
> On Thu, 3 Apr 2008, Robert Treat wrote:
>
> > You can plug a single item graphed over time into things like rrdtool to
> > get good trending information. And it's often easier to do this using
> > sql interfaces to get the data than pulling it out of log files (almost
> > like the db was designed for that :-)
>
> The pg_stat_bgwriter value for buffers_checkpoint was intentionally
> implemented in 8.3 such that it jumps in one big lump when the checkpoint
> is done.  While it's not the ideal interface for what you're looking for,
> the reason for that is to made it possible to build a "when was the last
> checkpoint finished?" interface via some remote monitoring tool just by
> determining the last time that the value jumped upwards.  You can easily
> see them just by graphing that value, it shouldn't be too hard to teach
> something with rrdtool guts to find them.
>
> Since checkpoints have a fairly predictable duration in 8.3, as long as
> you catch the start or end of them you can make a resonable guess where
> the other side was.  The case you're trying to avoid here, the system
> going a long time without checkpointing, can be implemented by looking for
> a begin or end regularly, you don't need to track both.  As long as
> there's a checkpoint finish "pulse" in buffers_checkpoint showing up
> regularly you're fine.  The only situation I can think of where this might
> be problematic is where the system has been idle enough to not have any
> buffers to write at checkpoint time, but I recall a code path there where
> checkpoints stop altogether unless there's been activity so even tracking
> the time may not change that.

If we want to expose more of the internals of Postgres via functions,
fine, but a one-off function just to cover something that happened to
one user just doesn't make sense.  We need a plan on exactly what we
want to expose in a coherent way, and how.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Expose checkpoint start/finish times into SQL.

From

Greg Smith

Date:

04 April 2008, 01:30:54

On Thu, 3 Apr 2008, Joshua D. Drake wrote:

> For knowing how long checkpoints are taking. If they are taking too
> long you may need to adjust your bgwriter settings, and it is a
> serious drag to parse postgresql logs for this info.

There's some disconnect here between what I think you want here and what
Theo's patch does.

If I were running an 8.2 system, there's a whole lot of diagnostic
information I'd like to have handy in order to optimize checkpoints, as
you describe.  But this patch doesn't help 8.2 users, because the whole
pg_stat_bgwriter structure is 8.3 only--it provides much of what is needed
in this area.

In 8.3, there is very little background writer tuning to do.  What tuning
you can do is easy to figure out from what's in pg_stat_bgwriter.  The new
checkpoint_completion_target feature works better than tuning the 8.2
background writer to reduce checkpoint spikes ever did anyway, and that's
the main thing that drives how long checkpoints take to process now.

So in the only situation where this patch can be applied, 8.3, I don't
think the scenario you describe is likely to pop up.  And the combination
of two 8.3 features (log_checkpoint and the CSV logs, it's no coincidence
I worked on both) should allow a good enough way to hack around this area
when it does.  Just make the logs rotate fairly regularly, and every time
they do import the last CSV section that's now finished into a table so
you can compute statistics about how the checkpoints are doing there.  I
don't see why that sort of stuff should go into core when it's now easy to
do outside of it.  I have a whole stack of scripts in that area I plan to
release over the summer.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Expose checkpoint start/finish times into SQL.

From

Greg Smith

Date:

04 April 2008, 01:41:14

On Thu, 3 Apr 2008, Tom Lane wrote:

> As of PG 8.3, the bgwriter tries very hard to make the elapsed time of a
> checkpoint be just about checkpoint_timeout *
> checkpoint_completion_target, regardless of load factors.

In the cases where the timing on checkpoint writes are timeout driven.
When they're checkpoint_segments driven the time is derived from that
drumbeat instead.  And those cases are the more interesting ones, because
those are the ones that will speed up during really busy periods.

I think it's completely relevant to track how often checkpoints are
happening because they do still vary even in 8.3.  I'm just not sure if
the current methods available for that really aren't good enough, or if
it's just the case that not everyone is aware of all of them.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Expose checkpoint start/finish times into SQL.

From

Robert Treat

Date:

04 April 2008, 01:59:45

On Friday 04 April 2008 00:09, Greg Smith wrote:
> On Thu, 3 Apr 2008, Robert Treat wrote:
> > You can plug a single item graphed over time into things like rrdtool to
> > get good trending information. And it's often easier to do this using
> > sql interfaces to get the data than pulling it out of log files (almost
> > like the db was designed for that :-)
>
> The pg_stat_bgwriter value for buffers_checkpoint was intentionally
> implemented in 8.3 such that it jumps in one big lump when the checkpoint
> is done.  While it's not the ideal interface for what you're looking for,
> the reason for that is to made it possible to build a "when was the last
> checkpoint finished?" interface via some remote monitoring tool just by
> determining the last time that the value jumped upwards.  You can easily
> see them just by graphing that value, it shouldn't be too hard to teach
> something with rrdtool guts to find them.
>

the advantage of using a timestamp is that you get the incrementing counter
for free which is certainly helpful in third party tools like phppgadmin that
don't instrument tracking methods; you can look at the system and it's
settings and have a pretty good idea if something is awry.

>
> Ultimately a lot of the other questions you might ask (i.e. "how many
> buffers have been written per hour by checkpoints?") require processing
> the numbers in this way anyway, and I thought this implementation was good
> enough to monitor the situation you're trying to avoid--presuming you're
> using some sort of moderately powerful remote monitoring tool.  Theo's
> patch would make it easier to answer with a simple command which has some
> value; a little SQL in a cron job would be good enough to trigger an alert
> rather than needing a real monitoring probe.
>

Yes, the idea of making a basic nagios/munin check that is readily consumable
by the general public is certainly a bonus in my eyes.

I have to add, given that we already provide the time of last checkpoint
information via pg_controldata, I don't understand why people are against
making that information accesible to remote clients.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: Expose checkpoint start/finish times into SQL.

From

Tom Lane

Date:

04 April 2008, 02:16:41

Robert Treat <xzilla@users.sourceforge.net> writes:
> I have to add, given that we already provide the time of last checkpoint
> information via pg_controldata, I don't understand why people are against
> making that information accesible to remote clients.

So, I can expect to see a patch next week that invents a separate SQL
function to expose each field printed by pg_controldata?

You are failing to make the distinction between debug/developer support
(which is what pg_controldata is) and things that are interesting for
DBAs to track.  Would you expect someone to install handwritten
monitoring infrastructure to awaken them in the middle of the night when
pg_control_version changes?

I am not saying that there is no use for some sort of facility in this
area.  But I think it needs some more-general thinking and design than
has been evidenced so far in this thread.

            regards, tom lane

Re: Expose checkpoint start/finish times into SQL.

From

Greg Smith

Date:

04 April 2008, 02:29:14

On Thu, 3 Apr 2008, Tom Lane wrote:

> "the system stopped checkpointing" does not strike me as a routine
> occurrence that we should be making provisions for DBAs to watch for.
> What, pray tell, is the DBA supposed to do when and if he notices that?

Schedule downtime rather than wait for it to happen unpredictably when the
inevitable crash happens.

> I'd much rather be spending our time and effort on understanding what
> broke for you, and fixing the code so it doesn't happen again.

(Here I start laughing all over again as I recall Robert's talk, which we
really need to get you in particular the video of)  Their situation had
possible causes that included a bit flipping in bad memory, which is
pretty hard to code around (unless you're a Core Wars veteran).  I'm
familiar with that part of the checkpoint code path, and everything I was
able to think of when hearing the outline of events was already considered
and rejected as not being a likely cause.  This patch comes out of
pragmatic acceptance that, sometimes, stuff will happen you can't easily
explain, but that doesn't mean it's not worth keeping an eye on it anyway
so it doesn't sneak up on you again.

Anyway, I think this whole thing would be better handled by a larger
internals view that this whole codebase could use a dose of anyway.  What
I really want is an interface like this:

psql> select pg_internals('last_checkpoint_time');

and then start sprinkling exports of those probe points in some popular
places people would like to look at.

I will apologize now for suggesting this, followed by not having enough
time to code it in the near future.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Expose checkpoint start/finish times into SQL.

From

Tom Lane

Date:

04 April 2008, 02:59:56

Greg Smith <gsmith@gregsmith.com> writes:
> On Thu, 3 Apr 2008, Tom Lane wrote:
>> I'd much rather be spending our time and effort on understanding what
>> broke for you, and fixing the code so it doesn't happen again.

> [ shit happens... ]

Completely fair, but I still don't see how this particular patch would
be a useful addition to the DBA's monitoring-tool arsenal.  The scope
seems too narrow.

(And you still didn't tell me what the actual failure case was.)

            regards, tom lane

Re: Expose checkpoint start/finish times into SQL.

From

Greg Smith

Date:

04 April 2008, 03:21:38

On Fri, 4 Apr 2008, Tom Lane wrote:

> (And you still didn't tell me what the actual failure case was.)

Database stops checkpointing.  WAL files pile up.  In the middle of
backup, system finally dies, and when it starts recovery there's a bad
record in the WAL files--which there are now thousands of to apply, and
the bad one is 4 hours of replay in.  Believe it or not, it goes downhill
from there.

It's what kicked off the first step that's the big mystery.  The only code
path I thought of that can block checkpoints like this is when the
archive_command isn't working anymore, and that wasn't being used.  Given
some of the other corruption found later and the bad memory issues
discovered, a bit flipping in the "do I need to checkpoint now?" code or
data seems just as likely as any other explanation.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Expose checkpoint start/finish times into SQL.

From

Tom Lane

Date:

04 April 2008, 03:36:40

Greg Smith <gsmith@gregsmith.com> writes:
> On Fri, 4 Apr 2008, Tom Lane wrote:
>> (And you still didn't tell me what the actual failure case was.)

> Database stops checkpointing.  WAL files pile up.  In the middle of
> backup, system finally dies, and when it starts recovery there's a bad
> record in the WAL files--which there are now thousands of to apply, and
> the bad one is 4 hours of replay in.  Believe it or not, it goes downhill
> from there.

> It's what kicked off the first step that's the big mystery.

Indeed :-(.  But given those observations, I'd still have about zero
faith in the usefulness of this patch.  If the bgwriter is not able to
complete checkpoints, is it able to tell you the truth about what it's
doing?

The actual advice I'd give to a DBA faced with such a case is to
kill -ABRT the bgwriter and send the stack trace to -hackers.
That's not in the proposed patch though...

            regards, tom lane

Re: Expose checkpoint start/finish times into SQL.

From

Greg Smith

Date:

04 April 2008, 04:01:11

On Fri, 4 Apr 2008, Tom Lane wrote:

> The actual advice I'd give to a DBA faced with such a case is to
> kill -ABRT the bgwriter and send the stack trace to -hackers.

And that's a perfect example of where they're trying to get to.  They
didn't notice the problem until after the crash.  The server didn't come
back up (busy processing WALs) and that downtime was caught by a
monitoring system.  At that point it was too late to collect debugging
information on what was wrong inside the server processes that might have
given a clue what happened.

If they'd have noticed it while the server was up, perhaps because the
"last checkpoint" value hadn't changed in a long time (which seems like it
might be available via stats even if, as you say, the background writer is
out of its mind at that point), they could have done such a kill and
collected some actual useful information here.  That's the theory at
least.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Expose checkpoint start/finish times into SQL.

From

Tom Lane

Date:

04 April 2008, 04:14:38

Greg Smith <gsmith@gregsmith.com> writes:
> ... If they'd have noticed it while the server was up, perhaps because the
> "last checkpoint" value hadn't changed in a long time (which seems like it
> might be available via stats even if, as you say, the background writer is
> out of its mind at that point), they could have done such a kill and
> collected some actual useful information here.  That's the theory at
> least.

Well, mebbe, but that still seems to require a lot of custom monitoring
infrastructure that is not present in this patch, and furthermore that
this patch doesn't especially aid the development of.

            regards, tom lane

Re: Expose checkpoint start/finish times into SQL.

From

Alvaro Herrera

Date:

04 April 2008, 15:30:13

Tom Lane wrote:
> Greg Smith <gsmith@gregsmith.com> writes:
> > ... If they'd have noticed it while the server was up, perhaps because the
> > "last checkpoint" value hadn't changed in a long time (which seems like it
> > might be available via stats even if, as you say, the background writer is
> > out of its mind at that point), they could have done such a kill and
> > collected some actual useful information here.  That's the theory at
> > least.
>
> Well, mebbe, but that still seems to require a lot of custom monitoring
> infrastructure that is not present in this patch, and furthermore that
> this patch doesn't especially aid the development of.

These kind of things can be monitored externally very easily, say by
Nagios, when the values are available via the database.  If you have to
troll the logs, it's quite a bit harder to do it.

I'm not sure about the right values to export -- last checkpoint start
time is the most obvious idea, but I would also suggest exporting last
checkpoint end, or NULL if the checkpoint is ongoing; and also previous-
to-last checkpoint start and end.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Expose checkpoint start/finish times into SQL.

From

Robert Treat

Date:

04 April 2008, 15:45:39

On Friday 04 April 2008 01:59, Tom Lane wrote:
> Greg Smith <gsmith@gregsmith.com> writes:
> > On Thu, 3 Apr 2008, Tom Lane wrote:
> >> I'd much rather be spending our time and effort on understanding what
> >> broke for you, and fixing the code so it doesn't happen again.
> >
> > [ shit happens... ]
>
> Completely fair, but I still don't see how this particular patch would
> be a useful addition to the DBA's monitoring-tool arsenal.  The scope
> seems too narrow.
>

So, thinking about this for a few minutes, here are some of the things that
knowing the last checkpoint time might help a DBA determine.

1) Alert if checkpointing stops occuring within a reasonable time frame (note
there are failure cases and normal use cases where this might occur)  (also
note I'll agree, this isn't common, but the results are pretty disatrous if
it does happen)

2) Can be graphed over time (using rrdtool and others) for trending checkpoint
activity

3) Ease monitoring of checkpoints across pitr setups

4) Help determine if your checkpoints are being timeout driven or segment
driven, or if you need to look at those settings

5) Determine the number of log files that will need to be replayed in the
event of a crash

6) Helps give an indication on if you should enter a manual checkpoint before
issuing a pg_start_backup call

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: Expose checkpoint start/finish times into SQL.

From

Robert Treat

Date:

04 April 2008, 15:46:05

On Friday 04 April 2008 03:14, Tom Lane wrote:
> Greg Smith <gsmith@gregsmith.com> writes:
> > ... If they'd have noticed it while the server was up, perhaps because
> > the "last checkpoint" value hadn't changed in a long time (which seems
> > like it might be available via stats even if, as you say, the background
> > writer is out of its mind at that point), they could have done such a
> > kill and collected some actual useful information here.  That's the
> > theory at least.
>
> Well, mebbe, but that still seems to require a lot of custom monitoring
> infrastructure that is not present in this patch, and furthermore that
> this patch doesn't especially aid the development of.
>

This patch makes that a 1 line sql call, which can be plugged into any number
of monitoring systems.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: Expose checkpoint start/finish times into SQL.

From

Gregory Stark

Date:

04 April 2008, 16:09:12

"Alvaro Herrera" <alvherre@commandprompt.com> writes:

> Tom Lane wrote:
>> Greg Smith <gsmith@gregsmith.com> writes:
>> > ... If they'd have noticed it while the server was up, perhaps because the
>> > "last checkpoint" value hadn't changed in a long time (which seems like it
>> > might be available via stats even if, as you say, the background writer is
>> > out of its mind at that point), they could have done such a kill and
>> > collected some actual useful information here.  That's the theory at
>> > least.
>>
>> Well, mebbe, but that still seems to require a lot of custom monitoring
>> infrastructure that is not present in this patch, and furthermore that
>> this patch doesn't especially aid the development of.
>
> These kind of things can be monitored externally very easily, say by
> Nagios, when the values are available via the database.  If you have to
> troll the logs, it's quite a bit harder to do it.

I can see Tom's reluctance out of fear that really this is going to be the
first of hundreds of dials which have to be monitored so a single function to
handle that single dial is kind of short sighted. I would think eventually it
should be part of the Postgres SNMP MIB.

But I would say from my experience on the not-really-a-sysadmin side I think
the time of the last checkpoint is probably the *most* important thing to be
monitoring. Effectively it's monitoring how stale your data on disk is
potentially becoming by showing how much recovery will be necessary.

> I'm not sure about the right values to export -- last checkpoint start
> time is the most obvious idea, but I would also suggest exporting last
> checkpoint end, or NULL if the checkpoint is ongoing; and also previous-
> to-last checkpoint start and end.

I'm surprised y'all want the time of the last checkpoint *start* though. It
seems to me that only the last checkpoint *end* is actually interesting. A
checkpoint which has started but not ended yet is not really a checkpoint at
all. It's nothing.

In the future there could be multiple checkpoints which have "started" but not
finished. Or we could add support for lazy checkpoints in which case there
could be an infinite number of "potential" checkpoints which haven't finished.

Worse, the time the last checkpoint actually finished isn't actually
interesting either. What you want to know is what time the record which the
last finished checkpoint *checkpointed up to* was inserted. That is, the time
of the record that the last checkpoint record *points to*.

That represents the "guarantee" that the database is making to the sysadmin
about data integrity. Everything before that time is guaranteed to have
reached data files already. Everything after it may or may not be in the data
files and has to be checked against the WAL logs.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: Expose checkpoint start/finish times into SQL.

From

Tom Lane

Date:

04 April 2008, 17:16:27

Alvaro Herrera <alvherre@commandprompt.com> writes:
> These kind of things can be monitored externally very easily, say by
> Nagios, when the values are available via the database.  If you have to
> troll the logs, it's quite a bit harder to do it.

> I'm not sure about the right values to export -- last checkpoint start
> time is the most obvious idea, but I would also suggest exporting last
> checkpoint end, or NULL if the checkpoint is ongoing; and also previous-
> to-last checkpoint start and end.

Any Nagios-style monitoring would have to watch checkpoint end (and
we'd better define that field as only updating at *successful*
checkpoint end).  Consider the case where some dirty buffer has a
persistent write failure condition.

I'm almost inclined to say that the patch shouldn't expose checkpoint
start at all, just to make sure people won't get this wrong.  We've
pretty thoroughly trashed the notion that looking at the interval is
helpful anyway.

            regards, tom lane

Re: Expose checkpoint start/finish times into SQL.

From

Simon Riggs

Date:

04 April 2008, 20:36:57

On Fri, 2008-04-04 at 02:21 -0400, Greg Smith wrote:

> Database stops checkpointing.  WAL files pile up.  In the middle of
> backup, system finally dies, and when it starts recovery there's a bad
> record in the WAL files--which there are now thousands of to apply, and
> the bad one is 4 hours of replay in.  Believe it or not, it goes downhill
> from there.
>
> It's what kicked off the first step that's the big mystery.  The only code
> path I thought of that can block checkpoints like this is when the
> archive_command isn't working anymore, and that wasn't being used.  Given
> some of the other corruption found later and the bad memory issues
> discovered, a bit flipping in the "do I need to checkpoint now?" code or
> data seems just as likely as any other explanation.

A few additional comments here:

If you set checkpoint_segments very, very high you can avoid a
checkpoint via checkpoint_timeout for up to 60 minutes. If you did this
for performance reasons presumably you've got lots of WAL files and
might end up with 1000s of them in that time period.

If you set it too high, you hit the disk limits first and can then crash
the server if the pg_xlog directory's physical limits are unluckily low
enough.

Starvation of the checkpoint start lock has been witnessed previously,
so if you're running 8.2 or previous that could be a possible
explanation here. What can happen is that a checkpoint is triggered yet
the bgwriter needs to wait to get access to the CheckpointStartLock. I
witnessed a starvation of 3 minutes once during testing a server running
at max velocity with 200 users, in 2006. I assumed that was an outlier,
but its possible for that to be longer. I wouldn't believe too much
longer, though. That was patched in 8.3 as a result.

Anyway, either of those factors, or their combination, plus a small
pg_xlog disk would be sufficient to explain the crash and wal file build
up.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

Re: Expose checkpoint start/finish times into SQL.

From

Heikki Linnakangas

Date:

05 April 2008, 12:37:25

Robert Treat wrote:
> 1) Alert if checkpointing stops occuring within a reasonable time frame (note
> there are failure cases and normal use cases where this might occur)  (also
> note I'll agree, this isn't common, but the results are pretty disatrous if
> it does happen)

What are the normal use cases where this would occur? I can't think of any.

Wrt. failure cases, there's a million things that can go wrong in a
system, and only few of them will give the symptom of "checkpoints
stopped happening", so I'm not excited about adding a facility to
monitor just that.

More hooks for monitoring purposes in general would be nice, and I would
like to see them exposed as SQL functions, but I'd like to see a much
bigger proposal for that.

> 2) Can be graphed over time (using rrdtool and others) for trending checkpoint
> activity

Hmm. You'd need the historical data to do that properly. In particular,
if two checkpoints happen between the polling interval, you'd miss that.
log_checkpoints=on, in CSV output, seems like a better approach for that.

> 3) Ease monitoring of checkpoints across pitr setups

How is that different from monitoring in a non-pitr setup?

> 4) Help determine if your checkpoints are being timeout driven or segment
> driven, or if you need to look at those settings

Seems like the log_checkpoints output is more useful for that, as it
directly tells you what triggered the checkpoint.

> 5) Determine the number of log files that will need to be replayed in the
> event of a crash

If you have a requirement for that, just set checkpoint_segments
accordingly. And again, you can get the information in the logs by
log_checkpoints=on already.

> 6) Helps give an indication on if you should enter a manual checkpoint before
> issuing a pg_start_backup call

If you want the backup to begin immediately, just do a manual checkpoint
unconditionally. It'll finish quickly if there hasn't been much activity
since the last one. We talked about adding a new variant of
pg_start_backup() that does that automatically (or rather, just hurries
the current checkpoint) in the 8.3 cycle, but didn't do it in the end.
Perhaps we should add that, after all?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Expose checkpoint start/finish times into SQL.

From

"Joshua D. Drake"

Date:

05 April 2008, 12:42:08

On Sat, 05 Apr 2008 16:37:15 +0100
Heikki Linnakangas <heikki@enterprisedb.com> wrote:


May I just say that every person that is currently talking on this
thread is offtopic? Move it to -hackers please.

Joshua D. Drake

--
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

Re: Expose checkpoint start/finish times into SQL.

From

Alvaro Herrera

Date:

05 April 2008, 15:00:33

Heikki Linnakangas wrote:
> Robert Treat wrote:

>> 2) Can be graphed over time (using rrdtool and others) for trending
>> checkpoint activity
>
> Hmm. You'd need the historical data to do that properly. In particular,
> if two checkpoints happen between the polling interval, you'd miss that.

Yes, Munin and other tools collect the values over time to create the
graph.  It's not like "top" saves historical system load data either,
yet the collector is able to present you with a graph.

> log_checkpoints=on, in CSV output, seems like a better approach for that.

It's not, because the tool would have a much harder time trolling the
logs.  CSV does not really make it much easier -- the DBA is still
tasked with importing the thing after the current file is complete; and
you don't have an up-to-date picture either.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support