Thread: Logging checkpoints and other slowdown causes

Logging checkpoints and other slowdown causes

From

Greg Smith

Date:

01 April 2007, 04:27:58

This patch puts entries in the log file around some backend events that
regularly cause client pauses, most of which are visible only when running
with log_min_messages = debug2.  The main focus is regularly occuring
checkpoints, from either running out of segments or hitting a timeout.
This is an example of what gets written to the logs at checkpoint time:

DEBUG2:  checkpoint required (timeout passed)
DEBUG2:  checkpoint starting
DEBUG2:  checkpoint flushing buffer pool
DEBUG2:  checkpoint database fsync starting
DEBUG2:  checkpoint database fsync complete
DEBUG1:  checkpoint buffers dirty=16.7 MB (13.0%) write=174.8 ms
sync=3716.5 ms
DEBUG2:  checkpoint complete; 0 transaction log file(s) added, 0 removed,
0 recycled

The idea here is that if you want to really watch what your system is
doing, either to track down a problem or for benchmarking, you log at
DEBUG2 and timestamp the logs with something like log_line_prefix =
'[%m]:'.  Then line up your timestamped transactions with the logs to
figure out what happened during bad periods.  This is very easy to do with
the timestamped pgbench latency logs for example, so you can put them in
time sequence order and see something like "oh, those long transactions
all came in between the fsync start and end".  The summary line with the
write/sync timing information has a broader use and suggests when you need
to go deeper, which is why I put that one at DEBUG1.

I also adjust some other log messages to make it easier to run the system
at DEBUG2 without going completely crazy, like moving individual WAL
segment recycles to DEBUG3 and standardizing the format of the fork/exit
messages so they're easier to filter out.

My main concern about this code is how it adds several gettimeofday calls
and complicated string building inside FlushBufferPool, all of which is
wasted if this data isn't actually going to get logged.  I'd like to have
something that tells me whether I should even bother, but I didn't want to
poke into the elog variables for fear of adding a dependency on its
current structure to here.  Ideally, it would be nice to call something
like WillLog(DEBUG2) and get a boolean back saying whether something at
that log level will be output anywhere; I don't know enough about the
logging code to add such a thing to it myself.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Attachment

checkpointlog.txt

Re: Logging checkpoints and other slowdown causes

From

Bruce Momjian

Date:

02 April 2007, 22:23:21

Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Greg Smith wrote:
> This patch puts entries in the log file around some backend events that
> regularly cause client pauses, most of which are visible only when running
> with log_min_messages = debug2.  The main focus is regularly occuring
> checkpoints, from either running out of segments or hitting a timeout.
> This is an example of what gets written to the logs at checkpoint time:
>
> DEBUG2:  checkpoint required (timeout passed)
> DEBUG2:  checkpoint starting
> DEBUG2:  checkpoint flushing buffer pool
> DEBUG2:  checkpoint database fsync starting
> DEBUG2:  checkpoint database fsync complete
> DEBUG1:  checkpoint buffers dirty=16.7 MB (13.0%) write=174.8 ms
> sync=3716.5 ms
> DEBUG2:  checkpoint complete; 0 transaction log file(s) added, 0 removed,
> 0 recycled
>
> The idea here is that if you want to really watch what your system is
> doing, either to track down a problem or for benchmarking, you log at
> DEBUG2 and timestamp the logs with something like log_line_prefix =
> '[%m]:'.  Then line up your timestamped transactions with the logs to
> figure out what happened during bad periods.  This is very easy to do with
> the timestamped pgbench latency logs for example, so you can put them in
> time sequence order and see something like "oh, those long transactions
> all came in between the fsync start and end".  The summary line with the
> write/sync timing information has a broader use and suggests when you need
> to go deeper, which is why I put that one at DEBUG1.
>
> I also adjust some other log messages to make it easier to run the system
> at DEBUG2 without going completely crazy, like moving individual WAL
> segment recycles to DEBUG3 and standardizing the format of the fork/exit
> messages so they're easier to filter out.
>
> My main concern about this code is how it adds several gettimeofday calls
> and complicated string building inside FlushBufferPool, all of which is
> wasted if this data isn't actually going to get logged.  I'd like to have
> something that tells me whether I should even bother, but I didn't want to
> poke into the elog variables for fear of adding a dependency on its
> current structure to here.  Ideally, it would be nice to call something
> like WillLog(DEBUG2) and get a boolean back saying whether something at
> that log level will be output anywhere; I don't know enough about the
> logging code to add such a thing to it myself.
>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Content-Description:

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Logging checkpoints and other slowdown causes

From

Peter Eisentraut

Date:

03 April 2007, 03:15:14

Greg Smith wrote:
> This patch puts entries in the log file around some backend events
> that regularly cause client pauses, most of which are visible only
> when running with log_min_messages = debug2.

Something that is aimed at a user should not be enabled at a "debug"
level.  Debug levels are for debugging, not for very high verbosity.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Logging checkpoints and other slowdown causes

From

Greg Smith

Date:

03 April 2007, 10:53:47

On Tue, 3 Apr 2007, Peter Eisentraut wrote:

> Something that is aimed at a user should not be enabled at a "debug"
> level.  Debug levels are for debugging, not for very high verbosity.

I asked for feedback about where to log at when I intially sent the first
version of this in and didn't hear anything back on that part, so I pushed
these in line with other log messages I saw.  The messages for when
checkpoints start and stop were both logged at DEBUG2, so I put progress
reports on the other significant phases of the process there as well.

I don't expect these messages will be helpful for a normal user--that's
what the new data in pg_stats_bgwriter is for.  Their main purpose of this
patch is debugging checkpoint related performance issues at a level I'd
expect only a developer to work at; they're also helpful for someone
writing benchmark code.

There are several patches in process floating around that aim to adjust
either the background writer or the checkpoint process to reduce the
impact of checkpoints.  This logging allows grading their success at that.
As my tests with this patch in place suggest this problem is far from
solved with any of the current suggestions, I'd like to get other
developers looking at that problem the same way I have been; that's why
I'd like to see some standardization on how checkpoints are instrumented.
The fact that really advanced users might also use this for
troubleshooting I consider a bonus rather than the main focus here.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Logging checkpoints and other slowdown causes

From

Heikki Linnakangas

Date:

11 May 2007, 07:02:04

Greg Smith wrote:
> On Tue, 3 Apr 2007, Peter Eisentraut wrote:
>
>> Something that is aimed at a user should not be enabled at a "debug"
>> level.  Debug levels are for debugging, not for very high verbosity.
>
> I asked for feedback about where to log at when I intially sent the
> first version of this in and didn't hear anything back on that part, so
> I pushed these in line with other log messages I saw.  The messages for
> when checkpoints start and stop were both logged at DEBUG2, so I put
> progress reports on the other significant phases of the process there as
> well.

I agree that debug levels are not suitable for this.

I'm thinking of INFO, NOTICE or LOG. The user manual says about LOG:

LOG

     Reports information of interest to administrators, e.g., checkpoint
activity.

But looking at the code, all the checkpoint related messages are at
DEBUG-levels, nothing gets printed at LOG-level. Printing the messages
at LOG-level would bring the code in line with the documentation, but I
don't think we want to fill the log with checkpoint chatter unless the
DBA explicitly asks for that.

How about INFO? It seems like the best level for information on normal
activity of the system. The documentation however has this to say about it:

INFO

     Provides information implicitly requested by the user, e.g., during
VACUUM VERBOSE.

We should adjust the documentation, but INFO seems like the best level
to me. Or we could add a GUC variable similar to log_connections or
log_statement to control if the messages are printed or not, and use LOG.

> I don't expect these messages will be helpful for a normal user--that's
> what the new data in pg_stats_bgwriter is for.  Their main purpose of
> this patch is debugging checkpoint related performance issues at a level
> I'd expect only a developer to work at; they're also helpful for someone
> writing benchmark code.

I disagree. They would be helpful to an administrator chasing down
checkpoint related problems. E.g. checkpoints taking too long, occurring
too often (though we already have log_checkpoint_warning for that), or
to identify if transient performance problems that users complain about
coincide with checkpoints. And at least I like to have messages like
that in the log just to comfort me that everything is going ok.

> There are several patches in process floating around that aim to adjust
> either the background writer or the checkpoint process to reduce the
> impact of checkpoints.  This logging allows grading their success at
> that. As my tests with this patch in place suggest this problem is far
> from solved with any of the current suggestions, I'd like to get other
> developers looking at that problem the same way I have been; that's why
> I'd like to see some standardization on how checkpoints are
> instrumented. The fact that really advanced users might also use this
> for troubleshooting I consider a bonus rather than the main focus here.

Agreed.

Looking at the patch, I think we want one line in the log when
checkpoint starts, one when buffer flushing starts (or maybe not, if we
assume that checkpointing clog, subtrans and multixact don't take long),
one when sync-phase starts and when the checkpoint is done. We don't
need to print the times elapsed in each phase on a separate line, that's
just derived information from the other lines, unless we use different
log-levels for detail lines (like you did in your patch).

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Logging checkpoints and other slowdown causes

From

"Guillaume Smet"

Date:

11 May 2007, 08:29:18

On 5/11/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> I agree that debug levels are not suitable for this.

Squid (the (reverse) proxy) has a nice feature for debugging purposes
(admin and developers). It's possible to have different logging level
for different sections of the code. For example, if you want to debug
a particular part, you can have ALL,1 33,2 in your config file which
allows you to have a more verbose debug on the section 33 of the code
(33 is ACL).
It's really useful because you can even debug a particular code path
or a misconfiguration in production without generating a huge amount
of logs.

--
Guillaume

Re: Logging checkpoints and other slowdown causes

From

Heikki Linnakangas

Date:

11 May 2007, 08:41:49

Guillaume Smet wrote:
> On 5/11/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
>> I agree that debug levels are not suitable for this.
>
> Squid (the (reverse) proxy) has a nice feature for debugging purposes
> (admin and developers). It's possible to have different logging level
> for different sections of the code. For example, if you want to debug
> a particular part, you can have ALL,1 33,2 in your config file which
> allows you to have a more verbose debug on the section 33 of the code
> (33 is ACL).
> It's really useful because you can even debug a particular code path
> or a misconfiguration in production without generating a huge amount
> of logs.

Yes, that'd be really nice, though using numeric codes is a bit archaic.
In Java-world, a logging framework called log4j has become the de facto
standard to implement logging. It's basically the same idea, you have
different log levels, and you can have enable different levels of
logging for different modules. I'm sure there's similar libraries for C,
that's something to think about in future releases.

8.3 will add some new log_* GUC variables, log_autovacuum and
log_temp_files, so it seems we're headed to that direction already. I'm
now thinking that's the approach we should take for checkpoints as well:
add a new GUC variable, log_checkpoints, and print the messages at
LOG-level if that's enabled.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Logging checkpoints and other slowdown causes

From

Greg Smith

Date:

11 May 2007, 11:31:19

On Fri, 11 May 2007, Heikki Linnakangas wrote:

> Printing the messages at LOG-level would bring the code in line with the
> documentation, but I don't think we want to fill the log with checkpoint
> chatter unless the DBA explicitly asks for that.  How about INFO?

In putting that together, I pushed everything toward the lowest DEBUG
level that it was possible for it to run at without the messages I wanted
to see being completely lost.  I didn't want this to get rejected just on
the basis of being too chatty for regular users.  Putting a single line in
for each checkpoint at INFO, with all the individual progress points being
at DEBUG1 or 2, would be close to optimal as far as I'm concerned.  More
on this below.

> Or we could add a GUC variable similar to log_connections or log_statement to
> control if the messages are printed or not, and use LOG.

If the levels for the individual messages are adjusted usefully the
ability to control whether they show up or not falls out of the existing
log level adjustments; I haven't ever felt a strong need for GUC when
using this code.  If, as you say, there's already a move in this
direction, then fine--another GUC would be better.

> We don't need to print the times elapsed in each phase on a separate
> line, that's just derived information from the other lines, unless we
> use different log-levels for detail lines

Let me turn this around for a second and ask you this:  what do you think
people are going to do with this information?  I can tell you what I do.
I parse the logs and look for that single line with the summary
information.  I then take the three pieces of information it provides (MB
written, time to write, time to fsync) and save them into a table.  From
there I generate stats, scatter plots, all sorts of useful stuff.

If you know when the checkpoint ended, and you know how long each of the
pieces took, you can reconstruct the other times easily.  The way you
describe this it is true--that the summary is redundant given the
detail--but if you put yourself in the shoes of a log file parser the
other way around is easier to work with.  Piecing together log entries is
a pain, splitting them is easy.

If I had to only keep one line out of this, it would be the one with the
summary.  It would be nice to have it logged at INFO.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Logging checkpoints and other slowdown causes

From

Heikki Linnakangas

Date:

11 May 2007, 12:37:56

Greg Smith wrote:
> On Fri, 11 May 2007, Heikki Linnakangas wrote:
>> Or we could add a GUC variable similar to log_connections or
>> log_statement to control if the messages are printed or not, and use LOG.
>
> If the levels for the individual messages are adjusted usefully the
> ability to control whether they show up or not falls out of the existing
> log level adjustments; I haven't ever felt a strong need for GUC when
> using this code.  If, as you say, there's already a move in this
> direction, then fine--another GUC would be better.

The need for GUC becomes apparent when you start adding logging to other
modules as well, like we did for autovacuum a while back. If there
wasn't a log_autovacuum parameter, you'd have to choose between logging
checkpoints and autovacuum actions, or neither.

>> We don't need to print the times elapsed in each phase on a separate
>> line, that's just derived information from the other lines, unless we
>> use different log-levels for detail lines
>
> Let me turn this around for a second and ask you this:  what do you
> think people are going to do with this information?  I can tell you what
> I do. I parse the logs and look for that single line with the summary
> information.  I then take the three pieces of information it provides
> (MB written, time to write, time to fsync) and save them into a table.
>  From there I generate stats, scatter plots, all sorts of useful stuff.
>
> If you know when the checkpoint ended, and you know how long each of the
> pieces took, you can reconstruct the other times easily.  The way you
> describe this it is true--that the summary is redundant given the
> detail--but if you put yourself in the shoes of a log file parser the
> other way around is easier to work with.  Piecing together log entries
> is a pain, splitting them is easy.
>
> If I had to only keep one line out of this, it would be the one with the
> summary.  It would be nice to have it logged at INFO.

Yeah, if we have the summary line we don't need the other lines and vice
versa. I have sympathy for parsing log files, I've done that a lot in
the past and I can see what you mean. Having the individual lines is
nice when you're monitoring a running system; you don't get the summary
line until the checkpoint is finished. I suppose we can have both the
individual lines and the summary, the extra lines shouldn't hurt anyone,
and you won't get them unless you turn on the new log_checkpoints
parameter anyway.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Performance monitoring (was: Logging checkpoints and other slowdown causes)

From

"Jim C. Nasby"

Date:

12 May 2007, 18:20:01

Moving to -hackers.

On Fri, May 11, 2007 at 04:37:44PM +0100, Heikki Linnakangas wrote:
> >If you know when the checkpoint ended, and you know how long each of the
> >pieces took, you can reconstruct the other times easily.  The way you
> >describe this it is true--that the summary is redundant given the
> >detail--but if you put yourself in the shoes of a log file parser the
> >other way around is easier to work with.  Piecing together log entries
> >is a pain, splitting them is easy.
> >
> >If I had to only keep one line out of this, it would be the one with the
> >summary.  It would be nice to have it logged at INFO.
>
> Yeah, if we have the summary line we don't need the other lines and vice
> versa. I have sympathy for parsing log files, I've done that a lot in
> the past and I can see what you mean. Having the individual lines is
> nice when you're monitoring a running system; you don't get the summary
> line until the checkpoint is finished. I suppose we can have both the
> individual lines and the summary, the extra lines shouldn't hurt anyone,
> and you won't get them unless you turn on the new log_checkpoints
> parameter anyway.

Not to beat a dead horse, but do we really want to force folks to be
parsing logs for performance monitoring? Especially if that log parsing
is just going to result in data being inserted into a table anyway?

I know there's concern about performance of the stats system and maybe
that needs to be addressed, but pushing users to log parsing is a lot of
extra effort, non-standard, likely to be overlooked, and doesn't play
well with other tools. It also conflicts with all the existing
statistics framework.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Logging checkpoints and other slowdown causes

From

Heikki Linnakangas

Date:

28 May 2007, 17:26:15

Here's a work-in-progress updated version of this patch. It's not
completed, but I've settled on:

- a log_checkpoints GUC variable to control whether anything is printed
or not. This is in line with log_connections, log_autovacuum etc.

- when checkpoints are logged, LOG-level is used

- the following log output:

LOG:  checkpoint starting (CHECKPOINT command)
LOG:  checkpoint complete; buffers written=3.1 MB (9.6%) write=96.8 ms
sync=32.0 ms

where "checkpoint starting" is obviously printed when checkpoint starts,
and "checkpoint complete" when the checkpoint is fully complete.

- all information on the progress of checkpoint is collected to a global
struct. A global struct sounds a bit ugly, but it avoids the ugliness of
littering all functions with multiple return values to pass back
information to the caller that's only used to print it out. This makes
it easy to add more data points. MoveOfflineLogs should probably be
changed to use that as well for the sake of consistency, instead of
returning nsegsremoved and nsegsrecycled through pointers.

- RequestCheckpoint and CreateCheckpoint was changed to take a
CheckpointReason enum as argument, instead of boolean flags. This allows
printing a different message depending on the cause of checkpoint, and
is also cleaner IMHO.

I'm posting now to allow others, and Greg Smith in particular, to review
the changes this far, and take over from here.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.268
diff -c -r1.268 xlog.c
*** src/backend/access/transam/xlog.c    30 Apr 2007 21:01:52 -0000    1.268
--- src/backend/access/transam/xlog.c    20 May 2007 15:21:49 -0000
***************
*** 71,76 ****
--- 71,78 ----
  bool        XLOG_DEBUG = false;
  #endif

+ bool log_checkpoints = false;
+
  /*
   * XLOGfileslop is used in the code as the allowed "fuzz" in the number of
   * preallocated XLOG segments --- we try to have at least XLOGfiles advance
***************
*** 120,125 ****
--- 122,130 ----
  static TimestampTz recoveryStopTime;
  static bool recoveryStopAfter;

+ CheckpointProgressData checkpointProgress;
+
+
  /*
   * During normal operation, the only timeline we care about is ThisTimeLineID.
   * During recovery, however, things are more complicated.  To simplify life
***************
*** 1595,1601 ****
                          if (XLOG_DEBUG)
                              elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
!                         RequestCheckpoint(false, true);
                      }
                  }
              }
--- 1600,1606 ----
                          if (XLOG_DEBUG)
                              elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
!                         RequestCheckpoint(false, CHKP_SEGMENTS);
                      }
                  }
              }
***************
*** 5067,5073 ****
           * the rule that TLI only changes in shutdown checkpoints, which
           * allows some extra error checking in xlog_redo.
           */
!         CreateCheckPoint(true, true);

          /*
           * Close down recovery environment
--- 5072,5078 ----
           * the rule that TLI only changes in shutdown checkpoints, which
           * allows some extra error checking in xlog_redo.
           */
!         CreateCheckPoint(CHKP_SHUTDOWN); /* XXX startup rather than shutdown... */

          /*
           * Close down recovery environment
***************
*** 5340,5346 ****
      ereport(LOG,
              (errmsg("shutting down")));

!     CreateCheckPoint(true, true);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
--- 5345,5351 ----
      ereport(LOG,
              (errmsg("shutting down")));

!     CreateCheckPoint(CHKP_SHUTDOWN);
      ShutdownCLOG();
      ShutdownSUBTRANS();
      ShutdownMultiXact();
***************
*** 5356,5362 ****
   * activity has occurred since the last one.
   */
  void
! CreateCheckPoint(bool shutdown, bool force)
  {
      CheckPoint    checkPoint;
      XLogRecPtr    recptr;
--- 5361,5367 ----
   * activity has occurred since the last one.
   */
  void
! CreateCheckPoint(CheckpointReason reason)
  {
      CheckPoint    checkPoint;
      XLogRecPtr    recptr;
***************
*** 5370,5375 ****
--- 5375,5435 ----
      int            nsegsrecycled = 0;
      TransactionId *inCommitXids;
      int            nInCommit;
+     struct timeval start_t;
+     int            elapsed_secs;
+
+     /* Log beginning of checkpoint */
+
+     gettimeofday(&start_t, NULL);
+
+     if (log_checkpoints)
+     {
+         switch(reason)
+         {
+             case CHKP_SHUTDOWN:
+                 /* ereport(LOG, (errmsg("checkpoint starting (shutdown)"))); */
+                 break;
+             case CHKP_COMMAND:
+                 ereport(LOG, (errmsg("checkpoint starting (CHECKPOINT command)")));
+                 break;
+             case CHKP_BACKUP:
+                 ereport(LOG, (errmsg("checkpoint starting (starting backup)")));
+                 break;
+             case CHKP_TIMEOUT:
+                 ereport(LOG, (errmsg("checkpoint starting (checkpoint_timeout reached)")));
+                 break;
+             case CHKP_SEGMENTS:
+                 ereport(LOG, (errmsg("checkpoint starting (checkpoint_segments reached)")));
+                 break;
+             case CHKP_CREATEDB:
+                 ereport(LOG, (errmsg("checkpoint starting (creating database)")));
+                 break;
+             case CHKP_DROPDB:
+                 ereport(LOG, (errmsg("checkpoint starting (dropping database)")));
+                 break;
+             default:
+                 elog(WARNING, "unexpected checkpoint reason %d", reason);
+                 break;
+         }
+     }
+
+     /*
+      * We will warn if (a) too soon since last checkpoint (whatever
+      * caused it) and (b) somebody has set the ckpt_time_warn flag
+      * since the last checkpoint start.  Note in particular that this
+      * implementation will not generate warnings caused by
+      * CheckPointTimeout < CheckPointWarning.
+      */
+     elapsed_secs = start_t.tv_sec - checkpointProgress.cs_start_t.tv_sec;
+     if (reason == CHKP_SEGMENTS && elapsed_secs < CheckPointWarning)
+         ereport(LOG,
+                 (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
+                         elapsed_secs),
+                  errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
+
+
+     checkpointProgress.cs_reason = reason;
+     checkpointProgress.cs_start_t = start_t;

      /*
       * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
***************
*** 5384,5390 ****
       */
      START_CRIT_SECTION();

!     if (shutdown)
      {
          ControlFile->state = DB_SHUTDOWNING;
          ControlFile->time = time(NULL);
--- 5444,5450 ----
       */
      START_CRIT_SECTION();

!     if (reason == CHKP_SHUTDOWN)
      {
          ControlFile->state = DB_SHUTDOWNING;
          ControlFile->time = time(NULL);
***************
*** 5416,5422 ****
       * the end of the last checkpoint record, and its redo pointer must point
       * to itself.
       */
!     if (!shutdown && !force)
      {
          XLogRecPtr    curInsert;

--- 5476,5482 ----
       * the end of the last checkpoint record, and its redo pointer must point
       * to itself.
       */
!     if (reason == CHKP_TIMEOUT) /* XXX is the condition same as on head? */
      {
          XLogRecPtr    curInsert;

***************
*** 5479,5488 ****
       */
      LWLockRelease(WALInsertLock);

-     if (!shutdown)
-         ereport(DEBUG2,
-                 (errmsg("checkpoint starting")));
-
      /*
       * Before flushing data, we must wait for any transactions that are
       * currently in their commit critical sections.  If an xact inserted its
--- 5539,5544 ----
***************
*** 5530,5540 ****

      LWLockAcquire(OidGenLock, LW_SHARED);
      checkPoint.nextOid = ShmemVariableCache->nextOid;
!     if (!shutdown)
          checkPoint.nextOid += ShmemVariableCache->oidCount;
      LWLockRelease(OidGenLock);

!     MultiXactGetCheckptMulti(shutdown,
                               &checkPoint.nextMulti,
                               &checkPoint.nextMultiOffset);

--- 5586,5596 ----

      LWLockAcquire(OidGenLock, LW_SHARED);
      checkPoint.nextOid = ShmemVariableCache->nextOid;
!     if (reason != CHKP_SHUTDOWN)
          checkPoint.nextOid += ShmemVariableCache->oidCount;
      LWLockRelease(OidGenLock);

!     MultiXactGetCheckptMulti(reason == CHKP_SHUTDOWN,
                               &checkPoint.nextMulti,
                               &checkPoint.nextMultiOffset);

***************
*** 5561,5567 ****
      rdata.next = NULL;

      recptr = XLogInsert(RM_XLOG_ID,
!                         shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
                          XLOG_CHECKPOINT_ONLINE,
                          &rdata);

--- 5617,5623 ----
      rdata.next = NULL;

      recptr = XLogInsert(RM_XLOG_ID,
!                         (reason == CHKP_SHUTDOWN) ? XLOG_CHECKPOINT_SHUTDOWN :
                          XLOG_CHECKPOINT_ONLINE,
                          &rdata);

***************
*** 5571,5577 ****
       * We now have ProcLastRecPtr = start of actual checkpoint record, recptr
       * = end of actual checkpoint record.
       */
!     if (shutdown && !XLByteEQ(checkPoint.redo, ProcLastRecPtr))
          ereport(PANIC,
                  (errmsg("concurrent transaction log activity while database system is shutting down")));

--- 5627,5633 ----
       * We now have ProcLastRecPtr = start of actual checkpoint record, recptr
       * = end of actual checkpoint record.
       */
!     if (reason == CHKP_SHUTDOWN && !XLByteEQ(checkPoint.redo, ProcLastRecPtr))
          ereport(PANIC,
                  (errmsg("concurrent transaction log activity while database system is shutting down")));

***************
*** 5585,5591 ****
       * Update the control file.
       */
      LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
!     if (shutdown)
          ControlFile->state = DB_SHUTDOWNED;
      ControlFile->prevCheckPoint = ControlFile->checkPoint;
      ControlFile->checkPoint = ProcLastRecPtr;
--- 5641,5647 ----
       * Update the control file.
       */
      LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
!     if (reason == CHKP_SHUTDOWN)
          ControlFile->state = DB_SHUTDOWNED;
      ControlFile->prevCheckPoint = ControlFile->checkPoint;
      ControlFile->checkPoint = ProcLastRecPtr;
***************
*** 5626,5634 ****
       * Make more log segments if needed.  (Do this after deleting offline log
       * segments, to avoid having peak disk space usage higher than necessary.)
       */
!     if (!shutdown)
          nsegsadded = PreallocXlogFiles(recptr);

      /*
       * Truncate pg_subtrans if possible.  We can throw away all data before
       * the oldest XMIN of any running transaction.    No future transaction will
--- 5682,5694 ----
       * Make more log segments if needed.  (Do this after deleting offline log
       * segments, to avoid having peak disk space usage higher than necessary.)
       */
!     if (reason != CHKP_SHUTDOWN)
          nsegsadded = PreallocXlogFiles(recptr);

+         ereport(DEBUG2,
+                 (errmsg("checkpoint: %d transaction log file(s) added, %d removed, %d recycled",
+                         nsegsadded, nsegsremoved, nsegsrecycled)));
+
      /*
       * Truncate pg_subtrans if possible.  We can throw away all data before
       * the oldest XMIN of any running transaction.    No future transaction will
***************
*** 5639,5650 ****
      if (!InRecovery)
          TruncateSUBTRANS(GetOldestXmin(true, false));

-     if (!shutdown)
-         ereport(DEBUG2,
-                 (errmsg("checkpoint complete; %d transaction log file(s) added, %d removed, %d recycled",
-                         nsegsadded, nsegsremoved, nsegsrecycled)));
-
      LWLockRelease(CheckpointLock);
  }

  /*
--- 5699,5725 ----
      if (!InRecovery)
          TruncateSUBTRANS(GetOldestXmin(true, false));

      LWLockRelease(CheckpointLock);
+
+     /* Checkpoint complete. All we have left is updating and printing stats */
+
+     gettimeofday(&checkpointProgress.cs_end_t, NULL);
+     if (log_checkpoints)
+     {
+         long write_usecs;
+         long sync_usecs;
+
+         write_usecs = 1000000 * (long) (checkpointProgress.cs_sync_t.tv_sec - checkpointProgress.cs_start_t.tv_sec) +
+         (long) (checkpointProgress.cs_sync_t.tv_usec - checkpointProgress.cs_start_t.tv_usec);
+         sync_usecs = 1000000 * (long) (checkpointProgress.cs_end_t.tv_sec - checkpointProgress.cs_sync_t.tv_sec) +
+         (long) (checkpointProgress.cs_end_t.tv_usec - checkpointProgress.cs_sync_t.tv_usec);
+
+         ereport(LOG,(errmsg(
+             "checkpoint complete; buffers written=%.1f MB (%.1f%%) write=%.1f ms sync=%.1f ms",
+             (float) (BLCKSZ * checkpointProgress.cs_buffers_written) / (1024*1024),
+             (float) 100 * checkpointProgress.cs_buffers_written / NBuffers,
+         (float) write_usecs / 1000, (float) sync_usecs / 1000)));
+     }
  }

  /*
Index: src/backend/bootstrap/bootstrap.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/bootstrap/bootstrap.c,v
retrieving revision 1.233
diff -c -r1.233 bootstrap.c
*** src/backend/bootstrap/bootstrap.c    7 Mar 2007 13:35:02 -0000    1.233
--- src/backend/bootstrap/bootstrap.c    20 May 2007 10:03:27 -0000
***************
*** 489,495 ****

      /* Perform a checkpoint to ensure everything's down to disk */
      SetProcessingMode(NormalProcessing);
!     CreateCheckPoint(true, true);

      /* Clean up and exit */
      cleanup();
--- 489,495 ----

      /* Perform a checkpoint to ensure everything's down to disk */
      SetProcessingMode(NormalProcessing);
!     CreateCheckPoint(CHKP_SHUTDOWN);

      /* Clean up and exit */
      cleanup();
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.194
diff -c -r1.194 dbcommands.c
*** src/backend/commands/dbcommands.c    12 Apr 2007 15:04:35 -0000    1.194
--- src/backend/commands/dbcommands.c    20 May 2007 10:04:27 -0000
***************
*** 503,509 ****
           * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
           * we can avoid this.
           */
!         RequestCheckpoint(true, false);

          /*
           * Close pg_database, but keep lock till commit (this is important to
--- 503,509 ----
           * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
           * we can avoid this.
           */
!         RequestCheckpoint(true, CHKP_CREATEDB);

          /*
           * Close pg_database, but keep lock till commit (this is important to
***************
*** 656,662 ****
       * open files, which would cause rmdir() to fail.
       */
  #ifdef WIN32
!     RequestCheckpoint(true, false);
  #endif

      /*
--- 656,662 ----
       * open files, which would cause rmdir() to fail.
       */
  #ifdef WIN32
!     RequestCheckpoint(true, CHKP_DROPDB);
  #endif

      /*
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.37
diff -c -r1.37 bgwriter.c
*** src/backend/postmaster/bgwriter.c    30 Mar 2007 18:34:55 -0000    1.37
--- src/backend/postmaster/bgwriter.c    20 May 2007 15:31:30 -0000
***************
*** 116,122 ****
      sig_atomic_t ckpt_done;        /* advances when checkpoint done */
      sig_atomic_t ckpt_failed;    /* advances when checkpoint fails */

!     sig_atomic_t ckpt_time_warn;    /* warn if too soon since last ckpt? */

      int            num_requests;    /* current # of requests */
      int            max_requests;    /* allocated array size */
--- 116,122 ----
      sig_atomic_t ckpt_done;        /* advances when checkpoint done */
      sig_atomic_t ckpt_failed;    /* advances when checkpoint fails */

!     sig_atomic_t ckpt_reason;    /* XXX */

      int            num_requests;    /* current # of requests */
      int            max_requests;    /* allocated array size */
***************
*** 340,349 ****
      for (;;)
      {
          bool        do_checkpoint = false;
!         bool        force_checkpoint = false;
          time_t        now;
          int            elapsed_secs;
          long        udelay;

          /*
           * Emergency bailout if postmaster has died.  This is to avoid the
--- 340,350 ----
      for (;;)
      {
          bool        do_checkpoint = false;
!         CheckpointReason reason = 0; /* init to keep compiler happy */
          time_t        now;
          int            elapsed_secs;
          long        udelay;
+

          /*
           * Emergency bailout if postmaster has died.  This is to avoid the
***************
*** 366,372 ****
          {
              checkpoint_requested = false;
              do_checkpoint = true;
!             force_checkpoint = true;
              BgWriterStats.m_requested_checkpoints++;
          }
          if (shutdown_requested)
--- 367,373 ----
          {
              checkpoint_requested = false;
              do_checkpoint = true;
!             reason = BgWriterShmem->ckpt_reason;
              BgWriterStats.m_requested_checkpoints++;
          }
          if (shutdown_requested)
***************
*** 389,399 ****
           */
          now = time(NULL);
          elapsed_secs = now - last_checkpoint_time;
!         if (elapsed_secs >= CheckPointTimeout)
          {
!             do_checkpoint = true;
!             if (!force_checkpoint)
                  BgWriterStats.m_timed_checkpoints++;
          }

          /*
--- 390,404 ----
           */
          now = time(NULL);
          elapsed_secs = now - last_checkpoint_time;
!         if (!do_checkpoint)
          {
!             elapsed_secs = now - last_checkpoint_time;
!             if (elapsed_secs >= CheckPointTimeout)
!             {
!                 do_checkpoint = true;
!                 reason = CHKP_TIMEOUT;
                  BgWriterStats.m_timed_checkpoints++;
+             }
          }

          /*
***************
*** 404,421 ****
          {
              /*
               * We will warn if (a) too soon since last checkpoint (whatever
!              * caused it) and (b) somebody has set the ckpt_time_warn flag
!              * since the last checkpoint start.  Note in particular that this
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if (BgWriterShmem->ckpt_time_warn &&
!                 elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
                                  elapsed_secs),
                           errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
-             BgWriterShmem->ckpt_time_warn = false;

              /*
               * Indicate checkpoint start to any waiting backends.
--- 409,424 ----
          {
              /*
               * We will warn if (a) too soon since last checkpoint (whatever
!              * caused it) and (b) checkpoint was initiated by filling WAL
!              * segments.  Note in particular that this
               * implementation will not generate warnings caused by
               * CheckPointTimeout < CheckPointWarning.
               */
!             if (reason == CHKP_SEGMENTS && elapsed_secs < CheckPointWarning)
                  ereport(LOG,
                          (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
                                  elapsed_secs),
                           errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));

              /*
               * Indicate checkpoint start to any waiting backends.
***************
*** 423,429 ****
              ckpt_active = true;
              BgWriterShmem->ckpt_started++;

!             CreateCheckPoint(false, force_checkpoint);

              /*
               * After any checkpoint, close all smgr files.    This is so we
--- 426,432 ----
              ckpt_active = true;
              BgWriterShmem->ckpt_started++;

!             CreateCheckPoint(reason);

              /*
               * After any checkpoint, close all smgr files.    This is so we
***************
*** 637,648 ****
   * before returning; otherwise, just signal the request and return
   * immediately.
   *
!  * If warnontime is true, and it's "too soon" since the last checkpoint,
   * the bgwriter will log a warning.  This should be true only for checkpoints
   * caused due to xlog filling, else the warning will be misleading.
   */
  void
! RequestCheckpoint(bool waitforit, bool warnontime)
  {
      /* use volatile pointer to prevent code rearrangement */
      volatile BgWriterShmemStruct *bgs = BgWriterShmem;
--- 640,651 ----
   * before returning; otherwise, just signal the request and return
   * immediately.
   *
!  * XXX If warnontime is true, and it's "too soon" since the last checkpoint,
   * the bgwriter will log a warning.  This should be true only for checkpoints
   * caused due to xlog filling, else the warning will be misleading.
   */
  void
! RequestCheckpoint(bool waitforit, CheckpointReason reason)
  {
      /* use volatile pointer to prevent code rearrangement */
      volatile BgWriterShmemStruct *bgs = BgWriterShmem;
***************
*** 654,660 ****
       */
      if (!IsPostmasterEnvironment)
      {
!         CreateCheckPoint(false, true);

          /*
           * After any checkpoint, close all smgr files.    This is so we won't
--- 657,663 ----
       */
      if (!IsPostmasterEnvironment)
      {
!         CreateCheckPoint(reason);

          /*
           * After any checkpoint, close all smgr files.    This is so we won't
***************
*** 666,673 ****
      }

      /* Set warning request flag if appropriate */
!     if (warnontime)
!         bgs->ckpt_time_warn = true;

      /*
       * Send signal to request checkpoint.  When waitforit is false, we
--- 669,675 ----
      }

      /* Set warning request flag if appropriate */
!     bgs->ckpt_reason = reason;

      /*
       * Send signal to request checkpoint.  When waitforit is false, we
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.218
diff -c -r1.218 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    2 May 2007 23:34:48 -0000    1.218
--- src/backend/storage/buffer/bufmgr.c    20 May 2007 16:09:02 -0000
***************
*** 43,48 ****
--- 43,49 ----
  #include "postgres.h"

  #include <sys/file.h>
+ #include <sys/time.h>
  #include <unistd.h>

  #include "miscadmin.h"
***************
*** 1006,1011 ****
--- 1007,1013 ----
          if (SyncOneBuffer(buf_id, false))
          {
              BgWriterStats.m_buf_written_checkpoints++;
+             checkpointProgress.cs_buffers_written++;

              /*
               * If in bgwriter, absorb pending fsync requests after each
***************
*** 1323,1330 ****
--- 1325,1339 ----
  void
  FlushBufferPool(void)
  {
+     checkpointProgress.cs_buffers_written = 0;
+
      BufferSync();
+
+     gettimeofday(&checkpointProgress.cs_write_t, NULL);
+
      smgrsync();
+
+     gettimeofday(&checkpointProgress.cs_sync_t, NULL);
  }


Index: src/backend/tcop/utility.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tcop/utility.c,v
retrieving revision 1.279
diff -c -r1.279 utility.c
*** src/backend/tcop/utility.c    27 Apr 2007 22:05:49 -0000    1.279
--- src/backend/tcop/utility.c    20 May 2007 10:22:16 -0000
***************
*** 1089,1095 ****
                  ereport(ERROR,
                          (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                           errmsg("must be superuser to do CHECKPOINT")));
!             RequestCheckpoint(true, false);
              break;

          case T_ReindexStmt:
--- 1089,1095 ----
                  ereport(ERROR,
                          (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                           errmsg("must be superuser to do CHECKPOINT")));
!             RequestCheckpoint(true, CHKP_COMMAND);
              break;

          case T_ReindexStmt:
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.391
diff -c -r1.391 guc.c
*** src/backend/utils/misc/guc.c    8 May 2007 16:33:51 -0000    1.391
--- src/backend/utils/misc/guc.c    20 May 2007 10:11:45 -0000
***************
*** 104,109 ****
--- 104,110 ----
  extern int    CommitSiblings;
  extern char *default_tablespace;
  extern bool fullPageWrites;
+ extern bool log_checkpoints;

  #ifdef TRACE_SORT
  extern bool trace_sort;
***************
*** 596,601 ****
--- 597,610 ----
          false, NULL, NULL
      },
      {
+         {"log_checkpoints", PGC_BACKEND, LOGGING_WHAT,
+             gettext_noop("Logs beginning and end of checkpoints, and some stats XXX."),
+             NULL
+         },
+         &log_checkpoints,
+         false, NULL, NULL
+     },
+     {
          {"debug_assertions", PGC_USERSET, DEVELOPER_OPTIONS,
              gettext_noop("Turns on various assertion checks."),
              gettext_noop("This is a debugging aid."),
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.215
diff -c -r1.215 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample    18 Apr 2007 16:44:18 -0000    1.215
--- src/backend/utils/misc/postgresql.conf.sample    20 May 2007 10:14:36 -0000
***************
*** 318,323 ****
--- 318,324 ----
  #debug_print_rewritten = off
  #debug_print_plan = off
  #debug_pretty_print = off
+ #log_checkpoints = off # XXX: Should this be under the checkpoints section?
  #log_connections = off
  #log_disconnections = off
  #log_duration = off
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.76
diff -c -r1.76 xlog.h
*** src/include/access/xlog.h    5 Jan 2007 22:19:51 -0000    1.76
--- src/include/access/xlog.h    20 May 2007 15:10:43 -0000
***************
*** 128,133 ****
--- 128,164 ----
      struct XLogRecData *next;    /* next struct in chain, or NULL */
  } XLogRecData;

+ /* specifying CHKP_SHUTDOWN modifies behavior. The other reasons
+  * are distinct for informational purposes */
+ typedef enum CheckpointReason
+ {
+     CHKP_SHUTDOWN = 1,
+     CHKP_COMMAND,
+     CHKP_BACKUP,
+     CHKP_TIMEOUT,
+     CHKP_SEGMENTS,
+     CHKP_CREATEDB,
+     CHKP_DROPDB
+ } CheckpointReason;
+
+ /* CheckpointProgress is used to collect information about the progress
+  * of checkpoint. When there's no checkpoint in progress, the fields store
+  * the details of the last finished checkpoint.
+  *
+  * The fields are in the order they're filled in during a checkpoint.
+  */
+ typedef struct CheckpointProgressData
+ {
+     CheckpointReason cs_reason;
+     struct timeval    cs_start_t;
+     int                cs_buffers_written;
+     struct timeval    cs_write_t;
+     struct timeval    cs_sync_t;
+     struct timeval    cs_end_t;
+ } CheckpointProgressData;
+
+ extern CheckpointProgressData checkpointProgress;
+
  extern TimeLineID ThisTimeLineID;        /* current TLI */
  extern bool InRecovery;
  extern XLogRecPtr MyLastRecPtr;
***************
*** 162,168 ****
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
--- 193,199 ----
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(CheckpointReason reason);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.9
diff -c -r1.9 bgwriter.h
*** src/include/postmaster/bgwriter.h    5 Jan 2007 22:19:57 -0000    1.9
--- src/include/postmaster/bgwriter.h    20 May 2007 10:05:03 -0000
***************
*** 12,17 ****
--- 12,18 ----
  #ifndef _BGWRITER_H
  #define _BGWRITER_H

+ #include "access/xlog.h"
  #include "storage/block.h"
  #include "storage/relfilenode.h"

***************
*** 23,29 ****

  extern void BackgroundWriterMain(void);

! extern void RequestCheckpoint(bool waitforit, bool warnontime);

  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
--- 24,30 ----

  extern void BackgroundWriterMain(void);

! extern void RequestCheckpoint(bool waitforit, CheckpointReason reason);

  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);

Re: Logging checkpoints and other slowdown causes

From

Simon Riggs

Date:

28 May 2007, 17:50:34

On Mon, 2007-05-28 at 21:25 +0100, Heikki Linnakangas wrote:

> - RequestCheckpoint and CreateCheckpoint was changed to take a
> CheckpointReason enum as argument, instead of boolean flags. This allows
> printing a different message depending on the cause of checkpoint, and
> is also cleaner IMHO.

Looks good.

I'd be inclined to make the reason code for creating/dropping database
the same and just say "Cluster DDL" or something, since for analysis
purposes we just want to know it was DDL.

I'd rather see the timings in seconds rather than ms. Counting
significant digits will be a pain...

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Logging checkpoints and other slowdown causes

From

Greg Smith

Date:

28 May 2007, 23:21:58

I'll take another stab at refining this can of worms I opened.  The one
thing I noticed on a quick review is that it's almost possible to skip all
the calls to gettimeofday if log_checkpoints is off now.  I'd like to make
that a specific goal, because that will make me feel better that adding
this code has almost no performance impact relative to now unless you turn
the feature on.

I agree with Simon that tracking create/drop separately is unnecessary.
As for why all the timing info is in ms, given the scale of the numbers
typically encountered I found it easier to work with.  I originally wanted
resolution down to 0.1ms if the underlying OS supports it, which means 4
figures to the right of the decimal point if the unit was switched to
seconds.  Quite often the times reported are less than 100ms, so you'll
normally be dealing with fractional part of a second.  If we take Heikki's
example:

LOG:  checkpoint complete; buffers written=3.1 MB (9.6%) write=96.8 ms
sync=32.0 ms

And switch it to seconds:

LOG:  checkpoint complete; buffers written=3.1 MB (9.6%) write=0.0968 ms
sync=0.0320 ms

I don't find that as easy to work with.  The only way a timing in seconds
would look OK is if the resolution of the whole thing is reduced to ms,
which then makes 3 decimal points--easy to read as ms instead.  Having
stared at a fair amount of this data now, that's probably fine; I'll
collect up some more data on it from a fast server this week to confirm
whether's it's worthless precision or worth capturing.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Logging checkpoints and other slowdown causes

From

Heikki Linnakangas

Date:

29 May 2007, 14:56:31

Greg Smith wrote:
> I'll take another stab at refining this can of worms I opened.  The one
> thing I noticed on a quick review is that it's almost possible to skip
> all the calls to gettimeofday if log_checkpoints is off now.  I'd like
> to make that a specific goal, because that will make me feel better that
> adding this code has almost no performance impact relative to now unless
> you turn the feature on.

Saving a couple of gettimeofday calls on an event that happens as
infrequently as checkpoints is not going to make any difference.
Especially if you compare it to all the other work that's done on
checkpoint.

> I agree with Simon that tracking create/drop separately is unnecessary.
> As for why all the timing info is in ms, given the scale of the numbers
> typically encountered I found it easier to work with.  I originally
> wanted resolution down to 0.1ms if the underlying OS supports it, which
> means 4 figures to the right of the decimal point if the unit was
> switched to seconds.  Quite often the times reported are less than
> 100ms, so you'll normally be dealing with fractional part of a second.
> If we take Heikki's example:
>
> LOG:  checkpoint complete; buffers written=3.1 MB (9.6%) write=96.8 ms
> sync=32.0 ms
>
> And switch it to seconds:
>
> LOG:  checkpoint complete; buffers written=3.1 MB (9.6%) write=0.0968 ms
> sync=0.0320 ms
>
> I don't find that as easy to work with.  The only way a timing in
> seconds would look OK is if the resolution of the whole thing is reduced
> to ms, which then makes 3 decimal points--easy to read as ms instead.
> Having stared at a fair amount of this data now, that's probably fine;
> I'll collect up some more data on it from a fast server this week to
> confirm whether's it's worthless precision or worth capturing.

The checkpoint will take at least a couple of seconds on any interesting
system, so 0.1 s resolution should be enough IMHO.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Logging checkpoints and other slowdown causes

From

Greg Smith

Date:

30 May 2007, 01:09:36

On Tue, 29 May 2007, Heikki Linnakangas wrote:

> The checkpoint will take at least a couple of seconds on any interesting
> system, so 0.1 s resolution should be enough IMHO.

You may be underestimating the resources some interesting systems are
willing to put into speeding up checkpoints.  I'm sometimes dumping into a
SAN whose cache is bigger than the shared_buffer cache in the server, and
0.1s isn't really enough resolution in that situation.  A second is a
really long checkpoint there.  Since even that's limited by fiber-channel
speeds, I know it's possible to do better than what I'm seeing with
something like a PCIe host adapter having on-board cache in the GB range
(which isn't that expensive nowadays).

Also, even if the checkpoint total takes seconds, much of that is in the
sync phase; the write time can still be in the small number of ms range,
and I wouldn't want to see that truncated too much.

Anyway, I have a bunch of data on this subject being collected at this
moment, and I'll rescale the results based on what I see after analyzing
that this week.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD