Thread: Instrument checkpoint sync calls
Attached patch adds some logging for each individual fsync call made during a checkpoint, along with a summary at the end. You need to have the following to see all of the detail: log_checkpoints=on log_min_messages=debug1 And here's a sample: LOG: checkpoint starting: immediate force wait DEBUG: checkpoint sync: file=1 time=1.946000 msec DEBUG: checkpoint sync: file=2 time=0.666000 msec DEBUG: checkpoint sync: file=3 time=0.004000 msec LOG: checkpoint sync: files=3 longest=1.946000 msec average=0.872000 msec LOG: checkpoint complete: wrote 3 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.002 s, total=0.003 s I think that it's reasonable for the sort of people who turn log_checkpoints on to also get the sync summary line, thus it being logged at LOG level. The detail on individual syncs might go to DEBUG2 or lower instead of DEBUG1 where I put it, that part I don't have a strong opinion on. It's at DEBUG1 to make testing the patch without a gigantic amount of log data also coming in easier. Right now the code is written such that all the calls that grab timing information are wrapped around "ifdef DEBUG_FSYNC", which is a variable set to 1 that could be a compile-time option like DEBUG_DEADLOCK, to allow turning this code path off at build time. I personally think that if you're already making an fsync call and have log_checkpoints on, the additional overhead of also timing that fsync is minimal even on platforms where timing is slow (I don't have such a system to test that assumption however). And I've seen enough value in troubleshooting nasty checkpoint sync problems using this patch to feel it's worth having even if it does add some overhead. I'm a little concerned about log_checkpoints changing on me in the middle of the execution of a checkpoint, which would cause some problems here. Not sure if that's something I should explicitly code for, given that all I think it will do is screw up one of the timing results. It does seem a risk from the last minute self-review I just did of the code. I'll give a sample program that stresses the system, generating slow timing results and other types of bad behavior, along with the next patch I submit here shortly. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 1219fcf..c72da72 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -20,6 +20,7 @@ #include "catalog/catalog.h" #include "miscadmin.h" +#include "portability/instr_time.h" #include "postmaster/bgwriter.h" #include "storage/fd.h" #include "storage/bufmgr.h" @@ -29,6 +30,11 @@ #include "utils/memutils.h" #include "pg_trace.h" +/* + * For now just leave this on all the time, eventually this will + * need to be a full compiler flag like DEBUG_DEADLOCK + */ +#define DEBUG_FSYNC 1 /* interval for calling AbsorbFsyncRequests in mdsync */ #define FSYNCS_PER_ABSORB 10 @@ -927,6 +933,15 @@ mdsync(void) PendingOperationEntry *entry; int absorb_counter; +#ifdef DEBUG_FSYNC + /* Statistics on sync times */ + int processed = 0; + instr_time sync_start,sync_end,diff; + double elapsed; + double longest = 0; + double total_elapsed = 0; +#endif + /* * This is only called during checkpoints, and checkpoints should only * occur in processes that have created a pendingOpsTable. @@ -1069,9 +1084,32 @@ mdsync(void) seg = _mdfd_getseg(reln, entry->tag.forknum, entry->tag.segno * ((BlockNumber) RELSEG_SIZE), false, EXTENSION_RETURN_NULL); + +#ifdef DEBUG_FSYNC + if (log_checkpoints) + INSTR_TIME_SET_CURRENT(sync_start); +#endif if (seg != NULL && FileSync(seg->mdfd_vfd) >= 0) + { + +#ifdef DEBUG_FSYNC + if (log_checkpoints) + { + INSTR_TIME_SET_CURRENT(sync_end); + diff=sync_end; + INSTR_TIME_SUBTRACT(diff, sync_start); + elapsed = (double) INSTR_TIME_GET_MICROSEC(diff) / 1000.0; + if (elapsed > longest) + longest = elapsed; + total_elapsed+=elapsed; + processed++; + /* Reduce the log level here to DEBUG2 in final patch? */ + elog(DEBUG1, "checkpoint sync: file=%d time=%f msec", processed, elapsed); + } +#endif break; /* success; break out of retry loop */ + } /* * XXX is there any point in allowing more than one retry? @@ -1113,6 +1151,13 @@ mdsync(void) elog(ERROR, "pendingOpsTable corrupted"); } /* end loop over hashtable entries */ +#ifdef DEBUG_FSYNC + /* Report on sync performance metrics */ + if (log_checkpoints && (processed > 0)) + elog(LOG, "checkpoint sync: files=%d longest=%f msec average=%f msec", + processed, longest, total_elapsed / processed); +#endif + /* Flag successful completion of mdsync */ mdsync_in_progress = false; }
On Sun, Nov 14, 2010 at 22:37, Greg Smith <greg@2ndquadrant.com> wrote: > Attached patch adds some logging for each individual fsync call made during > a checkpoint, along with a summary at the end. You need to have the > following to see all of the detail: > > log_checkpoints=on > log_min_messages=debug1 > > And here's a sample: > > LOG: checkpoint starting: immediate force wait > DEBUG: checkpoint sync: file=1 time=1.946000 msec > DEBUG: checkpoint sync: file=2 time=0.666000 msec > DEBUG: checkpoint sync: file=3 time=0.004000 msec > LOG: checkpoint sync: files=3 longest=1.946000 msec average=0.872000 msec > LOG: checkpoint complete: wrote 3 buffers (0.1%); 0 transaction log file(s) > added, 0 removed, 0 recycled; write=0.000 s, sync=0.002 s, total=0.003 s > > I think that it's reasonable for the sort of people who turn log_checkpoints > on to also get the sync summary line, thus it being logged at LOG level. In that case, wouldn't it make more sense to add a couple of more fields to the existing line? Easier to post-process the logfile that way... > The detail on individual syncs might go to DEBUG2 or lower instead of > DEBUG1 where I put it, that part I don't have a strong opinion on. It's at > DEBUG1 to make testing the patch without a gigantic amount of log data also > coming in easier. > > Right now the code is written such that all the calls that grab timing > information are wrapped around "ifdef DEBUG_FSYNC", which is a variable set > to 1 that could be a compile-time option like DEBUG_DEADLOCK, to allow > turning this code path off at build time. I personally think that if you're > already making an fsync call and have log_checkpoints on, the additional > overhead of also timing that fsync is minimal even on platforms where timing > is slow (I don't have such a system to test that assumption however). And > I've seen enough value in troubleshooting nasty checkpoint sync problems > using this patch to feel it's worth having even if it does add some > overhead. It sounds like it should be - but that should be possible to measure, no? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander wrote:<br /><blockquote cite="mid:AANLkTi=prse2HvBnaj5Amzf9zZq7izNPs0APwx1MP-5z@mail.gmail.com" type="cite"><blockquotetype="cite"><pre wrap=""> I think that it's reasonable for the sort of people who turn log_checkpoints on to also get the sync summary line, thus it being logged at LOG level. </pre></blockquote><pre wrap=""> In that case, wouldn't it make more sense to add a couple of more fields to the existing line? Easier to post-process the logfile that way... </pre></blockquote><br /> It might. One trade-off is that if you're looking at the sync write detail, the summarycomes out in a similar form. And it was easy to put in here--I'd have to return some new data out of the sync phasecall in order for that to show up in the main log. If there's general buy-in on the idea, I could do all of that.<br/><br /><br /><blockquote cite="mid:AANLkTi=prse2HvBnaj5Amzf9zZq7izNPs0APwx1MP-5z@mail.gmail.com" type="cite"><blockquotetype="cite"><pre wrap="">I personally think that if you're already making an fsync call and have log_checkpoints on, the additional overhead of also timing that fsync is minimal even on platforms where timing is slow (I don't have such a system to test that assumption however)...</pre></blockquote><pre wrap=""> It sounds like it should be - but that should be possible to measure, no? </pre></blockquote><br /> What I was alluding tois that I know gettimeofday executes fast on my Linux system here, so even if I did measure the overhead and showed it'snear zero that doesn't mean it will be so on every platform. The "how long does it take to find out the current timeon every supported PostgreSQL platform?" question is one I'd like to have an answer to, but it's hard to collect properly. All I know is that I don't have any system where it's slow to properly test again here.<br /><br /><pre class="moz-signature"cols="72">-- Greg Smith 2ndQuadrant US <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a> </pre>
On Sun, Nov 14, 2010 at 7:04 PM, Greg Smith <greg@2ndquadrant.com> wrote: > It might. One trade-off is that if you're looking at the sync write detail, > the summary comes out in a similar form. And it was easy to put in > here--I'd have to return some new data out of the sync phase call in order > for that to show up in the main log. If there's general buy-in on the idea, > I could do all of that. +1 for that approach. > What I was alluding to is that I know gettimeofday executes fast on my Linux > system here, so even if I did measure the overhead and showed it's near zero > that doesn't mean it will be so on every platform. The "how long does it > take to find out the current time on every supported PostgreSQL platform?" > question is one I'd like to have an answer to, but it's hard to collect > properly. All I know is that I don't have any system where it's slow to > properly test again here. I would be very surprised if we can find a system where gettimeofday() takes a significant amount of time compared with fsync(). It might be (probably is) too expensive to stick into code paths that are heavily CPU-bounded, but surely the cost here is going to be dwarfed by the fsync(), no? Unless maybe there's no I/O to be done anyway, but that case doesn't seem important to optimize for. I think making this code dependent on ifdef DEBUG_FSYNC is going to make in order of magnitude less useful, so I'd really rather not do that unless we have some hard evidence that there is a real problem to worry about here. Making it conditional on log_checkpoints seems entirely sufficient to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I would be very surprised if we can find a system where gettimeofday() > takes a significant amount of time compared with fsync(). It might be > (probably is) too expensive to stick into code paths that are heavily > CPU-bounded, but surely the cost here is going to be dwarfed by the > fsync(), no? Unless maybe there's no I/O to be done anyway, but that > case doesn't seem important to optimize for. I'm not sure I buy that --- the whole point of spread checkpoints is that we hope the I/O happens before we actually call fsync. > Making it > conditional on log_checkpoints seems entirely sufficient to me. But I'll agree with that. If you're turning on log_checkpoints then you've given the system permission to indulge in extra overhead for monitoring. regards, tom lane
So my task list is: 0) Rebase against the HEAD that just code related to this touched today 1) Assume that log_checkpoints is sufficient control over whether the timing overhead added is worth collecting, and therefore remove the half-baked idea of also wrapping with a compile-time option. 2) Have the sync summary returned upwards, so it can be put onto the same line as the rest of the rest of the log_checkpoint info. All seems reasonable to me. Will rev a new patch by tomorrow. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
On Mon, Nov 15, 2010 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I would be very surprised if we can find a system where gettimeofday() >> takes a significant amount of time compared with fsync(). It might be >> (probably is) too expensive to stick into code paths that are heavily >> CPU-bounded, but surely the cost here is going to be dwarfed by the >> fsync(), no? Unless maybe there's no I/O to be done anyway, but that >> case doesn't seem important to optimize for. > > I'm not sure I buy that --- the whole point of spread checkpoints is > that we hope the I/O happens before we actually call fsync. Possibly. Even then, it's not guaranteed, because it might well be that fsync() has to do more grovelling through system data structures to determine that there is in fact no work to be done than what gettimeofday() has to do to read off a counter somewhere... >> Making it >> conditional on log_checkpoints seems entirely sufficient to me. > > But I'll agree with that. If you're turning on log_checkpoints then > you've given the system permission to indulge in extra overhead for > monitoring. ...but if we agree on the way forward then the details don't matter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Nov 15, 2010 at 3:09 PM, Greg Smith <greg@2ndquadrant.com> wrote: > So my task list is: > > 0) Rebase against the HEAD that just code related to this touched today > > 1) Assume that log_checkpoints is sufficient control over whether the timing > overhead added is worth collecting, and therefore remove the half-baked idea > of also wrapping with a compile-time option. > > 2) Have the sync summary returned upwards, so it can be put onto the same > line as the rest of the rest of the log_checkpoint info. > > All seems reasonable to me. Will rev a new patch by tomorrow. Did this get eaten by the email goblin, or you're still working on it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Nov 15, 2010 at 12:09 PM, Greg Smith <greg@2ndquadrant.com> wrote: > So my task list is: > > 0) Rebase against the HEAD that just code related to this touched today > > 1) Assume that log_checkpoints is sufficient control over whether the timing > overhead added is worth collecting, and therefore remove the half-baked idea > of also wrapping with a compile-time option. > > 2) Have the sync summary returned upwards, so it can be put onto the same > line as the rest of the rest of the log_checkpoint info. > > All seems reasonable to me. Will rev a new patch by tomorrow. For the individual file sync times emitted under debug1, it would be very handy if the file being synced was identified, for example "relation base/16384/16523". Rather than being numbered sequentially within a given checkpoint. Cheers, Jeff
Robert Haas wrote: > Did this get eaten by the email goblin, or you're still working on it? > Fell behind due to an unfortunately timed bit of pneumonia. Hurray for the health benefits of cross country flights. Will fix this up, rebase my other patch, and head toward some more review/'Fest cleanup now that I'm feeling better. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
On Wed, Nov 24, 2010 at 1:02 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Robert Haas wrote: >> Did this get eaten by the email goblin, or you're still working on it? > > Fell behind due to an unfortunately timed bit of pneumonia. Hurray for the > health benefits of cross country flights. Will fix this up, rebase my other > patch, and head toward some more review/'Fest cleanup now that I'm feeling > better. Ouch. Fringe benefits of consulting. Thanks for the update. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Jeff Janes wrote: > For the individual file sync times emitted under debug1, it would be > very handy if the file being synced was identified, for example > "relation base/16384/16523". I was in the middle of looking into adding that already, so consider that part of the target for the next update I'm working on. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Jeff Janes wrote: > For the individual file sync times emitted under debug1, it would be > very handy if the file being synced was identified, for example > "relation base/16384/16523". Rather than being numbered sequentially > within a given checkpoint. > I was numbering them sequentially so that it's straightforward to graph the sync times in an external analysis tool, but the relation data is helpful too. New patch reflecting all upthread suggestions is attached. The output looks like this now at DEBUG1: LOG: checkpoint starting: xlog DEBUG: checkpoint sync: number=1 file=base/16424/11645 time=11589.549000 msec DEBUG: checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=53.530000 msec DEBUG: checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec DEBUG: checkpoint sync: number=6 file=base/16424/16425_fsm time=2.921000 msec DEBUG: checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec DEBUG: checkpoint sync: number=8 file=base/16424/16428_fsm time=1.654000 msec DEBUG: checkpoint sync: number=9 file=base/16424/16442 time=7.920000 msec DEBUG: checkpoint sync: number=10 file=base/16424/16428_vm time=2.613000 msec DEBUG: checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec DEBUG: checkpoint sync: number=12 file=base/16424/16437_fsm time=2.638000 msec DEBUG: checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec DEBUG: checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec DEBUG: checkpoint sync: number=15 file=base/16424/16437_vm time=8.686000 msec DEBUG: checkpoint sync: number=16 file=base/16424/16425_vm time=5.984000 msec LOG: checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s, total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s I kept the units for the DEBUG level ones in msec because that's a better scale for the common really short syncs during good behavior. But the summary info in seconds now appears at the end of the existing "checkpoint complete" message, so only one line to parse for those looking to analyze the gross checkpoint data. That looks to work well enough for finding situations like the big ext3 spikes. You can easily see one in this example by the fact that "longest=11.589 s" is almost the entirety of "sync=11.715 s". That's the really key thing there's currently no visibility into, that's made obvious with this patch. This might be ready for some proper review now. I know there's at least one blatant bug still in here I haven't found yet, related to how the averages are computed. I saw this once: LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s After an immediate checkpoint, so at least one path not quite right yet. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ede6ceb..6f6eb3b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7063,10 +7063,15 @@ LogCheckpointEnd(bool restartpoint) { long write_secs, sync_secs, - total_secs; + total_secs, + longest_secs, + average_secs; int write_usecs, sync_usecs, - total_usecs; + total_usecs, + longest_usecs, + average_usecs; + double average_sync_time; CheckpointStats.ckpt_end_t = GetCurrentTimestamp(); @@ -7082,18 +7087,35 @@ LogCheckpointEnd(bool restartpoint) CheckpointStats.ckpt_sync_end_t, &sync_secs, &sync_usecs); + /* + * Timing values returned from CheckpointStats are in milliseconds. + * Convert to the second plus microsecond form that TimestampDifference + * returns for homogeneous printing. + */ + longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000); + longest_usecs = 1000 * (CheckpointStats.ckpt_longest_sync - longest_secs * 1000); + + average_sync_time = CheckpointStats.ckpt_longest_sync / CheckpointStats.ckpt_sync_rels; + average_secs = (long) (average_sync_time / 1000); + average_usecs = 1000 * (average_sync_time - average_secs * 1000); + if (restartpoint) elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); " - "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s", + "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; " + "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, - total_secs, total_usecs / 1000); + total_secs, total_usecs / 1000, + CheckpointStats.ckpt_sync_rels, + longest_secs, longest_usecs / 1000, + average_secs, average_usecs / 1000); else elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); " "%d transaction log file(s) added, %d removed, %d recycled; " - "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s", + "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; " + "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, CheckpointStats.ckpt_segs_added, @@ -7101,7 +7123,10 @@ LogCheckpointEnd(bool restartpoint) CheckpointStats.ckpt_segs_recycled, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, - total_secs, total_usecs / 1000); + total_secs, total_usecs / 1000, + CheckpointStats.ckpt_sync_rels, + longest_secs, longest_usecs / 1000, + average_secs, average_usecs / 1000); } /* diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 604db21..cadd938 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -20,6 +20,7 @@ #include "catalog/catalog.h" #include "miscadmin.h" +#include "portability/instr_time.h" #include "postmaster/bgwriter.h" #include "storage/fd.h" #include "storage/bufmgr.h" @@ -927,6 +928,13 @@ mdsync(void) PendingOperationEntry *entry; int absorb_counter; + /* Statistics on sync times */ + int processed = 0; + instr_time sync_start, sync_end, sync_diff; + double elapsed; + double longest = 0; + double total_elapsed = 0; + /* * This is only called during checkpoints, and checkpoints should only * occur in processes that have created a pendingOpsTable. @@ -1069,9 +1077,29 @@ mdsync(void) seg = _mdfd_getseg(reln, entry->tag.forknum, entry->tag.segno * ((BlockNumber) RELSEG_SIZE), false, EXTENSION_RETURN_NULL); + + if (log_checkpoints) + INSTR_TIME_SET_CURRENT(sync_start); + if (seg != NULL && FileSync(seg->mdfd_vfd) >= 0) + { + if (log_checkpoints) + { + INSTR_TIME_SET_CURRENT(sync_end); + sync_diff = sync_end; + INSTR_TIME_SUBTRACT(sync_diff, sync_start); + elapsed = (double) INSTR_TIME_GET_MICROSEC(sync_diff) / 1000.0; + if (elapsed > longest) + longest = elapsed; + total_elapsed += elapsed; + processed++; + elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%f msec", + processed, FilePathName(seg->mdfd_vfd), elapsed); + } + break; /* success; break out of retry loop */ + } /* * XXX is there any point in allowing more than one retry? @@ -1113,6 +1141,11 @@ mdsync(void) elog(ERROR, "pendingOpsTable corrupted"); } /* end loop over hashtable entries */ + /* Return sync performance metrics for report at checkpoint end */ + CheckpointStats.ckpt_sync_rels = processed; + CheckpointStats.ckpt_longest_sync = longest; + CheckpointStats.ckpt_agg_sync_time = total_elapsed; + /* Flag successful completion of mdsync */ mdsync_in_progress = false; } diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index fa7ae2a..ec4d9db 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -257,6 +257,13 @@ typedef struct CheckpointStatsData int ckpt_segs_added; /* # of new xlog segments created */ int ckpt_segs_removed; /* # of xlog segments deleted */ int ckpt_segs_recycled; /* # of xlog segments recycled */ + + int ckpt_sync_rels; /* # of relations synced */ + double ckpt_longest_sync; /* Longest sync for one relation */ + double ckpt_agg_sync_time; /* The sum of all the individual sync + times, which is not necessarily the + same as the total elapsed time for + the entire sync phase. */ } CheckpointStatsData; extern CheckpointStatsData CheckpointStats;
2010/11/30 Greg Smith <greg@2ndquadrant.com>: > Jeff Janes wrote: >> >> For the individual file sync times emitted under debug1, it would be >> very handy if the file being synced was identified, for example >> "relation base/16384/16523". Rather than being numbered sequentially >> within a given checkpoint. >> > > I was numbering them sequentially so that it's straightforward to graph the > sync times in an external analysis tool, but the relation data is helpful > too. New patch reflecting all upthread suggestions is attached. The output > looks like this now at DEBUG1: > > LOG: checkpoint starting: xlog > DEBUG: checkpoint sync: number=1 file=base/16424/11645 time=11589.549000 > msec > DEBUG: checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec > DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=53.530000 msec > DEBUG: checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec > DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec > DEBUG: checkpoint sync: number=6 file=base/16424/16425_fsm time=2.921000 > msec > DEBUG: checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec > DEBUG: checkpoint sync: number=8 file=base/16424/16428_fsm time=1.654000 > msec > DEBUG: checkpoint sync: number=9 file=base/16424/16442 time=7.920000 msec > DEBUG: checkpoint sync: number=10 file=base/16424/16428_vm time=2.613000 > msec > DEBUG: checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec > DEBUG: checkpoint sync: number=12 file=base/16424/16437_fsm time=2.638000 > msec > DEBUG: checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec > DEBUG: checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec > DEBUG: checkpoint sync: number=15 file=base/16424/16437_vm time=8.686000 > msec > DEBUG: checkpoint sync: number=16 file=base/16424/16425_vm time=5.984000 > msec > LOG: checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log > file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s, > total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s > > I kept the units for the DEBUG level ones in msec because that's a better > scale for the common really short syncs during good behavior. But the > summary info in seconds now appears at the end of the existing "checkpoint > complete" message, so only one line to parse for those looking to analyze > the gross checkpoint data. That looks to work well enough for finding > situations like the big ext3 spikes. You can easily see one in this example > by the fact that "longest=11.589 s" is almost the entirety of "sync=11.715 > s". That's the really key thing there's currently no visibility into, > that's made obvious with this patch. wonderfull. > > This might be ready for some proper review now. I know there's at least one > blatant bug still in here I haven't found yet, related to how the averages > are computed. I saw this once: > > LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) > added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; > sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s > > After an immediate checkpoint, so at least one path not quite right yet. > > -- > Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD > > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Tue, Nov 30, 2010 at 8:38 AM, Greg Smith <greg@2ndquadrant.com> wrote: Hi Greg, Thanks for the update. > This might be ready for some proper review now. I know there's at least one > blatant bug still in here I haven't found yet, related to how the averages > are computed. Yes, the blatant bug: average_sync_time = CheckpointStats.ckpt_longest_sync / CheckpointStats.ckpt_sync_rels; That should clearly be ckpt_agg_sync_time, not ckpt_longest_sync. > I saw this once: > > LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) > added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; > sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s > > After an immediate checkpoint, so at least one path not quite right yet. Not clear what the right thing to do here is. I guess we should special case the div by zero to yield zero for the average? The patch is in unified diff rather than context diff. Doesn't bother me, but the wiki on doing a review says... Cheers, Jeff
On Tue, Nov 30, 2010 at 12:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Nov 30, 2010 at 8:38 AM, Greg Smith <greg@2ndquadrant.com> wrote: > > > Hi Greg, > > Thanks for the update. > > > >> This might be ready for some proper review now. I know there's at least one >> blatant bug still in here I haven't found yet, related to how the averages >> are computed. > > Yes, the blatant bug: > > average_sync_time = CheckpointStats.ckpt_longest_sync / > CheckpointStats.ckpt_sync_rels; > > That should clearly be ckpt_agg_sync_time, not ckpt_longest_sync. I've attached a tiny patch to apply over yours, to deal with this and with the case where no files are synced. Combining this instrumentation patch with the backend sync one already committed, the net result under log_min_messages=debug1is somewhat undesirable in that I can now see the individual sync times for the syncs done by the checkpoint writer, but I do not get times for the syncs done by the backend (I only get informed of their existence). I don't know what I would propose to fix this. Having the reportage of sync time of backend syncs be controlled by log_checkpoints seems somewhat perverse, but the only alternative I see is to have log_min_messages=debug1 always report times for the backend syncs. Or to have them go unreported altogether. Cheers, Jeff
Attachment
Jeff Janes wrote: > I've attached a tiny patch to apply over yours, to deal with this and > with the case where no files are synced. > Thanks for that. That obvious error eluded me because in most of the patch update testing I was doing (on ext3), the longest sync was always about the same length as the total sync time. Attached patch (in correct diff form this time!) collects up all changes. That includes elimination of a potential race condition if someone changes log_checkpoints while a long sync phase is executing. I don't know whether that can happen, and it obviously won't give accurate stats going back to the beginning of the checkpoint in that case, but it tries to defend aginst producing complete garbage if that value changes out from under it. This is the first version of this patch I feel fairly good about; no open concerns left on my side. Jeff, since you're now the de-facto credited reviewer of this one by virtue of suggesting bug fixes, could you take this update out for a spin too? > Combining this instrumentation patch with the backend sync one already > committed, the net result under log_min_messages=debug1is somewhat > undesirable in that I can now see the individual sync times for the > syncs done by the checkpoint writer, but I do not get times for the > syncs done by the backend (I only get informed of their existence). > On a properly working system, backend syncs shouldn't be happening. So if you see them, I think the question shouldn't be "how long are they taking?", it's "how do I get rid of them?" From that perspective, knowing of their existence is sufficient to suggest the necessary tuning changes, such as dropping bgwriter_delay. When you get into a situation where they are showing up, a whole lot of them can happen in a very brief period; enough that I'd actually be concerned about the added timing overhead, something I normally don't care about very much. That's the main reason I didn't bother instrumenting those too--the situations where they happen are ones already running low on resources. Big writes for things that can only be written out at checkpoint time, on the other hand, are unavoidable in the current design. Providing detail on them is going to be relevant unless there's a major refactoring of how those happen. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ae9ed8f..6adc139 100644 *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *************** LogCheckpointEnd(bool restartpoint) *** 6955,6964 **** { long write_secs, sync_secs, ! total_secs; int write_usecs, sync_usecs, ! total_usecs; CheckpointStats.ckpt_end_t = GetCurrentTimestamp(); --- 6955,6969 ---- { long write_secs, sync_secs, ! total_secs, ! longest_secs, ! average_secs; int write_usecs, sync_usecs, ! total_usecs, ! longest_usecs, ! average_usecs; ! double average_sync_time; CheckpointStats.ckpt_end_t = GetCurrentTimestamp(); *************** LogCheckpointEnd(bool restartpoint) *** 6974,6991 **** CheckpointStats.ckpt_sync_end_t, &sync_secs, &sync_usecs); if (restartpoint) elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); " ! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000); else elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); " "%d transaction log file(s) added, %d removed, %d recycled; " ! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, CheckpointStats.ckpt_segs_added, --- 6979,7017 ---- CheckpointStats.ckpt_sync_end_t, &sync_secs, &sync_usecs); + /* + * Timing values returned from CheckpointStats are in milliseconds. + * Convert to the second plus microsecond form that TimestampDifference + * returns for homogeneous printing. + */ + longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000); + longest_usecs = 1000 * (CheckpointStats.ckpt_longest_sync - + longest_secs * 1000); + + average_sync_time = 0; + if (CheckpointStats.ckpt_sync_rels > 0) + average_sync_time = CheckpointStats.ckpt_agg_sync_time / + CheckpointStats.ckpt_sync_rels; + average_secs = (long) (average_sync_time / 1000); + average_usecs = 1000 * (average_sync_time - average_secs * 1000); + if (restartpoint) elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); " ! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; " ! "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000, ! CheckpointStats.ckpt_sync_rels, ! longest_secs, longest_usecs / 1000, ! average_secs, average_usecs / 1000); else elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); " "%d transaction log file(s) added, %d removed, %d recycled; " ! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; " ! "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, CheckpointStats.ckpt_segs_added, *************** LogCheckpointEnd(bool restartpoint) *** 6993,6999 **** CheckpointStats.ckpt_segs_recycled, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000); } /* --- 7019,7028 ---- CheckpointStats.ckpt_segs_recycled, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000, ! CheckpointStats.ckpt_sync_rels, ! longest_secs, longest_usecs / 1000, ! average_secs, average_usecs / 1000); } /* diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 604db21..1f602a7 100644 *** a/src/backend/storage/smgr/md.c --- b/src/backend/storage/smgr/md.c *************** *** 20,25 **** --- 20,26 ---- #include "catalog/catalog.h" #include "miscadmin.h" + #include "portability/instr_time.h" #include "postmaster/bgwriter.h" #include "storage/fd.h" #include "storage/bufmgr.h" *************** mdsync(void) *** 927,932 **** --- 928,940 ---- PendingOperationEntry *entry; int absorb_counter; + /* Statistics on sync times */ + int processed = 0; + instr_time sync_start, sync_end, sync_diff; + double elapsed; + double longest = 0; + double total_elapsed = 0; + /* * This is only called during checkpoints, and checkpoints should only * occur in processes that have created a pendingOpsTable. *************** mdsync(void) *** 1069,1077 **** --- 1077,1107 ---- seg = _mdfd_getseg(reln, entry->tag.forknum, entry->tag.segno * ((BlockNumber) RELSEG_SIZE), false, EXTENSION_RETURN_NULL); + + if (log_checkpoints) + INSTR_TIME_SET_CURRENT(sync_start); + else + INSTR_TIME_SET_ZERO(sync_start); + if (seg != NULL && FileSync(seg->mdfd_vfd) >= 0) + { + if (log_checkpoints && (! INSTR_TIME_IS_ZERO(sync_start))) + { + INSTR_TIME_SET_CURRENT(sync_end); + sync_diff = sync_end; + INSTR_TIME_SUBTRACT(sync_diff, sync_start); + elapsed = (double) INSTR_TIME_GET_MICROSEC(sync_diff) / 1000.0; + if (elapsed > longest) + longest = elapsed; + total_elapsed += elapsed; + processed++; + elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%f msec", + processed, FilePathName(seg->mdfd_vfd), elapsed); + } + break; /* success; break out of retry loop */ + } /* * XXX is there any point in allowing more than one retry? *************** mdsync(void) *** 1113,1118 **** --- 1143,1153 ---- elog(ERROR, "pendingOpsTable corrupted"); } /* end loop over hashtable entries */ + /* Return sync performance metrics for report at checkpoint end */ + CheckpointStats.ckpt_sync_rels = processed; + CheckpointStats.ckpt_longest_sync = longest; + CheckpointStats.ckpt_agg_sync_time = total_elapsed; + /* Flag successful completion of mdsync */ mdsync_in_progress = false; } diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index fa7ae2a..d95f843 100644 *** a/src/include/access/xlog.h --- b/src/include/access/xlog.h *************** typedef struct CheckpointStatsData *** 257,262 **** --- 257,269 ---- int ckpt_segs_added; /* # of new xlog segments created */ int ckpt_segs_removed; /* # of xlog segments deleted */ int ckpt_segs_recycled; /* # of xlog segments recycled */ + + int ckpt_sync_rels; /* # of relations synced */ + double ckpt_longest_sync; /* Longest sync for one relation */ + double ckpt_agg_sync_time; /* The sum of all the individual sync + times, which is not necessarily the + same as the total elapsed time for + the entire sync phase. */ } CheckpointStatsData; extern CheckpointStatsData CheckpointStats;
On Sun, Dec 5, 2010 at 1:23 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Jeff Janes wrote: >> >> I've attached a tiny patch to apply over yours, to deal with this and >> with the case where no files are synced. >> > > Thanks for that. That obvious error eluded me because in most of the patch > update testing I was doing (on ext3), the longest sync was always about the > same length as the total sync time. > > Attached patch (in correct diff form this time!) collects up all changes. > That includes elimination of a potential race condition if someone changes > log_checkpoints while a long sync phase is executing. I don't know whether > that can happen, and it obviously won't give accurate stats going back to > the beginning of the checkpoint in that case, but it tries to defend aginst > producing complete garbage if that value changes out from under it. It looks to me that the checkpoint writer only reloads settings between checkpoints, so log_checkpoints cannot change under it. But if it can, the behavior of the patch is about as reasonable as one can expect, I think. > > This is the first version of this patch I feel fairly good about; no open > concerns left on my side. Jeff, since you're now the de-facto credited > reviewer of this one by virtue of suggesting bug fixes, could you take this > update out for a spin too? It applies and builds cleanly, and passes make check with -enable-cassert. The details of log_checkpoints output are not documented, so no document patches are needed. I cannot think of any reasonable way to add a check into make check framework, so I don't think one is needed. It does what it says and what we want, and I don't see any dangers. I haven't noticed any slow-downs. It is listed under performance, but it is intended to diagnose, not fix, performance problems. It seems to meet the coding style. It uses existing macros, so should be as portable as the existing PG code is. Comments seem sufficient. The use of INSTR_TIME_SET_ZERO as a flag for detecting changing log_checkpoints maybe could use a comment, but it doesn't take much thought to figure out that that is what is going on. I could not make it crash. I'm marking it as ready for committer. > >> Combining this instrumentation patch with the backend sync one already >> committed, the net result under log_min_messages=debug1is somewhat >> undesirable in that I can now see the individual sync times for the >> syncs done by the checkpoint writer, but I do not get times for the >> syncs done by the backend (I only get informed of their existence). >> > > On a properly working system, backend syncs shouldn't be happening. So if > you see them, I think the question shouldn't be "how long are they taking?", > it's "how do I get rid of them?" From that perspective, knowing of their > existence is sufficient to suggest the necessary tuning changes, such as > dropping bgwriter_delay. OK. I was trying to figure out if the syncs were entangled (all ended at about the same time, regardless of when they started). But I can see that that is basically a hacker thing to do, not production, so I can add in my own logging at the end of the backend sync for my purposes. > > When you get into a situation where they are showing up, a whole lot of them > can happen in a very brief period; In my test cases, the syncs that the backends were doing were almost always to the same file that the checkpoint writer was already choking on (so they are entangled simply by virtue of that). So very quickly all the backends hit the same wall and thunked to a halt. This is probably a feature of trying to use pgbench as the basis to get a very artificial model. I need to find a way to get dozens or hundreds of files going. Maybe pgbench could be hacked so that each connection used a different schema. Cheers, Jeff
Jeff Janes wrote: > In my test cases, the syncs that the backends were doing were almost > always to the same file that the checkpoint writer was already choking > on (so they are entangled simply by virtue of that). So very quickly > all the backends hit the same wall and thunked to a halt. This is > probably a feature of trying to use pgbench as the basis to get a very > artificial model. Yes--pgbench has some problems like you describe, ones that are a bit different than the way I've seen fsync writes get in each other's way in the production systems I've looked at. That's good if you really want to provoke this behavior, which is one reason why I've used as an example for my patches so far (the other being that it's already available in everyone's installation). But it's tough to get it to act more like a real-world system, which don't have quite so many localized updates, without cranking the scale way up. And that then tends to aggravate other problems too. The 8.3 checkpoint spreading work also got some useful results using the dbt-2 benchmark. I'm at the point where I think I need to return to that test program for what I'm doing now. I'd encourage you to try that out too if you get a chance. Thanks for the feedback and the review. I hope you appreciate now why I suggested you wait for the stuff I was submitting before getting back into the sorted checkpoint topic again. That should be a lot easier to make sense of with this instrumentation in place. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Sun, Dec 5, 2010 at 4:23 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Jeff Janes wrote: >> >> I've attached a tiny patch to apply over yours, to deal with this and >> with the case where no files are synced. >> > > Thanks for that. That obvious error eluded me because in most of the patch > update testing I was doing (on ext3), the longest sync was always about the > same length as the total sync time. > > Attached patch (in correct diff form this time!) collects up all changes. > That includes elimination of a potential race condition if someone changes > log_checkpoints while a long sync phase is executing. I don't know whether > that can happen, and it obviously won't give accurate stats going back to > the beginning of the checkpoint in that case, but it tries to defend aginst > producing complete garbage if that value changes out from under it. > > This is the first version of this patch I feel fairly good about; no open > concerns left on my side. Jeff, since you're now the de-facto credited > reviewer of this one by virtue of suggesting bug fixes, could you take this > update out for a spin too? I took a look at this and it looks generally good, but I'm wondering why md.c is converting the results from an exact value to a floating point, only to have xlog.c turn around and convert back to an integer.I think it could just return milliseconds directly,or if you're worried about a checkpoint that takes more than 24 days to complete, seconds and microseconds. Or am I missing something? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > I'm wondering why md.c is converting the results from an exact value to a floating > point, only to have xlog.c turn around and convert back to an integer. > I think it could just return milliseconds directly, or if you're > worried about a checkpoint that takes more than 24 days to complete, > seconds and microseconds. md.c is printing the value as a float, so I converted early to a double and then percolated it upward from there. More an artifact of how the code grew from its initial form than an intentional decision. I see your point that making elapsed, total_elapsed, ckpt_agg_sync_time, and ckpt_longest_sync all the same type of integer that INSTR_TIME_GET_MICROSEC returns would reduce the potential for rounding abberations. I could do another rev of the patch tomorrow with that change if you'd prefer it that way. I don't have a strong opinion about that implementation detail; given that xlog.c is printing a less fine-grained time anyway (seconds with 3 digits vs. msec with 3 digits) it seems unlikely to run into a real problem here. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Mon, Dec 13, 2010 at 3:19 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Robert Haas wrote: >> >> I'm wondering why md.c is converting the results from an exact value to a >> floating >> point, only to have xlog.c turn around and convert back to an integer. >> I think it could just return milliseconds directly, or if you're >> worried about a checkpoint that takes more than 24 days to complete, >> seconds and microseconds. > > md.c is printing the value as a float, so I converted early to a double and > then percolated it upward from there. More an artifact of how the code grew > from its initial form than an intentional decision. I see your point that > making elapsed, total_elapsed, ckpt_agg_sync_time, and ckpt_longest_sync all > the same type of integer that INSTR_TIME_GET_MICROSEC returns would reduce > the potential for rounding abberations. I could do another rev of the patch > tomorrow with that change if you'd prefer it that way. I don't have a > strong opinion about that implementation detail; given that xlog.c is > printing a less fine-grained time anyway (seconds with 3 digits vs. msec > with 3 digits) it seems unlikely to run into a real problem here. Yeah, it'd probably be OK anyway, but I think it doesn't really cost us anything to avoid the unnecessary conversion steps, so count me as a vote for doing it that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > I took a look at this and it looks generally good, but I'm wondering > why md.c is converting the results from an exact value to a floating > point, only to have xlog.c turn around and convert back to an integer. > I think it could just return milliseconds directly, or if you're > worried about a checkpoint that takes more than 24 days to complete, > seconds and microseconds. Attached patch now does something like this, except without the roll-over concern. INSTR_TIME_GET_MICROSEC returns a uint64 value. I just made that the storage format for all these values until they're converted for display. Test output: LOG: checkpoint starting: xlog DEBUG: checkpoint sync: number=1 file=base/16385/16480 time=10422.859 msec DEBUG: checkpoint sync: number=2 file=base/16385/16475_vm time=2896.614 msec DEBUG: checkpoint sync: number=3 file=base/16385/16475.1 time=57.836 msec DEBUG: checkpoint sync: number=4 file=base/16385/16466 time=20.080 msec DEBUG: checkpoint sync: number=5 file=base/16385/16463 time=74.926 msec DEBUG: checkpoint sync: number=6 file=base/16385/16482 time=74.263 msec DEBUG: checkpoint sync: number=7 file=base/16385/16475_fsm time=7.062 msec DEBUG: checkpoint sync: number=8 file=base/16385/16475 time=35.164 msec LOG: checkpoint complete: wrote 2143 buffers (52.3%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=1.213 s, sync=13.589 s, total=24.744 s; sync files=8, longest=10.422 s, average=1.698 s This shows the hard truncation used, so 10422.859 msec becomes 10.422 s. I don't think allowing up to 0.999ms of error there is a problem given the expected scale. But since none of the precision is lost until the end, that could be changed with only touching the final display formatting conversion of the value. Following your general idea further, why throw away any resolution inside of md.c; let xlog.c decide how to show it. Note that I also fixed the DEBUG level lines to only show their actual precision. Before that was printing 6 digits to the right of the decimal point each time, the last three of which were always 0. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 1ed9687..c9778df 100644 *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *************** LogCheckpointEnd(bool restartpoint) *** 6955,6964 **** { long write_secs, sync_secs, ! total_secs; int write_usecs, sync_usecs, ! total_usecs; CheckpointStats.ckpt_end_t = GetCurrentTimestamp(); --- 6955,6969 ---- { long write_secs, sync_secs, ! total_secs, ! longest_secs, ! average_secs; int write_usecs, sync_usecs, ! total_usecs, ! longest_usecs, ! average_usecs; ! uint64 average_sync_time; CheckpointStats.ckpt_end_t = GetCurrentTimestamp(); *************** LogCheckpointEnd(bool restartpoint) *** 6974,6991 **** CheckpointStats.ckpt_sync_end_t, &sync_secs, &sync_usecs); if (restartpoint) elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); " ! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000); else elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); " "%d transaction log file(s) added, %d removed, %d recycled; " ! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, CheckpointStats.ckpt_segs_added, --- 6979,7017 ---- CheckpointStats.ckpt_sync_end_t, &sync_secs, &sync_usecs); + /* + * Timing values returned from CheckpointStats are in microseconds. + * Convert to the second plus microsecond form that TimestampDifference + * returns for homogeneous printing. + */ + longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000000); + longest_usecs = CheckpointStats.ckpt_longest_sync - + (uint64) longest_secs * 1000000; + + average_sync_time = 0; + if (CheckpointStats.ckpt_sync_rels > 0) + average_sync_time = CheckpointStats.ckpt_agg_sync_time / + CheckpointStats.ckpt_sync_rels; + average_secs = (long) (average_sync_time / 1000000); + average_usecs = average_sync_time - (uint64) average_secs * 1000000; + if (restartpoint) elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); " ! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; " ! "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000, ! CheckpointStats.ckpt_sync_rels, ! longest_secs, longest_usecs / 1000, ! average_secs, average_usecs / 1000); else elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); " "%d transaction log file(s) added, %d removed, %d recycled; " ! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; " ! "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, CheckpointStats.ckpt_segs_added, *************** LogCheckpointEnd(bool restartpoint) *** 6993,6999 **** CheckpointStats.ckpt_segs_recycled, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000); } /* --- 7019,7028 ---- CheckpointStats.ckpt_segs_recycled, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000, ! CheckpointStats.ckpt_sync_rels, ! longest_secs, longest_usecs / 1000, ! average_secs, average_usecs / 1000); } /* diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 604db21..64f9027 100644 *** a/src/backend/storage/smgr/md.c --- b/src/backend/storage/smgr/md.c *************** *** 20,25 **** --- 20,26 ---- #include "catalog/catalog.h" #include "miscadmin.h" + #include "portability/instr_time.h" #include "postmaster/bgwriter.h" #include "storage/fd.h" #include "storage/bufmgr.h" *************** mdsync(void) *** 927,932 **** --- 928,940 ---- PendingOperationEntry *entry; int absorb_counter; + /* Statistics on sync times */ + int processed = 0; + instr_time sync_start, sync_end, sync_diff; + uint64 elapsed; + uint64 longest = 0; + uint64 total_elapsed = 0; + /* * This is only called during checkpoints, and checkpoints should only * occur in processes that have created a pendingOpsTable. *************** mdsync(void) *** 1069,1077 **** --- 1077,1107 ---- seg = _mdfd_getseg(reln, entry->tag.forknum, entry->tag.segno * ((BlockNumber) RELSEG_SIZE), false, EXTENSION_RETURN_NULL); + + if (log_checkpoints) + INSTR_TIME_SET_CURRENT(sync_start); + else + INSTR_TIME_SET_ZERO(sync_start); + if (seg != NULL && FileSync(seg->mdfd_vfd) >= 0) + { + if (log_checkpoints && (! INSTR_TIME_IS_ZERO(sync_start))) + { + INSTR_TIME_SET_CURRENT(sync_end); + sync_diff = sync_end; + INSTR_TIME_SUBTRACT(sync_diff, sync_start); + elapsed = INSTR_TIME_GET_MICROSEC(sync_diff); + if (elapsed > longest) + longest = elapsed; + total_elapsed += elapsed; + processed++; + elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec", + processed, FilePathName(seg->mdfd_vfd), (double) elapsed / 1000); + } + break; /* success; break out of retry loop */ + } /* * XXX is there any point in allowing more than one retry? *************** mdsync(void) *** 1113,1118 **** --- 1143,1153 ---- elog(ERROR, "pendingOpsTable corrupted"); } /* end loop over hashtable entries */ + /* Return sync performance metrics for report at checkpoint end */ + CheckpointStats.ckpt_sync_rels = processed; + CheckpointStats.ckpt_longest_sync = longest; + CheckpointStats.ckpt_agg_sync_time = total_elapsed; + /* Flag successful completion of mdsync */ mdsync_in_progress = false; } diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index fa7ae2a..484cb6c 100644 *** a/src/include/access/xlog.h --- b/src/include/access/xlog.h *************** typedef struct CheckpointStatsData *** 257,262 **** --- 257,269 ---- int ckpt_segs_added; /* # of new xlog segments created */ int ckpt_segs_removed; /* # of xlog segments deleted */ int ckpt_segs_recycled; /* # of xlog segments recycled */ + + int ckpt_sync_rels; /* # of relations synced */ + uint64 ckpt_longest_sync; /* Longest sync for one relation */ + uint64 ckpt_agg_sync_time; /* The sum of all the individual sync + times, which is not necessarily the + same as the total elapsed time for + the entire sync phase. */ } CheckpointStatsData; extern CheckpointStatsData CheckpointStats;
I gave this patch a look and it seems pretty good to me, except that I'm uncomfortable with the idea of mdsync filling in the details for CheckpointStats fields directly. Would it work to pass a struct (say SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to mdsync, have this function fill it, and return it back so that CheckPointBuffers copies the data from this struct into CheckpointStats? Another minor nitpick: inside the block when you call FileSync, why check for log_checkpoints at all? Seems to me that just checking for zero of sync_start should be enough. Alternatively, seems simpler to just have a local var with the value of log_checkpoints at the start of mdsync and use that throughout the function. (Surely if someone turns off log_checkpoints in the middle of a checkpoint, it's not really a problem that we collect and report stats during that checkpoint.) -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Dec 14, 2010 at 9:29 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > I gave this patch a look and it seems pretty good to me, except Err, woops. I just committed this as-is. Sorry. > that I'm > uncomfortable with the idea of mdsync filling in the details for > CheckpointStats fields directly. Would it work to pass a struct (say > SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to > mdsync, have this function fill it, and return it back so that > CheckPointBuffers copies the data from this struct into CheckpointStats? > > Another minor nitpick: inside the block when you call FileSync, why > check for log_checkpoints at all? Seems to me that just checking for > zero of sync_start should be enough. Alternatively, seems simpler to > just have a local var with the value of log_checkpoints at the start of > mdsync and use that throughout the function. (Surely if someone turns > off log_checkpoints in the middle of a checkpoint, it's not really a > problem that we collect and report stats during that checkpoint.) Neither of these things bothers me, but we can certainly discuss... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Excerpts from Robert Haas's message of mar dic 14 11:34:55 -0300 2010: > On Tue, Dec 14, 2010 at 9:29 AM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: > > I gave this patch a look and it seems pretty good to me, except > > Err, woops. I just committed this as-is. Sorry. I noticed :-) > > that I'm > > uncomfortable with the idea of mdsync filling in the details for > > CheckpointStats fields directly. Would it work to pass a struct (say > > SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to > > mdsync, have this function fill it, and return it back so that > > CheckPointBuffers copies the data from this struct into CheckpointStats? > > > > Another minor nitpick: inside the block when you call FileSync, why > > check for log_checkpoints at all? Seems to me that just checking for > > zero of sync_start should be enough. Alternatively, seems simpler to > > just have a local var with the value of log_checkpoints at the start of > > mdsync and use that throughout the function. (Surely if someone turns > > off log_checkpoints in the middle of a checkpoint, it's not really a > > problem that we collect and report stats during that checkpoint.) > > Neither of these things bothers me, but we can certainly discuss... Well, the second one was just about simplifying it, so never mind that. But referring to CheckpointStats in md.c seems to me to be a violation of modularity that ought to be fixed. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Dec 14, 2010 at 11:47 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: >> > that I'm >> > uncomfortable with the idea of mdsync filling in the details for >> > CheckpointStats fields directly. Would it work to pass a struct (say >> > SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to >> > mdsync, have this function fill it, and return it back so that >> > CheckPointBuffers copies the data from this struct into CheckpointStats? > > But referring to CheckpointStats in md.c seems to me to be a violation > of modularity that ought to be fixed. Hmm. I guess I can't get worked up about it. We could do what you propose, but I'm not sure what purpose it would serve. It's not as if mdsync() can possibly serve any other purpose other than to be the guts of a checkpoint. It seems to me that if we really wanted to get this disentangled from the checkpoint stats stuff we'd also need to think about moving that elog(DEBUG1) I added out of the function somehow, but I think that would just be a lot of notional complexity for no particular benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Alvaro Herrera wrote: > I gave this patch a look and it seems pretty good to me, except that I'm > uncomfortable with the idea of mdsync filling in the details for > CheckpointStats fields directly. Would it work to pass a struct (say > SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to > mdsync, have this function fill it, and return it back so that > CheckPointBuffers copies the data from this struct into CheckpointStats? > That was originally how I planned to write this bit of code. When I realized that the CheckpointStats structure was already visible there and stuffed with details that ultimately go into the same output line at the end, it just didn't seem worth the extra code complexity. The abstraction layer around md.c was not exactly airtight before I poked that extra little hole in there, and I was aiming via the principal of a smaller diff usually being the better patch . If you feel strongly that the result led to a bad abstraction violation, I'll submit a patch to refactor it to pass a structure instead before the next CF. I appreciate your concern, I'm just not sure it's worth spending time on. What I'd really like to do is refactor out major parts of the leaky md/smgr layers altogether instead, but that's obviously a bigger project. > Another minor nitpick: inside the block when you call FileSync, why > check for log_checkpoints at all? Seems to me that just checking for > zero of sync_start should be enough. Alternatively, seems simpler to > just have a local var with the value of log_checkpoints at the start of > mdsync and use that throughout the function. (Surely if someone turns > off log_checkpoints in the middle of a checkpoint, it's not really a > problem that we collect and report stats during that checkpoint.) > And now you're just getting picky! This is a useful observation though, and I'll try to include that fix along with the next general checkpoint overhaul patch I submit. Doesn't seem worth going through the trouble of committing that minor rework on its own, I'll slip it into the next useful thing that touches this area I do. Thanks for the hint, this would work better than what I did. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Wed, Dec 15, 2010 at 9:22 AM, Greg Smith <greg@2ndquadrant.com> wrote: > patch I submit. Doesn't seem worth going through the trouble of committing > that minor rework on its own, I'll slip it into the next useful thing that > touches this area I do. Thanks for the hint, this would work better than > what I did. Well, if I'm the one committing it, I'll pull that part out again and commit it separately. Not sure if that affects your calculus, but I much prefer patches that don't try to do ancillary things along the way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: <blockquote cite="mid:AANLkTimfCQYq4QdKi_ocsJWu_UW8LTnrEd7MGVrk=pBA@mail.gmail.com" type="cite"><prewrap="">On Wed, Dec 15, 2010 at 9:22 AM, Greg Smith <a class="moz-txt-link-rfc2396E" href="mailto:greg@2ndquadrant.com"><greg@2ndquadrant.com></a>wrote: </pre><blockquote type="cite"><pre wrap="">patchI submit. Doesn't seem worth going through the trouble of committing that minor rework on its own, I'll slip it into the next useful thing that touches this area I do. Thanks for the hint, this would work better than what I did. </pre></blockquote><pre wrap=""> Well, if I'm the one committing it, I'll pull that part out again and commit it separately. Not sure if that affects your calculus, but I much prefer patches that don't try to do ancillary things along the way. </pre></blockquote><br /> I meant that I'd bundle it into the block of time I spend on that, and likely submit withsomething else that touches the same area. Obviously the correction patch would be better on its own when being handedover to a committer.<br /><br /><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant US <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a> "PostgreSQL 9.0 High Performance": <a class="moz-txt-link-freetext" href="http://www.2ndQuadrant.com/books">http://www.2ndQuadrant.com/books</a> </pre>
On Thu, Dec 16, 2010 at 6:48 AM, Greg Smith <greg@2ndquadrant.com> wrote: > I meant that I'd bundle it into the block of time I spend on that, and > likely submit with something else that touches the same area. Obviously the > correction patch would be better on its own when being handed over to a > committer. Oh, that's very different. Never mind! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company