Thread: Load Distributed Checkpoints, take 3
Here's an updated WIP patch for load distributed checkpoints. I added a spinlock to protect the signaling fields between bgwriter and backends. The current non-locking approach gets really difficult as the patch adds two new flags, and both are more important than the existing ckpt_time_warn flag. In fact, I think there's a small race condition in CVS HEAD: 1. pg_start_backup() is called, which calls RequestCheckpoint 2. RequestCheckpoint takes note of the old value of ckpt_started 3. bgwriter wakes up from pg_usleep, and sees that we've exceeded checkpoint_timeout. 4. bgwriter increases ckpt_started to note that a new checkpoint has started 5. RequestCheckpoint signals bgwriter to start a new checkpoint 6. bgwriter calls CreateCheckpoint, with the force-flag set to false because this checkpoint was triggered by timeout 7. RequestCheckpoint sees that ckpt_started has increased, and starts to wait for ckpt_done to reach the new value. 8. CreateCheckpoint finishes immediately, because there was no XLOG activity since last checkpoint. 9. RequestCheckpoint sees that ckpt_done matches ckpt_started, and returns. 10. pg_start_backup() continues, with potentially the same redo location and thus history filename as previous backup. Now I admit that the chances for that to happen are extremely small, people don't usually do two pg_start_backup calls without *any* WAL logged activity in between them, for example. But as we add the new flags, avoiding scenarios like that becomes harder. Since last patch, I did some clean up and refactoring, and added a bunch of comments, and user documentation. I haven't yet changed GetInsertRecPtr to use the almost up-to-date value protected by the info_lck per Simon's suggestion, and I need to do some correctness testing. After that, I'm done with the patch. Ps. In case you wonder what took me so long since last revision, I've spent a lot of time reviewing HOT. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: doc/src/sgml/config.sgml =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/config.sgml,v retrieving revision 1.126 diff -c -r1.126 config.sgml *** doc/src/sgml/config.sgml 7 Jun 2007 19:19:56 -0000 1.126 --- doc/src/sgml/config.sgml 19 Jun 2007 14:24:31 -0000 *************** *** 1565,1570 **** --- 1565,1608 ---- </listitem> </varlistentry> + <varlistentry id="guc-checkpoint-smoothing" xreflabel="checkpoint_smoothing"> + <term><varname>checkpoint_smoothing</varname> (<type>floating point</type>)</term> + <indexterm> + <primary><varname>checkpoint_smoothing</> configuration parameter</primary> + </indexterm> + <listitem> + <para> + Specifies the target length of checkpoints, as a fraction of + the checkpoint interval. The default is 0.3. + + This parameter can only be set in the <filename>postgresql.conf</> + file or on the server command line. + </para> + </listitem> + </varlistentry> + + <term><varname>checkpoint_rate</varname> (<type>floating point</type>)</term> + <indexterm> + <primary><varname>checkpoint_rate</> configuration parameter</primary> + </indexterm> + <listitem> + <para> + Specifies the minimum I/O rate used to flush dirty buffers during a + checkpoint, when there's not many dirty buffers in the buffer cache. + The default is 512 KB/s. + + Note: the accuracy of this setting depends on + <varname>bgwriter_delay</varname. This value is converted internally + to pages / bgwriter_delay, so if for examply the minimum allowed + bgwriter_delay setting of 10ms is used, the effective minimum + checkpoint I/O rate is 1 page / 10 ms, or 800 KB/s. + + This parameter can only be set in the <filename>postgresql.conf</> + file or on the server command line. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning"> <term><varname>checkpoint_warning</varname> (<type>integer</type>)</term> <indexterm> Index: doc/src/sgml/wal.sgml =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/wal.sgml,v retrieving revision 1.43 diff -c -r1.43 wal.sgml *** doc/src/sgml/wal.sgml 31 Jan 2007 20:56:19 -0000 1.43 --- doc/src/sgml/wal.sgml 19 Jun 2007 14:26:45 -0000 *************** *** 217,225 **** </para> <para> There will be at least one WAL segment file, and will normally not be more than 2 * <varname>checkpoint_segments</varname> + 1 ! files. Each segment file is normally 16 MB (though this size can be altered when building the server). You can use this to estimate space requirements for <acronym>WAL</acronym>. Ordinarily, when old log segment files are no longer needed, they --- 217,245 ---- </para> <para> + If there is a lot of dirty buffers in the buffer cache, flushing them + all at checkpoint will cause a heavy burst of I/O that can disrupt other + activity in the system. To avoid that, the checkpoint I/O can be distributed + over a longer period of time, defined with + <varname>checkpoint_smoothing</varname>. It's given as a fraction of the + checkpoint interval, as defined by <varname>checkpoint_timeout</varname> + and <varname>checkpoint_segments</varname>. The WAL segment consumption + and elapsed time is monitored and the I/O rate is adjusted during + checkpoint so that it's finished when the given fraction of elapsed time + or WAL segments has passed, whichever is sooner. However, that could lead + to unnecessarily prolonged checkpoints when there's not many dirty buffers + in the cache. To avoid that, <varname>checkpoint_rate</varname> can be used + to set the minimum I/O rate used. Note that prolonging checkpoints + affects recovery time, because the longer the checkpoint takes, more WAL + need to be kept around and replayed in recovery. + </para> + + <para> There will be at least one WAL segment file, and will normally not be more than 2 * <varname>checkpoint_segments</varname> + 1 ! files, though there can be more if a large ! <varname>checkpoint_smoothing</varname> setting is used. ! Each segment file is normally 16 MB (though this size can be altered when building the server). You can use this to estimate space requirements for <acronym>WAL</acronym>. Ordinarily, when old log segment files are no longer needed, they Index: src/backend/access/transam/xlog.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.272 diff -c -r1.272 xlog.c *** src/backend/access/transam/xlog.c 31 May 2007 15:13:01 -0000 1.272 --- src/backend/access/transam/xlog.c 20 Jun 2007 10:44:40 -0000 *************** *** 398,404 **** static void exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg); static bool recoveryStopsHere(XLogRecord *record, bool *includeThis); ! static void CheckPointGuts(XLogRecPtr checkPointRedo); static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites, XLogRecPtr *lsn, BkpBlock *bkpb); --- 398,404 ---- static void exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg); static bool recoveryStopsHere(XLogRecord *record, bool *includeThis); ! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate); static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites, XLogRecPtr *lsn, BkpBlock *bkpb); *************** *** 1608,1614 **** if (XLOG_DEBUG) elog(LOG, "time for a checkpoint, signaling bgwriter"); #endif ! RequestCheckpoint(false, true); } } } --- 1608,1614 ---- if (XLOG_DEBUG) elog(LOG, "time for a checkpoint, signaling bgwriter"); #endif ! RequestXLogFillCheckpoint(); } } } *************** *** 5110,5116 **** * the rule that TLI only changes in shutdown checkpoints, which * allows some extra error checking in xlog_redo. */ ! CreateCheckPoint(true, true); /* * Close down recovery environment --- 5110,5116 ---- * the rule that TLI only changes in shutdown checkpoints, which * allows some extra error checking in xlog_redo. */ ! CreateCheckPoint(true, true, true); /* * Close down recovery environment *************** *** 5319,5324 **** --- 5319,5340 ---- } /* + * GetInsertRecPtr -- Returns the current insert position. + */ + XLogRecPtr + GetInsertRecPtr(void) + { + XLogCtlInsert *Insert = &XLogCtl->Insert; + XLogRecPtr recptr; + + LWLockAcquire(WALInsertLock, LW_SHARED); + INSERT_RECPTR(recptr, Insert, Insert->curridx); + LWLockRelease(WALInsertLock); + + return recptr; + } + + /* * Get the time of the last xlog segment switch */ time_t *************** *** 5383,5389 **** ereport(LOG, (errmsg("shutting down"))); ! CreateCheckPoint(true, true); ShutdownCLOG(); ShutdownSUBTRANS(); ShutdownMultiXact(); --- 5399,5405 ---- ereport(LOG, (errmsg("shutting down"))); ! CreateCheckPoint(true, true, true); ShutdownCLOG(); ShutdownSUBTRANS(); ShutdownMultiXact(); *************** *** 5395,5405 **** /* * Perform a checkpoint --- either during shutdown, or on-the-fly * * If force is true, we force a checkpoint regardless of whether any XLOG * activity has occurred since the last one. */ void ! CreateCheckPoint(bool shutdown, bool force) { CheckPoint checkPoint; XLogRecPtr recptr; --- 5411,5424 ---- /* * Perform a checkpoint --- either during shutdown, or on-the-fly * + * If immediate is true, we try to finish the checkpoint as fast as we can, + * ignoring checkpoint_smoothing parameter. + * * If force is true, we force a checkpoint regardless of whether any XLOG * activity has occurred since the last one. */ void ! CreateCheckPoint(bool shutdown, bool immediate, bool force) { CheckPoint checkPoint; XLogRecPtr recptr; *************** *** 5591,5597 **** */ END_CRIT_SECTION(); ! CheckPointGuts(checkPoint.redo); START_CRIT_SECTION(); --- 5610,5616 ---- */ END_CRIT_SECTION(); ! CheckPointGuts(checkPoint.redo, immediate); START_CRIT_SECTION(); *************** *** 5693,5708 **** /* * Flush all data in shared memory to disk, and fsync * * This is the common code shared between regular checkpoints and * recovery restartpoints. */ static void ! CheckPointGuts(XLogRecPtr checkPointRedo) { CheckPointCLOG(); CheckPointSUBTRANS(); CheckPointMultiXact(); ! FlushBufferPool(); /* performs all required fsyncs */ /* We deliberately delay 2PC checkpointing as long as possible */ CheckPointTwoPhase(checkPointRedo); } --- 5712,5730 ---- /* * Flush all data in shared memory to disk, and fsync * + * If immediate is true, try to finish as quickly as possible, ignoring + * the GUC variables to throttle checkpoint I/O. + * * This is the common code shared between regular checkpoints and * recovery restartpoints. */ static void ! CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate) { CheckPointCLOG(); CheckPointSUBTRANS(); CheckPointMultiXact(); ! FlushBufferPool(immediate); /* performs all required fsyncs */ /* We deliberately delay 2PC checkpointing as long as possible */ CheckPointTwoPhase(checkPointRedo); } *************** *** 5710,5716 **** /* * Set a recovery restart point if appropriate * ! * This is similar to CreateCheckpoint, but is used during WAL recovery * to establish a point from which recovery can roll forward without * replaying the entire recovery log. This function is called each time * a checkpoint record is read from XLOG; it must determine whether a --- 5732,5738 ---- /* * Set a recovery restart point if appropriate * ! * This is similar to CreateCheckPoint, but is used during WAL recovery * to establish a point from which recovery can roll forward without * replaying the entire recovery log. This function is called each time * a checkpoint record is read from XLOG; it must determine whether a *************** *** 5751,5757 **** /* * OK, force data out to disk */ ! CheckPointGuts(checkPoint->redo); /* * Update pg_control so that any subsequent crash will restart from this --- 5773,5779 ---- /* * OK, force data out to disk */ ! CheckPointGuts(checkPoint->redo, true); /* * Update pg_control so that any subsequent crash will restart from this *************** *** 6177,6183 **** * have different checkpoint positions and hence different history * file names, even if nothing happened in between. */ ! RequestCheckpoint(true, false); /* * Now we need to fetch the checkpoint record location, and also its --- 6199,6205 ---- * have different checkpoint positions and hence different history * file names, even if nothing happened in between. */ ! RequestLazyCheckpoint(); /* * Now we need to fetch the checkpoint record location, and also its Index: src/backend/bootstrap/bootstrap.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/bootstrap/bootstrap.c,v retrieving revision 1.233 diff -c -r1.233 bootstrap.c *** src/backend/bootstrap/bootstrap.c 7 Mar 2007 13:35:02 -0000 1.233 --- src/backend/bootstrap/bootstrap.c 19 Jun 2007 15:29:51 -0000 *************** *** 489,495 **** /* Perform a checkpoint to ensure everything's down to disk */ SetProcessingMode(NormalProcessing); ! CreateCheckPoint(true, true); /* Clean up and exit */ cleanup(); --- 489,495 ---- /* Perform a checkpoint to ensure everything's down to disk */ SetProcessingMode(NormalProcessing); ! CreateCheckPoint(true, true, true); /* Clean up and exit */ cleanup(); Index: src/backend/commands/dbcommands.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v retrieving revision 1.195 diff -c -r1.195 dbcommands.c *** src/backend/commands/dbcommands.c 1 Jun 2007 19:38:07 -0000 1.195 --- src/backend/commands/dbcommands.c 20 Jun 2007 09:36:24 -0000 *************** *** 404,410 **** * up-to-date for the copy. (We really only need to flush buffers for the * source database, but bufmgr.c provides no API for that.) */ ! BufferSync(); /* * Once we start copying subdirectories, we need to be able to clean 'em --- 404,410 ---- * up-to-date for the copy. (We really only need to flush buffers for the * source database, but bufmgr.c provides no API for that.) */ ! BufferSync(true); /* * Once we start copying subdirectories, we need to be able to clean 'em *************** *** 507,513 **** * Perhaps if we ever implement CREATE DATABASE in a less cheesy way, * we can avoid this. */ ! RequestCheckpoint(true, false); /* * Close pg_database, but keep lock till commit (this is important to --- 507,513 ---- * Perhaps if we ever implement CREATE DATABASE in a less cheesy way, * we can avoid this. */ ! RequestImmediateCheckpoint(); /* * Close pg_database, but keep lock till commit (this is important to *************** *** 661,667 **** * open files, which would cause rmdir() to fail. */ #ifdef WIN32 ! RequestCheckpoint(true, false); #endif /* --- 661,667 ---- * open files, which would cause rmdir() to fail. */ #ifdef WIN32 ! RequestImmediateCheckpoint(); #endif /* *************** *** 1427,1433 **** * up-to-date for the copy. (We really only need to flush buffers for * the source database, but bufmgr.c provides no API for that.) */ ! BufferSync(); /* * Copy this subdirectory to the new location --- 1427,1433 ---- * up-to-date for the copy. (We really only need to flush buffers for * the source database, but bufmgr.c provides no API for that.) */ ! BufferSync(true); /* * Copy this subdirectory to the new location Index: src/backend/postmaster/bgwriter.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/postmaster/bgwriter.c,v retrieving revision 1.38 diff -c -r1.38 bgwriter.c *** src/backend/postmaster/bgwriter.c 27 May 2007 03:50:39 -0000 1.38 --- src/backend/postmaster/bgwriter.c 20 Jun 2007 12:58:20 -0000 *************** *** 44,49 **** --- 44,50 ---- #include "postgres.h" #include <signal.h> + #include <sys/time.h> #include <time.h> #include <unistd.h> *************** *** 59,64 **** --- 60,66 ---- #include "storage/pmsignal.h" #include "storage/shmem.h" #include "storage/smgr.h" + #include "storage/spin.h" #include "tcop/tcopprot.h" #include "utils/guc.h" #include "utils/memutils.h" *************** *** 112,122 **** { pid_t bgwriter_pid; /* PID of bgwriter (0 if not started) */ ! sig_atomic_t ckpt_started; /* advances when checkpoint starts */ ! sig_atomic_t ckpt_done; /* advances when checkpoint done */ ! sig_atomic_t ckpt_failed; /* advances when checkpoint fails */ ! sig_atomic_t ckpt_time_warn; /* warn if too soon since last ckpt? */ int num_requests; /* current # of requests */ int max_requests; /* allocated array size */ --- 114,128 ---- { pid_t bgwriter_pid; /* PID of bgwriter (0 if not started) */ ! slock_t ckpt_lck; /* protects all the ckpt_* fields */ ! int ckpt_started; /* advances when checkpoint starts */ ! int ckpt_done; /* advances when checkpoint done */ ! int ckpt_failed; /* advances when checkpoint fails */ ! ! bool ckpt_rqst_time_warn; /* warn if too soon since last ckpt */ ! bool ckpt_rqst_immediate; /* an immediate ckpt has been requested */ ! bool ckpt_rqst_force; /* checkpoint even if no WAL activity */ int num_requests; /* current # of requests */ int max_requests; /* allocated array size */ *************** *** 131,136 **** --- 137,143 ---- int BgWriterDelay = 200; int CheckPointTimeout = 300; int CheckPointWarning = 30; + double CheckPointSmoothing = 0.3; /* * Flags set by interrupt handlers for later service in the main loop. *************** *** 146,154 **** --- 153,176 ---- static bool ckpt_active = false; + /* Current time and WAL insert location when checkpoint was started */ + static time_t ckpt_start_time; + static XLogRecPtr ckpt_start_recptr; + + static double ckpt_cached_elapsed; + static time_t last_checkpoint_time; static time_t last_xlog_switch_time; + /* Prototypes for private functions */ + + static void RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force); + static void CheckArchiveTimeout(void); + static void BgWriterNap(void); + static bool IsCheckpointOnSchedule(double progress); + static bool ImmediateCheckpointRequested(void); + + /* Signal handlers */ static void bg_quickdie(SIGNAL_ARGS); static void BgSigHupHandler(SIGNAL_ARGS); *************** *** 170,175 **** --- 192,198 ---- Assert(BgWriterShmem != NULL); BgWriterShmem->bgwriter_pid = MyProcPid; + SpinLockInit(&BgWriterShmem->ckpt_lck); am_bg_writer = true; /* *************** *** 281,288 **** --- 304,314 ---- /* use volatile pointer to prevent code rearrangement */ volatile BgWriterShmemStruct *bgs = BgWriterShmem; + SpinLockAcquire(&BgWriterShmem->ckpt_lck); bgs->ckpt_failed++; bgs->ckpt_done = bgs->ckpt_started; + SpinLockRelease(&bgs->ckpt_lck); + ckpt_active = false; } *************** *** 328,337 **** for (;;) { bool do_checkpoint = false; - bool force_checkpoint = false; time_t now; int elapsed_secs; - long udelay; /* * Emergency bailout if postmaster has died. This is to avoid the --- 354,361 ---- *************** *** 354,360 **** { checkpoint_requested = false; do_checkpoint = true; - force_checkpoint = true; BgWriterStats.m_requested_checkpoints++; } if (shutdown_requested) --- 378,383 ---- *************** *** 377,387 **** */ now = time(NULL); elapsed_secs = now - last_checkpoint_time; ! if (elapsed_secs >= CheckPointTimeout) { do_checkpoint = true; ! if (!force_checkpoint) ! BgWriterStats.m_timed_checkpoints++; } /* --- 400,409 ---- */ now = time(NULL); elapsed_secs = now - last_checkpoint_time; ! if (!do_checkpoint && elapsed_secs >= CheckPointTimeout) { do_checkpoint = true; ! BgWriterStats.m_timed_checkpoints++; } /* *************** *** 390,395 **** --- 412,445 ---- */ if (do_checkpoint) { + /* use volatile pointer to prevent code rearrangement */ + volatile BgWriterShmemStruct *bgs = BgWriterShmem; + bool time_warn; + bool immediate; + bool force; + + /* + * Atomically check the request flags to figure out what + * kind of a checkpoint we should perform, and increase the + * started-counter to acknowledge that we've started + * a new checkpoint. + */ + + SpinLockAcquire(&bgs->ckpt_lck); + + time_warn = bgs->ckpt_rqst_time_warn; + bgs->ckpt_rqst_time_warn = false; + + immediate = bgs->ckpt_rqst_immediate; + bgs->ckpt_rqst_immediate = false; + + force = bgs->ckpt_rqst_force; + bgs->ckpt_rqst_force = false; + + bgs->ckpt_started++; + + SpinLockRelease(&bgs->ckpt_lck); + /* * We will warn if (a) too soon since last checkpoint (whatever * caused it) and (b) somebody has set the ckpt_time_warn flag *************** *** 397,417 **** * implementation will not generate warnings caused by * CheckPointTimeout < CheckPointWarning. */ ! if (BgWriterShmem->ckpt_time_warn && elapsed_secs < CheckPointWarning) ereport(LOG, (errmsg("checkpoints are occurring too frequently (%d seconds apart)", elapsed_secs), errhint("Consider increasing the configuration parameter \"checkpoint_segments\"."))); ! BgWriterShmem->ckpt_time_warn = false; /* * Indicate checkpoint start to any waiting backends. */ ckpt_active = true; - BgWriterShmem->ckpt_started++; ! CreateCheckPoint(false, force_checkpoint); /* * After any checkpoint, close all smgr files. This is so we --- 447,474 ---- * implementation will not generate warnings caused by * CheckPointTimeout < CheckPointWarning. */ ! if (time_warn && elapsed_secs < CheckPointWarning) ereport(LOG, (errmsg("checkpoints are occurring too frequently (%d seconds apart)", elapsed_secs), errhint("Consider increasing the configuration parameter \"checkpoint_segments\"."))); ! /* * Indicate checkpoint start to any waiting backends. */ ckpt_active = true; ! ckpt_start_recptr = GetInsertRecPtr(); ! ckpt_start_time = now; ! ckpt_cached_elapsed = 0; ! ! elog(DEBUG1, "CHECKPOINT: start"); ! ! CreateCheckPoint(false, immediate, force); ! ! elog(DEBUG1, "CHECKPOINT: end"); /* * After any checkpoint, close all smgr files. This is so we *************** *** 422,428 **** /* * Indicate checkpoint completion to any waiting backends. */ ! BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started; ckpt_active = false; /* --- 479,487 ---- /* * Indicate checkpoint completion to any waiting backends. */ ! SpinLockAcquire(&bgs->ckpt_lck); ! bgs->ckpt_done = bgs->ckpt_started; ! SpinLockRelease(&bgs->ckpt_lck); ckpt_active = false; /* *************** *** 433,446 **** last_checkpoint_time = now; } else ! BgBufferSync(); /* ! * Check for archive_timeout, if so, switch xlog files. First we do a ! * quick check using possibly-stale local state. */ ! if (XLogArchiveTimeout > 0 && ! (int) (now - last_xlog_switch_time) >= XLogArchiveTimeout) { /* * Update local state ... note that last_xlog_switch_time is the --- 492,530 ---- last_checkpoint_time = now; } else ! { ! BgAllSweep(); ! BgLruSweep(); ! } /* ! * Check for archive_timeout and switch xlog files if necessary. */ ! CheckArchiveTimeout(); ! ! /* Nap for the configured time. */ ! BgWriterNap(); ! } ! } ! ! /* ! * CheckArchiveTimeout -- check for archive_timeout and switch xlog files ! * if needed ! */ ! static void ! CheckArchiveTimeout(void) ! { ! time_t now; ! ! if (XLogArchiveTimeout <= 0) ! return; ! ! now = time(NULL); ! ! /* First we do a quick check using possibly-stale local state. */ ! if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout) ! return; ! { /* * Update local state ... note that last_xlog_switch_time is the *************** *** 450,459 **** last_xlog_switch_time = Max(last_xlog_switch_time, last_time); - /* if we did a checkpoint, 'now' might be stale too */ - if (do_checkpoint) - now = time(NULL); - /* Now we can do the real check */ if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout) { --- 534,539 ---- *************** *** 478,483 **** --- 558,572 ---- last_xlog_switch_time = now; } } + } + + /* + * BgWriterNap -- Nap for the configured time or until a signal is received. + */ + static void + BgWriterNap(void) + { + long udelay; /* * Send off activity statistics to the stats collector *************** *** 496,502 **** * We absorb pending requests after each short sleep. */ if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) || ! (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0)) udelay = BgWriterDelay * 1000L; else if (XLogArchiveTimeout > 0) udelay = 1000000L; /* One second */ --- 585,592 ---- * We absorb pending requests after each short sleep. */ if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) || ! (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0) || ! ckpt_active) udelay = BgWriterDelay * 1000L; else if (XLogArchiveTimeout > 0) udelay = 1000000L; /* One second */ *************** *** 505,522 **** while (udelay > 999999L) { ! if (got_SIGHUP || checkpoint_requested || shutdown_requested) break; pg_usleep(1000000L); AbsorbFsyncRequests(); udelay -= 1000000L; } ! if (!(got_SIGHUP || checkpoint_requested || shutdown_requested)) pg_usleep(udelay); } } /* -------------------------------- * signal handler routines --- 595,766 ---- while (udelay > 999999L) { ! /* If a checkpoint is active, postpone reloading the config ! * until the checkpoint is finished, and don't care about ! * non-immediate checkpoint requests. ! */ ! if (shutdown_requested || ! (!ckpt_active && (got_SIGHUP || checkpoint_requested)) || ! (ckpt_active && ImmediateCheckpointRequested())) break; + pg_usleep(1000000L); AbsorbFsyncRequests(); udelay -= 1000000L; } ! ! if (!(shutdown_requested || ! (!ckpt_active && (got_SIGHUP || checkpoint_requested)) || ! (ckpt_active && ImmediateCheckpointRequested()))) pg_usleep(udelay); + } + + /* + * Returns true if an immediate checkpoint request is pending. + */ + static bool + ImmediateCheckpointRequested() + { + if (checkpoint_requested) + { + volatile BgWriterShmemStruct *bgs = BgWriterShmem; + + /* + * We're only looking at a single field, so we don't need to + * acquire the lock in this case. + */ + if (bgs->ckpt_rqst_immediate) + return true; } + return false; } + /* + * CheckpointWriteDelay -- periodical sleep in checkpoint write phase + * + * During checkpoint, this is called periodically by the buffer manager while + * writing out dirty buffers from the shared buffer cache. We estimate if we've + * made enough progress so that we're going to finish this checkpoint in time + * before the next one is due, taking checkpoint_smoothing into account. + * If so, we perform one round of normal bgwriter activity including LRU- + * cleaning of buffer cache, switching xlog segment if archive_timeout has + * passed, and sleeping for BgWriterDelay msecs. + * + * 'progress' is an estimate of how much of the writes has been done, as a + * fraction between 0.0 meaning none, and 1.0 meaning all done. + */ + void + CheckpointWriteDelay(double progress) + { + /* + * Return immediately if we should finish the checkpoint ASAP. + */ + if (!am_bg_writer || CheckPointSmoothing <= 0 || shutdown_requested || + ImmediateCheckpointRequested()) + return; + + elog(DEBUG1, "CheckpointWriteDelay: progress=%.3f", progress); + + /* Take a nap and perform the usual bgwriter duties, unless we're behind + * schedule, in which case we just try to catch up as quickly as possible. + */ + if (IsCheckpointOnSchedule(progress)) + { + CheckArchiveTimeout(); + BgLruSweep(); + BgWriterNap(); + } + } + + /* + * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint + * in time? + * + * Compares the current progress against the time/segments elapsed since last + * checkpoint, and returns true if the progress we've made this far is greater + * than the elapsed time/segments. + * + * If another checkpoint has already been requested, always return false. + */ + static bool + IsCheckpointOnSchedule(double progress) + { + struct timeval now; + XLogRecPtr recptr; + double progress_in_time, + progress_in_xlog; + + Assert(ckpt_active); + + /* scale progress according to CheckPointSmoothing */ + progress *= CheckPointSmoothing; + + /* + * Check against the cached value first. Only do the more expensive + * calculations once we reach the target previously calculated. Since + * neither time or WAL insert pointer moves backwards, a freshly + * calculated value can only be greater than or equal to the cached value. + */ + if (progress < ckpt_cached_elapsed) + { + elog(DEBUG2, "IsCheckpointOnSchedule: Still behind cached=%.3f, progress=%.3f", + ckpt_cached_elapsed, progress); + return false; + } + + gettimeofday(&now, NULL); + + /* + * Check progress against time elapsed and checkpoint_timeout. + */ + progress_in_time = ((double) (now.tv_sec - ckpt_start_time) + + now.tv_usec / 1000000.0) / CheckPointTimeout; + + if (progress < progress_in_time) + { + elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_timeout, time=%.3f, progress=%.3f", + progress_in_time, progress); + + ckpt_cached_elapsed = progress_in_time; + + return false; + } + + /* + * Check progress against WAL segments written and checkpoint_segments. + * + * We compare the current WAL insert location against the location + * computed before calling CreateCheckPoint. The code in XLogInsert that + * actually triggers a checkpoint when checkpoint_segments is exceeded + * compares against RedoRecptr, so this is not completely accurate. + * However, it's good enough for our purposes, we're only calculating + * an estimate anyway. + */ + recptr = GetInsertRecPtr(); + progress_in_xlog = + (((double) recptr.xlogid - (double) ckpt_start_recptr.xlogid) * XLogSegsPerFile + + ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) / + CheckPointSegments; + + if (progress < progress_in_xlog) + { + elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_segments, xlog=%.3f, progress=%.3f", + progress_in_xlog, progress); + + ckpt_cached_elapsed = progress_in_xlog; + + return false; + } + + + /* It looks like we're on schedule. */ + + elog(DEBUG2, "IsCheckpointOnSchedule: on schedule, time=%.3f, xlog=%.3f progress=%.3f", + progress_in_time, progress_in_xlog, progress); + + return true; + } /* -------------------------------- * signal handler routines *************** *** 618,625 **** } /* * RequestCheckpoint ! * Called in backend processes to request an immediate checkpoint * * If waitforit is true, wait until the checkpoint is completed * before returning; otherwise, just signal the request and return --- 862,910 ---- } /* + * RequestImmediateCheckpoint + * Called in backend processes to request an immediate checkpoint. + * + * Returns when the checkpoint is finished. + */ + void + RequestImmediateCheckpoint() + { + RequestCheckpoint(true, false, true, true); + } + + /* + * RequestImmediateCheckpoint + * Called in backend processes to request a lazy checkpoint. + * + * This is essentially the same as RequestImmediateCheckpoint, except + * that this form obeys the checkpoint_smoothing GUC variable, and + * can therefore take a lot longer time. + * + * Returns when the checkpoint is finished. + */ + void + RequestLazyCheckpoint() + { + RequestCheckpoint(true, false, false, true); + } + + /* + * RequestXLogFillCheckpoint + * Signals the bgwriter that we've reached checkpoint_segments + * + * Unlike RequestImmediateCheckpoint and RequestLazyCheckpoint, return + * immediately without waiting for the checkpoint to finish. + */ + void + RequestXLogFillCheckpoint() + { + RequestCheckpoint(false, true, false, false); + } + + /* * RequestCheckpoint ! * Common subroutine for all the above Request*Checkpoint variants. * * If waitforit is true, wait until the checkpoint is completed * before returning; otherwise, just signal the request and return *************** *** 628,648 **** * If warnontime is true, and it's "too soon" since the last checkpoint, * the bgwriter will log a warning. This should be true only for checkpoints * caused due to xlog filling, else the warning will be misleading. */ ! void ! RequestCheckpoint(bool waitforit, bool warnontime) { /* use volatile pointer to prevent code rearrangement */ volatile BgWriterShmemStruct *bgs = BgWriterShmem; ! sig_atomic_t old_failed = bgs->ckpt_failed; ! sig_atomic_t old_started = bgs->ckpt_started; /* * If in a standalone backend, just do it ourselves. */ if (!IsPostmasterEnvironment) { ! CreateCheckPoint(false, true); /* * After any checkpoint, close all smgr files. This is so we won't --- 913,942 ---- * If warnontime is true, and it's "too soon" since the last checkpoint, * the bgwriter will log a warning. This should be true only for checkpoints * caused due to xlog filling, else the warning will be misleading. + * + * If immediate is true, the checkpoint should be finished ASAP. + * + * If force is true, force a checkpoint even if no XLOG activity has occured + * since the last one. */ ! static void ! RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force) { /* use volatile pointer to prevent code rearrangement */ volatile BgWriterShmemStruct *bgs = BgWriterShmem; ! int old_failed, old_started; /* * If in a standalone backend, just do it ourselves. */ if (!IsPostmasterEnvironment) { ! /* ! * There's no point in doing lazy checkpoints in a standalone ! * backend, because there's no other backends the checkpoint could ! * disrupt. ! */ ! CreateCheckPoint(false, true, true); /* * After any checkpoint, close all smgr files. This is so we won't *************** *** 653,661 **** return; } ! /* Set warning request flag if appropriate */ if (warnontime) ! bgs->ckpt_time_warn = true; /* * Send signal to request checkpoint. When waitforit is false, we --- 947,974 ---- return; } ! /* ! * Atomically set the request flags, and take a snapshot of the counters. ! * This ensures that when we see that ckpt_started > old_started, ! * we know the flags we set here have been seen by bgwriter. ! * ! * Note that we effectively OR the flags with any existing flags, to ! * avoid overriding a "stronger" request by another backend. ! */ ! SpinLockAcquire(&bgs->ckpt_lck); ! ! old_failed = bgs->ckpt_failed; ! old_started = bgs->ckpt_started; ! ! /* Set request flags as appropriate */ if (warnontime) ! bgs->ckpt_rqst_time_warn = true; ! if (immediate) ! bgs->ckpt_rqst_immediate = true; ! if (force) ! bgs->ckpt_rqst_force = true; ! ! SpinLockRelease(&bgs->ckpt_lck); /* * Send signal to request checkpoint. When waitforit is false, we *************** *** 674,701 **** */ if (waitforit) { ! while (bgs->ckpt_started == old_started) { CHECK_FOR_INTERRUPTS(); pg_usleep(100000L); } - old_started = bgs->ckpt_started; /* ! * We are waiting for ckpt_done >= old_started, in a modulo sense. ! * This is a little tricky since we don't know the width or signedness ! * of sig_atomic_t. We make the lowest common denominator assumption ! * that it is only as wide as "char". This means that this algorithm ! * will cope correctly as long as we don't sleep for more than 127 ! * completed checkpoints. (If we do, we will get another chance to ! * exit after 128 more checkpoints...) */ ! while (((signed char) (bgs->ckpt_done - old_started)) < 0) { CHECK_FOR_INTERRUPTS(); pg_usleep(100000L); } ! if (bgs->ckpt_failed != old_failed) ereport(ERROR, (errmsg("checkpoint request failed"), errhint("Consult recent messages in the server log for details."))); --- 987,1031 ---- */ if (waitforit) { ! int new_started, new_failed; ! ! /* Wait for a new checkpoint to start. */ ! for(;;) { + SpinLockAcquire(&bgs->ckpt_lck); + new_started = bgs->ckpt_started; + SpinLockRelease(&bgs->ckpt_lck); + + if (new_started != old_started) + break; + CHECK_FOR_INTERRUPTS(); pg_usleep(100000L); } /* ! * We are waiting for ckpt_done >= new_started, in a modulo sense. ! * This algorithm will cope correctly as long as we don't sleep for ! * more than MAX_INT completed checkpoints. (If we do, we will get ! * another chance to exit after MAX_INT more checkpoints...) */ ! for(;;) { + int new_done; + + SpinLockAcquire(&bgs->ckpt_lck); + new_done = bgs->ckpt_done; + new_failed = bgs->ckpt_failed; + SpinLockRelease(&bgs->ckpt_lck); + + if(new_done - new_started >= 0) + break; + CHECK_FOR_INTERRUPTS(); pg_usleep(100000L); } ! ! if (new_failed != old_failed) ereport(ERROR, (errmsg("checkpoint request failed"), errhint("Consult recent messages in the server log for details."))); Index: src/backend/storage/buffer/bufmgr.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.220 diff -c -r1.220 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 30 May 2007 20:11:58 -0000 1.220 --- src/backend/storage/buffer/bufmgr.c 20 Jun 2007 12:47:43 -0000 *************** *** 32,38 **** * * BufferSync() -- flush all dirty buffers in the buffer pool. * ! * BgBufferSync() -- flush some dirty buffers in the buffer pool. * * InitBufferPool() -- Init the buffer module. * --- 32,40 ---- * * BufferSync() -- flush all dirty buffers in the buffer pool. * ! * BgAllSweep() -- write out some dirty buffers in the pool. ! * ! * BgLruSweep() -- write out some lru dirty buffers in the pool. * * InitBufferPool() -- Init the buffer module. * *************** *** 74,79 **** --- 76,82 ---- double bgwriter_all_percent = 0.333; int bgwriter_lru_maxpages = 5; int bgwriter_all_maxpages = 5; + int checkpoint_rate = 512; /* in pages/s */ long NDirectFileRead; /* some I/O's are direct file access. bypass *************** *** 645,651 **** * at 1 so that the buffer can survive one clock-sweep pass.) */ buf->tag = newTag; ! buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR); buf->flags |= BM_TAG_VALID; buf->usage_count = 1; --- 648,654 ---- * at 1 so that the buffer can survive one clock-sweep pass.) */ buf->tag = newTag; ! buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR); buf->flags |= BM_TAG_VALID; buf->usage_count = 1; *************** *** 1000,1037 **** * BufferSync -- Write out all dirty buffers in the pool. * * This is called at checkpoint time to write out all dirty shared buffers. */ void ! BufferSync(void) { ! int buf_id; int num_to_scan; int absorb_counter; /* * Find out where to start the circular scan. */ ! buf_id = StrategySyncStart(); /* Make sure we can handle the pin inside SyncOneBuffer */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); /* ! * Loop over all buffers. */ num_to_scan = NBuffers; absorb_counter = WRITES_PER_ABSORB; while (num_to_scan-- > 0) { ! if (SyncOneBuffer(buf_id, false)) { BgWriterStats.m_buf_written_checkpoints++; /* * If in bgwriter, absorb pending fsync requests after each * WRITES_PER_ABSORB write operations, to prevent overflow of the * fsync request queue. If not in bgwriter process, this is a * no-op. */ if (--absorb_counter <= 0) { --- 1003,1127 ---- * BufferSync -- Write out all dirty buffers in the pool. * * This is called at checkpoint time to write out all dirty shared buffers. + * If 'immediate' is true, write them all ASAP, otherwise throttle the + * I/O rate according to checkpoint_write_rate GUC variable, and perform + * normal bgwriter duties periodically. */ void ! BufferSync(bool immediate) { ! int buf_id, start_id; int num_to_scan; + int num_to_write; + int num_written; int absorb_counter; + int num_written_since_nap; + int writes_per_nap; + + /* + * Convert checkpoint_write_rate to number writes of writes to perform in + * a period of BgWriterDelay. The result is an integer, so we lose some + * precision here. There's a lot of other factors as well that affect the + * real rate, for example granularity of OS timer used for BgWriterDelay, + * whether any of the writes block, and time spent in CheckpointWriteDelay + * performing normal bgwriter duties. + */ + writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay); /* * Find out where to start the circular scan. */ ! start_id = StrategySyncStart(); /* Make sure we can handle the pin inside SyncOneBuffer */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); /* ! * Loop over all buffers, and mark the ones that need to be written with ! * BM_CHECKPOINT_NEEDED. Count them as we go (num_to_write), so that we ! * can estimate how much work needs to be done. ! * ! * This allows us to only write those pages that were dirty when the ! * checkpoint began, and haven't been flushed to disk since. Whenever a ! * page with BM_CHECKPOINT_NEEDED is written out by normal backends or ! * the bgwriter LRU-scan, the flag is cleared, and any pages dirtied after ! * this scan don't have the flag set. ! */ ! num_to_scan = NBuffers; ! num_to_write = 0; ! buf_id = start_id; ! while (num_to_scan-- > 0) ! { ! volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id]; ! ! /* ! * Header spinlock is enough to examine BM_DIRTY, see comment in ! * SyncOneBuffer. ! */ ! LockBufHdr(bufHdr); ! ! if (bufHdr->flags & BM_DIRTY) ! { ! bufHdr->flags |= BM_CHECKPOINT_NEEDED; ! num_to_write++; ! } ! ! UnlockBufHdr(bufHdr); ! ! if (++buf_id >= NBuffers) ! buf_id = 0; ! } ! ! elog(DEBUG1, "CHECKPOINT: %d / %d buffers to write", num_to_write, NBuffers); ! ! /* ! * Loop over all buffers again, and write the ones (still) marked with ! * BM_CHECKPOINT_NEEDED. */ num_to_scan = NBuffers; + num_written = num_written_since_nap = 0; absorb_counter = WRITES_PER_ABSORB; + buf_id = start_id; while (num_to_scan-- > 0) { ! volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id]; ! bool needs_flush; ! ! /* We don't need to acquire the lock here, because we're ! * only looking at a single bit. It's possible that someone ! * else writes the buffer and clears the flag right after we ! * check, but that doesn't matter. This assumes that no-one ! * clears the flag and sets it again while holding info_lck, ! * expecting no-one to see the intermediary state. ! */ ! needs_flush = (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0; ! ! if (needs_flush && SyncOneBuffer(buf_id, false)) { BgWriterStats.m_buf_written_checkpoints++; + num_written++; + + /* + * Perform normal bgwriter duties and sleep to throttle + * our I/O rate. + */ + if (!immediate && ++num_written_since_nap >= writes_per_nap) + { + num_written_since_nap = 0; + CheckpointWriteDelay((double) (num_written) / num_to_write); + } /* * If in bgwriter, absorb pending fsync requests after each * WRITES_PER_ABSORB write operations, to prevent overflow of the * fsync request queue. If not in bgwriter process, this is a * no-op. + * + * AbsorbFsyncRequests is also called inside CheckpointWriteDelay, + * so this is partially redundant. However, we can't totally trust + * on the call in CheckpointWriteDelay, because it's only made + * before sleeping. In case CheckpointWriteDelay doesn't sleep, + * we need to absorb pending requests ourselves. */ if (--absorb_counter <= 0) { *************** *** 1045,1059 **** } /* ! * BgBufferSync -- Write out some dirty buffers in the pool. * * This is called periodically by the background writer process. */ void ! BgBufferSync(void) { static int buf_id1 = 0; - int buf_id2; int num_to_scan; int num_written; --- 1135,1152 ---- } /* ! * BgAllSweep -- Write out some dirty buffers in the pool. * + * Runs the bgwriter all-sweep algorithm to write dirty buffers to + * minimize work at checkpoint time. * This is called periodically by the background writer process. + * + * XXX: Is this really needed with load distributed checkpoints? */ void ! BgAllSweep(void) { static int buf_id1 = 0; int num_to_scan; int num_written; *************** *** 1063,1072 **** /* * To minimize work at checkpoint time, we want to try to keep all the * buffers clean; this motivates a scan that proceeds sequentially through ! * all buffers. But we are also charged with ensuring that buffers that ! * will be recycled soon are clean when needed; these buffers are the ones ! * just ahead of the StrategySyncStart point. We make a separate scan ! * through those. */ /* --- 1156,1162 ---- /* * To minimize work at checkpoint time, we want to try to keep all the * buffers clean; this motivates a scan that proceeds sequentially through ! * all buffers. */ /* *************** *** 1098,1103 **** --- 1188,1210 ---- } BgWriterStats.m_buf_written_all += num_written; } + } + + /* + * BgLruSweep -- Write out some lru dirty buffers in the pool. + */ + void + BgLruSweep(void) + { + int buf_id2; + int num_to_scan; + int num_written; + + /* + * The purpose of this sweep is to ensure that buffers that + * will be recycled soon are clean when needed; these buffers are the ones + * just ahead of the StrategySyncStart point. + */ /* * This loop considers only unpinned buffers close to the clock sweep *************** *** 1341,1349 **** * flushed. */ void ! FlushBufferPool(void) { ! BufferSync(); smgrsync(); } --- 1448,1459 ---- * flushed. */ void ! FlushBufferPool(bool immediate) { ! elog(DEBUG1, "CHECKPOINT: write phase"); ! BufferSync(immediate || CheckPointSmoothing <= 0); ! ! elog(DEBUG1, "CHECKPOINT: sync phase"); smgrsync(); } *************** *** 2132,2138 **** Assert(buf->flags & BM_IO_IN_PROGRESS); buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR); if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED)) ! buf->flags &= ~BM_DIRTY; buf->flags |= set_flag_bits; UnlockBufHdr(buf); --- 2242,2248 ---- Assert(buf->flags & BM_IO_IN_PROGRESS); buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR); if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED)) ! buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED); buf->flags |= set_flag_bits; UnlockBufHdr(buf); Index: src/backend/tcop/utility.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tcop/utility.c,v retrieving revision 1.280 diff -c -r1.280 utility.c *** src/backend/tcop/utility.c 30 May 2007 20:12:01 -0000 1.280 --- src/backend/tcop/utility.c 20 Jun 2007 09:36:31 -0000 *************** *** 1089,1095 **** ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("must be superuser to do CHECKPOINT"))); ! RequestCheckpoint(true, false); break; case T_ReindexStmt: --- 1089,1095 ---- ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("must be superuser to do CHECKPOINT"))); ! RequestImmediateCheckpoint(); break; case T_ReindexStmt: Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v retrieving revision 1.396 diff -c -r1.396 guc.c *** src/backend/utils/misc/guc.c 8 Jun 2007 18:23:52 -0000 1.396 --- src/backend/utils/misc/guc.c 20 Jun 2007 10:14:06 -0000 *************** *** 1487,1492 **** --- 1487,1503 ---- 30, 0, INT_MAX, NULL, NULL }, + + { + {"checkpoint_rate", PGC_SIGHUP, WAL_CHECKPOINTS, + gettext_noop("Minimum I/O rate used to write dirty buffers during checkpoints."), + NULL, + GUC_UNIT_BLOCKS + }, + &checkpoint_rate, + 100, 0.0, 100000, NULL, NULL + }, + { {"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS, gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."), *************** *** 1866,1871 **** --- 1877,1891 ---- 0.1, 0.0, 100.0, NULL, NULL }, + { + {"checkpoint_smoothing", PGC_SIGHUP, WAL_CHECKPOINTS, + gettext_noop("Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval."), + NULL + }, + &CheckPointSmoothing, + 0.3, 0.0, 0.9, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL Index: src/backend/utils/misc/postgresql.conf.sample =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/postgresql.conf.sample,v retrieving revision 1.216 diff -c -r1.216 postgresql.conf.sample *** src/backend/utils/misc/postgresql.conf.sample 3 Jun 2007 17:08:15 -0000 1.216 --- src/backend/utils/misc/postgresql.conf.sample 20 Jun 2007 10:03:17 -0000 *************** *** 168,173 **** --- 168,175 ---- #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each #checkpoint_timeout = 5min # range 30s-1h + #checkpoint_smoothing = 0.3 # checkpoint duration, range 0.0 - 0.9 + #checkpoint_rate = 512.0KB # min. checkpoint write rate per second #checkpoint_warning = 30s # 0 is off # - Archiving - Index: src/include/access/xlog.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v retrieving revision 1.78 diff -c -r1.78 xlog.h *** src/include/access/xlog.h 30 May 2007 20:12:02 -0000 1.78 --- src/include/access/xlog.h 19 Jun 2007 14:10:07 -0000 *************** *** 171,179 **** extern void StartupXLOG(void); extern void ShutdownXLOG(int code, Datum arg); extern void InitXLOGAccess(void); ! extern void CreateCheckPoint(bool shutdown, bool force); extern void XLogPutNextOid(Oid nextOid); extern XLogRecPtr GetRedoRecPtr(void); extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch); #endif /* XLOG_H */ --- 171,180 ---- extern void StartupXLOG(void); extern void ShutdownXLOG(int code, Datum arg); extern void InitXLOGAccess(void); ! extern void CreateCheckPoint(bool shutdown, bool immediate, bool force); extern void XLogPutNextOid(Oid nextOid); extern XLogRecPtr GetRedoRecPtr(void); + extern XLogRecPtr GetInsertRecPtr(void); extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch); #endif /* XLOG_H */ Index: src/include/postmaster/bgwriter.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/postmaster/bgwriter.h,v retrieving revision 1.9 diff -c -r1.9 bgwriter.h *** src/include/postmaster/bgwriter.h 5 Jan 2007 22:19:57 -0000 1.9 --- src/include/postmaster/bgwriter.h 20 Jun 2007 09:27:20 -0000 *************** *** 20,29 **** extern int BgWriterDelay; extern int CheckPointTimeout; extern int CheckPointWarning; extern void BackgroundWriterMain(void); ! extern void RequestCheckpoint(bool waitforit, bool warnontime); extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno); extern void AbsorbFsyncRequests(void); --- 20,33 ---- extern int BgWriterDelay; extern int CheckPointTimeout; extern int CheckPointWarning; + extern double CheckPointSmoothing; extern void BackgroundWriterMain(void); ! extern void RequestImmediateCheckpoint(void); ! extern void RequestLazyCheckpoint(void); ! extern void RequestXLogFillCheckpoint(void); ! extern void CheckpointWriteDelay(double progress); extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno); extern void AbsorbFsyncRequests(void); Index: src/include/storage/buf_internals.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v retrieving revision 1.90 diff -c -r1.90 buf_internals.h *** src/include/storage/buf_internals.h 30 May 2007 20:12:03 -0000 1.90 --- src/include/storage/buf_internals.h 12 Jun 2007 11:42:23 -0000 *************** *** 35,40 **** --- 35,41 ---- #define BM_IO_ERROR (1 << 4) /* previous I/O failed */ #define BM_JUST_DIRTIED (1 << 5) /* dirtied since write started */ #define BM_PIN_COUNT_WAITER (1 << 6) /* have waiter for sole pin */ + #define BM_CHECKPOINT_NEEDED (1 << 7) /* this needs to be written in checkpoint */ typedef bits16 BufFlags; Index: src/include/storage/bufmgr.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v retrieving revision 1.104 diff -c -r1.104 bufmgr.h *** src/include/storage/bufmgr.h 30 May 2007 20:12:03 -0000 1.104 --- src/include/storage/bufmgr.h 20 Jun 2007 10:28:43 -0000 *************** *** 36,41 **** --- 36,42 ---- extern double bgwriter_all_percent; extern int bgwriter_lru_maxpages; extern int bgwriter_all_maxpages; + extern int checkpoint_rate; /* in buf_init.c */ extern DLLIMPORT char *BufferBlocks; *************** *** 136,142 **** extern void ResetBufferUsage(void); extern void AtEOXact_Buffers(bool isCommit); extern void PrintBufferLeakWarning(Buffer buffer); ! extern void FlushBufferPool(void); extern BlockNumber BufferGetBlockNumber(Buffer buffer); extern BlockNumber RelationGetNumberOfBlocks(Relation relation); extern void RelationTruncate(Relation rel, BlockNumber nblocks); --- 137,143 ---- extern void ResetBufferUsage(void); extern void AtEOXact_Buffers(bool isCommit); extern void PrintBufferLeakWarning(Buffer buffer); ! extern void FlushBufferPool(bool immediate); extern BlockNumber BufferGetBlockNumber(Buffer buffer); extern BlockNumber RelationGetNumberOfBlocks(Relation relation); extern void RelationTruncate(Relation rel, BlockNumber nblocks); *************** *** 161,168 **** extern void AbortBufferIO(void); extern void BufmgrCommit(void); ! extern void BufferSync(void); ! extern void BgBufferSync(void); extern void AtProcExit_LocalBuffers(void); --- 162,170 ---- extern void AbortBufferIO(void); extern void BufmgrCommit(void); ! extern void BufferSync(bool immediate); ! extern void BgAllSweep(void); ! extern void BgLruSweep(void); extern void AtProcExit_LocalBuffers(void);
Heikki Linnakangas <heikki@enterprisedb.com> writes: > I added a spinlock to protect the signaling fields between bgwriter and > backends. The current non-locking approach gets really difficult as the > patch adds two new flags, and both are more important than the existing > ckpt_time_warn flag. That may be, but you could minimize the cost and notational ugliness by not using the spinlock where you don't have to. Put the sig_atomic_t fields back the way they were, and many of the uses of the spinlock go away. All you really need it for is the places where the additional flags are set or read. > In fact, I think there's a small race condition in CVS HEAD: Yeah, probably --- the original no-locking design didn't have any side flags. The reason you need the lock is for a backend to be sure that a newly-started checkpoint is using its requested flags. But the detection of checkpoint termination is still the same. Some other comments: I tend to agree with whoever said upthread that the combination of GUC variables proposed here is confusing and ugly. It'd make more sense to have min and max checkpoint rates in KB/s, with the max checkpoint rate only honored when we are predicting we'll finish before the next checkpoint time. The flow of control and responsibility between xlog.c, bgwriter.c and bufmgr.c seems to have reached the breaking point of unintelligibility. Can you think of a refactoring that would simplify it? We might have to resign ourselves to this much messiness, but I'd rather not... regards, tom lane
Heikki Linnakangas <heikki@enterprisedb.com> wrote: > Here's an updated WIP patch for load distributed checkpoints. > Since last patch, I did some clean up and refactoring, and added a bunch > of comments, and user documentation. The only thing I don't understand is the naming of 'checkpoint_smoothing'. Can users imagine the unit of 'smoothing' is a fraction? You explain the paremeter with the word 'fraction'. Why don't you simply name it 'checkpoint_fraction' ? | Specifies the target length of checkpoints, as a fraction of | the checkpoint interval. The default is 0.3. Sorry if I'm missing discussions abount the naming. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro wrote: > The only thing I don't understand is the naming of 'checkpoint_smoothing'. > Can users imagine the unit of 'smoothing' is a fraction? > > You explain the paremeter with the word 'fraction'. > Why don't you simply name it 'checkpoint_fraction' ? > | Specifies the target length of checkpoints, as a fraction of > | the checkpoint interval. The default is 0.3. I chose checkpoint_smoothing because it tells you what the parameter is for. If you want more smoothing, tune it up, and if you want less, tune it down. checkpoint_fraction makes you wonder what you can do with it and why you would change it. > Sorry if I'm missing discussions abount the naming. No, I chose _smoothing on my own. People didn't like checkpoint_write_percent either (including). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> I added a spinlock to protect the signaling fields between bgwriter and >> backends. The current non-locking approach gets really difficult as the >> patch adds two new flags, and both are more important than the existing >> ckpt_time_warn flag. > > That may be, but you could minimize the cost and notational ugliness by > not using the spinlock where you don't have to. Put the sig_atomic_t > fields back the way they were, and many of the uses of the spinlock go > away. All you really need it for is the places where the additional > flags are set or read. I find it easier to understand if it's used whenever any of the fields are accessed. You don't need to read and write ckpt_failed and ckpt_started/ckpt_done in specific order anymore, for example. > Some other comments: > > I tend to agree with whoever said upthread that the combination of GUC > variables proposed here is confusing and ugly. It'd make more sense to > have min and max checkpoint rates in KB/s, with the max checkpoint rate > only honored when we are predicting we'll finish before the next > checkpoint time. Really? I thought everyone is happy with the current combination, and that it was just the old trio of parameters controlling the write, nap and sync phases that was ugly. One particularly nice thing about defining the duration as a fraction of checkpoint interval is that we can come up with a reasonable default value that doesn't depend on your hardware. How would a min and max rate work? Anyone else have an opinion on the parameters? > The flow of control and responsibility between xlog.c, bgwriter.c and > bufmgr.c seems to have reached the breaking point of unintelligibility. > Can you think of a refactoring that would simplify it? We might have > to resign ourselves to this much messiness, but I'd rather not... The problem we're trying to solve is doing a checkpoint while running the normal bgwriter activity at the same time. The normal way to do two things simultaneously is to have two different processes (or threads). I thought about having a separate checkpoint process, but I gave up on that thought because the communication needed between backends, bgwriter and the checkpointer seems like a mess. The checkpointer would need the pendingOpsTable so that it can do the fsyncs, and it would also need to receive the forget-messages to that table. We could move that table entirely to the checkpointer, but since bgwriter is presumably doing most of the writes, there would be a lot of chatter between bgwriter and the checkpointer. The current approach is like co-operative multitasking. BufferSyncs yields control to bgwriter every now and then. The division of labor between xlog.c and other modules is not that bad, IMO. CreateCheckPoint is the main entry point to create a checkpoint, and it calls other modules to do their stuff, including bufmgr.c. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> In fact, I think there's a small race condition in CVS HEAD: > > Yeah, probably --- the original no-locking design didn't have any side > flags. The reason you need the lock is for a backend to be sure that > a newly-started checkpoint is using its requested flags. But the > detection of checkpoint termination is still the same. Actually, the race condition I outlined isn't related to the flags. It's possible because RequestCheckpoint doesn't guarantee that a checkpoint is performed when there's been no WAL activity since last one. I did use a new force-flag to fix it, but I'm sure there is other ways. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> I tend to agree with whoever said upthread that the combination of GUC >> variables proposed here is confusing and ugly. It'd make more sense to >> have min and max checkpoint rates in KB/s, with the max checkpoint rate >> only honored when we are predicting we'll finish before the next >> checkpoint time. > Really? I thought everyone is happy with the current combination, and > that it was just the old trio of parameters controlling the write, nap > and sync phases that was ugly. One particularly nice thing about > defining the duration as a fraction of checkpoint interval is that we > can come up with a reasonable default value that doesn't depend on your > hardware. That argument would hold some water if you weren't introducing a hardware-dependent min rate in the same patch. Do we need the min rate at all? If so, why can't it be in the same units as the max (ie, a fraction of checkpoint)? > How would a min and max rate work? Pretty much the same as the code does now, no? You either delay, or not. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>> I tend to agree with whoever said upthread that the combination of GUC >>> variables proposed here is confusing and ugly. It'd make more sense to >>> have min and max checkpoint rates in KB/s, with the max checkpoint rate >>> only honored when we are predicting we'll finish before the next >>> checkpoint time. > >> Really? I thought everyone is happy with the current combination, and >> that it was just the old trio of parameters controlling the write, nap >> and sync phases that was ugly. One particularly nice thing about >> defining the duration as a fraction of checkpoint interval is that we >> can come up with a reasonable default value that doesn't depend on your >> hardware. > > That argument would hold some water if you weren't introducing a > hardware-dependent min rate in the same patch. Do we need the min rate > at all? If so, why can't it be in the same units as the max (ie, a > fraction of checkpoint)? > >> How would a min and max rate work? > > Pretty much the same as the code does now, no? You either delay, or not. I don't think you understand how the settings work. Did you read the documentation? If you did, it's apparently not adequate. The main tuning knob is checkpoint_smoothing, which is defined as a fraction of the checkpoint interval (both checkpoint_timeout and checkpoint_segments are taken into account). Normally, the write phase of a checkpoint takes exactly that much time. So the length of a checkpoint stays the same regardless of how many dirty buffers there is (assuming you don't exceed the bandwidth of your hardware), but the I/O rate varies. There's another possible strategy: keep the I/O rate constant, but vary the length of the checkpoint. checkpoint_rate allows you to do that. I'm envisioning we set the defaults so that checkpoint_smoothing is the effective control in a relatively busy system, and checkpoint_rate ensures that we don't unnecessarily prolong checkpoints on an idle system. Now how would you replace checkpoint_smoothing with a max I/O rate? The only real alternative way of defining it is directly as a time and/or segments, similar to checkpoint_timeout and checkpoint_segments, but then we'd really need two knobs instead of one. Though maybe we could just hard-code it to 0.8, for example, and tell people to tune checkpoint_rate if they want more aggressive checkpoints. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > I don't think you understand how the settings work. Did you read the > documentation? If you did, it's apparently not adequate. I did read the documentation, and I'm not complaining that I don't understand it. I'm complaining that I don't like the presented API because it's self-inconsistent. You've got two parameters that are in effect upper and lower bounds for the checkpoint write rate, but they are named inconsistently and not even measured in the same kind of unit. Nor do I agree that the inconsistency buys any ease of use. > The main tuning knob is checkpoint_smoothing, which is defined as a > fraction of the checkpoint interval (both checkpoint_timeout and > checkpoint_segments are taken into account). Normally, the write phase > of a checkpoint takes exactly that much time. So the question is, why in the heck would anyone want the behavior that "checkpoints take exactly X time"?? The useful part of this whole patch is to cap the write rate at something that doesn't interfere too much with foreground queries. I don't see why people wouldn't prefer "checkpoints can take any amount of time up to the checkpoint interval, but we do our best not to exceed Y writes/second". Basically I don't see what useful values checkpoint_smoothing would have other than 0 and 1. You might as well make it a bool. > There's another possible strategy: keep the I/O rate constant, but vary > the length of the checkpoint. checkpoint_rate allows you to do that. But only from the lower side. > Now how would you replace checkpoint_smoothing with a max I/O rate? I don't see why you think that's hard. It looks to me like the components of the decision are the same numbers in any case: you have to estimate your progress towards checkpoint completion, your available time till next checkpoint, and your write rate. Then you either delay or not. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> The main tuning knob is checkpoint_smoothing, which is defined as a >> fraction of the checkpoint interval (both checkpoint_timeout and >> checkpoint_segments are taken into account). Normally, the write phase >> of a checkpoint takes exactly that much time > > So the question is, why in the heck would anyone want the behavior that > "checkpoints take exactly X time"?? The useful part of this whole patch > is to cap the write rate at something that doesn't interfere too much > with foreground queries. I don't see why people wouldn't prefer > "checkpoints can take any amount of time up to the checkpoint interval, > but we do our best not to exceed Y writes/second". Because it's easier to tune. You don't need to know how much checkpoint I/O you can tolerate. The system will use just enough I/O bandwidth to meet the deadline, but not more than that. > Basically I don't see what useful values checkpoint_smoothing would have > other than 0 and 1. You might as well make it a bool. Well that's one option. It feels like a good thing to be able to control how much headroom you have until the next checkpoint, but maybe we can just hardcode it close to 1. It's also good to avoid spreading the checkpoints unnecessarily, to keep recovery times lower, but you can control that with the min rate setting as well. >> There's another possible strategy: keep the I/O rate constant, but vary >> the length of the checkpoint. checkpoint_rate allows you to do that. > > But only from the lower side. > >> Now how would you replace checkpoint_smoothing with a max I/O rate? > > I don't see why you think that's hard. It looks to me like the > components of the decision are the same numbers in any case: you have to > estimate your progress towards checkpoint completion, your available > time till next checkpoint, and your write rate. Then you either delay > or not. Let me put it this way: If you define a min and a max I/O rate, when would the max I/O rate limit take effect? If there's few dirty buffers in the pool, so that you'll finish the checkpoint in time before the next one is due writing at the min rate, that's what you'll use. If there's more, you'll need to write fast enough that you'll finish the checkpoint in time, regardless of the max rate. Or would you let the next checkpoint slip and keep writing at the max rate? That seems like a footgun if someone sets it to a too low value. Or are you thinking that we have just one setting: checkpoint_rate? You describe it as a maximum, but I've been thinking of it as a minimum because you *will* exceed it if the next checkpoint is due soon. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> So the question is, why in the heck would anyone want the behavior that >> "checkpoints take exactly X time"?? > Because it's easier to tune. You don't need to know how much checkpoint > I/O you can tolerate. The system will use just enough I/O bandwidth to > meet the deadline, but not more than that. Uh, not really. After consuming more caffeine and reading the patch more carefully, I think there are several problems here: 1. checkpoint_rate is used thusly: writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay); where writes_per_nap is the max number of dirty blocks to write before taking a bgwriter nap. Now surely this is completely backward: if BgWriterDelay is increased, the number of writes to allow per nap decreases? If you think checkpoint_rate is expressed in some kind of physical bytes/sec unit, that cannot be right; the number of blocks per nap has to increase if the naps get longer. (BTW, the patch seems a bit schizoid about whether checkpoint_rate is int or float.) 2. checkpoint_smoothing is used thusly: /* scale progress according to CheckPointSmoothing */ progress *= CheckPointSmoothing; where the progress value being scaled is the fraction so far completed of the total number of dirty pages we expect to have to write. This is then compared against estimates of the total fraction of the time- between-checkpoints that has elapsed; if less, we are behind schedule and should not nap, if more, we are ahead of schedule and may nap. (This is a bit odd, but I guess it's all right because it's equivalent to dividing the elapsed-time estimate by CheckPointSmoothing, which seems a more natural way of thinking about what needs to happen.) What's bugging me about this is that we are either going to be writing at checkpoint_rate if ahead of schedule, or max possible rate if behind; that's not "smoothing" to me, that's introducing some pretty bursty behavior. ISTM that actual "smoothing" would involve adjusting writes_per_nap up or down according to whether we are ahead or behind schedule, so as to have a finer degree of control over the I/O rate. (I'd also consider saving the last writes_per_nap value across checkpoints so as to have a more nearly accurate starting value next time.) In any case I still concur with Takahiro-san that "smoothing" doesn't seem the most appropriate name for the parameter. Something along the lines of "checkpoint_completion_target" would convey more about what it does, I think. And checkpoint_rate really needs to be named checkpoint_min_rate, if it's going to be a minimum. However, I question whether we need it at all, because as the code stands, with the default BgWriterDelay you would have to increase checkpoint_rate to 4x its proposed default before writes_per_nap moves off its minimum of 1. This says to me that the system's tested behavior has been so insensitive to checkpoint_rate that we probably need not expose such a parameter at all; just hardwire the minimum writes_per_nap at 1. regards, tom lane
Tom Lane wrote: > 1. checkpoint_rate is used thusly: > > writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay); > > where writes_per_nap is the max number of dirty blocks to write before > taking a bgwriter nap. Now surely this is completely backward: if > BgWriterDelay is increased, the number of writes to allow per nap > decreases? If you think checkpoint_rate is expressed in some kind of > physical bytes/sec unit, that cannot be right; the number of blocks > per nap has to increase if the naps get longer. Uh, you're right, that's just plain wrong. checkpoint_rate is in pages/s, so that line should be writes_per_nap = Min(1, checkpoint_rate * BgWriterDelay / 1000) > (BTW, the patch seems > a bit schizoid about whether checkpoint_rate is int or float.) Yeah, I've gone back and forth on the data type. I wanted it to be a float, but guc code doesn't let you specify a float in KB, so I switched it to int. > 2. checkpoint_smoothing is used thusly: > > /* scale progress according to CheckPointSmoothing */ > progress *= CheckPointSmoothing; > > where the progress value being scaled is the fraction so far completed > of the total number of dirty pages we expect to have to write. This > is then compared against estimates of the total fraction of the time- > between-checkpoints that has elapsed; if less, we are behind schedule > and should not nap, if more, we are ahead of schedule and may nap. > (This is a bit odd, but I guess it's all right because it's equivalent > to dividing the elapsed-time estimate by CheckPointSmoothing, which > seems a more natural way of thinking about what needs to happen.) Yeah, it's a bit unnatural. I did it that way so we don't have to divide all three of the estimates: cached_elapsed, progress_in_time and progress_in_xlog. Maybe it's not worth micro-optimizing that. > What's bugging me about this is that we are either going to be writing > at checkpoint_rate if ahead of schedule, or max possible rate if behind; > that's not "smoothing" to me, that's introducing some pretty bursty > behavior. ISTM that actual "smoothing" would involve adjusting > writes_per_nap up or down according to whether we are ahead or behind > schedule, so as to have a finer degree of control over the I/O rate. > (I'd also consider saving the last writes_per_nap value across > checkpoints so as to have a more nearly accurate starting value next > time.) That sounds a lot more complex. The burstiness at that level shouldn't matter much. The OS is still going to cache the writes, and should even out the bursts. Assuming time/xlogs elapse at a steady rate, we will write some multiple of writes_per_nap pages between each sleep. With a small writes_per_nap, writing just writes_per_nap isn't enough to catch up after we fall behind, so we'll write more than that between each sleep. That means that on each iteration, we'll write either N*writes_per_nap, or (N+1)*writes_per_nap. At worst, that means either writes_per_nap or 2*writes_per_nap pages on each iteration. That's not too bad, I think. > In any case I still concur with Takahiro-san that "smoothing" doesn't > seem the most appropriate name for the parameter. Something along the > lines of "checkpoint_completion_target" would convey more about what it > does, I think. Ok, I'm not wedded to smoothing. > And checkpoint_rate really needs to be named checkpoint_min_rate, if > it's going to be a minimum. However, I question whether we need it at > all, because as the code stands, with the default BgWriterDelay you > would have to increase checkpoint_rate to 4x its proposed default before > writes_per_nap moves off its minimum of 1. Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, using the non-broken formula above, we get: (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12. So I think that's fine. (looking at the patch, I see that the default in guc.c is actually 100 pages / s, contrary to documentation; needs to be fixed) > This says to me that the > system's tested behavior has been so insensitive to checkpoint_rate > that we probably need not expose such a parameter at all; just hardwire > the minimum writes_per_nap at 1. I've set checkpoint_rate to a small value in my tests on purpose to control the feature with the other parameter. That must be why I haven't noticed the bogus calculation of it before. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > And checkpoint_rate really needs to be named checkpoint_min_rate, if > it's going to be a minimum. However, I question whether we need it at > all, because as the code stands, with the default BgWriterDelay you > would have to increase checkpoint_rate to 4x its proposed default before > writes_per_nap moves off its minimum of 1. This says to me that the > system's tested behavior has been so insensitive to checkpoint_rate > that we probably need not expose such a parameter at all; just hardwire > the minimum writes_per_nap at 1. I agree on starting with as simple a GUC as possible and expand it as there is need in the field. 99% of people are never going to test this value as much as you are. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> (BTW, the patch seems >> a bit schizoid about whether checkpoint_rate is int or float.) > Yeah, I've gone back and forth on the data type. I wanted it to be a > float, but guc code doesn't let you specify a float in KB, so I switched > it to int. I seriously question trying to claim that it's blocks at all, seeing that the *actual* units are pages per unit time. Pretending that it's a memory unit does more to obscure the truth than document it. >> What's bugging me about this is that we are either going to be writing >> at checkpoint_rate if ahead of schedule, or max possible rate if behind; >> that's not "smoothing" to me, that's introducing some pretty bursty >> behavior. > That sounds a lot more complex. The burstiness at that level shouldn't > matter much. The OS is still going to cache the writes, and should even > out the bursts. With the changes you're proposing here, the burstiness would be quite severe. OTOH, if writes_per_nap is always 1, then bufmgr is going to recheck the delay situation after every page, so what you have actually tested is as granular as it could get. >> And checkpoint_rate really needs to be named checkpoint_min_rate, if >> it's going to be a minimum. However, I question whether we need it at >> all, > Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, > using the non-broken formula above, we get: > (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12. > So I think that's fine. "Fine?" That's 12x the value you have actually tested. That's enough of a change to invalidate all your performance testing IMHO. I still think you've not demonstrated a need to expose this parameter. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>> (BTW, the patch seems >>> a bit schizoid about whether checkpoint_rate is int or float.) > >> Yeah, I've gone back and forth on the data type. I wanted it to be a >> float, but guc code doesn't let you specify a float in KB, so I switched >> it to int. > > I seriously question trying to claim that it's blocks at all, seeing > that the *actual* units are pages per unit time. Pretending that it's > a memory unit does more to obscure the truth than document it. Hmm. What I wanted to avoid is that the I/O rate you get then depends on your bgwriter_delay, so you if you change that you need to change checkpoint_min_rate as well. Now we already have that issue with bgwriter_all/lru_maxpages, and I don't like it there either. If you think it's better to let the user define it directly as pages/bgwriter_delay, fine. >>> And checkpoint_rate really needs to be named checkpoint_min_rate, if >>> it's going to be a minimum. However, I question whether we need it at >>> all, > >> Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, >> using the non-broken formula above, we get: > >> (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12. > >> So I think that's fine. > > "Fine?" That's 12x the value you have actually tested. That's enough > of a change to invalidate all your performance testing IMHO. I'll reschedule the tests to be sure, after we settle on how we want to control this feature. > I still think you've not demonstrated a need to expose this parameter. Greg Smith wanted to explicitly control the I/O rate, and let the checkpoint duration vary. I personally think that fixing the checkpoint duration is better because it's easier to tune. But if we only do that, you might end up with ridiculously long checkpoints when there's not many dirty pages. If we want to avoid that, we need some way of telling what's a safe minimum rate to write at, because that can vary greatly depending on your hardware. But maybe we don't care about prolonging checkpoints, and don't really need any GUCs at all. We could then just hardcode writes_per_nap to some low value, and target duration close to 1.0. You would have a checkpoint running practically all the time, and you would use checkpoint_timeout/checkpoint_segments to control how long it takes. I'm a bit worried about jumping to such a completely different regime, though. For example, pg_start_backup needs to create a new checkpoint, so it would need to wait on average 1.5 * checkpoint_timeout/segments, and recovery would need to process on average 1.5 as much WAL as before. Though with LDC, you should get away with shorter checkpoint intervals than before, because the checkpoints aren't as invasive. If we do that, we should remove bgwriter_all_* settings. They wouldn't do much because we would have checkpoint running all the time, writing out dirty pages. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> I still think you've not demonstrated a need to expose this parameter. > Greg Smith wanted to explicitly control the I/O rate, and let the > checkpoint duration vary. I personally think that fixing the checkpoint > duration is better because it's easier to tune. > But if we only do that, you might end up with ridiculously long > checkpoints when there's not many dirty pages. If we want to avoid that, > we need some way of telling what's a safe minimum rate to write at, > because that can vary greatly depending on your hardware. As long as the minimum rate is at least 1/bgdelay, I don't think this is a big problem. > But maybe we don't care about prolonging checkpoints, and don't really > need any GUCs at all. We could then just hardcode writes_per_nap to some > low value, and target duration close to 1.0. You would have a checkpoint > running practically all the time, and you would use > checkpoint_timeout/checkpoint_segments to control how long it takes. I'm > a bit worried about jumping to such a completely different regime, > though. For example, pg_start_backup needs to create a new checkpoint, > so it would need to wait on average 1.5 * checkpoint_timeout/segments, Maybe I misread the patch, but I thought that if someone requested an immediate checkpoint, the checkpoint-in-progress would effectively flip to immediate mode. So that could be handled by offering an immediate vs extended checkpoint option in pg_start_backup. I'm not sure it's a problem though, since as previously noted you probably want pg_start_backup to be noninvasive. Also, one could do a manual CHECKPOINT command then immediately pg_start_backup if one wanted as-fast-as-possible (CHECKPOINT requests immediate checkpoint, right?) > and recovery would need to process on average 1.5 as much WAL as before. > Though with LDC, you should get away with shorter checkpoint intervals > than before, because the checkpoints aren't as invasive. No, you still want a pretty long checkpoint interval, because of the increase in WAL traffic due to more page images being dumped when the interval is short. > If we do that, we should remove bgwriter_all_* settings. They wouldn't > do much because we would have checkpoint running all the time, writing > out dirty pages. Yeah, I'm not sure that we've thought through the interactions with the existing bgwriter behavior. regards, tom lane
Tom Lane wrote: > Maybe I misread the patch, but I thought that if someone requested an > immediate checkpoint, the checkpoint-in-progress would effectively flip > to immediate mode. So that could be handled by offering an immediate vs > extended checkpoint option in pg_start_backup. I'm not sure it's a > problem though, since as previously noted you probably want > pg_start_backup to be noninvasive. Also, one could do a manual > CHECKPOINT command then immediately pg_start_backup if one wanted > as-fast-as-possible (CHECKPOINT requests immediate checkpoint, right?) Yeah, that's possible. >> and recovery would need to process on average 1.5 as much WAL as before. >> Though with LDC, you should get away with shorter checkpoint intervals >> than before, because the checkpoints aren't as invasive. > > No, you still want a pretty long checkpoint interval, because of the > increase in WAL traffic due to more page images being dumped when the > interval is short. > >> If we do that, we should remove bgwriter_all_* settings. They wouldn't >> do much because we would have checkpoint running all the time, writing >> out dirty pages. > > Yeah, I'm not sure that we've thought through the interactions with the > existing bgwriter behavior. I searched the archives a bit for the discussions when the current bgwriter settings were born, and found this thread: http://archives.postgresql.org/pgsql-hackers/2004-12/msg00784.php The idea of Load Distributed Checkpoints certainly isn't new :). Ok, if we approach this from the idea that there will be *no* GUC variables at all to control this, and we remove the bgwriter_all_* settings as well, does anyone see a reason why that would be bad? Here's the ones mentioned this far: 1. we need to keep 2x as much WAL segments around as before. 2. pg_start_backup will need to wait for a long time. 3. Recovery will take longer, because the distance last committed redo ptr will lag behind more. 1. and 3. can be alleviated by using a smaller checkpoint_timeout/segments though as you pointed out that leads to higher WAL traffic. 2. is not a big deal, and we can add an 'immediate' parameter to pg_start_backup if necessary. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 22 Jun 2007, Tom Lane wrote: > Yeah, I'm not sure that we've thought through the interactions with the > existing bgwriter behavior. The entire background writer mess needs a rewrite, and the best way to handle that is going to shift considerably with LDC applied. As the person who was complaining about corner cases I'm not in a position to talk more explicitly about, I can at least summarize my opinion of how I feel everyone should be thinking about this patch and you can take what you want from that. In the context of getting an 8.3 release finalized, I think you should be building a LDC implementation that accomplishes the main goal of spreading the checkpoints out, which is clearly working, and erring on the side of more knobs in cases where it's unsure if they're needed or not. It's clear what my position is which non-obvious knobs I think are important for some people. Which is worse: putting in a tuning setting that it's discovered everybody just uses the same value for, or discovering after release that there's a use-case for that setting and by hard-coding it you've made the DBAs for a class of applications you didn't think about miserable? In the cases where there's good evidence so far of the right setting, just make that the default, and the only harm is GUC bloat. Nothing should be done that changes the existing behavior if the LDC feature is turned off, so anything more obtrusive to the background writer is right out. Make reducing the knobs, optimizing the default behavior, and rewriting the background writer to better fit into its new context a major goal of 8.4. I know I've got a whole notebook full of stuff on that topic I've been ignoring as not to distract you guys from getting 8.3 done. That's the low risk plan, and the design/beta/release period here is short enough that I think going too experimental beyond that is a bad idea. To pick an example, when I read this idea from Heikki: > You would have a checkpoint running practically all the time, and you > would use checkpoint_timeout/checkpoint_segments to control how long it > takes... If we do that, we should remove bgwriter_all_* settings Whether or not I think this is an awesome idea, the very idea of a change that big at this point gives me the willies. Just off the top of my head, there's a whole class of issues involving recycling xlog segments this would introduce I would be really unhappy with the implications of. Did anyone else ever notice that when a new xlog segment is created, the write to clear it out doesn't happen via direct I/O like the rest of the xlog writes do? That write goes through the regular buffered path instead. The checkpoint logging patch I submitted logs when this happens specifically because that particular issue messed with some operations and I found it important to be aware of. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Ok, if we approach this from the idea that there will be *no* GUC > variables at all to control this, and we remove the bgwriter_all_* > settings as well, does anyone see a reason why that would be bad? Here's > the ones mentioned this far: > 1. we need to keep 2x as much WAL segments around as before. No, only 50% more. We've always kept WAL back to the checkpoint-before-last, so the peak usage has been about 2 checkpoint distances (plus a bit), and now it'd tend to be about 3. > 2. pg_start_backup will need to wait for a long time. > 3. Recovery will take longer, because the distance last committed redo > ptr will lag behind more. True, you'd have to replay 1.5 checkpoint intervals on average instead of 0.5 (more or less, assuming checkpoints had been short). I don't think we're in the business of optimizing crash recovery time though. It might be that getting rid of checkpoint_smoothing is an overreaction, and that we should keep that parameter. IIRC Greg had worried about being able to turn this behavior off, so we'd still need at least a bool, and we might as well expose the fraction instead. (Still don't like the name "smoothing" though.) I agree with removing the non-LRU part of the bgwriter's write logic though; that should simplify matters a bit and cut down the overall I/O volume. regards, tom lane
Greg Smith <gsmith@gregsmith.com> writes: > As the person who was complaining about corner cases I'm not in a position > to talk more explicitly about, I can at least summarize my opinion of how > I feel everyone should be thinking about this patch and you can take what > you want from that. Sorry, but we're not going to make decisions on the basis of evidence you won't show us. Right at the moment the best thing to do seems to be to enable LDC with a low minimum write rate and a high target duration, and remove the thereby-obsoleted "all buffers" scan of the existing bgwriter logic. If you've got specific evidence why any of these things need to be parameterized, let's see it. Personally I think that we have a bad track record of exposing GUC variables as a substitute for understanding performance issues at the start, and this approach isn't doing any favors for DBAs. > Whether or not I think this is an awesome idea, the very idea of a change > that big at this point gives me the willies. Just off the top of my head, > there's a whole class of issues involving recycling xlog segments this > would introduce I would be really unhappy with the implications of. Really? Name one. AFAICS this should not affect xlog recycling in the slightest (other than that we have to bump up the target number of live segments from 2x to 3x the checkpoint distance). > Did anyone else ever notice that when a new xlog segment is created, > the write to clear it out doesn't happen via direct I/O like the rest > of the xlog writes do? It's not supposed to matter, because that path isn't supposed to be taken often. regards, tom lane
On Fri, 22 Jun 2007, Tom Lane wrote: > Greg had worried about being able to turn this behavior off, so we'd > still need at least a bool, and we might as well expose the fraction > instead. I agree with removing the non-LRU part of the bgwriter's write > logic though If you accept that being able to turn LDC off is important, people who aren't turning it on may still want the existing bgwriter_all_* settings so they can keep running things the way they are now. It's certainly reasonable to skip that code path when doing things the LDC way. > True, you'd have to replay 1.5 checkpoint intervals on average instead > of 0.5 (more or less, assuming checkpoints had been short). I don't > think we're in the business of optimizing crash recovery time though. If you're not, I think you should be. Keeping that replay interval time down was one of the reasons why the people I was working with were displeased with the implications of the very spread out style of some LDC tunings. They were already unhappy with the implied recovery time of how high they had to set checkpoint_settings for good performance, and making it that much bigger aggrevates the issue. Given a knob where the LDC can be spread out a bit but not across the entire interval, that makes it easier to control how much expansion there is relative to the current behavior. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: >> True, you'd have to replay 1.5 checkpoint intervals on average instead >> of 0.5 (more or less, assuming checkpoints had been short). I don't >> think we're in the business of optimizing crash recovery time though. > > If you're not, I think you should be. Keeping that replay interval time > down was one of the reasons why the people I was working with were > displeased with the implications of the very spread out style of some > LDC tunings. They were already unhappy with the implied recovery time > of how high they had to set checkpoint_settings for good performance, > and making it that much bigger aggrevates the issue. Given a knob where > the LDC can be spread out a bit but not across the entire interval, that > makes it easier to control how much expansion there is relative to the > current behavior. I agree on that one: we *should* optimize crash recovery time. It may not be the most important thing on earth, but it's a significant consideration for some systems. However, I think shortening the checkpoint interval is a perfectly valid solution to that. It does lead to more full page writes, but in 8.3 more full page writes can actually make the recovery go faster, not slower, because with we no longer read in the previous contents of the page when we restore it from a full page image. In any case, while people sometimes complain that we have a large WAL footprint, it's not usually a problem. This is off-topic, but at PGCon in May, Itagaki-san and his colleagues whose names I can't remember, pointed out to me very clearly that our recovery is *slow*. So slow, that in the benchmarks they were running, their warm stand-by slave couldn't keep up with the master generating the WAL, even though both are running on the same kind of hardware. The reason is simple: There can be tens of backends doing I/O and generating WAL, but in recovery we serialize them. If you have decent I/O hardware that could handle for example 10 concurrent random I/Os, at recovery we'll be issuing them one at a time. That's a scalability issue, and doesn't show up on a laptop or a small server with a single disk. That's one of the first things I'm planning to tackle when the 8.4 dev cycle opens. And I'm planning to look at recovery times in general; I've never even measured it before so who knows what comes up. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Greg Smith wrote: > On Fri, 22 Jun 2007, Tom Lane wrote: > >> Greg had worried about being able to turn this behavior off, so we'd >> still need at least a bool, and we might as well expose the fraction >> instead. I agree with removing the non-LRU part of the bgwriter's >> write logic though > > If you accept that being able to turn LDC off is important, people who > aren't turning it on may still want the existing bgwriter_all_* settings > so they can keep running things the way they are now. It's certainly > reasonable to skip that code path when doing things the LDC way. Well, if you don't want anything to change, don't upgrade. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
This message is going to come off as kind of angry, and I hope you don't take that personally. I'm very frustrated with this whole area right now but am unable to do anything to improve that situation. On Fri, 22 Jun 2007, Tom Lane wrote: > If you've got specific evidence why any of these things need to be > parameterized, let's see it. All I'm trying to suggest here is that you might want to pause and consider whether you want to make a change that might break existing, happily working installations based just on the small number of tests that have been done on this patch so far. A nice stack of DBT2 results is very informative, but the DBT2 workload is not everybody's workload. Did you see anybody else predicting issues with the LDC patch on overloaded systems as are starting to be seen in the 150 warehouse/90% latency figures in Heikki's most recent results? The way I remember that, it was just me pushing to expose that problem, because I knew it was there from my unfortunately private tests, but it was difficult to encounter the issue on other types of benchmarks (thanks again to Greg Stark and Heikki for helping with that). But that's fine, if you want to blow off the rest of my suggestions now just because the other things I'm worried about are also very hard problem to expose and I can't hand you over a smoking gun, that's your decision. > Personally I think that we have a bad track record of exposing GUC > variables as a substitute for understanding performance issues at the > start, and this approach isn't doing any favors for DBAs. I think this project has an awful track record of introducing new GUC variables and never having a plan to follow through with a process to figure out how they should be set. The almost complete lack of standardization and useful tools for collecting performance information about this database boggles my mind, and you're never going to get the performance related sections of the GUC streamlined without it. We were just talking about the mess that is effective_cache_size recently. As a more topical example here, the background writer was officially released in early 2005, with a bizarre collection of tunables. I had to help hack on that code myself, over two years later, to even start exposing the internal statistics data needed to optimize it correctly. The main reason I can't prove some of my concerns is that I got so side-tracked adding the infrastructure needed to even show they exist that I wasn't able to nail down exactly what was going on well enough to generate a public test case before the project that exposed the issues wrapped up. > Right at the moment the best thing to do seems to be to enable LDC with > a low minimum write rate and a high target duration, and remove the > thereby-obsoleted "all buffers" scan of the existing bgwriter logic. I have reason to believe there's a set of use cases where a more accelerated LDC approach than everyone seems to be learning toward is appropriate, which would then reinvigorate the need for the all-scan BGW component. I have a whole new design for the non-LRU background writer that fixes most of what's wrong with it I'm waiting for 8.4 to pass out and get feedback on, but if everybody is hell bent on just yanking the whole thing out in preference to these really lazy checkpoints go ahead and do what you want. My life would be easier if I just tossed all that out and forgot about the whole thing, and I'm real close to doing just that right now. >> Did anyone else ever notice that when a new xlog segment is created, >> the write to clear it out doesn't happen via direct I/O like the rest >> of the xlog writes do? > It's not supposed to matter, because that path isn't supposed to be > taken often. Yes, but during the situation it does happen in--when checkpoints take so much longer than expected that more segments have to be created, or in an archive logger faiure--it badly impacts an already unpleasant situation. >> there's a whole class of issues involving recycling xlog segments this >> would introduce I would be really unhappy with the implications of. > Really? Name one. You already mentioned expansion of the log segments used which is a primary issue. Acting like all the additional segments used for some of the more extreme checkpoint spreading approaches are without cost is completely unrealistic IMHO. In the situation I just described above, I also noticed the way O_DIRECT sync writes get mixed with buffered WAL writes seems to cause some weird I/O scheduling issues in Linux that can make worst-case latency degrade. But since I can't prove that, I guess I might as well not even mention that either. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 2007-06-22 at 22:19 +0100, Heikki Linnakangas wrote: > However, I think shortening the checkpoint interval is a perfectly valid > solution to that. Agreed. That's what checkpoint_timeout is for. Greg can't choose to use checkpoint_segments as the limit and then complain about unbounded recovery time, because that was clearly a conscious choice. > In any case, while people > sometimes complain that we have a large WAL footprint, it's not usually > a problem. IMHO its a huge problem. Turning on full_page_writes means that the amount of WAL generated varies fairly linearly with number of blocks touched, which means large databases become a problem. Suzuki-san's team had results that showed this was a problem also. > This is off-topic, but at PGCon in May, Itagaki-san and his colleagues > whose names I can't remember, pointed out to me very clearly that our > recovery is *slow*. So slow, that in the benchmarks they were running, > their warm stand-by slave couldn't keep up with the master generating > the WAL, even though both are running on the same kind of hardware. > The reason is simple: There can be tens of backends doing I/O and > generating WAL, but in recovery we serialize them. If you have decent > I/O hardware that could handle for example 10 concurrent random I/Os, at > recovery we'll be issuing them one at a time. That's a scalability > issue, and doesn't show up on a laptop or a small server with a single disk. The results showed that the current recovery isn't scalable beyond a certain point, not that it was slow per se. The effect isn't noticeable on systems generating fewer writes (of any size) or ones where the cache hit ratio is high. It isn't accurate to say laptops and small servers only, but it would be accurate to say hi volume I/O bound OLTP is the place where the scalability of recovery does not match the performance scalability of the master server. Yes, we need to make recovery more scalable by de-serializing I/O. Slony also serializes changes onto the Slave nodes, so the general problem of scalability of recovery solutions needs to be tackled. > That's one of the first things I'm planning to tackle when the 8.4 dev > cycle opens. And I'm planning to look at recovery times in general; I've > never even measured it before so who knows what comes up. I'm planning on working on recovery also, as is Florian. Let's make sure we coordinate what we do to avoid patch conflicts. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Fri, 2007-06-22 at 22:19 +0100, Heikki Linnakangas wrote: > >> However, I think shortening the checkpoint interval is a perfectly valid >> solution to that. > > Agreed. That's what checkpoint_timeout is for. Greg can't choose to use > checkpoint_segments as the limit and then complain about unbounded > recovery time, because that was clearly a conscious choice. No, by checkpoint interval, I meant both checkpoint_timeout and checkpoint_segments. You can decrease checkpoint_segments as well to shorten recovery time. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2007-06-22 at 16:21 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > 3. Recovery will take longer, because the distance last committed redo > > ptr will lag behind more. > > True, you'd have to replay 1.5 checkpoint intervals on average instead > of 0.5 (more or less, assuming checkpoints had been short). I don't > think we're in the business of optimizing crash recovery time though. Agreed, though only for crash recovery. Most archive recoveries will not be change noticeably - half a checkpoint isn't very long in PITR terms. > It might be that getting rid of checkpoint_smoothing is an overreaction, > and that we should keep that parameter. IIRC Greg had worried about > being able to turn this behavior off, so we'd still need at least a > bool, and we might as well expose the fraction instead. (Still don't > like the name "smoothing" though.) ISTM we don't need the checkpoint_smoothing parameter. I can't see why anyone would want to turn off smoothing: If they are doing many writes, then they will be effected by the sharp dive at checkpoint, which happens *every* checkpoint. The increase in crash recovery time is an easily acceptable trade-off for what is gained; if not, they can always make checkpoint_timeout smaller. If they aren't doing many writes they don't care what the setting of checkpoint_smoothing is and recovery will be very fast anyway. >I agree with removing the non-LRU > part of the bgwriter's write logic though; that should simplify matters > a bit and cut down the overall I/O volume. Yes, agreed. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Fri, 2007-06-22 at 16:57 -0400, Greg Smith wrote: > If you're not, I think you should be. Keeping that replay interval > time down was one of the reasons why the people I was working with > were displeased with the implications of the very spread out style of > some LDC tunings. They were already unhappy with the implied recovery > time of how high they had to set checkpoint_settings for good > performance, and making it that much bigger aggrevates the issue. > Given a knob where the LDC can be spread out a bit but not across the > entire interval, that makes it easier to control how much expansion > there is relative to the current behavior. We won't need to set checkpoint_settings so high, since performance is smoothed across checkpoints by LDC and its OK to allow them more frequently. So this concern need not apply with LDC. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Sun, 24 Jun 2007, Simon Riggs wrote: > I can't see why anyone would want to turn off smoothing: If they are > doing many writes, then they will be effected by the sharp dive at > checkpoint, which happens *every* checkpoint. There are service-level agreement situations where a short and sharp disruption is more acceptable than a more prolonged one. As some of the overloaded I/O tests are starting to show, the LDC may be a backward step for someone in that sort of environment. I am not a fan of introducing a replacement feature based on what I consider too limited testing, and I don't feel this one has been beat on long yet enough to start pruning features that would allow better backward compatibility/transitioning. I think that's introducing an unnecessary risk to the design. > We won't need to set checkpoint_segments so high, since performance is > smoothed across checkpoints by LDC and its OK to allow them more > frequently. So this concern need not apply with LDC. Performance *should* be smoothed across by checkpoints by LDC and my concern *may* not apply. I think assuming it will always help based on the limited number of test results presented so far is extrapolation. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > I am not a fan of introducing a replacement feature based on what I > consider too limited testing, and I don't feel this one has been beat on > long yet enough to start pruning features that would allow better backward > compatibility/transitioning. I think that's introducing an unnecessary > risk to the design. While it's certainly true that we could do with wider testing of this patch, I'm not sure why you hold such strong allegiance to the status quo. We know that the status quo isn't working very well. And if you think that the current code had enormous amounts of testing before it went in, I've got to disillusion you :-( regards, tom lane
On Sun, 24 Jun 2007, Simon Riggs wrote: > Greg can't choose to use checkpoint_segments as the limit and then > complain about unbounded recovery time, because that was clearly a > conscious choice. I'm complaining only because everyone seems content to wander in a direction where the multiplier on checkpoint_segments for how many segments are actually active at once will go up considerably, which can make a known problem (recovery time) worse. I'm thinking about things like how the release notes for 8.3 are going to address this. It the plan to say something like "whatever you set checkpoint_segments to before so you were happy with the crash recovery time, make sure to re-run all those tests again because we made a big change there, it may take a lot longer now with the same value, and too bad if you don't like it because there's no way to get the old behavior back. Suck it up, decrease checkpoint_segments, and get used to having more checkpoints if you have to keep the same recovery time". -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, 25 Jun 2007, Tom Lane wrote: > I'm not sure why you hold such strong allegiance to the status quo. We > know that the status quo isn't working very well. Don't get me wrong here; I am a big fan of this patch, think it's an important step forward, and it's exactly the fact that I'm so shell shocked from abuse by the status quo that I'm still mixed up in this mess (I really should be ignoring the lot of you and writing new code instead). LDC certainly makes things better in almost every case. My "allegiance" comes from having seen a class of transactions where LDC made things worse on a fast/overloaded system, in that it made some types of service guarantees harder to meet, and I just don't know who else might run into problems in that area. I'm worried that if it's not adjustable, you're introducing a risk that you'll take a step backward for some of this code's users, and that will be hard to undo given the way releases are structured here. I spent some time trading stocks for a living. There are sometimes situations you can get into there where there is a tiny chance that something very bad can happen with a trade, and many people get wiped out by such things. If it's possible in that situation to remove that risk with something inexpensive, you do it, even though the net expected value of the change might be slightly negative. This seems like such a situation to me. If it's possible to take away the risk of other people running into an unforseen problem with the LDC patch just by keeping a knob that's already there, unless that's an expensive operation my opinion is that you should pick a good default but not remove it yet. > And if you think that the current code had enormous amounts of testing > before it went in, I've got to disillusion you :-( It's having been on the painful receiving end of that fact that makes me so paranoid now :) -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > LDC certainly makes things better in almost every case. My "allegiance" > comes from having seen a class of transactions where LDC made things > worse on a fast/overloaded system, in that it made some types of service > guarantees harder to meet, and I just don't know who else might run into > problems in that area. Please describe the class of transactions and the service guarantees so that we can reproduce that, and figure out what's the best solution. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-06-25 at 01:33 -0400, Greg Smith wrote: > On Sun, 24 Jun 2007, Simon Riggs wrote: > > > Greg can't choose to use checkpoint_segments as the limit and then > > complain about unbounded recovery time, because that was clearly a > > conscious choice. > > I'm complaining I apologise for using that emotive phrase. > only because everyone seems content to wander in a > direction where the multiplier on checkpoint_segments for how many > segments are actually active at once will go up considerably, which can > make a known problem (recovery time) worse. +50% more. Recovery time is a consideration that can be adjusted for. We have done nothing to make recovery rate worse; the additional WAL leads to an increased recovery time *only* if you keep the same parameter settings. There is no reason to keep them the same, nor do we promise that parameters will keep the exact meaning they had previous releases. As you say, we can put comments in the release notes to advise people of 50% increase in recovery time if the parameters stay the same. That would be balanced by the comment that checkpoints are now considerably smoother than before and more frequent checkpoints are unlikely to be a problem, so it is OK to reduce the parameters from the settings you used in previous releases. I don't see any reason why we would want unsmooth checkpoints; similarly we don't want unsmooth bg writing, unsmooth archiving etc.. Performance will increase, response times will drop and people will be happy. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, Jun 25, 2007 at 10:15:07AM +0100, Simon Riggs wrote: > On Mon, 2007-06-25 at 01:33 -0400, Greg Smith wrote: > > On Sun, 24 Jun 2007, Simon Riggs wrote: > > > > > Greg can't choose to use checkpoint_segments as the limit and then > > > complain about unbounded recovery time, because that was clearly a > > > conscious choice. > > > > I'm complaining > > I apologise for using that emotive phrase. > > > only because everyone seems content to wander in a > > direction where the multiplier on checkpoint_segments for how many > > segments are actually active at once will go up considerably, which can > > make a known problem (recovery time) worse. > > +50% more. Recovery time is a consideration that can be adjusted for. We > have done nothing to make recovery rate worse; the additional WAL leads > to an increased recovery time *only* if you keep the same parameter > settings. There is no reason to keep them the same, nor do we promise > that parameters will keep the exact meaning they had previous releases. > > As you say, we can put comments in the release notes to advise people of > 50% increase in recovery time if the parameters stay the same. That > would be balanced by the comment that checkpoints are now considerably > smoother than before and more frequent checkpoints are unlikely to be a > problem, so it is OK to reduce the parameters from the settings you used > in previous releases. Didn't we already add other featuers that makes recovery much *faster* than before? In that case, are they faster enugh to neutralise this increased time (a guestimate, of course) Or did I mess that up with stuff we added for 8.2? :-) //Magnus
Magnus Hagander wrote: > On Mon, Jun 25, 2007 at 10:15:07AM +0100, Simon Riggs wrote: >> As you say, we can put comments in the release notes to advise people of >> 50% increase in recovery time if the parameters stay the same. That >> would be balanced by the comment that checkpoints are now considerably >> smoother than before and more frequent checkpoints are unlikely to be a >> problem, so it is OK to reduce the parameters from the settings you used >> in previous releases. > > Didn't we already add other featuers that makes recovery much *faster* than > before? Yes, we no longer read in a page just to overwrite it with a full page image. > In that case, are they faster enugh to neutralise this increased > time (a guestimate, of course) It depends a lot on your workload. Under the right circumstances, it will more than offset any increase caused by LDC, but sometimes it won't make any difference at all (for example with full_page_writes=off). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-06-25 at 12:56 +0200, Magnus Hagander wrote: > Didn't we already add other featuers that makes recovery much *faster* than > before? In that case, are they faster enugh to neutralise this increased > time (a guestimate, of course) > > Or did I mess that up with stuff we added for 8.2? :-) Only with full_page_writes on, but yes you're right, the predicted recovery times are changing for 8.3 anyway. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > I agree with removing the non-LRU > part of the bgwriter's write logic though; that should simplify matters > a bit and cut down the overall I/O volume. On further thought, there is one workload where removing the non-LRU part would be counterproductive: If you have a system with a very bursty transaction rate, it's possible that when it's time for a checkpoint, there hasn't been any WAL logged activity since last checkpoint, so we skip it. When that happens, the buffer cache might still be full of dirty pages, because of hint bit updates. That still isn't a problem on it's own, but if you then do a huge batch update, you have to flush those dirty pages at that point. It would be better to trickle flush those dirty pages during the idle period. So we might still want to keep the non-LRU scan. Or alternatively, when the checkpoint is a no-op, we call BufferSync nevertheless. That's effectively the same thing, except that BufferSync would be controlled by the logic to estimate the deadline until next checkpoint, instead of the current bgwriter_all_*-settings. Greg, is this the kind of workload you're having, or is there some other scenario you're worried about? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > On further thought, there is one workload where removing the non-LRU > part would be counterproductive: > If you have a system with a very bursty transaction rate, it's possible > that when it's time for a checkpoint, there hasn't been any WAL logged > activity since last checkpoint, so we skip it. When that happens, the > buffer cache might still be full of dirty pages, because of hint bit > updates. That still isn't a problem on it's own, but if you then do a > huge batch update, you have to flush those dirty pages at that point. It > would be better to trickle flush those dirty pages during the idle period. But wouldn't the LRU-based scan accomplish that? regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> On further thought, there is one workload where removing the non-LRU >> part would be counterproductive: > >> If you have a system with a very bursty transaction rate, it's possible >> that when it's time for a checkpoint, there hasn't been any WAL logged >> activity since last checkpoint, so we skip it. When that happens, the >> buffer cache might still be full of dirty pages, because of hint bit >> updates. That still isn't a problem on it's own, but if you then do a >> huge batch update, you have to flush those dirty pages at that point. It >> would be better to trickle flush those dirty pages during the idle period. > > But wouldn't the LRU-based scan accomplish that? It only scans bgwriter_lru_percent buffers ahead of the clock hand. If the hand isn't moving, it keeps scanning the same buffers over and over again. You can crank it all the way up to 100%, though, in which case it would work, but that starts to get expensive CPU-wise. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> Heikki Linnakangas <heikki@enterprisedb.com> writes: >>> If you have a system with a very bursty transaction rate, it's possible >>> that when it's time for a checkpoint, there hasn't been any WAL logged >>> activity since last checkpoint, so we skip it. When that happens, the >>> buffer cache might still be full of dirty pages, because of hint bit >>> updates. That still isn't a problem on it's own, but if you then do a >>> huge batch update, you have to flush those dirty pages at that point. It >>> would be better to trickle flush those dirty pages during the idle period. >> >> But wouldn't the LRU-based scan accomplish that? > It only scans bgwriter_lru_percent buffers ahead of the clock hand. If > the hand isn't moving, it keeps scanning the same buffers over and over > again. You can crank it all the way up to 100%, though, in which case it > would work, but that starts to get expensive CPU-wise. Hmm. But if we're going to do that, we might as well have a checkpoint for our troubles, no? The reason for the current design is the assumption that a bgwriter_all scan is less burdensome than a checkpoint, but that is no longer true given this rewrite. AFAICS all the bgwriter_all scan will accomplish is induce extra I/O in most scenarios. And it's certainly extra code in an area that's already been rendered pretty darn baroque by this patch. Also, the design assumption here is that the bgwriter_lru scan is supposed to keep enough buffers clean to satisfy the system's need for new buffers. If it's not able to do so, it's not apparent to me that the bgwriter_all scan helps --- the latter is most likely flushing the wrong buffers, ie, those not close in front of the clock sweep. You'd be well advised to increase lru_percent anyway to keep more buffers clean, if this scenario is worth optimizing rather than others for your usage. regards, tom lane
On Mon, 25 Jun 2007, Heikki Linnakangas wrote: > Greg, is this the kind of workload you're having, or is there some other > scenario you're worried about? The way transitions between completely idle and all-out bursts happen were one problematic area I struggled with. Since the LRU point doesn't move during the idle parts, and the lingering buffers have a usage_count>0, the LRU scan won't touch them; the only way to clear out a bunch of dirty buffers leftover from the last burst is with the all-scan. Ideally, you want those to write during idle periods so you're completely clean when the next burst comes. My plan for the code I wanted to put into 8.4 one day was to have something like the current all-scan that defers to the LRU and checkpoint, such that if neither of them are doing anything it would go searching for buffers it might blow out. Because the all-scan mainly gets in the way under heavy load right now I've only found mild settings helpful, but if it had a bit more information about what else was going on it could run much harder during slow spots. That's sort of the next stage to the auto-tuning LRU writer code in the grand design floating through my head. As a general comment on this subject, a lot of the work in LDC presumes you have an accurate notion of how close the next checkpoint is. On systems that can dirty buffers and write WAL really fast, I've found hyper bursty workloads are a challenge for it to cope with. You can go from thinking you have all sorts of time to stream the data out to discovering the next checkpoint is coming up fast in only seconds. In that situation, you'd have been better off had you been writing faster during the period preceeding the burst when the code thought it should be "smooth"[1]. That falls into the category of things I haven't found a good way for other people to test (I happened to have an internal bursty app that aggrevated this area to use). [1] This is actually a reference to "Yacht Rock", one of my favorite web sites: http://www.channel101.com/shows/show.php?show_id=152 -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, 25 Jun 2007, Heikki Linnakangas wrote: > It only scans bgwriter_lru_percent buffers ahead of the clock hand. If the > hand isn't moving, it keeps scanning the same buffers over and over again. > You can crank it all the way up to 100%, though, in which case it would work, > but that starts to get expensive CPU-wise. In addition to being a CPU pig, that still won't necessarily work because the way the LRU writer ignores things with a non-zero usage_count. If it's dirty, it's probably been used recently as well. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, 25 Jun 2007, Heikki Linnakangas wrote: > Please describe the class of transactions and the service guarantees so > that we can reproduce that, and figure out what's the best solution. I'm confident you're already moving in that direction by noticing how the 90th percentile numbers were kind of weird with your 150 warehouse DBT2 tests, and I already mentioned how that could be usefully fleshed out by more tests during beta. That number is the kind of service guarantee I'm talking about--if before 90% of transactions were <4.5ms, but now that number is closer to 6ms, that could be considered worse performance by some service metrics even if the average and worst-case performance were improved. The only thing I can think of if you wanted to make the problem more like what I was seeing would be switching the transaction mix on that around to do more UPDATEs relative to the other types of transactions; having more of those seemed to aggrevate my LDC-related issues because they leave a different pattern of dirty+used buffers around than other operations. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > The way transitions between completely idle and all-out bursts happen were > one problematic area I struggled with. Since the LRU point doesn't move > during the idle parts, and the lingering buffers have a usage_count>0, the > LRU scan won't touch them; the only way to clear out a bunch of dirty > buffers leftover from the last burst is with the all-scan. One thing that might be worth changing is that right now, BgBufferSync starts over from the current clock-sweep point on each call --- that is, each bgwriter cycle. So it can't really be made to write very many buffers without excessive CPU work. Maybe we should redefine it to have some static state carried across bgwriter cycles, such that it would write at most N dirty buffers per call, but scan through X percent of the buffers, possibly across several calls, before returning to the (by now probably advanced) clock-sweep point. This would allow a larger value of X to be used than is currently practical. You might wish to recheck the clock sweep point on each iteration just to make sure the scan hasn't fallen behind it, but otherwise I don't see any downside. The scenario where somebody re-dirties a buffer that was cleaned by the bgwriter scan isn't a problem, because that buffer will also have had its usage_count increased and thereby not be a candidate for replacement. > As a general comment on this subject, a lot of the work in LDC presumes > you have an accurate notion of how close the next checkpoint is. Yeah; this is one reason I was interested in carrying some write-speed state across checkpoints instead of having the calculation start from scratch each time. That wouldn't help systems that sit idle a long time and suddenly go nuts, but it seems to me that smoothing the write rate across more than one checkpoint could help if the fluctuations occur over a timescale of a few checkpoints. regards, tom lane
Tom Lane wrote: > Hmm. But if we're going to do that, we might as well have a checkpoint > for our troubles, no? The reason for the current design is the > assumption that a bgwriter_all scan is less burdensome than a > checkpoint, but that is no longer true given this rewrite. Per comments in CreateCheckPoint, another also skip the extra checkpoints to avoid writing two checkpoints to the same page, risking losing both on a crash: > * If this isn't a shutdown or forced checkpoint, and we have not inserted > * any XLOG records since the start of the last checkpoint, skip the > * checkpoint. The idea here is to avoid inserting duplicate checkpoints > * when the system is idle. That wastes log space, and more importantly it > * exposes us to possible loss of both current and previous checkpoint > * records if the machine crashes just as we're writing the update. > * (Perhaps it'd make even more sense to checkpoint only when the previous > * checkpoint record is in a different xlog page?) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Greg Smith <gsmith@gregsmith.com> writes: >> The way transitions between completely idle and all-out bursts happen were >> one problematic area I struggled with. Since the LRU point doesn't move >> during the idle parts, and the lingering buffers have a usage_count>0, the >> LRU scan won't touch them; the only way to clear out a bunch of dirty >> buffers leftover from the last burst is with the all-scan. > > One thing that might be worth changing is that right now, BgBufferSync > starts over from the current clock-sweep point on each call --- that is, > each bgwriter cycle. So it can't really be made to write very many > buffers without excessive CPU work. Maybe we should redefine it to have > some static state carried across bgwriter cycles, such that it would > write at most N dirty buffers per call, but scan through X percent of > the buffers, possibly across several calls, before returning to the (by > now probably advanced) clock-sweep point. This would allow a larger > value of X to be used than is currently practical. You might wish to > recheck the clock sweep point on each iteration just to make sure the > scan hasn't fallen behind it, but otherwise I don't see any downside. > The scenario where somebody re-dirties a buffer that was cleaned by the > bgwriter scan isn't a problem, because that buffer will also have had its > usage_count increased and thereby not be a candidate for replacement. Something along those lines could be useful. I've thought of that before, but it never occured to me that if a page in front of the clock hand is re-dirtied, it's no longer a candidate for replacement anyway... I'm going to leave the all- and lru- bgwriter scans alone for now to get this LDC patch finished. We still have the bgwriter autotuning patch in the queue. Let's think about this in the context of that patch. >> As a general comment on this subject, a lot of the work in LDC presumes >> you have an accurate notion of how close the next checkpoint is. > > Yeah; this is one reason I was interested in carrying some write-speed > state across checkpoints instead of having the calculation start from > scratch each time. That wouldn't help systems that sit idle a long time > and suddenly go nuts, but it seems to me that smoothing the write rate > across more than one checkpoint could help if the fluctuations occur > over a timescale of a few checkpoints. Hmm. This problem only applies to checkpoints triggered by checkpoint_segments; time tends to move forward at a constant rate. I'm not worried about small fluctuations or bursts. As I argued earlier, the OS still buffers the writes and should give some extra smoothing of the physical writes. I believe bursts of say 50-100 MB would easily fit in OS cache, as long as there's enough gap between them. I haven't tested that, though. Here's a proposal for an algorithm to smooth bigger bursts: The basic design is the same as before. We keep track of elapsed time and elapsed xlogs, and based on them we estimate how much progress in flushing the buffers we should've made by now, and then we catch up until we reach that. The estimate for time is the same. The estimate for xlogs gets more complicated: Let's have a few definitions first: Ro = elapsed segments / elapsed time, from previous checkpoint cycle. For example, 1.25 means that the checkpoint was triggered by checkpoint_segments, and we had spent 1/1.25 = 80% of checkpoint_timeout when we reached checkpoint_segments. 0.25 would mean that checkpoint was triggered by checkpoint_timeout, and we had spent 25% of checkpoint_segments by then. Rn = elapsed segments / elapsed time this far from current in-progress checkpoint. t = elapsed time, as a fraction of checkpoint_timeout (0.0 - 1.0, though could be > 1 if next checkpoint is already due) s = elapsed xlog segments, as a fraction of checkpoint_segments (0.0 - 1.0, though could again be > 1 if next checkpoint is already due) R = estimate for WAL segment consumption rate, as checkpoint_segments / checkpoint_timeout R = Ro * t + Rn * (1 - t) In other words, at the beginning of the checkpoint, we give more weight to the state carried over from previous checkpoint. As we go forward, more weight is given to the rate calculated from current cycle. From R, we extrapolate how much progress we should've done by now: Target progress = R * t This would require saving just one number from previous cycle (Rn), and there is no requirement to call the estimation function at steady time intervals, for example. It gives pretty steady I/O rate even if there's big bursts in WAL activity, but still reacts to changes in the rate. I'm not convinced this is worth the effort, though. First of all, this is only a problem if you use checkpoint_segments to control your checkpoints, so you can lower checkpoint_timeout to do more work during the idle periods. Secondly, with the optimization of not flushing buffers during checkpoint that were dirtied after the start of checkpoint, the LRU-sweep will also contribute to flushing the buffers and finishing the checkpoint. We don't count them towards the progress made ATM, but we probably should. Lastly, distributing the writes even a little bit is going to be smoother than the current behavior anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> Hmm. But if we're going to do that, we might as well have a checkpoint >> for our troubles, no? The reason for the current design is the >> assumption that a bgwriter_all scan is less burdensome than a >> checkpoint, but that is no longer true given this rewrite. > Per comments in CreateCheckPoint, another also skip the extra > checkpoints to avoid writing two checkpoints to the same page, risking > losing both on a crash: That's not the reason for the current design --- that logic existed long before bgwriter did. Anyway, if there are no XLOG records since the last checkpoint, there's probably nothing in shared buffers that needs flushing. There might be some dirty hint-bits, but the only reason to push those out is to make some free buffers available, and doing that is not checkpoint's job (nor the all-buffers scan's job); that's what the LRU scan is for. The only reason that the all-buffers scan was put in was to try to make the next checkpoint cheaper. That reason falls to the ground with LDC. What we have left is merely excess I/O. regards, tom lane
Tom Lane wrote: > Anyway, if there are no XLOG records since the last checkpoint, there's > probably nothing in shared buffers that needs flushing. There might be > some dirty hint-bits, but the only reason to push those out is to make > some free buffers available, and doing that is not checkpoint's job (nor > the all-buffers scan's job); that's what the LRU scan is for. Yeah, except the LRU scan is not doing a very good job at that. It will ignore buffers with usage_count > 0, and it only scans bgwriter_lru_percent buffers ahead of the clock hand. One pathological case is a COPY of a table slightly smaller than shared_buffers. That will fill the buffer cache. If you then have a checkpoint, and after that a SELECT COUNT(*), or a VACUUM, the buffer cache will be full of pages with just hint-bit-updates, but no WAL activity since last checkpoint. But let's fix the LRU scan, rather work around it's deficiencies. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> ... that's what the LRU scan is for. > Yeah, except the LRU scan is not doing a very good job at that. It will > ignore buffers with usage_count > 0, and it only scans > bgwriter_lru_percent buffers ahead of the clock hand. Which is exactly in sync with which buffers are actually candidates for replacement. It could be looking further ahead of the clock hand, as per my previous suggestion, but there is no need to push out buffers with usage_count > 0 until after the sweep has decremented that to 0. Doing so would tend to create excess I/O --- it makes it more likely that multiple page-dirtying events on a page will result in separate writes instead of a single write. > One pathological case is a COPY of a table slightly smaller than > shared_buffers. That will fill the buffer cache. If you then have a > checkpoint, and after that a SELECT COUNT(*), or a VACUUM, the buffer > cache will be full of pages with just hint-bit-updates, but no WAL > activity since last checkpoint. This argument supposes that the bgwriter will do nothing while the COPY is proceeding. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> One pathological case is a COPY of a table slightly smaller than >> shared_buffers. That will fill the buffer cache. If you then have a >> checkpoint, and after that a SELECT COUNT(*), or a VACUUM, the buffer >> cache will be full of pages with just hint-bit-updates, but no WAL >> activity since last checkpoint. > > This argument supposes that the bgwriter will do nothing while the COPY > is proceeding. It will clean buffers ahead of the COPY, but it won't write the buffers COPY leaves behind since they have usage_count=1. Let me demonstrate this with an imaginary example with shared_buffers=4: buf_id usage_count dirty 1 0 f 2 0 f 3 0 f 4 0 f After COPY buf_id usage_count dirty 1 1 t 2 1 t 3 1 t 4 1 t CHECKPOINT: buf_id usage_count dirty 1 1 f 2 1 f 3 1 f 4 1 f VACUUM: buf_id usage_count dirty 1 1 t 2 1 t 3 1 t 4 1 t As soon as a backend asks for a buffer, the situation is defused as the backend will do a full clock sweep and decrement the usage_count of each buffer to 0, letting the bgwriter lru-scan to clean them. Having the buffer cache full of dirty buffers is not a problem on its own, so this only becomes a performance issue if you then issue another large COPY etc. that needs those buffers, and you now have to write them at the busy time. This is a corner case that might not be worth worrying about. It's also mitigated by the fact that the OS cache is most likely clean after a period of idle time, and should be able to absorb the write burst. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> This argument supposes that the bgwriter will do nothing while the COPY >> is proceeding. > It will clean buffers ahead of the COPY, but it won't write the buffers > COPY leaves behind since they have usage_count=1. Yeah, and they don't *need* to be written until the clock sweep has passed over them once. I'm not impressed with the idea of writing buffers because we might need them someday; that just costs extra I/O due to re-dirtying in too many scenarios. (Note that COPY per se will not trigger this behavior anyway, since it will act in a limited number of buffers because of the recent buffer access strategy patch.) regards, tom lane
Tom Lane wrote: > (Note that COPY per se will not trigger this behavior anyway, since it > will act in a limited number of buffers because of the recent buffer > access strategy patch.) Actually we dropped it from COPY, because it didn't seem to improve performance in the tests we ran. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> (Note that COPY per se will not trigger this behavior anyway, since it >> will act in a limited number of buffers because of the recent buffer >> access strategy patch.) > Actually we dropped it from COPY, because it didn't seem to improve > performance in the tests we ran. Who's "we"? AFAICS, CVS HEAD will treat a large copy the same as any other large heapscan. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>> (Note that COPY per se will not trigger this behavior anyway, since it >>> will act in a limited number of buffers because of the recent buffer >>> access strategy patch.) > >> Actually we dropped it from COPY, because it didn't seem to improve >> performance in the tests we ran. > > Who's "we"? AFAICS, CVS HEAD will treat a large copy the same as any > other large heapscan. Umm, I'm talking about populating a table with COPY *FROM*. That's not a heap scan at all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> Who's "we"? AFAICS, CVS HEAD will treat a large copy the same as any >> other large heapscan. > Umm, I'm talking about populating a table with COPY *FROM*. That's not a > heap scan at all. No wonder we're failing to communicate. I assumed you were talking about copy-to-file. Copy-from-file is going to be creating WAL entries hence the no-checkpoint case doesn't apply anyway, no? [ thinks ... ] Oh, you must be positing the case where the recently added skip-WAL-if-table-is-new-in-this-transaction optimization applies. Well, that thing could probably do with some more work anyway (I wonder why it's using shared buffers at all anymore). I don't really think that case should be allowed to drive our thinking about how the bgwriter should work. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>> Who's "we"? AFAICS, CVS HEAD will treat a large copy the same as any >>> other large heapscan. > >> Umm, I'm talking about populating a table with COPY *FROM*. That's not a >> heap scan at all. > > No wonder we're failing to communicate. I assumed you were talking > about copy-to-file. Copy-from-file is going to be creating WAL entries > hence the no-checkpoint case doesn't apply anyway, no? We are indeed having trouble to communicate :(. No, I'm not talking about the new non-WAL-logged COPY optimization. COPY FROM *would* create WAL entries, and the next checkpoint would clean them. So far so good. But then you run VACUUM, as you often do after loading a table, to set all hint bits. That will *not* generate WAL, and next checkpoint is skipped. To recap, the sequence is: 1. COPY FROM 2. checkpoint 3. VACUUM Now you have buffer cache full of dirty buffers with usage_count=1, and no WAL activity since last checkpoint. They will not be flushed until: a) WAL activity happens and next checkpoint comes b) database is shut down, or manual CHECKPOINT is issued c) clock hand advances and decrements the usage_counts It's a corner case. Probably not a problem in practice; you usually run CREATE INDEX after loading a table, for example. But it exists. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > To recap, the sequence is: > 1. COPY FROM > 2. checkpoint > 3. VACUUM > Now you have buffer cache full of dirty buffers with usage_count=1, Well, it won't be very full, because VACUUM works in a limited number of buffers (and did even before the BufferAccessStrategy patch). I have no doubt that there are scenarios such as you are thinking about, but it definitely seems like a corner case that doesn't justify keeping the all-buffers scan. That scan is costing us extra I/O in ordinary non-corner cases, so it's not free to keep it. regards, tom lane
On Mon, 25 Jun 2007, Tom Lane wrote: > right now, BgBufferSync starts over from the current clock-sweep point > on each call --- that is, each bgwriter cycle. So it can't really be > made to write very many buffers without excessive CPU work. Maybe we > should redefine it to have some static state carried across bgwriter > cycles The LRU portion restarts like that, and it's certainly not optimal. But the auto-tuning LRU patch that's already near application makes this much less of an issue because it only does real work when buffers have been allocated, so the sweep point will have moved along. I'll add this idea to my list of things that would be nice to have as part of a larger rewriter, I think it's more trouble than it's worth to chase right now. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Tue, 26 Jun 2007, Tom Lane wrote: > I have no doubt that there are scenarios such as you are thinking about, > but it definitely seems like a corner case that doesn't justify keeping > the all-buffers scan. That scan is costing us extra I/O in ordinary > non-corner cases, so it's not free to keep it. And scenarios I'm concerned about but can't diagram as easily fall into this category as well. I agree that a LDC enabled config would ship with the all-buffers scan turned off as redundant and wasteful, in which the only cost to keep it is code baggage. But the fact that there are corner cases floating around this area is what makes me feel that removing it altogether is still a bit premature. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Tue, 26 Jun 2007, Tom Lane wrote: > I'm not impressed with the idea of writing buffers because we might need > them someday; that just costs extra I/O due to re-dirtying in too many > scenarios. This is kind of an interesting statement to me because it really highlights the difference in how I thinking about this problem from how you see it. As far as I'm concerned, there's a hierarchy of I/O the database needs to finish that goes like this: 1) Client back-end writes (blocked until a buffer appears) 2) LRU writes so (1) doesn't happen 3) Checkpoint writes 4) Dirty pages with a non-zero usage count In my view of the world, there should be one parameter for a target rate of how much I/O you can stand under normal use, and the background writer should work its way as far down this chain as it can until it meets that. If there's plenty of clean buffers for the expected new allocations and there's no checkpoint going on, by all means write out some buffers we might re-dirty if there's I/O to spare. If you write them twice, so what? You didn't even get to that point as an option until all the important stuff was taken care of and the system was near idle. The elimination of the all-scan background writer means that true hot and dirty spots in the buffer cache, like popular index blocks on a heavily updated table that never get a zero usage_count, are never going to be written out other than as part of the checkpoint process. That's OK for now, but I'd like it to be the case that one day the database's I/O scheduling would eventually get to those, in order to optimize performance in the kind of bursty scenarios I've been mentioning lately. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
"Greg Smith" <gsmith@gregsmith.com> writes: > If you write them twice, so what? You didn't even get to that point as an > option until all the important stuff was taken care of and the system was > near idle. Well even if it's near idle you were still occupying the i/o system for a few milliseconds. If someone else came in with a block request at that time you extended their response time by that much. > The elimination of the all-scan background writer means that true hot and dirty > spots in the buffer cache, like popular index blocks on a heavily updated table > that never get a zero usage_count, are never going to be written out other than > as part of the checkpoint process. If they're really popular blocks on a heavily updated table then they really don't buy us anything to write them out unnecessarily. The case where they help us is when they weren't really popular but we're not doing enough to get around to writing them out and then when we do need to write it out the system's busy. In that case we wasted a chance to write them out when the system was more idle. But don't forget that we're still going through the OS's buffers. That will smooth out a lot of the i/o we generate anyways. Just because Postgres is idle doesn't mean the OS isn't busy flushing the buffers we wrote out when it was busy. > That's OK for now, but I'd like it to be the case that one day the > database's I/O scheduling would eventually get to those, in order to > optimize performance in the kind of bursty scenarios I've been mentioning > lately. I think the feeling is that the bursty scenarios are really corner cases. Heikki described a case where you're dirtying tons of blocks without triggering any WAL. That can happen but it's pretty peculiar. I can imagine a scenario where you have a system that's very busy for 60s and then idle for 60s repeatedly. And for some reason you configure a checkpoint_timeout on the order of 20m or so (assuming you're travelling precisely 60mph). In that scenario bgwriter's lru scan has to fight to keep up for 60s while it's mostly writing out dirty pages that it could have flushed out during the idle time. Effectively you're only making use of half the i/o bandwidth since bgwriter doesn't do any work for half the duty cycle. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > I can imagine a scenario where you have a system that's very busy for 60s and > then idle for 60s repeatedly. And for some reason you configure a > checkpoint_timeout on the order of 20m or so (assuming you're travelling > precisely 60mph). Is that Scottish m? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.